librelist archives

« back to archive

Chunker params for very large files

Chunker params for very large files

From:
Alex Gorbachev
Date:
2015-08-21 @ 06:07
Hello, what would be a good chunker setting for a handful (15) files
with sizes from 20 GB to 150 GB to a total of 2.3 TB per day?  These
are database backups that cannot be made incremental.

Reading https://borgbackup.github.io/borgbackup/internals.html#chunks

CHUNK_MIN_EXP = 10 (minimum chunk size = 2^10 B = 1 kiB)
CHUNK_MAX_EXP = 23 (maximum chunk size = 2^23 B = 8 MiB)
HASH_MASK_BITS = 16 (statistical medium chunk size ~= 2^16 B = 64 kiB)
HASH_WINDOW_SIZE = 4095 [B] (0xFFF)

An existing recommendation of 19,23,21,4095 for huge files from
https://borgbackup.github.io/borgbackup/usage.html appears to
translate into:

minimum chunk of 512 KiB
maximum chunk of 8 MiB
medium chunk of 2 MiB

In a 100GB file we are looking at 51200 chunks.  Would it be
beneficial to raise these further?  The machine I have doing this has
plenty of RAM (32 GB) and 8 CPU cores at 2.3 GHz, so RAM/compute is
not a problem.  Main goal is processing speed then deduplication
efficiency.

Thank you,
Alex

Re: [borgbackup] Chunker params for very large files

From:
Thomas Waldmann
Date:
2015-08-21 @ 11:48
Hi Alex,

> Hello, what would be a good chunker setting for a handful (15) files
> with sizes from 20 GB to 150 GB to a total of 2.3 TB per day?  These
> are database backups that cannot be made incremental.

That depends a bit on your goals.

If you have enough space and you rather care for good speed, little
management overhead (but not so much about deduplicating with very fine
grained blocks), use a higher value for HASH_MASK_BITS, like 20 or 21,
so it creates larger chunks in the statistical medium. It sounds like
this matches your case.

If you care for very fine grained deduplication and you maybe don't have
that much data and you can live with the management overhead, use a
small chunksize (small HASH_MASK_BITS, like the default 16).

> An existing recommendation of 19,23,21,4095 for huge files from
> https://borgbackup.github.io/borgbackup/usage.html appears to
> translate into:
> 
> minimum chunk of 512 KiB
> maximum chunk of 8 MiB
> medium chunk of 2 MiB
> 
> In a 100GB file we are looking at 51200 chunks.

You need to take the total amount of your data (~2TB) and compute the
chunk count (1.000.000). Then use the resource formula from the docs and
compute the sizes of the index files (and RAM needs).

In your case this looks quite reasonable, you could also use 1MB chunks,
but better don't use 64KB chunks.

> beneficial to raise these further?  The machine I have doing this has
> plenty of RAM (32 GB) and 8 CPU cores at 2.3 GHz, so RAM/compute is
> not a problem.

Right. But if your index is rather big, it'll need to copy around a lot
of data (for transactions, for resyncing the cache in case you backup
multiple machines to same repo).


Cheers, Thomas

----

GPG Fingerprint: 6D5B EF9A DD20 7580 5747  B70F 9F88 FB52 FAF7 B393
Encrypted E-Mail is preferred / Verschluesselte E-Mail wird bevorzugt.

Re: [borgbackup] Chunker params for very large files

From:
Alex Gorbachev
Date:
2015-08-24 @ 02:26
Hi Thomas,

On Fri, Aug 21, 2015 at 7:48 AM, Thomas Waldmann <tw@waldmann-edv.de> wrote:

> If you have enough space and you rather care for good speed, little
> management overhead (but not so much about deduplicating with very fine
> grained blocks), use a higher value for HASH_MASK_BITS, like 20 or 21,
> so it creates larger chunks in the statistical medium. It sounds like
> this matches your case.
>
> If you care for very fine grained deduplication and you maybe don't have
> that much data and you can live with the management overhead, use a
> small chunksize (small HASH_MASK_BITS, like the default 16).
>
>> An existing recommendation of 19,23,21,4095 for huge files from
>> https://borgbackup.github.io/borgbackup/usage.html appears to
>> translate into:
>>
>> minimum chunk of 512 KiB
>> maximum chunk of 8 MiB
>> medium chunk of 2 MiB
>>
>> In a 100GB file we are looking at 51200 chunks.
>
> You need to take the total amount of your data (~2TB) and compute the
> chunk count (1.000.000). Then use the resource formula from the docs and
> compute the sizes of the index files (and RAM needs).
>
> In your case this looks quite reasonable, you could also use 1MB chunks,
> but better don't use 64KB chunks.

Thank you for the clarification.  Is the HASH_WINDOW_SIZE tunable in
any way or useful to change?

Best regards,
Alex

>
>> beneficial to raise these further?  The machine I have doing this has
>> plenty of RAM (32 GB) and 8 CPU cores at 2.3 GHz, so RAM/compute is
>> not a problem.
>
> Right. But if your index is rather big, it'll need to copy around a lot
> of data (for transactions, for resyncing the cache in case you backup
> multiple machines to same repo).
>
>
> Cheers, Thomas
>
> ----
>
> GPG Fingerprint: 6D5B EF9A DD20 7580 5747  B70F 9F88 FB52 FAF7 B393
> Encrypted E-Mail is preferred / Verschluesselte E-Mail wird bevorzugt.

Re: [borgbackup] Chunker params for very large files

From:
Thomas Waldmann
Date:
2015-08-24 @ 09:45
A short note for all people who want to play with --chunker-params:

currently you need to use git master branch for that, 0.24 release does
not work:

https://github.com/borgbackup/borg/issues/154

Cheers, Thomas

-- 

GPG Fingerprint: 6D5B EF9A DD20 7580 5747  B70F 9F88 FB52 FAF7 B393
Encrypted E-Mail is preferred / Verschluesselte E-Mail wird bevorzugt.