librelist archives

« back to archive

Borg speed tuning on large files

Borg speed tuning on large files

From:
Alex Gorbachev
Date:
2015-08-25 @ 01:35
Hello, new challenge on performance - running a machine with two 4
core 2 GHz CPUs, 32 GB RAM and pretty fast disks.  Trying to run a
dedup backup of 14 files, 20-150 GB in size with a total of about 2TB.

When borg runs I see IO rates via iostat that are far below the
storage subsystem capabilities.  top shows 97-99% load for borg and
around 100 MB RSS.

I am assuming the bottleneck is the CPU as borg is single threaded.
Is there anything we could do to speed the process up though - more
RAM caching somehow?

Thank you,
Alex

Re: [borgbackup] Borg speed tuning on large files

From:
Thomas Waldmann
Date:
2015-08-25 @ 08:30
On 08/25/2015 03:35 AM, Alex Gorbachev wrote:
> Hello, new challenge on performance - running a machine with two 4
> core 2 GHz CPUs, 32 GB RAM and pretty fast disks.  Trying to run a
> dedup backup of 14 files, 20-150 GB in size with a total of about 2TB.
> 
> When borg runs I see IO rates via iostat that are far below the
> storage subsystem capabilities.  top shows 97-99% load for borg and
> around 100 MB RSS.

Do you use compression or encryption?

I am wondering a bit that you get so close to 100% as due to the
single-threaded and relatively simple way of internal working of the
current release, one usually only gets that close when using high
compression (and in that case, the high compression can be a bottleneck)
or no cpu-accelerated encryption (or both).

If you use encryption and openssl can't use AES-NI (because the cpu does
not support it or the drivers are not loaded), that can also slow down
things.

So, for first speed tests, I'ld recommend no or fast compression and no
encryption. With 0.24 release that is --compression 0 (default) or
--compression 1.

Next release will add super fast lz4 compression (which I think you will
like if your I/O system is rather fast), I hope I can release it in a
few days.

Keep an eye on borg's cpu load and get it to a bit lower value (maybe
50-80%),
so it is in the sweet spot in the middle of being I/O bound and being
CPU bound.

Also, as I've already said, I am sorry that 0.24 had broken
--chunker-params parameter parsing, so do not use that right now, this
will also be fixed in next release asap.

> I am assuming the bottleneck is the CPU as borg is single threaded.
> Is there anything we could do to speed the process up though - more
> RAM caching somehow?

I am working on a multithreaded implementation (which is not trivial and
not expected to be finished soon) which can use the CPU cores and I/O
capabilities of a system much better.

If you want to play with it, it is in multithreading branch of the repo,
but do NOT use that for real backups, it has still failing tests and
also the crypto might be unsecure there.

I've seen > 300% CPU load with that code on a dual-core cpu with
hyperthreading and also the wallclock runtime was better than with
single-threaded code (but not 3x better, there is also some overhead).

-- 

GPG Fingerprint: 6D5B EF9A DD20 7580 5747  B70F 9F88 FB52 FAF7 B393
Encrypted E-Mail is preferred / Verschluesselte E-Mail wird bevorzugt.

Re: [borgbackup] Borg speed tuning on large files

From:
Alex Gorbachev
Date:
2015-08-25 @ 15:43
On Tue, Aug 25, 2015 at 4:30 AM, Thomas Waldmann <tw@waldmann-edv.de> wrote:

> On 08/25/2015 03:35 AM, Alex Gorbachev wrote:
> > Hello, new challenge on performance - running a machine with two 4
> > core 2 GHz CPUs, 32 GB RAM and pretty fast disks.  Trying to run a
> > dedup backup of 14 files, 20-150 GB in size with a total of about 2TB.
> >
> > When borg runs I see IO rates via iostat that are far below the
> > storage subsystem capabilities.  top shows 97-99% load for borg and
> > around 100 MB RSS.
>
> Do you use compression or encryption?
>

No encryption.  Tested with compression and without - observed the behavior
you described below, but the volume of data pretty much requires
compression to properly function.  I tried compression levels of 0 and 3.


>
> I am wondering a bit that you get so close to 100% as due to the
> single-threaded and relatively simple way of internal working of the
> current release, one usually only gets that close when using high
> compression (and in that case, the high compression can be a bottleneck)
> or no cpu-accelerated encryption (or both).
>
> If you use encryption and openssl can't use AES-NI (because the cpu does
> not support it or the drivers are not loaded), that can also slow down
> things.
>
> So, for first speed tests, I'ld recommend no or fast compression and no
> encryption. With 0.24 release that is --compression 0 (default) or
> --compression 1.
>
> Next release will add super fast lz4 compression (which I think you will
> like if your I/O system is rather fast), I hope I can release it in a
> few days.
>

Oh can't wait - this is what ZFS uses and ours is plenty fast.  I also
realized that we are going from a compressed ZFS to its snapshot and then
to uncompressed target destination with borg (as borg already compresses),
so there is overhead on ZFS decompression...but that should run on another
core.


>
> Keep an eye on borg's cpu load and get it to a bit lower value (maybe
> 50-80%),
> so it is in the sweet spot in the middle of being I/O bound and being
> CPU bound.
>

I turned of hyperthreading and enabled aggressive CPU power mode, running a
test, which will take about 25+ hours on the 2TB


>
> Also, as I've already said, I am sorry that 0.24 had broken
> --chunker-params parameter parsing, so do not use that right now, this
> will also be fixed in next release asap.
>

Thanks, I compiled right away from the git tree and no problems there,
thank you for the fast response


>
> > I am assuming the bottleneck is the CPU as borg is single threaded.
> > Is there anything we could do to speed the process up though - more
> > RAM caching somehow?
>
> I am working on a multithreaded implementation (which is not trivial and
> not expected to be finished soon) which can use the CPU cores and I/O
> capabilities of a system much better.
>

Completely understood, and I am assuming starting multiple borg processes
in parallel for each file is not a good idea?


>
> If you want to play with it, it is in multithreading branch of the repo,
> but do NOT use that for real backups, it has still failing tests and
> also the crypto might be unsecure there.
>
> I've seen > 300% CPU load with that code on a dual-core cpu with
> hyperthreading and also the wallclock runtime was better than with
> single-threaded code (but not 3x better, there is also some overhead).
>
> --
>
> GPG Fingerprint: 6D5B EF9A DD20 7580 5747  B70F 9F88 FB52 FAF7 B393
> Encrypted E-Mail is preferred / Verschluesselte E-Mail wird bevorzugt.
>

Re: [borgbackup] Borg speed tuning on large files

From:
Thomas Waldmann
Date:
2015-08-25 @ 22:45
> No encryption.  Tested with compression and without - observed the
> behavior you described below, but the volume of data pretty much
> requires compression to properly function.  I tried compression levels
> of 0 and 3.

Maybe try 1 until lz4 is available. That's relatively fast and still
compresses.

> Completely understood, and I am assuming starting multiple borg
> processes in parallel for each file is not a good idea?

You can run up to N borg in parallel (if N is your cpu core count), but
only if the target repo of each is a different one, otherwise they will
block each other.



-- 

GPG ID: FAF7B393
GPG FP: 6D5B EF9A DD20 7580 5747 B70F 9F88 FB52 FAF7 B393

Re: [borgbackup] Borg speed tuning on large files

From:
Alex Gorbachev
Date:
2015-08-28 @ 22:36
could not decode message

Re: [borgbackup] Borg speed tuning on large files

From:
Thomas Waldmann
Date:
2015-08-29 @ 13:23
> Tool 	Parameters 	Data size (apparent) 	Repo size 	Hrs 	Ratio 	C Rat 	C
> MB/s
> gzip 	c3 	2308843696 	560376600 	22 	24% 	4.1 	7
> Attic First Run 	default 	2251760621 	531964928 	48 	24% 	4.2 	3
> Attic Next Run 	default 	2308843696 	234398336 	32 	10% 	9.9 	2
> Borg First Run 	C0,19,23,21,4095 	2330579192 	2354907008 	26 	101% 	1 	25
> Borg Next Run 	C0,19,23,21,4095 	2270686256 	1341393408 	18 	59% 	1.7 	21
> Borg First Run 	C3,19,23,21,4095 	2270686256 	568351360 	33 	25% 	4 	5
> Borg Next Run 	C3,19,23,21,4095 	2268472600 	302165632 	23 	13% 	7.5 	4
> Borg Next Run 	C1,19,23,21,4095 	2247244128 	422037120 	24 	19% 	5.3 	5

Nice to see confirmation that we are quite faster than Attic. :)

Hmm, should the last line read "Borg First Run ... C1"?

In general, to evaluate the speed, it might be easier to only do "first
runs", because there always some specific amount of data (== all input
data) gets processed.

In "next run", the amount of data actually needing processing might vary
widely, depending on how much change there is between first and next run.

BTW, note for other readers: the "Parameters" column can't be given that
way to borg, it needs to be (e.g.):
borg create -C1 --chunker-params 19,23,21,4095 repo::archive data

Or in 0.25:
borg create -C zlib,1 --chunker-params ....

> Here is a picture in case the text does not come through well:

Yeah, that looked better. :)

BTW, what you currently have in the C MB/s column is how many compressed
MB/s it actually writes to storage (and if that is a limiting factor, it
would be your target storage, not borg).

Maybe more interesting would be how much uncompressed data it can
process per second.

> Oddly, compression setting of 1 took longer than C3.

Either there is a mistake in your table or your cpu is so fast that
higher compression saves more time by avoiding I/O than it needs for the
better compression.

With 0.25.0 you could try:
- lz4 = superfast, but low compression
- lzma = slow/expensive, but high compression
- none - no compression, no overhead (this is not zlib,0 any more)

> C0 shows the actual dedup capability of this data.

Doesn't seem to find significant amounts of "internal" duplication
within a "first run". Historical dedup seems to work and help, though.

Does that match your expectations considering the contents of your files?

In case you measure again, keep an eye on CPU load.

>  My business goal here is to get
> the data in within a day, so about 12 hours or so.  

If you can partition your data set somehow into N pieces and use N
separate repos, you could save some time by running N borgs in parallel
(assuming your I/O isn't a bottleneck then).

N ~= core count of your CPU

At some time in the future, borg might be able to a similar thing by
internal multithreading, but that is not ready for production yet.

There are also some other optimizations possible in the code (using
different hashes, different crypto modes, ...) - we'll try making it
much faster.

-- 


GPG ID: FAF7B393
GPG FP: 6D5B EF9A DD20 7580 5747 B70F 9F88 FB52 FAF7 B393

Re: [borgbackup] Borg speed tuning on large files

From:
Alex Gorbachev
Date:
2015-08-31 @ 03:27
Hi Thomas,

On Sat, Aug 29, 2015 at 9:23 AM, Thomas Waldmann <tw@waldmann-edv.de> wrote:
>> Tool  Parameters      Data size (apparent)    Repo size       Hrs     
Ratio   C Rat   C
>> MB/s
>> gzip  c3      2308843696      560376600       22      24%     4.1     7
>> Attic First Run       default         2251760621      531964928       
48      24%     4.2     3
>> Attic Next Run        default         2308843696      234398336       
32      10%     9.9     2
>> Borg First Run        C0,19,23,21,4095        2330579192      
2354907008      26      101%    1       25
>> Borg Next Run         C0,19,23,21,4095        2270686256      
1341393408      18      59%     1.7     21
>> Borg First Run        C3,19,23,21,4095        2270686256      568351360
33      25%     4       5
>> Borg Next Run         C3,19,23,21,4095        2268472600      302165632
23      13%     7.5     4
>> Borg Next Run         C1,19,23,21,4095        2247244128      422037120
24      19%     5.3     5
>
> Nice to see confirmation that we are quite faster than Attic. :)
>
> Hmm, should the last line read "Borg First Run ... C1"?

Yes, I switched the [now obsolete] parameter to level 1 for a "next run"

>
> In general, to evaluate the speed, it might be easier to only do "first
> runs", because there always some specific amount of data (== all input
> data) gets processed.

But...in that case gzip beats all :).

>
> In "next run", the amount of data actually needing processing might vary
> widely, depending on how much change there is between first and next run.

Understood, though the point of dedup is to save space on
shared/unchanged data regions.  In my case the data is likely not as
similar, with 59% at no compression it means we only found 41% of
"same data" whereas I know in these databases 10% of change a day is
high.   So maybe I need to go chunk size hunting.  For others this
will likely work in a more efficient  manner.

> BTW, note for other readers: the "Parameters" column can't be given that
> way to borg, it needs to be (e.g.):
> borg create -C1 --chunker-params 19,23,21,4095 repo::archive data
>
> Or in 0.25:
> borg create -C zlib,1 --chunker-params ....
>
>> Here is a picture in case the text does not come through well:
>
> Yeah, that looked better. :)
>
> BTW, what you currently have in the C MB/s column is how many compressed
> MB/s it actually writes to storage (and if that is a limiting factor, it
> would be your target storage, not borg).

Sorry, I should have commented, C is for computed, i.e. size divided
by time.  I assume storage is not an issue, as uncompressed data can
pump here at 50+ MB/s.

>
> Maybe more interesting would be how much uncompressed data it can
> process per second.
>
>> Oddly, compression setting of 1 took longer than C3.
>
> Either there is a mistake in your table or your cpu is so fast that
> higher compression saves more time by avoiding I/O than it needs for the
> better compression.

That makes sense, CPU on this box is quite powerful.

>
> With 0.25.0 you could try:
> - lz4 = superfast, but low compression
> - lzma = slow/expensive, but high compression
> - none - no compression, no overhead (this is not zlib,0 any more)

Started lz4 trials tonight, will update!

>
>> C0 shows the actual dedup capability of this data.
>
> Doesn't seem to find significant amounts of "internal" duplication
> within a "first run". Historical dedup seems to work and help, though.
>
> Does that match your expectations considering the contents of your files?

It's a big mystery, highly esoteric database (think MUMPS :) but I
know overall change is unlikely to exceed 10% of "business content"
per day.  So I am not finding the right chunk size yet.

>
> In case you measure again, keep an eye on CPU load.

I see borg taking 99% of one core, load average in the 3-4 range, but
other processes are working, so this may be a bit muddled, I will
observe at idle times.

>
>>  My business goal here is to get
>> the data in within a day, so about 12 hours or so.
>
> If you can partition your data set somehow into N pieces and use N
> separate repos, you could save some time by running N borgs in parallel
> (assuming your I/O isn't a bottleneck then).
>
> N ~= core count of your CPU
>
> At some time in the future, borg might be able to a similar thing by
> internal multithreading, but that is not ready for production yet.

Understood, hard to do and make safe.  Thanks.

>
> There are also some other optimizations possible in the code (using
> different hashes, different crypto modes, ...) - we'll try making it
> much faster.

Much appreciated, I have the good high stress real life playground to test this.

Alex

>
> --
>
>
> GPG ID: FAF7B393
> GPG FP: 6D5B EF9A DD20 7580 5747 B70F 9F88 FB52 FAF7 B393
>

Re: [borgbackup] Borg speed tuning on large files

From:
Alex Gorbachev
Date:
2015-09-01 @ 11:29
<snip>

>>
>> With 0.25.0 you could try:
>> - lz4 = superfast, but low compression
>> - lzma = slow/expensive, but high compression
>> - none - no compression, no overhead (this is not zlib,0 any more)
>
> Started lz4 trials tonight, will update!

Indeed, with lz4 the compression speed is the fastest it has been.  On
the first run, data is 35% of original size (same 2TB volume) and it
took 17 hours to compress vs. the previous 33 with LZMA.
Computed speed is 12 MB/s vs. the previous 5 MB/s, and we are not at
all disk bound (we do 100+ MB/s network transfers from it).

I will run an incremental run shortly.

Thanks,
Alex

Re: [borgbackup] Borg speed tuning on large files

From:
Thomas Waldmann
Date:
2015-09-01 @ 12:24
> Indeed, with lz4 the compression speed is the fastest it has been.  On
> the first run, data is 35% of original size (same 2TB volume) and it
> took 17 hours to compress vs. the previous 33 with LZMA.

Ah, that's impressive.

lz4 seems to like your data (usually it doesn't compress to 35%). :D

A zlib,1 comparison value would have been nice here (as that is the
fastest compression zlib can do [but usually slower than lz4]).
lzma is known to be rather slow (but high compression).

> Computed speed is 12 MB/s vs. the previous 5 MB/s, and we are not at
> all disk bound (we do 100+ MB/s network transfers from it).

If I divide 2TB original data by 17h backup time, I get 32MB/s.
Your data rate is based on the compressed data.


-- 

GPG ID: FAF7B393
GPG FP: 6D5B EF9A DD20 7580 5747 B70F 9F88 FB52 FAF7 B393

Re: [borgbackup] Borg speed tuning on large files

From:
Alex Gorbachev
Date:
2015-09-11 @ 22:27
Here is the latest round of benchmarks - lz4 is definitely a lot faster

Borg First Runlz4,19,23,20,409522569107527829648648/31/2015 13:51:009/1/2015
7:12:0017.435%2.8835Borg Next Runlz4,19,23,20,409522387572964369795849/1/2015
7:20:009/1/2015 21:58:0014.620%5.1242Borg First Runlz4,17,22,19,4095
22656182407912710409/2/2015 7:15:009/2/2015 23:37:0016.435%2.8638Borg Next
Runlz4,17,22,19,409522844942163960725769/3/2015 15:54:009/4/2015 6:58:0015.1
17%5.7741Borg Next Runlz4,17,22,19,409523200848803747252489/4/2015
13:20:009/5/2015
4:08:0014.816%6.1943Borg First Runlz4,19,23,21,409523581555688117031689/5/2015
10:21:009/6/2015 3:13:0016.934%2.9138Borg Next Runlz4,19,23,21,4095
23456232883870658569/6/2015 20:20:009/7/2015 11:52:0015.517%6.0641

On Tue, Sep 1, 2015 at 8:24 AM, Thomas Waldmann <tw@waldmann-edv.de> wrote:

> > Indeed, with lz4 the compression speed is the fastest it has been.  On
> > the first run, data is 35% of original size (same 2TB volume) and it
> > took 17 hours to compress vs. the previous 33 with LZMA.
>
> Ah, that's impressive.
>
> lz4 seems to like your data (usually it doesn't compress to 35%). :D
>
> A zlib,1 comparison value would have been nice here (as that is the
> fastest compression zlib can do [but usually slower than lz4]).
> lzma is known to be rather slow (but high compression).
>
> > Computed speed is 12 MB/s vs. the previous 5 MB/s, and we are not at
> > all disk bound (we do 100+ MB/s network transfers from it).
>
> If I divide 2TB original data by 17h backup time, I get 32MB/s.
> Your data rate is based on the compressed data.
>
>
> --
>
> GPG ID: FAF7B393
> GPG FP: 6D5B EF9A DD20 7580 5747 B70F 9F88 FB52 FAF7 B393
>
>