librelist archives

« back to archive

Speed of cache re-sync

Speed of cache re-sync

From:
Dmitry Astapov
Date:
2014-05-27 @ 20:59
Hi,

Can you please tell me:
1)Should I expect "Initializing cache" process to be IO-bound or CPU-bound?
2)What's the ballpark figures for estimating how long it could take?
3)Any ways to speed it up?

My situation in full:

I have a repo that contains 153 backups of the same local fileset (so
target and destination filesystem is the same, no network/ssh involved).
Fileset containst roughly 70.000 files. Total size of the repo is 357 Gb.
All is happening on 2gHz intel atom box with 1 gb of memory (NAS with
debian chroot). Attic is built from git,
revision 3ab53b776d9226facab9494d8bf7c565e4fb3d08 (the latest, as of 27th
May 2014).

Recently one of the backup runs was terminated (killed the wrong process)
and on the next invocation attic started to re-sync the local cache.

Judging from the speed with which first 10 backups were analysed, it would
be doing that for the next 6-7 days(!) - 9 archives were analysed in 8
hours so far, and i have 153 of them.

I am, naturally, curious as to what I could do to potentially speed the
process up (now, and in the future, if similar occasion happens again).

What is puzzling (to me) is that attic does not appear to be cpu-bound.
According to top, attic consumes 5-15% of the CPU and ~600 Mb of memory,
and while there is quite a bit of IO happening, box is definitely not
struggling. According to strace, attic is opening files from data/ folder
in no particular order and reading from them at a measured pace, and
according to ltrace, there is a lot of memset/free going one between
easily-explainable memcpy's and other byte churning calls:

30736 memset(0xe2d66f88, '\000', 12)
= 0xe2d66f88
30736 memset(0xe2d66f88, '\000', 12)
= 0xe2d66f88
30736 memset(0xe2d66f88, '\000', 12)
= 0xe2d66f88
30736 memset(0xe2d66f88, '\000', 12)
= 0xe2d66f88
30736 memset(0xe2d66f88, '\000', 12)
= 0xe2d66f88
30736 memset(0xe2d66f88, '\000', 12)
= 0xe2d66f88
30736 memset(0xe2d66f88, '\000', 12)
= 0xe2d66f88
30736 memset(0xe2d66f88, '\000', 12)
= 0xe2d66f88
30736 memset(0xe2d66f88, '\000', 12)
= 0xe2d66f88
30736 memset(0xe2d66f88, '\000', 12)
= 0xe2d66f88
30736 memset(0xe2d66f88, '\000', 12)
= 0xe2d66f88
30736 memset(0xe2d66f88, '\000', 12)
= 0xe2d66f88
30736 memset(0xe2d66f88, '\000', 12)
= 0xe2d66f88
30736 memset(0xe2d66f88, '\000', 12)
= 0xe2d66f88
30736 memset(0xe2d66f88, '\000', 12)
= 0xe2d66f88
30736 memset(0xe2d66f88, '\000', 12)
= 0xe2d66f88
30736 memset(0xe2d66f88, '\000', 12)
= 0xe2d66f88
30736 memset(0xe2d66f88, '\000', 12)
= 0xe2d66f88
30736 memset(0xe2d66f88, '\000', 12)
= 0xe2d66f88
30736 memset(0xe2d66f88, '\000', 12)
= 0xe2d66f88
30736 memset(0xe2d66f88, '\000', 12)
= 0xe2d66f88
30736 memset(0xe2d66f88, '\000', 12)
= 0xe2d66f88
30736 memset(0xe2d66f88, '\000', 12)
= 0xe2d66f88
30736 memset(0xe2d66f88, '\000', 12)
= 0xe2d66f88
30736 memset(0xe2d66f88, '\000', 12)
= 0xe2d66f88
30736 memset(0xe2d66f88, '\000', 12)
= 0xe2d66f88
30736 memset(0xe2d66f88, '\000', 12)
= 0xe2d66f88
30736 memset(0xe2d66f88, '\000', 12)
= 0xe2d66f88
30736 memset(0xe2d66f88, '\000', 12)
= 0xe2d66f88
30736 memset(0xe2d66f88, '\000', 12)
= 0xe2d66f88
30736 malloc(112)
 = 0x084ee088
30736 memcpy(0xf6edc768, "acl", 3)
= 0xf6edc768
30736 memcpy(0xf6edc540, "chunks", 6)
 = 0xf6edc540
30736 malloc(56)
= 0x08494130
30736 memset(0x08494130, '\000', 56)
= 0x08494130
30736 malloc(12)
= 0x084d7650
30736 memset(0x084d7650, '\000', 12)
= 0x084d7650
30736 memcpy(0xe2be8f48, "\023\254\376W ]X\364\306uX:\211\004\352E", 32)
= 0xe2be8f48
30736 malloc(12)
= 0x085b3508
....
30736 free(0x085bd6b8)
= <void>
30736 free(0x084ee118)
= <void>
30736 free(0x084d4f60)
= <void>
30736 free(0x084bcc60)
= <void>
30736 free(0x08553a88)
= <void>
30736 free(0x084d74e0)
= <void>
30736 free(0x084e4370)
= <void>
30736 free(0x084e4940)
= <void>
30736 free(0x08535fe0)
= <void>
30736 free(0x08624a90)
= <void>
30736 free(0x084bcc80)
= <void>
30736 free(0x084ee238)
= <void>
30736 free(0x085b3518)
= <void>
30736 free(0x08568af8)
= <void>
30736 free(0x084f05b0)
= <void>
30736 free(0x084ea880)
= <void>
30736 free(0x084ee408)
= <void>
30736 free(0x085ac860)
= <void>

Interesting to note that repeated memset calls seem to happen once for
every block read, and memory for decoding/processing block is always
allocated in small chunks.

I am not familiar with python at such depth, but maybe it is possible to
force bigger allocations to reduce memory churn somewhat?

This process is not going anywhere for the next week, so if you want me to
run any debugging/tracing against it, I would happily do that.

-- 
Dmitry Astapov

Re: Speed of cache re-sync

From:
Dmitry Astapov
Date:
2014-05-27 @ 22:06
Apologies, I just realized that I forgot to include the most interesting
piece of evidence which prompted me to ask my question about allocations.

I ran "ltrace -p <pid> -f -c" for a while, which collates time spent in
various library calls, and after 5 minutes (measured by wall clock) I got
this (just top of the output included):

% time     seconds  usecs/call     calls      function
------ ----------- ----------- --------- --------------------
* 52.04  154.513758         166    927789 memset*
 16.68   49.526806         155    318898 malloc
 13.77   40.872722         131    310545 free
 13.21   39.219940         122    319394 memcpy
  1.69    5.024017       12560       400 read
  0.76    2.246910         434      5169 pthread_mutex_unlock
  0.74    2.199215       91633        24 open64
  0.73    2.164815         628      3446 pthread_cond_signal
  0.09    0.276356         457       604 inflate

I understand that some of these memsets are really necessary, but from
looking at expanded ltrace output, most of the are repeated over and over
again, and it is kind of sad that they take 50% of the time ...



On Tue, May 27, 2014 at 9:59 PM, Dmitry Astapov <dastapov@gmail.com> wrote:

> Hi,
>
> Can you please tell me:
> 1)Should I expect "Initializing cache" process to be IO-bound or CPU-bound?
> 2)What's the ballpark figures for estimating how long it could take?
> 3)Any ways to speed it up?
>
> My situation in full:
>
> I have a repo that contains 153 backups of the same local fileset (so
> target and destination filesystem is the same, no network/ssh involved).
> Fileset containst roughly 70.000 files. Total size of the repo is 357 Gb.
> All is happening on 2gHz intel atom box with 1 gb of memory (NAS with
> debian chroot). Attic is built from git,
> revision 3ab53b776d9226facab9494d8bf7c565e4fb3d08 (the latest, as of 27th
> May 2014).
>
> Recently one of the backup runs was terminated (killed the wrong process)
> and on the next invocation attic started to re-sync the local cache.
>
> Judging from the speed with which first 10 backups were analysed, it would
> be doing that for the next 6-7 days(!) - 9 archives were analysed in 8
> hours so far, and i have 153 of them.
>
> I am, naturally, curious as to what I could do to potentially speed the
> process up (now, and in the future, if similar occasion happens again).
>
> What is puzzling (to me) is that attic does not appear to be cpu-bound.
> According to top, attic consumes 5-15% of the CPU and ~600 Mb of memory,
> and while there is quite a bit of IO happening, box is definitely not
> struggling. According to strace, attic is opening files from data/ folder
> in no particular order and reading from them at a measured pace, and
> according to ltrace, there is a lot of memset/free going one between
> easily-explainable memcpy's and other byte churning calls:
>
> 30736 memset(0xe2d66f88, '\000', 12)
> = 0xe2d66f88
> 30736 memset(0xe2d66f88, '\000', 12)
> = 0xe2d66f88
> 30736 memset(0xe2d66f88, '\000', 12)
> = 0xe2d66f88
> 30736 memset(0xe2d66f88, '\000', 12)
> = 0xe2d66f88
> 30736 memset(0xe2d66f88, '\000', 12)
> = 0xe2d66f88
> 30736 memset(0xe2d66f88, '\000', 12)
> = 0xe2d66f88
> 30736 memset(0xe2d66f88, '\000', 12)
> = 0xe2d66f88
> 30736 memset(0xe2d66f88, '\000', 12)
> = 0xe2d66f88
> 30736 memset(0xe2d66f88, '\000', 12)
> = 0xe2d66f88
> 30736 memset(0xe2d66f88, '\000', 12)
> = 0xe2d66f88
> 30736 memset(0xe2d66f88, '\000', 12)
> = 0xe2d66f88
> 30736 memset(0xe2d66f88, '\000', 12)
> = 0xe2d66f88
> 30736 memset(0xe2d66f88, '\000', 12)
> = 0xe2d66f88
> 30736 memset(0xe2d66f88, '\000', 12)
> = 0xe2d66f88
> 30736 memset(0xe2d66f88, '\000', 12)
> = 0xe2d66f88
> 30736 memset(0xe2d66f88, '\000', 12)
> = 0xe2d66f88
> 30736 memset(0xe2d66f88, '\000', 12)
> = 0xe2d66f88
> 30736 memset(0xe2d66f88, '\000', 12)
> = 0xe2d66f88
> 30736 memset(0xe2d66f88, '\000', 12)
> = 0xe2d66f88
> 30736 memset(0xe2d66f88, '\000', 12)
> = 0xe2d66f88
> 30736 memset(0xe2d66f88, '\000', 12)
> = 0xe2d66f88
> 30736 memset(0xe2d66f88, '\000', 12)
> = 0xe2d66f88
> 30736 memset(0xe2d66f88, '\000', 12)
> = 0xe2d66f88
> 30736 memset(0xe2d66f88, '\000', 12)
> = 0xe2d66f88
> 30736 memset(0xe2d66f88, '\000', 12)
> = 0xe2d66f88
> 30736 memset(0xe2d66f88, '\000', 12)
> = 0xe2d66f88
> 30736 memset(0xe2d66f88, '\000', 12)
> = 0xe2d66f88
> 30736 memset(0xe2d66f88, '\000', 12)
> = 0xe2d66f88
> 30736 memset(0xe2d66f88, '\000', 12)
> = 0xe2d66f88
> 30736 memset(0xe2d66f88, '\000', 12)
> = 0xe2d66f88
> 30736 malloc(112)
>  = 0x084ee088
> 30736 memcpy(0xf6edc768, "acl", 3)
> = 0xf6edc768
> 30736 memcpy(0xf6edc540, "chunks", 6)
>  = 0xf6edc540
> 30736 malloc(56)
> = 0x08494130
> 30736 memset(0x08494130, '\000', 56)
> = 0x08494130
> 30736 malloc(12)
> = 0x084d7650
> 30736 memset(0x084d7650, '\000', 12)
> = 0x084d7650
> 30736 memcpy(0xe2be8f48, "\023\254\376W ]X\364\306uX:\211\004\352E", 32)
> = 0xe2be8f48
> 30736 malloc(12)
> = 0x085b3508
> ....
> 30736 free(0x085bd6b8)
> = <void>
> 30736 free(0x084ee118)
> = <void>
> 30736 free(0x084d4f60)
> = <void>
> 30736 free(0x084bcc60)
> = <void>
> 30736 free(0x08553a88)
> = <void>
> 30736 free(0x084d74e0)
> = <void>
> 30736 free(0x084e4370)
> = <void>
> 30736 free(0x084e4940)
> = <void>
> 30736 free(0x08535fe0)
> = <void>
> 30736 free(0x08624a90)
> = <void>
> 30736 free(0x084bcc80)
> = <void>
> 30736 free(0x084ee238)
> = <void>
> 30736 free(0x085b3518)
> = <void>
> 30736 free(0x08568af8)
> = <void>
> 30736 free(0x084f05b0)
> = <void>
> 30736 free(0x084ea880)
> = <void>
> 30736 free(0x084ee408)
> = <void>
> 30736 free(0x085ac860)
> = <void>
>
> Interesting to note that repeated memset calls seem to happen once for
> every block read, and memory for decoding/processing block is always
> allocated in small chunks.
>
> I am not familiar with python at such depth, but maybe it is possible to
> force bigger allocations to reduce memory churn somewhat?
>
> This process is not going anywhere for the next week, so if you want me to
> run any debugging/tracing against it, I would happily do that.
>
> --
> Dmitry Astapov
>



-- 
Dmitry Astapov

Re: [attic] Re: Speed of cache re-sync

From:
Jonas Borgström
Date:
2014-05-28 @ 10:58
On 28/05/14 00:06, Dmitry Astapov wrote:
> Apologies, I just realized that I forgot to include the most interesting
> piece of evidence which prompted me to ask my question about allocations.
> 
> I ran "ltrace -p <pid> -f -c" for a while, which collates time spent in
> various library calls, and after 5 minutes (measured by wall clock) I
> got this (just top of the output included):
> 
> % time     seconds  usecs/call     calls      function
> ------ ----------- ----------- --------- --------------------
> * 52.04  154.513758         166    927789 memset*
>  16.68   49.526806         155    318898 malloc
>  13.77   40.872722         131    310545 free
>  13.21   39.219940         122    319394 memcpy
>   1.69    5.024017       12560       400 read
>   0.76    2.246910         434      5169 pthread_mutex_unlock
>   0.74    2.199215       91633        24 open64
>   0.73    2.164815         628      3446 pthread_cond_signal
>   0.09    0.276356         457       604 inflate
> 
> I understand that some of these memsets are really necessary, but from
> looking at expanded ltrace output, most of the are repeated over and
> over again, and it is kind of sad that they take 50% of the time ...

I guess what happened is that the Attic process was terminated in the
critical window between the repository commit operation and the cache
commit operation. This window is usually very short.

If this happens Attic will detect that the cache is not in sync with the
repository and automatically starts rebuilding it.

The current approach is very naive and first zeros all chunk reference
counters and then updates them by processing the metadata for all
archives in the repository.

This results in lots and lots of decryption, decompression and msgpack
parsing and is usually cpu-bound unless the repository IO is really slow.

I'm not sure where the memset comes from but I guess it's used by python
itself when creating all the different python objects created during
parsing of the msgpack formatted archive metadata.

> On Tue, May 27, 2014 at 9:59 PM, Dmitry Astapov <dastapov@gmail.com
> <mailto:dastapov@gmail.com>> wrote:
> 
>     Hi,
> 
>     Can you please tell me:
>     1)Should I expect "Initializing cache" process to be IO-bound or
>     CPU-bound?

Usually CPU-bound.
Is the repository placed on a local filesystem?
Is the IO wait high?

>     2)What's the ballpark figures for estimating how long it could take?

The re-sync linearly scans though all archives so your estimate is
probably correct. But I've never seen such slow numbers.

>     3)Any ways to speed it up?

Not in the short term. But fairly high on my todo list is to make the
resync incremental (when possible). But that will probably not happen
until Attic 0.14.


>     My situation in full:
> 
>     I have a repo that contains 153 backups of the same local fileset
>     (so target and destination filesystem is the same, no network/ssh
>     involved). Fileset containst roughly 70.000 files. Total size of the
>     repo is 357 Gb. All is happening on 2gHz intel atom box with 1 gb of
>     memory (NAS with debian chroot). Attic is built from git,
>     revision 3ab53b776d9226facab9494d8bf7c565e4fb3d08 (the latest, as of
>     27th May 2014).
> 
>     Recently one of the backup runs was terminated (killed the wrong
>     process) and on the next invocation attic started to re-sync the
>     local cache.
> 
>     Judging from the speed with which first 10 backups were analysed, it
>     would be doing that for the next 6-7 days(!) - 9 archives were
>     analysed in 8 hours so far, and i have 153 of them.
> 
>     I am, naturally, curious as to what I could do to potentially speed
>     the process up (now, and in the future, if similar occasion happens
>     again). 
> 
>     What is puzzling (to me) is that attic does not appear to be
>     cpu-bound. According to top, attic consumes 5-15% of the CPU and
>     ~600 Mb of memory, and while there is quite a bit of IO happening,
>     box is definitely not struggling. According to strace, attic is
>     opening files from data/ folder in no particular order and reading
>     from them at a measured pace, and according to ltrace, there is a
>     lot of memset/free going one between easily-explainable memcpy's and
>     other byte churning calls:

*snip*

>     Interesting to note that repeated memset calls seem to happen once
>     for every block read, and memory for decoding/processing block is
>     always allocated in small chunks. 
> 
>     I am not familiar with python at such depth, but maybe it is
>     possible to force bigger allocations to reduce memory churn somewhat?
> 
>     This process is not going anywhere for the next week, so if you want
>     me to run any debugging/tracing against it, I would happily do that.

memset can't be the real problem if the cpu usage is only at 5-15%. It
looks like the IO system for some reason is not able to keep up with the
random IO requests generated by Attic.

Can you check if the system IO wait is high?

A bonnie++ test of the filesystem would also be interesting.

/ Jonas

Re: [attic] Re: Speed of cache re-sync

From:
Dmitry Astapov
Date:
2014-05-29 @ 07:59
could not decode message

Re: [attic] Re: Speed of cache re-sync

From:
Dmitry Astapov
Date:
2014-06-20 @ 21:05
Hello, Jonas.

Sorry for the "bump", but I wonder if you had a chance to see/evaluate my
patches (esp. delayed_refcount_updates)?


On Thu, May 29, 2014 at 8:59 AM, Dmitry Astapov <dastapov@gmail.com> wrote:

> On Wed, May 28, 2014 at 11:58 AM, Jonas Borgström <jonas@borgstrom.se>
> wrote:
>
>> On 28/05/14 00:06, Dmitry Astapov wrote:
>> > Apologies, I just realized that I forgot to include the most interesting
>> > piece of evidence which prompted me to ask my question about
>> allocations.
>> >
>> > I ran "ltrace -p <pid> -f -c" for a while, which collates time spent in
>> > various library calls, and after 5 minutes (measured by wall clock) I
>> > got this (just top of the output included):
>> >
>> > % time     seconds  usecs/call     calls      function
>> > ------ ----------- ----------- --------- --------------------
>> > * 52.04  154.513758         166    927789 memset*
>> >  16.68   49.526806         155    318898 malloc
>> >  13.77   40.872722         131    310545 free
>> >  13.21   39.219940         122    319394 memcpy
>> >   1.69    5.024017       12560       400 read
>> >   0.76    2.246910         434      5169 pthread_mutex_unlock
>> >   0.74    2.199215       91633        24 open64
>> >   0.73    2.164815         628      3446 pthread_cond_signal
>> >   0.09    0.276356         457       604 inflate
>> >
>> > I understand that some of these memsets are really necessary, but from
>> > looking at expanded ltrace output, most of the are repeated over and
>> > over again, and it is kind of sad that they take 50% of the time ...
>>
>> I guess what happened is that the Attic process was terminated in the
>> critical window between the repository commit operation and the cache
>> commit operation. This window is usually very short.
>>
>> If this happens Attic will detect that the cache is not in sync with the
>> repository and automatically starts rebuilding it.
>>
>
> I agree, it seems that it is exactly what happened.
>
>
>>
>> The current approach is very naive and first zeros all chunk reference
>> counters and then updates them by processing the metadata for all
>> archives in the repository.
>>
>> This results in lots and lots of decryption, decompression and msgpack
>> parsing and is usually cpu-bound unless the repository IO is really slow.
>>
>> I'm not sure where the memset comes from but I guess it's used by python
>> itself when creating all the different python objects created during
>> parsing of the msgpack formatted archive metadata.
>>
>
> I think it is coming from all the tuples created and destroyed inside
> "add" (in Cache.sync).
>
>>
>> > On Tue, May 27, 2014 at 9:59 PM, Dmitry Astapov <dastapov@gmail.com
>> > <mailto:dastapov@gmail.com>> wrote:
>> >
>> >     Hi,
>> >
>> >     Can you please tell me:
>> >     1)Should I expect "Initializing cache" process to be IO-bound or
>> >     CPU-bound?
>>
>> Usually CPU-bound.
>> Is the repository placed on a local filesystem?
>>
>
> Yes it is. Filesystem is ext4 on lvm volume backed by 4 physical disks in
> raid5, no special mount options. I also tested cache resync on remote
> client (running on linux, ext3 filesystem), and got similar behavior - low
> CPU load, slow cache rebuild speed. More on this below - I think I have a
> theory
>
>
>> Is the IO wait high?
>>
>
> Yes it is (i was able to measure it on remote client better).  As I said,
> I think I have a theory - more on this later.
>
> [snip]
>
>> >     Interesting to note that repeated memset calls seem to happen once
>> >     for every block read, and memory for decoding/processing block is
>> >     always allocated in small chunks.
>> >
>> >     I am not familiar with python at such depth, but maybe it is
>> >     possible to force bigger allocations to reduce memory churn
>> somewhat?
>> >
>> >     This process is not going anywhere for the next week, so if you want
>> >     me to run any debugging/tracing against it, I would happily do that.
>>
>> memset can't be the real problem if the cpu usage is only at 5-15%. It
>> looks like the IO system for some reason is not able to keep up with the
>> random IO requests generated by Attic.
>>
>> Can you check if the system IO wait is high?
>>
>> A bonnie++ test of the filesystem would also be interesting.
>>
>>
> Here is my current theory: mmap'ed chunks cache + lots of small updates to
> it are killing me.
>
> It seems that if your IO consist mostly of flushing buffers from FS cache,
> "top" and "vmstat" would not give you meaningful numbers to look at. Once I
> was able to get my hands on "iotop" on that box, picture became clearer:
> according to iotop, attic was generating writes at 20-40 Mb/sec, pretty
> much all the time without breaks. On the other hand, I haven't seen any
> file descriptor-related write calls in strace output, so I went looking for
> mmap and found it in _hashindex.c.
>
> From debugging output that I put in Cache.sync earlier I knew that my repo
> has 5756144 unique chunk ids, and that sharing between archives is pretty
> high, so each of these chunk ids would see lots of updates, and each update
> would touch a block in hashindex, and it looks like these blocks would be
> all over the mmaped file, so kernel would start flushing dirty blocks to
> disk aggressively. I observed this behavior with the same repo on two
> different boxes, both running linux. Boxes had different hardware, kernels,
> underlying filesystesm, disks and different linux distros, so I am pretty
> sure that this is not a quirk of my local setup.
>
> To test my theory, I delayed all the updates of chunks hashindex up until
> the end of the sync with patch sync_delayed_updates.patch (attached). With
> this change, scanning single archive took 1 minute, but after all of them
> were scanned, attic ate quite a lot of memory and updating self.chunks took
> a loooooong time since box started to swap quite a lot. I waited 3 hours
> after archives where scanned, then killed the process.
>
> My second test was to put chunks into hashset as soon as they are seen,
> but defer refcount updates till the very end with patch
> sync_delayed_refcount_updates.patch (also attached). With this patch whole
> of the cache sync was able to complete in ~5.5 hours without noticeable
> increase in memory consumption and without excessive IO (there was some at
> the beginning, but as process hits more and more chunks that were already
> seen, it levels off). This time, attic was clearly CPU-bound, as expected.
>
> I also tried putting archive directory on tmpfs and running attic without
> any alterations - this alleviates situation a lot (CPU load is higher,
> disks are not thrashing like crazy), but
> attic+sync_delayed_refcount_updates+cache dir on tmpfs easily beats it - i
> guess there is a price for agressive buffer flushes even to tmpfs.
>
> I also tried dirty hack on _hashindex.c where mmap is done with
> MAP_PRIVATE, and hashindex_close just opens the file and fwrite's index
> contents there. This has the most pronounceable effect, which, in my mind,
> confirms my theory about mmap completely.
>
> Now, I understand why mmap of chunk index is necessary - simplicity of the
> code, coherence of cache state in case process is killed, etc. However,
> maybe it would be beneficial to be able to turn it off for the during
> Cache.sync?
>
> What do you think?
>
> --
> Dmitry Astapov
>



-- 
Dmitry Astapov

Re: [attic] Re: Speed of cache re-sync

From:
Jonas Borgström
Date:
2014-06-21 @ 20:22
On 2014-06-20 23:05, Dmitry Astapov wrote:
> Hello, Jonas.
> 
> Sorry for the "bump", but I wonder if you had a chance to see/evaluate
> my patches (esp. delayed_refcount_updates)?

Hi Dmitry,

I've not had very much time to spend on Attic lately and your mail ended
up among many other things on my todo list, sorry about that.

Anyway, I think you're correct in that the mmap(MAP_SHARED) usage by the
chunk cache is less than ideal, and can cause a lot of unnecessary IO on
low memory systems.

I've never really noticed this issue myself, but that's probably because
most of my systems have plenty of ram and fast IO.

Anyway, as you and Jeremy suggested, simply switching to MAP_PRIVATE is
a quick way to get better control of the IO. But as Jeremy noticed the
cache directory transaction semantics also needs to be reworked a bit.

Anyway, my current plan is to release Attic 0.13 soon(tm). After that I
might take a look at this. But before doing any more performance related
work like this I need to get a better test environment.

/ Jonas


Re: [attic] Re: Speed of cache re-sync

From:
Jonas Borgström
Date:
2014-07-01 @ 21:11
On 2014-06-21 22:22, Jonas Borgström wrote:
> On 2014-06-20 23:05, Dmitry Astapov wrote:
>> Hello, Jonas.
>>
>> Sorry for the "bump", but I wonder if you had a chance to see/evaluate
>> my patches (esp. delayed_refcount_updates)?
> 
> Hi Dmitry,
> 
> I've not had very much time to spend on Attic lately and your mail ended
> up among many other things on my todo list, sorry about that.
> 
> Anyway, I think you're correct in that the mmap(MAP_SHARED) usage by the
> chunk cache is less than ideal, and can cause a lot of unnecessary IO on
> low memory systems.
> 
> I've never really noticed this issue myself, but that's probably because
> most of my systems have plenty of ram and fast IO.
> 
> Anyway, as you and Jeremy suggested, simply switching to MAP_PRIVATE is
> a quick way to get better control of the IO. But as Jeremy noticed the
> cache directory transaction semantics also needs to be reworked a bit.
> 
> Anyway, my current plan is to release Attic 0.13 soon(tm). After that I
> might take a look at this. But before doing any more performance related
> work like this I need to get a better test environment.

Hi again,

I've created an experimental non-mmap based hashindex implementation if
someone would like to give it a try:

https://gist.github.com/jborg/03f8e8ea87ee701c59ad

The patch is still a bit rough around the edges but seems to work well.

So far I've only tested it on my laptop and there the performance seems
to be unchanged. But hopefully we will see some improvements on
IO-starved systems...

/ Jonas

Re: [attic] Re: Speed of cache re-sync

From:
Dmitry Astapov
Date:
2014-07-03 @ 21:57
Hello,

Regardless of your plans regarding this patch, I am definitely keeping it!
Thank you :)

I've tried three things:
1)Listing medium-sized (70Gb) remote repo without local cache
2)Backing up to different remote repo from client with lots of ram (4Gb)
and no noticeable system load
3)Backing in to biggish (550Gb) local repo on a different box with 1Gb of
ram and some activity (LA 2-4, iowait 10%-20%)

I tested attic with and without the patch
For (1) performance was the same (time measured with "time" + cpu and io
load observed via "atop")
For (2) version with patch was about 5% faster, but it might've been a
measurement artefact
However, for (3) I got drastic improvements.

Repo mentioned in (3) is the repo that started this whole thread. Just to
remind you, attempt to re-sync cache for that repo with unpatched attic as
of ~2 month ago was taking about 1 hour per archive (and I had 150+ of
them). Then with my patches it went down to 1-2 minutes per archive + ~hour
at the end to actually update block counts in mmaped cache.

With your patch analysis proceed with a speed of 0.5-1 minute per archive
(I moved cache aside, started cache resync, but I haven't waited for all
archives to be analyzed).

Then I tried running the backup. Source location for that repo usually has
little or no updates, and it takes attic at least an hour to process it
nevertheless. I've selected some recent runs where no files were updated,
and durations were:
Duration: 2 hours 6 minutes 19.32 seconds
Duration: 1 hours 58 minutes 31.67 seconds
Duration: 2 hours 4 minutes 54.68 seconds
Duration: 1 hours 53 minutes 15.63 seconds
Duration: 1 hours 54 minutes 4.05 seconds
Duration: 1 hours 59 minutes 16.86 seconds
Duration: 2 hours 4 minutes 47.40 seconds
Duration: 2 hours 9 minutes 25.96 seconds
Duration: 2 hours 16 minutes 9.18 seconds
Duration: 2 hours 20 minutes 52.38 seconds
Duration: 2 hours 25 minutes 54.90 seconds

It is worth noting that attic would allocat 600Mb RSS and 1Gb+ VSS (on a
box with 1 Gb physical ram), which probably accounts for quite aggressive
dirty page writes for mmapped cache, high iowait and general slowness of
the process.

Version with the patch, however, took much less time. Much, much less:
Duration: 4 minutes 30.43 seconds

I tried deleting some archives, and (as expected) speed gain is noticeable
there as well: 2 minutes w/patch vs 30-40 minutes w/o patch.

So, as I said, regardless of whether you would include this patch or not, I
am keeping it :)

Thanks a lot!

On Tue, Jul 1, 2014 at 10:11 PM, Jonas Borgström <jonas@borgstrom.se> wrote:

> On 2014-06-21 22:22, Jonas Borgström wrote:
> > On 2014-06-20 23:05, Dmitry Astapov wrote:
> >> Hello, Jonas.
> >>
> >> Sorry for the "bump", but I wonder if you had a chance to see/evaluate
> >> my patches (esp. delayed_refcount_updates)?
> >
> > Hi Dmitry,
> >
> > I've not had very much time to spend on Attic lately and your mail ended
> > up among many other things on my todo list, sorry about that.
> >
> > Anyway, I think you're correct in that the mmap(MAP_SHARED) usage by the
> > chunk cache is less than ideal, and can cause a lot of unnecessary IO on
> > low memory systems.
> >
> > I've never really noticed this issue myself, but that's probably because
> > most of my systems have plenty of ram and fast IO.
> >
> > Anyway, as you and Jeremy suggested, simply switching to MAP_PRIVATE is
> > a quick way to get better control of the IO. But as Jeremy noticed the
> > cache directory transaction semantics also needs to be reworked a bit.
> >
> > Anyway, my current plan is to release Attic 0.13 soon(tm). After that I
> > might take a look at this. But before doing any more performance related
> > work like this I need to get a better test environment.
>
> Hi again,
>
> I've created an experimental non-mmap based hashindex implementation if
> someone would like to give it a try:
>
> https://gist.github.com/jborg/03f8e8ea87ee701c59ad
>
> The patch is still a bit rough around the edges but seems to work well.
>
> So far I've only tested it on my laptop and there the performance seems
> to be unchanged. But hopefully we will see some improvements on
> IO-starved systems...
>
> / Jonas
>
>
>


-- 
Dmitry Astapov

Re: [attic] Re: Speed of cache re-sync

From:
Jonas Borgström
Date:
2014-07-10 @ 13:51
On 2014-07-03 23:57, Dmitry Astapov wrote:
> Hello,
>
> Regardless of your plans regarding this patch, I am definitely keeping
> it! Thank you :)
>
> I've tried three things:
> 1)Listing medium-sized (70Gb) remote repo without local cache
> 2)Backing up to different remote repo from client with lots of ram
> (4Gb) and no noticeable system load
> 3)Backing in to biggish (550Gb) local repo on a different box with 1Gb
> of ram and some activity (LA 2-4, iowait 10%-20%)
>
> I tested attic with and without the patch
> For (1) performance was the same (time measured with "time" + cpu and
> io load observed via "atop")
> For (2) version with patch was about 5% faster, but it might've been a
> measurement artefact
> However, for (3) I got drastic improvements.
>
> Repo mentioned in (3) is the repo that started this whole thread. Just
> to remind you, attempt to re-sync cache for that repo with unpatched
> attic as of ~2 month ago was taking about 1 hour per archive (and I
> had 150+ of them). Then with my patches it went down to 1-2 minutes
> per archive + ~hour at the end to actually update block counts in
> mmaped cache.
>
> With your patch analysis proceed with a speed of 0.5-1 minute per
> archive (I moved cache aside, started cache resync, but I haven't
> waited for all archives to be analyzed).
>
> Then I tried running the backup. Source location for that repo usually
> has little or no updates, and it takes attic at least an hour to
> process it nevertheless. I've selected some recent runs where no files
> were updated, and durations were:
> Duration: 2 hours 6 minutes 19.32 seconds
> Duration: 1 hours 58 minutes 31.67 seconds
> Duration: 2 hours 4 minutes 54.68 seconds
> Duration: 1 hours 53 minutes 15.63 seconds
> Duration: 1 hours 54 minutes 4.05 seconds
> Duration: 1 hours 59 minutes 16.86 seconds
> Duration: 2 hours 4 minutes 47.40 seconds
> Duration: 2 hours 9 minutes 25.96 seconds
> Duration: 2 hours 16 minutes 9.18 seconds
> Duration: 2 hours 20 minutes 52.38 seconds
> Duration: 2 hours 25 minutes 54.90 seconds
>
> It is worth noting that attic would allocat 600Mb RSS and 1Gb+ VSS (on
> a box with 1 Gb physical ram), which probably accounts for quite
> aggressive dirty page writes for mmapped cache, high iowait and
> general slowness of the process.
>
> Version with the patch, however, took much less time. Much, much less:
> Duration: 4 minutes 30.43 seconds
>
> I tried deleting some archives, and (as expected) speed gain is
> noticeable there as well: 2 minutes w/patch vs 30-40 minutes w/o patch.
>
> So, as I said, regardless of whether you would include this patch or
> not, I am keeping it :)
>
> Thanks a lot!

That's excellent news, thanks for your help. I've just pushed this to
the git master branch to give this code some wider testing. But it sure
looks promising!

/ Jonas

Re: [attic] Re: Speed of cache re-sync

From:
Dmitry Astapov
Date:
2014-07-10 @ 14:25
Wonderful!

Just so you know, I am still keeping an eye on numbers, and improvements in
time are not going away


On Thu, Jul 10, 2014 at 2:51 PM, Jonas Borgström <jonas@borgstrom.se> wrote:

> On 2014-07-03 23:57, Dmitry Astapov wrote:
> > Hello,
> >
> > Regardless of your plans regarding this patch, I am definitely keeping
> > it! Thank you :)
> >
> > I've tried three things:
> > 1)Listing medium-sized (70Gb) remote repo without local cache
> > 2)Backing up to different remote repo from client with lots of ram
> > (4Gb) and no noticeable system load
> > 3)Backing in to biggish (550Gb) local repo on a different box with 1Gb
> > of ram and some activity (LA 2-4, iowait 10%-20%)
> >
> > I tested attic with and without the patch
> > For (1) performance was the same (time measured with "time" + cpu and
> > io load observed via "atop")
> > For (2) version with patch was about 5% faster, but it might've been a
> > measurement artefact
> > However, for (3) I got drastic improvements.
> >
> > Repo mentioned in (3) is the repo that started this whole thread. Just
> > to remind you, attempt to re-sync cache for that repo with unpatched
> > attic as of ~2 month ago was taking about 1 hour per archive (and I
> > had 150+ of them). Then with my patches it went down to 1-2 minutes
> > per archive + ~hour at the end to actually update block counts in
> > mmaped cache.
> >
> > With your patch analysis proceed with a speed of 0.5-1 minute per
> > archive (I moved cache aside, started cache resync, but I haven't
> > waited for all archives to be analyzed).
> >
> > Then I tried running the backup. Source location for that repo usually
> > has little or no updates, and it takes attic at least an hour to
> > process it nevertheless. I've selected some recent runs where no files
> > were updated, and durations were:
> > Duration: 2 hours 6 minutes 19.32 seconds
> > Duration: 1 hours 58 minutes 31.67 seconds
> > Duration: 2 hours 4 minutes 54.68 seconds
> > Duration: 1 hours 53 minutes 15.63 seconds
> > Duration: 1 hours 54 minutes 4.05 seconds
> > Duration: 1 hours 59 minutes 16.86 seconds
> > Duration: 2 hours 4 minutes 47.40 seconds
> > Duration: 2 hours 9 minutes 25.96 seconds
> > Duration: 2 hours 16 minutes 9.18 seconds
> > Duration: 2 hours 20 minutes 52.38 seconds
> > Duration: 2 hours 25 minutes 54.90 seconds
> >
> > It is worth noting that attic would allocat 600Mb RSS and 1Gb+ VSS (on
> > a box with 1 Gb physical ram), which probably accounts for quite
> > aggressive dirty page writes for mmapped cache, high iowait and
> > general slowness of the process.
> >
> > Version with the patch, however, took much less time. Much, much less:
> > Duration: 4 minutes 30.43 seconds
> >
> > I tried deleting some archives, and (as expected) speed gain is
> > noticeable there as well: 2 minutes w/patch vs 30-40 minutes w/o patch.
> >
> > So, as I said, regardless of whether you would include this patch or
> > not, I am keeping it :)
> >
> > Thanks a lot!
>
> That's excellent news, thanks for your help. I've just pushed this to
> the git master branch to give this code some wider testing. But it sure
> looks promising!
>
> / Jonas
>
>
>


-- 
Dmitry Astapov

Re: [attic] Re: Speed of cache re-sync

From:
Jeremy Maitin-Shepard
Date:
2014-06-20 @ 22:05
I think you would do better to just do a proper implementation of the
MAP_PRIVATE-based approach.  In fact this would incur no extra overhead
because attic currently makes a copy of every hash table file into a
temporary file prior to opening it for writing and mmaping.  This is
necessary to for the transaction semantics.  Therefore, a data copy is
already happening.  Instead, you could just MAP_PRIVATE the existing file
without copying it, then when it is time to commit it, write the contents
to a new file.


On Fri, Jun 20, 2014 at 2:05 PM, Dmitry Astapov <dastapov@gmail.com> wrote:

> Hello, Jonas.
>
> Sorry for the "bump", but I wonder if you had a chance to see/evaluate my
> patches (esp. delayed_refcount_updates)?
>
>
> On Thu, May 29, 2014 at 8:59 AM, Dmitry Astapov <dastapov@gmail.com>
> wrote:
>
>> On Wed, May 28, 2014 at 11:58 AM, Jonas Borgström <jonas@borgstrom.se>
>> wrote:
>>
>>> On 28/05/14 00:06, Dmitry Astapov wrote:
>>> > Apologies, I just realized that I forgot to include the most
>>> interesting
>>> > piece of evidence which prompted me to ask my question about
>>> allocations.
>>> >
>>> > I ran "ltrace -p <pid> -f -c" for a while, which collates time spent in
>>> > various library calls, and after 5 minutes (measured by wall clock) I
>>> > got this (just top of the output included):
>>> >
>>> > % time     seconds  usecs/call     calls      function
>>> > ------ ----------- ----------- --------- --------------------
>>> > * 52.04  154.513758         166    927789 memset*
>>> >  16.68   49.526806         155    318898 malloc
>>> >  13.77   40.872722         131    310545 free
>>> >  13.21   39.219940         122    319394 memcpy
>>> >   1.69    5.024017       12560       400 read
>>> >   0.76    2.246910         434      5169 pthread_mutex_unlock
>>> >   0.74    2.199215       91633        24 open64
>>> >   0.73    2.164815         628      3446 pthread_cond_signal
>>> >   0.09    0.276356         457       604 inflate
>>> >
>>> > I understand that some of these memsets are really necessary, but from
>>> > looking at expanded ltrace output, most of the are repeated over and
>>> > over again, and it is kind of sad that they take 50% of the time ...
>>>
>>> I guess what happened is that the Attic process was terminated in the
>>> critical window between the repository commit operation and the cache
>>> commit operation. This window is usually very short.
>>>
>>> If this happens Attic will detect that the cache is not in sync with the
>>> repository and automatically starts rebuilding it.
>>>
>>
>> I agree, it seems that it is exactly what happened.
>>
>>
>>>
>>> The current approach is very naive and first zeros all chunk reference
>>> counters and then updates them by processing the metadata for all
>>> archives in the repository.
>>>
>>> This results in lots and lots of decryption, decompression and msgpack
>>> parsing and is usually cpu-bound unless the repository IO is really slow.
>>>
>>> I'm not sure where the memset comes from but I guess it's used by python
>>> itself when creating all the different python objects created during
>>> parsing of the msgpack formatted archive metadata.
>>>
>>
>> I think it is coming from all the tuples created and destroyed inside
>> "add" (in Cache.sync).
>>
>>>
>>> > On Tue, May 27, 2014 at 9:59 PM, Dmitry Astapov <dastapov@gmail.com
>>> > <mailto:dastapov@gmail.com>> wrote:
>>> >
>>> >     Hi,
>>> >
>>> >     Can you please tell me:
>>> >     1)Should I expect "Initializing cache" process to be IO-bound or
>>> >     CPU-bound?
>>>
>>> Usually CPU-bound.
>>> Is the repository placed on a local filesystem?
>>>
>>
>> Yes it is. Filesystem is ext4 on lvm volume backed by 4 physical disks in
>> raid5, no special mount options. I also tested cache resync on remote
>> client (running on linux, ext3 filesystem), and got similar behavior - low
>> CPU load, slow cache rebuild speed. More on this below - I think I have a
>> theory
>>
>>
>>> Is the IO wait high?
>>>
>>
>> Yes it is (i was able to measure it on remote client better).  As I said,
>> I think I have a theory - more on this later.
>>
>> [snip]
>>
>>> >     Interesting to note that repeated memset calls seem to happen once
>>> >     for every block read, and memory for decoding/processing block is
>>> >     always allocated in small chunks.
>>> >
>>> >     I am not familiar with python at such depth, but maybe it is
>>> >     possible to force bigger allocations to reduce memory churn
>>> somewhat?
>>> >
>>> >     This process is not going anywhere for the next week, so if you
>>> want
>>> >     me to run any debugging/tracing against it, I would happily do
>>> that.
>>>
>>> memset can't be the real problem if the cpu usage is only at 5-15%. It
>>> looks like the IO system for some reason is not able to keep up with the
>>> random IO requests generated by Attic.
>>>
>>> Can you check if the system IO wait is high?
>>>
>>> A bonnie++ test of the filesystem would also be interesting.
>>>
>>>
>> Here is my current theory: mmap'ed chunks cache + lots of small updates
>> to it are killing me.
>>
>> It seems that if your IO consist mostly of flushing buffers from FS
>> cache, "top" and "vmstat" would not give you meaningful numbers to look at.
>> Once I was able to get my hands on "iotop" on that box, picture became
>> clearer: according to iotop, attic was generating writes at 20-40 Mb/sec,
>> pretty much all the time without breaks. On the other hand, I haven't seen
>> any file descriptor-related write calls in strace output, so I went looking
>> for mmap and found it in _hashindex.c.
>>
>> From debugging output that I put in Cache.sync earlier I knew that my
>> repo has 5756144 unique chunk ids, and that sharing between archives is
>> pretty high, so each of these chunk ids would see lots of updates, and each
>> update would touch a block in hashindex, and it looks like these blocks
>> would be all over the mmaped file, so kernel would start flushing dirty
>> blocks to disk aggressively. I observed this behavior with the same repo on
>> two different boxes, both running linux. Boxes had different hardware,
>> kernels, underlying filesystesm, disks and different linux distros, so I am
>> pretty sure that this is not a quirk of my local setup.
>>
>> To test my theory, I delayed all the updates of chunks hashindex up until
>> the end of the sync with patch sync_delayed_updates.patch (attached). With
>> this change, scanning single archive took 1 minute, but after all of them
>> were scanned, attic ate quite a lot of memory and updating self.chunks took
>> a loooooong time since box started to swap quite a lot. I waited 3 hours
>> after archives where scanned, then killed the process.
>>
>> My second test was to put chunks into hashset as soon as they are seen,
>> but defer refcount updates till the very end with patch
>> sync_delayed_refcount_updates.patch (also attached). With this patch whole
>> of the cache sync was able to complete in ~5.5 hours without noticeable
>> increase in memory consumption and without excessive IO (there was some at
>> the beginning, but as process hits more and more chunks that were already
>> seen, it levels off). This time, attic was clearly CPU-bound, as expected.
>>
>> I also tried putting archive directory on tmpfs and running attic without
>> any alterations - this alleviates situation a lot (CPU load is higher,
>> disks are not thrashing like crazy), but
>> attic+sync_delayed_refcount_updates+cache dir on tmpfs easily beats it - i
>> guess there is a price for agressive buffer flushes even to tmpfs.
>>
>> I also tried dirty hack on _hashindex.c where mmap is done with
>> MAP_PRIVATE, and hashindex_close just opens the file and fwrite's index
>> contents there. This has the most pronounceable effect, which, in my mind,
>> confirms my theory about mmap completely.
>>
>> Now, I understand why mmap of chunk index is necessary - simplicity of
>> the code, coherence of cache state in case process is killed, etc. However,
>> maybe it would be beneficial to be able to turn it off for the during
>> Cache.sync?
>>
>> What do you think?
>>
>> --
>> Dmitry Astapov
>>
>
>
>
> --
> Dmitry Astapov
>