librelist archives

« back to archive

amount of chunks / scalability

amount of chunks / scalability

From:
Thomas Waldmann
Date:
2015-04-01 @ 13:07
Just to keep you all updated:

Now that crypto and compression is more or less done (in merge-all), I was
looking for other stuff to improve (in speed and/or in scalability).

I did some experiments with the chunker. Some stuff I found:

chunk size
========

attic's targetted chunk size is rather small (64KB), same for the minimum
chunk size (1KB).

While this is good for getting the best deduplication, it is bad for
several other things:
- if you have a lot of data, it creates a huge amount of chunks, big cache
size, high memory usage
- we only have ~64KB to feed into the compression. The blosc stuff I used
can parallelize internally, but then it chops that already rather small
chunk into multiple even smaller chunks. This leads to quite some overhead
and is maybe not that good compression, because compression dictionaries
start from scratch each time and do not get that much data to compress.

I did some chunking + compression benchmarks and the deduped compressed
output size slowly crept upwards with increasing chunk size (somehow as
expected). No really spectacular increase.
But for bigger sizes it created a lot less chunks.

So I am thinking about significantly increasing the target chunk size (like
to e.g. 1MB) and also the minimum chunk size.

chunker parallelism
===============

crypto and compression are super fast now (in merge-all branch, the overall
performance in my tests was about 3x as fast as original attic for
backups), but somehow they don't get fed fast enough (I am seeing
compression is NOT cpu bound).

as I am working on SSDs, I suspect this is not due to slow I/O, but rather
due to the chunker being a bottleneck.

The chunker runs on 1 core currently, so if the chunker maxes it out,
that's it - you won't see significant load on all cores by the compression
then because it just does not get the data fast enough (similar for crypto).

Multi-Threading likely would not help (Python GIL), so I am thinking about
whether the multiprocessing module would help by creating multiple worker
processes. I did some synthetic benchmark (just using the chunker and the
compressor, nothing else) and I have seen more load on the cpu cores and
some speedup.

Things get a bit tricky in real-life attic though because attic already
uses quite a lot of memory for the chunks cache, so one maybe rather would
not want to multiply that by 4, 8, 16 or more (core count). So master
process should have all the caches in memory, while workers should just do
simple tasks and not have their own copy of the cache in memory.

Also not sure how fast inter-process communication is and whether it is
good for pumping lots of data over it. Maybe "remote access" to the
in-memory cache via interprocess communication could be a solution.

Re: [attic] amount of chunks / scalability

From:
Dan Williams
Date:
2015-04-01 @ 13:34
Hi Thomas

 

Interestingly, Obnam uses a larger chunk size by default, and it performs
poorly at de-duplication when compared to Attic and Bup. I suspect that may
be because of a different method - from what I understand, it does not use a
rolling hash - but that's just an observation.

 

However, another aspect of Obnam is that it allows you to tune various
parameters such as the chunk size and memory usage. This makes me a little
uneasy though, as I think it confuses users and it's hard to know how to get
the best experience without a lot of testing.

 

Ideally, perhaps Attic should adjust its parameters based on conditions? I
don't know if that's possible. For instance, if it finds itself running on a
system with a lot of memory available, could it decide that high memory
usage is okay, and on a lower-memory system, adjust differently? Or would
this cause problems later if trying to run on different systems? (E.g. if a
repository is created using a high amount of memory would that prevent later
adding to it or using it with a low-memory system.)

 

Equally, maybe it should adjust the chunk size based on the size of the
repository, or the amount of information de-duplicated? Again, I do not know
if that is possible to change on-the-fly.

 

I would say that for me personally, disk usage over multiple backups
(successful de-duplication of data shared between generations) is more
important than initial repository size, although I would like the size to be
as small as possible. I am not sure about speed vs size - again perhaps this
is like gzip where you should be able to choose. I have plenty of disk space
and don't really mind it being used, but I'd like to reduce the time of long
operations. But on other systems that may be different.

 

I have not noticed any memory usage issues at all, and I have also been
happy enough with the speed of Attic. There are some areas that I think
could be sped up, but it sounds like you have made strides in that direction
anyway. I'm really keen to try those improvements once they are deemed
safe/stable.

 

Regarding the parallelism, I don't like the sound of multiplying the memory
use! I don't deal with Python very much so I don't know about how its
threads stack up, but if you are looking to go multi-process I would agree
that the workers should not each hold a copy of the cache. What about using
an existing facility to store all of that? I don't know what's available to
use in this context, but systems like Redis or memcached would provide a
single cache store with fast access that would support multiple-worker
access. So maybe there's something along those lines that would be quick
enough and suitable for inclusion into the Attic codebase. This is assuming
that you can't utilise shared memory between the workers? I was sure Python
was able to do that.

 

I assume the chunker is a bottleneck simply because it has to go searching
the cache for matches, which it is doing sequentially. Hence using an
existing method as a central storage solution might work, or even better,
shared memory if that is possible.

 

Finally, my experience of inter-process communication is that it is very
quick, so that should not introduce any problems.

 

 

 

From: attic@librelist.com [mailto:attic@librelist.com] On Behalf Of Thomas
Waldmann
Sent: 01 April 2015 14:07
To: attic@librelist.com
Subject: [attic] amount of chunks / scalability

 

Just to keep you all updated:

Now that crypto and compression is more or less done (in merge-all), I was
looking for other stuff to improve (in speed and/or in scalability).

I did some experiments with the chunker. Some stuff I found:

chunk size
========


attic's targetted chunk size is rather small (64KB), same for the minimum
chunk size (1KB).


While this is good for getting the best deduplication, it is bad for several
other things:

- if you have a lot of data, it creates a huge amount of chunks, big cache
size, high memory usage

- we only have ~64KB to feed into the compression. The blosc stuff I used
can parallelize internally, but then it chops that already rather small
chunk into multiple even smaller chunks. This leads to quite some overhead
and is maybe not that good compression, because compression dictionaries
start from scratch each time and do not get that much data to compress.

I did some chunking + compression benchmarks and the deduped compressed
output size slowly crept upwards with increasing chunk size (somehow as
expected). No really spectacular increase.

But for bigger sizes it created a lot less chunks.

 

So I am thinking about significantly increasing the target chunk size (like
to e.g. 1MB) and also the minimum chunk size.

chunker parallelism
===============

crypto and compression are super fast now (in merge-all branch, the overall
performance in my tests was about 3x as fast as original attic for backups),
but somehow they don't get fed fast enough (I am seeing compression is NOT
cpu bound). 

as I am working on SSDs, I suspect this is not due to slow I/O, but rather
due to the chunker being a bottleneck.

The chunker runs on 1 core currently, so if the chunker maxes it out, that's
it - you won't see significant load on all cores by the compression then
because it just does not get the data fast enough (similar for crypto).

Multi-Threading likely would not help (Python GIL), so I am thinking about
whether the multiprocessing module would help by creating multiple worker
processes. I did some synthetic benchmark (just using the chunker and the
compressor, nothing else) and I have seen more load on the cpu cores and
some speedup.

Things get a bit tricky in real-life attic though because attic already uses
quite a lot of memory for the chunks cache, so one maybe rather would not
want to multiply that by 4, 8, 16 or more (core count). So master process
should have all the caches in memory, while workers should just do simple
tasks and not have their own copy of the cache in memory.

Also not sure how fast inter-process communication is and whether it is good
for pumping lots of data over it. Maybe "remote access" to the in -memory
cache via interprocess communication could be a solution.

Re: [attic] amount of chunks / scalability

From:
Dmitry Astapov
Date:
2015-04-01 @ 14:02
My two cents:

I would love to see bigger chunk sizes supported by attic. I am backing up
0.5T of data in 100K files a regular basis, and chunk cache have already
grown to almost 1Gb, and given that attic requires roughly twice as much to
keep it in memory, my backup box starts to feel the strain (it has 2Gb ram
and routinely swaps during backups).

On Wed, Apr 1, 2015 at 2:34 PM, Dan Williams <dan@dotfive.co.uk> wrote:

> Hi Thomas
>
>
>
> Interestingly, Obnam uses a larger chunk size by default, and it performs
> poorly at de-duplication when compared to Attic and Bup. I suspect that may
> be because of a different method – from what I understand, it does not use
> a rolling hash – but that’s just an observation.
>
>
>
> However, another aspect of Obnam is that it allows you to tune various
> parameters such as the chunk size and memory usage. This makes me a little
> uneasy though, as I think it confuses users and it’s hard to know how to
> get the best experience without a lot of testing.
>
>
>
> Ideally, perhaps Attic should adjust its parameters based on conditions? I
> don’t know if that’s possible. For instance, if it finds itself running on
> a system with a lot of memory available, could it decide that high memory
> usage is okay, and on a lower-memory system, adjust differently? Or would
> this cause problems later if trying to run on different systems? (E.g. if a
> repository is created using a high amount of memory would that prevent
> later adding to it or using it with a low-memory system.)
>
>
>
> Equally, maybe it should adjust the chunk size based on the size of the
> repository, or the amount of information de-duplicated? Again, I do not
> know if that is possible to change on-the-fly.
>
>
>
> I would say that for me personally, disk usage over multiple backups
> (successful de-duplication of data shared between generations) is more
> important than initial repository size, although I would like the size to
> be as small as possible. I am not sure about speed vs si ze – again perhaps
> this is like gzip where you should be able to choose. I have plenty of disk
> space and don’t really mind it being used, but I’d like to reduce the time
> of long operations. But on other systems that may be different.
>
>
>
> I have not noticed any memory usage issues at all, and I have also been
> happy enough with the speed of Attic. There are some areas that I think
> could be sped up, but it sounds like you have made strides in that
> direction anyway. I’m really keen to try those improvements once they are
> deemed safe/stable.
>
>
>
> Regarding the parallelism, I don’t like the sound of multiplying the
> memory use! I don’t deal with Python very much so I don’t know about how
> its threads stack up, but if you are looking to go multi-process I would
> agree that the workers should not each hold a copy of the cache. What about
> using an existing facility to store all of that? I don’t know what’s
> available to use in this context, but systems like Redis or memcached would
> provide a single cache store with fast access that would support
> multiple-worker access. So maybe there’s something along those lines that
> would be quick enough and suitable for inclusion into the Attic codebase.
> This is assuming that you can’t utilise shared memory between the workers?
> I was sure Python was able to do that.
>
>  < /o:p>
>
> I assume the chunker is a bottleneck simply because it has to go searching
> the cache for matches, which it is doing sequentially. Hence using an
> existing method as a central storage solution might work, or even better,
> shared memory if that is possible.
>
>
>
> Finally, my experience of inter-process communication is that it is very
> quick, so that should not introduce any problems.
>
>
>
>
>
>
>
> *From:* attic@librelist.com [mailto:attic@librelist.com] *On Behalf Of *Thomas
> Waldmann
> *Sent:* 01 April 2015 14:07
> *To:* attic@librelist.com
> *Subject:* [attic] amount of chunks / scalability
>
>
>
> Just to keep you all updated:
>
> Now that crypto and compression is mor e or less done (in merge-all), I
> was looking for other stuff to improve (in speed and/or in scalability).
>
> I did some experiments with the chunker. Some stuff I found:
>
> chunk size
> ========
>
>
> attic's targetted chunk size is rather small (64KB), same for the minimum
> chunk size (1KB).
>
>
> While this is good for getting the best deduplication, it is bad for
> several other things:
>
> - if you have a lot of data, it creates a huge amount of chunks, big cache
> size, high memory usage
>
> - we only have ~64KB to feed into the compression. The blosc stuff I used
> can parallelize internally, but then it chops that already rather small
> chunk into multiple even smaller chunks. This leads to quite some overhead
> and is maybe not that good compression, because compress ion dictionaries
> start from scratch each time and do not get that much data to compress.
>
> I did some chunking + compression benchmarks and the deduped compressed
> output size slowly crept upwards with increasing chunk size (somehow as
> expected). No really spectacular increase.
>
> But for bigger sizes it created a lot less chunks.
>
>
>
> So I am thinking about significantly increasing the target chunk size
> (like to e.g. 1MB) and also the minimum chunk size.
>
> chunker parallelism
> ===============
>
> crypto and compression are super fast now (in merge-all branch, the
> overall performance in my tests was about 3x as fast as original attic for
> backups) , but somehow they don't get fed fast enough (I am seeing
> compression is NOT cpu bound).
>
> as I am working on SSDs, I suspect this is not due to slow I/O, but rather
> due to the chunker being a bottleneck.
>
> The chunker runs on 1 core currently, so if the chunker maxes it out,
> that's it - you won't see significant load on all cores by the compression
> then because it just does not get the data fast enough (similar for crypto).
>
> Multi-Threading likely would not help (Python GIL), so I am thinking about
> whether the multiprocessing module would help by creating multiple worker
> processes. I did some synthetic benchmark (just using the chunker and the
> compressor, nothing else) and I have seen more load on the cpu cores and
> some speedup.
>
> Things get a bit tricky in real-life attic though because attic already
> uses quite a lot of memory fo r the chunks cache, so one maybe rather would
> not want to multiply that by 4, 8, 16 or more (core count). So master
> process should have all the caches in memory, while workers should just do
> simple tasks and not have their own copy of the cache in memory.
>
> Also not sure how fast inter-process communication is and whether it is
> good for pumping lots of data over it. Maybe "remote access" to the in
> -memory cache via interprocess communication could be a solution.
>



-- 
Dmitry Astapov