librelist archives

« back to archive

Chunking issues

Chunking issues

From:
Jeremy Maitin-Shepard
Date:
2014-05-06 @ 22:51
I noticed a few issues related to encryption and chunking:

1. Encryption does not appear to adequately protect file size information:
small files are encoded as a single chunk, and therefore the exact size (of
the compressed data) is visible in the repository.

2. Even with larger files, there is no reason to believe the chunking is
crytographically secure.  In fact it susceptible to a brute-force attack
since it only depends on a 32-bit key.  This is understandable since there
may not be any existing cryptographically-secure chunking algorithms;
still, it is a concern.

Thus, an adversary with access to the encrypted repository can check if a
set of files with known sizes may be present in the repository.

It might be worth considering using the tarsnap chunking algorithm (the
code is not free to use/modify, but the algorithm is not patented) as I
believe it is designed to be as secure as possible and uses a much longer
key, though it also hasn't been analyzed by anyone other than the author.

Tarsnap avoids the problem of small chunks by combining small files and the
tails of files into a single chunk stream, which disguises file size to
some extent.  Obviously this adds a fair bit of additional complexity, and
depending on various trade-offs, may result in greater space usage in the
repository over time and/or require more time and I/O to perform a backup
even when no data has changed.

The easiest way to ensure privacy would be to directly encrypt the
repository segment files and index files (and then the data in the
repository would not need to be encrypted).  This would mean the repository
server mode could not be used securely, but perhaps it would be possible to
still access a repository over the network reasonably efficiently by doing
the following:
-  Segments are encrypted and written to the server as they are produced,
with an LRU cache on the client
-  It may be beneficial to use separate segments for file data and
metadata, as this would allow greater locality in accessing the metadata
-  Client maintains the repository index locally, and periodically uploads
an encrypted version to the server, but not after every commit; instead it
would only be done once the number of entries written has exceeded some
threshold relative to the size of the index (possibly it would never be
useful to upload the index).
-  Each segment would be written as two files rather than one, with one
file containing just the repository metadata (i.e. key, tag, offset,
checksum?), and the other containing the data.  (It is not clear whether
the checksum should go with the metadata or data.)
-  A client with a lost index or out-of-date index can efficiently rebuild
it by reading the index file from the server (if more recent than what is
present on the client) and any necessary segment metadata files.

This scheme would also work with a cloud storage provider.

The downside is that compaction has to be done entirely client side, and
coordinating access by multiple clients is more difficult.

Re: [attic] Chunking issues

From:
Cyril Roussillon
Date:
2014-05-07 @ 05:39
Just my two cents...

On 2014/05/07 00:51, Jeremy Maitin-Shepard wrote:
> I noticed a few issues related to encryption and chunking:
>
> 1. Encryption does not appear to adequately protect file size
> information: small files are encoded as a single chunk, and therefore
> the exact size (of the compressed data) is visible in the repository.
That is correct, but there is no way to know if a small chunk is a whole
small file, the end of a large file, or simply a small chunk of a large
file. Still, stastically it indeed leaks information.

> 2. Even with larger files, there is no reason to believe the chunking
> is crytographically secure.  In fact it susceptible to a brute-force
> attack since it only depends on a 32-bit key.  This is understandable
> since there may not be any existing cryptographically-secure chunking
> algorithms; still, it is a concern.
I had also considered that the simple xor of the table with the seed
could be too weak, then thought that 32bits was enough, but you are
right it could actually be attacked with brute force, even if I'm not
sure how to exploit it. An idea I had was to use a 256bits key to
generate a permutation of the table (using first step of RC4 algorithm)
instead of the xor (it would generate 256! possible tables instead of
2^32). Actually it can be used in addition to the xor for more obfuscation.

-- Cyril Roussillon http://crteknologies.fr/

Re: [attic] Chunking issues

From:
Jeremy Maitin-Shepard
Date:
2014-05-07 @ 06:23
One scenario in which protecting the file sizes could be important:

The user may wish to hide the fact that he possesses some particular set of
documents, i.e. a collection of leaked documents.  The local government may
know or be able to guess the precise file sizes of these documents, and if
the documents are made up of a large number of files, the sizes may provide
a reliable signature.  By searching the encrypted backups of a large number
of users, the local government may be able to identify the source of the
leak; alternatively, if a particular person is suspected, information about
the file sizes could provide strong evidence that the person does possess
the documents.

As far as modifying the chunking algorithm to make it more secure, I think
it is unwise to rely on the security of any new cryptographic algorithm
that has not been thoroughly analyzed.  Chunking seems to be quite a bit
more complicated of a problem than many other crytographic tasks for which
many seemingly correct but insecure methods have been devised.

I think the assumption should be that the file sizes are leaked.  It would
be great if there were a way to keep the key/value repository itself
unencrypted without leaking file size information, but I'm not sure what it
is.  The only clearly safe method I can think of is to pad all chunks up to
the maximum size.  This would eliminate the problem at the cost of a large
(and certainly unacceptable) space overhead.  Directly encrypting the
repository segments is simple and safe but means a loss of some
functionality (compaction, indexing and repository integrity verification
without needing the encryption key).

On Tue, May 6, 2014 at 10:39 PM, Cyril Roussillon <cyril42e@gmail.com>wrote:

> Just my two cents...
>
> On 2014/05/07 00:51, Jeremy Maitin-Shepard wrote:
> > I noticed a few issues related to encryption and chunking:
> >
> > 1. Encryption does not appear to adequately protect file size
> > information: small files are encoded as a single chunk, and therefore
> > the exact size (of the compressed data) is visible in the repository.
> That is correct, but there is no way to know if a small chunk is a whole
> small file, the end of a large file, or simply a small chunk of a large
> file. Still, stastically it indeed leaks information.
>
> > 2. Even with larger files, there is no reason to believe the chunking
> > is crytographically secure.  In fact it susceptible to a brute-force
> > attack since it only depends on a 32-bit key.  This is understandable
> > since there may not be any existing cryptographically-secure chunking
> > algorithms; still, it is a concern.
> I had also considered that the simple xor of the table with the seed
> could be too weak, then thought that 32bits was enough, but you are
> right it could actually be attacked with brute force, even if I'm not
> sure how to exploit it. An idea I had was to use a 256bits key to
> generate a permutation of the table (using first step of RC4 algorithm)
> instead of the xor (it would generate 256! possible tables instead of
> 2^32). Actually it can be used in addition to the xor for more obfuscation.
>
> -- Cyril Roussillon http://crteknologies.fr/
>
>
>

Re: [attic] Chunking issues

From:
Cyril Roussillon
Date:
2014-05-07 @ 13:07
could not decode message

Re: [attic] Chunking issues

From:
Jeremy Maitin-Shepard
Date:
2014-05-07 @ 16:30
On May 7, 2014 6:11 AM, "Cyril Roussillon" <cyril42e@gmail.com> wrote:
>
> Actually tarsnap method of storing small files and tails of files looks
perfectly safe as well. If you push them in a separate stream that you feed
to the same chunking algorithm, all visible chunks sizes will be due to the
chunking algorithm and not the plain files sizes (except the very last one
of each archive maybe), and except implementation complexity I can't see
any major drawback.

I believe the tarsnap scheme is as safe as the chunking algorithm.

> There will be greater space usage of course, but not more than the one
resulting of having an average chunk size of 64kB instead of less.
> Why do you think it would require more time or I/O when no data has
changed ?

Currently, when processing an unchanged file present in the cache, Attic
doesn't have to open the file at all, since the cache specifies the list of
chunk ids for the file.  Attic can just verify that the chunks are still
present in the repository and write the list of chunk ids to new archive.

With a tail steam, for any file with a tail, which might be some large
fraction of them, it is necessary to write the tail to the tail stream,
even if the file hasn't changed.  Tarsnap stores a compressed copy of the
tail in the cache to at least have better locality in disk access, but in
any case it is certainly more expensive than what Attic does now.

It might be possible to avoid having to process unchanged tails, but it
would require more complicated caching.  In particular, each cache entry
would include the sha256 hash of the tail stream's in-progress chunk before
and after the tail portion is added, as well as the id and length of any
chunks that are completed by the current cache entry.  For each cached file
encountered, if the tail stream state matches the hash in the cache, there
is no need to read or process the tail data.  However, if a mismatch is
found, it is necessary to go back and actually create the in-progress chunk
so that the new data can be added.  It would be especially important to
ensure that the filesystem traversal happens in the same order each time as
much as possible.  This is already important for reusing metadata chunks
but would be especially important with a tail stream.

>
> I agree that keeping the key/value store unencrypted is very important
for integrity checks without the key and the other functionalities you
mentioned.

Note that it could still be possible to verify each segment file without
decrypting it provides that an appropriate checksum or signature is
included.

>
> As for the chunking algorithm, of course you are right, but without being
able to find a proven cryptographically-secure chunking algorithm, that's
probably the best we can do. Even without a thorough analysis it's pretty
clear that Attic chunking algorithm is more secure than rsyncrypto's one
for instance (sum of past 8192 bytes equals 0 modulo 4096), and that
increasing the size of the key and the number of combinations will reduce
the efficiency of brute force attacks, and won't increase the efficiency of
a smart attack.

If Attic used some hand-rolled encryption scheme instead of AES, or
hand-rolled authentication instead of HMAC, we wouldn't have any confidence
in it.  Likewise there is no reason to trust the chunking algorithm.

>
> Maybe it's worth asking on crypto.stackexchange.com to clarify about the
existence of proven cryptographically-secure chunking algorithms, or the
existence of algorithms that could be used at this end such as secure
rolling HMAC or secure single byte-level encryption (which is probably
impossible as it necessarily corresponds to ECB mode of operation). Maybe
we could have input from experts about security of Attic's chunking
algorithm as well.

Yes, definitely a good idea.

Re: [attic] Chunking issues

From:
Cyril Roussillon
Date:
2014-05-07 @ 19:06
On 2014/05/07 18:30, Jeremy Maitin-Shepard wrote:
> I believe the tarsnap scheme is as safe as the chunking algorithm.

Is there a description of this scheme and of the chunking algorithm
somewhere on the Internet, or did you analyze the source code ? I could
not find detailed information about it on their website.

> > There will be greater space usage of course, but not more than the
> one resulting of having an average chunk size of 64kB instead of less.
> > Why do you think it would require more time or I/O when no data has
> changed ?
>
> Currently, when processing an unchanged file present in the cache,
> Attic doesn't have to open the file at all, since the cache specifies
> the list of chunk ids for the file.  Attic can just verify that the
> chunks are still present in the repository and write the list of chunk
> ids to new archive.
>
> With a tail steam, for any file with a tail, which might be some large
> fraction of them, it is necessary to write the tail to the tail
> stream, even if the file hasn't changed.  Tarsnap stores a compressed
> copy of the tail in the cache to at least have better locality in disk
> access, but in any case it is certainly more expensive than what Attic
> does now.
>
> It might be possible to avoid having to process unchanged tails, but
> it would require more complicated caching.  In particular, each cache
> entry would include the sha256 hash of the tail stream's in-progress
> chunk before and after the tail portion is added, as well as the id
> and length of any chunks that are completed by the current cache
> entry.  For each cached file encountered, if the tail stream state
> matches the hash in the cache, there is no need to read or process the
> tail data.  However, if a mismatch is found, it is necessary to go
> back and actually create the in-progress chunk so that the new data
> can be added.  It would be especially important to ensure that the
> filesystem traversal happens in the same order each time as much as
> possible.  This is already important for reusing metadata chunks but
> would be especially important with a tail stream.
>

Why do you think it is necessary to write the tail to the tail stream
even if the file hasn't changed ? If you don't process it at all and
reuse the chunk ids from the cache, incrementing the chunks reference
count, then everything should be valid, shouldn't it ?
Only if you need to chunkify the file again because it has changed, you
would have to put its tail in the tail stream, because there is no way
to know whether the tail changed or not (except if the cache also
contained the hash of the tail indeed, but it's probably not worth it).

The only modifications I was thinking of were that the cache and
metadata should refer to pairs of chunk id and subchunk reference
(instead of only chunk id), and that we would have to wait until the
tail stream generates a full chunk to know its id and write it back to
all the previous files metadata that need to refer to it (we can leave
the room, push the addresses in a stack, and bulk fill them afterwards ;
as we also have to wait that moment to push the metadata into its
chunkifier, it is necessary to put a size limit on the pending metadata
to control memory usage).

A drawback is that you will lose deduplication for small files, which
should not hurt in general, but you can think of vicious cases : for
instance if your whole repository is made of triplets of identical files
of 16kB, you won't have any deduplication whereas you could have divided
storage size by 3.

> >
> > I agree that keeping the key/value store unencrypted is very
> important for integrity checks without the key and the other
> functionalities you mentioned.
>
> Note that it could still be possible to verify each segment file
> without decrypting it provides that an appropriate checksum or
> signature is included.
>

Yes but you could not detect a missing segment (actually it would maybe
be possible by adding a clear manifest of the segments).
Anyway there's still the compaction and index building problems.

Re: [attic] Chunking issues

From:
Jonas Borgström
Date:
2014-05-07 @ 21:20
Hi Cyril and Jeremy, sorry for joining this discussion late.

First of all, if we're concerned that it would be to easy to brute force
the chunk seed it would be fairly easy to extend the seed to 64, 128 or
even 256 bits.

The tail length hiding might be a bit more difficult without breaking
compatibility and affecting performance too much. But more on that below.

On 2014-05-07 21:06, Cyril Roussillon wrote:
> 
> On 2014/05/07 18:30, Jeremy Maitin-Shepard wrote:
>> I believe the tarsnap scheme is as safe as the chunking algorithm.
> 
> Is there a description of this scheme and of the chunking algorithm
> somewhere on the Internet, or did you analyze the source code ? I could
> not find detailed information about it on their website.
> 
>>> There will be greater space usage of course, but not more than the
>> one resulting of having an average chunk size of 64kB instead of less.
>>> Why do you think it would require more time or I/O when no data has
>> changed ?
>>
>> Currently, when processing an unchanged file present in the cache,
>> Attic doesn't have to open the file at all, since the cache specifies
>> the list of chunk ids for the file.  Attic can just verify that the
>> chunks are still present in the repository and write the list of chunk
>> ids to new archive.
>>
>> With a tail steam, for any file with a tail, which might be some large
>> fraction of them, it is necessary to write the tail to the tail
>> stream, even if the file hasn't changed.  Tarsnap stores a compressed
>> copy of the tail in the cache to at least have better locality in disk
>> access, but in any case it is certainly more expensive than what Attic
>> does now.

Wouldn't that cache be huge? If you backup a couple of million files
with a couple of kB tail data per file things start to add up...

I don't know about tarsnap but Attic loads the cache into ram so the
memory usage is high as it is without including any actual file data.

>> It might be possible to avoid having to process unchanged tails, but
>> it would require more complicated caching.  In particular, each cache
>> entry would include the sha256 hash of the tail stream's in-progress
>> chunk before and after the tail portion is added, as well as the id
>> and length of any chunks that are completed by the current cache
>> entry.  For each cached file encountered, if the tail stream state
>> matches the hash in the cache, there is no need to read or process the
>> tail data.  However, if a mismatch is found, it is necessary to go
>> back and actually create the in-progress chunk so that the new data
>> can be added.  It would be especially important to ensure that the
>> filesystem traversal happens in the same order each time as much as
>> possible.  This is already important for reusing metadata chunks but
>> would be especially important with a tail stream.
>>
> 
> Why do you think it is necessary to write the tail to the tail stream
> even if the file hasn't changed ? If you don't process it at all and
> reuse the chunk ids from the cache, incrementing the chunks reference
> count, then everything should be valid, shouldn't it ?
> Only if you need to chunkify the file again because it has changed, you
> would have to put its tail in the tail stream, because there is no way
> to know whether the tail changed or not (except if the cache also
> contained the hash of the tail indeed, but it's probably not worth it).
> 
> The only modifications I was thinking of were that the cache and
> metadata should refer to pairs of chunk id and subchunk reference
> (instead of only chunk id), and that we would have to wait until the
> tail stream generates a full chunk to know its id and write it back to
> all the previous files metadata that need to refer to it (we can leave
> the room, push the addresses in a stack, and bulk fill them afterwards ;
> as we also have to wait that moment to push the metadata into its
> chunkifier, it is necessary to put a size limit on the pending metadata
> to control memory usage).

Yeah, a separate tail stream might work but it's hard to tell how
difficult it would be to implement the subchunk reference thing, I'll
have to give it some more thought.

> A drawback is that you will lose deduplication for small files, which
> should not hurt in general, but you can think of vicious cases : for
> instance if your whole repository is made of triplets of identical files
> of 16kB, you won't have any deduplication whereas you could have divided
> storage size by 3.

It would definitely hurt our deduplication ratio a bit but it's hard to
tell how much without doing some tests.


Since the tail stream approach will waste some extra disk space how
about this completely different approach:

We alter the the encryption envelope format to support including a
variable number of padding bytes to mask the real payload length.
Each "tail" chunk is padded with a random number of bytes in the [0-N]
interval.

But I have no idea how to figure out a safe value for N.

(As an added bonus this approach will not negatively affect unencrypted
repositories at all)

What do you guys think about that?

/ Jonas

Re: [attic] Chunking issues

From:
Jeremy Maitin-Shepard
Date:
2014-05-07 @ 22:19
On Wed, May 7, 2014 at 2:20 PM, Jonas Borgström <jonas@borgstrom.se> wrote:

> Hi Cyril and Jeremy, sorry for joining this discussion late.
>
> First of all, if we're concerned that it would be to easy to brute force
> the chunk seed it would be fairly easy to extend the seed to 64, 128 or
> even 256 bits.
>

Yes that would probably be better, though I would prefer if there were some
cryptographic analysis of the rolling hash function.  I don't like the grey
area that the rolling checksum occupies in terms of security.  We should
either consider it completely insecure, or should have reason to believe it
is as secure as AES/sha256.

Note: Tarsnap computes the coefficients as the first 4 bytes of HMAC('x' +
c) for character c in [0,256).  It also computes the other constants using
HMAC.  See tarsnap/tar/multitape/chunkify.c.  (Of course there is no
liecnse to use the code.)

Also this presentation of his may be of interest as it relates to chunking,
though not to its security:

http://www.daemonology.net/papers/EuroBSDCon13.pdf


> The tail length hiding might be a bit more difficult without breaking
> compatibility and affecting performance too much. But more on that below.
>
> On 2014-05-07 21:06, Cyril Roussillon wrote:
> >
> > On 2014/05/07 18:30, Jeremy Maitin-Shepard wrote:
> >> I believe the tarsnap scheme is as safe as the chunking algorithm.
> >
> > Is there a description of this scheme and of the chunking algorithm
> > somewhere on the Internet, or did you analyze the source code ? I could
> > not find detailed information about it on their website.
> >
> >>> There will be greater space usage of course, but not more than the
> >> one resulting of having an average chunk size of 64kB instead of less.
> >>> Why do you think it would require more time or I/O when no data has
> >> changed ?
> >>
> >> Currently, when processing an unchanged file present in the cache,
> >> Attic doesn't have to open the file at all, since the cache specifies
> >> the list of chunk ids for the file.  Attic can just verify that the
> >> chunks are still present in the repository and write the list of chunk
> >> ids to new archive.
> >>
> >> With a tail steam, for any file with a tail, which might be some large
> >> fraction of them, it is necessary to write the tail to the tail
> >> stream, even if the file hasn't changed.  Tarsnap stores a compressed
> >> copy of the tail in the cache to at least have better locality in disk
> >> access, but in any case it is certainly more expensive than what Attic
> >> does now.
>
> Wouldn't that cache be huge? If you backup a couple of million files
> with a couple of kB tail data per file things start to add up...
>

It does seem the cache could get pretty large, particularly with small
files.  With tarsnap, not every file has a tail.  Only if the tail is less
than 4096 bytes does it get put on the tail stream.  I haven't looked at
exactly how the cache is stored, but it could be pretty efficient to access
it from disk as it could be stored in the traversal order.  Still,
definitely annoying.


>
> I don't know about tarsnap but Attic loads the cache into ram so the
> memory usage is high as it is without including any actual file data.
>
> >> It might be possible to avoid having to process unchanged tails, but
> >> it would require more complicated caching.  In particular, each cache
> >> entry would include the sha256 hash of the tail stream's in-progress
> >> chunk before and after the tail portion is added, as well as the id
> >> and length of any chunks that are completed by the current cache
> >> entry.  For each cached file encountered, if the tail stream state
> >> matches the hash in the cache, there is no need to read or process the
> >> tail data.  However, if a mismatch is found, it is necessary to go
> >> back and actually create the in-progress chunk so that the new data
> >> can be added.  It would be especially important to ensure that the
> >> filesystem traversal happens in the same order each time as much as
> >> possible.  This is already important for reusing metadata chunks but
> >> would be especially important with a tail stream.
> >>
> >
> > Why do you think it is necessary to write the tail to the tail stream
> > even if the file hasn't changed ? If you don't process it at all and
> > reuse the chunk ids from the cache, incrementing the chunks reference
> > count, then everything should be valid, shouldn't it ?
> > Only if you need to chunkify the file again because it has changed, you
> > would have to put its tail in the tail stream, because there is no way
> > to know whether the tail changed or not (except if the cache also
> > contained the hash of the tail indeed, but it's probably not worth it).
> >
> > The only modifications I was thinking of were that the cache and
> > metadata should refer to pairs of chunk id and subchunk reference
> > (instead of only chunk id), and that we would have to wait until the
> > tail stream generates a full chunk to know its id and write it back to
> > all the previous files metadata that need to refer to it (we can leave
> > the room, push the addresses in a stack, and bulk fill them afterwards ;
> > as we also have to wait that moment to push the metadata into its
> > chunkifier, it is necessary to put a size limit on the pending metadata
> > to control memory usage).
>
> Yeah, a separate tail stream might work but it's hard to tell how
> difficult it would be to implement the subchunk reference thing, I'll
> have to give it some more thought.
>

I think you could have a separate stream that stores the chunk ids (with
corresponding chunk lengths) for the tail stream (which we can call the
tail chunk stream), and in the archive metadata stream just write the
offset in the tail stream (or an offset into the tail chunk stream).


>
> > A drawback is that you will lose deduplication for small files, which
> > should not hurt in general, but you can think of vicious cases : for
> > instance if your whole repository is made of triplets of identical files
> > of 16kB, you won't have any deduplication whereas you could have divided
> > storage size by 3.
>
> It would definitely hurt our deduplication ratio a bit but it's hard to
> tell how much without doing some tests.
>
>
> Since the tail stream approach will waste some extra disk space how
> about this completely different approach:
>
> We alter the the encryption envelope format to support including a
> variable number of padding bytes to mask the real payload length.
> Each "tail" chunk is padded with a random number of bytes in the [0-N]
> interval.
>
> But I have no idea how to figure out a safe value for N.
>
> (As an added bonus this approach will not negatively affect unencrypted
> repositories at all)
>
> What do you guys think about that?
>

That certainly would be a very convenient solution as long as N could be
small relative to the chunk size.  Ideally there would be some form of
zero-knowledge guarantee: that the encrypted data is indistinguishable by
an attacker from the encryption of random data (except for some information
about the total amount of data).

With only a small amount of padding it seems unavoidable that you leak some
information: in particular, if you have a bunch of small files, you will
end up with a bunch of small chunks, whereas if you have long files you
will end up with a more even distribution of chunks.

Re: [attic] Chunking issues

From:
Jonas Borgström
Date:
2014-05-08 @ 10:29
On 08/05/14 00:19, Jeremy Maitin-Shepard wrote:
> 
> On Wed, May 7, 2014 at 2:20 PM, Jonas Borgström <jonas@borgstrom.se
> <mailto:jonas@borgstrom.se>> wrote:
> 
>     Hi Cyril and Jeremy, sorry for joining this discussion late.
> 
>     First of all, if we're concerned that it would be to easy to brute force
>     the chunk seed it would be fairly easy to extend the seed to 64, 128 or
>     even 256 bits.
> 
> 
> Yes that would probably be better, though I would prefer if there were
> some cryptographic analysis of the rolling hash function.  I don't like
> the grey area that the rolling checksum occupies in terms of security. 
> We should either consider it completely insecure, or should have reason
> to believe it is as secure as AES/sha256. 

A cryptographic analysis would be ideal, but they are not easy to come
by. The same applies to rolling hash function as well as far as I know.

> Note: Tarsnap computes the coefficients as the first 4 bytes of HMAC('x'
> + c) for character c in [0,256).  It also computes the other constants
> using HMAC.  See tarsnap/tar/multitape/chunkify.c.  (Of course there is
> no liecnse to use the code.)
> 
> Also this presentation of his may be of interest as it relates to
> chunking, though not to its security:
> 
> http://www.daemonology.net/papers/EuroBSDCon13.pdf
> 
> 
>     The tail length hiding might be a bit more difficult without breaking
>     compatibility and affecting performance too much. But more on that
>     below.
> 
>     On 2014-05-07 21:06, Cyril Roussillon wrote:
>     >
>     > On 2014/05/07 18:30, Jeremy Maitin-Shepard wrote:
>     >> I believe the tarsnap scheme is as safe as the chunking algorithm.
>     >
>     > Is there a description of this scheme and of the chunking algorithm
>     > somewhere on the Internet, or did you analyze the source code ? I
>     could
>     > not find detailed information about it on their website.
>     >
>     >>> There will be greater space usage of course, but not more than the
>     >> one resulting of having an average chunk size of 64kB instead of
>     less.
>     >>> Why do you think it would require more time or I/O when no data has
>     >> changed ?
>     >>
>     >> Currently, when processing an unchanged file present in the cache,
>     >> Attic doesn't have to open the file at all, since the cache specifies
>     >> the list of chunk ids for the file.  Attic can just verify that the
>     >> chunks are still present in the repository and write the list of
>     chunk
>     >> ids to new archive.
>     >>
>     >> With a tail steam, for any file with a tail, which might be some
>     large
>     >> fraction of them, it is necessary to write the tail to the tail
>     >> stream, even if the file hasn't changed.  Tarsnap stores a compressed
>     >> copy of the tail in the cache to at least have better locality in
>     disk
>     >> access, but in any case it is certainly more expensive than what
>     Attic
>     >> does now.
> 
>     Wouldn't that cache be huge? If you backup a couple of million files
>     with a couple of kB tail data per file things start to add up...
> 
> 
> It does seem the cache could get pretty large, particularly with small
> files.  With tarsnap, not every file has a tail.  Only if the tail is
> less than 4096 bytes does it get put on the tail stream.  I haven't
> looked at exactly how the cache is stored, but it could be pretty
> efficient to access it from disk as it could be stored in the traversal
> order.  Still, definitely annoying.

Interesting, so the tail handling in tarsnap only helps to reduce
storage overhead for tiny files and does not prevent file size leaks?

>     I don't know about tarsnap but Attic loads the cache into ram so the
>     memory usage is high as it is without including any actual file data.
> 
>     >> It might be possible to avoid having to process unchanged tails, but
>     >> it would require more complicated caching.  In particular, each cache
>     >> entry would include the sha256 hash of the tail stream's in-progress
>     >> chunk before and after the tail portion is added, as well as the id
>     >> and length of any chunks that are completed by the current cache
>     >> entry.  For each cached file encountered, if the tail stream state
>     >> matches the hash in the cache, there is no need to read or
>     process the
>     >> tail data.  However, if a mismatch is found, it is necessary to go
>     >> back and actually create the in-progress chunk so that the new data
>     >> can be added.  It would be especially important to ensure that the
>     >> filesystem traversal happens in the same order each time as much as
>     >> possible.  This is already important for reusing metadata chunks but
>     >> would be especially important with a tail stream.
>     >>
>     >
>     > Why do you think it is necessary to write the tail to the tail stream
>     > even if the file hasn't changed ? If you don't process it at all and
>     > reuse the chunk ids from the cache, incrementing the chunks reference
>     > count, then everything should be valid, shouldn't it ?
>     > Only if you need to chunkify the file again because it has
>     changed, you
>     > would have to put its tail in the tail stream, because there is no way
>     > to know whether the tail changed or not (except if the cache also
>     > contained the hash of the tail indeed, but it's probably not worth
>     it).
>     >
>     > The only modifications I was thinking of were that the cache and
>     > metadata should refer to pairs of chunk id and subchunk reference
>     > (instead of only chunk id), and that we would have to wait until the
>     > tail stream generates a full chunk to know its id and write it back to
>     > all the previous files metadata that need to refer to it (we can leave
>     > the room, push the addresses in a stack, and bulk fill them
>     afterwards ;
>     > as we also have to wait that moment to push the metadata into its
>     > chunkifier, it is necessary to put a size limit on the pending
>     metadata
>     > to control memory usage).
> 
>     Yeah, a separate tail stream might work but it's hard to tell how
>     difficult it would be to implement the subchunk reference thing, I'll
>     have to give it some more thought.
> 
> 
> I think you could have a separate stream that stores the chunk ids (with
> corresponding chunk lengths) for the tail stream (which we can call the
> tail chunk stream), and in the archive metadata stream just write the
> offset in the tail stream (or an offset into the tail chunk stream).
>  
> 
> 
>     > A drawback is that you will lose deduplication for small files, which
>     > should not hurt in general, but you can think of vicious cases : for
>     > instance if your whole repository is made of triplets of identical
>     files
>     > of 16kB, you won't have any deduplication whereas you could have
>     divided
>     > storage size by 3.
> 
>     It would definitely hurt our deduplication ratio a bit but it's hard to
>     tell how much without doing some tests.
> 
> 
>     Since the tail stream approach will waste some extra disk space how
>     about this completely different approach:
> 
>     We alter the the encryption envelope format to support including a
>     variable number of padding bytes to mask the real payload length.
>     Each "tail" chunk is padded with a random number of bytes in the [0-N]
>     interval.
> 
>     But I have no idea how to figure out a safe value for N.
> 
>     (As an added bonus this approach will not negatively affect unencrypted
>     repositories at all)
> 
>     What do you guys think about that?
> 
> 
> That certainly would be a very convenient solution as long as N could be
> small relative to the chunk size.  Ideally there would be some form of
> zero-knowledge guarantee: that the encrypted data is indistinguishable
> by an attacker from the encryption of random data (except for some
> information about the total amount of data).
> 
> With only a small amount of padding it seems unavoidable that you leak
> some information: in particular, if you have a bunch of small files, you
> will end up with a bunch of small chunks, whereas if you have long files
> you will end up with a more even distribution of chunks.

Yeah, but that's difficult to achieve. Chunking is for example applied
before compression, so repositories containing mostly jpeg files will
have a higher average chunk size than repositories that contain more
easily compressible files such as source code.

/ Jonas

Re: [attic] Chunking issues

From:
Cyril Roussillon
Date:
2014-05-08 @ 11:11
could not decode message

Re: [attic] Chunking issues

From:
Jeremy Maitin-Shepard
Date:
2014-05-08 @ 20:15
On Thu, May 8, 2014 at 4:11 AM, Cyril Roussillon <cyril42e@gmail.com> wrote:

>
>  On 2014/05/08 00:19, Jeremy Maitin-Shepard wrote:
>
>
> On Wed, May 7, 2014 at 2:20 PM, Jonas Borgström <jonas@borgstrom.se>wrote:
>
>> Hi Cyril and Jeremy, sorry for joining this discussion late.
>>
>> First of all, if we're concerned that it would be to easy to brute force
>> the chunk seed it would be fairly easy to extend the seed to 64, 128 or
>> even 256 bits.
>>
>
>  Yes that would probably be better, though I would prefer if there were
> some cryptographic analysis of the rolling hash function.  I don't like the
> grey area that the rolling checksum occupies in terms of security.  We
> should either consider it completely insecure, or should have reason to
> believe it is as secure as AES/sha256.
>
> Note: Tarsnap computes the coefficients as the first 4 bytes of HMAC('x' +
> c) for character c in [0,256).  It also computes the other constants using
> HMAC.  See tarsnap/tar/multitape/chunkify.c.  (Of course there is no
> liecnse to use the code.)
>
>  Also this presentation of his may be of interest as it relates to
> chunking, though not to its security:
>
> http://www.daemonology.net/papers/EuroBSDCon13.pdf
>
>
> Actually after more thinking I think that the xor seed scheme may indeed
> have a weakness. By xoring the table with the seed d, you are changing in
> the wikipedia formula (
> http://en.wikipedia.org/wiki/Rolling_hash#Cyclic_polynomial) all terms
> like this :
> s^k[h(c)] -> s^k[h(c)+d]
> where s is the binary rotation and + the xor.
>
> But the binary rotation and xor have the following property :
> s[a+b] = s[a]+s[b]
>
> So if H(d) is the rolling hash with seed d, then :
> H(d) = H(0) + sumxor(s^i[d], i=0..k)
>
> The sum happens to be d (if even number of 1 bits in d) or ~d (if odd
> number of 1 bits in d) for k = 32 (because the rolling hash is 32 bits),
> and always d for k = n*64, including k=4096 that you are using.
>
> So knowing H(0), for instance with a known plain text attack, and knowing
> that the last 6 bits of H(d) are null (because the cut decision was taken),
> allows to recover the last 6 bits of the seed d which are the last 6 bits
> of H(0).
>

Good analysis.  Another worrisome property of buzhash is that for some byte
sequences the hash value is independent of the substitution chosen: for
instance, the sequence of 64 bytes 0...32 0...32 is guaranteed to hash to 0.


>
> That's not the whole seed, there may exist better attacks, so it would
> probably be a good idea to generate the table in a stronger way, for
> instance with HMAC of the indexes as Jeremy is suggesting. It still won't
> guarantee that it is not possible to recover the table values with a chosen
> text attack, or to infer something about the data from the cut points
> though.
>
>
> Jeremy, did you understand completely tarsnap's chunking algorithm from
> the source code ? I find it quite obscure and have difficulties to figure
> out the detailed scheme.
> What makes you believe it is more secure than buzhash with secret table ?
>
>
I have not taken the time to fully understand all of the arithmetic, but
basically the scheme is as follows: (there is some general information
starting on page 8 of that BSDCon pdf I linked)


Tarsnap is computing the Rabin-Karp hash.  The byte substitutions, alpha,
and prime modulus p, are all computed pseudorandomly.  However, instead of
using a fixed window size, it breaks the chunk if the hash equals zero for
any window size within a certain range.  According to the presentation,
this is for the purpose of producing a better distribution than the
exponential one that is obtained by using a single window size.

In the presentation he mentions using all window sizes, but tarsnap
actually limits the windows as follows:

k is the current length of the chunk
mu is the mean chunk length (a parameter) = 65536 in tarsnap

Only start computing hash after:  r = floor(sqrt(4 * k - mu)) > 0
This corresponds to k >= 16385 with the mu above.

Only windows starting at byte position 16385+1 are considered, and the
window length is limited to the range:
(w = 32) [inclusive] to (r + w) [exclusive]

Note that since r depends on the chunk length k, the range of window sizes
increases as the chunk gets longer.  Presumably this is done for its affect
on the distribution of chunk sizes.  With a maximum chunk length of 261120,
the maximum value of r is 989.

Thus the minumum possible chunk length obtained due to the hash (as opposed
to due to running out of data in the file) is 16385+1+w.  In contrast
chunks at least 4096 bytes long will be stored as is, rather than put on
the tail stream.

Some of these values seem to be artifacts of the implementation rather than
deliberately chosen values, but maybe I'll ask the author for clarification.

The points in favor of tarsnap's approach:
 - seeding done using cryptographic pseudorandom sequence (but this is easy
to fix in tarsnap as Jonas said)
 - Rabin-Karp seems to have fewer obvious weaknesses than buzhash, though
this isn't very well founded

In a private email exchange with Colin Percival several years ago, he said
that he doesn't have any proof of security but the best attack he could
come up with involved about 10^18 bytes of known plaintext data.  He also
said he didn't consider hiding known plaintext a goal for tarsnap.

Still, the sizes of files between 4096 and 16385+w bytes are directly
leaked.

Incidentally, he had the same idea as Jonas for masking the chunk sizes:
just adding random padding (not currently implemented in tarsnap).

This solution seems a reasonable compromise between efficiency and
> difficulty of implementation.
> I think a value of N between 2 and 4 would be enough to mess up signatures
> (even if depending on the case fuzzy matches could still bring
> information), without generating too much wasted space.
>
>
2 - 4 bytes seems pretty small to achieve much effect.  Possibly it is a
lost cause to try to hide file sizes, though.

Re: [attic] Chunking issues

From:
Jeremy Maitin-Shepard
Date:
2014-05-08 @ 22:41
One question is: independent of the issues of privacy, is it a problem to
have too many small chunks?  It seems the main cost is in maintaining the
repository index.  Also with extremely small files the space overhead could
add up.


In other words, could it be that the tail stream is important for reasons
other than masking file sizes?


On Thu, May 8, 2014 at 1:15 PM, Jeremy Maitin-Shepard
<jeremy@jeremyms.com>wrote:

> On Thu, May 8, 2014 at 4:11 AM, Cyril Roussillon <cyril42e@gmail.com>wrote:
>
>>
>>  On 2014/05/08 00:19, Jeremy Maitin-Shepard wrote:
>>
>>
>> On Wed, May 7, 2014 at 2:20 PM, Jonas Borgström <jonas@borgstrom.se>wrote:
>>
>>> Hi Cyril and Jeremy, sorry for joining this discussion late.
>>>
>>> First of all, if we're concerned that it would be to easy to brute force
>>> the chunk seed it would be fairly easy to extend the seed to 64, 128 or
>>> even 256 bits.
>>>
>>
>>  Yes that would probably be better, though I would prefer if there were
>> some cryptographic analysis of the rolling hash function.  I don't like the
>> grey area that the rolling checksum occupies in terms of security.  We
>> should either consider it completely insecure, or should have reason to
>> believe it is as secure as AES/sha256.
>>
>> Note: Tarsnap computes the coefficients as the first 4 bytes of HMAC('x'
>> + c) for character c in [0,256).  It also computes the other constants
>> using HMAC.  See tarsnap/tar/multitape/chunkify.c.  (Of course there is no
>> liecnse to use the code.)
>>
>>  Also this presentation of his may be of interest as it relates to
>> chunking, though not to its security:
>>
>> http://www.daemonology.net/papers/EuroBSDCon13.pdf
>>
>>
>> Actually after more thinking I think that the xor seed scheme may indeed
>> have a weakness. By xoring the table with the seed d, you are changing in
>> the wikipedia formula (
>> http://en.wikipedia.org/wiki/Rolling_hash#Cyclic_polynomial) all terms
>> like this :
>> s^k[h(c)] -> s^k[h(c)+d]
>> where s is the binary rotation and + the xor.
>>
>> But the binary rotation and xor have the following property :
>> s[a+b] = s[a]+s[b]
>>
>> So if H(d) is the rolling hash with seed d, then :
>> H(d) = H(0) + sumxor(s^i[d], i=0..k)
>>
>> The sum happens to be d (if even number of 1 bits in d) or ~d (if odd
>> number of 1 bits in d) for k = 32 (because the rolling hash is 32 bits),
>> and always d for k = n*64, including k=4096 that you are using.
>>
>> So knowing H(0), for instance with a known plain text attack, and knowing
>> that the last 6 bits of H(d) are null (because the cut decision was taken),
>> allows to recover the last 6 bits of the seed d which are the last 6 bits
>> of H(0).
>>
>
> Good analysis.  Another worrisome property of buzhash is that for some
> byte sequences the hash value is independent of the substitution chosen:
> for instance, the sequence of 64 bytes 0...32 0...32 is guaranteed to hash
> to 0.
>
>
>>
>> That's not the whole seed, there may exist better attacks, so it would
>> probably be a good idea to generate the table in a stronger way, for
>> instance with HMAC of the indexes as Jeremy is suggesting. It still won't
>> guarantee that it is not possible to recover the table values with a chosen
>> text attack, or to infer something about the data from the cut points
>> though.
>>
>>
>> Jeremy, did you understand completely tarsnap's chunking algorithm from
>> the source code ? I find it quite obscure and have difficulties to figure
>> out the detailed scheme.
>> What makes you believe it is more secure than buzhash with secret table ?
>>
>>
> I have not taken the time to fully understand all of the arithmetic, but
> basically the scheme is as follows: (there is some general information
> starting on page 8 of that BSDCon pdf I linked)
>
>
> Tarsnap is computing the Rabin-Karp hash.  The byte substitutions, alpha,
> and prime modulus p, are all computed pseudorandomly.  However, instead of
> using a fixed window size, it breaks the chunk if the hash equals zero for
> any window size within a certain range.  According to the presentation,
> this is for the purpose of producing a better distribution than the
> exponential one that is obtained by using a single window size.
>
> In the presentation he mentions using all window sizes, but tarsnap
> actually limits the windows as follows:
>
> k is the current length of the chunk
> mu is the mean chunk length (a parameter) = 65536 in tarsnap
>
> Only start computing hash after:  r = floor(sqrt(4 * k - mu)) > 0
> This corresponds to k >= 16385 with the mu above.
>
> Only windows starting at byte position 16385+1 are considered, and the
> window length is limited to the range:
> (w = 32) [inclusive] to (r + w) [exclusive]
>
> Note that since r depends on the chunk length k, the range of window sizes
> increases as the chunk gets longer.  Presumably this is done for its affect
> on the distribution of chunk sizes.  With a maximum chunk length of 261120,
> the maximum value of r is 989.
>
> Thus the minumum possible chunk length obtained due to the hash (as
> opposed to due to running out of data in the file) is 16385+1+w.  In
> contrast chunks at least 4096 bytes long will be stored as is, rather than
> put on the tail stream.
>
> Some of these values seem to be artifacts of the implementation rather
> than deliberately chosen values, but maybe I'll ask the author for
> clarification.
>
> The points in favor of tarsnap's approach:
>  - seeding done using cryptographic pseudorandom sequence (but this is
> easy to fix in tarsnap as Jonas said)
>  - Rabin-Karp seems to have fewer obvious weaknesses than buzhash, though
> this isn't very well founded
>
> In a private email exchange with Colin Percival several years ago, he said
> that he doesn't have any proof of security but the best attack he could
> come up with involved about 10^18 bytes of known plaintext data.  He also
> said he didn't consider hiding known plaintext a goal for tarsnap.
>
> Still, the sizes of files between 4096 and 16385+w bytes are directly
> leaked.
>
> Incidentally, he had the same idea as Jonas for masking the chunk sizes:
> just adding random padding (not currently implemented in tarsnap).
>
> This solution seems a reasonable compromise between efficiency and
>> difficulty of implementation.
>> I think a value of N between 2 and 4 would be enough to mess up
>> signatures (even if depending on the case fuzzy matches could still bring
>> information), without generating too much wasted space.
>>
>>
> 2 - 4 bytes seems pretty small to achieve much effect.  Possibly it is a
> lost cause to try to hide file sizes, though.
>

Re: [attic] Chunking issues

From:
Cyril Roussillon
Date:
2014-05-09 @ 07:18
could not decode message

Re: [attic] Chunking issues

From:
Jeremy Maitin-Shepard
Date:
2014-05-09 @ 08:09
As I only just joined stackoverflow, I don't have sufficient privileges to
add comments (only to suggest edits or post an answer, which is not really
what I want to do).

Encrypting the hash is just equivalent to comparing against some value
other than zero (and that would be a lot simpler and more efficient); it
effectively adds just one more parameter.  How useful this is, I don't know.

I don't think Rabin-Karp can be solved as a linear system, since the
modulus, the coefficients, and the value at which the polynomial is
evaluated are all unknown.  For tarsnap even the window size is not known
precisely, only within a range.  With a known plain text, only some
equality relationships between the coefficients are known.


On Fri, May 9, 2014 at 12:18 AM, Cyril Roussillon <cyril42e@gmail.com>wrote:

>
> FYI I posted a question on crypto.stackexchange.com about the chunking
> function :
> 
http://crypto.stackexchange.com/questions/16082/cryptographically-secure-keyed-rolling-hash-function
>
> There's already an answer, but I'm not sure it is valid.
>
> Let me know if you have ideas for improving the description of the problem.
>
>
>
> On 2014/05/08 22:15, Jeremy Maitin-Shepard wrote:
>
>  On Thu, May 8, 2014 at 4:11 AM, Cyril Roussillon <cyril42e@gmail.com>wrote:
>
>>
>> On 2014/05/08 00:19, Jeremy Maitin-Shepard wrote:
>>
>>
>> On Wed, May 7, 2014 at 2:20 PM, Jonas Borgström <jonas@borgstrom.se>wrote:
>>
>>> Hi Cyril and Jeremy, sorry for joining this discussion late.
>>>
>>> First of all, if we're concerned that it would be to easy to brute force
>>> the chunk seed it would be fairly easy to extend the seed to 64, 128 or
>>> even 256 bits.
>>>
>>
>>  Yes that would probably be better, though I would prefer if there were
>> some cryptographic analysis of the rolling hash function.  I don't like the
>> grey area that the rolling checksum occupies in terms of security.  We
>> should either consider it completely insecure, or should have reason to
>> believe it is as secure as AES/sha256.
>>
>> Note: Tarsnap computes the coefficients as the first 4 bytes of HMAC('x'
>> + c) for character c in [0,256).  It also computes the other constants
>> using HMAC.  See tarsnap/tar/multitape/chunkify.c.  (Of course there is no
>> liecnse to use the code.)
>>
>>  Also this presentation of his may be of interest as it relates to
>> chunking, though not to its security:
>>
>> http://www.daemonology.net/papers/EuroBSDCon13.pdf
>>
>>
>>  Actually after more thinking I think that the xor seed scheme may indeed
>> have a weakness. By xoring the table with the seed d, you are changing in
>> the wikipedia formula (
>> http://en.wikipedia.org/wiki/Rolling_hash#Cyclic_polynomial) all terms
>> like this :
>> s^k[h(c)] -> s^k[h(c)+d]
>> where s is the binary rotation and + the xor.
>>
>> But the binary rotation and xor have the following property :
>> s[a+b] = s[a]+s[b]
>>
>> So if H(d) is the rolling hash with seed d, then :
>> H(d) = H(0) + sumxor(s^i[d], i=0..k)
>>
>> The sum happens to be d (if even number of 1 bits in d) or ~d (if odd
>> number of 1 bits in d) for k = 32 (because the rolling hash is 32 bits),
>> and always d for k = n*64, including k=4096 that you are using.
>>
>> So knowing H(0), for instance with a known plain text attack, and knowing
>> that the last 6 bits of H(d) are null (because the cut decision was taken),
>> allows to recover the last 6 bits of the seed d which are the last 6 bits
>> of H(0).
>>
>
> Good analysis.  Another worrisome property of buzhash is that for some
> byte sequences the hash value is independent of the substitution chosen:
> for instance, the sequence of 64 bytes 0...32 0...32 is guaranteed to hash
> to 0.
>
>
>>
>> That's not the whole seed, there may exist better attacks, so it would
>> probably be a good idea to generate the table in a stronger way, for
>> instance with HMAC of the indexes as Jeremy is suggesting. It still won't
>> guarantee that it is not possible to recover the table values with a chosen
>> text attack, or to infer something about the data from the cut points
>> though.
>>
>>
>> Jeremy, did you understand completely tarsnap's chunking algorithm from
>> the source code ? I find it quite obscure and have difficulties to figure
>> out the detailed scheme.
>> What makes you believe it is more secure than buzhash with secret table ?
>>
>>
>  I have not taken the time to fully understand all of the arithmetic, but
> basically the scheme is as follows: (there is some general information
> starting on page 8 of that BSDCon pdf I linked)
>
>
>  Tarsnap is computing the Rabin-Karp hash.  The byte substitutions,
> alpha, and prime modulus p, are all computed pseudorandomly.  However,
> instead of using a fixed window size, it breaks the chunk if the hash
> equals zero for any window size within a certain range.  According to the
> presentation, this is for the purpose of producing a better distribution
> than the exponential one that is obtained by using a single window size.
>
>  In the presentation he mentions using all window sizes, but tarsnap
> actually limits the windows as follows:
>
>  k is the current length of the chunk
>  mu is the mean chunk length (a parameter) = 65536 in tarsnap
>
>  Only start computing hash after:  r = floor(sqrt(4 * k - mu)) > 0
>  This corresponds to k >= 16385 with the mu above.
>
>  Only windows starting at byte position 16385+1 are considered, and the
> window length is limited to the range:
>  (w = 32) [inclusive] to (r + w) [exclusive]
>
>  Note that since r depends on the chunk length k, the range of window
> sizes increases as the chunk gets longer.  Presumably this is done for its
> affect on the distribution of chunk sizes.  With a maximum chunk length of
> 261120, the maximum value of r is 989.
>
>  Thus the minumum possible chunk length obtained due to the hash (as
> opposed to due to running out of data in the file) is 16385+1+w.  In
> contrast chunks at least 4096 bytes long will be stored as is, rather than
> put on the tail stream.
>
>  Some of these values seem to be artifacts of the implementation rather
> than deliberately chosen values, but maybe I'll ask the author for
> clarification.
>
>  The points in favor of tarsnap's approach:
>   - seeding done using cryptographic pseudorandom sequence (but this is
> easy to fix in tarsnap as Jonas said)
>   - Rabin-Karp seems to have fewer obvious weaknesses than buzhash,
> though this isn't very well founded
>
>  In a private email exchange with Colin Percival several years ago, he
> said that he doesn't have any proof of security but the best attack he
> could come up with involved about 10^18 bytes of known plaintext data.  He
> also said he didn't consider hiding known plaintext a goal for tarsnap.
>
>  Still, the sizes of files between 4096 and 16385+w bytes are directly
> leaked.
>
>  Incidentally, he had the same idea as Jonas for masking the chunk sizes:
> just adding random padding (not currently implemented in tarsnap).
>
>   This solution seems a reasonable compromise between efficiency and
>> difficulty of implementation.
>> I think a value of N between 2 and 4 would be enough to mess up
>> signatures (even if depending on the case fuzzy matches could still bring
>> information), without generating too much wasted space.
>>
>>
> 2 - 4 bytes seems pretty small to achieve much effect.  Possibly it is a
> lost cause to try to hide file sizes, though.
>
>
>

Re: [attic] Chunking issues

From:
Cyril Roussillon
Date:
2014-05-09 @ 10:48
could not decode message

Re: [attic] Chunking issues

From:
Jeremy Maitin-Shepard
Date:
2014-05-09 @ 18:20
On Fri, May 9, 2014 at 3:48 AM, Cyril Roussillon <cyril42e@gmail.com> wrote:

>
> On 2014/05/09 10:09, Jeremy Maitin-Shepard wrote:
>
>  As I only just joined stackoverflow, I don't have sufficient privileges
> to add comments (only to suggest edits or post an answer, which is not
> really what I want to do).
>
> Encrypting the hash is just equivalent to comparing against some value
> other than zero (and that would be a lot simpler and more efficient); it
> effectively adds just one more parameter.  How useful this is, I don't know.
>
>
> I'm not sure it is equivalent :
> - with comparison to secret value, if you know once the data in a window
> where it cut, you know its clear hash, so you know that the secret value is
> the last bits of this hash. Then everytime the function triggers, you know
> that the last bits of the hash of the data are the (no more secret) value.
> - with encryption of the hash, if you know once the data in a window where
> it cut, you know its clear hash, so you know one hash value that has an
> encrypted image with trailing zeros, but next times the function triggers
> with unknown data it won't give you much information.
>


Okay, I agree.  I was only thinking of the case of tarsnap's method, where
the rolling hash is checked to equal exactly 0; the sum is evaluated modulo
p for a prime p chosen to be the right size to give the desired chunk
sizes.  Incidentally the tarsnap prime will be approximately 16777216 and
the randomness only results in at most 256 different primes.

In the more general case of requiring only a prefix of the bits to be zero,
encrypting effectively means we split on a pseudorandom subset of bit
patterns of a given size (with tarsnap's method the size being 1).

Even without trying to optimize the window size as tarsnap is doing or
> changing the underlying rolling hash algorithm, making the window size a
> secret fixed for a given repo would also add one unknown parameter. As
> there are no requirement for the window size to be a power of 2, instead of
> being 4096 it coud be random between 4096-512 and 4096*512 for instance,
> which should not impact too much the deduplication.
>
That's true, though it would only add 10 bits.


>
> By the way, isn't there any risk that making the window size dependent on
> the data as is doing tarsnap will hurt repeatability of chunks ? If making
> a modification at the beginning, next cut point will be affected, but the
> next one as well because it depends on the chunk length, ie the previous
> cut point, etc. What does prove that it will end up resynchronizing, and
> that it resynchronizes fast enough to prevent unacceptable loss of
> efficiency of deduplication ?
>
It seems to me the case you are worried about is there being a lot of
long-window break points (L) followed closely by medium-window break points
(M) in the original data.  Originally, the long-window break points are
chosen and the medium-window break points are skipped because they occur
too soon after each previous break point.

i.e.

.....L..M...L...M...L...M...L...

Some modification causes the first L breakpoint not to be taken, and
instead the first M breakpoint is taken.  Consequently, the following L
occurs too soon after the M breakpoint to be considered, and in this way it
never synchronizes.

However, this same problem can occur with a fixed window size, if there is
a sequence of break points such that two consecutive break points are too
close to both be chosen, but every other break point can be chosen.

....A..B..A..B..A..B

It is hard to know to what extent the tarsnap method makes things worse.

Re: [attic] Chunking issues

From:
Jonas Borgström
Date:
2014-05-10 @ 12:51
On 2014-05-09 12:48, Cyril Roussillon wrote:
> 
> On 2014/05/09 10:09, Jeremy Maitin-Shepard wrote:
>> As I only just joined stackoverflow, I don't have sufficient
>> privileges to add comments (only to suggest edits or post an answer,
>> which is not really what I want to do).
>>
>> Encrypting the hash is just equivalent to comparing against some value
>> other than zero (and that would be a lot simpler and more efficient);
>> it effectively adds just one more parameter.  How useful this is, I
>> don't know.
> 
> I'm not sure it is equivalent :
> - with comparison to secret value, if you know once the data in a window
> where it cut, you know its clear hash, so you know that the secret value
> is the last bits of this hash. Then everytime the function triggers, you
> know that the last bits of the hash of the data are the (no more secret)
> value.
> - with encryption of the hash, if you know once the data in a window
> where it cut, you know its clear hash, so you know one hash value that
> has an encrypted image with trailing zeros, but next times the function
> triggers with unknown data it won't give you much information.
> 
> Actually encrypting the hash with AES-ECB might be quite secure for our
> specific application, because we only reveal *part* of the ciphered data
> (in addition to revealing only little ciphered data) :
> - for a known text attack, knowing one pair of plain data + truncated
> ciphered image does reveal a full entry of the dictionnary, because a
> lot of other data could have the same truncated ciphered image.
> - for a statistical attack as well the truncation of ciphered image
> prevents a simple analysis. In addition the result of the rolling hash
> function might partially hide the plain data statistics.
> 
> It would probably be even better to combine the two approaches, as each
> one prevents an attack on the other one. For the same reason it would
> also be better to keep the initial substitution.
> 
> However I'm not sure that encrypting the hash won't be too slow, because
> we need to encrypt a 128bits block for each byte, so it will be 16 times
> slower than raw AES. Still we could use a weaker and faster algorithm,
> with a block size of 32bits to avoid wasting time encrypting padding
> bytes. We would be back to the lack of strong guarantees, but it would
> certainly improve security by performing operations that a rolling hash
> cannot perform in order to remain a rolling hash.

I just did a quick test to get a feel about how this would affect the
performance. And I'm afraid it doesn't look too good...

On my laptop (Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz) the current
implementation has a throughput of about 300MB/s and applying a single
round of AES_encrypt (256 bit key) the speed drops down to 5.7 MB/s

My very hackish implementation with a hard coded key:

https://gist.github.com/jborg/2f4d6702201228be1de4

/ Jonas

Re: [attic] Chunking issues

From:
Cyril Roussillon
Date:
2014-05-12 @ 20:16
On 2014/05/10 14:51, Jonas Borgström wrote:
> I just did a quick test to get a feel about how this would affect the
> performance. And I'm afraid it doesn't look too good...
>
> On my laptop (Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz) the current
> implementation has a throughput of about 300MB/s and applying a single
> round of AES_encrypt (256 bit key) the speed drops down to 5.7 MB/s
>
> My very hackish implementation with a hard coded key:
>
> https://gist.github.com/jborg/2f4d6702201228be1de4

I tested you implementation on a Intel(R) Core(TM) i7-4800MQ CPU @
2.70GHz with AES-NI :
- I get only 80MB/s for buzhash alone, which is strange because my CPU
is supposed to be faster than yours... (I commented out the AES
instructions and replaced them by "*secure_sum = *insecure_sum")
- I also get only 6.2 MB/s with the round of AES (corresponding to
100MB/s for raw AES). Even though I have AES-NI instructions and openssl
compiled with -march=native, they don't seem to be used.

Apparently you need to use the EVP API so that the processor is detected
at runtime and the AES-NI instructions used. I reach 700MB/s (which
makes the rolling hash run at 44MB/s).

Using a 128 bits key also slightly increases performance (140MB/s
without AES-NI, 900MB/s with them)


But I agree that we cannot rely on AES-NI instructions which are too
recent, and that even these speeds are still too low. I'm investigating
to use something faster than AES, for instance SipHash which would meet
the security requirements and make the rolling hash run at 51MB/s on my
machine without using any particular instruction set.




Re: [attic] Chunking issues

From:
Jeremy Maitin-Shepard
Date:
2014-05-12 @ 22:28
One unfortunate thing about encrypting or hashing the rolling checksum like
this is that it prevents using the multiple-window-size trick employed by
tarsnap.  The improved distribution obtained by using multiple window sizes
means that for a given average chunk size, a random single-byte change is
more likely to fall in a smaller size chunk than if only a single window
size is used; thus less new data has to be written.  I haven't computed
statistics for multiple-window chunking yet, but I think it may result in
only about 60%-70% as much data written as with a single window size (for a
random single-byte change).

The same hash table trick that tarsnap uses can be used with Buzhash: if
you have the Buzhash of the strings a and ab, h(b) = h(a) xor h(ab).
Therefore you could just store h(x) & CHUNK_MASK in the hash table and wait
for a collision.  Naturally this isn't possible though if the hash is
encrypted in some way.


On Mon, May 12, 2014 at 1:16 PM, Cyril Roussillon <cyril42e@gmail.com>wrote:

> On 2014/05/10 14:51, Jonas Borgström wrote:
> > I just did a quick test to get a feel about how this would affect the
> > performance. And I'm afraid it doesn't look too good...
> >
> > On my laptop (Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz) the current
> > implementation has a throughput of about 300MB/s and applying a single
> > round of AES_encrypt (256 bit key) the speed drops down to 5.7 MB/s
> >
> > My very hackish implementation with a hard coded key:
> >
> > https://gist.github.com/jborg/2f4d6702201228be1de4
>
> I tested you implementation on a Intel(R) Core(TM) i7-4800MQ CPU @
> 2.70GHz with AES-NI :
> - I get only 80MB/s for buzhash alone, which is strange because my CPU
> is supposed to be faster than yours... (I commented out the AES
> instructions and replaced them by "*secure_sum = *insecure_sum")
> - I also get only 6.2 MB/s with the round of AES (corresponding to
> 100MB/s for raw AES). Even though I have AES-NI instructions and openssl
> compiled with -march=native, they don't seem to be used.
>
> Apparently you need to use the EVP API so that the processor is detected
> at runtime and the AES-NI instructions used. I reach 700MB/s (which
> makes the rolling hash run at 44MB/s).
>
> Using a 128 bits key also slightly increases performance (140MB/s
> without AES-NI, 900MB/s with them)
>
>
> But I agree that we cannot rely on AES-NI instructions which are too
> recent, and that even these speeds are still too low. I'm investigating
> to use something faster than AES, for instance SipHash which would meet
> the security requirements and make the rolling hash run at 51MB/s on my
> machine without using any particular instruction set.
>
>
>
>
>
>

Re: [attic] Chunking issues

From:
Cyril Roussillon
Date:
2014-05-14 @ 16:12
could not decode message

Re: [attic] Chunking issues

From:
Jeremy Maitin-Shepard
Date:
2014-05-14 @ 17:27
On Wed, May 14, 2014 at 9:12 AM, Cyril Roussillon <cyril42e@gmail.com>wrote:

>
> If the goal is only to improve the distribution by increasing the
> probability of a cut when the chunk size increases, you could also make the
> chunk mask dependent on the chunk size (decreasing number of 1 bits :
> 0xffff, 0x7fff, 0x3fff, 0x1fff, 0x0fff, 0x0eff ...).
>
That's true, and certainly a lot simpler.

Re: [attic] Chunking issues

From:
Jeremy Maitin-Shepard
Date:
2014-05-15 @ 00:40
It seems to me that the only sure way to prevent watermarking attacks is to
ensure every chunk has the same size.  Since any window-based content
slicing technique can be manipulated by varying the repetitiveness vs.
randomness of the data, and compression potentially leaks information as
well, it seems impossible to ensure no information is leaked if the chunk
sizes are variable.

Unfortunately, padding each chunk up to the maximum size would waste a lot
of space.  A way around this is to use the same procedure as is done
currently to compute and compress the chunks that are currently written
directly to the repository.  However, instead of writing these compressed
chunks directly to the repository, such that their sizes are revealed, we
break each chunk up into a sequence of fixed-size sectors.  Only the last
sector needs to be zero-padded.  Each sector is then written to the
repository as a separate entry.  Provided that the chunk size is
significantly larger than the sector size, this doesn't waste too much
space.  Since dedupliation is still happening at the level of chunks,
rather than sectors, it does unfortunately mean that the size of the chunk
index is multiplied by the average number of sectors per chunk, and in
return we get no improvement in dedupliation granularity.  Conveniently,
the size of the metadata would not necessarily need to increase.

Re: [attic] Chunking issues

From:
Jonas Borgström
Date:
2014-05-13 @ 10:58
On 12/05/14 22:16, Cyril Roussillon wrote:
> On 2014/05/10 14:51, Jonas Borgström wrote:
>> I just did a quick test to get a feel about how this would affect the
>> performance. And I'm afraid it doesn't look too good...
>>
>> On my laptop (Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz) the current
>> implementation has a throughput of about 300MB/s and applying a single
>> round of AES_encrypt (256 bit key) the speed drops down to 5.7 MB/s
>>
>> My very hackish implementation with a hard coded key:
>>
>> https://gist.github.com/jborg/2f4d6702201228be1de4
> 
> I tested you implementation on a Intel(R) Core(TM) i7-4800MQ CPU @
> 2.70GHz with AES-NI :
> - I get only 80MB/s for buzhash alone, which is strange because my CPU
> is supposed to be faster than yours... (I commented out the AES
> instructions and replaced them by "*secure_sum = *insecure_sum")

That's weird. It must be some kind of compiler optimization issue, can
you re-run this test on an unpatched version of Attic?

My original numbers were from an unpatched version of Attic compiled on
Arch Linux (gcc 4.9).

I also tested your method on a Macbook Pro (i7) and got 245MB/s.

> - I also get only 6.2 MB/s with the round of AES (corresponding to
> 100MB/s for raw AES). Even though I have AES-NI instructions and openssl
> compiled with -march=native, they don't seem to be used.
> 
> Apparently you need to use the EVP API so that the processor is detected
> at runtime and the AES-NI instructions used. I reach 700MB/s (which
> makes the rolling hash run at 44MB/s).

Ah, interesting. I guess we should update crypto.pyx to use the EVP api
then.

> Using a 128 bits key also slightly increases performance (140MB/s
> without AES-NI, 900MB/s with them)
> 
> 
> But I agree that we cannot rely on AES-NI instructions which are too
> recent, and that even these speeds are still too low. I'm investigating
> to use something faster than AES, for instance SipHash which would meet
> the security requirements and make the rolling hash run at 51MB/s on my
> machine without using any particular instruction set.

That's better but still probably slower than the IO system on most
systems. We should try to keep Attic IO bound instead of CPU bound for
unmodified data.

Unless we can figure out something really clever we probably have to
consider providing two different encryption modes, with or without a
cryptographically secure chunking.

But this "paranoid" mode should then probably also somehow include a
proper fix for the tail problem.

/ Jonas

Re: [attic] Chunking issues

From:
Cyril Roussillon
Date:
2014-05-13 @ 20:43
On 2014/05/13 12:58, Jonas Borgström wrote:
> On 12/05/14 22:16, Cyril Roussillon wrote:
>> On 2014/05/10 14:51, Jonas Borgström wrote:
>>> I just did a quick test to get a feel about how this would affect the
>>> performance. And I'm afraid it doesn't look too good...
>>>
>>> On my laptop (Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz) the current
>>> implementation has a throughput of about 300MB/s and applying a single
>>> round of AES_encrypt (256 bit key) the speed drops down to 5.7 MB/s
>>>
>>> My very hackish implementation with a hard coded key:
>>>
>>> https://gist.github.com/jborg/2f4d6702201228be1de4
>> I tested you implementation on a Intel(R) Core(TM) i7-4800MQ CPU @
>> 2.70GHz with AES-NI :
>> - I get only 80MB/s for buzhash alone, which is strange because my CPU
>> is supposed to be faster than yours... (I commented out the AES
>> instructions and replaced them by "*secure_sum = *insecure_sum")
> That's weird. It must be some kind of compiler optimization issue, can
> you re-run this test on an unpatched version of Attic?
>
> My original numbers were from an unpatched version of Attic compiled on
> Arch Linux (gcc 4.9).
>
> I also tested your method on a Macbook Pro (i7) and got 245MB/s.

I did it from a fresh clone and build of Attic and it was the same, but
then I noticed that there was no optimization flag in the gcc
compilation line :
x86_64-pc-linux-gnu-gcc -pthread -fPIC -I/usr/include/python3.3 -c
attic/chunker.c -o build/temp.linux-x86_64-3.3/attic/chunker.o

After manually recompiling with -O2 and relinking I got 340MB/s

I'm compiling with "python setup.py build", and installing with "python
setup.py install".
Maybe is it because I'm not using pip ?

By the way, by doing the modulo ( & 0x1f ) inside the formula for the
barrel shift, you are apparently preventing the compiler to recognize
the barrel shift and to use the "rol*" asm instructions. If I compute
the "& 0x1f" outside and assign it to a 32 bits variable, then it runs
at 420MB/s (and I checked that no rol instruction was used before, and
that they are used after). You can check what I did here :

http://crteknologies.fr/download/0001-chunker-move-the-modulo-out-of-the-barrel-shift-and-.patch

>> - I also get only 6.2 MB/s with the round of AES (corresponding to
>> 100MB/s for raw AES). Even though I have AES-NI instructions and openssl
>> compiled with -march=native, they don't seem to be used.
>>
>> Apparently you need to use the EVP API so that the processor is detected
>> at runtime and the AES-NI instructions used. I reach 700MB/s (which
>> makes the rolling hash run at 44MB/s).
> Ah, interesting. I guess we should update crypto.pyx to use the EVP api
> then.
>
>> Using a 128 bits key also slightly increases performance (140MB/s
>> without AES-NI, 900MB/s with them)
>>
>>
>> But I agree that we cannot rely on AES-NI instructions which are too
>> recent, and that even these speeds are still too low. I'm investigating
>> to use something faster than AES, for instance SipHash which would meet
>> the security requirements and make the rolling hash run at 51MB/s on my
>> machine without using any particular instruction set.
> That's better but still probably slower than the IO system on most
> systems. We should try to keep Attic IO bound instead of CPU bound for
> unmodified data.

I optimized the reference implementation of SipHash to specialize it for
4 bytes inputs and repetitive calls with same key and same size, and it
runs over 95MB/s in the secure rolling hash scenario (hashing four bytes
of data for each input byte).

Here is a summary of the performance of different solutions on different
machines (the test program is available at
http://crteknologies.fr/download/test_prf.zip), in MB/s for secure
rolling hash scenario (but without the rolling hash) :

              CPU1  CPU2  CPU3  CPU4
AES128         9.3   4.4   1.6   5.7
AES128-EVP    68.0   4.1   1.5  12.8
SipHash       56.4   8.4  12.9  37.0
SipHashOptim  97.5  12.0  25.3  57.5
(BuzHash)    340    64

CPU1 (laptop, modern high-end): Intel(R) Core(TM) i7-4800MQ CPU @ 2.70GHz
CPU2 (laptop, old high-end)   : Intel(R) Core(TM)2 Duo CPU     P8400  @
2.26GHz
CPU3 (server, modern low-end) : Intel(R) Atom(TM) CPU N2800   @ 1.86GHz
CPU4 (dekstop, old high-end)  : Intel(R) Core(TM)2 Duo CPU     E8400  @
3.00GHz

So it's still too slow on some platforms.

Apparently according to the comments on stackexchange there may be other
fast block ciphers or PRF, but I didn't find anything else with a quick
search. The real question is how weak could it be, knowing that very
little information is disclosed it probably does not have to be so strong.


> Unless we can figure out something really clever we probably have to
> consider providing two different encryption modes, with or without a
> cryptographically secure chunking.
>
> But this "paranoid" mode should then probably also somehow include a
> proper fix for the tail problem.

Yes I agree if the price of perfect security is too high, it can be a
separate mode, but then it really has to be perfect and fix the file
size leak as well.

Re: [attic] Chunking issues

From:
Jonas Borgström
Date:
2014-05-13 @ 21:41
On 2014-05-13 22:43, Cyril Roussillon wrote:
> 
> On 2014/05/13 12:58, Jonas Borgström wrote:
>> On 12/05/14 22:16, Cyril Roussillon wrote:
>>> On 2014/05/10 14:51, Jonas Borgström wrote:
>>>> I just did a quick test to get a feel about how this would affect the
>>>> performance. And I'm afraid it doesn't look too good...
>>>>
>>>> On my laptop (Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz) the current
>>>> implementation has a throughput of about 300MB/s and applying a single
>>>> round of AES_encrypt (256 bit key) the speed drops down to 5.7 MB/s
>>>>
>>>> My very hackish implementation with a hard coded key:
>>>>
>>>> https://gist.github.com/jborg/2f4d6702201228be1de4
>>> I tested you implementation on a Intel(R) Core(TM) i7-4800MQ CPU @
>>> 2.70GHz with AES-NI :
>>> - I get only 80MB/s for buzhash alone, which is strange because my CPU
>>> is supposed to be faster than yours... (I commented out the AES
>>> instructions and replaced them by "*secure_sum = *insecure_sum")
>> That's weird. It must be some kind of compiler optimization issue, can
>> you re-run this test on an unpatched version of Attic?
>>
>> My original numbers were from an unpatched version of Attic compiled on
>> Arch Linux (gcc 4.9).
>>
>> I also tested your method on a Macbook Pro (i7) and got 245MB/s.
> 
> I did it from a fresh clone and build of Attic and it was the same, but
> then I noticed that there was no optimization flag in the gcc
> compilation line :
> x86_64-pc-linux-gnu-gcc -pthread -fPIC -I/usr/include/python3.3 -c
> attic/chunker.c -o build/temp.linux-x86_64-3.3/attic/chunker.o
> 
> After manually recompiling with -O2 and relinking I got 340MB/s
> 
> I'm compiling with "python setup.py build", and installing with "python
> setup.py install".
> Maybe is it because I'm not using pip ?

That's weird, what does your "python3-config --cflags" output?

$ python3-config --cflags
-I/usr/include/python3.4m -I/usr/include/python3.4m  -Wno-unused-result
-Werror=declaration-after-statement -march=x86-64 -mtune=generic -O2
-pipe -fstack-protector --param=ssp-buffer-size=4
-DDYNAMIC_ANNOTATIONS_ENABLED=1 -DNDEBUG -g -fwrapv -O3 -Wall
-Wstrict-prototypes


> By the way, by doing the modulo ( & 0x1f ) inside the formula for the
> barrel shift, you are apparently preventing the compiler to recognize
> the barrel shift and to use the "rol*" asm instructions. If I compute
> the "& 0x1f" outside and assign it to a 32 bits variable, then it runs
> at 420MB/s (and I checked that no rol instruction was used before, and
> that they are used after). You can check what I did here :
> 
http://crteknologies.fr/download/0001-chunker-move-the-modulo-out-of-the-barrel-shift-and-.patch

Cool, that's a nice performance boost. I've just merged that change, thanks!

> 
>>> - I also get only 6.2 MB/s with the round of AES (corresponding to
>>> 100MB/s for raw AES). Even though I have AES-NI instructions and openssl
>>> compiled with -march=native, they don't seem to be used.
>>>
>>> Apparently you need to use the EVP API so that the processor is detected
>>> at runtime and the AES-NI instructions used. I reach 700MB/s (which
>>> makes the rolling hash run at 44MB/s).
>> Ah, interesting. I guess we should update crypto.pyx to use the EVP api
>> then.

I pushed a change implementing that earlier. On my laptop AES-NI gives
me an impressive 20x performance boost on AES encryption (On a 64kiB
message).
But since zlib compress() is the real bottleneck right now, this will
not translate into that much of an overall performance boost.

>>> Using a 128 bits key also slightly increases performance (140MB/s
>>> without AES-NI, 900MB/s with them)
>>>
>>>
>>> But I agree that we cannot rely on AES-NI instructions which are too
>>> recent, and that even these speeds are still too low. I'm investigating
>>> to use something faster than AES, for instance SipHash which would meet
>>> the security requirements and make the rolling hash run at 51MB/s on my
>>> machine without using any particular instruction set.
>> That's better but still probably slower than the IO system on most
>> systems. We should try to keep Attic IO bound instead of CPU bound for
>> unmodified data.
> 
> I optimized the reference implementation of SipHash to specialize it for
> 4 bytes inputs and repetitive calls with same key and same size, and it
> runs over 95MB/s in the secure rolling hash scenario (hashing four bytes
> of data for each input byte).
> 
> Here is a summary of the performance of different solutions on different
> machines (the test program is available at
> http://crteknologies.fr/download/test_prf.zip), in MB/s for secure
> rolling hash scenario (but without the rolling hash) :
> 
>               CPU1  CPU2  CPU3  CPU4
> AES128         9.3   4.4   1.6   5.7
> AES128-EVP    68.0   4.1   1.5  12.8
> SipHash       56.4   8.4  12.9  37.0
> SipHashOptim  97.5  12.0  25.3  57.5
> (BuzHash)    340    64
> 
> CPU1 (laptop, modern high-end): Intel(R) Core(TM) i7-4800MQ CPU @ 2.70GHz
> CPU2 (laptop, old high-end)   : Intel(R) Core(TM)2 Duo CPU     P8400  @
> 2.26GHz
> CPU3 (server, modern low-end) : Intel(R) Atom(TM) CPU N2800   @ 1.86GHz
> CPU4 (dekstop, old high-end)  : Intel(R) Core(TM)2 Duo CPU     E8400  @
> 3.00GHz
> 
> So it's still too slow on some platforms.

Yeah, I had high hopes on SipHash but it seems a bit too slow to be our
default chunker at least. Something that is 50% faster than siphash
would have been awesome...

Excellent work on benchmarking and testing btw!


> Apparently according to the comments on stackexchange there may be other
> fast block ciphers or PRF, but I didn't find anything else with a quick
> search. The real question is how weak could it be, knowing that very
> little information is disclosed it probably does not have to be so strong.
> 
> 
>> Unless we can figure out something really clever we probably have to
>> consider providing two different encryption modes, with or without a
>> cryptographically secure chunking.
>>
>> But this "paranoid" mode should then probably also somehow include a
>> proper fix for the tail problem.
> 
> Yes I agree if the price of perfect security is too high, it can be a
> separate mode, but then it really has to be perfect and fix the file
> size leak as well.

agreed

Re: [attic] Chunking issues

From:
Jeremy Maitin-Shepard
Date:
2014-05-13 @ 21:46
One other idea that occurred to me:

Use any rolling checksum method to find next chunk boundary.   However,
after finding the chunk boundary, compute an HMAC of the window used for
the rolling checksum (or possibly a subset of it, I'm not really sure what
the difference would be), and based on this HMAC, decide how much of the
window to include in the chunk.  In other words, suppose the chunk boundary
is found after n bytes, with a window size of w.  We compute
HMAC(data[n-w:n]) (using a separate HMAC key) and use that to sample the
actual chunk boundary position uniformly from [n-w,n].  As long as the
boundary position occurs somewhere in the window, this is very unlikely to
make the deduplication any worse.

It seems to me this would have a similar effect as adding a random amount
of padding during encryption, but with the advantage of not wasting any
space.  It also would add only negligible computation overhead over the
original chunking scheme.  However, it may not be as secure as using a
cryptographically secure pseudorandom function for deciding which checksum
values constitute chunk boundaries, as is obtained by using SipHash.


On Tue, May 13, 2014 at 1:43 PM, Cyril Roussillon <cyril42e@gmail.com>wrote:

>
> On 2014/05/13 12:58, Jonas Borgström wrote:
> > On 12/05/14 22:16, Cyril Roussillon wrote:
> >> On 2014/05/10 14:51, Jonas Borgström wrote:
> >>> I just did a quick test to get a feel about how this would affect the
> >>> performance. And I'm afraid it doesn't look too good...
> >>>
> >>> On my laptop (Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz) the current
> >>> implementation has a throughput of about 300MB/s and applying a single
> >>> round of AES_encrypt (256 bit key) the speed drops down to 5.7 MB/s
> >>>
> >>> My very hackish implementation with a hard coded key:
> >>>
> >>> https://gist.github.com/jborg/2f4d6702201228be1de4
> >> I tested you implementation on a Intel(R) Core(TM) i7-4800MQ CPU @
> >> 2.70GHz with AES-NI :
> >> - I get only 80MB/s for buzhash alone, which is strange because my CPU
> >> is supposed to be faster than yours... (I commented out the AES
> >> instructions and replaced them by "*secure_sum = *insecure_sum")
> > That's weird. It must be some kind of compiler optimization issue, can
> > you re-run this test on an unpatched version of Attic?
> >
> > My original numbers were from an unpatched version of Attic compiled on
> > Arch Linux (gcc 4.9).
> >
> > I also tested your method on a Macbook Pro (i7) and got 245MB/s.
>
> I did it from a fresh clone and build of Attic and it was the same, but
> then I noticed that there was no optimization flag in the gcc
> compilation line :
> x86_64-pc-linux-gnu-gcc -pthread -fPIC -I/usr/include/python3.3 -c
> attic/chunker.c -o build/temp.linux-x86_64-3.3/attic/chunker.o
>
> After manually recompiling with -O2 and relinking I got 340MB/s
>
> I'm compiling with "python setup.py build", and installing with "python
> setup.py install".
> Maybe is it because I'm not using pip ?
>
> By the way, by doing the modulo ( & 0x1f ) inside the formula for the
> barrel shift, you are apparently preventing the compiler to recognize
> the barrel shift and to use the "rol*" asm instructions. If I compute
> the "& 0x1f" outside and assign it to a 32 bits variable, then it runs
> at 420MB/s (and I checked that no rol instruction was used before, and
> that they are used after). You can check what I did here :
>
> 
http://crteknologies.fr/download/0001-chunker-move-the-modulo-out-of-the-barrel-shift-and-.patch
>
> >> - I also get only 6.2 MB/s with the round of AES (corresponding to
> >> 100MB/s for raw AES). Even though I have AES-NI instructions and openssl
> >> compiled with -march=native, they don't seem to be used.
> >>
> >> Apparently you need to use the EVP API so that the processor is detected
> >> at runtime and the AES-NI instructions used. I reach 700MB/s (which
> >> makes the rolling hash run at 44MB/s).
> > Ah, interesting. I guess we should update crypto.pyx to use the EVP api
> > then.
> >
> >> Using a 128 bits key also slightly increases performance (140MB/s
> >> without AES-NI, 900MB/s with them)
> >>
> >>
> >> But I agree that we cannot rely on AES-NI instructions which are too
> >> recent, and that even these speeds are still too low. I'm investigating
> >> to use something faster than AES, for instance SipHash which would meet
> >> the security requirements and make the rolling hash run at 51MB/s on my
> >> machine without using any particular instruction set.
> > That's better but still probably slower than the IO system on most
> > systems. We should try to keep Attic IO bound instead of CPU bound for
> > unmodified data.
>
> I optimized the reference implementation of SipHash to specialize it for
> 4 bytes inputs and repetitive calls with same key and same size, and it
> runs over 95MB/s in the secure rolling hash scenario (hashing four bytes
> of data for each input byte).
>
> Here is a summary of the performance of different solutions on different
> machines (the test program is available at
> http://crteknologies.fr/download/test_prf.zip), in MB/s for secure
> rolling hash scenario (but without the rolling hash) :
>
>               CPU1  CPU2  CPU3  CPU4
> AES128         9.3   4.4   1.6   5.7
> AES128-EVP    68.0   4.1   1.5  12.8
> SipHash       56.4   8.4  12.9  37.0
> SipHashOptim  97.5  12.0  25.3  57.5
> (BuzHash)    340    64
>
> CPU1 (laptop, modern high-end): Intel(R) Core(TM) i7-4800MQ CPU @ 2.70GHz
> CPU2 (laptop, old high-end)   : Intel(R) Core(TM)2 Duo CPU     P8400  @
> 2.26GHz
> CPU3 (server, modern low-end) : Intel(R) Atom(TM) CPU N2800   @ 1.86GHz
> CPU4 (dekstop, old high-end)  : Intel(R) Core(TM)2 Duo CPU     E8400  @
> 3.00GHz
>
> So it's still too slow on some platforms.
>
> Apparently according to the comments on stackexchange there may be other
> fast block ciphers or PRF, but I didn't find anything else with a quick
> search. The real question is how weak could it be, knowing that very
> little information is disclosed it probably does not have to be so strong.
>
>
> > Unless we can figure out something really clever we probably have to
> > consider providing two different encryption modes, with or without a
> > cryptographically secure chunking.
> >
> > But this "paranoid" mode should then probably also somehow include a
> > proper fix for the tail problem.
>
> Yes I agree if the price of perfect security is too high, it can be a
> separate mode, but then it really has to be perfect and fix the file
> size leak as well.
>
>
>

Re: [attic] Chunking issues

From:
Jonas Borgström
Date:
2014-05-08 @ 15:46
On 08/05/14 13:11, Cyril Roussillon wrote:
> 
> On 2014/05/08 00:19, Jeremy Maitin-Shepard wrote:
>>
>> On Wed, May 7, 2014 at 2:20 PM, Jonas Borgström <jonas@borgstrom.se
>> <mailto:jonas@borgstrom.se>> wrote:
>>
>>     Hi Cyril and Jeremy, sorry for joining this discussion late.
>>
>>     First of all, if we're concerned that it would be to easy to brute
>>     force
>>     the chunk seed it would be fairly easy to extend the seed to 64,
>>     128 or
>>     even 256 bits.
>>
>>
>> Yes that would probably be better, though I would prefer if there were
>> some cryptographic analysis of the rolling hash function.  I don't
>> like the grey area that the rolling checksum occupies in terms of
>> security.  We should either consider it completely insecure, or should
>> have reason to believe it is as secure as AES/sha256. 
>>
>> Note: Tarsnap computes the coefficients as the first 4 bytes of
>> HMAC('x' + c) for character c in [0,256).  It also computes the other
>> constants using HMAC.  See tarsnap/tar/multitape/chunkify.c.  (Of
>> course there is no liecnse to use the code.)
>>
>> Also this presentation of his may be of interest as it relates to
>> chunking, though not to its security:
>>
>> http://www.daemonology.net/papers/EuroBSDCon13.pdf
> 
> Actually after more thinking I think that the xor seed scheme may indeed
> have a weakness. By xoring the table with the seed d, you are changing
> in the wikipedia formula
> (http://en.wikipedia.org/wiki/Rolling_hash#Cyclic_polynomial) all terms
> like this :
> s^k[h(c)] -> s^k[h(c)+d]
> where s is the binary rotation and + the xor.
> 
> But the binary rotation and xor have the following property :
> s[a+b] = s[a]+s[b]
> 
> So if H(d) is the rolling hash with seed d, then :
> H(d) = H(0) + sumxor(s^i[d], i=0..k)
> 
> The sum happens to be d (if even number of 1 bits in d) or ~d (if odd
> number of 1 bits in d) for k = 32 (because the rolling hash is 32 bits),
> and always d for k = n*64, including k=4096 that you are using.
> 
> So knowing H(0), for instance with a known plain text attack, and
> knowing that the last 6 bits of H(d) are null (because the cut decision
> was taken), allows to recover the last 6 bits of the seed d which are
> the last 6 bits of H(0).

I'm not sure I follow that but I guess that could be true.

> That's not the whole seed, there may exist better attacks, so it would
> probably be a good idea to generate the table in a stronger way, for
> instance with HMAC of the indexes as Jeremy is suggesting. It still
> won't guarantee that it is not possible to recover the table values with
> a chosen text attack, or to infer something about the data from the cut
> points though.

One easy and probably fairly safe method would be to extend the seed
from 32 bits to a 256 bit AES key.

The table would then be created by encrypting an empty (all 0 bytes)
table using AES in CTR_MODE the same way we do with our file data.

The AES CTR mode key stream is basically a random number generator.


> Jeremy, did you understand completely tarsnap's chunking algorithm from
> the source code ? I find it quite obscure and have difficulties to
> figure out the detailed scheme.
> What makes you believe it is more secure than buzhash with secret table ?
> 
> 
>>
>>     It would definitely hurt our deduplication ratio a bit but it's
>>     hard to
>>     tell how much without doing some tests.
>>
>>
>>     Since the tail stream approach will waste some extra disk space how
>>     about this completely different approach:
>>
>>     We alter the the encryption envelope format to support including a
>>     variable number of padding bytes to mask the real payload length.
>>     Each "tail" chunk is padded with a random number of bytes in the [0-N]
>>     interval.
>>
>>     But I have no idea how to figure out a safe value for N.
>>
>>     (As an added bonus this approach will not negatively affect
>>     unencrypted
>>     repositories at all)
>>
>>     What do you guys think about that?
>>
>>
>> That certainly would be a very convenient solution as long as N could
>> be small relative to the chunk size.  Ideally there would be some form
>> of zero-knowledge guarantee: that the encrypted data is
>> indistinguishable by an attacker from the encryption of random data
>> (except for some information about the total amount of data).
>>
>> With only a small amount of padding it seems unavoidable that you leak
>> some information: in particular, if you have a bunch of small files,
>> you will end up with a bunch of small chunks, whereas if you have long
>> files you will end up with a more even distribution of chunks.
> 
> This solution seems a reasonable compromise between efficiency and
> difficulty of implementation.
> I think a value of N between 2 and 4 would be enough to mess up
> signatures (even if depending on the case fuzzy matches could still
> bring information), without generating too much wasted space.

Probably, but it would be nice to have some kind of proof that this
actually makes a difference before implementing it.

/ Jonas

Re: [attic] Chunking issues

From:
Jeremy Maitin-Shepard
Date:
2014-05-07 @ 21:31
On Wed, May 7, 2014 at 12:06 PM, Cyril Roussillon <cyril42e@gmail.com>wrote:

>
> On 2014/05/07 18:30, Jeremy Maitin-Shepard wrote:
> > I believe the tarsnap scheme is as safe as the chunking algorithm.
>
> Is there a description of this scheme and of the chunking algorithm
> somewhere on the Internet, or did you analyze the source code ? I could
> not find detailed information about it on their website.
>

I just analyzed the source code.  There is no description except occasional
comments on his blog and on the tarsnap mailing list.


Why do you think it is necessary to write the tail to the tail stream
> even if the file hasn't changed ? If you don't process it at all and
> reuse the chunk ids from the cache, incrementing the chunks reference
> count, then everything should be valid, shouldn't it ?
> Only if you need to chunkify the file again because it has changed, you
> would have to put its tail in the tail stream, because there is no way
> to know whether the tail changed or not (except if the cache also
> contained the hash of the tail indeed, but it's probably not worth it).
>

You could do that, but then over time your space efficiency may go down:
you might have a 10 byte file contained in a 1MB chunk, with no remaining
archives referencing anything else in that chunk.  It seems to me that with
typical usage patterns this could well happen with a large fraction of
files.  You could try to take care of this by adding another layer of
compaction (which would require the encryption keys), but then it seems you
might as well do the simpler approach of encrypting the repository directly.


>
> The only modifications I was thinking of were that the cache and
> metadata should refer to pairs of chunk id and subchunk reference
> (instead of only chunk id), and that we would have to wait until the
> tail stream generates a full chunk to know its id and write it back to
> all the previous files metadata that need to refer to it (we can leave
> the room, push the addresses in a stack, and bulk fill them afterwards ;
> as we also have to wait that moment to push the metadata into its
> chunkifier, it is necessary to put a size limit on the pending metadata
> to control memory usage).
>

I think there are various approaches that can be used to avoid having to
delay writing the metadata.  I haven't looked carefully enough to see
exactly what tarsnap does, but basically you can write the chunk ids for
the tail stream into a separate metadata stream, and just write an offset
into the main archive metadata.


>
> A drawback is that you will lose deduplication for small files, which
> should not hurt in general, but you can think of vicious cases : for
> instance if your whole repository is made of triplets of identical files
> of 16kB, you won't have any deduplication whereas you could have divided
> storage size by 3.
>

Well, zlib might take care of deduplication within each chunk.