librelist archives

« back to archive

Ensuring repository integrity

Ensuring repository integrity

From:
Jeremy Maitin-Shepard
Date:
2014-05-18 @ 04:47
Right now, a client can verify that the entire repository state is
consistent with a manifest, but it must download all of the data in order
to do so.  The repository has CRC32 checksums for each entry, but there is
no way (without decrypting it) to verify its state as a whole, in
particular to verify that no entries have been lost.  That is, it would be
useful for the server to be able to compute a checksum of the repository as
a whole (without decrypting it), and for the client to be able to verify
that it is what is expected.  In combination with some sort of replication,
potentially it could be done in such a way as to allow for data recovery.
The client should be able to compute the whole-repository checksum based
only on the contents of the client-side cache, i.e. the metadata.

The usual way to compute a checksum of an unordered set seems to be to
either sum or xor the checksums of each element.  (Any commutative group
could be used; it might be better if every element is not its own inverse,
as in the case of xor.)  The checksum of each element would presumably be
something like the sha256(key|sha256(value)) (| denotes concatenation).

Each segment could store its checksum (and the checksum of the whole
repository is just the sum of the checksum of the segments).

Each segment written as part of a compaction could specify which segments
are being superceded.  That way it could be verified that the sum of the
checksum of the new segments plus the superceded segments is equal to zero.

Two main changes are needed to be able to compute these checksums:
1. Delete entries would need to store the appropriate hash of the data
being deleted.  (Simple enough.)
2. The client-side cache needs to store sha256(value) for every entry.
This is more troublesome since it means an extra 32-bytes per entry in the
cache, and all of the metadata that specifies keys would also need to
specify the extra 32-byte value hash.

The effective doubling of the metadata size does seem like a significant
barrier, but being able to track repository integrity on the server side
seems like an extremely useful feature.  Possibly a 128-bit hash could be
used instead.  I also don't know what theoretical guarantees there are for
a commutative hash of this type.

Re: [attic] Ensuring repository integrity

From:
Jonas Borgström
Date:
2014-05-18 @ 12:34
On 18/05/14 06:47, Jeremy Maitin-Shepard wrote:
> Right now, a client can verify that the entire repository state is
> consistent with a manifest, but it must download all of the data in
> order to do so.  The repository has CRC32 checksums for each entry, but
> there is no way (without decrypting it) to verify its state as a whole,
> in particular to verify that no entries have been lost.  That is, it
> would be useful for the server to be able to compute a checksum of the
> repository as a whole (without decrypting it), and for the client to be
> able to verify that it is what is expected.  In combination with some
> sort of replication, potentially it could be done in such a way as to
> allow for data recovery.  The client should be able to compute the
> whole-repository checksum based only on the contents of the client-side
> cache, i.e. the metadata.

The current server side "attic check --repository-only" check already is
smart enough to detect bit-rot and other types of corruption, in
addition to detecting extra or missing keys as a result missing or
truncated segment files.
And best of all, it can do this without needing access to any encryption
keys or client cache/state.

So can you elaborate a bit more about the scenario you are thinking about?

> The usual way to compute a checksum of an unordered set seems to be to
> either sum or xor the checksums of each element.  (Any commutative group
> could be used; it might be better if every element is not its own
> inverse, as in the case of xor.)  The checksum of each element would
> presumably be something like the sha256(key|sha256(value)) (| denotes
> concatenation).
> 
> Each segment could store its checksum (and the checksum of the whole
> repository is just the sum of the checksum of the segments).
> 
> Each segment written as part of a compaction could specify which
> segments are being superceded.  That way it could be verified that the
> sum of the checksum of the new segments plus the superceded segments is
> equal to zero.
> 
> Two main changes are needed to be able to compute these checksums:
> 1. Delete entries would need to store the appropriate hash of the data
> being deleted.  (Simple enough.)
> 2. The client-side cache needs to store sha256(value) for every entry. 
> This is more troublesome since it means an extra 32-bytes per entry in
> the cache, and all of the metadata that specifies keys would also need
> to specify the extra 32-byte value hash.
> 
> The effective doubling of the metadata size does seem like a significant
> barrier, but being able to track repository integrity on the server side
> seems like an extremely useful feature.  Possibly a 128-bit hash could
> be used instead.  I also don't know what theoretical guarantees there
> are for a commutative hash of this type.

I guess that could work, but it sounds like an overly complex solution
that would require both a format change and would double the memory
usage. Two things we we really don't want to do unless there's a huge
upside.

So if we wanted to compare the contents of the client cache with the
repository, why not simply xor all keys in the client cache and compare
that with the xor:ed value of all keys in the repository index?

/ Jonas

Re: [attic] Ensuring repository integrity

From:
Jeremy Maitin-Shepard
Date:
2014-05-18 @ 18:50
On Sun, May 18, 2014 at 5:34 AM, Jonas Borgström <jonas@borgstrom.se> wrote:

>
> The current server side "attic check --repository-only" check already is
> smart enough to detect bit-rot and other types of corruption, in
> addition to detecting extra or missing keys as a result missing or
> truncated segment files.
> And best of all, it can do this without needing access to any encryption
> keys or client cache/state.
>

It is true that Attic already does pretty well at detecting corruption,
particularly due to disk errors.

So can you elaborate a bit more about the scenario you are thinking about?
>

My understanding is that Attic verifies that the index matches the
segments.  However, if there is a mismatch, all that can really be done is
to throw away and rebuild the index.  Trying to do anything more
complicated would seem to be depend on heuristics and be tricky to get
right.  Therefore, I view that check as a way of verifying the index, but
not really the segments; the usual assumption with a log-structured
database is that the log is considered to be the "source of truth".

One source of corruption that the existing mechanism is not so effective at
detecting is bugs in the code.  While the repository code is admirably
short, there is still a constant churning of the repository data, and all
of the verification hinges on the assumption that the process is completely
correct.  In contrast, a checksum-based verification of the entire
repository means that you only have to trust the very simple checksum
computation code.  With a deduplicating backup system, it seems helpful to
be extra vigilant.

One scenario I'm thinking about is replication for the repository.  For one
thing, it might be best to do this via the segment files and not replicate
the index file directly, since the index can change a lot due to rehashing,
compaction, and the fact that there are a lot of random, small writes.
Rsyncing the index would also just replicate any corruption in it, and so
it still wouldn't really help in the event of corruption.  In any case, if
you did detect a difference between the replicas (maybe based on a list of
hashes of each segment file), it seems it would depend on various
heuristics to decide how to correct it, and it would be hard to be
confident that the correct state of the repository was really restored.

I guess that could work, but it sounds like an overly complex solution
> that would require both a format change and would double the memory
> usage. Two things we we really don't want to do unless there's a huge
> upside.
>

It seems that a 64-bit or even 32-bit hash might be sufficient given that
it is just for detecting accidental corruption.  (This could also replace
the existing checksum in the segment files.)  At this size, the memory
overhead would be pretty small.  For instance, with a 64-bit checksum, the
chunk cache size would increase from 44 to 52 bytes, just 18%.  The archive
metadata per chunk would increase from 40 to 48 bytes, 20%.

I also have been thinking about another (admittedly minor) issue related to
the encryption: if you add a file chunk, delete it, then add it again
later, it will have the same key, which will indicate to an eavesdropper
that it is the same data, and patterns due to this might reveal something
about the data.  To avoid this, the key could be encrypted using a random
nonce, but the nonce would have to be stored and transmitted along with the
key.  Conveniently, the same hash code used for verifying the repository
integrity could be used as the nonce.  Since the hash code depends on the
encryption envelope, which includes a random nonce, the hash code is also
random.



So if we wanted to compare the contents of the client cache with the
> repository, why not simply xor all keys in the client cache and compare
> that with the xor:ed value of all keys in the repository index?
>

That is essentially what I'm suggesting, except that the values would be
included as well.