librelist archives

« back to archive

remote verify uses tons of bandwidth

remote verify uses tons of bandwidth

From:
Dan Christensen
Date:
2014-02-11 @ 02:22
I'm just starting to use attic to do some real backups.  I did a backup
to a remote server using ssh (with attic running at the remote end),
and it worked quite well.  But I was surprised to find that

  attic verify user@host:/path/to/repo::archive

caused a lot of data to be downloaded.  I killed it before it finished,
since it would have taken hours, but my guess is that the entire archive
was being downloaded.

Since attic is running on the remote machine, couldn't the remote copy
of attic do the verifying?  Then it would be equivalent to

  ssh user@host attic verify /path/to/repo::archive

which is fast.

I haven't tried the new check code (maybe post a note when you think
it's in a good state to test), but I wonder if the same issues will
come up there.

Dan

Re: [attic] remote verify uses tons of bandwidth

From:
Jonas Borgström
Date:
2014-02-11 @ 19:54
On 2014-02-11 03:22, Dan Christensen wrote:
> I'm just starting to use attic to do some real backups.  I did a backup
> to a remote server using ssh (with attic running at the remote end),
> and it worked quite well.  But I was surprised to find that
> 
>   attic verify user@host:/path/to/repo::archive
> 
> caused a lot of data to be downloaded.  I killed it before it finished,
> since it would have taken hours, but my guess is that the entire archive
> was being downloaded.

Yes, verify is pretty much the same as extract except nothing is
actually written to disk.

> 
> Since attic is running on the remote machine, couldn't the remote copy
> of attic do the verifying?  Then it would be equivalent to
> 
>   ssh user@host attic verify /path/to/repo::archive

Not easily or securely. If the repository is encrypted all data stored
in the repository is indistinguishable from random bytes without the
corresponding key file and/or passphrase. So to do any verification on
this level would require sending the key file and or passphrase to the
remote host, which is obviously not secure.
This is why the attic server protocol is nothing more than a minimal
protocol to store and delete binary blobs.

But of course if the remote repository is unencrypted or if you are okay
with sending your encryption keys to the remote host you can always run
the verify command as you suggested.

> I haven't tried the new check code (maybe post a note when you think
> it's in a good state to test), but I wonder if the same issues will
> come up there.

"attic check" currently performs the following steps:

1. Detect repository bit-rot by verifying the crc32 checksum for each
object (blob) stored in the repository).
2. Verify the repository index/transaction consistency.

These steps do not require any encryption keys so these checks are
performed directly on the remote host if a remote repository is used.

Future steps, not implemented yet:
3. Verify archive metadata consistency. This will require access to
encryption keys and will need to download the metadata (1-2% of total
repository size).

4. Verify cryptograhic checksum on all file chunks. This will need to
download the rest of the repository data. This step is optional since
it's expensive and not really needed to detect corruption/bit-rot since
that's covered by step 1. This will however detect malicious data tampering.

"attic check" (without --repair) is read-only and safe if you want to
test it.

/ Jonas

Re: [attic] remote verify uses tons of bandwidth

From:
Dan Christensen
Date:
2014-02-13 @ 00:32
Jonas Borgström <jonas@borgstrom.se> writes:

> "attic check" (without --repair) is read-only and safe if you want to
> test it.

While running "attic check" on a repository on an external USB drive, I
accidentally (honestly!) wiggled the USB cable and caused the drive to
disconnect.  I was surprised to find that after reconnecting the drive,
attic couldn't check the repo, complaining about problems creating an
index.tmp file in the repo.  I found some disk corruption with that
file, so I ran e2fsck, which fixed the trouble, and then "attic check"
ran successfully.

What surprises me about this is that attic was writing to a file in
the repo while doing "attic check".  Ideally, this operation would
work on a read-only file system.  If one is trying to recover from
a disk error, you don't want to be writing to the disk unless you
really have to...

Dan

Re: [attic] remote verify uses tons of bandwidth

From:
Jonas Borgström
Date:
2014-02-16 @ 22:46
On 2014-02-13 01:32, Dan Christensen wrote:
> Jonas Borgström <jonas@borgstrom.se> writes:
> 
>> "attic check" (without --repair) is read-only and safe if you want to
>> test it.
> 
> While running "attic check" on a repository on an external USB drive, I
> accidentally (honestly!) wiggled the USB cable and caused the drive to
> disconnect.  I was surprised to find that after reconnecting the drive,
> attic couldn't check the repo, complaining about problems creating an
> index.tmp file in the repo.  I found some disk corruption with that
> file, so I ran e2fsck, which fixed the trouble, and then "attic check"
> ran successfully.
> 
> What surprises me about this is that attic was writing to a file in
> the repo while doing "attic check".  Ideally, this operation would
> work on a read-only file system.  If one is trying to recover from
> a disk error, you don't want to be writing to the disk unless you
> really have to...

Attic uses a file called "index.tmp" for accounting during the
repository check. This file is right placed inside the repository
directory for simplicity reasons. But you're right, it should be placed
somewhere else. But just to be clear, a regular "attic check" without
"--repair" will not modify any repository files.

Btw, I've just pushed initial support for check and --repair of archive
meta data itself. So attic should now be able to detect and repair most
types of repository corruption. But use with care since this code still
needs to be tested more.

/ Jonas

Re: [attic] remote verify uses tons of bandwidth

From:
Dan Christensen
Date:
2014-02-19 @ 01:47
Jonas Borgström <jonas@borgstrom.se> writes:

> Btw, I've just pushed initial support for check and --repair of archive
> meta data itself. So attic should now be able to detect and repair most
> types of repository corruption. But use with care since this code still
> needs to be tested more.

I tried "attic check --repair" for a real problem, and it worked!

I was doing an "attic prune" on a repository over sshfs, and I hit control-c
to interrupt it.  After that, "attic list" can an error, and "attic check"
reported "Key found in more than one segment."  Then I ran "attic check
--repair", and it fixed the repo.

Great work!

Is it expected that interrupting an attic prune operation might cause
corruption, or could this be an example of sshfs not implementing the
filesystem operations (e.g. fsync) correctly?

Dan

Re: [attic] remote verify uses tons of bandwidth

From:
Jonas Borgström
Date:
2014-02-19 @ 22:02
On 2014-02-19 02:47, Dan Christensen wrote:
> Jonas Borgström <jonas@borgstrom.se> writes:
> 
>> Btw, I've just pushed initial support for check and --repair of archive
>> meta data itself. So attic should now be able to detect and repair most
>> types of repository corruption. But use with care since this code still
>> needs to be tested more.
> 
> I tried "attic check --repair" for a real problem, and it worked!
> 
> I was doing an "attic prune" on a repository over sshfs, and I hit control-c
> to interrupt it.  After that, "attic list" can an error, and "attic check"
> reported "Key found in more than one segment."  Then I ran "attic check
> --repair", and it fixed the repo.

Cool,

> Great work!
> 
> Is it expected that interrupting an attic prune operation might cause
> corruption, or could this be an example of sshfs not implementing the
> filesystem operations (e.g. fsync) correctly?

you don't happen to have a copy of the exact error message from "attic
list"?

"attic prune" is pretty much just a bunch of delete operations followed
by a single put where the manifest is updated. And all this is wrapped
in a single transaction. So this should be an atomic operation as long
as the filesystem works correctly (and attic is bug free :)

Are you able to reproduce this?

/ Jonas

Re: [attic] remote verify uses tons of bandwidth

From:
Dan Christensen
Date:
2014-02-19 @ 22:53
Jonas Borgström <jonas@borgstrom.se> writes:

> you don't happen to have a copy of the exact error message from "attic
> list"?

See below.

> "attic prune" is pretty much just a bunch of delete operations followed
> by a single put where the manifest is updated. And all this is wrapped
> in a single transaction. So this should be an atomic operation as long
> as the filesystem works correctly (and attic is bug free :)
>
> Are you able to reproduce this?

Here I've reproduced it on a local ext4 filesystem.  I used a directory
tree with ~3MB of small files, and did a create operation 100 times,
with no changes to the files.  So the repo was about 3MB in total.  Then
I hit control-c about 0.5 to 1s into the prune operation:

$ attic prune -H 1 test-prune-intr.attic
^Cattic: Error: Keyboard interrupt

$ attic list test-prune-intr.attic
attic: Error: Inconsistency detected. Please run "attic check 
test-prune-intr.attic"

$ attic check test-prune-intr.attic
Starting repository check...
No suitable index found
attic: Exiting with failure status due to previous errors

$ cp -a test-prune-intr.attic test-prune-repair.attic
$ attic check --repair test-prune-repair.attic
$ attic check --repair test-prune-repair.attic
attic: Warning: 'check --repair' is an experimental feature that might result
in data loss.

Type "Yes I am sure" if you understand this and want to continue.

Do you want to continue? Yes I am sure
Starting repository check...
No suitable index found
Starting archive consistency check...
Analyzing archive 1-10 (1/91)
Analyzing archive 4-1 (2/91)
Analyzing archive 7-9 (3/91)
Analyzing archive 8-10 (4/91)
Analyzing archive 7-10 (5/91)
Analyzing archive 7-8 (6/91)
Analyzing archive 4-4 (7/91)
Analyzing archive 9-4 (8/91)
Analyzing archive 9-7 (9/91)
Analyzing archive 9-6 (10/91)
Analyzing archive 4-8 (11/91)
Analyzing archive 7-2 (12/91)
Analyzing archive 3-9 (13/91)
Analyzing archive 4-10 (14/91)
Analyzing archive 9-2 (15/91)
Analyzing archive 6-8 (16/91)
Analyzing archive 6-9 (17/91)
Analyzing archive 6-6 (18/91)
Analyzing archive 6-7 (19/91)
Analyzing archive 6-4 (20/91)
Analyzing archive 6-5 (21/91)
Analyzing archive 6-2 (22/91)
Analyzing archive 6-3 (23/91)
Analyzing archive 6-1 (24/91)
Analyzing archive 4-2 (25/91)
Analyzing archive 8-8 (26/91)
Analyzing archive 8-9 (27/91)
Analyzing archive 8-4 (28/91)
Analyzing archive 8-5 (29/91)
Analyzing archive 8-6 (30/91)
Analyzing archive 8-7 (31/91)
Analyzing archive 8-1 (32/91)
Analyzing archive 8-2 (33/91)
Analyzing archive 8-3 (34/91)
Analyzing archive 7-3 (35/91)
Analyzing archive 3-10 (36/91)
Analyzing archive 2-2 (37/91)
Analyzing archive 2-3 (38/91)
Analyzing archive 2-1 (39/91)
Analyzing archive 2-6 (40/91)
Analyzing archive 2-7 (41/91)
Analyzing archive 2-4 (42/91)
Analyzing archive 2-5 (43/91)
Analyzing archive 2-8 (44/91)
Analyzing archive 2-9 (45/91)
Analyzing archive 9-8 (46/91)
Analyzing archive 9-1 (47/91)
Analyzing archive 10 (48/91)
Analyzing archive 9-3 (49/91)
Analyzing archive 4-3 (50/91)
Analyzing archive 9-5 (51/91)
Analyzing archive 4-5 (52/91)
Analyzing archive 4-6 (53/91)
Analyzing archive 4-7 (54/91)
Analyzing archive 9-9 (55/91)
Analyzing archive 4-9 (56/91)
Analyzing archive 7-1 (57/91)
Analyzing archive 5-2 (58/91)
Analyzing archive 7-7 (59/91)
Analyzing archive 7-6 (60/91)
Analyzing archive 7-5 (61/91)
Analyzing archive 7-4 (62/91)
Analyzing archive 10-10 (63/91)
Analyzing archive 5-10 (64/91)
Analyzing archive 3-3 (65/91)
Analyzing archive 3-2 (66/91)
Analyzing archive 6-10 (67/91)
Analyzing archive 1-9 (68/91)
Analyzing archive 1-8 (69/91)
Analyzing archive 1-1 (70/91)
Analyzing archive 1-3 (71/91)
Analyzing archive 1-2 (72/91)
Analyzing archive 1-5 (73/91)
Analyzing archive 1-4 (74/91)
Analyzing archive 1-7 (75/91)
Analyzing archive 1-6 (76/91)
Analyzing archive 5-5 (77/91)
Analyzing archive 5-4 (78/91)
Analyzing archive 5-7 (79/91)
Analyzing archive 5-6 (80/91)
Analyzing archive 5-1 (81/91)
Analyzing archive 5-3 (82/91)
Analyzing archive 3-8 (83/91)
Analyzing archive 3-7 (84/91)
Analyzing archive 3-6 (85/91)
Analyzing archive 3-5 (86/91)
Analyzing archive 3-4 (87/91)
Analyzing archive 5-9 (88/91)
Analyzing archive 5-8 (89/91)
Analyzing archive 3-1 (90/91)
Analyzing archive 2-10 (91/91)
Archive consistency check complete, no problems found.

Any now all is fine with the repo.

I can't share the test files I used, but since my first attempt at
reproducing this worked, you will hopefully be able to as well.

Version:

Attic 0.10-46-g7bcb0f9

Dan

Re: [attic] remote verify uses tons of bandwidth

From:
Jonas Borgström
Date:
2014-02-20 @ 11:51
On 2014-02-19 23:53 , Dan Christensen wrote:
> Jonas Borgström <jonas@borgstrom.se> writes:
> 
>> you don't happen to have a copy of the exact error message from "attic
>> list"?
> 
> See below.
> 
>> "attic prune" is pretty much just a bunch of delete operations followed
>> by a single put where the manifest is updated. And all this is wrapped
>> in a single transaction. So this should be an atomic operation as long
>> as the filesystem works correctly (and attic is bug free :)
>>
>> Are you able to reproduce this?
> 
> Here I've reproduced it on a local ext4 filesystem.  I used a directory
> tree with ~3MB of small files, and did a create operation 100 times,
> with no changes to the files.  So the repo was about 3MB in total.  Then
> I hit control-c about 0.5 to 1s into the prune operation:

Do you know if you hit control-c when there were sill archives left to
prune or while the transaction was being committed?

> $ attic prune -H 1 test-prune-intr.attic
> ^Cattic: Error: Keyboard interrupt
> 
> $ attic list test-prune-intr.attic
> attic: Error: Inconsistency detected. Please run "attic check 
test-prune-intr.attic"
> 
> $ attic check test-prune-intr.attic
> Starting repository check...
> No suitable index found
> attic: Exiting with failure status due to previous errors

So the only error found was "No suitable index found", You got "Key
found in more than one segment." when you used sshfs, right?

"No suitable index found" isn't really any problem with the actual
repository data. Attic just noticed that the index is out of date and
needs to be refreshed. Attic 0.10 actually did this automatically but I
moved it into "check --repair" since I didn't think it ever happened.

But I'll try to reproduce this myself and then maybe add the index
rebuild code back if necessary.

/ Jonas

Re: [attic] remote verify uses tons of bandwidth

From:
Dan Christensen
Date:
2014-02-20 @ 15:23
Jonas Borgström <jonas@borgstrom.se> writes:

> Do you know if you hit control-c when there were sill archives left to
> prune or while the transaction was being committed?

For the test I did on a local disk, of the 100 copies, 99 should have
been pruned, but after the repair, 91 were still there.  So I think
there were still archives left to prune.

For the test over sshfs, after I ran the repair operation, all the
pruning was done.

> So the only error found was "No suitable index found", You got "Key
> found in more than one segment." when you used sshfs, right?

You are right, the errors are different.  Over sshfs, I think I only got
a brief message from "attic list", but I don't know for sure.  Then,
when I ran "attic check", I definitely got the "Key found in more than
one segment" errors.  I'll see if I can reproduce (with or without
sshfs), but I won't have time for a little while.

I wonder if it's worth setting up some systematic stress testing that
runs attic repeatedly, killing it at random intervals (with SIGKILL), or
killing an ssh connection it is using (maybe to the local host), etc.
If there's a way to forcibly unmount a filesystem while attic is writing
to it, that could be part of the test as well.  Maybe there's even a
framework for this kind of testing?

> "No suitable index found" isn't really any problem with the actual
> repository data. Attic just noticed that the index is out of date and
> needs to be refreshed. Attic 0.10 actually did this automatically but I
> moved it into "check --repair" since I didn't think it ever happened.
>
> But I'll try to reproduce this myself and then maybe add the index
> rebuild code back if necessary.

As we discussed earlier, it's a bit disconcerting if "attic list"
changes the repo.  So it would be better if there was a way to avoid
having (or minimize the chance of having) the index get out of date,
and then maybe suggest "check --repair" if there is a problem?

Dan

Re: [attic] remote verify uses tons of bandwidth

From:
Jonas Borgström
Date:
2014-02-20 @ 16:19
On 2014-02-20 16:23 , Dan Christensen wrote:
> Jonas Borgström <jonas@borgstrom.se> writes:
> 
>> Do you know if you hit control-c when there were sill archives left to
>> prune or while the transaction was being committed?
> 
> For the test I did on a local disk, of the 100 copies, 99 should have
> been pruned, but after the repair, 91 were still there.  So I think
> there were still archives left to prune.
> 
> For the test over sshfs, after I ran the repair operation, all the
> pruning was done.
> 
>> So the only error found was "No suitable index found", You got "Key
>> found in more than one segment." when you used sshfs, right?
> 
> You are right, the errors are different.  Over sshfs, I think I only got
> a brief message from "attic list", but I don't know for sure.  Then,
> when I ran "attic check", I definitely got the "Key found in more than
> one segment" errors.  I'll see if I can reproduce (with or without
> sshfs), but I won't have time for a little while.

I'm actually able to reproduce the "Key found in more than one..." error
quite reliably (even on ext4). But those errors are caused when the
"compaction" step (that is run after each commit) is interrupted. And
this will auto-correct itself after next successful commit (or by
running repair).

> I wonder if it's worth setting up some systematic stress testing that
> runs attic repeatedly, killing it at random intervals (with SIGKILL), or
> killing an ssh connection it is using (maybe to the local host), etc.
> If there's a way to forcibly unmount a filesystem while attic is writing
> to it, that could be part of the test as well.  Maybe there's even a
> framework for this kind of testing?

That would be nice to have but probably not that easy to automate and
maintain.

>> "No suitable index found" isn't really any problem with the actual
>> repository data. Attic just noticed that the index is out of date and
>> needs to be refreshed. Attic 0.10 actually did this automatically but I
>> moved it into "check --repair" since I didn't think it ever happened.
>>
>> But I'll try to reproduce this myself and then maybe add the index
>> rebuild code back if necessary.
> 
> As we discussed earlier, it's a bit disconcerting if "attic list"
> changes the repo.  So it would be better if there was a way to avoid
> having (or minimize the chance of having) the index get out of date,
> and then maybe suggest "check --repair" if there is a problem?


Agreed, but it's also not very good if a program crash could "wedge" a
scheduled backup requiring manual intervention.

The "index rebuild" is also a lot quicker than a full "check --repair"

Anyway, HEAD is currently not as robust as 0.10 which probably will need
to change before we can release 0.11.

/ Jonas

Re: [attic] remote verify uses tons of bandwidth

From:
Dan Christensen
Date:
2014-02-20 @ 16:36
Jonas Borgström <jonas@borgstrom.se> writes:

> I'm actually able to reproduce the "Key found in more than one..." error
> quite reliably (even on ext4). But those errors are caused when the
> "compaction" step (that is run after each commit) is interrupted. And
> this will auto-correct itself after next successful commit (or by
> running repair).

Ok, good.  That's probably what happened when I had the problem over
sshfs.  So at least sshfs is out of the equation.

>> As we discussed earlier, it's a bit disconcerting if "attic list"
>> changes the repo.  So it would be better if there was a way to avoid
>> having (or minimize the chance of having) the index get out of date,
>> and then maybe suggest "check --repair" if there is a problem?
>
> Agreed, but it's also not very good if a program crash could "wedge" a
> scheduled backup requiring manual intervention.

Well, it's certainly reasonable for "attic create" to do "safe" cleanup
operations, since we are expecting it to change the repo anyways.

But it's less clear whether "attic list" should change the repo.  One
option would be to default to letting it make safe changes to the repo,
but provide a "--read-only" switch to use when investigating a possibly
corrupt repo.  Or the default could be to not make changes unless a
switch like "--force" is provided.

If the default is to allow "attic list" to make changes, it would also
be good if "attic list" gracefully handled the situation where the repo
is on a filesystem that is mounted read-only.

Dan

Re: [attic] remote verify uses tons of bandwidth

From:
Jonas Borgström
Date:
2014-02-20 @ 23:05
On 2014-02-20 17:36, Dan Christensen wrote:
> Jonas Borgström <jonas@borgstrom.se> writes:
> 
>> I'm actually able to reproduce the "Key found in more than one..." error
>> quite reliably (even on ext4). But those errors are caused when the
>> "compaction" step (that is run after each commit) is interrupted. And
>> this will auto-correct itself after next successful commit (or by
>> running repair).
> 
> Ok, good.  That's probably what happened when I had the problem over
> sshfs.  So at least sshfs is out of the equation.
> 
>>> As we discussed earlier, it's a bit disconcerting if "attic list"
>>> changes the repo.  So it would be better if there was a way to avoid
>>> having (or minimize the chance of having) the index get out of date,
>>> and then maybe suggest "check --repair" if there is a problem?
>>
>> Agreed, but it's also not very good if a program crash could "wedge" a
>> scheduled backup requiring manual intervention.
> 
> Well, it's certainly reasonable for "attic create" to do "safe" cleanup
> operations, since we are expecting it to change the repo anyways.
> 
> But it's less clear whether "attic list" should change the repo.  One
> option would be to default to letting it make safe changes to the repo,
> but provide a "--read-only" switch to use when investigating a possibly
> corrupt repo.  Or the default could be to not make changes unless a
> switch like "--force" is provided.
> 
> If the default is to allow "attic list" to make changes, it would also
> be good if "attic list" gracefully handled the situation where the repo
> is on a filesystem that is mounted read-only.

I didn't find any good way to do this without reinstating the automatic
index rebuild for now.

This should however be "safe" since it only touches the index file and
nothing else. It will also fairly gracefully fail with:

attic: Error: Failed to acquire write lock on xxxx

if the repository filesystem is read only.

Since there's been a fair amount of changes to repository.py since the
last release I'll do some more testing to make sure there are no other
regressions.

/ Jonas

Re: [attic] remote verify uses tons of bandwidth

From:
Dan Christensen
Date:
2014-02-12 @ 02:19
Jonas Borgström <jonas@borgstrom.se> writes:

> This is why the attic server protocol is nothing more than a minimal
> protocol to store and delete binary blobs.

Seems reasonable.  Thanks for the explanation.  Incidentally, this makes
me wonder how much of the server protocol would work over sftp.

> "attic check" (without --repair) is read-only and safe if you want to
> test it.

Ok, I'll probably use attic check instead of verify, since it should be
faster.  I see that currently you have to check the whole repo.  I
imagine this might get slow as the repo gets large.  Would it be
reasonable to support checking just one archive in a repo?

Dan

Re: [attic] remote verify uses tons of bandwidth

From:
Jonas Borgström
Date:
2014-02-12 @ 20:31
On 2014-02-12 03:19, Dan Christensen wrote:
> Jonas Borgström <jonas@borgstrom.se> writes:
> 
>> This is why the attic server protocol is nothing more than a minimal
>> protocol to store and delete binary blobs.
> 
> Seems reasonable.  Thanks for the explanation.  Incidentally, this makes
> me wonder how much of the server protocol would work over sftp.
> 
>> "attic check" (without --repair) is read-only and safe if you want to
>> test it.
> 
> Ok, I'll probably use attic check instead of verify, since it should be
> faster.  I see that currently you have to check the whole repo.  I
> imagine this might get slow as the repo gets large.  Would it be
> reasonable to support checking just one archive in a repo?

The repository check part of "attic check" is fairly fast and should be
io bound. So as long as you have fast disks it should not take that long
to check a repo.

It would be possible to do some checking on a single archive but by
looking at all metadata it is possible to detect some types of
inconsistencies that are otherwise undetectable.

/ Jonas