librelist archives

« back to archive

--exclude-from ?

--exclude-from ?

From:
Dan Christensen
Date:
2014-02-03 @ 22:03
Rather than using "-e pattern" many times on the command line, I thought
it might be convenient to be able to list patterns in a file, one per
line.  Lines beginning with # would be ignored.  Would this be a
reasonable feature?

One question is whether these patterns would be recorded in the archive.
Currently, "attic info <archive>" gives the command line, which lets
you see which patterns were excluded.  It would be nice to be able
to see the exclusions even if they come from an external file.  But
it would also be good if they didn't obscure the rest of the command
line.

Any thoughts?

Dan

Re: [attic] --exclude-from ?

From:
Jonas Borgström
Date:
2014-02-03 @ 22:25
On 2014-02-03 23:03, Dan Christensen wrote:
> Rather than using "-e pattern" many times on the command line, I thought
> it might be convenient to be able to list patterns in a file, one per
> line.  Lines beginning with # would be ignored.  Would this be a
> reasonable feature?

Yeah, That's exactly how tar works but I've never gotten around to
implement that yet.

> One question is whether these patterns would be recorded in the archive.
> Currently, "attic info <archive>" gives the command line, which lets
> you see which patterns were excluded.  It would be nice to be able
> to see the exclusions even if they come from an external file.  But
> it would also be good if they didn't obscure the rest of the command
> line.
>
> Any thoughts?
>

I see your point but I think simply recording the command line is good
enough for now.

/ Jonas

Re: [attic] --exclude-from ?

From:
Dan Christensen
Date:
2014-02-06 @ 02:47
I'm trying to understand how the current exclude patterns work.  Here
are some things I think I do understand:

- The full path name must match the pattern.
- The patterns are shell glob patterns, using *, ?, [abc], [!abc],
  [a-z], etc.
- However, unlike in the shell, the path separator "/" is not
  treated specially, so "p*test" will match "path/to/filetest".

There are a few things I don't understand.

First, what is the intention for how directories are handled?  Is
a trailing "/" on a pattern supposed to mean that this only matches
directories?  But without a "/", the pattern is permitted to match
either files or directories?

Here is the relevant code:

class ExcludePattern(IncludePattern):

    def __init__(self, pattern):
        self.pattern = self.dirpattern = pattern
        if not pattern.endswith('/'):
            self.dirpattern += '/*'

    def match(self, path):
        dir, name = os.path.split(path)
        return fnmatchcase(path, self.pattern) or fnmatchcase(dir + '/', 
self.dirpattern)

It seems that this does roughly what I said, but there are a few things
that confuse me.  First, why is '/*' added to dirpattern instead of just
'/'?  Second, it seems like the directory itself is included but that
all of its contents are excluded.  Wouldn't it make more sense to
exclude the directory itself?  And, in fact, if the pattern doesn't
end in '/', won't the directory have already been excluded, meaning that
the += '/*' isn't ever getting used?  Is this logic somewhat convoluted
in order to avoid extra calls to stat to see whether a path is a
directory?

Here is some simpler code that might be equivalent to the above code:

    def __init__(self, pattern):
        self.pattern = pattern
        if pattern.endswith('/'):
            self.pattern += '*'

    def match(self, path):
        return fnmatchcase(path, self.pattern)

If the pattern is "path/to/foo/", then the contents of the directory foo
are excluded (if foo is a directory), but foo itself isn't excluded
(whether or not it is a directory).

And if the pattern is "path/to/foo", then foo is excluded (whether or
not it is a directory).  I think this implies that its contents are
excluded too, but maybe not if one of the source paths is beneath foo in
the file tree?  Maybe that's the intention of the '/*'?

I'm also wondering why adjust_patterns is called for extract and verify,
but not for create, and what the purpose is:

def adjust_patterns(paths, excludes):
    if paths:
        return (excludes or []) + [IncludePattern(path) for path in paths]
+ [ExcludePattern('*')]
    else:
        return excludes

Are there plans for more general include/exclude rules, kind of like
rsync has, where the order matters?

Once I understand, I will write up some documentation.

Dan

Re: [attic] --exclude-from ?

From:
Jonas Borgström
Date:
2014-02-06 @ 17:05
On 2014-02-06 03:47, Dan Christensen wrote:
> I'm trying to understand how the current exclude patterns work.  Here
> are some things I think I do understand:
> 
> - The full path name must match the pattern.
> - The patterns are shell glob patterns, using *, ?, [abc], [!abc],
>   [a-z], etc.
> - However, unlike in the shell, the path separator "/" is not
>   treated specially, so "p*test" will match "path/to/filetest".
> 
> There are a few things I don't understand.
> 
> First, what is the intention for how directories are handled?  Is
> a trailing "/" on a pattern supposed to mean that this only matches
> directories?  But without a "/", the pattern is permitted to match
> either files or directories?
> 
> Here is the relevant code:
> 
> class ExcludePattern(IncludePattern):
> 
>     def __init__(self, pattern):
>         self.pattern = self.dirpattern = pattern
>         if not pattern.endswith('/'):
>             self.dirpattern += '/*'
> 
>     def match(self, path):
>         dir, name = os.path.split(path)
>         return fnmatchcase(path, self.pattern) or fnmatchcase(dir + '/',
self.dirpattern)
> 
> It seems that this does roughly what I said, but there are a few things
> that confuse me.  First, why is '/*' added to dirpattern instead of just
> '/'?  Second, it seems like the directory itself is included but that
> all of its contents are excluded.  Wouldn't it make more sense to
> exclude the directory itself?  And, in fact, if the pattern doesn't
> end in '/', won't the directory have already been excluded, meaning that
> the += '/*' isn't ever getting used?  Is this logic somewhat convoluted
> in order to avoid extra calls to stat to see whether a path is a
> directory?

The include/exclude functions are used by both create and extract. For
"create" it's easy. If a directory is excluded we won't descend into it
so all sub paths are automatically excluded as well.

For extract/verify/list it's a bit different since the include/exclude
functions are used to filter the list of files/directories contained in
an archive.
So if "--exclude /foo" is specified it's not enough to only match "/foo"
it must also match "/foo/bar".


> Here is some simpler code that might be equivalent to the above code:
> 
>     def __init__(self, pattern):
>         self.pattern = pattern
>         if pattern.endswith('/'):
>             self.pattern += '*'
> 
>     def match(self, path):
>         return fnmatchcase(path, self.pattern)
> 
> If the pattern is "path/to/foo/", then the contents of the directory foo
> are excluded (if foo is a directory), but foo itself isn't excluded
> (whether or not it is a directory).
> 
> And if the pattern is "path/to/foo", then foo is excluded (whether or
> not it is a directory).  I think this implies that its contents are
> excluded too, but maybe not if one of the source paths is beneath foo in
> the file tree?  Maybe that's the intention of the '/*'?

Yes, as far as I can tell the above version does not pass the test suite?

> I'm also wondering why adjust_patterns is called for extract and verify,
> but not for create, and what the purpose is:
> 
> def adjust_patterns(paths, excludes):
>     if paths:
>         return (excludes or []) + [IncludePattern(path) for path in 
paths] + [ExcludePattern('*')]
>     else:
>         return excludes
> 
> Are there plans for more general include/exclude rules, kind of like
> rsync has, where the order matters?

Attic actually had a include/exclude system very similar to rsync before
but I decided to replace it with the current system since it was easier
to understand and use. But as you see some parts of the code still show
some traces of that old functionality. And it would be fairly easy to
add it back in the future if we think it's worth it.

> Once I understand, I will write up some documentation.

Cool. I guess the bottom line is that Attic tries to behave the same way
as tar. And if if does not it's probably a bug.

/ Jonas

Re: [attic] --exclude-from ?

From:
Dan Christensen
Date:
2014-02-06 @ 18:15
Jonas Borgström <jonas@borgstrom.se> writes:

> The include/exclude functions are used by both create and extract. For
> "create" it's easy. If a directory is excluded we won't descend into it
> so all sub paths are automatically excluded as well.

What is the intention if the user does:

  attic create --exclude '*/junk' attic.repo::1 /home /home/junk/important

And can you clarify whether the behaviour is supposed to depend on
whether the user puts a trailing slash on the exclude pattern?

Dan

Re: [attic] --exclude-from ?

From:
Jonas Borgström
Date:
2014-02-06 @ 20:21
On 2014-02-06 19:15, Dan Christensen wrote:
> Jonas Borgström <jonas@borgstrom.se> writes:
> 
>> The include/exclude functions are used by both create and extract. For
>> "create" it's easy. If a directory is excluded we won't descend into it
>> so all sub paths are automatically excluded as well.
> 
> What is the intention if the user does:
> 
>   attic create --exclude '*/junk' attic.repo::1 /home /home/junk/important
>
> And can you clarify whether the behaviour is supposed to depend on
> whether the user puts a trailing slash on the exclude pattern?
>

This would exclude the following files:

/home/junk
/home/junk/important
/home/junk/important/whatever

As far as I can tell this is also the behavior of GNU tar at least.

--exclude '*/junk/' would keep the /home/junk directory itself but
exclude the rest of the files.

GNU tar seems to exclude the /home/junk directory anyway so the trailing
slash does not seem to have any effect there.

I think our current behavior is a bit more useful but it might not be
worth it, what do you think?

/ Jonas

Re: [attic] --exclude-from ?

From:
Dan Christensen
Date:
2014-02-06 @ 22:14
Jonas Borgström <jonas@borgstrom.se> writes:

> On 2014-02-06 19:15, Dan Christensen wrote:
>> Jonas Borgström <jonas@borgstrom.se> writes:
>> 
>>> The include/exclude functions are used by both create and extract. For
>>> "create" it's easy. If a directory is excluded we won't descend into it
>>> so all sub paths are automatically excluded as well.
>> 
>> What is the intention if the user does:
>> 
>>   attic create --exclude '*/junk' attic.repo::1 /home /home/junk/important
>>
>> And can you clarify whether the behaviour is supposed to depend on
>> whether the user puts a trailing slash on the exclude pattern?
>>
>
> This would exclude the following files:
>
> /home/junk
> /home/junk/important
> /home/junk/important/whatever
>
> As far as I can tell this is also the behavior of GNU tar at least.

Yes, and it is documented here:
  
http://www.gnu.org/software/tar/manual/html_node/problems-with-exclude.html#SEC110

> GNU tar seems to exclude the /home/junk directory anyway so the trailing
> slash does not seem to have any effect there.

With my version of GNU tar, including a trailing slash means that the
exclude pattern has no effect at all.

There is one other differences with tar's exclude patterns:  The pattern
can match any trailing section of a path name broken at a "/", i.e. the
patterns "foo", "bar/foo" and "baz/bar/foo" all match the path
"baz/bar/foo", but the pattern "r/foo" does not.

I don't know whether attic should imitate that, since it can also
be obtained by having the user put "*/" at the start of the pattern
(except for files located at the top of the hierarchy, which don't
contain a leading slash in some cases).

> I think our current behavior is a bit more useful but it might not be
> worth it, what do you think?

If we remove the code that adds '/*', the user can still exclude the
contents of a directory by specifying '/path/to/dir/*', so I would 
suggest getting rid of this behaviour.  I was thinking about how to
explain the current behaviour, and it is a bit complicated.  Motivated
by making things easy to explain, I would suggest the following:

class ExcludePattern(IncludePattern):

    def __init__(self, pattern):
        self.pattern = pattern
        if self.pattern.endswith('/'):
            self.pattern = self.pattern[:-1]

    def match(self, path):
        return fnmatchcase(path, self.pattern) or fnmatchcase(path, 
self.pattern+'/*')

Stripping the trailing slash is a convenience for the user, since
otherwise match() will always fail (except maybe for the case of the
root path "/").  If we omit those two lines, we are closer to tar,
but that seems like odd behaviour.

I haven't thought about how the include patterns work, or why they
are needed for extract.

Dan

Re: [attic] --exclude-from ?

From:
Dan Christensen
Date:
2014-02-06 @ 23:41
Dan Christensen <jdc@uwo.ca> writes:

> Jonas Borgström <jonas@borgstrom.se> writes:
>
>> I think our current behavior is a bit more useful but it might not be
>> worth it, what do you think?
>
> If we remove the code that adds '/*', the user can still exclude the
> contents of a directory by specifying '/path/to/dir/*', so I would 
> suggest getting rid of this behaviour.  

On second thought, since a pattern ending with '/' isn't useful,
maybe it is better to give it the meaning you suggest (exclude the
contents of the directory, but not the directory itself), like this:

class ExcludePattern(IncludePattern):

    def __init__(self, pattern):
        self.pattern = pattern
        if self.pattern.endswith('/'):
            self.pattern = self.pattern+'*'  # Only change

    def match(self, path):
        return fnmatchcase(path, self.pattern) or fnmatchcase(path, 
self.pattern+'/*')

(The second part of the "or" will be redundant if the user specifies
a path ending in '/'.)

Note that the existing code has a bug:

$ attic create -v -e test-errors/ test.attic::exclude2 test-errors 
test-errors/dir/j
test-errors
test-errors/dir/j

The subdirectory is not excluded, because no '*' is added to dirpattern
if it initially ends in a '/'.

I haven't actually tested the code I propose above, but I *think* it
implements the behaviour you want.

Dan

Re: [attic] --exclude-from ?

From:
Dan Christensen
Date:
2014-02-07 @ 21:11
Dan Christensen <jdc@uwo.ca> writes:

> On second thought, since a pattern ending with '/' isn't useful,
> maybe it is better to give it the meaning you suggest (exclude the
> contents of the directory, but not the directory itself)

I've just sent a pull request with changes to IncludePattern,
ExcludePattern and PatternTestCase.  They keep the intended behaviour
as you described.

The changes to ExcludePattern fix this bug:

> Note that the existing code has a bug:
>
> $ attic create -v -e test-errors/ test.attic::exclude2 test-errors 
test-errors/dir/j
> test-errors
> test-errors/dir/j
>
> The subdirectory is not excluded, because no '*' is added to dirpattern
> if it initially ends in a '/'.

Also, I cache a compiled regular expression.  Even though both
fnmatchcase and re.match use caches, the cache lookups take a bit of
time.  E.g. for 20 patterns and 100000 files, it's 10 to 20 seconds with
the old code and 1-2 seconds with the new code.  For a backup where not
much changed, that would be noticeable.

For includes, I also fixed a different bug:

  $ mkdir test test/f test/foo
  $ touch test/f/bar test/foo/bar
  $ attic init test.attic
  ...
  $ attic create -v test.attic::1 test
  Initializing cache...
  test
  test/f
  test/f/bar
  test/foo
  test/foo/bar
  $ mv test test.save
  $ attic extract -v test.attic::1 test/f
  test/f
  test/f/bar
  test/foo/bar  # This shouldn't be here

And for includes, I explicitly ignore a trailing os.path.sep given in
the pattern, as I don't think there is any meaning that you want to give
to that.  (Right?)

For both includes and excludes:

I use os.path.sep consistently.

I verified that the root path '/' is handled correctly.  As an include
pattern, it is stored as the empty string, so the repr may look funny,
but it works and makes sense.

I added test cases that detect both of the above bugs, and also a couple
of other test cases regarding trailing slashes.  All of the previous
tests pass.

I'll send another message about documentation.

Dan

Re: [attic] --exclude-from ?

From:
Dan Christensen
Date:
2014-02-07 @ 21:57
Ok, here's a try at some documentation:

Exclude patterns use a variant of shell pattern syntax, with '*'
matching any number of characters, '?' matching any single character,
'[...]' matching any single character specified, including ranges, and
'[!...]' matching any character not specified.  For the purpose of these
patterns, the path separator (e.g. '/' or '\') is not treated specially.
For a path to match a pattern, it must completely match from start to
end, or must match from the start to just before a path separator.
Except for the root path, paths will never end in the path separator
when matching is attempted.  Thus, if a given pattern ends in a path
separator, a '*' is appended before matching is attempted.

Examples:

# Exclude '/home/user/file.o' but not '/home/user/file.odt':
$ attic create -e '*.o' repo.attic /

# Exclude '/home/user/junk' and '/home/user/subdir/junk' but
# not '/home/user/importantjunk' or '/etc/junk':
$ attic create -e '/home/*/junk' repo.attic /

# Exclude the contents of '/home/user/cache' but not the
# directory itself:
$ attic create -e '/home/user/cache/' repo.attic /

# The file '/home/user/cache/important' is *not* backed up:
$ attic create -e '/home/user/cache/' repo.attic / /home/user/cache/important

Feedback welcome.

But where should it go?  The Examples can do in usage.rst, but should
the paragraph be part of the epilog for create and extract (and verify,
for as long as it exists)?  Or could attic have a help command, so
the epilog for create and extract might say

  See "attic help patterns" for more information about pattern matching.

Mercurial does a combination of things.  E.g. "hg help commit" includes:

   See "hg help dates" for a list of formats valid for -d/--date.

and

  use "hg -v help commit" to show more info

Maybe "attic create -h -v" could include the details of how excludes
work, and even the examples as well?

Dan

Re: [attic] --exclude-from ?

From:
Jonas Borgström
Date:
2014-02-08 @ 13:38
On 2014-02-07 22:57, Dan Christensen wrote:
> Ok, here's a try at some documentation:
> 
> Exclude patterns use a variant of shell pattern syntax, with '*'
> matching any number of characters, '?' matching any single character,
> '[...]' matching any single character specified, including ranges, and
> '[!...]' matching any character not specified.  For the purpose of these
> patterns, the path separator (e.g. '/' or '\') is not treated specially.
> For a path to match a pattern, it must completely match from start to
> end, or must match from the start to just before a path separator.
> Except for the root path, paths will never end in the path separator
> when matching is attempted.  Thus, if a given pattern ends in a path
> separator, a '*' is appended before matching is attempted.
> 
> Examples:
> 
> # Exclude '/home/user/file.o' but not '/home/user/file.odt':
> $ attic create -e '*.o' repo.attic /
> 
> # Exclude '/home/user/junk' and '/home/user/subdir/junk' but
> # not '/home/user/importantjunk' or '/etc/junk':
> $ attic create -e '/home/*/junk' repo.attic /
> 
> # Exclude the contents of '/home/user/cache' but not the
> # directory itself:
> $ attic create -e '/home/user/cache/' repo.attic /
> 
> # The file '/home/user/cache/important' is *not* backed up:
> $ attic create -e '/home/user/cache/' repo.attic / /home/user/cache/important
>
> Feedback welcome.

That looks good, especially if the examples are included. Without them
it would probably take the average user some time to figure out how that
translates into command line arguments.

> But where should it go?  The Examples can do in usage.rst, but should
> the paragraph be part of the epilog for create and extract (and verify,
> for as long as it exists)?  Or could attic have a help command, so
> the epilog for create and extract might say
> 
>   See "attic help patterns" for more information about pattern matching.
> 
> Mercurial does a combination of things.  E.g. "hg help commit" includes:
> 
>    See "hg help dates" for a list of formats valid for -d/--date.
> 
> and
> 
>   use "hg -v help commit" to show more info
> 
> Maybe "attic create -h -v" could include the details of how excludes
> work, and even the examples as well?

I think the Mercurial way looks good. If I understand that correctly:

- attic <command> -h Works like it currently does but the epilog might
refer to other help topics (See "attic help patterns" for example)

- attic help <topic> can be used either as an alias for "attic <command>
-h" or to display other help topics such as patterns, encryption,
remote_repos, etc.."

If we keep the help topics as restructured text we might be able to
include them into the html documentation as well.

/ Jonas

Re: [attic] --exclude-from ?

From:
Dan Christensen
Date:
2014-02-08 @ 14:47
Jonas Borgström <jonas@borgstrom.se> writes:

> I think the Mercurial way looks good. If I understand that correctly:
>
> - attic <command> -h Works like it currently does but the epilog might
> refer to other help topics (See "attic help patterns" for example)
>
> - attic help <topic> can be used either as an alias for "attic <command>
> -h" or to display other help topics such as patterns, encryption,
> remote_repos, etc.."

Yes, that sounds good.  Mercurial hides "-h" from the user, but accepts
it.  Mercurial also allows "-v" to mean that more detailed help is
shown, but for now attic probably doesn't need that.  I think it's
cleaner to use "attic help topic" when more info is available.

> If we keep the help topics as restructured text we might be able to
> include them into the html documentation as well.

Sounds interesting.  Do you have an idea how this would work?
Currently, the plain text -h output is automatically included in the
.rst file, and while that works, I don't think it looks as nice as
properly formatted html.

For now, I've just submitted a pull request for something simpleminded
that at least gets the help text into the repo.  (With a couple of minor
updates: explain path separator; mention quoting of patterns; remove
quoting in Examples when not needed.)

Dan

Re: [attic] --exclude-from ?

From:
Dan Christensen
Date:
2014-02-07 @ 21:33
Dan Christensen <jdc@uwo.ca> writes:

> I verified that the root path '/' is handled correctly.  As an include
> pattern, it is stored as the empty string, so the repr may look funny,
> but it works and makes sense.

My latest commit stores every pattern with a '/' at the end, so
filenames will look funny, but the code is faster and shorter this way.

Dan