librelist archives

« back to archive

Padding distribution

Padding distribution

From:
Michael Rogers
Date:
2011-07-29 @ 11:59
Hi all,

I'm working on a problem that I guess some members of the list might
find interesting: given a message of size x, how to choose an amount of
padding, y, to add to the message, such that the size of the padded
message, x + y, reveals as little as possible about x?

"Reveals as little as possible" means that the entropy of an observer's
probability distribution over possible values of x, conditioned on x +
y, should be as high as possible.

Any thoughts about how to go about finding a good (deterministic or
probabilistic) function f(x) = y?

Thanks,
Michael

Re: [remailer] Padding distribution

From:
Lance Cottrell
Date:
2011-07-29 @ 14:09
Good hiding of information which could be used to track messages always 
requires a lot of overhead.

The decision in Mixmaster was to pad to a size that was not ridiculous for
small messages, and to break large messages in to independent units of 
that standard block size.

Any general padding function will always reveal that the message in 
question is smaller than the padded size. This is a particular tell for 
very large messages unless your padding function pads a significant 
fraction of messages to be extraordinarily large (with associated massive 
overhead).

It would help my analysis if I understood the context in which this would 
be used. Without knowing what the observer can see and what the space of 
observables looks like, the "entropy" is poorly defined.

	-Lance

On Jul 29, 2011, at 7:59 AM, Michael Rogers wrote:

> Hi all,
> 
> I'm working on a problem that I guess some members of the list might
> find interesting: given a message of size x, how to choose an amount of
> padding, y, to add to the message, such that the size of the padded
> message, x + y, reveals as little as possible about x?
> 
> "Reveals as little as possible" means that the entropy of an observer's
> probability distribution over possible values of x, conditioned on x +
> y, should be as high as possible.
> 
> Any thoughts about how to go about finding a good (deterministic or
> probabilistic) function f(x) = y?
> 
> Thanks,
> Michael
> 

--
Lance Cottrell
loki@obscura.com


Re: [remailer] Padding distribution

From:
Michael Rogers
Date:
2011-07-29 @ 17:07
Hi Lance,

On 29/07/11 15:09, Lance Cottrell wrote:
> It would help my analysis if I understood the context in which this
> would be used. Without knowing what the observer can see and what the
> space of observables looks like, the "entropy" is poorly defined.

Sorry, I should have explained the context in my first message. I'm
working on a communication system that uses a variety of different media
- memory sticks, Bluetooth, TCP connections, etc - to transport opaque
chunks of data. A higher-layer protocol interprets those chunks of data
and stitches the various media together into a delay-tolerant network.
But some of the media are susceptible to traffic analysis, so I'd like
to come up with a general-purpose mechanism that takes an opaque chunk
of data and pads it to conceal its size, as far as possible, from an
observer who may have some prior knowledge about likely sizes.

It would be possible to break each large chunk into smaller chunks, as
in Mixmaster, but the observer can still see how many of those smaller
chunks are being transmitted, so we don't get the same kind of
protection from that approach as Mixmaster does (where, IIRC, each chunk
is routed independently through the mix network?).

For some media it might be possible to conceal the true volume of
traffic by using a fixed transmission rate, buffering real traffic or
adding dummy traffic as necessary, but right now I'm focussing on the
media where that isn't possible - for example, transporting the data on
a memory stick or attaching it to an email.

Cheers,
Michael

Re: [remailer] Padding distribution

From:
Greg Rubino
Date:
2011-07-29 @ 15:13
On Fri, Jul 29, 2011 at 10:09 AM, Lance Cottrell <loki@obscura.com> wrote:
>
> Any general padding function will always reveal that the message in 
question is smaller than the padded size. This is a particular tell for 
very large messages unless your padding function pads a significant 
fraction of messages to be extraordinarily large (with associated massive 
overhead).
>

I was wondering about this...

Could the padding function actually include negative values and then
the negative padding could be a controlled amount of compression
applied to the message?  The resulting message being the original
message compressed to the point that it is shorter by the amount of
"negative" padding.  I think you could actually achieve this kind of
lengthening and shortening with an adaptation of Huffman's algorithm
where the comparison function used to determine the priority of the
subtrees inserted into the queue is modified such that it approaches
the desired size for the given input.  I'm not sure that such an
approach would actually hide information (because the encoding tree
would have to be transmitted along with the message), but it sounds
like it'd be a lot of fun to implement!

Just thinking out loud there...

--Greg

Re: [remailer] Padding distribution

From:
Lance Cottrell
Date:
2011-07-29 @ 15:41
I think this could be equivalently represented as padding everything as 
before but starting from a maximally compressed version of every message.

To me it looks like it would be an equivalent situation so no benefit.

	-Lance

On Jul 29, 2011, at 11:13 AM, Greg Rubino wrote:

> On Fri, Jul 29, 2011 at 10:09 AM, Lance Cottrell <loki@obscura.com> wrote:
>> 
>> Any general padding function will always reveal that the message in 
question is smaller than the padded size. This is a particular tell for 
very large messages unless your padding function pads a significant 
fraction of messages to be extraordinarily large (with associated massive 
overhead).
>> 
> 
> I was wondering about this...
> 
> Could the padding function actually include negative values and then
> the negative padding could be a controlled amount of compression
> applied to the message?  The resulting message being the original
> message compressed to the point that it is shorter by the amount of
> "negative" padding.  I think you could actually achieve this kind of
> lengthening and shortening with an adaptation of Huffman's algorithm
> where the comparison function used to determine the priority of the
> subtrees inserted into the queue is modified such that it approaches
> the desired size for the given input.  I'm not sure that such an
> approach would actually hide information (because the encoding tree
> would have to be transmitted along with the message), but it sounds
> like it'd be a lot of fun to implement!
> 
> Just thinking out loud there...
> 
> --Greg
> 

--
Lance Cottrell
loki@obscura.com


Re: [remailer] Padding distribution

From:
Greg Rubino
Date:
2011-07-29 @ 16:37
On Fri, Jul 29, 2011 at 11:41 AM, Lance Cottrell <loki@obscura.com> wrote:
> I think this could be equivalently represented as padding everything as 
before but starting from a maximally compressed version of every message.

I think I see what you mean, but...

I am n00b to information theory (in case that wasn't already obvious),
so can you provide a more precision wrt what you mean by equivalence?
That would probably help me think about the problem as well.

Thanks,

Greg

Re: [remailer] Padding distribution

From:
Lance Cottrell
Date:
2011-07-29 @ 17:33
As I understand the proposal, some files would get compressed by somewhere
between 0% and whatever the best available compression system can manage 
(m%), or the file can get padded up to some larger size from 0% to some 
maximum (n%).

So the file (original size X)  ends up being being a new size Y somewhere 
in the range  m X < Y < n X

This is really just the same as saying the original file is size m X and 
could be padded up to n X.

Thinking about both shrinking and growing makes it more complicated, so 
lets assume it has been shrunk completely and then only talk about padding
it by various amounts.

	-Lance


On Jul 29, 2011, at 12:37 PM, Greg Rubino wrote:

> On Fri, Jul 29, 2011 at 11:41 AM, Lance Cottrell <loki@obscura.com> wrote:
>> I think this could be equivalently represented as padding everything as
before but starting from a maximally compressed version of every message.
> 
> I think I see what you mean, but...
> 
> I am n00b to information theory (in case that wasn't already obvious),
> so can you provide a more precision wrt what you mean by equivalence?
> That would probably help me think about the problem as well.
> 
> Thanks,
> 
> Greg
> 

--
Lance Cottrell
loki@obscura.com


Re: [remailer] Padding distribution

From:
Tom Ritter
Date:
2011-07-29 @ 12:43
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

> Any thoughts about how to go about finding a good (deterministic or 
> probabilistic) function f(x) = y?

I've thought a bit about this myself.  These are just ideas, nothing
formal.  Any padding of length y reveals that the message x is of a
length <= y, possibly after being compressed.  So there's always an
information leak there.

My first thought was padding 'buckets'.  For example: any message of
length x is padded to min(1024, 2^ceil(log2(x))).  A message of length
4095 is padded to 4096, but a message of length 2049 is padded to the
same.   And then small messages are always padded up to 1024.  A
disadvantage is you can say with certainty what the range of lengths for
the message was (e.g. 2087-4096).  The base is obviously adjustable to
widen the range.

Let's say a message was padded randomly, a message between 2087-4096 can
be padded anywhere from 0-maxrand bytes.  If you choose a sufficiently
large maxrand number, you get a good distribution.  But what must
maxrand be to really achieve a good distribution?  If it's <= x, then
the total length will be padded to <= 2x, and you can determine that a
message length is between y/2 and y.  Any maxrand based on the message
size would give a similar equation, and any constant maxrand would
provide a smaller-than-previous range when a message large enough is
sent.  Large values or large equations for maxrand expands the range,
but the range and equations still exists.  And there's the practical
problem that you don't want to be transmitting 2MB for a 2KB message.

If Alice and Bob exchange dozens of messages all of size 2048: is more
or less information gained from knowing their messages are 1029-2048 OR
watching them exchange dozens of messages, plotting their ranges on a
number line, finding the area of most overlap, and making a conjecture
narrowing the range?  Is more security gained first by adding a maxrand
value, and *then* padding it to a bucketsize?  Should there be a set of
equations determining maxrand, and an equation chosen at random?

I don't know, I certainly haven't run any statistical tests, and I'd
probably have a flaw in my tests if I tried.  But it's a thought.

- -tom
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)

iEYEARECAAYFAk4yqw4ACgkQJZJIJEzU09sm7wCgidpZCKqwXZQVy1rLUzy6Xsvg
UgAAnR+y8y+7nelytVc8z2wcpj/6WJu7
=WjQG
-----END PGP SIGNATURE-----

Re: [remailer] Padding distribution

From:
Michael Rogers
Date:
2011-07-29 @ 16:41
Hi Tom,

Thanks for the reply.

On 29/07/11 13:43, Tom Ritter wrote:
> If Alice and Bob exchange dozens of messages all of size 2048: is more
> or less information gained from knowing their messages are 1029-2048 OR
> watching them exchange dozens of messages, plotting their ranges on a
> number line, finding the area of most overlap, and making a conjecture
> narrowing the range?  Is more security gained first by adding a maxrand
> value, and *then* padding it to a bucketsize?  Should there be a set of
> equations determining maxrand, and an equation chosen at random?

The buckets idea is something I'd considered too, but the protection it
provides seems to depend heavily on the observer's prior knowledge about
likely message sizes. To take an extreme example, if the observer knows
that x must be either 4096 or 4097, the buckets scheme provides no
protection (the observer learns with certainty which value x has by
observing whether the padded message is 4096 or 8092 bytes long), but if
the observer knows that x must be either 4095 or 4096, the buckets
scheme provides perfect protection (the observer learns nothing because
the padded message is 4096 bytes long in either case).

That suggests we might need to take the observer's prior knowledge into
account when saying anything about the strength of a given padding
scheme. But can we still make general statements about how a given
scheme transforms the observer's prior knowledge into posterior knowledge?

For example, let's say we have a function sample() that returns a random
sample from some probability distribution over the positive integers. To
pad a message of size x, we keep calling sample() until it returns a
value >= x, and that's the size of our padded message.

Now, the observer, who has some prior distribution over possible values
of x, sees the padded message, and can calculate, for each possible
value of x, the probability of padding that value to the observed size.
So now the observer has a posterior distribution over possible values of x.

But that's where I get stuck. Can we say anything like "using
distribution A for sample() rather than distribution B leaves the
observer closer to the prior, therefore distribution A is preferable",
or does everything depend on the exact form of the prior (as in the
buckets example)?

Cheers,
Michael

Re: [remailer] Padding distribution

From:
Nick Mathewson
Date:
2011-08-01 @ 16:25
On Fri, Jul 29, 2011 at 7:59 AM, Michael Rogers <m--@gmx.com> wrote:
> Hi all,
>
> I'm working on a problem that I guess some members of the list might
> find interesting: given a message of size x, how to choose an amount of
> padding, y, to add to the message, such that the size of the padded
> message, x + y, reveals as little as possible about x?
>
> "Reveals as little as possible" means that the entropy of an observer's
> probability distribution over possible values of x, conditioned on x +
> y, should be as high as possible.
>
> Any thoughts about how to go about finding a good (deterministic or
> probabilistic) function f(x) = y?

Well, it's trivial to come up with something that meets your stated
needs without actually being a good idea: choose y = C - x for some
constant C larger than any larger possible value of x.  The total
value x+y is now equal to C, which reveals nothing about x.

Of course, this is probably not what you want, since if you want to
support messages up to, say, 100 MB, you would then have to make
_every_ message padded to at least 100MB long.

Here's another example that does what you asked for while at the same
time being an even *worse* idea: choose y = C - x + r() where C is a
constant larger than any possible value of x, .and r() is a large
randomly chosen integer.

Why is this worse?  Clearly, x+y doesn't reveal anything about the
size of x... but if this mix net is one where messages remain at
constant size as they move through the network, then having messages
of many different sizes makes those messages less likely to provide
cover for one another.  If message sizes change, but the way in which
they change preserves any information about the original size (for
instance, they only get shorter), this can still be used to help trace
messages.

If we model a large eavesdropping adversary in terms of a long-term
intersection attack (for example, Statistical Disclosure and its
variants), then having multiple possible message sizes effectively
puts every message on a separate mixnet.



-- 
Nick

Re: [remailer] Padding distribution

From:
Michael Rogers
Date:
2011-08-02 @ 11:04
Hi Nick,

On 01/08/11 17:25, Nick Mathewson wrote:
> Well, it's trivial to come up with something that meets your stated
> needs without actually being a good idea: choose y = C - x for some
> constant C larger than any larger possible value of x.  The total
> value x+y is now equal to C, which reveals nothing about x.
> 
> Of course, this is probably not what you want, since if you want to
> support messages up to, say, 100 MB, you would then have to make
> _every_ message padded to at least 100MB long.

Good point, I should have said something about the cost of padding.
Ideally I'd like to be able to measure the "effectiveness" of a given
padding scheme as well as its cost. Then an appropriate tradeoff can be
chosen for each application. We already know how to define cost, so the
question is how to define effectiveness.

> If we model a large eavesdropping adversary in terms of a long-term
> intersection attack (for example, Statistical Disclosure and its
> variants), then having multiple possible message sizes effectively
> puts every message on a separate mixnet.

Interesting. The scenario I have in mind isn't a mixnet - each relay can
replace the received padding with its own padding before forwarding the
message, so a message can change size while moving through the network -
but I guess it's still useful to ask whether it's more effective against
statistical attacks to pad the message to the same size each time (as in
Tom's buckets scheme, for example), or to an independently chosen size
each time.

Cheers,
Michael

Re: [remailer] Padding distribution

From:
Nick Mathewson
Date:
2011-08-02 @ 14:33
On Tue, Aug 2, 2011 at 7:04 AM, Michael Rogers <m--@gmx.com> wrote:
 [...]
> Interesting. The scenario I have in mind isn't a mixnet - each relay can
> replace the received padding with its own padding before forwarding the
> message, so a message can change size while moving through the network -

If every relay can replace padding with new padding, then won't the
relays have to know how much padding they can safely replace?  And if
they know how much padding they can safely replace, won't that tell
the relays something about the original message size?

I can imagine some scenarios where you can tell how much padding you
can remove.  For one example, there are plenty of mixnet designs where
a fixed-size header is removed at each hop and replaced by a fixed
amount of padding.  In these systems, the amount removed and replaced
reveals nothing about the underlying message size.  For another
example, there might be a mixnet design where padding is only added at
each step, but never removed.  In this case, you could learn some
information about messages by noting that a longer message would never
turn into a shorter message.

> but I guess it's still useful to ask whether it's more effective against
> statistical attacks to pad the message to the same size each time (as in
> Tom's buckets scheme, for example), or to an independently chosen size
> each time.

I think you might need to back up and try to figure out what kind of
threat model you're trying to resist here, and what kind of properties
you're trying to provide in the presence of that threat.

-- 
Nick

Re: [remailer] Padding distribution

From:
Michael Rogers
Date:
2011-08-02 @ 16:17
On 02/08/11 15:33, Nick Mathewson wrote:
> If every relay can replace padding with new padding, then won't the
> relays have to know how much padding they can safely replace?  And if
> they know how much padding they can safely replace, won't that tell
> the relays something about the original message size?

Yes - in my scenario that's not a problem, since the relays can see the
plaintext of the messages they're forwarding. I'll explain more below.

> I think you might need to back up and try to figure out what kind of
> threat model you're trying to resist here, and what kind of properties
> you're trying to provide in the presence of that threat.

OK - I was hoping this might be a general-purpose problem that would be
somewhat independent of my particular scenario, but perhaps that was
wishful thinking and I should explain the details of my scenario.

The network I'm considering is a delay-tolerant publish-subscribe
overlay, broadly similar to Usenet. Each node in the overlay is owned
and operated by an individual, and may be hosted on a personal computer
or a mobile device such as a smartphone. Direct communication only
occurs between nodes whose operators know and trust one another and have
exchanged encryption and authentication keys face-to-face.

Users (node operators) subscribe to discussion groups. Each group forms
a connected subgraph of the overlay. Subscribers can post messages to
groups. Each message is flooded to the group's subscribers across the
edges of the overlay that connect subscribers. Nodes do not receive
messages for groups to which they do not subscribe.

Connections between neighbouring nodes may use a wide range of
underlying media, from memory sticks to Bluetooth to TCP/IP. These media
have a wide range of characteristics, so I divide them into two classes:

1. Batch-mode connections. This class includes media that don't attempt
to provide bidirectional, reliable, ordered, timely delivery of data. A
memory stick tied to a carrier pigeon would be an example of a
batch-mode connection.

2. Stream-mode connections. This class includes media that attempt
(though don't necessarily succeed) to provide bidirectional, reliable,
ordered, timely delivery of data. A Bluetooth RFCOMM connection would be
an example of a stream-mode connection.

The threat model is as follows:

* All long-range communication infrastructure (internet, phone network,
etc) is monitored by the adversary.

* The adversary has a limited ability to monitor short-range
communication infrastructure (Bluetooth, wifi, etc).

* The adversary may operate an unlimited number of overlay nodes.

* The adversary only has a limited ability to persuade users to trust
the operators of the adversary's nodes - thus the size of the cut
between the adversary's nodes and the rest of the network is limited.

* The adversary can't break standard cryptographic primitives, and
therefore can't decrypt or undetectably modify communication between
non-adversarial nodes.

* There are some users who can keep their nodes secure - those who can't
are counted as working for the adversary.

Now, obviously there are a million issues to consider here, but the one
I'm currently interested in is how much the adversary can learn about
which messages are travelling across which connections by observing the
encrypted traffic on those connections.

This is a well-studied problem for black box models where the adversary
sees the inputs and outputs of an anonymity system and wants to match
inputs to outputs, but the situation here is slightly different - both
because the adversary has a partial view, and because the system as a
whole doesn't have inputs and outputs (though each node does).

So I thought I'd go back to square one and try to establish some basic
metrics, such as a metric for how much the size of a padded message
(together with the observer's prior knowledge) reveals about the size of
the original message - but maybe that's the wrong approach and there's a
better way to carve up the problem?

Thanks for your patience in reading this far!

Cheers,
Michael

Re: [remailer] Padding distribution

From:
Lance Cottrell
Date:
2011-08-02 @ 15:31
Yes


On Aug 2, 2011, at 10:33 AM, Nick Mathewson wrote:

> On Tue, Aug 2, 2011 at 7:04 AM, Michael Rogers <m--@gmx.com> wrote:
> [...]
>> Interesting. The scenario I have in mind isn't a mixnet - each relay can
>> replace the received padding with its own padding before forwarding the
>> message, so a message can change size while moving through the network -
> 
> If every relay can replace padding with new padding, then won't the
> relays have to know how much padding they can safely replace?  And if
> they know how much padding they can safely replace, won't that tell
> the relays something about the original message size?
> 
> I can imagine some scenarios where you can tell how much padding you
> can remove.  For one example, there are plenty of mixnet designs where
> a fixed-size header is removed at each hop and replaced by a fixed
> amount of padding.  In these systems, the amount removed and replaced
> reveals nothing about the underlying message size.  For another
> example, there might be a mixnet design where padding is only added at
> each step, but never removed.  In this case, you could learn some
> information about messages by noting that a longer message would never
> turn into a shorter message.
> 
>> but I guess it's still useful to ask whether it's more effective against
>> statistical attacks to pad the message to the same size each time (as in
>> Tom's buckets scheme, for example), or to an independently chosen size
>> each time.
> 
> I think you might need to back up and try to figure out what kind of
> threat model you're trying to resist here, and what kind of properties
> you're trying to provide in the presence of that threat.
> 
> -- 
> Nick
> 

--
Lance Cottrell
loki@obscura.com