librelist archives

« back to archive

RFC: ETags

RFC: ETags

From:
Zed A. Shaw
Date:
2010-07-14 @ 17:46
I did up some basic ETags and stuff last night, and now working on
finishing up the file serving features.  There's going to be some basic
caching of the headers/stat/file junk in the server, and I'm going with
a policy that etags are fixed at:

Etag: mtime-size

So this is what a response looks like:

---------------------
$ curl -i http://localhost:6767/tests/sample.html
HTTP/1.1 200 OK
Date: Wed, 14 Jul 10 17:40:41 +0000
Content-Type: text/plain
Content-Length: 9
Last-Modified: Mon, 05 Jul 10 08:14:20 +0000
ETag: 4c31945c-9
Connection: keep-alive

hi there
---------------------

Questions I have are:

1. Should Etags be mtime-size-ctime?  Only cost is the extra bytes of
transfer.
2. I'm allowing keep-alive if the request is "small", but doing
connection:close if it's a large request.  This will cut down on hogs,
but I'm curious what people think of that?
3. No directory listings OK?
4. Assuming index.html for now alright?  This will become an option
later, but for now just keeping it simple.
5. Any other "file serving" things you wished were available?

Also, if anyone knows of some ninja tricks to get at CRC32 or MD5 hashes
the filesystem knows about let me know.  I may tinker with using that
instead of these mtime-size etags.

-- 
Zed A. Shaw
http://zedshaw.com/

Re: [mongrel2] RFC: ETags

From:
Andreas Krennmair
Date:
2010-07-14 @ 20:45
* Zed A. Shaw <zedshaw@zedshaw.com> [2010-07-14 19:50]:
>I did up some basic ETags and stuff last night, and now working on
>finishing up the file serving features.  There's going to be some basic
>caching of the headers/stat/file junk in the server, and I'm going with
>a policy that etags are fixed at:
>
>Etag: mtime-size

Hmm, that looks too weak for me. Apache's approach of also incorporating a 
file's inode is an approach good enough in practise to guarantee freedom from 
collisions, even though it suffers its own problems (like the etag validator 
producing false-negatives on redundant systems with mirrored files). But 
different files with the same mtime and size are just too easy to generate.  
But, to be fair, the HTTP/1.1 requirements on etags and strong validators are 
IMHO too strict to be fulfilled in practise. If I argued in 
RFC-lawyer-asshole mode, I'd say that even Apache violates the relevant 
sections in the HTTP/1.1 RFC.

>1. Should Etags be mtime-size-ctime?  Only cost is the extra bytes of
>transfer.

I think that adds no value.

>3. No directory listings OK?

Strictly speaking, directory listings are a relic, and since Mongrel2 is 
focusing on web applications, I don't think it's a necessity.

>Also, if anyone knows of some ninja tricks to get at CRC32 or MD5 hashes
>the filesystem knows about let me know.  I may tinker with using that
>instead of these mtime-size etags.

That was my thought, too - if only the operating system already kept a current 
file hash available for us. But I'm not aware of anything even remotely 
portable. ZFS probably has something available, as it checksums virtually 
everything, but ZFS isn't available on any but a few operating systems.

Regards,
Andreas

Re: [mongrel2] RFC: ETags

From:
Zed A. Shaw
Date:
2010-07-15 @ 01:05
On Wed, Jul 14, 2010 at 10:45:36PM +0200, Andreas Krennmair wrote:
> * Zed A. Shaw <zedshaw@zedshaw.com> [2010-07-14 19:50]:
> >Etag: mtime-size
> 
> Hmm, that looks too weak for me. Apache's approach of also incorporating a 
> file's inode is an approach good enough in practise to guarantee freedom from 

Nope, inode breaks when you've got multiple boxes.  First thing people
do is disable that for mtime-size when they get a 2nd box (or they
should).

> collisions, even though it suffers its own problems (like the etag validator 
> producing false-negatives on redundant systems with mirrored files). But 
> different files with the same mtime and size are just too easy to generate.  

Well, I think you're confusing what the etag does.  I don't find the
file by the etag, I find the file, then compare the etags.  So, if it's
the same path and has the same mtime and the same length, then it's a
duck.

The only totally unbustable way is with a crc32 or md5, but that I think
I'll leave for the module system later on so people can cook up
whatever.  For now this is the best practice so I'll go with it to get
the feature done.

> >Also, if anyone knows of some ninja tricks to get at CRC32 or MD5 hashes
> >the filesystem knows about let me know.  I may tinker with using that
> >instead of these mtime-size etags.
> 
> That was my thought, too - if only the operating system already kept a current 
> file hash available for us. But I'm not aware of anything even remotely 
> portable. ZFS probably has something available, as it checksums virtually 
> everything, but ZFS isn't available on any but a few operating systems.

I know they have *some* hash internally, just not sure WTF is available.

-- 
Zed A. Shaw
http://zedshaw.com/

Re: [mongrel2] RFC: ETags

From:
John Aughey
Date:
2010-07-14 @ 22:21
The important thing is what happens if the etag is wrong. If it forces a
re-download of the data, it takes more time but the behavior is correct. If
it is wrong and the data actually changed, then it's just plain wrong.

Don't you think that a site that cares enough about this will be using a CDN
or other server set for static content anyway?  Let mongrel2 focus on the
dynamic content and not worry about the extreme configuration cases. It
should be correct and fast, but scope the solution appropriately.

John Aughey

On Wed, Jul 14, 2010 at 4:45 PM, Andreas Krennmair <ak@synflood.at> wrote:

> * Zed A. Shaw <zedshaw@zedshaw.com> [2010-07-14 19:50]:
> >I did up some basic ETags and stuff last night, and now working on
> >finishing up the file serving features.  There's going to be some basic
> >caching of the headers/stat/file junk in the server, and I'm going with
> >a policy that etags are fixed at:
> >
> >Etag: mtime-size
>
> Hmm, that looks too weak for me. Apache's approach of also incorporating a
> file's inode is an approach good enough in practise to guarantee freedom
> from
> collisions, even though it suffers its own problems (like the etag
> validator
> producing false-negatives on redundant systems with mirrored files). But
> different files with the same mtime and size are just too easy to generate.
> But, to be fair, the HTTP/1.1 requirements on etags and strong validators
> are
> IMHO too strict to be fulfilled in practise. If I argued in
> RFC-lawyer-asshole mode, I'd say that even Apache violates the relevant
> sections in the HTTP/1.1 RFC.
>
> >1. Should Etags be mtime-size-ctime?  Only cost is the extra bytes of
> >transfer.
>
> I think that adds no value.
>
> >3. No directory listings OK?
>
> Strictly speaking, directory listings are a relic, and since Mongrel2 is
> focusing on web applications, I don't think it's a necessity.
>
> >Also, if anyone knows of some ninja tricks to get at CRC32 or MD5 hashes
> >the filesystem knows about let me know.  I may tinker with using that
> >instead of these mtime-size etags.
>
> That was my thought, too - if only the operating system already kept a
> current
> file hash available for us. But I'm not aware of anything even remotely
> portable. ZFS probably has something available, as it checksums virtually
> everything, but ZFS isn't available on any but a few operating systems.
>
> Regards,
> Andreas
>

Re: [mongrel2] RFC: ETags

From:
Zed A. Shaw
Date:
2010-07-15 @ 01:06
On Wed, Jul 14, 2010 at 06:21:26PM -0400, John Aughey wrote:
> The important thing is what happens if the etag is wrong. If it forces a
> re-download of the data, it takes more time but the behavior is correct. If
> it is wrong and the data actually changed, then it's just plain wrong.
> 
> Don't you think that a site that cares enough about this will be using a CDN
> or other server set for static content anyway?  Let mongrel2 focus on the
> dynamic content and not worry about the extreme configuration cases. It
> should be correct and fast, but scope the solution appropriately.

I agree! :-)

-- 
Zed A. Shaw
http://zedshaw.com/

Re: [mongrel2] RFC: ETags

From:
Eric Wong
Date:
2010-07-14 @ 19:45
"Zed A. Shaw" <zedshaw@zedshaw.com> wrote:
> I did up some basic ETags and stuff last night, and now working on
> finishing up the file serving features.  There's going to be some basic
> caching of the headers/stat/file junk in the server, and I'm going with
> a policy that etags are fixed at:
> 
> Etag: mtime-size
> 
> So this is what a response looks like:
> 
> ---------------------
> $ curl -i http://localhost:6767/tests/sample.html
> HTTP/1.1 200 OK
> Date: Wed, 14 Jul 10 17:40:41 +0000
> Content-Type: text/plain
> Content-Length: 9
> Last-Modified: Mon, 05 Jul 10 08:14:20 +0000
> ETag: 4c31945c-9
> Connection: keep-alive
> 
> hi there
> ---------------------
> 
> Questions I have are:
> 
> 1. Should Etags be mtime-size-ctime?  Only cost is the extra bytes of
> transfer.

ctime is not possible to synchronize across multiple machines if
you're load balancing, so stick with mtime-size.

> 2. I'm allowing keep-alive if the request is "small", but doing
> connection:close if it's a large request.  This will cut down on hogs,
> but I'm curious what people think of that?

I've been debating that myself.  I think it's alright to always enable
keepalive for idempotent requests like GET/HEAD.  Maybe clients can be
smart enough to make a decision to fire off a GET request in another
connection if it notices a large response being sent....

Mainstream browsers double the number of parallel connections if it
detects keep-alive is off, so in some cases it can lead to better
performance if there are large transfers while small ones are happening
(and it can also hurt the server).

> Also, if anyone knows of some ninja tricks to get at CRC32 or MD5 hashes
> the filesystem knows about let me know.  I may tinker with using that
> instead of these mtime-size etags.

Since you already use sqlite, I would just lazily compute them and store
them in sqlite.  You can compute+store them as extended attributes, too,
but not everybody has nor enables them.

-- 
Eric Wong

Re: [mongrel2] RFC: ETags

From:
Zed A. Shaw
Date:
2010-07-14 @ 20:29
On Wed, Jul 14, 2010 at 12:45:47PM -0700, Eric Wong wrote:
> > 1. Should Etags be mtime-size-ctime?  Only cost is the extra bytes of
> > transfer.
> 
> ctime is not possible to synchronize across multiple machines if
> you're load balancing, so stick with mtime-size.

Ahhh, hadn't thought of that.

> > 2. I'm allowing keep-alive if the request is "small", but doing
> > connection:close if it's a large request.  This will cut down on hogs,
> > but I'm curious what people think of that?
> 
> Mainstream browsers double the number of parallel connections if it
> detects keep-alive is off, so in some cases it can lead to better
> performance if there are large transfers while small ones are happening
> (and it can also hurt the server).

Yeah, I'll have to test how it works in practice, and maybe just make it
an option.

> > Also, if anyone knows of some ninja tricks to get at CRC32 or MD5 hashes
> > the filesystem knows about let me know.  I may tinker with using that
> > instead of these mtime-size etags.
> 
> Since you already use sqlite, I would just lazily compute them and store
> them in sqlite.  You can compute+store them as extended attributes, too,
> but not everybody has nor enables them.

Well, I keep trying to find how you access this information, and it's
basically impossible.  So, oh well.

-- 
Zed A. Shaw
http://zedshaw.com/

Re: [mongrel2] RFC: ETags

From:
Alex Gartrell
Date:
2010-07-14 @ 23:18
On Wed, Jul 14, 2010 at 4:29 PM, Zed A. Shaw <zedshaw@zedshaw.com> wrote:

> On Wed, Jul 14, 2010 at 12:45:47PM -0700, Eric Wong wrote:
> > > 1. Should Etags be mtime-size-ctime?  Only cost is the extra bytes of
> > > transfer.
> >
> > ctime is not possible to synchronize across multiple machines if
> > you're load balancing, so stick with mtime-size.
>
> Ahhh, hadn't thought of that.
>
> > > 2. I'm allowing keep-alive if the request is "small", but doing
> > > connection:close if it's a large request.  This will cut down on hogs,
> > > but I'm curious what people think of that?
> >
> > Mainstream browsers double the number of parallel connections if it
> > detects keep-alive is off, so in some cases it can lead to better
> > performance if there are large transfers while small ones are happening
> > (and it can also hurt the server).
>
> Yeah, I'll have to test how it works in practice, and maybe just make it
> an option.
>
> > > Also, if anyone knows of some ninja tricks to get at CRC32 or MD5
> hashes
> > > the filesystem knows about let me know.  I may tinker with using that
> > > instead of these mtime-size etags.
> >
> > Since you already use sqlite, I would just lazily compute them and store
> > them in sqlite.  You can compute+store them as extended attributes, too,
> > but not everybody has nor enables them.
>
> Well, I keep trying to find how you access this information, and it's
> basically impossible.  So, oh well.
>

I think he's asking why we don't just wait for a request for the file, check
for it's md5 in a cache, and otherwise either calculate it immediately (a
slowpath on the send back that does read then send with a hash in between)
or omit the ETag and let some background service or low priority task
populate the entry in the cache.


>
> --
> Zed A. Shaw
> http://zedshaw.com/
>

Re: [mongrel2] RFC: ETags

From:
Zed A. Shaw
Date:
2010-07-15 @ 01:00
On Wed, Jul 14, 2010 at 07:18:08PM -0400, Alex Gartrell wrote:
> On Wed, Jul 14, 2010 at 4:29 PM, Zed A. Shaw <zedshaw@zedshaw.com> wrote:
> > Well, I keep trying to find how you access this information, and it's
> > basically impossible.  So, oh well.
> >
> 
> I think he's asking why we don't just wait for a request for the file, check
> for it's md5 in a cache, and otherwise either calculate it immediately (a
> slowpath on the send back that does read then send with a hash in between)
> or omit the ETag and let some background service or low priority task
> populate the entry in the cache.

I thought about that, but I think I'll hold that for later when there's
a module system and people can cook up their own crazy schemes.

-- 
Zed A. Shaw
http://zedshaw.com/

Re: [mongrel2] RFC: ETags

From:
Billy Gray
Date:
2010-07-14 @ 19:32
On Wed, Jul 14, 2010 at 1:46 PM, Zed A. Shaw <zedshaw@zedshaw.com> wrote:

>
> 5. Any other "file serving" things you wished were available?
>
>
Along the lines of the index.html lookup, one of the most dang useful things
about nginx is being able to do something like this:

  # If the file exists as a static file serve it directly without
  # running all the other rewite tests on it
  if (-f $request_filename) {
    break;
  }

  # index.html
  if (-f $request_filename/index.html) {
    rewrite (.*) $1/index.html break;
  }

  # rails caching
  if (-f $request_filename.html) {
    rewrite (.*) $1.html break;
  }

Re: [mongrel2] RFC: ETags

From:
Andrew Cholakian
Date:
2010-07-14 @ 20:01
mtime-size scares me. Apache uses inode-mtime-size for a reason. Size isn't
much of a differentiator, and I can definitely see someone for some strange
(not necessarily good) reason having files with the same mtime.

Personally, I think if-modified-since is the way, rather than etags, to go
if you're only going to use mtime + size.

On Wed, Jul 14, 2010 at 12:32 PM, Billy Gray <wgray@zetetic.net> wrote:

> On Wed, Jul 14, 2010 at 1:46 PM, Zed A. Shaw <zedshaw@zedshaw.com> wrote:
>
>>
>> 5. Any other "file serving" things you wished were available?
>>
>>
> Along the lines of the index.html lookup, one of the most dang useful
> things about nginx is being able to do something like this:
>
>   # If the file exists as a static file serve it directly without
>   # running all the other rewite tests on it
>   if (-f $request_filename) {
>     break;
>   }
>
>   # index.html
>   if (-f $request_filename/index.html) {
>     rewrite (.*) $1/index.html break;
>   }
>
>   # rails caching
>   if (-f $request_filename.html) {
>     rewrite (.*) $1.html break;
>   }
>
>
>
>


-- 
Andrew Cholakian
http://www.andrewvc.com

Re: [mongrel2] RFC: ETags

From:
Zed A. Shaw
Date:
2010-07-14 @ 20:27
On Wed, Jul 14, 2010 at 01:01:14PM -0700, Andrew Cholakian wrote:
> mtime-size scares me. Apache uses inode-mtime-size for a reason. Size isn't
> much of a differentiator, and I can definitely see someone for some strange
> (not necessarily good) reason having files with the same mtime.
> 
> Personally, I think if-modified-since is the way, rather than etags, to go
> if you're only going to use mtime + size.

Nope, turns out once you have more than one server, inode is the death
since they're different on every server.

-- 
Zed A. Shaw
http://zedshaw.com/

Re: [mongrel2] RFC: ETags

From:
Zed A. Shaw
Date:
2010-07-14 @ 19:40
On Wed, Jul 14, 2010 at 03:32:48PM -0400, Billy Gray wrote:
> On Wed, Jul 14, 2010 at 1:46 PM, Zed A. Shaw <zedshaw@zedshaw.com> wrote:
> Along the lines of the index.html lookup, one of the most dang useful things
> about nginx is being able to do something like this:
> 
>   # If the file exists as a static file serve it directly without
>   # running all the other rewite tests on it
>   if (-f $request_filename) {
>     break;
>   }
> 
>   # index.html
>   if (-f $request_filename/index.html) {
>     rewrite (.*) $1/index.html break;
>   }
> 
>   # rails caching
>   if (-f $request_filename.html) {
>     rewrite (.*) $1.html break;
>   }

That'll come with filters, and, like, actually make sense.

-- 
Zed A. Shaw
http://zedshaw.com/