Re: [mongrel2] some tnetstring feedback
- From:
- Zed A. Shaw
- Date:
- 2011-04-08 @ 17:06
On Wed, Apr 06, 2011 at 10:52:53AM +1000, Ryan Kelly wrote:
>
> Hi All,
>
> * By far the most frequent comment: Why not put the type indicator at
> the front?
It's harder to parse. The parsing changes from:
split at :, that's the length
To:
split at regex([:,{\[#])
Which is both slower and harder to implement and get right.
It also doesn't buy you anything because you still have to read the
whole string of bytes contained and put them in the data structure.
Might as well just read the length, :, data, ending and then process it.
Having the { first means people will try to cheat and recurse into the
bytes and read in a stream which is how they get buffer overflows and
just about every other problem out there. Think about it this way,
people thinking this code is better:
read size
read type
while size > 0:
read datum
reduce size
put int structure
Are the reason we have have buffer overflows and inneficient servers.
Here's just some of the problems:
* They'll build a partial dict or list, then hit an error mid-stream and
have to clean up, which they won't.
* They'll get the size math wrong and stall or have a buffer overflow.
* They'll have to check the size to avoid giant payloads, which they
won't because they'll be doing it in a stream, and then make a massive
dict.
* They'll get the recursion wrong and screw up the loops.
Contrast that with what we have now:
read size
read data
read type
build_type(data)
This works reliably, is always safe because you have to actually load
the RAM directly, is immedieatly abortable when there's an error, works
faster because the parsing happens on the RAM immediately, and is easy
to implement (as demonstrated by the ones we've done).
Now, the next response is, "Well what if I want to stream a 4 GIGABYTE
DVD at the server." Answer: send multiple messages. Everyone wants to
stream all of the data in one single message, but this is why things
like HTTP uploads blow. The better way to do this is to pick a fixed max
message size, and then an application format that lets you transmit any
size data inside a sequence of frames within the fixed max.
In general, knowing the type buys you nothing in terms of
implementation, streaming, or speed.
> * would be much more cache-friendly and therefore faster.
That's total bullshit. Anytime someone claims some change would improve
cache misses, without first actualy making the change and providing
evidence, they're full of shit.
> * I tried a basic implementation this morning, and couldn't
> measure any difference in speed for front vs back placement.
Awesome, glad you tested this to confirm. Exactly what I said, there's
no difference. The idea that somehow magically you're going to get
performance gains because you parsed 5 chars instead of 1 is a giant
load of crap. Sorry to put it that way, but "cache friendly ROFL scale"
dudes piss me off. They never have any evidence of this and usually
their claims of higher performance because of cache misses get
invalidated when a revision of any of the CPUs comes out.
> * people nonplussed about the back-placement making a parser
> easier to write, since they wouldn't be writing one.
They should care. If the parse is easier to write and harder to screw up
then their servers and clients don't crash and have buffer overflows.
You should start this presentation by telling them Pi is equal to 3.
Take some time to prove that Pi is 3 because it's faster. It's just so
much faster to have Pi be a single integer. It fits into caches better,
the works. Then when they're all worked up over how accuracy is more
important than tiny bits of speed, point out that this is the same for
protocols. Don't matter how fast they are if they crash or fail to work
right.
> * Perhaps a separate type tag for ints and floats? Make parsing
> easier instead of having a single "number" type.
Actually, that's why I took floats out of mine. We have to think about
the statement that we're sending a float since they don't translate
reliably between platforms. If we can make it *very* clear that, no
insane math dude, you do not get the exact same number in Haskell as you
do in Javascript, then it should be fine to do another type for floats.
> * Why spell out "true" and "false" in full? 1:t! and 1:f! are almost
> as easy to read and much more efficient.
Because javascript uses it, so it's easy to translate, but that's not a
requirement.
> * Add a separate type for unicode strings.
No! A thousand times no. This is immediately the road to hell in
protocols. First off, you even got it wrong, since it should be a utf-8
string not "unicode". More importantly though, that's an application
level display issue, not a protocol issue. If someone wants their data
to have utf-8 then they define their strings as being utf-8 and do the
translation. Putting it in the protocol means that *all* languages have
to deal with this, and utf-8 translation is nasty and error prone.
The general rule is that unicode encodings are an application display
system and not a protocol or data storage format. Data is stored the
way a computer understands it as arrays of bytes. It's transmitted that
way too. Computers and networks don't need this concept, only people so
the application is responsible for it.
Whew, anyway great getting feedback. I know this isn't you saying this,
but it pretty much hit every one of my network protocol pet peeves. :-)
--
Zed A. Shaw
http://zedshaw.com/
Re: [mongrel2] some tnetstring feedback
- From:
- Ryan Kelly
- Date:
- 2011-04-09 @ 01:26
On Fri, 2011-04-08 at 10:06 -0700, Zed A. Shaw wrote:
> On Wed, Apr 06, 2011 at 10:52:53AM +1000, Ryan Kelly wrote:
>
> > * people nonplussed about the back-placement making a parser
> > easier to write, since they wouldn't be writing one.
>
> They should care. If the parse is easier to write and harder to screw up
> then their servers and clients don't crash and have buffer overflows.
>
> You should start this presentation by telling them Pi is equal to 3.
> Take some time to prove that Pi is 3 because it's faster. It's just so
> much faster to have Pi be a single integer. It fits into caches better,
> the works. Then when they're all worked up over how accuracy is more
> important than tiny bits of speed, point out that this is the same for
> protocols. Don't matter how fast they are if they crash or fail to work
> right.
Spot on. It wasn't in my slides, but on the page about "almost
impossible to get wrong" the point I made was "I'd much rather this in
my webserver than a JSON parser".
> > * Perhaps a separate type tag for ints and floats? Make parsing
> > easier instead of having a single "number" type.
>
> Actually, that's why I took floats out of mine. We have to think about
> the statement that we're sending a float since they don't translate
> reliably between platforms. If we can make it *very* clear that, no
> insane math dude, you do not get the exact same number in Haskell as you
> do in Javascript, then it should be fine to do another type for floats.
>
I had a good chat about this with a C++ guy (the more-cache-efficient
guy) and I hadn't realised what a can of worms float parsing is. E.g.
you need to write out 17 decimal places to guarantee 15 places of
accuracy when reading back in.
Apparently newer python versions have some magic to output "the shortest
string guaranteed to parse back without loss of accuracy" but I don't
understand it enough to make any suggestions.
> Whew, anyway great getting feedback. I know this isn't you saying this,
> but it pretty much hit every one of my network protocol pet peeves. :-)
Yeah, I figured it probably would, thanks for the considered responses.
Cheers,
Ryan
--
Ryan Kelly
http://www.rfk.id.au | This message is digitally signed. Please visit
ryan@rfk.id.au | http://www.rfk.id.au/ramblings/gpg/ for details
Re: [mongrel2] some tnetstring feedback
- From:
- Jon Rosebaugh
- Date:
- 2011-04-06 @ 05:51
On Wednesday, April 6, 2011, Ryan Kelly <ryan@rfk.id.au> wrote:
> * Add a separate type for unicode strings.
> * This is probably an artifact of the Python crowd where
> unicode-vs-bytes is such a big deal. But no-one seems to care
> that JSON is missing a "bytes" type.
Except Unicode strings would have to be encoded in bytes anyway, so
we'd have to have have a convention 'use utf-8' or something. Why not
just have bytes, and let your higher-level protocol specify what
encoding to use?
Re: [mongrel2] some tnetstring feedback
- From:
- Ryan Kelly
- Date:
- 2011-04-06 @ 06:11
On Tue, 2011-04-05 at 22:51 -0700, Jon Rosebaugh wrote:
> On Wednesday, April 6, 2011, Ryan Kelly <ryan@rfk.id.au> wrote:
> > * Add a separate type for unicode strings.
> > * This is probably an artifact of the Python crowd where
> > unicode-vs-bytes is such a big deal. But no-one seems to care
> > that JSON is missing a "bytes" type.
>
> Except Unicode strings would have to be encoded in bytes anyway, so
> we'd have to have have a convention 'use utf-8' or something. Why not
> just have bytes, and let your higher-level protocol specify what
> encoding to use?
Oh, I completely agree. It's only a problem when you've got some
combination of unicode and bytes in a single datastructure, for example:
["hello",u"world"]
Currently there's no way to round-trip this structure in the tnetstring
python module, because there's nothing to say "this string was
originally bytes" or "this string was originally unicode". But then,
what kind of busted protocol is going to be sending data like this
anyway?
I'm in favour simply saying "it's the protocol's problem", just feeding
back impressions from the outside world.
Cheers,
Ryan
--
Ryan Kelly
http://www.rfk.id.au | This message is digitally signed. Please visit
ryan@rfk.id.au | http://www.rfk.id.au/ramblings/gpg/ for details
Re: [mongrel2] some tnetstring feedback
- From:
- joshua simmons
- Date:
- 2011-04-06 @ 05:57
I liked the type at the front only when you're not keeping backwards
compatibility. Then you can use it to delimit the length from the body and
save a byte. :D
Otherwise though, it's largely irrelevant. This isn't a storage format that
you're supposed to be reading anyway, not as a rule. Parser comes before a
little thing like type placement.
8 bit clean data is vastly superior to forcing an encoding too for a wire
protocol. It does cause issue in json because you then have to validate it
which is an unnecessary overhead at the protocol level.
On Wed, Apr 6, 2011 at 3:51 PM, Jon Rosebaugh <chairos@gmail.com> wrote:
> On Wednesday, April 6, 2011, Ryan Kelly <ryan@rfk.id.au> wrote:
> > * Add a separate type for unicode strings.
> > * This is probably an artifact of the Python crowd where
> > unicode-vs-bytes is such a big deal. But no-one seems to care
> > that JSON is missing a "bytes" type.
>
> Except Unicode strings would have to be encoded in bytes anyway, so
> we'd have to have have a convention 'use utf-8' or something. Why not
> just have bytes, and let your higher-level protocol specify what
> encoding to use?
>