Re: [mongrel2] Another tnetstring impl and feedback on the spec
- From:
- Armando Singer
- Date:
- 2011-04-18 @ 03:25
> <unicode strings commentary>
My thinking is as follows:
- Some language families have a separate byte[] and string types that
incur conversion and/or memory overhead to convert on to the other.
- These languages must parse the payload as a byte array to stay
performant.
- It's fine in these languages to use a byte[] type and convert them
to whatever type is needed in the host language, as with my mongrel2
handler impl using plain old netstrings. However, the usefulness of
a typed wire format is a bit lessened in the above languages if we
have to convert from byte[] to String in the host language anyway,
for what is probably the most common case (obviously depends on the
app).
I'm not proposing adding a unicode type to the tnestring spec at all.
Rather, the , and " type would be identical, carrying a payload of
ASCII encoded bytes.
In other words, the type character (, vs ") is *just a type hint* so
that languages w/ distinct byte[] and string types can create the
appropriate data structures in the host language. tnetstrings or
mongrel2 does not need to know that in this language, for example,
Strings are arrays of double byte UTF-16 unsigned chars.
So in Java, I'd have:
case '"': new String(msg, i, len, ASCII;
case ',': Arrays.copyOfRange(msg, i, i + len);
But in python we'd do the same thing for both:
elif payload_type == ',' or payload_type == '"':
value = payload
Similarly, in the host language, I *have* to deal w/ peculiarities
that Java Strings, for example, are unicode aware. Most libaries are
using the built in types, so we have to deal with dumping types in the
host language. But no need to pollute the tnetstrings wire format w/
any of the languages peculiarities. So I want to easily be able to
dump all Strings, StringBuilders...all CharSequences, boolean, long,
short, byte, byte[], char, for example. All of these types are handled
by the 5 methods below in this host language:
public static String dump(CharSequence data) { return asciiLength(data)
+ ":" + data + '"'; }
public static String dump(byte[] data) { return data.length + ":" + new
String(data, ASCII) + ','; }
public static String dump(boolean data) { return data ? "4:true!" :
"5:false!"; }
public static String dump(long data) { return
numberString(Long.toString(data)); }
public static String dump(char data) { return asciiLength(data) + ":" +
data + '"'; }
But none of that is polluting tnetstrings. Round tripping works well
in this case regardless of language, but each would have dump methods
specific to that language.
>> - The integer type in the reference implementation is limited to
>> sys.maxint. It might be a good idea to be specific in the spec about
>> what the max integer is allowed to be
>
> Indeed, sys.maxint is different on 32-bit vs 64-bit python so no
> ambiguity is resolved here.
Yes. Some options:
- integer is defined as architecture dependent and is very clear on
what this means. Not very portable.
- integer is defined always 64 bit signed or unsigned or whatever.
- integer is defined as arbitrarily large and signed. I think this
would be the easiest to implement across languages and it fully
supports that part of the numerical tower (in scheme terms).
Cheers,
Armando
Re: [mongrel2] Another tnetstring impl and feedback on the spec
- From:
- Ryan Kelly
- Date:
- 2011-04-18 @ 04:09
On Sun, 2011-04-17 at 20:25 -0700, Armando Singer wrote:
> > <unicode strings commentary>
>
>
> I'm not proposing adding a unicode type to the tnestring spec at all.
>
> Rather, the , and " type would be identical, carrying a payload of
> ASCII encoded bytes.
>
> In other words, the type character (, vs ") is *just a type hint* so
> that languages w/ distinct byte[] and string types can create the
> appropriate data structures in the host language.
Right. Trying to do anything else at the tnetstring level is asking for
trouble.
Perhaps I'm just confusing the issue by saying "unicode" everywhere.
Sorry. It's a type distinction between "text" and "bytes" and it's
about how you want to work with the object after it has been
deserialized. Agree?
But, and correct me if I'm wrong, the whole trouble here is that the
"string" object is invariably designed to represent unicode characters.
So there is encoding going on somewhere, even if it's the implicit
encoding that your host language does it store the things in memory.
Can the java String object represent an arbitrary byte sequence? One of
the issues faced by python is that you can't really represent e.g. null
bytes in a unicode string object.
> tnetstrings or mongrel2 does not need to know that in this language,
for example,
> Strings are arrays of double byte UTF-16 unsigned chars.
I think we can all agree that we don't want tnetstrings to touch any
encoding issues :-)
> So in Java, I'd have:
>
> case '"': new String(msg, i, len, ASCII;
> case ',': Arrays.copyOfRange(msg, i, i + len);
>
> But in python we'd do the same thing for both:
>
> elif payload_type == ',' or payload_type == '"':
> value = payload
I think this behaviour would be very surprising to python programmers.
If you've said "this stuff is text" in your type tag, they would expect
to get a unicode string object.
Probably I just don't understand enough about how Java strings work.
Sounds like the distinction between String/byte[] is sufficiently
different to the bytes/unicode distinction in python that my intuition
is off.
Is the whole point of , vs " that you end up with either a byte[] filled
with ASCII bytes, or a String() filled with ASCII bytes? If so, it
sounds like a hack to workaround the inefficiencies of java's String
and/or byte[] objects and I don't think it's worth the complication.
What happens if someone passes in a string containing some non-ascii
unicode characters? Does it error out, or wind up on the wire in UTF16?
Ryan
--
Ryan Kelly
http://www.rfk.id.au | This message is digitally signed. Please visit
ryan@rfk.id.au | http://www.rfk.id.au/ramblings/gpg/ for details
Re: [mongrel2] Another tnetstring impl and feedback on the spec
- From:
- Zed A. Shaw
- Date:
- 2011-04-18 @ 08:24
On Mon, Apr 18, 2011 at 02:09:07PM +1000, Ryan Kelly wrote:
> On Sun, 2011-04-17 at 20:25 -0700, Armando Singer wrote:
> > > <unicode strings commentary>
> > In other words, the type character (, vs ") is *just a type hint* so
> > that languages w/ distinct byte[] and string types can create the
> > appropriate data structures in the host language.
>
> Right. Trying to do anything else at the tnetstring level is asking for
> trouble.
I'm going to make it easier:
When tnetstrings uses the word "strings" it means, "A sequence of 8bit
bytes (octets) that has no meaning beyond this definition". They are
not UTF-8, ascii, byte[], or anything other than this definition. Your
application then specifies what it is sending either in code or in
metadata for the request. That means, if you want UTF-8 for the
transport, then tell the receivers it's UTF-8.
Would that clear it up?
--
Zed A. Shaw
http://zedshaw.com/
Re: [mongrel2] Another tnetstring impl and feedback on the spec
- From:
- Austin Wise
- Date:
- 2011-04-19 @ 04:01
Having "strings" be just mean "a sequence of 8bit bytes" makes sense
and works fine in my C# TNetString implementation. However it would
be helpful if Mongrel2's handler format specified something like "all
header key and values are ASCII and the request body is just a
sequence of bytes" so that I know how to interpret the header bytes.
On Mon, Apr 18, 2011 at 1:24 AM, Zed A. Shaw <zedshaw@zedshaw.com> wrote:
> On Mon, Apr 18, 2011 at 02:09:07PM +1000, Ryan Kelly wrote:
>> On Sun, 2011-04-17 at 20:25 -0700, Armando Singer wrote:
>> > > <unicode strings commentary>
>> > In other words, the type character (, vs ") is *just a type hint* so
>> > that languages w/ distinct byte[] and string types can create the
>> > appropriate data structures in the host language.
>>
>> Right. Trying to do anything else at the tnetstring level is asking for
>> trouble.
>
> I'm going to make it easier:
>
> When tnetstrings uses the word "strings" it means, "A sequence of 8bit
> bytes (octets) that has no meaning beyond this definition". They are
> not UTF-8, ascii, byte[], or anything other than this definition. Your
> application then specifies what it is sending either in code or in
> metadata for the request. That means, if you want UTF-8 for the
> transport, then tell the receivers it's UTF-8.
>
> Would that clear it up?
>
> --
> Zed A. Shaw
> http://zedshaw.com/
>
Re: [mongrel2] Another tnetstring impl and feedback on the spec
- From:
- Zed A. Shaw
- Date:
- 2011-04-19 @ 22:03
On Mon, Apr 18, 2011 at 09:01:54PM -0700, Austin Wise wrote:
> Having "strings" be just mean "a sequence of 8bit bytes" makes sense
> and works fine in my C# TNetString implementation. However it would
> be helpful if Mongrel2's handler format specified something like "all
> header key and values are ASCII and the request body is just a
> sequence of bytes" so that I know how to interpret the header bytes.
I believe that's how it's specified in the older docs, so I can update
the tnetstring version to be the same. The parser would actually
enforce this too.
--
Zed A. Shaw
http://zedshaw.com/
Re: [mongrel2] Another tnetstring impl and feedback on the spec
- From:
- joshua simmons
- Date:
- 2011-04-19 @ 04:17
Mongrel2 already specifies the headers as being ASCII, not sure if there's
any particulars for the request body.
On Tue, Apr 19, 2011 at 2:01 PM, Austin Wise <austinwise@gmail.com> wrote:
> Having "strings" be just mean "a sequence of 8bit bytes" makes sense
> and works fine in my C# TNetString implementation. However it would
> be helpful if Mongrel2's handler format specified something like "all
> header key and values are ASCII and the request body is just a
> sequence of bytes" so that I know how to interpret the header bytes.
>
> On Mon, Apr 18, 2011 at 1:24 AM, Zed A. Shaw <zedshaw@zedshaw.com> wrote:
> > On Mon, Apr 18, 2011 at 02:09:07PM +1000, Ryan Kelly wrote:
> >> On Sun, 2011-04-17 at 20:25 -0700, Armando Singer wrote:
> >> > > <unicode strings commentary>
> >> > In other words, the type character (, vs ") is *just a type hint* so
> >> > that languages w/ distinct byte[] and string types can create the
> >> > appropriate data structures in the host language.
> >>
> >> Right. Trying to do anything else at the tnetstring level is asking for
> >> trouble.
> >
> > I'm going to make it easier:
> >
> > When tnetstrings uses the word "strings" it means, "A sequence of 8bit
> > bytes (octets) that has no meaning beyond this definition". They are
> > not UTF-8, ascii, byte[], or anything other than this definition. Your
> > application then specifies what it is sending either in code or in
> > metadata for the request. That means, if you want UTF-8 for the
> > transport, then tell the receivers it's UTF-8.
> >
> > Would that clear it up?
> >
> > --
> > Zed A. Shaw
> > http://zedshaw.com/
> >
>
Re: [mongrel2] Another tnetstring impl and feedback on the spec
- From:
- Loic d'Anterroches
- Date:
- 2011-04-19 @ 06:29
On 2011-04-19 06:17, joshua simmons wrote:
> Mongrel2 already specifies the headers as being ASCII, not sure if
> there's any particulars for the request body.
The request body encoding is specified in the headers. It can basically
be anything: http://www.ietf.org/rfc/rfc2388.txt
loïc
> On Tue, Apr 19, 2011 at 2:01 PM, Austin Wise <austinwise@gmail.com
> <mailto:austinwise@gmail.com>> wrote:
>
> Having "strings" be just mean "a sequence of 8bit bytes" makes sense
> and works fine in my C# TNetString implementation. However it would
> be helpful if Mongrel2's handler format specified something like "all
> header key and values are ASCII and the request body is just a
> sequence of bytes" so that I know how to interpret the header bytes.
>
> On Mon, Apr 18, 2011 at 1:24 AM, Zed A. Shaw <zedshaw@zedshaw.com
> <mailto:zedshaw@zedshaw.com>> wrote:
> > On Mon, Apr 18, 2011 at 02:09:07PM +1000, Ryan Kelly wrote:
> >> On Sun, 2011-04-17 at 20:25 -0700, Armando Singer wrote:
> >> > > <unicode strings commentary>
> >> > In other words, the type character (, vs ") is *just a type
> hint* so
> >> > that languages w/ distinct byte[] and string types can create the
> >> > appropriate data structures in the host language.
> >>
> >> Right. Trying to do anything else at the tnetstring level is
> asking for
> >> trouble.
> >
> > I'm going to make it easier:
> >
> > When tnetstrings uses the word "strings" it means, "A sequence of 8bit
> > bytes (octets) that has no meaning beyond this definition". They are
> > not UTF-8, ascii, byte[], or anything other than this definition.
> Your
> > application then specifies what it is sending either in code or in
> > metadata for the request. That means, if you want UTF-8 for the
> > transport, then tell the receivers it's UTF-8.
> >
> > Would that clear it up?
> >
> > --
> > Zed A. Shaw
> > http://zedshaw.com/
> >
>
>
Re: [mongrel2] Another tnetstring impl and feedback on the spec
- From:
- Armando Singer
- Date:
- 2011-04-18 @ 17:19
On Apr 18, 2011, at 1:24 AM, Zed A. Shaw wrote:
> When tnetstrings uses the word "strings" it means, "A sequence of 8bit
> bytes (octets) that has no meaning beyond this definition". They are
> not UTF-8, ascii, byte[], or anything other than this definition. Your
> application then specifies what it is sending either in code or in
> metadata for the request. That means, if you want UTF-8 for the
> transport, then tell the receivers it's UTF-8.
>
> Would that clear it up?
Yes. But why have the word "strings" in there at all? I guess strings are
in the name "tnetstrings", but otherwise the spec should say 8 bit bytes (octets).
Thank you,
Armando
>
> --
> Zed A. Shaw
> http://zedshaw.com/
Re: [mongrel2] Another tnetstring impl and feedback on the spec
- From:
- Zed A. Shaw
- Date:
- 2011-04-19 @ 22:03
On Mon, Apr 18, 2011 at 10:19:41AM -0700, Armando Singer wrote:
> Yes. But why have the word "strings" in there at all? I guess strings are
> in the name "tnetstrings", but otherwise the spec should say 8 bit bytes
(octets).
>
"8 bit bytes (octets)" is kind of ridiculous don't you think? How
about, since "strings" has been bastardized to mean so many things in so
many languages we use "Blob". It's a common term from databases that
means what we're saying, and isn't overloaded.
The downside to blobs is implementers will feel it necessary to actually
create a Blob class to hold them even when they aren't needed, so I'll
probably need a table of how to map those in different languages. Like
this:
Python | str
Java | byte[]
C | char[]
And so on.
--
Zed A. Shaw
http://zedshaw.com/
Re: [mongrel2] Another tnetstring impl and feedback on the spec
- From:
- Armando Singer
- Date:
- 2011-04-19 @ 22:18
On Apr 19, 2011, at 3:03 PM, Zed A. Shaw wrote:
> On Mon, Apr 18, 2011 at 10:19:41AM -0700, Armando Singer wrote:
>> Yes. But why have the word "strings" in there at all? I guess strings are
>> in the name "tnetstrings", but otherwise the spec should say 8 bit
bytes (octets).
>>
>
> "8 bit bytes (octets)" is kind of ridiculous don't you think? How
> about, since "strings" has been bastardized to mean so many things in so
> many languages we use "Blob". It's a common term from databases that
> means what we're saying, and isn't overloaded.
Yes, that makes sense.
>
> The downside to blobs is implementers will feel it necessary to actually
> create a Blob class to hold them even when they aren't needed, so I'll
> probably need a table of how to map those in different languages. Like
> this:
>
> Python | str
> Java | byte[]
> C | char[]
>
> And so on.
Yes, that would be helpful.
Thank you,
Armando
>
> --
> Zed A. Shaw
> http://zedshaw.com/
Re: [mongrel2] Another tnetstring impl and feedback on the spec
- From:
- Ryan Kelly
- Date:
- 2011-04-18 @ 09:23
On Mon, 2011-04-18 at 01:24 -0700, Zed A. Shaw wrote:
> On Mon, Apr 18, 2011 at 02:09:07PM +1000, Ryan Kelly wrote:
> > On Sun, 2011-04-17 at 20:25 -0700, Armando Singer wrote:
> > > > <unicode strings commentary>
> > > In other words, the type character (, vs ") is *just a type hint* so
> > > that languages w/ distinct byte[] and string types can create the
> > > appropriate data structures in the host language.
> >
> > Right. Trying to do anything else at the tnetstring level is asking for
> > trouble.
>
> I'm going to make it easier:
>
> When tnetstrings uses the word "strings" it means, "A sequence of 8bit
> bytes (octets) that has no meaning beyond this definition". They are
> not UTF-8, ascii, byte[], or anything other than this definition. Your
> application then specifies what it is sending either in code or in
> metadata for the request. That means, if you want UTF-8 for the
> transport, then tell the receivers it's UTF-8.
>
> Would that clear it up?
So let me summarize the unicode-friendliness I want to put in my python
module.
Tnetstrings deal only in sequences of 8bit bytes. When you read in a
string without telling it anything else, that's what you'll get:
>>> tns.loads("8:5:hello,]")
["hello"]
If you somehow specify the encoding out-of-band, then you are free to
interpret strings according to that encoding. The tnetstring protocol
doesn't care, but the API can make it easier for you:
>>> tns.loads("8:5:hello,]", "utf8")
[u"hello"]
But if you want to mix interpreted and uninterpreted strings (say, a
dict with unicode-string keys and bytestring values) then you're on your
own.
>>> # I want to get {u"hello": "\xFF"} but can't
>>> tns.loads("12:5:hello,1:\xFF,}","utf8")
Traceback
...blah blah...
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff
So you'll have to pick apart the result and decode the bits you want:
>>> d = tns.loads("12:5:hello,1:\xFF,}")
>>> for k in d.keys():
... d[k.decode("utf8")] = d.pop(k)
>>> d
{u"hello": "\xFF"}
It would be convenient if there were a separate "bytes" type so that you
could do, say:
>>> d = tns.loads("12:5:hello,1:\xFF$}", "utf8")
{u"hello": "\xFF"}
But it wouldn't be such a big convenience that I'm going to say any more
about it on this list :-)
Zed, would you be happy to see such an API inside a tnetstrings
module?
Or, would it be better/easier/cleaner to have people do a separate pass
over their data to coerce things to/from bytes as they see fit?
Cheers,
Ryan
--
Ryan Kelly
http://www.rfk.id.au | This message is digitally signed. Please visit
ryan@rfk.id.au | http://www.rfk.id.au/ramblings/gpg/ for details
Re: [mongrel2] Another tnetstring impl and feedback on the spec
- From:
- Zed A. Shaw
- Date:
- 2011-04-19 @ 21:59
On Mon, Apr 18, 2011 at 07:23:21PM +1000, Ryan Kelly wrote:
> Tnetstrings deal only in sequences of 8bit bytes. When you read in a
> string without telling it anything else, that's what you'll get:
>
> >>> tns.loads("8:5:hello,]")
> ["hello"]
>
> If you somehow specify the encoding out-of-band, then you are free to
> interpret strings according to that encoding. The tnetstring protocol
> doesn't care, but the API can make it easier for you:
>
> >>> tns.loads("8:5:hello,]", "utf8")
> [u"hello"]
Hmmm, yeah that could be helpful, and yes I think this is better. Only
thing is, this *only* uses the encoding on the *contents*. Every other
part is ASCII. So, if I have your line:
8:5:hello,]
The only part that is converted is the hello. Everything else stays
ASCII always.
The reason is this reduces the attack surface for situations where
people find bizarre unicode sequences that can still equal say : but are
not really : so you miss them in parsing and scanning.
That make sense?
Now, as for my usage I just do this:
http://chardet.feedparser.org/
In my protocols, I'll have an encoding metadata field, and then assume
the sender could be lying and use the above to confirm it. Take a look
at chardet as a sort of "guess" option for the API.
--
Zed A. Shaw
http://zedshaw.com/
Re: [mongrel2] Another tnetstring impl and feedback on the spec
- From:
- Ryan Kelly
- Date:
- 2011-04-19 @ 22:06
On Tue, 2011-04-19 at 14:59 -0700, Zed A. Shaw wrote:
> On Mon, Apr 18, 2011 at 07:23:21PM +1000, Ryan Kelly wrote:
> > Tnetstrings deal only in sequences of 8bit bytes. When you read in a
> > string without telling it anything else, that's what you'll get:
> >
> > >>> tns.loads("8:5:hello,]")
> > ["hello"]
> >
> > If you somehow specify the encoding out-of-band, then you are free to
> > interpret strings according to that encoding. The tnetstring protocol
> > doesn't care, but the API can make it easier for you:
> >
> > >>> tns.loads("8:5:hello,]", "utf8")
> > [u"hello"]
>
> Hmmm, yeah that could be helpful, and yes I think this is better. Only
> thing is, this *only* uses the encoding on the *contents*. Every other
> part is ASCII. So, if I have your line:
>
> 8:5:hello,]
>
> The only part that is converted is the hello. Everything else stays
> ASCII always.
>
> The reason is this reduces the attack surface for situations where
> people find bizarre unicode sequences that can still equal say : but are
> not really : so you miss them in parsing and scanning.
>
> That make sense?
Absolutely. Underneath the parser core is still working on a char* one
byte at a time, this would only happen way up in the code that says
"turn this chunk of bytes into a Python string".
It had never even occurred to me to do otherwise.
> Now, as for my usage I just do this:
>
> http://chardet.feedparser.org/
>
> In my protocols, I'll have an encoding metadata field, and then assume
> the sender could be lying and use the above to confirm it. Take a look
> at chardet as a sort of "guess" option for the API.
Is this what you used for unicode-handling in Lamson? I think I
remember reading about its awesome powers, will definitely check it out.
Ryan
--
Ryan Kelly
http://www.rfk.id.au | This message is digitally signed. Please visit
ryan@rfk.id.au | http://www.rfk.id.au/ramblings/gpg/ for details
Re: [mongrel2] Another tnetstring impl and feedback on the spec
- From:
- Zed A. Shaw
- Date:
- 2011-04-19 @ 22:29
On Wed, Apr 20, 2011 at 08:06:23AM +1000, Ryan Kelly wrote:
> > http://chardet.feedparser.org/
> >
> > In my protocols, I'll have an encoding metadata field, and then assume
> > the sender could be lying and use the above to confirm it. Take a look
> > at chardet as a sort of "guess" option for the API.
>
> Is this what you used for unicode-handling in Lamson? I think I
> remember reading about its awesome powers, will definitely check it out.
Yes, it did wonders on cleaning up email, which has tons of badly
specified encodings. In fact, MIME is a prime example of why mixing
your framing and your encodings in protocols is a bad idea.
--
Zed A. Shaw
http://zedshaw.com/
Re: [mongrel2] Another tnetstring impl and feedback on the spec
- From:
- Loic d'Anterroches
- Date:
- 2011-04-18 @ 07:30
On 2011-04-18 06:09, Ryan Kelly wrote:
> On Sun, 2011-04-17 at 20:25 -0700, Armando Singer wrote:
>>> <unicode strings commentary>
>> I'm not proposing adding a unicode type to the tnestring spec at all.
>>
>> Rather, the , and " type would be identical, carrying a payload of
>> ASCII encoded bytes.
They are not ascii encoded they are just binary data.
>> In other words, the type character (, vs ") is *just a type hint* so
>> that languages w/ distinct byte[] and string types can create the
>> appropriate data structures in the host language.
If this is just a type hint, this means you can create two python
implementations which are incompatible with each other. For me this does
not feel right.
The problem I see in your discussion, is that you always consider a kind
of implicit encoding for the handling of the strings. What is this
encoding? Python can use ascii or utf-8 or whatever you configured. PHP
can use whatever you configured.
If you want a string as such, you need to give the encoding with,
because a string is just a byte array interpreted in a given way.
For me, trying to add the string support/hinting, this is really opening
a can of worms. Can you remember the mess of MySQL where people were
storing utf-8 in another "implicit encoded" storage and then were
surprised they were not able to dump the data in something working? It
feels like that.
loïc
>
> Right. Trying to do anything else at the tnetstring level is asking for
> trouble.
>
> Perhaps I'm just confusing the issue by saying "unicode" everywhere.
> Sorry. It's a type distinction between "text" and "bytes" and it's
> about how you want to work with the object after it has been
> deserialized. Agree?
>
> But, and correct me if I'm wrong, the whole trouble here is that the
> "string" object is invariably designed to represent unicode characters.
> So there is encoding going on somewhere, even if it's the implicit
> encoding that your host language does it store the things in memory.
>
> Can the java String object represent an arbitrary byte sequence? One of
> the issues faced by python is that you can't really represent e.g. null
> bytes in a unicode string object.
>
>> tnetstrings or mongrel2 does not need to know that in this language,
for example,
>> Strings are arrays of double byte UTF-16 unsigned chars.
>
> I think we can all agree that we don't want tnetstrings to touch any
> encoding issues :-)
>
>> So in Java, I'd have:
>>
>> case '"': new String(msg, i, len, ASCII;
>> case ',': Arrays.copyOfRange(msg, i, i + len);
>>
>> But in python we'd do the same thing for both:
>>
>> elif payload_type == ',' or payload_type == '"':
>> value = payload
>
> I think this behaviour would be very surprising to python programmers.
>
> If you've said "this stuff is text" in your type tag, they would expect
> to get a unicode string object.
>
> Probably I just don't understand enough about how Java strings work.
> Sounds like the distinction between String/byte[] is sufficiently
> different to the bytes/unicode distinction in python that my intuition
> is off.
>
> Is the whole point of , vs " that you end up with either a byte[] filled
> with ASCII bytes, or a String() filled with ASCII bytes? If so, it
> sounds like a hack to workaround the inefficiencies of java's String
> and/or byte[] objects and I don't think it's worth the complication.
>
> What happens if someone passes in a string containing some non-ascii
> unicode characters? Does it error out, or wind up on the wire in UTF16?
>
>
> Ryan
>
--
Dr Loïc d'Anterroches
Founder Céondo Ltd
w: www.ceondo.com | e: loic@ceondo.com
t: +44 (0)207 183 0016 | f: +44 (0)207 183 0124
Céondo Ltd
Dalton House
60 Windsor Avenue
London
SW19 2RR / United Kingdom
Re: [mongrel2] Another tnetstring impl and feedback on the spec
- From:
- Armando Singer
- Date:
- 2011-04-18 @ 17:16
On Apr 18, 2011, at 12:30 AM, Loic d'Anterroches wrote:
> On 2011-04-18 06:09, Ryan Kelly wrote:
>> On Sun, 2011-04-17 at 20:25 -0700, Armando Singer wrote:
>>>> <unicode strings commentary>
>>> I'm not proposing adding a unicode type to the tnestring spec at all.
>>>
>>> Rather, the , and " type would be identical, carrying a payload of
>>> ASCII encoded bytes.
>
> They are not ascii encoded they are just binary data.
Good point, tnetstrings doesn't specify an encoding except for the
size string. I had picked ASCII up from the Handler netstrings impl
(Note 3: Sorry, Unicodians, It’s All ASCII...). My bad!
However, the reference Python implementation dumps a python string to
ASCII encoded bytes, correct? I'm probably wrong there. But that would mean
to remain compatible I'd have to dump Java Strings to ASCII encoded
bytes. Otherwise, I'd have to pick some encoding when dumping byte[],
so it might as well be UTF-8.
Better, we always need to always specify encoding:
byte[] dump(String javaString, Charset charset)...
String parseString(byte [] tnestring, Charset charset)...
>
>>> In other words, the type character (, vs ") is *just a type hint* so
>>> that languages w/ distinct byte[] and string types can create the
>>> appropriate data structures in the host language.
>
> If this is just a type hint, this means you can create two python
> implementations which are incompatible with each other. For me this does
> not feel right.
>
> The problem I see in your discussion, is that you always consider a kind
> of implicit encoding for the handling of the strings. What is this
> encoding? Python can use ascii or utf-8 or whatever you configured. PHP
> can use whatever you configured.
Yes, if there always an encoding specified on the wire, then we could
always convert to the platform's String type w/o specifying an
encoding. But since it's intentionally not specified, we must always
specify an encoding anyway to get a String in one's platform.
>
> If you want a string as such, you need to give the encoding with,
> because a string is just a byte array interpreted in a given way.
>
> For me, trying to add the string support/hinting, this is really opening
> a can of worms. Can you remember the mess of MySQL where people were
> storing utf-8 in another "implicit encoded" storage and then were
> surprised they were not able to dump the data in something working? It
> feels like that.
I agree.
Thank you for the feedback.
Armando
>
> loïc
>
>
>
>>
>> Right. Trying to do anything else at the tnetstring level is asking for
>> trouble.
>>
>> Perhaps I'm just confusing the issue by saying "unicode" everywhere.
>> Sorry. It's a type distinction between "text" and "bytes" and it's
>> about how you want to work with the object after it has been
>> deserialized. Agree?
>>
>> But, and correct me if I'm wrong, the whole trouble here is that the
>> "string" object is invariably designed to represent unicode characters.
>> So there is encoding going on somewhere, even if it's the implicit
>> encoding that your host language does it store the things in memory.
>>
>> Can the java String object represent an arbitrary byte sequence? One of
>> the issues faced by python is that you can't really represent e.g. null
>> bytes in a unicode string object.
>>
>>> tnetstrings or mongrel2 does not need to know that in this language,
for example,
>>> Strings are arrays of double byte UTF-16 unsigned chars.
>>
>> I think we can all agree that we don't want tnetstrings to touch any
>> encoding issues :-)
>>
>>> So in Java, I'd have:
>>>
>>> case '"': new String(msg, i, len, ASCII;
>>> case ',': Arrays.copyOfRange(msg, i, i + len);
>>>
>>> But in python we'd do the same thing for both:
>>>
>>> elif payload_type == ',' or payload_type == '"':
>>> value = payload
>>
>> I think this behaviour would be very surprising to python programmers.
>>
>> If you've said "this stuff is text" in your type tag, they would expect
>> to get a unicode string object.
>>
>> Probably I just don't understand enough about how Java strings work.
>> Sounds like the distinction between String/byte[] is sufficiently
>> different to the bytes/unicode distinction in python that my intuition
>> is off.
>>
>> Is the whole point of , vs " that you end up with either a byte[] filled
>> with ASCII bytes, or a String() filled with ASCII bytes? If so, it
>> sounds like a hack to workaround the inefficiencies of java's String
>> and/or byte[] objects and I don't think it's worth the complication.
>>
>> What happens if someone passes in a string containing some non-ascii
>> unicode characters? Does it error out, or wind up on the wire in UTF16?
>>
>>
>> Ryan
>>
>
> --
> Dr Loïc d'Anterroches
> Founder Céondo Ltd
>
> w: www.ceondo.com | e: loic@ceondo.com
> t: +44 (0)207 183 0016 | f: +44 (0)207 183 0124
>
> Céondo Ltd
> Dalton House
> 60 Windsor Avenue
> London
> SW19 2RR / United Kingdom
Re: [mongrel2] Another tnetstring impl and feedback on the spec
- From:
- Zed A. Shaw
- Date:
- 2011-04-19 @ 21:54
On Mon, Apr 18, 2011 at 10:16:34AM -0700, Armando Singer wrote:
> However, the reference Python implementation dumps a python string to
> ASCII encoded bytes, correct? I'm probably wrong there. But that would mean
Do you mean this:
return '%d:' % len(data) + data + ','
That actually doesn't output the data as ASCII, it outputs it as bytes.
Python's strings can hold anything so they're more like byte arrays. If
it were this however:
return '%d:%s,' % (len(data), data)
Then it would get screwed up the way you think. If you think that's
wrong, can you work up a counter case that shows it with the python
implementation?
> Better, we always need to always specify encoding:
>
> byte[] dump(String javaString, Charset charset)...
>
> String parseString(byte [] tnestring, Charset charset)...
Uh, wouldn't this just be back to square-one and have you specifying
charsets when the contents should be unspecified (man java makes this
confusing). I'll take a look at your code and maybe rewrite it to what
I'm thinking of. Code is probably better than English to say this.
--
Zed A. Shaw
http://zedshaw.com/
Re: [mongrel2] Another tnetstring impl and feedback on the spec
- From:
- Armando Singer
- Date:
- 2011-04-19 @ 22:14
On Apr 19, 2011, at 2:54 PM, Zed A. Shaw wrote:
> On Mon, Apr 18, 2011 at 10:16:34AM -0700, Armando Singer wrote:
>> However, the reference Python implementation dumps a python string to
>> ASCII encoded bytes, correct? I'm probably wrong there. But that would mean
>
> Do you mean this:
>
> return '%d:' % len(data) + data + ','
>
> That actually doesn't output the data as ASCII, it outputs it as bytes.
> Python's strings can hold anything so they're more like byte arrays. If
> it were this however:
>
> return '%d:%s,' % (len(data), data)
>
> Then it would get screwed up the way you think. If you think that's
> wrong, can you work up a counter case that shows it with the python
> implementation?
The reference impl is correct. I had thought about it more and concluded
that it's just dumping bytes.
>
>> Better, we always need to always specify encoding:
>>
>> byte[] dump(String javaString, Charset charset)...
>>
>> String parseString(byte [] tnestring, Charset charset)...
>
> Uh, wouldn't this just be back to square-one and have you specifying
> charsets when the contents should be unspecified (man java makes this
> confusing). I'll take a look at your code and maybe rewrite it to what
> I'm thinking of. Code is probably better than English to say this.
Please take a look at my 2nd implementation. I believe it's correct, where
no assumption is made about encoding of the contents.
There is one main parse method:
/** @return byte[] or Long or Boolean or Map<byte[], Object> or
List<Object> or null */
public static <T> T parse(final byte[] msg)
There is also 1 convenience method to parse the contents as a Java String.
It's not strictly needed, but 1) it's a bit easier to use when host
language is working
libraries that need a String, and 2) it's internally optimized so we don't
have extra
copies (first getting a byte[], which is a copy of a range, then
converting that to a
Java String, which causes another copy and decoding). Because we don't make any
assumptions about the encoding of the contents, the user must specify a
charset if they
want a Java String:
/** convenience method to parse to Java String and optimized to prevent
double copy */
public static String parseString(final byte[] msg, final Charset charset)
Same with the dump() methods. We're dumping everything to byte[]. We
convert each Java type
such as String, char, long, int, short, etc to byte[], but any character
data must specify
and encoding, else we don't know how to properly convert it to a byte[].
Hope I'm making sense!
Thank you,
Armando
>
>
> --
> Zed A. Shaw
> http://zedshaw.com/
Re: [mongrel2] Another tnetstring impl and feedback on the spec
- From:
- Ryan Kelly
- Date:
- 2011-04-19 @ 22:09
On Tue, 2011-04-19 at 14:54 -0700, Zed A. Shaw wrote:
> On Mon, Apr 18, 2011 at 10:16:34AM -0700, Armando Singer wrote:
> > However, the reference Python implementation dumps a python string to
> > ASCII encoded bytes, correct? I'm probably wrong there. But that would mean
>
> Do you mean this:
>
> return '%d:' % len(data) + data + ','
>
> That actually doesn't output the data as ASCII, it outputs it as bytes.
> Python's strings can hold anything so they're more like byte arrays. If
> it were this however:
>
> return '%d:%s,' % (len(data), data)
>
> Then it would get screwed up the way you think.
Really? I always understood the two forms to be equivalent. Can you
give an example of some data that gets mangled by the later but not the
former?
Cheers,
Ryan
--
Ryan Kelly
http://www.rfk.id.au | This message is digitally signed. Please visit
ryan@rfk.id.au | http://www.rfk.id.au/ramblings/gpg/ for details
Re: [mongrel2] Another tnetstring impl and feedback on the spec
- From:
- Zed A. Shaw
- Date:
- 2011-04-19 @ 22:27
On Wed, Apr 20, 2011 at 08:09:02AM +1000, Ryan Kelly wrote:
> > return '%d:%s,' % (len(data), data)
> >
> > Then it would get screwed up the way you think.
>
> Really? I always understood the two forms to be equivalent. Can you
> give an example of some data that gets mangled by the later but not the
> former?
Yep, you're right:
http://dpaste.de/iTAE/
It's because print and writing to files tries to do conversions and
other stupid stuff, not the use of %s.
--
Zed A. Shaw
http://zedshaw.com/
Re: [mongrel2] Another tnetstring impl and feedback on the spec
- From:
- Ryan Kelly
- Date:
- 2011-04-18 @ 00:37
On Mon, 2011-04-18 at 10:27 +1000, Ryan Kelly wrote:
> On Sun, 2011-04-17 at 16:31 -0700, Armando Singer wrote:
> >
> > - It might be a good idea to have a separate string type:
> >
> > " string
> > , byte array
> >
> > I have implemented this in the attached code. It adds 1 line to the
> > parsing. Having just a byte[] will work fine, but we're getting
> > pretty close to netrings as we'll have to convert to String any time
> > we want one, which would be common.
>
> I feel sorry for this poor dead horse, but I suspect it's going to keep
> getting beaten.
>
>
> On one hand, for a general-purpose library in a language that has
> distinct "bytes" and "string" types, it would be very nice to be able to
> round-trip mixed data structures, e.g.:
>
> ["hello",u"world"] == tns.loads(tns.dumps(["hello",u"world"]))
>
> [...snip...]
>
> But if the proposal is simply to indicate "these bytes are a unicode
> string in whatever encoding you've decided to use for this application,
> you deal with it" then I think, based on my experiences with the python
> module, it would be worth adding as a separate type tag.
By the way, I'm aware that this is probably just my general-purpose
python library bias showing, so I'm quite prepared to be shot down.
Just want to get it all out on the table.
The python lib *will* have to deal conveniently with unicode strings
eventually, and the API will probably look like this:
>>> tns.dumps(u"hello")
ValueError: you must specify an encoding for unicode strings
>>>
>>> # if tnetstrings grows a string type
>>> tns.dumps(u"hello","utf8")
5:hello"
>>>
>>> # if tnetstrings doesn't grow a string type
>>> tns.dumps(u"hello","utf8")
5:hello,
>>>
So really, the only horse I have in this race is "can we unambiguously
mix strings and bytes in a single document".
Cheers,
Ryan
--
Ryan Kelly
http://www.rfk.id.au | This message is digitally signed. Please visit
ryan@rfk.id.au | http://www.rfk.id.au/ramblings/gpg/ for details
Re: [mongrel2] Another tnetstring impl and feedback on the spec
- From:
- joshua simmons
- Date:
- 2011-04-18 @ 00:42
IIRC .net and the jvm intern strings too, which makes messing with them not
particularly performant unless you use a string builder. Working with a
byte[] is vastly superior until you need string semantics, and that's up to
your application.
Also with a byte[] it should be relatively easy to duck into unsafe code to
get some serious performance if there are nasty spots in the code.
String.Split is something I'd avoid anyway since once again it means you
search the string then make two new strings (interned) which still need to
be processed. If possible parsing in-place and extracting the valuable data
should prove much faster.
Having separate string and blob types just complicates matters. tnetstrings
are 8 bit clean, so store whatever rubbish you want. But it's not the
protocol's problem as to how you encode your data.
To this end as well, iirc it's easy enough to get a byte[] and length from a
native string, and to convert between utf-8 / ascii / whatever.
http://msdn.microsoft.com/en-us/library/system.text.utf8encoding.getbytes(v=VS.100).aspx
for
example. imo your api should handle byte[]'s only, and then let the
application decide encoding, this then reflects the actual protocol
semantics and stops anybody from getting confused.
On Mon, Apr 18, 2011 at 10:27 AM, Ryan Kelly <ryan@rfk.id.au> wrote:
> On Sun, 2011-04-17 at 16:31 -0700, Armando Singer wrote:
> >
> > - It might be a good idea to have a separate string type:
> >
> > " string
> > , byte array
> >
> > I have implemented this in the attached code. It adds 1 line to the
> > parsing. Having just a byte[] will work fine, but we're getting
> > pretty close to netrings as we'll have to convert to String any time
> > we want one, which would be common.
>
> I feel sorry for this poor dead horse, but I suspect it's going to keep
> getting beaten.
>
>
> On one hand, for a general-purpose library in a language that has
> distinct "bytes" and "string" types, it would be very nice to be able to
> round-trip mixed data structures, e.g.:
>
> ["hello",u"world"] == tns.loads(tns.dumps(["hello",u"world"]))
>
> On the other hand, you most definitely do NOT want mongrel2 trying to
> deal with encoding/decoding unicode strings. Bad bad bad.
>
>
> And that's not even going into the details of encoding. To quote the
> tentstrings spec:
>
> "String encoding is an application level, political, and display
> specification. Transport protocols should not have to decode random
> character encodings accurately to function properly."
>
> A big +1 from me on that!
>
>
> When the tnetstring adventure was just starting out, Zed's original
> proposal was to have separate "string" and "bytes" type tags, but have a
> policy that "tnetstring doesn't do encoding". So this:
>
> 5:hello,
>
> Means "here is a byte array". While this:
>
> 5:hello"
>
> Means "here is a string in whatever encoding you're using up there".
>
>
> I fought back against having bytes in a potentially ambiguous encoding.
> I now wish I'd kept my mouth shut.
>
>
> As I see it there are two options:
>
> 1) Just tell unicode strings to piss off. This is not a
> general-purpose serialisation library, it's a special-purpose format for
> communicating between bytestream-based services.
>
> 2) Allow a separate string type, but refuse to accept or generate it
> within mongrel2.
>
>
> In a previous email on this topic, people jumped in to say that when I
> said "unicode strings" like above what I really meant was "utf8
> strings". Not so.
>
> If the proposed solution were indeed to be "encode all unicode strings
> in utf8 and decode them in the parser" I would be against it.
>
> But if the proposal is simply to indicate "these bytes are a unicode
> string in whatever encoding you've decided to use for this application,
> you deal with it" then I think, based on my experiences with the python
> module, it would be worth adding as a separate type tag.
>
>
> The ability to transparently round-trip both bytearrays and strings
> would actually be a additional bonus of tnetstrings over JSON, which
> demands that all strings be unicode.
>
> (Worse actually: the whole JSON document is a big unicode string in one
> of several different encodings, and your parser is supposed to examine
> the pattern of zeros in the first few chars of the document to determine
> which encoding it is in. Of course no-one does this, so in the wild
> JSON is almost always in utf8.)
>
> > - The integer type in the reference implementation is limited to
> > sys.maxint. It might be a good idea to be specific in the spec about
> > what the max integer is allowed to be
>
> Indeed, sys.maxint is different on 32-bit vs 64-bit python so no
> ambiguity is resolved here.
>
> > - I'm also not handling floating point numbers. Is this correct? Not
> > having floats seems the only way to fulfill rule #1
>
> I'd like to see a separate float type in the interests of completeness.
> I propose:
>
> 7:3.14159^
>
> Because the carat reminds me of exponentiation. Surely every language
> has some facility to to convert float <=> string, accuracy be damned?
>
>
>
>
> Cheers,
>
>
> Ryan
>
> --
> Ryan Kelly
> http://www.rfk.id.au | This message is digitally signed. Please visit
> ryan@rfk.id.au | http://www.rfk.id.au/ramblings/gpg/ for details
>
>
Re: [mongrel2] Another tnetstring impl and feedback on the spec
- From:
- Armando Singer
- Date:
- 2011-04-18 @ 03:45
> IIRC .net and the jvm intern strings too, which makes messing with them
not particularly performant unless you use a string builder. Working with
a byte[] is vastly superior until you need string semantics, and that's up
to your application.
Correct, you want to just index into the byte[] and copy ranges to
create types in the host language.
>
> Also with a byte[] it should be relatively easy to duck into unsafe code
to get some serious performance if there are nasty spots in the code.
String.Split is something I'd avoid anyway since once again it means you
search the string then make two new strings (interned) which still need to
be processed. If possible parsing in-place and extracting the valuable
data should prove much faster.
Yup. In my impl, the byte[] is parsed in place by jumping to different
offsets once we find the length before the ':'
>
> Having separate string and blob types just complicates matters.
tnetstrings are 8 bit clean, so store whatever rubbish you want. But it's
not the protocol's problem as to how you encode your data.
I'm suggesting string and blob types are the same on the wire, the
type char is the only difference and they are just type hints for host
languages. Some languages don't care, some do.
It's not protocol's problem how you encode your data, but if the
protocol proposes cross-language types, one would want those types to
be useful in all the common cases. I'm not saying that byte[], a
subset of integers, booleans, null and lists and maps aren't useful,
but a string type hint would be mighty useful in some languages.
>
> To this end as well, iirc it's easy enough to get a byte[] and length
from a native string, and to convert between utf-8 / ascii / whatever.
http://msdn.microsoft.com/en-us/library/system.text.utf8encoding.getbytes(v=VS.100).aspx
for example. imo your api should handle byte[]'s only, and then let the
application decide encoding, this then reflects the actual protocol
semantics and stops anybody from getting confused.
Yes, it's pretty easy to convert from byte[] to once's platform's
string representation. But if you're doing this anyway, why not just
use plain old netstrings?
Cheers,
Armando