librelist archives

« back to archive

Another tnetstring impl and feedback on the spec

Another tnetstring impl and feedback on the spec

From:
Armando Singer
Date:
2011-04-17 @ 23:31
I've kicked the tires on the tnestring format by making an
implementation, attached. Here's my feedback, as well as some notes
that would apply to both jvm languages and .NET languages.

Implementation notes:

- Parses a byte array (byte[] type) instead of strings. Strings are
  not a byte[] or a byte[] wrapper in the jvm and .NET languages. A
  byte[] is also what we get from 0mq.

- This means no using split() and friends.

- The byte[] impl still meets rule 1. It's still trivial to parse. We're just
  slicing ranges out of the byte[] and creating the appropriate type
  from that slice.

Language peculiarities (apply to jvm and .NET languages).

- A Java and C# string is not equivalent to or a wrapper around a byte
  array. Strings backed by a char[], each of which are UTF16 double
  byte values.

- We most definitely do not want to convert our tnestring to a String
  before parsing. We would undergo sad conversions and waste memory
  for any messages that are images, multi-part video messages, etc.

- An impl in java, clojure, scala, C#, etc. is not going to enjoy
  similar performance to python, ruby, etc if the impl of those is simply
  ported over. I have seen implementations of the old netstring and
  current tnetstring specs that are sadly converting everything to a
  String. The naive implementation breaks rules 3 & 4:

  3. Fast and low resource intensive.

  4. Makes no assumptions about string contents and can store binary
     data without escaping or encoding them.

- Also, for dumping the data, one cannot just do string.length + ":" +
  data. We want the length in bytes of the payload, and in java,
  string.length is the length of the underlying double byte UTF-16
  char[] array. Sadly, one needs to do:

  private static final char MAX_SINGLE_BYTE = '\u007F';

  private static int asciiLength(char c) {
    return c > MAX_SINGLE_BYTE ? 2 : 1;
  }

  To properly calculate the byte length of each character.

  This is another problem I've seen in naive impls (and a common
  problem I see in general).

- It might be a good idea to have a separate string type:

  "  string
  ,  byte array

  I have implemented this in the attached code. It adds 1 line to the
  parsing. Having just a byte[] will work fine, but we're getting
  pretty close to netrings as we'll have to convert to String any time
  we want one, which would be common.

Other notes

- The integer type in the reference implementation is limited to
  sys.maxint. It might be a good idea to be specific in the spec about
  what the max integer is allowed to be--a system max? Arbitrarily
  large integers? This impl assumes parses the integers into a Java
  long, which is sort of the closest thing to the reference impl. I'd
  back it by BigInteger for an arbitrarily long integer.

- I'm also not handling floating point numbers. Is this correct? Not
  having floats seems the only way to fulfill rule #1

- Similary, I assume tnetstrings should not support other integer
  notations such as scientific, octal, hex. I didn't go out of my way
  to reject these, though, as using the built in number parsing is the
  easiest impl.

Cheers,
Armando

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Zed A. Shaw
Date:
2011-04-18 @ 08:36
On Sun, Apr 17, 2011 at 04:31:02PM -0700, Armando Singer wrote:
> I've kicked the tires on the tnestring format by making an
> implementation, attached. Here's my feedback, as well as some notes
> that would apply to both jvm languages and .NET languages.

Cool, I'll take a look tomorrow sometime.  Some quick notes:

> Implementation notes:
> 
> - Parses a byte array (byte[] type) instead of strings.

This is correct.

> - This means no using split() and friends.

That's alright, probably faster to not do that anyway, the Python code
is meant to be small and correct, not fast.

> - The byte[] impl still meets rule 1. It's still trivial to parse. We're just
>   slicing ranges out of the byte[] and creating the appropriate type
>   from that slice.

Yes in Java, it's a byte[] and that's it.  Any conversion after that is
app specific.  We just call that a "string" since Java coopted the
phrase and turned it into this bastard child or chtulhu where you can't
even reliably read a damn file anymore (man I'm glad I don't do Java IO
anymore).

> Language peculiarities (apply to jvm and .NET languages).
> 
> - A Java and C# string is not equivalent to or a wrapper around a byte
>   array. Strings backed by a char[], each of which are UTF16 double
>   byte values.

Sort of don't care, that's their problem.  Seriously, if they went with
UTF16 with no way to convert to the easier UTF8 or ASCII they're
seriously broken.

> - We most definitely do not want to convert our tnestring to a String
>   before parsing.

Yep, that should probably be a specific convert function that can be
called for convenience.

> - It might be a good idea to have a separate string type:
> 
>   "  string
>   ,  byte array

No, see my other email on this thread, but adding any "display like
String that's not a sequence of bytes" would open tnetstrings to
bickering over what's the one true string format.  Better to just keep
it array of bytes and let the app decide what it is.

>   I have implemented this in the attached code.

I'd remove the format " and just have a function or setting for it.
It's better that way and then your code works with everyone else's code.

> Other notes
> 
> - The integer type in the reference implementation is limited to
>   sys.maxint. It might be a good idea to be specific in the spec about
>   what the max integer is allowed to be--a system max? Arbitrarily
>   large integers? This impl assumes parses the integers into a Java
>   long, which is sort of the closest thing to the reference impl. I'd
>   back it by BigInteger for an arbitrarily long integer.

It should be a long integer type, but I could make it specifically a
int64 or int32 and stop there.  It's not particularly meant for sending
insane precision or encrypted bignums, it's just for reasonable integer
data with strings for huge binary information.  For example, if you
wanted to send a bignum then you'd encode the raw byte it into a string
or do it as hex.

> - I'm also not handling floating point numbers. Is this correct? Not
>   having floats seems the only way to fulfill rule #1

Yep, no floats yet as I haven't figured out a way to reliably specify
what they are.

> - Similary, I assume tnetstrings should not support other integer
>   notations such as scientific, octal, hex. I didn't go out of my way
>   to reject these, though, as using the built in number parsing is the
>   easiest impl.

Nope, no other notations.  Only base 10 sequence of digits.  It's
easiest to get right both in parsing and writing.

Cool I'll look at your code tomorrow and if you can take out the :"
format then I'll add it to the list.  Thanks again.


-- 
Zed A. Shaw
http://zedshaw.com/

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Jason Miller
Date:
2011-04-19 @ 20:39
> > Other notes
> > 
> > - The integer type in the reference implementation is limited to
> >   sys.maxint. It might be a good idea to be specific in the spec about
> >   what the max integer is allowed to be--a system max? Arbitrarily
> >   large integers? This impl assumes parses the integers into a Java
> >   long, which is sort of the closest thing to the reference impl. I'd
> >   back it by BigInteger for an arbitrarily long integer.
> 
> It should be a long integer type, but I could make it specifically a
> int64 or int32 and stop there.  It's not particularly meant for sending
> insane precision or encrypted bignums, it's just for reasonable integer
> data with strings for huge binary information.  For example, if you
> wanted to send a bignum then you'd encode the raw byte it into a string
> or do it as hex.
I think leaving it ambiguous, or perhaps "support the largest native 
integer type of the language" is good.  We should allow sending 64-bit 
integers for 64-bit machines, but 32-bit implementations shouldn't 
necessarily require bignums, and javascript implementations should be 
allowed to round anything larger than 2^53, because javascript is 
stupid.  Just like you can crap out for really long strings, you can
crap out for really large ints, where "really large" is implementation
defined.
> 
> > - I'm also not handling floating point numbers. Is this correct? Not
> >   having floats seems the only way to fulfill rule #1
> 
> Yep, no floats yet as I haven't figured out a way to reliably specify
> what they are.
Well since you don't reliably specify what integers are, is this really
a problem?  You could specify a minimum number of base 10 digits for the
mantissa and exponent (not including leading zeroes) and that would
enforce a minimum precision.  You could just require it formatted
with "%.53e" or equivalent, for example.

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Zed A. Shaw
Date:
2011-04-19 @ 22:07
On Tue, Apr 19, 2011 at 01:39:55PM -0700, Jason Miller wrote:
> > It should be a long integer type, but I could make it specifically a
> > int64 or int32 and stop there.  It's not particularly meant for sending
> > insane precision or encrypted bignums, it's just for reasonable integer
> > data with strings for huge binary information.  For example, if you
> > wanted to send a bignum then you'd encode the raw byte it into a string
> > or do it as hex.

> I think leaving it ambiguous, or perhaps "support the largest native 
> integer type of the language" is good.  We should allow sending 64-bit 
> integers for 64-bit machines, but 32-bit implementations shouldn't 
> necessarily require bignums, and javascript implementations should be 
> allowed to round anything larger than 2^53, because javascript is 
> stupid.  Just like you can crap out for really long strings, you can
> crap out for really large ints, where "really large" is implementation
> defined.

Alright, that makes sense, and then the receiver can just go by max
number of digits they'll accept before trying to convert, and they can
reject if they don't like it, just like with strings.  Other than that,
there's no spec on what a number is and that's left to the application.

-- 
Zed A. Shaw
http://zedshaw.com/

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Armando Singer
Date:
2011-04-18 @ 17:28
On Apr 18, 2011, at 1:36 AM, Zed A. Shaw wrote:

> Cool I'll look at your code tomorrow and if you can take out the :"
> format then I'll add it to the list.  Thanks again.

Thanks for the feedback. I'll rewrite my impl now that I have my
confusion cleared up.

You will need to specify a charset when parsing to a Java string
and when dumping from a java string or char.

  public static String parseString(byte[] msg, Charset charset)...

And:

  private static final byte[] COMMA_BYTES = new byte[] { ',' };
    
  public static byte[] dump(CharSequence data, Charset charset) {
    return concat((byteLength(data) + ":").getBytes(ASCII),
      String.valueOf(data).getBytes(charset), COMMA_BYTES);
  }

Internally, when I parse the tnetstring, I'll have an optimization
to prevent an extra copy of the byte range if you want to parse the
tnetstring into a Java String.

Thank you,
Armando

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Zed A. Shaw
Date:
2011-04-19 @ 22:06
On Mon, Apr 18, 2011 at 10:28:15AM -0700, Armando Singer wrote:
> On Apr 18, 2011, at 1:36 AM, Zed A. Shaw wrote:
> 
> > Cool I'll look at your code tomorrow and if you can take out the :"
> > format then I'll add it to the list.  Thanks again.
> 
> Thanks for the feedback. I'll rewrite my impl now that I have my
> confusion cleared up.
> 
> You will need to specify a charset when parsing to a Java string
> and when dumping from a java string or char.
> 
>   public static String parseString(byte[] msg, Charset charset)...

Ahhh, I see why, disregard my last message, since this makes sense and
is similar to what Ryan was saying.  Very cool.

> And:
> 
>   private static final byte[] COMMA_BYTES = new byte[] { ',' };
>     
>   public static byte[] dump(CharSequence data, Charset charset) {
>     return concat((byteLength(data) + ":").getBytes(ASCII),
>       String.valueOf(data).getBytes(charset), COMMA_BYTES);
>   }

Ok, but then can I dump an PNG image with this?

-- 
Zed A. Shaw
http://zedshaw.com/

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Armando Singer
Date:
2011-04-19 @ 22:25
On Apr 19, 2011, at 3:06 PM, Zed A. Shaw wrote:

> On Mon, Apr 18, 2011 at 10:28:15AM -0700, Armando Singer wrote:
>> On Apr 18, 2011, at 1:36 AM, Zed A. Shaw wrote:
>> 
>>> Cool I'll look at your code tomorrow and if you can take out the :"
>>> format then I'll add it to the list.  Thanks again.
>> 
>> Thanks for the feedback. I'll rewrite my impl now that I have my
>> confusion cleared up.
>> 
>> You will need to specify a charset when parsing to a Java string
>> and when dumping from a java string or char.
>> 
>>  public static String parseString(byte[] msg, Charset charset)...
> 
> Ahhh, I see why, disregard my last message, since this makes sense and
> is similar to what Ryan was saying.  Very cool.

Yup!

> 
>> And:
>> 
>>  private static final byte[] COMMA_BYTES = new byte[] { ',' };
>> 
>>  public static byte[] dump(CharSequence data, Charset charset) {
>>    return concat((byteLength(data) + ":").getBytes(ASCII),
>>      String.valueOf(data).getBytes(charset), COMMA_BYTES);
>>  }
> 
> Ok, but then can I dump an PNG image with this?

That's just part of the API. The full impl has a set of overloaded methods.
This is idiomatic in the host language, and we dispatch to the correct impl
according to the type of args.

/** @return byte[] or Long or Boolean or Map<byte[], Object> or 
List<Object> or null */
T parse(byte[])
/** convenience method to parse to Java String and optimized to prevent 
double copy */
String parseString(byte[], Charset)

byte[] dump(boolean)
byte[] dump(byte[])
byte[] dump(char, Charset)
byte[] dump(CharSequence, Charset) // for String, StringBuilder, etc.
byte[] dump(List<Object>)
byte[] dump(long) // also handles int, short, byte
byte[] dump(Map<byte[], Object>)
byte[] dump(Object) // handles null & all of above

So you can dump anything from the host language, unless I'm missing 
something obvious.

Cheers,
Armando

> 
> -- 
> Zed A. Shaw
> http://zedshaw.com/

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Zed A. Shaw
Date:
2011-04-19 @ 22:30
On Tue, Apr 19, 2011 at 03:25:42PM -0700, Armando Singer wrote:
> > Ok, but then can I dump an PNG image with this?
> 
> That's just part of the API. The full impl has a set of overloaded methods.
> This is idiomatic in the host language, and we dispatch to the correct impl
> according to the type of args.

Ah, very cool, alright that'll work just fine.  I'll take a look for
when I do the updated tnetstrings.org page.

-- 
Zed A. Shaw
http://zedshaw.com/

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Armando Singer
Date:
2011-05-02 @ 08:09
More tnetstrings and mongrel2 handler questions and feedback.

I've updated my Java handler impl to handle tnetstrings using
the TNetstring impl that I've been asking questions about
recently. Attached is a near-final draft of both the updated
TNetstring parser and dumper and the updated Java Mongrel2 handler
(the one linked to from the mongrel2.org).

As with the Python impl, headers are handled whether they are passed
as json or tnetstring by first parsing as a tnetstring. If we get a
byte[], then we parse the byte[] as json.

However, with strong types it's a bit of a mess because we have to
cast to one of several possible types if we get headers from a
tnetstring, or to a Map<String, Object> if we get headers json. More
on this below.

First, I wanted to clarify all the charsets used for
encoding/decoding. I believe what I've attached is strictly correct in
each area, but it would be good get confirmation that these are the
charsets handlers need to use.

CHARSETS FOR MONGREL2 HANDLERS:

HTTP header encoding: ISO-8859-1 (RFC 2616)

   deliverHTTP & replyHTTP sends an HTTP message. We send specified
   headers as well. Handler implementers would need to encode
   charactars outside of ISO-8859-1 according to RFC 2047, correct?

   Do we need to be more specific about what header encoding should be
   used for deliverHTTP and replyHTTP?

Parsing json (headers or body): Unicode (default UTF-8, but can
  also be UTF-16 (BE or LE), UTF-32 (BE or LE)...) (RFC 4627)

  "JSON text SHALL be encoded in Unicode. The default encoding is UTF-8."

  We also need to auto-detect which unicode encoding is used by
  checking the first 2-4 octets. The attached updated impl does this as well.

Delivering json: UTF-8.

  One must make a choice of Unicode encoding when delivering valid
  JSON. UTF-8 is 8 bit clean and is the most widely supported and is
  the default for JSON--a no brainer. (I don't want to present an
  unnecessary option just to deliver json. Best to pick the default
  valid encoding.)

Mongrel2 header encoding: Unicode if headers from json, no specified
    encoding if tnetstring)

  - Previously, mongrel headers were specified w/ JSON strings. Since
    json strings *must* be encoded in Unicode (default UTF-8, but can
    also be UTF-16 (BE or LE), UTF-32 (BE or LE)...)--this meant that
    valid headers previously were unicode

  - Previously, it was possible to obtain mogrel2 headers as a Java
    String type, because it's possible to auto-detect the unicode
    encoding by checking the pattern of nulls of the first 4 octets.

  - With the addition of tnetstrings support, headers can be also be
    retrieved as byte[] or Map<byte[], byte[]> or ... w/ no encoding
    specified. If working w/ character data, the app developers will
    decide which encoding they want to use.

  1 Do we want to define the mongrel2 header encoding? It has been
    implicitly unicode because of json but can now be unicode or unspecified.

  2 Do we want to restrict the *type* that headers can be as a subset
    of JSON and tnetstrings?

    Currently, it's possible to get headers as a byte[], int, List,
    Map, etc.

    The API would be nicer in strongly typed languages if the a
    TNetstring header is specified to always be a dict of with blob
    keys and values. (Map<byte[], byte[]> w/ a convenience method that
    accepts a charset to obtain a Map<String, String>)

    It's sort of weird to get headers as anything other than a
    dict. And if we prevent it, the API is nicer in strongly typed
    languages.

    Similarly, JSON headers could be specified to be a JSON Object
    with with json string keys and values (Map<String, Sting>)

    If we both restricted the types of headers and specified an
    encoding for non-json headers, then we could return a Map<String,
    String> whether the headers were supplied as either tnetstring or
    json, and the single method would be just as nice as the python
    version.

    (The attached impl doesn't restrict types or encoding. I'm not sure
    it's a good idea, but I want to make the tradeoffs clear.)

Also here are my differences from python refrence impl:

- Added a deliverTNetstring method in addition to the existing deliver
  and deliverJson methods, which are also in the python impl. I'm
  assuming the python impl wants a deliver_tnetstring as well since we
  are now handling tnetstring headers and bodies.

- The python impl has recv and recvJson. I removed recvJson in this
  revision of the Java version because we need to get the json data
  with a separate method anyway.

  req.getBody() returns the Java version tnestring types (a byte[] or
    Long or Boolean or Map<byte[], Object> or List<Object> or null)

  req.getJsonData() returns Map<String, Object>

  Them methods aren't separate just because of the diffrent types
  returned--Like the python impl, getJsonData returns an empty map if
  the "METHOD" header is not "JSON". I think recvJson can go away in
  the python impl as well as it only seems to signal an eager json parse.

  My impl is lazy so getBody(), getHeaders(), getJsonData() etc only
  parse their range of the 0mq byte[] msg when called. One only needs
  to call recv and data will be parsed according to get method that's
  called.

  And I renamed getData() to getJsonData() because it's confusing when
  getBody() both getData() can return complex types. getData() is really
  only for json data.

I'll put the updated handler and tnetstring implementations to a permanent
location if all looks good.

Cheers,
Armando

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Bobby Powers
Date:
2011-05-10 @ 22:13
Hi Armando,

I seem to need the following patch to prevent an infinite recursion
when dumping a string:

diff --git a/src/main/java/com/paperculture/codec/TNetstring.java
b/src/main/java/com/paperculture/codec/TNetstring.
index 44cf78d..338a430 100644
--- a/src/main/java/com/paperculture/codec/TNetstring.java
+++ b/src/main/java/com/paperculture/codec/TNetstring.java
@@ -197,7 +197,7 @@ public final class TNetstring {
   private static final byte[] NULL_BYTES = "0:~".getBytes(ASCII);

   public static byte[] dump(final Object data) {
-    if (data instanceof String) return dump(data);
+      if (data instanceof String) return dump((String)data,
Charset.defaultCharset());
     else if (data instanceof byte[]) return dump((byte[]) data);
     else if (data instanceof Number) return numberBytes(data.toString());
     else if (data instanceof Boolean) return dump(((Boolean)
data).equals(true));

Any chance you'll put your java up in something like github at some point?

yours,
Bobby

On Mon, May 2, 2011 at 1:09 AM, Armando Singer <armando.singer@gmail.com> wrote:
> More tnetstrings and mongrel2 handler questions and feedback.
>
> I've updated my Java handler impl to handle tnetstrings using
> the TNetstring impl that I've been asking questions about
> recently. Attached is a near-final draft of both the updated
> TNetstring parser and dumper and the updated Java Mongrel2 handler
> (the one linked to from the mongrel2.org).
>
> As with the Python impl, headers are handled whether they are passed
> as json or tnetstring by first parsing as a tnetstring. If we get a
> byte[], then we parse the byte[] as json.
>
> However, with strong types it's a bit of a mess because we have to
> cast to one of several possible types if we get headers from a
> tnetstring, or to a Map<String, Object> if we get headers json. More
> on this below.
>
> First, I wanted to clarify all the charsets used for
> encoding/decoding. I believe what I've attached is strictly correct in
> each area, but it would be good get confirmation that these are the
> charsets handlers need to use.
>
> CHARSETS FOR MONGREL2 HANDLERS:
>
> HTTP header encoding: ISO-8859-1 (RFC 2616)
>
>   deliverHTTP & replyHTTP sends an HTTP message. We send specified
>   headers as well. Handler implementers would need to encode
>   charactars outside of ISO-8859-1 according to RFC 2047, correct?
>
>   Do we need to be more specific about what header encoding should be
>   used for deliverHTTP and replyHTTP?
>
> Parsing json (headers or body): Unicode (default UTF-8, but can
>  also be UTF-16 (BE or LE), UTF-32 (BE or LE)...) (RFC 4627)
>
>  "JSON text SHALL be encoded in Unicode. The default encoding is UTF-8."
>
>  We also need to auto-detect which unicode encoding is used by
>  checking the first 2-4 octets. The attached updated impl does this as well.
>
> Delivering json: UTF-8.
>
>  One must make a choice of Unicode encoding when delivering valid
>  JSON. UTF-8 is 8 bit clean and is the most widely supported and is
>  the default for JSON--a no brainer. (I don't want to present an
>  unnecessary option just to deliver json. Best to pick the default
>  valid encoding.)
>
> Mongrel2 header encoding: Unicode if headers from json, no specified
>    encoding if tnetstring)
>
>  - Previously, mongrel headers were specified w/ JSON strings. Since
>    json strings *must* be encoded in Unicode (default UTF-8, but can
>    also be UTF-16 (BE or LE), UTF-32 (BE or LE)...)--this meant that
>    valid headers previously were unicode
>
>  - Previously, it was possible to obtain mogrel2 headers as a Java
>    String type, because it's possible to auto-detect the unicode
>    encoding by checking the pattern of nulls of the first 4 octets.
>
>  - With the addition of tnetstrings support, headers can be also be
>    retrieved as byte[] or Map<byte[], byte[]> or ... w/ no encoding
>    specified. If working w/ character data, the app developers will
>    decide which encoding they want to use.
>
>  1 Do we want to define the mongrel2 header encoding? It has been
>    implicitly unicode because of json but can now be unicode or unspecified.
>
>  2 Do we want to restrict the *type* that headers can be as a subset
>    of JSON and tnetstrings?
>
>    Currently, it's possible to get headers as a byte[], int, List,
>    Map, etc.
>
>    The API would be nicer in strongly typed languages if the a
>    TNetstring header is specified to always be a dict of with blob
>    keys and values. (Map<byte[], byte[]> w/ a convenience method that
>    accepts a charset to obtain a Map<String, String>)
>
>    It's sort of weird to get headers as anything other than a
>    dict. And if we prevent it, the API is nicer in strongly typed
>    languages.
>
>    Similarly, JSON headers could be specified to be a JSON Object
>    with with json string keys and values (Map<String, Sting>)
>
>    If we both restricted the types of headers and specified an
>    encoding for non-json headers, then we could return a Map<String,
>    String> whether the headers were supplied as either tnetstring or
>    json, and the single method would be just as nice as the python
>    version.
>
>    (The attached impl doesn't restrict types or encoding. I'm not sure
>    it's a good idea, but I want to make the tradeoffs clear.)
>
> Also here are my differences from python refrence impl:
>
> - Added a deliverTNetstring method in addition to the existing deliver
>  and deliverJson methods, which are also in the python impl. I'm
>  assuming the python impl wants a deliver_tnetstring as well since we
>  are now handling tnetstring headers and bodies.
>
> - The python impl has recv and recvJson. I removed recvJson in this
>  revision of the Java version because we need to get the json data
>  with a separate method anyway.
>
>  req.getBody() returns the Java version tnestring types (a byte[] or
>    Long or Boolean or Map<byte[], Object> or List<Object> or null)
>
>  req.getJsonData() returns Map<String, Object>
>
>  Them methods aren't separate just because of the diffrent types
>  returned--Like the python impl, getJsonData returns an empty map if
>  the "METHOD" header is not "JSON". I think recvJson can go away in
>  the python impl as well as it only seems to signal an eager json parse.
>
>  My impl is lazy so getBody(), getHeaders(), getJsonData() etc only
>  parse their range of the 0mq byte[] msg when called. One only needs
>  to call recv and data will be parsed according to get method that's
>  called.
>
>  And I renamed getData() to getJsonData() because it's confusing when
>  getBody() both getData() can return complex types. getData() is really
>  only for json data.
>
> I'll put the updated handler and tnetstring implementations to a permanent
> location if all looks good.
>
> Cheers,
> Armando
>
>

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Armando Singer
Date:
2011-05-10 @ 23:23
On May 10, 2011, at 3:13 PM, Bobby Powers wrote:

> Hi Armando,
> 
> I seem to need the following patch to prevent an infinite recursion
> when dumping a string:

Hi Bobby,

Thanks for pointing that out. I had caught that bug and have actually written
unit tests, but I haven't yet put my impl up someplace public. I'll put it
somewhere tonight hopefully. Yes, maybe I'll put it on github, too.

Attached is my latest. I had since gone on to micro-benchmark and optimize. This
version now doesn't produce any garbage, for example (well, except the
results). So the size and longs are parsed straight from the byte array
range. I've also found I could convert to and from String to bytes about 2 to 3x
faster than with string.getBytes(charset), or string.getBytes(charsetName) for
example (and between those 2, oddly passing in the charsetName is faster.)

Let me know if you see anything funny!

Cheers,
Armando

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Zed A. Shaw
Date:
2011-05-02 @ 21:06
On Mon, May 02, 2011 at 01:09:09AM -0700, Armando Singer wrote:
> More tnetstrings and mongrel2 handler questions and feedback.

Great, btw have you seen:

https://github.com/kwo/mojaha

> First, I wanted to clarify all the charsets used for
> encoding/decoding. I believe what I've attached is strictly correct in
> each area, but it would be good get confirmation that these are the
> charsets handlers need to use.

Yeah, this is all fine (but man Java is one OCD kind of language).

>   1 Do we want to define the mongrel2 header encoding? It has been
>     implicitly unicode because of json but can now be unicode or unspecified.

It's already specified in the parser in src/http11/http11_parser.rl
exactly.  English words can't do better than that parser as a
specification.

>   2 Do we want to restrict the *type* that headers can be as a subset
>     of JSON and tnetstrings?
>
>     Currently, it's possible to get headers as a byte[], int, List,
>     Map, etc.

It's defined in code and currently you'll only get string=string or
string=[string] (in your terms byte[]=byte[] or byte=array<byte[]>).

>     The API would be nicer in strongly typed languages if the a
>     TNetstring header is specified to always be a dict of with blob
>     keys and values. (Map<byte[], byte[]> w/ a convenience method that
>     accepts a charset to obtain a Map<String, String>)

You could do that as a convenience for Mongrel2 handlers, and it might
protect them, but that's specific to Java not really anyone else.

>     (The attached impl doesn't restrict types or encoding. I'm not sure
>     it's a good idea, but I want to make the tradeoffs clear.)

Just out of curiosity, why is this such a huge issue?  I've implemented
this in many different languages, and the worst that happens is you just
drop the request and log an error.  Are you imagining that this will
have some sort of huge problems if not specified completely exactly?

> Also here are my differences from python refrence impl:
> 
> - Added a deliverTNetstring method in addition to the existing deliver
>   and deliverJson methods, which are also in the python impl. I'm
>   assuming the python impl wants a deliver_tnetstring as well since we
>   are now handling tnetstring headers and bodies.

Uh, no IIRC that's only if the *client* (not mongrel2) wants json, so
like for JSSockets.  I suppose you can add it for compelteness but it's
not something I envision people using.

> I'll put the updated handler and tnetstring implementations to a permanent
> location if all looks good.

Awesome, let me know and I'll point at it.


-- 
Zed A. Shaw
http://zedshaw.com/

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Armando Singer
Date:
2011-05-02 @ 22:07
Thanks Zed. Comments below:

On May 2, 2011, at 2:06 PM, Zed A. Shaw wrote:

> On Mon, May 02, 2011 at 01:09:09AM -0700, Armando Singer wrote:
>> More tnetstrings and mongrel2 handler questions and feedback.
> 
> Great, btw have you seen:
> 
> https://github.com/kwo/mojaha

Yup. This one was sent out a couple days ago. I'm providing the author
feedback.

I implemented mine last year and have now been adding support for
tnetstrings as well as clarifying encoding for everything.[1]

After a quick look, mojaha aims to model HTTP requests like the servlet
API. I'm sure about how to make non-http replies w/ reply, deliverJson,
etc. and I don't believe it supports tnetstrings yet. Good to see new
impls coming out.

>> First, I wanted to clarify all the charsets used for
>> encoding/decoding. I believe what I've attached is strictly correct in
>> each area, but it would be good get confirmation that these are the
>> charsets handlers need to use.
> 
> Yeah, this is all fine (but man Java is one OCD kind of language).

True

> 
>>  1 Do we want to define the mongrel2 header encoding? It has been
>>    implicitly unicode because of json but can now be unicode or unspecified.
> 
> It's already specified in the parser in src/http11/http11_parser.rl
> exactly.  English words can't do better than that parser as a
> specification.
> 
>>  2 Do we want to restrict the *type* that headers can be as a subset
>>    of JSON and tnetstrings?
>> 
>>    Currently, it's possible to get headers as a byte[], int, List,
>>    Map, etc.
> 
> It's defined in code and currently you'll only get string=string or
> string=[string] (in your terms byte[]=byte[] or byte=array<byte[]>).

Great. I must have missed that.

> 
>>    The API would be nicer in strongly typed languages if the a
>>    TNetstring header is specified to always be a dict of with blob
>>    keys and values. (Map<byte[], byte[]> w/ a convenience method that
>>    accepts a charset to obtain a Map<String, String>)
> 
> You could do that as a convenience for Mongrel2 handlers, and it might
> protect them, but that's specific to Java not really anyone else.

Ok. 

> 
>>    (The attached impl doesn't restrict types or encoding. I'm not sure
>>    it's a good idea, but I want to make the tradeoffs clear.)
> 
> Just out of curiosity, why is this such a huge issue?  I've implemented
> this in many different languages, and the worst that happens is you just
> drop the request and log an error.  Are you imagining that this will
> have some sort of huge problems if not specified completely exactly?

Oh, it's not a huge issue. I'm just trying to be sure I made everything
perfect. :) To be clear, I'm meant "I'm not sure restricting the encoding
is a good idea".

In an HTTP Java API, one would expect to get Map<String, String> for headers,
for example. I realize we're not doing the same thing here. So we need to
get Map<byte[], byte[]> or Map<byte[], List<byte[]> for headers and provide a way
to convert to Strings given a charset.

Not mongrel2's fault, just a shame that it's so painful in Java. The API
is a bit clunkier to use unless encodings are specified.

Handler apps like this would be a PITA in common cases:

final Map<byte[], byte[]> headers = req.getHeaders();
byte[] fooVal = headers.get("foo".getBytes(mycharset));
search(new String(fooVal, mycharset));

I think the best we can do for the common case of getting Strings is:

final Map<String, String> headers = req.getHeadersWithBytesAsString(mycharset).
search(headers.get("foo"));

> 
>> Also here are my differences from python refrence impl:
>> 
>> - Added a deliverTNetstring method in addition to the existing deliver
>>  and deliverJson methods, which are also in the python impl. I'm
>>  assuming the python impl wants a deliver_tnetstring as well since we
>>  are now handling tnetstring headers and bodies.
> 
> Uh, no IIRC that's only if the *client* (not mongrel2) wants json, so
> like for JSSockets.  I suppose you can add it for compelteness but it's
> not something I envision people using.

That makes sense. I'll remove this to the keep the impl close to the python impl.

> 
>> I'll put the updated handler and tnetstring implementations to a permanent
>> location if all looks good.
> 
> Awesome, let me know and I'll point at it.

Thanks again for your time, Zed.

Cheers,
Armando

[1] http://www.paperculture.com/code/java-mongrel2-handler.html
    (this is already linked from mongrel2.org but I'm adding the tnetstring
    parser and updated handler as well)
> 
> 
> -- 
> Zed A. Shaw
> http://zedshaw.com/

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Karl Ostendorf
Date:
2011-05-03 @ 09:45
On Tue, May 3, 2011 at 00:07, Armando Singer <armando.singer@gmail.com> wrote:

> In an HTTP Java API, one would expect to get Map<String, String> for headers,
> for example. I realize we're not doing the same thing here. So we need to
> get Map<byte[], byte[]> or Map<byte[], List<byte[]> for headers and 
provide a way
> to convert to Strings given a charset.

I wasn't aware that HTTP headers could be in any charset other than
ASCII. Isn't this overkill?

Cheers,
Karl

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Loic d'Anterroches
Date:
2011-05-03 @ 11:53

On 2011-05-03 11:45, Karl Ostendorf wrote:
> On Tue, May 3, 2011 at 00:07, Armando Singer <armando.singer@gmail.com> wrote:
> 
>> In an HTTP Java API, one would expect to get Map<String, String> for headers,
>> for example. I realize we're not doing the same thing here. So we need to
>> get Map<byte[], byte[]> or Map<byte[], List<byte[]> for headers and 
provide a way
>> to convert to Strings given a charset.
> 
> I wasn't aware that HTTP headers could be in any charset other than
> ASCII. Isn't this overkill?

Practically, everybody is sending in ASCII. But do not forget that HTTP
was created in the 90's by people living mostly in Europe with a strong
background in research. That is, people used to work in different
languages (mit nackische füße?). So they tried to do the right thing to
not treat one encoding different than another.

Fast forward 15 years later... oups, the web is not really anymore what
it was.

So, the now defacto way to do the things is to have headers in ASCII,
with the definition of the encoding of the payload in them. You
basically split headers in ASCII and payload in whatever you/the server
want. By forcing ASCII headers, I don't think you will break a client.

loïc

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Armando Singer
Date:
2011-05-03 @ 19:00
On May 3, 2011, at 4:53 AM, Loic d'Anterroches wrote:
> On 2011-05-03 11:45, Karl Ostendorf wrote:
>> On Tue, May 3, 2011 at 00:07, Armando Singer <armando.singer@gmail.com> wrote:
>> 
>>> In an HTTP Java API, one would expect to get Map<String, String> for headers,
>>> for example. I realize we're not doing the same thing here. So we need to
>>> get Map<byte[], byte[]> or Map<byte[], List<byte[]> for headers and 
provide a way
>>> to convert to Strings given a charset.
>> 
>> I wasn't aware that HTTP headers could be in any charset other than
>> ASCII. Isn't this overkill?

Hi Karl - the HTTP rfc is 8859-1 and the servlet API that you're
familiar with, the commons HTTPClient libs, and many many parsers use 8859-1.

However, after looking at the mogrel2 parser, it looks like Zed is
intentionally restricting headers to only a sane subset of ASCII.

> Practically, everybody is sending in ASCII. But do not forget that HTTP
> was created in the 90's by people living mostly in Europe with a strong
> background in research. That is, people used to work in different
> languages (mit nackische füße?). So they tried to do the right thing to
> not treat one encoding different than another.
> 
> Fast forward 15 years later... oups, the web is not really anymore what
> it was.
> 
> So, the now defacto way to do the things is to have headers in ASCII,
> with the definition of the encoding of the payload in them. You
> basically split headers in ASCII and payload in whatever you/the server
> want. By forcing ASCII headers, I don't think you will break a client.

Hi Loïc - this makes sense. As I mentioned, 8859-1 is used in popular
java http parsers at least. But mongrel forcing ascii makes sense.

So I can automatically convert the headers to Java Strings (as
Map<String, String> or Map<String, List<String>>) using ascii because
1) the mongrel2 http parser ensures I will only get ascii and 2) It
would have been weird if I decoded the header bytes as 8859-1, but
valid JSON *must* be encoded with a unicode charset. So now if I get
headers as json, I can decode with ascii, and the json headers are
still unicode--the subset of utf-8 down in the ascii range.

Thanks guys!

Armando

> 
> loïc

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Zed A. Shaw
Date:
2011-05-03 @ 20:54
On Tue, May 03, 2011 at 12:00:10PM -0700, Armando Singer wrote:
> However, after looking at the mogrel2 parser, it looks like Zed is
> intentionally restricting headers to only a sane subset of ASCII.

Yep, the rational being that, when I researched security holes in HTTP
servers, it was due to ambiguity in the grammar of the protocol either
because of encodings or because of loose English specification.  By
writing a parser and saying "Nope I do ASCII only" I block a huge number
of attacks that try to jerk around with the encoding.  Years later, and
many many web sites using my parser to handle HTTP shows that *all* well
written clients use ASCII for the HTTP protocol, and only malicious
attacks try to do anything else.

> So I can automatically convert the headers to Java Strings (as
> Map<String, String> or Map<String, List<String>>) using ascii because
> 1) the mongrel2 http parser ensures I will only get ascii and 2) It
> would have been weird if I decoded the header bytes as 8859-1, but
> valid JSON *must* be encoded with a unicode charset. So now if I get
> headers as json, I can decode with ascii, and the json headers are
> still unicode--the subset of utf-8 down in the ascii range.

Yep, that should work fine.

-- 
Zed A. Shaw
http://zedshaw.com/

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Armando Singer
Date:
2011-04-18 @ 22:21
Here's take 2. First, a quick aside:

The spec says dict keys must be strings only. I'm assuming this also
means octets and means I need to use byte[] in jvm and .NET languages,
so we're returning and dumping Map<byte[], Object> instead of
Map<String, Object>. Sort of a pain use, but correct.

The main parse method is:

  /** @return byte[] or Long or Boolean or Map<byte[], Object> or 
List<Object> or null */
  public static <T> T parse(final byte[] tnetstring)...

And a convenience method to get a Java String, which is optimized
internally to prevent an extra copy (byte[] range -> byte[] -> array
copy + conversion to make String)

  /** convenience method to parse to Java String and optimized to prevent 
double copy */
  public static String parseString(final byte[] tnetstring, final Charset 
charset)...

Then there are a bunch of dump methods to dump from the various Java types:

byte[] dump(final CharSequence data, final Charset charset)
byte[] dump(final byte[] data)
byte[] dump(final char data, final Charset charset)
byte[] dump(final Map<byte[], Object> data)

etc...

The approach I've taken should be quite performant and of reference
quality, especially after I actually test it and perhaps get a round
of feedback.

Feedback would be much appreciated! I'll put it in a more permanent location
if it looks good.

Cheers,
Armando

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Ryan Kelly
Date:
2011-04-18 @ 00:27
On Sun, 2011-04-17 at 16:31 -0700, Armando Singer wrote:
>
> - It might be a good idea to have a separate string type:
> 
>   "  string
>   ,  byte array
> 
>   I have implemented this in the attached code. It adds 1 line to the
>   parsing. Having just a byte[] will work fine, but we're getting
>   pretty close to netrings as we'll have to convert to String any time
>   we want one, which would be common.

I feel sorry for this poor dead horse, but I suspect it's going to keep
getting beaten.


On one hand, for a general-purpose library in a language that has
distinct "bytes" and "string" types, it would be very nice to be able to
round-trip mixed data structures, e.g.:

   ["hello",u"world"] == tns.loads(tns.dumps(["hello",u"world"]))

On the other hand, you most definitely do NOT want mongrel2 trying to
deal with encoding/decoding unicode strings.  Bad bad bad.


And that's not even going into the details of encoding.  To quote the
tentstrings spec:

  "String encoding is an application level, political, and display
specification.  Transport protocols should not have to decode random
character encodings accurately to function properly."

A big +1 from me on that!


When the tnetstring adventure was just starting out, Zed's original
proposal was to have separate "string" and "bytes" type tags, but have a
policy that "tnetstring doesn't do encoding".  So this:

   5:hello,

Means "here is a byte array".  While this:

   5:hello"

Means "here is a string in whatever encoding you're using up there".


I fought back against having bytes in a potentially ambiguous encoding.
I now wish I'd kept my mouth shut.


As I see it there are two options:

  1) Just tell unicode strings to piss off.  This is not a
general-purpose serialisation library, it's a special-purpose format for
communicating between bytestream-based services.

  2) Allow a separate string type, but refuse to accept or generate it
within mongrel2.


In a previous email on this topic, people jumped in to say that when I
said "unicode strings" like above what I really meant was "utf8
strings".  Not so.

If the proposed solution were indeed to be "encode all unicode strings
in utf8 and decode them in the parser" I would be against it.

But if the proposal is simply to indicate "these bytes are a unicode
string in whatever encoding you've decided to use for this application,
you deal with it" then I think, based on my experiences with the python
module, it would be worth adding as a separate type tag.


The ability to transparently round-trip both bytearrays and strings
would actually be a additional bonus of tnetstrings over JSON, which
demands that all strings be unicode.

(Worse actually: the whole JSON document is a big unicode string in one
of several different encodings, and your parser is supposed to examine
the pattern of zeros in the first few chars of the document to determine
which encoding it is in.  Of course no-one does this, so in the wild
JSON is almost always in utf8.)

> - The integer type in the reference implementation is limited to
>   sys.maxint. It might be a good idea to be specific in the spec about
>   what the max integer is allowed to be

Indeed, sys.maxint is different on 32-bit vs 64-bit python so no
ambiguity is resolved here.

> - I'm also not handling floating point numbers. Is this correct? Not
>   having floats seems the only way to fulfill rule #1

I'd like to see a separate float type in the interests of completeness.
I propose:

    7:3.14159^

Because the carat reminds me of exponentiation.  Surely every language
has some facility to to convert float <=> string, accuracy be damned?




  Cheers,


      Ryan

-- 
Ryan Kelly
http://www.rfk.id.au  |  This message is digitally signed. Please visit
ryan@rfk.id.au        |  http://www.rfk.id.au/ramblings/gpg/ for details

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Armando Singer
Date:
2011-04-18 @ 03:25
> <unicode strings commentary>

My thinking is as follows:

- Some language families have a separate byte[] and string types that
  incur conversion and/or memory overhead to convert on to the other.

- These languages must parse the payload as a byte array to stay
  performant.

- It's fine in these languages to use a byte[] type and convert them
  to whatever type is needed in the host language, as with my mongrel2
  handler impl using plain old netstrings. However, the usefulness of
  a typed wire format is a bit lessened in the above languages if we
  have to convert from byte[] to String in the host language anyway,
  for what is probably the most common case (obviously depends on the
  app).

I'm not proposing adding a unicode type to the tnestring spec at all.

Rather, the , and " type would be identical, carrying a payload of
ASCII encoded bytes.

In other words, the type character (, vs ") is *just a type hint* so
that languages w/ distinct byte[] and string types can create the
appropriate data structures in the host language. tnetstrings or
mongrel2 does not need to know that in this language, for example,
Strings are arrays of double byte UTF-16 unsigned chars.

So in Java, I'd have:

    case '"': new String(msg, i, len, ASCII;
    case ',': Arrays.copyOfRange(msg, i, i + len);

But in python we'd do the same thing for both:

    elif payload_type == ',' or payload_type == '"':
        value = payload

Similarly, in the host language, I *have* to deal w/ peculiarities
that Java Strings, for example, are unicode aware. Most libaries are
using the built in types, so we have to deal with dumping types in the
host language. But no need to pollute the tnetstrings wire format w/
any of the languages peculiarities. So I want to easily be able to
dump all Strings, StringBuilders...all CharSequences, boolean, long,
short, byte, byte[], char, for example. All of these types are handled
by the 5 methods below in this host language:

  public static String dump(CharSequence data) { return asciiLength(data) 
+ ":" + data + '"'; }
  
  public static String dump(byte[] data) { return data.length + ":" + new 
String(data, ASCII) + ','; }
  
  public static String dump(boolean data) { return data ? "4:true!" : 
"5:false!"; }

  public static String dump(long data) { return 
numberString(Long.toString(data)); }

  public static String dump(char data) { return asciiLength(data) + ":" + 
data + '"'; }

But none of that is polluting tnetstrings. Round tripping works well
in this case regardless of language, but each would have dump methods
specific to that language.

>> - The integer type in the reference implementation is limited to
>>  sys.maxint. It might be a good idea to be specific in the spec about
>>  what the max integer is allowed to be
> 
> Indeed, sys.maxint is different on 32-bit vs 64-bit python so no
> ambiguity is resolved here.

Yes. Some options:

- integer is defined as architecture dependent and is very clear on
  what this means. Not very portable.

- integer is defined always 64 bit signed or unsigned or whatever.

- integer is defined as arbitrarily large and signed. I think this
  would be the easiest to implement across languages and it fully
  supports that part of the numerical tower (in scheme terms).

Cheers,
Armando

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Ryan Kelly
Date:
2011-04-18 @ 04:09
On Sun, 2011-04-17 at 20:25 -0700, Armando Singer wrote:
> > <unicode strings commentary>
> 
>
> I'm not proposing adding a unicode type to the tnestring spec at all.
>
> Rather, the , and " type would be identical, carrying a payload of
> ASCII encoded bytes.
> 
> In other words, the type character (, vs ") is *just a type hint* so
> that languages w/ distinct byte[] and string types can create the
> appropriate data structures in the host language.

Right.  Trying to do anything else at the tnetstring level is asking for
trouble.

Perhaps I'm just confusing the issue by saying "unicode" everywhere.
Sorry.  It's a type distinction between "text" and "bytes" and it's
about how you want to work with the object after it has been
deserialized.  Agree?

But, and correct me if I'm wrong, the whole trouble here is that the
"string" object is invariably designed to represent unicode characters.
So there is encoding going on somewhere, even if it's the implicit
encoding that your host language does it store the things in memory.

Can the java String object represent an arbitrary byte sequence?  One of
the issues faced by python is that you can't really represent e.g. null
bytes in a unicode string object.

>  tnetstrings or mongrel2 does not need to know that in this language, 
for example,
> Strings are arrays of double byte UTF-16 unsigned chars.

I think we can all agree that we don't want tnetstrings to touch any
encoding issues :-)

> So in Java, I'd have:
> 
>     case '"': new String(msg, i, len, ASCII;
>     case ',': Arrays.copyOfRange(msg, i, i + len);
> 
> But in python we'd do the same thing for both:
> 
>     elif payload_type == ',' or payload_type == '"':
>         value = payload

I think this behaviour would be very surprising to python programmers. 

If you've said "this stuff is text" in your type tag, they would expect
to get a unicode string object.

Probably I just don't understand enough about how Java strings work.
Sounds like the distinction between String/byte[] is sufficiently
different to the bytes/unicode distinction in python that my intuition
is off.

Is the whole point of , vs " that you end up with either a byte[] filled
with ASCII bytes, or a String() filled with ASCII bytes?  If so, it
sounds like a hack to workaround the inefficiencies of java's String
and/or byte[] objects and I don't think it's worth the complication.

What happens if someone passes in a string containing some non-ascii
unicode characters?  Does it error out, or wind up on the wire in UTF16?


   Ryan

-- 
Ryan Kelly
http://www.rfk.id.au  |  This message is digitally signed. Please visit
ryan@rfk.id.au        |  http://www.rfk.id.au/ramblings/gpg/ for details

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Zed A. Shaw
Date:
2011-04-18 @ 08:24
On Mon, Apr 18, 2011 at 02:09:07PM +1000, Ryan Kelly wrote:
> On Sun, 2011-04-17 at 20:25 -0700, Armando Singer wrote:
> > > <unicode strings commentary>
> > In other words, the type character (, vs ") is *just a type hint* so
> > that languages w/ distinct byte[] and string types can create the
> > appropriate data structures in the host language.
> 
> Right.  Trying to do anything else at the tnetstring level is asking for
> trouble.

I'm going to make it easier:

When tnetstrings uses the word "strings" it means, "A sequence of 8bit
bytes (octets) that has no meaning beyond this definition".  They are
not UTF-8, ascii, byte[], or anything other than this definition.  Your
application then specifies what it is sending either in code or in
metadata for the request.  That means, if you want UTF-8 for the
transport, then tell the receivers it's UTF-8.

Would that clear it up?

-- 
Zed A. Shaw
http://zedshaw.com/

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Austin Wise
Date:
2011-04-19 @ 04:01
Having "strings" be just mean "a sequence of 8bit bytes" makes sense
and works fine in my C# TNetString implementation.  However it would
be helpful if Mongrel2's handler format specified something like "all
header key and values are ASCII and the request body is just a
sequence of bytes" so that I know how to interpret the header bytes.

On Mon, Apr 18, 2011 at 1:24 AM, Zed A. Shaw <zedshaw@zedshaw.com> wrote:
> On Mon, Apr 18, 2011 at 02:09:07PM +1000, Ryan Kelly wrote:
>> On Sun, 2011-04-17 at 20:25 -0700, Armando Singer wrote:
>> > > <unicode strings commentary>
>> > In other words, the type character (, vs ") is *just a type hint* so
>> > that languages w/ distinct byte[] and string types can create the
>> > appropriate data structures in the host language.
>>
>> Right.  Trying to do anything else at the tnetstring level is asking for
>> trouble.
>
> I'm going to make it easier:
>
> When tnetstrings uses the word "strings" it means, "A sequence of 8bit
> bytes (octets) that has no meaning beyond this definition".  They are
> not UTF-8, ascii, byte[], or anything other than this definition.  Your
> application then specifies what it is sending either in code or in
> metadata for the request.  That means, if you want UTF-8 for the
> transport, then tell the receivers it's UTF-8.
>
> Would that clear it up?
>
> --
> Zed A. Shaw
> http://zedshaw.com/
>

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Zed A. Shaw
Date:
2011-04-19 @ 22:03
On Mon, Apr 18, 2011 at 09:01:54PM -0700, Austin Wise wrote:
> Having "strings" be just mean "a sequence of 8bit bytes" makes sense
> and works fine in my C# TNetString implementation.  However it would
> be helpful if Mongrel2's handler format specified something like "all
> header key and values are ASCII and the request body is just a
> sequence of bytes" so that I know how to interpret the header bytes.

I believe that's how it's specified in the older docs, so I can update
the tnetstring version to be the same.  The parser would actually
enforce this too.

-- 
Zed A. Shaw
http://zedshaw.com/

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
joshua simmons
Date:
2011-04-19 @ 04:17
Mongrel2 already specifies the headers as being ASCII, not sure if there's
any particulars for the request body.

On Tue, Apr 19, 2011 at 2:01 PM, Austin Wise <austinwise@gmail.com> wrote:

> Having "strings" be just mean "a sequence of 8bit bytes" makes sense
> and works fine in my C# TNetString implementation.  However it would
> be helpful if Mongrel2's handler format specified something like "all
> header key and values are ASCII and the request body is just a
> sequence of bytes" so that I know how to interpret the header bytes.
>
> On Mon, Apr 18, 2011 at 1:24 AM, Zed A. Shaw <zedshaw@zedshaw.com> wrote:
> > On Mon, Apr 18, 2011 at 02:09:07PM +1000, Ryan Kelly wrote:
> >> On Sun, 2011-04-17 at 20:25 -0700, Armando Singer wrote:
> >> > > <unicode strings commentary>
> >> > In other words, the type character (, vs ") is *just a type hint* so
> >> > that languages w/ distinct byte[] and string types can create the
> >> > appropriate data structures in the host language.
> >>
> >> Right.  Trying to do anything else at the tnetstring level is asking for
> >> trouble.
> >
> > I'm going to make it easier:
> >
> > When tnetstrings uses the word "strings" it means, "A sequence of 8bit
> > bytes (octets) that has no meaning beyond this definition".  They are
> > not UTF-8, ascii, byte[], or anything other than this definition.  Your
> > application then specifies what it is sending either in code or in
> > metadata for the request.  That means, if you want UTF-8 for the
> > transport, then tell the receivers it's UTF-8.
> >
> > Would that clear it up?
> >
> > --
> > Zed A. Shaw
> > http://zedshaw.com/
> >
>

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Loic d'Anterroches
Date:
2011-04-19 @ 06:29

On 2011-04-19 06:17, joshua simmons wrote:
> Mongrel2 already specifies the headers as being ASCII, not sure if
> there's any particulars for the request body.

The request body encoding is specified in the headers. It can basically
be anything: http://www.ietf.org/rfc/rfc2388.txt

loïc

> On Tue, Apr 19, 2011 at 2:01 PM, Austin Wise <austinwise@gmail.com
> <mailto:austinwise@gmail.com>> wrote:
> 
>     Having "strings" be just mean "a sequence of 8bit bytes" makes sense
>     and works fine in my C# TNetString implementation.  However it would
>     be helpful if Mongrel2's handler format specified something like "all
>     header key and values are ASCII and the request body is just a
>     sequence of bytes" so that I know how to interpret the header bytes.
> 
>     On Mon, Apr 18, 2011 at 1:24 AM, Zed A. Shaw <zedshaw@zedshaw.com
>     <mailto:zedshaw@zedshaw.com>> wrote:
>     > On Mon, Apr 18, 2011 at 02:09:07PM +1000, Ryan Kelly wrote:
>     >> On Sun, 2011-04-17 at 20:25 -0700, Armando Singer wrote:
>     >> > > <unicode strings commentary>
>     >> > In other words, the type character (, vs ") is *just a type
>     hint* so
>     >> > that languages w/ distinct byte[] and string types can create the
>     >> > appropriate data structures in the host language.
>     >>
>     >> Right.  Trying to do anything else at the tnetstring level is
>     asking for
>     >> trouble.
>     >
>     > I'm going to make it easier:
>     >
>     > When tnetstrings uses the word "strings" it means, "A sequence of 8bit
>     > bytes (octets) that has no meaning beyond this definition".  They are
>     > not UTF-8, ascii, byte[], or anything other than this definition.
>      Your
>     > application then specifies what it is sending either in code or in
>     > metadata for the request.  That means, if you want UTF-8 for the
>     > transport, then tell the receivers it's UTF-8.
>     >
>     > Would that clear it up?
>     >
>     > --
>     > Zed A. Shaw
>     > http://zedshaw.com/
>     >
> 
> 

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Armando Singer
Date:
2011-04-18 @ 17:19
On Apr 18, 2011, at 1:24 AM, Zed A. Shaw wrote:

> When tnetstrings uses the word "strings" it means, "A sequence of 8bit
> bytes (octets) that has no meaning beyond this definition".  They are
> not UTF-8, ascii, byte[], or anything other than this definition.  Your
> application then specifies what it is sending either in code or in
> metadata for the request.  That means, if you want UTF-8 for the
> transport, then tell the receivers it's UTF-8.
> 
> Would that clear it up?

Yes. But why have the word "strings" in there at all? I guess strings are
in the name "tnetstrings", but otherwise the spec should say 8 bit bytes (octets).

Thank you,
Armando

> 
> -- 
> Zed A. Shaw
> http://zedshaw.com/

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Zed A. Shaw
Date:
2011-04-19 @ 22:03
On Mon, Apr 18, 2011 at 10:19:41AM -0700, Armando Singer wrote:
> Yes. But why have the word "strings" in there at all? I guess strings are
> in the name "tnetstrings", but otherwise the spec should say 8 bit bytes
(octets).
> 

"8 bit bytes (octets)" is kind of ridiculous don't you think?  How
about, since "strings" has been bastardized to mean so many things in so
many languages we use "Blob".  It's a common term from databases that
means what we're saying, and isn't overloaded.

The downside to blobs is implementers will feel it necessary to actually
create a Blob class to hold them even when they aren't needed, so I'll
probably need a table of how to map those in different languages.  Like
this:

Python | str
Java   | byte[]
C      | char[]

And so on.

-- 
Zed A. Shaw
http://zedshaw.com/

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Armando Singer
Date:
2011-04-19 @ 22:18
On Apr 19, 2011, at 3:03 PM, Zed A. Shaw wrote:

> On Mon, Apr 18, 2011 at 10:19:41AM -0700, Armando Singer wrote:
>> Yes. But why have the word "strings" in there at all? I guess strings are
>> in the name "tnetstrings", but otherwise the spec should say 8 bit 
bytes (octets).
>> 
> 
> "8 bit bytes (octets)" is kind of ridiculous don't you think?  How
> about, since "strings" has been bastardized to mean so many things in so
> many languages we use "Blob".  It's a common term from databases that
> means what we're saying, and isn't overloaded.

Yes, that makes sense.

> 
> The downside to blobs is implementers will feel it necessary to actually
> create a Blob class to hold them even when they aren't needed, so I'll
> probably need a table of how to map those in different languages.  Like
> this:
> 
> Python | str
> Java   | byte[]
> C      | char[]
> 
> And so on.

Yes, that would be helpful.

Thank you,
Armando

> 
> -- 
> Zed A. Shaw
> http://zedshaw.com/

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Ryan Kelly
Date:
2011-04-18 @ 09:23
On Mon, 2011-04-18 at 01:24 -0700, Zed A. Shaw wrote:
> On Mon, Apr 18, 2011 at 02:09:07PM +1000, Ryan Kelly wrote:
> > On Sun, 2011-04-17 at 20:25 -0700, Armando Singer wrote:
> > > > <unicode strings commentary>
> > > In other words, the type character (, vs ") is *just a type hint* so
> > > that languages w/ distinct byte[] and string types can create the
> > > appropriate data structures in the host language.
> > 
> > Right.  Trying to do anything else at the tnetstring level is asking for
> > trouble.
> 
> I'm going to make it easier:
> 
> When tnetstrings uses the word "strings" it means, "A sequence of 8bit
> bytes (octets) that has no meaning beyond this definition".  They are
> not UTF-8, ascii, byte[], or anything other than this definition.  Your
> application then specifies what it is sending either in code or in
> metadata for the request.  That means, if you want UTF-8 for the
> transport, then tell the receivers it's UTF-8.
> 
> Would that clear it up?

So let me summarize the unicode-friendliness I want to put in my python
module.

Tnetstrings deal only in sequences of 8bit bytes.  When you read in a
string without telling it anything else, that's what you'll get:

    >>> tns.loads("8:5:hello,]")
    ["hello"]

If you somehow specify the encoding out-of-band, then you are free to
interpret strings according to that encoding.  The tnetstring protocol
doesn't care, but the API can make it easier for you:

    >>> tns.loads("8:5:hello,]", "utf8")
    [u"hello"]

But if you want to mix interpreted and uninterpreted strings (say, a
dict with unicode-string keys and bytestring values) then you're on your
own.

    >>> # I want to get {u"hello": "\xFF"} but can't
    >>> tns.loads("12:5:hello,1:\xFF,}","utf8")
    Traceback
        ...blah blah...
    UnicodeDecodeError: 'utf8' codec can't decode byte 0xff

So you'll have to pick apart the result and decode the bits you want:

    >>> d = tns.loads("12:5:hello,1:\xFF,}")
    >>> for k in d.keys():
    ...     d[k.decode("utf8")] = d.pop(k)
    >>> d
    {u"hello": "\xFF"}


It would be convenient if there were a separate "bytes" type so that you
could do, say:

    >>> d = tns.loads("12:5:hello,1:\xFF$}", "utf8")
    {u"hello": "\xFF"}


But it wouldn't be such a big convenience that I'm going to say any more
about it on this list :-)


Zed, would you be happy to see such an API inside a tnetstrings
module?  

Or, would it be better/easier/cleaner to have people do a separate pass
over their data to coerce things to/from bytes as they see fit?


  Cheers,

     Ryan

-- 
Ryan Kelly
http://www.rfk.id.au  |  This message is digitally signed. Please visit
ryan@rfk.id.au        |  http://www.rfk.id.au/ramblings/gpg/ for details

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Zed A. Shaw
Date:
2011-04-19 @ 21:59
On Mon, Apr 18, 2011 at 07:23:21PM +1000, Ryan Kelly wrote:
> Tnetstrings deal only in sequences of 8bit bytes.  When you read in a
> string without telling it anything else, that's what you'll get:
> 
>     >>> tns.loads("8:5:hello,]")
>     ["hello"]
> 
> If you somehow specify the encoding out-of-band, then you are free to
> interpret strings according to that encoding.  The tnetstring protocol
> doesn't care, but the API can make it easier for you:
> 
>     >>> tns.loads("8:5:hello,]", "utf8")
>     [u"hello"]

Hmmm, yeah that could be helpful, and yes I think this is better.  Only
thing is, this *only* uses the encoding on the *contents*.  Every other
part is ASCII.  So, if I have your line:

8:5:hello,]

The only part that is converted is the hello.  Everything else stays
ASCII always.

The reason is this reduces the attack surface for situations where
people find bizarre unicode sequences that can still equal say : but are
not really : so you miss them in parsing and scanning.

That make sense?

Now, as for my usage I just do this:

http://chardet.feedparser.org/

In my protocols, I'll have an encoding metadata field, and then assume
the sender could be lying and use the above to confirm it.  Take a look
at chardet as a sort of "guess" option for the API.

-- 
Zed A. Shaw
http://zedshaw.com/

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Ryan Kelly
Date:
2011-04-19 @ 22:06
On Tue, 2011-04-19 at 14:59 -0700, Zed A. Shaw wrote:
> On Mon, Apr 18, 2011 at 07:23:21PM +1000, Ryan Kelly wrote:
> > Tnetstrings deal only in sequences of 8bit bytes.  When you read in a
> > string without telling it anything else, that's what you'll get:
> > 
> >     >>> tns.loads("8:5:hello,]")
> >     ["hello"]
> > 
> > If you somehow specify the encoding out-of-band, then you are free to
> > interpret strings according to that encoding.  The tnetstring protocol
> > doesn't care, but the API can make it easier for you:
> > 
> >     >>> tns.loads("8:5:hello,]", "utf8")
> >     [u"hello"]
> 
> Hmmm, yeah that could be helpful, and yes I think this is better.  Only
> thing is, this *only* uses the encoding on the *contents*.  Every other
> part is ASCII.  So, if I have your line:
> 
> 8:5:hello,]
> 
> The only part that is converted is the hello.  Everything else stays
> ASCII always.
> 
> The reason is this reduces the attack surface for situations where
> people find bizarre unicode sequences that can still equal say : but are
> not really : so you miss them in parsing and scanning.
> 
> That make sense?

Absolutely.  Underneath the parser core is still working on a char* one
byte at a time, this would only happen way up in the code that says
"turn this chunk of bytes into a Python string".

It had never even occurred to me to do otherwise.

> Now, as for my usage I just do this:
> 
> http://chardet.feedparser.org/
> 
> In my protocols, I'll have an encoding metadata field, and then assume
> the sender could be lying and use the above to confirm it.  Take a look
> at chardet as a sort of "guess" option for the API.

Is this what you used for unicode-handling in Lamson?  I think I
remember reading about its awesome powers, will definitely check it out.


   Ryan

-- 
Ryan Kelly
http://www.rfk.id.au  |  This message is digitally signed. Please visit
ryan@rfk.id.au        |  http://www.rfk.id.au/ramblings/gpg/ for details

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Zed A. Shaw
Date:
2011-04-19 @ 22:29
On Wed, Apr 20, 2011 at 08:06:23AM +1000, Ryan Kelly wrote:
> > http://chardet.feedparser.org/
> > 
> > In my protocols, I'll have an encoding metadata field, and then assume
> > the sender could be lying and use the above to confirm it.  Take a look
> > at chardet as a sort of "guess" option for the API.
> 
> Is this what you used for unicode-handling in Lamson?  I think I
> remember reading about its awesome powers, will definitely check it out.

Yes, it did wonders on cleaning up email, which has tons of badly
specified encodings.  In fact, MIME is a prime example of why mixing
your framing and your encodings in protocols is a bad idea.

-- 
Zed A. Shaw
http://zedshaw.com/

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Loic d'Anterroches
Date:
2011-04-18 @ 07:30

On 2011-04-18 06:09, Ryan Kelly wrote:
> On Sun, 2011-04-17 at 20:25 -0700, Armando Singer wrote:
>>> <unicode strings commentary>
>> I'm not proposing adding a unicode type to the tnestring spec at all.
>>
>> Rather, the , and " type would be identical, carrying a payload of
>> ASCII encoded bytes.

They are not ascii encoded they are just binary data.

>> In other words, the type character (, vs ") is *just a type hint* so
>> that languages w/ distinct byte[] and string types can create the
>> appropriate data structures in the host language.

If this is just a type hint, this means you can create two python
implementations which are incompatible with each other. For me this does
not feel right.

The problem I see in your discussion, is that you always consider a kind
of implicit encoding for the handling of the strings. What is this
encoding? Python can use ascii or utf-8 or whatever you configured. PHP
can use whatever you configured.

If you want a string as such, you need to give the encoding with,
because a string is just a byte array interpreted in a given way.

For me, trying to add the string support/hinting, this is really opening
a can of worms. Can you remember the mess of MySQL where people were
storing utf-8 in another "implicit encoded" storage and then were
surprised they were not able to dump the data in something working? It
feels like that.

loïc



> 
> Right.  Trying to do anything else at the tnetstring level is asking for
> trouble.
> 
> Perhaps I'm just confusing the issue by saying "unicode" everywhere.
> Sorry.  It's a type distinction between "text" and "bytes" and it's
> about how you want to work with the object after it has been
> deserialized.  Agree?
> 
> But, and correct me if I'm wrong, the whole trouble here is that the
> "string" object is invariably designed to represent unicode characters.
> So there is encoding going on somewhere, even if it's the implicit
> encoding that your host language does it store the things in memory.
> 
> Can the java String object represent an arbitrary byte sequence?  One of
> the issues faced by python is that you can't really represent e.g. null
> bytes in a unicode string object.
> 
>>  tnetstrings or mongrel2 does not need to know that in this language, 
for example,
>> Strings are arrays of double byte UTF-16 unsigned chars.
> 
> I think we can all agree that we don't want tnetstrings to touch any
> encoding issues :-)
> 
>> So in Java, I'd have:
>>
>>     case '"': new String(msg, i, len, ASCII;
>>     case ',': Arrays.copyOfRange(msg, i, i + len);
>>
>> But in python we'd do the same thing for both:
>>
>>     elif payload_type == ',' or payload_type == '"':
>>         value = payload
> 
> I think this behaviour would be very surprising to python programmers. 
> 
> If you've said "this stuff is text" in your type tag, they would expect
> to get a unicode string object.
> 
> Probably I just don't understand enough about how Java strings work.
> Sounds like the distinction between String/byte[] is sufficiently
> different to the bytes/unicode distinction in python that my intuition
> is off.
> 
> Is the whole point of , vs " that you end up with either a byte[] filled
> with ASCII bytes, or a String() filled with ASCII bytes?  If so, it
> sounds like a hack to workaround the inefficiencies of java's String
> and/or byte[] objects and I don't think it's worth the complication.
> 
> What happens if someone passes in a string containing some non-ascii
> unicode characters?  Does it error out, or wind up on the wire in UTF16?
> 
> 
>    Ryan
> 

-- 
Dr Loïc d'Anterroches
Founder Céondo Ltd

w: www.ceondo.com       |  e: loic@ceondo.com
t: +44 (0)207 183 0016  |  f: +44 (0)207 183 0124

Céondo Ltd
Dalton House
60 Windsor Avenue
London
SW19 2RR / United Kingdom

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Armando Singer
Date:
2011-04-18 @ 17:16
On Apr 18, 2011, at 12:30 AM, Loic d'Anterroches wrote:

> On 2011-04-18 06:09, Ryan Kelly wrote:
>> On Sun, 2011-04-17 at 20:25 -0700, Armando Singer wrote:
>>>> <unicode strings commentary>
>>> I'm not proposing adding a unicode type to the tnestring spec at all.
>>> 
>>> Rather, the , and " type would be identical, carrying a payload of
>>> ASCII encoded bytes.
> 
> They are not ascii encoded they are just binary data.

Good point, tnetstrings doesn't specify an encoding except for the
size string. I had picked ASCII up from the Handler netstrings impl
(Note 3: Sorry, Unicodians, It’s All ASCII...). My bad!

However, the reference Python implementation dumps a python string to
ASCII encoded bytes, correct? I'm probably wrong there. But that would mean
to remain compatible I'd have to dump Java Strings to ASCII encoded
bytes. Otherwise, I'd have to pick some encoding when dumping byte[],
so it might as well be UTF-8.

Better, we always need to always specify encoding:

    byte[] dump(String javaString, Charset charset)...

    String parseString(byte [] tnestring, Charset charset)...

> 
>>> In other words, the type character (, vs ") is *just a type hint* so
>>> that languages w/ distinct byte[] and string types can create the
>>> appropriate data structures in the host language.
> 
> If this is just a type hint, this means you can create two python
> implementations which are incompatible with each other. For me this does
> not feel right.
> 
> The problem I see in your discussion, is that you always consider a kind
> of implicit encoding for the handling of the strings. What is this
> encoding? Python can use ascii or utf-8 or whatever you configured. PHP
> can use whatever you configured.

Yes, if there always an encoding specified on the wire, then we could
always convert to the platform's String type w/o specifying an
encoding. But since it's intentionally not specified, we must always
specify an encoding anyway to get a String in one's platform.

> 
> If you want a string as such, you need to give the encoding with,
> because a string is just a byte array interpreted in a given way.
> 
> For me, trying to add the string support/hinting, this is really opening
> a can of worms. Can you remember the mess of MySQL where people were
> storing utf-8 in another "implicit encoded" storage and then were
> surprised they were not able to dump the data in something working? It
> feels like that.

I agree.

Thank you for the feedback.

Armando

> 
> loïc
> 
> 
> 
>> 
>> Right.  Trying to do anything else at the tnetstring level is asking for
>> trouble.
>> 
>> Perhaps I'm just confusing the issue by saying "unicode" everywhere.
>> Sorry.  It's a type distinction between "text" and "bytes" and it's
>> about how you want to work with the object after it has been
>> deserialized.  Agree?
>> 
>> But, and correct me if I'm wrong, the whole trouble here is that the
>> "string" object is invariably designed to represent unicode characters.
>> So there is encoding going on somewhere, even if it's the implicit
>> encoding that your host language does it store the things in memory.
>> 
>> Can the java String object represent an arbitrary byte sequence?  One of
>> the issues faced by python is that you can't really represent e.g. null
>> bytes in a unicode string object.
>> 
>>> tnetstrings or mongrel2 does not need to know that in this language, 
for example,
>>> Strings are arrays of double byte UTF-16 unsigned chars.
>> 
>> I think we can all agree that we don't want tnetstrings to touch any
>> encoding issues :-)
>> 
>>> So in Java, I'd have:
>>> 
>>>    case '"': new String(msg, i, len, ASCII;
>>>    case ',': Arrays.copyOfRange(msg, i, i + len);
>>> 
>>> But in python we'd do the same thing for both:
>>> 
>>>    elif payload_type == ',' or payload_type == '"':
>>>        value = payload
>> 
>> I think this behaviour would be very surprising to python programmers. 
>> 
>> If you've said "this stuff is text" in your type tag, they would expect
>> to get a unicode string object.
>> 
>> Probably I just don't understand enough about how Java strings work.
>> Sounds like the distinction between String/byte[] is sufficiently
>> different to the bytes/unicode distinction in python that my intuition
>> is off.
>> 
>> Is the whole point of , vs " that you end up with either a byte[] filled
>> with ASCII bytes, or a String() filled with ASCII bytes?  If so, it
>> sounds like a hack to workaround the inefficiencies of java's String
>> and/or byte[] objects and I don't think it's worth the complication.
>> 
>> What happens if someone passes in a string containing some non-ascii
>> unicode characters?  Does it error out, or wind up on the wire in UTF16?
>> 
>> 
>>   Ryan
>> 
> 
> -- 
> Dr Loïc d'Anterroches
> Founder Céondo Ltd
> 
> w: www.ceondo.com       |  e: loic@ceondo.com
> t: +44 (0)207 183 0016  |  f: +44 (0)207 183 0124
> 
> Céondo Ltd
> Dalton House
> 60 Windsor Avenue
> London
> SW19 2RR / United Kingdom

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Zed A. Shaw
Date:
2011-04-19 @ 21:54
On Mon, Apr 18, 2011 at 10:16:34AM -0700, Armando Singer wrote:
> However, the reference Python implementation dumps a python string to
> ASCII encoded bytes, correct? I'm probably wrong there. But that would mean

Do you mean this:

return '%d:' % len(data) + data + ','

That actually doesn't output the data as ASCII, it outputs it as bytes.
Python's strings can hold anything so they're more like byte arrays.  If
it were this however:

return '%d:%s,' % (len(data), data)

Then it would get screwed up the way you think.  If you think that's
wrong, can you work up a counter case that shows it with the python
implementation?

> Better, we always need to always specify encoding:
> 
>     byte[] dump(String javaString, Charset charset)...
> 
>     String parseString(byte [] tnestring, Charset charset)...

Uh, wouldn't this just be back to square-one and have you specifying
charsets when the contents should be unspecified (man java makes this
confusing).  I'll take a look at your code and maybe rewrite it to what
I'm thinking of.  Code is probably better than English to say this.


-- 
Zed A. Shaw
http://zedshaw.com/

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Armando Singer
Date:
2011-04-19 @ 22:14
On Apr 19, 2011, at 2:54 PM, Zed A. Shaw wrote:

> On Mon, Apr 18, 2011 at 10:16:34AM -0700, Armando Singer wrote:
>> However, the reference Python implementation dumps a python string to
>> ASCII encoded bytes, correct? I'm probably wrong there. But that would mean
> 
> Do you mean this:
> 
> return '%d:' % len(data) + data + ','
> 
> That actually doesn't output the data as ASCII, it outputs it as bytes.
> Python's strings can hold anything so they're more like byte arrays.  If
> it were this however:
> 
> return '%d:%s,' % (len(data), data)
> 
> Then it would get screwed up the way you think.  If you think that's
> wrong, can you work up a counter case that shows it with the python
> implementation?

The reference impl is correct. I had thought about it more and concluded
that it's just dumping bytes.

> 
>> Better, we always need to always specify encoding:
>> 
>>    byte[] dump(String javaString, Charset charset)...
>> 
>>    String parseString(byte [] tnestring, Charset charset)...
> 
> Uh, wouldn't this just be back to square-one and have you specifying
> charsets when the contents should be unspecified (man java makes this
> confusing).  I'll take a look at your code and maybe rewrite it to what
> I'm thinking of.  Code is probably better than English to say this.

Please take a look at my 2nd implementation. I believe it's correct, where
no assumption is made about encoding of the contents.

There is one main parse method:

  /** @return byte[] or Long or Boolean or Map<byte[], Object> or 
List<Object> or null */
  public static <T> T parse(final byte[] msg)

There is also 1 convenience method to parse the contents as a Java String.
It's not strictly needed, but 1) it's a bit easier to use when host 
language is working
libraries that need a String, and 2) it's internally optimized so we don't
have extra
copies (first getting a byte[], which is a copy of a range, then 
converting that to a
Java String, which causes another copy and decoding). Because we don't make any
assumptions about the encoding of the contents, the user must specify a 
charset if they
want a Java String:

  /** convenience method to parse to Java String and optimized to prevent 
double copy */
  public static String parseString(final byte[] msg, final Charset charset)

Same with the dump() methods. We're dumping everything to byte[]. We 
convert each Java type
such as String, char, long, int, short, etc to byte[], but any character 
data must specify
and encoding, else we don't know how to properly convert it to a byte[].

Hope I'm making sense!

Thank you,
Armando

> 
> 
> -- 
> Zed A. Shaw
> http://zedshaw.com/

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Ryan Kelly
Date:
2011-04-19 @ 22:09
On Tue, 2011-04-19 at 14:54 -0700, Zed A. Shaw wrote:
> On Mon, Apr 18, 2011 at 10:16:34AM -0700, Armando Singer wrote:
> > However, the reference Python implementation dumps a python string to
> > ASCII encoded bytes, correct? I'm probably wrong there. But that would mean
> 
> Do you mean this:
> 
> return '%d:' % len(data) + data + ','
> 
> That actually doesn't output the data as ASCII, it outputs it as bytes.
> Python's strings can hold anything so they're more like byte arrays.  If
> it were this however:
> 
> return '%d:%s,' % (len(data), data)
> 
> Then it would get screwed up the way you think.

Really?  I always understood the two forms to be equivalent.  Can you
give an example of some data that gets mangled by the later but not the
former?

  Cheers,

     Ryan

-- 
Ryan Kelly
http://www.rfk.id.au  |  This message is digitally signed. Please visit
ryan@rfk.id.au        |  http://www.rfk.id.au/ramblings/gpg/ for details

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Zed A. Shaw
Date:
2011-04-19 @ 22:27
On Wed, Apr 20, 2011 at 08:09:02AM +1000, Ryan Kelly wrote:
> > return '%d:%s,' % (len(data), data)
> > 
> > Then it would get screwed up the way you think.
> 
> Really?  I always understood the two forms to be equivalent.  Can you
> give an example of some data that gets mangled by the later but not the
> former?

Yep, you're right:

http://dpaste.de/iTAE/

It's because print and writing to files tries to do conversions and
other stupid stuff, not the use of %s.

-- 
Zed A. Shaw
http://zedshaw.com/

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Ryan Kelly
Date:
2011-04-18 @ 00:37
On Mon, 2011-04-18 at 10:27 +1000, Ryan Kelly wrote:
> On Sun, 2011-04-17 at 16:31 -0700, Armando Singer wrote:
> >
> > - It might be a good idea to have a separate string type:
> > 
> >   "  string
> >   ,  byte array
> > 
> >   I have implemented this in the attached code. It adds 1 line to the
> >   parsing. Having just a byte[] will work fine, but we're getting
> >   pretty close to netrings as we'll have to convert to String any time
> >   we want one, which would be common.
> 
> I feel sorry for this poor dead horse, but I suspect it's going to keep
> getting beaten.
> 
> 
> On one hand, for a general-purpose library in a language that has
> distinct "bytes" and "string" types, it would be very nice to be able to
> round-trip mixed data structures, e.g.:
> 
>    ["hello",u"world"] == tns.loads(tns.dumps(["hello",u"world"]))
>
>  [...snip...]
> 
> But if the proposal is simply to indicate "these bytes are a unicode
> string in whatever encoding you've decided to use for this application,
> you deal with it" then I think, based on my experiences with the python
> module, it would be worth adding as a separate type tag.

By the way, I'm aware that this is probably just my general-purpose
python library bias showing, so I'm quite prepared to be shot down.
Just want to get it all out on the table.

The python lib *will* have to deal conveniently with unicode strings
eventually, and the API will probably look like this:


   >>> tns.dumps(u"hello")
   ValueError: you must specify an encoding for unicode strings
   >>>
   >>> # if tnetstrings grows a string type
   >>> tns.dumps(u"hello","utf8")
   5:hello"
   >>>
   >>> # if tnetstrings doesn't grow a string type
   >>> tns.dumps(u"hello","utf8")
   5:hello,
   >>>


So really, the only horse I have in this race is  "can we unambiguously
mix strings and bytes in a single document".


  Cheers,

     Ryan


-- 
Ryan Kelly
http://www.rfk.id.au  |  This message is digitally signed. Please visit
ryan@rfk.id.au        |  http://www.rfk.id.au/ramblings/gpg/ for details

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
joshua simmons
Date:
2011-04-18 @ 00:42
IIRC .net and the jvm intern strings too, which makes messing with them not
particularly performant unless you use a string builder. Working with a
byte[] is vastly superior until you need string semantics, and that's up to
your application.

Also with a byte[] it should be relatively easy to duck into unsafe code to
get some serious performance if there are nasty spots in the code.
String.Split is something I'd avoid anyway since once again it means you
search the string then make two new strings (interned) which still need to
be processed. If possible parsing in-place and extracting the valuable data
should prove much faster.

Having separate string and blob types just complicates matters. tnetstrings
are 8 bit clean, so store whatever rubbish you want. But it's not the
protocol's problem as to how you encode your data.

To this end as well, iirc it's easy enough to get a byte[] and length from a
native string, and to convert between utf-8 / ascii / whatever.

http://msdn.microsoft.com/en-us/library/system.text.utf8encoding.getbytes(v=VS.100).aspx
for
example. imo your api should handle byte[]'s only, and then let the
application decide encoding, this then reflects the actual protocol
semantics and stops anybody from getting confused.

On Mon, Apr 18, 2011 at 10:27 AM, Ryan Kelly <ryan@rfk.id.au> wrote:

> On Sun, 2011-04-17 at 16:31 -0700, Armando Singer wrote:
> >
> > - It might be a good idea to have a separate string type:
> >
> >   "  string
> >   ,  byte array
> >
> >   I have implemented this in the attached code. It adds 1 line to the
> >   parsing. Having just a byte[] will work fine, but we're getting
> >   pretty close to netrings as we'll have to convert to String any time
> >   we want one, which would be common.
>
> I feel sorry for this poor dead horse, but I suspect it's going to keep
> getting beaten.
>
>
> On one hand, for a general-purpose library in a language that has
> distinct "bytes" and "string" types, it would be very nice to be able to
> round-trip mixed data structures, e.g.:
>
>   ["hello",u"world"] == tns.loads(tns.dumps(["hello",u"world"]))
>
> On the other hand, you most definitely do NOT want mongrel2 trying to
> deal with encoding/decoding unicode strings.  Bad bad bad.
>
>
> And that's not even going into the details of encoding.  To quote the
> tentstrings spec:
>
>  "String encoding is an application level, political, and display
> specification.  Transport protocols should not have to decode random
> character encodings accurately to function properly."
>
> A big +1 from me on that!
>
>
> When the tnetstring adventure was just starting out, Zed's original
> proposal was to have separate "string" and "bytes" type tags, but have a
> policy that "tnetstring doesn't do encoding".  So this:
>
>   5:hello,
>
> Means "here is a byte array".  While this:
>
>   5:hello"
>
> Means "here is a string in whatever encoding you're using up there".
>
>
> I fought back against having bytes in a potentially ambiguous encoding.
> I now wish I'd kept my mouth shut.
>
>
> As I see it there are two options:
>
>  1) Just tell unicode strings to piss off.  This is not a
> general-purpose serialisation library, it's a special-purpose format for
> communicating between bytestream-based services.
>
>  2) Allow a separate string type, but refuse to accept or generate it
> within mongrel2.
>
>
> In a previous email on this topic, people jumped in to say that when I
> said "unicode strings" like above what I really meant was "utf8
> strings".  Not so.
>
> If the proposed solution were indeed to be "encode all unicode strings
> in utf8 and decode them in the parser" I would be against it.
>
> But if the proposal is simply to indicate "these bytes are a unicode
> string in whatever encoding you've decided to use for this application,
> you deal with it" then I think, based on my experiences with the python
> module, it would be worth adding as a separate type tag.
>
>
> The ability to transparently round-trip both bytearrays and strings
> would actually be a additional bonus of tnetstrings over JSON, which
> demands that all strings be unicode.
>
> (Worse actually: the whole JSON document is a big unicode string in one
> of several different encodings, and your parser is supposed to examine
> the pattern of zeros in the first few chars of the document to determine
> which encoding it is in.  Of course no-one does this, so in the wild
> JSON is almost always in utf8.)
>
> > - The integer type in the reference implementation is limited to
> >   sys.maxint. It might be a good idea to be specific in the spec about
> >   what the max integer is allowed to be
>
> Indeed, sys.maxint is different on 32-bit vs 64-bit python so no
> ambiguity is resolved here.
>
> > - I'm also not handling floating point numbers. Is this correct? Not
> >   having floats seems the only way to fulfill rule #1
>
> I'd like to see a separate float type in the interests of completeness.
> I propose:
>
>    7:3.14159^
>
> Because the carat reminds me of exponentiation.  Surely every language
> has some facility to to convert float <=> string, accuracy be damned?
>
>
>
>
>  Cheers,
>
>
>      Ryan
>
> --
> Ryan Kelly
> http://www.rfk.id.au  |  This message is digitally signed. Please visit
> ryan@rfk.id.au        |  http://www.rfk.id.au/ramblings/gpg/ for details
>
>

Re: [mongrel2] Another tnetstring impl and feedback on the spec

From:
Armando Singer
Date:
2011-04-18 @ 03:45
> IIRC .net and the jvm intern strings too, which makes messing with them 
not particularly performant unless you use a string builder. Working with 
a byte[] is vastly superior until you need string semantics, and that's up
to your application.

Correct, you want to just index into the byte[] and copy ranges to
create types in the host language.

> 
> Also with a byte[] it should be relatively easy to duck into unsafe code
to get some serious performance if there are nasty spots in the code. 
String.Split is something I'd avoid anyway since once again it means you 
search the string then make two new strings (interned) which still need to
be processed. If possible parsing in-place and extracting the valuable 
data should prove much faster.

Yup. In my impl, the byte[] is parsed in place by jumping to different
offsets once we find the length before the ':'

> 
> Having separate string and blob types just complicates matters. 
tnetstrings are 8 bit clean, so store whatever rubbish you want. But it's 
not the protocol's problem as to how you encode your data.

I'm suggesting string and blob types are the same on the wire, the
type char is the only difference and they are just type hints for host
languages. Some languages don't care, some do.

It's not protocol's problem how you encode your data, but if the
protocol proposes cross-language types, one would want those types to
be useful in all the common cases. I'm not saying that byte[], a
subset of integers, booleans, null and lists and maps aren't useful,
but a string type hint would be mighty useful in some languages.

> 
> To this end as well, iirc it's easy enough to get a byte[] and length 
from a native string, and to convert between utf-8 / ascii / whatever. 
http://msdn.microsoft.com/en-us/library/system.text.utf8encoding.getbytes(v=VS.100).aspx
for example. imo your api should handle byte[]'s only, and then let the 
application decide encoding, this then reflects the actual protocol 
semantics and stops anybody from getting confused.

Yes, it's pretty easy to convert from byte[] to once's platform's
string representation. But if you're doing this anyway, why not just
use plain old netstrings?

Cheers,
Armando