librelist archives

« back to archive

How to correctly parse strings

How to correctly parse strings

From:
Andre Leiradella
Date:
2014-07-01 @ 22:30
Hi All,

My lexer is defined as:

%lex

%%

"#"\s*"line"\s+\d+\s*'"'[^"]*'"'      /* ignore line directives */
\s+                                   /* skip whitespace */
[a-zA-Z_][a-zA-Z_0-9]* {
   var kw = keywords[yytext];
   if (kw) return kw;
   return 'ID';
}
"-"?[0-9]+"."[0-9]*([eE][+-]?[0-9]+)?f? return 'FLOAT_LITERAL'
"0x"[0-9a-fA-F]+                        return 'HEX_LITERAL'
"0b"[01]+                               return 'BIN_LITERAL'
"-"?[0-9]+                              return 'INT_LITERAL'
'"'[^"]*'"'                             return 'STR'
"'"[^']*"'"                             return 'STR'
"||"                                    return 'LOGOR'
"&&"                                    return 'LOGAND'
"<="                                    return 'LE'
">="                                    return 'GE'
"=="                                    return 'EQ'
"!="                                    return 'NE'
"["                                     return '['
"]"                                     return ']'
[-+;:.@{}<>,()=|]                       return yytext
<<EOF>>                                 return 'EOF'

/lex

%%

I'm having a problem where unterminated strings are not generating 
errors. How can I make the generated parser error on strings that span 
the end of the line or the end of the input file?

Thanks,

Andre

Re: [jison] How to correctly parse strings

From:
Doug
Date:
2014-07-02 @ 09:46
Just a guess, but you could simply try: 
 '"'[^"\n\r\f]*'"'                         return 'STR'
 - then an unterminated string would not match the rule
and probably there would be a parsing error where STR
was expected. 

Alternatively, this might give a more explicit syntax error -
create a new parsing state when you hit the first quote,
then in the new state accept everything except newlines until 
end quote, then pop the state stack. I have used something like
this method for string parsing:

%lex
%x dquote squote 
("\r\n"|\n|\r|\f)   /* ignore newlines  (maybe  \s+ does this too?) */

<INITIAL>'"'        { this.begin('dquote'); return 'DQUOT'; }
<dquote>["]/[\s]    { this.popState(); return 'DQUOT';  }
<dquote>("\r\n"|\n|\r|\f)   {return  'NEWLINE'}
<dquote>(\\\"|[^"\n\r\f])*/["]  { return 'STR'; }  

ditto for squote

The last pattern says, while in dquote mode - allow escaped quotes, 
or everything but a quote or newline - as long as its followed by
a positive lookahead quote assertion. 

A grammar production rule should look like this:

string_value  
:  DQUOT STR DQUOT 
     { $$ = $2; }
 | SQUOT STR SQUOT 
     { $$ = $2; }
;

If an unterminated quoted string is encountered you should get a
syntax error because there will be no STR after the first DQUOT.

Alternatively, if you remove the trailing
lookahead assertion /["]  from the last regex pattern, 
then an unterminated string would look like the production 
   DQUOT STR  NEWLINE
and the error should be something like: 
  encountered a NEWLINE when a DQUOT was expected 
 

Something like that anyway.
cheers
Doug





On Tue, 1 Jul 2014 07:30:46 PM Andre Leiradella wrote:
> Hi All,
> 
> My lexer is defined as:
> 
> %lex
> 
> %%
> 
> "#"\s*"line"\s+\d+\s*'"'[^"]*'"'      /* ignore line directives */
> \s+                                   /* skip whitespace */
> [a-zA-Z_][a-zA-Z_0-9]* {
>    var kw = keywords[yytext];
>    if (kw) return kw;
>    return 'ID';
> }
> "-"?[0-9]+"."[0-9]*([eE][+-]?[0-9]+)?f? return 'FLOAT_LITERAL'
> "0x"[0-9a-fA-F]+                        return 'HEX_LITERAL'
> "0b"[01]+                               return 'BIN_LITERAL'
> "-"?[0-9]+                              return 'INT_LITERAL'
> '"'[^"]*'"'                             return 'STR'
> "'"[^']*"'"                             return 'STR'
> "||"                                    return 'LOGOR'
> "&&"                                    return 'LOGAND'
> "<="                                    return 'LE'
> ">="                                    return 'GE'
> "=="                                    return 'EQ'
> "!="                                    return 'NE'
> "["                                     return '['
> "]"                                     return ']'
> [-+;:.@{}<>,()=|]                       return yytext
> <<EOF>>                                 return 'EOF'
> 
> /lex
> 
> %%
> 
> I'm having a problem where unterminated strings are not generating
> errors. How can I make the generated parser error on strings that span
> the end of the line or the end of the input file?
> 
> Thanks,
> 
> Andre

Re: [jison] How to correctly parse strings

From:
Andre Leiradella
Date:
2014-07-03 @ 00:56
On 02/07/2014 06:46, Doug wrote:
> '"'[^"\n\r\f]*'"' return 'STR'
That was what I tried before posting to the list. It does not error on 
unterminated strings. In fact, I'm puzzled with the results. I'm feeding 
this file to the parser (note the unterminated string):

------------------------8<------------------------
import "Core/DDLFoundation/DDLVector3.ddl"

cppinclude "CodeGen/DDLFoundation/DDLVector3.h"

pragma DataBufferRead true
pragma DataBufferWrite true

// Serializable AABB
struct DDLAABB, tag( Serializable )
{
   DDLVector3 Center, label("ha);
   DDLVector3 Extents;
}
------------------------8<------------------------

My parser just builds an AST, which in the case of the file above is this:

------------------------8<------------------------
{
   "pragmas": [
     {
       "id": "DataBufferRead",
       "value": true,
       "line": 5
     },
     {
       "id": "DataBufferWrite",
       "value": true,
       "line": 6
     }
   ],
   "imports": [
     "X:/core/devel/code/Core/DDLFoundation/DDLVector3.ddl"
   ],
   "cppincludes": [
     "CodeGen/DDLFoundation/DDLVector3.h"
   ],
   "aggregates": [
     {
       "type": "struct",
       "name": "DDLAABB",
       "tags": [
         {
           "name": "Serializable",
           "value": [],
           "line": 9
         }
       ],
       "fields": [
         {
           "info": {
             "arraytype": "scalar",
             "type": "DDLVector3",
             "name": "Extents"
           },
           "name": "Extents",
           "line": 11,
           "tags": []
         }
       ],
       "line": 9
     }
   ]
}
------------------------8<------------------------

Note that the Center field is not present in the AST, i.e. instead of an 
error the parser somehow suppresses the entire line from the input. If I 
fix the missing double quote, the output is this (which is correct):

------------------------8<------------------------
{
   "pragmas": [
     {
       "id": "DataBufferRead",
       "value": true,
       "line": 5
     },
     {
       "id": "DataBufferWrite",
       "value": true,
       "line": 6
     }
   ],
   "imports": [
     "X:/core/devel/code/Core/DDLFoundation/DDLVector3.ddl"
   ],
   "cppincludes": [
     "CodeGen/DDLFoundation/DDLVector3.h"
   ],
   "aggregates": [
     {
       "type": "struct",
       "name": "DDLAABB",
       "tags": [
         {
           "name": "Serializable",
           "value": [],
           "line": 9
         }
       ],
       "fields": [
         {
           "info": {
             "arraytype": "scalar",
             "type": "DDLVector3",
             "name": "Center"
           },
           "name": "Center",
           "line": 11,
           "tags": [
             {
               "name": "label",
               "value": "ha",
               "line": 11
             }
           ]
         },
         {
           "info": {
             "arraytype": "scalar",
             "type": "DDLVector3",
             "name": "Extents"
           },
           "name": "Extents",
           "line": 12,
           "tags": []
         }
       ],
       "line": 9
     }
   ]
}
------------------------8<------------------------
>
> %lex
>
> %x dquote squote
>
> ("\r\n"|\n|\r|\f) /* ignore newlines (maybe \s+ does this too?) */
>
> <INITIAL>'"' { this.begin('dquote'); return 'DQUOT'; }
>
> <dquote>["]/[\s] { this.popState(); return 'DQUOT'; }
>
> <dquote>("\r\n"|\n|\r|\f) {return 'NEWLINE'}
>
> <dquote>(\\\"|[^"\n\r\f])*/["] { return 'STR'; }
>
> ditto for squote
>
I couldn't make that work, the resulting parser refuses to parse even 
correctly-formed strings.

Thanks!

Re: [jison] How to correctly parse strings

From:
Doug du Boulay
Date:
2014-07-03 @ 02:05
Sorry,
Pushed for time at the moment, but
I forgot to change this rule:

<dquote>["]/[\s] { this.popState(); return 'DQUOT'; }

Remove the /[\s] term.
I coppied from my code where closing quotes are significant only when
followed by white space.
(Hence the positive lookahead assertion)

Sorry about.
Doug.

On 03/07/2014 8:57 AM, "Andre Leiradella" <andre@leiradella.com> wrote:
>
> On 02/07/2014 06:46, Doug wrote:
>>
>> '"'[^"\n\r\f]*'"' return 'STR'
>
> That was what I tried before posting to the list. It does not error on
unterminated strings. In fact, I'm puzzled with the results. I'm feeding
this file to the parser (note the unterminated string):
>
> ------------------------8<------------------------
> import "Core/DDLFoundation/DDLVector3.ddl"
>
> cppinclude "CodeGen/DDLFoundation/DDLVector3.h"
>
> pragma DataBufferRead true
> pragma DataBufferWrite true
>
> // Serializable AABB
> struct DDLAABB, tag( Serializable )
> {
>   DDLVector3 Center, label("ha);
>   DDLVector3 Extents;
> }
> ------------------------8<------------------------
>
> My parser just builds an AST, which in the case of the file above is this:
>
> ------------------------8<------------------------
> {
>   "pragmas": [
>     {
>       "id": "DataBufferRead",
>       "value": true,
>       "line": 5
>     },
>     {
>       "id": "DataBufferWrite",
>       "value": true,
>       "line": 6
>     }
>   ],
>   "imports": [
>     "X:/core/devel/code/Core/DDLFoundation/DDLVector3.ddl"
>   ],
>   "cppincludes": [
>     "CodeGen/DDLFoundation/DDLVector3.h"
>   ],
>   "aggregates": [
>     {
>       "type": "struct",
>       "name": "DDLAABB",
>       "tags": [
>         {
>           "name": "Serializable",
>           "value": [],
>           "line": 9
>         }
>       ],
>       "fields": [
>         {
>           "info": {
>             "arraytype": "scalar",
>             "type": "DDLVector3",
>             "name": "Extents"
>           },
>           "name": "Extents",
>           "line": 11,
>           "tags": []
>         }
>       ],
>       "line": 9
>     }
>   ]
> }
> ------------------------8<------------------------
>
> Note that the Center field is not present in the AST, i.e. instead of an
error the parser somehow suppresses the entire line from the input. If I
fix the missing double quote, the output is this (which is correct):
>
> ------------------------8<------------------------
> {
>   "pragmas": [
>     {
>       "id": "DataBufferRead",
>       "value": true,
>       "line": 5
>     },
>     {
>       "id": "DataBufferWrite",
>       "value": true,
>       "line": 6
>     }
>   ],
>   "imports": [
>     "X:/core/devel/code/Core/DDLFoundation/DDLVector3.ddl"
>   ],
>   "cppincludes": [
>     "CodeGen/DDLFoundation/DDLVector3.h"
>   ],
>   "aggregates": [
>     {
>       "type": "struct",
>       "name": "DDLAABB",
>       "tags": [
>         {
>           "name": "Serializable",
>           "value": [],
>           "line": 9
>         }
>       ],
>       "fields": [
>         {
>           "info": {
>             "arraytype": "scalar",
>             "type": "DDLVector3",
>             "name": "Center"
>           },
>           "name": "Center",
>           "line": 11,
>           "tags": [
>             {
>               "name": "label",
>               "value": "ha",
>               "line": 11
>             }
>           ]
>         },
>         {
>           "info": {
>             "arraytype": "scalar",
>             "type": "DDLVector3",
>             "name": "Extents"
>           },
>           "name": "Extents",
>           "line": 12,
>           "tags": []
>         }
>       ],
>       "line": 9
>     }
>   ]
> }
> ------------------------8<------------------------
>
>>
>>
>> %lex
>>
>> %x dquote squote
>>
>> ("\r\n"|\n|\r|\f) /* ignore newlines (maybe \s+ does this too?) */
>>
>>
>>
>> <INITIAL>'"' { this.begin('dquote'); return 'DQUOT'; }
>>
>> <dquote>["]/[\s] { this.popState(); return 'DQUOT'; }
>>
>> <dquote>("\r\n"|\n|\r|\f) {return 'NEWLINE'}
>>
>> <dquote>(\\\"|[^"\n\r\f])*/["] { return 'STR'; }
>>
>>
>>
>> ditto for squote
>
> I couldn't make that work, the resulting parser refuses to parse even
correctly-formed strings.
>
> Thanks!
>

Re: [jison] How to correctly parse strings

From:
Andre Leiradella
Date:
2014-07-03 @ 02:45
On 02/07/2014 23:05, Doug du Boulay wrote:
>
> Sorry,
>
No need for apologies!
>
> Pushed for time at the moment, but
> I forgot to change this rule:
>
> <dquote>["]/[\s] { this.popState(); return 'DQUOT'; }
>
> Remove the /[\s] term.
>
My lex now is:

------------------------8<------------------------
%lex
%x dquote
%%

"#"\s*"line"\s+\d+\s*'"'[^"]*'"'      /* ignore line directives */
\s+                                   /* skip whitespace */
[a-zA-Z_][a-zA-Z_0-9]* {
   var kw = keywords[yytext];
   if (kw) return kw;
   return 'ID';
}
"-"?[0-9]+"."[0-9]*([eE][+-]?[0-9]+)?f? return 'FLOAT_LITERAL'
"0x"[0-9a-fA-F]+                        return 'HEX_LITERAL'
"0b"[01]+                               return 'BIN_LITERAL'
"-"?[0-9]+                              return 'INT_LITERAL'
"||"                                    return 'LOGOR'
"&&"                                    return 'LOGAND'
"<="                                    return 'LE'
">="                                    return 'GE'
"=="                                    return 'EQ'
"!="                                    return 'NE'
"["                                     return '['
"]"                                     return ']'
[-+;:.@{}<>,()=|]                       return yytext
<<EOF>>                                 return 'EOF'

<INITIAL>'"' { this.begin('dquote'); return 'DQUOT'; }
<dquote>["] { this.popState(); return 'DQUOT'; }
<dquote>("\r\n"|\n|\r|\f) {return 'NEWLINE'}
<dquote>(\\\"|[^"\n\r\f])*/["] { return 'STR'; }

/lex

%%
------------------------8<------------------------

and I have the following rule:

------------------------8<------------------------
str
   : DQUOT STR DQUOT { $$ = $2; }
   ;
------------------------8<------------------------

but the problem persists: valid strings compile ok, invalid strings 
appear to make the parser dismiss the entire input line without errors.

I wonder if the \s+ /* skip whitespace */ line is the culprit, but 
without it how the lexer is going to ignore whitespace?

$ jison --version
$ 0.4.13

Thanks!

Re: [jison] How to correctly parse strings

From:
Doug du Boulay
Date:
2014-07-03 @ 02:56
Try to append the rule

. return 'INVALID'

as your final lex rule.
 On 03/07/2014 10:46 AM, "Andre Leiradella" <andre@leiradella.com> wrote:

> On 02/07/2014 23:05, Doug du Boulay wrote:
> >
> > Sorry,
> >
> No need for apologies!
> >
> > Pushed for time at the moment, but
> > I forgot to change this rule:
> >
> > <dquote>["]/[\s] { this.popState(); return 'DQUOT'; }
> >
> > Remove the /[\s] term.
> >
> My lex now is:
>
> ------------------------8<------------------------
> %lex
> %x dquote
> %%
>
> "#"\s*"line"\s+\d+\s*'"'[^"]*'"'      /* ignore line directives */
> \s+                                   /* skip whitespace */
> [a-zA-Z_][a-zA-Z_0-9]* {
>    var kw = keywords[yytext];
>    if (kw) return kw;
>    return 'ID';
> }
> "-"?[0-9]+"."[0-9]*([eE][+-]?[0-9]+)?f? return 'FLOAT_LITERAL'
> "0x"[0-9a-fA-F]+                        return 'HEX_LITERAL'
> "0b"[01]+                               return 'BIN_LITERAL'
> "-"?[0-9]+                              return 'INT_LITERAL'
> "||"                                    return 'LOGOR'
> "&&"                                    return 'LOGAND'
> "<="                                    return 'LE'
> ">="                                    return 'GE'
> "=="                                    return 'EQ'
> "!="                                    return 'NE'
> "["                                     return '['
> "]"                                     return ']'
> [-+;:.@{}<>,()=|]                       return yytext
> <<EOF>>                                 return 'EOF'
>
> <INITIAL>'"' { this.begin('dquote'); return 'DQUOT'; }
> <dquote>["] { this.popState(); return 'DQUOT'; }
> <dquote>("\r\n"|\n|\r|\f) {return 'NEWLINE'}
> <dquote>(\\\"|[^"\n\r\f])*/["] { return 'STR'; }
>
> /lex
>
> %%
> ------------------------8<------------------------
>
> and I have the following rule:
>
> ------------------------8<------------------------
> str
>    : DQUOT STR DQUOT { $$ = $2; }
>    ;
> ------------------------8<------------------------
>
> but the problem persists: valid strings compile ok, invalid strings
> appear to make the parser dismiss the entire input line without errors.
>
> I wonder if the \s+ /* skip whitespace */ line is the culprit, but
> without it how the lexer is going to ignore whitespace?
>
> $ jison --version
> $ 0.4.13
>
> Thanks!
>

Re: [jison] How to correctly parse strings

From:
Andre Leiradella
Date:
2014-07-03 @ 03:02
On 02/07/2014 23:56, Doug du Boulay wrote:
> . return 'INVALID'
Now it errors on the first line of the input file, even if there are no 
errors in it. I've made the rule the last one in the list.

Re: [jison] How to correctly parse strings

From:
Doug du Boulay
Date:
2014-07-03 @ 04:13
Hmm.
What if you add a rule to ignore newlines but only in the INITIAL state?
I.e.
<INITIAL>("\r\n"|\n|\r|\f)   /* skip newlines */

On 03/07/2014 11:03 AM, "Andre Leiradella" <andre@leiradella.com> wrote:
>
> On 02/07/2014 23:56, Doug du Boulay wrote:
> > . return 'INVALID'
> Now it errors on the first line of the input file, even if there are no
> errors in it. I've made the rule the last one in the list.
 On 02/07/2014 23:56, Doug du Boulay wrote:
> . return 'INVALID'
Now it errors on the first line of the input file, even if there are no
errors in it. I've made the rule the last one in the list.

Re: [jison] How to correctly parse strings

From:
Andre Leiradella
Date:
2014-07-03 @ 15:42
On 03/07/2014 01:13, Doug du Boulay wrote:
> Hmm.
> What if you add a rule to ignore newlines but only in the INITIAL 
> state? I.e.
> <INITIAL>("\r\n"|\n|\r|\f)   /* skip newlines */
Same thing, unfinished strings don't cause an error and "eat" the line 
away :(