librelist archives

« back to archive

Fwd: Re: High performance regexp using "^" (beginning with) and RegExlastIndex

Fwd: Re: High performance regexp using "^" (beginning with) and RegExlastIndex

From:
Robert Plummer
Date:
2014-02-17 @ 00:28
FYI, Doug chimes in on the performance issues dealing with offset.  He is
just awesome!
---------- Forwarded message ----------
From: "Douglas Crockford" <douglas@crockford.com>
Date: Feb 16, 2014 7:23 PM
Subject: Re: High performance regexp using "^" (beginning with) and
RegExlastIndex
To: "Robert Plummer" <robertleeplummerjr@gmail.com>
Cc:

Mistakes were made.
>
> ES6 regexp may have a /y flag that will use lastIndex correctly.
>
>
> On 2014-02-16 10:58 AM, Robert Plummer wrote:
>
>> Hey Doug,
>> Recently I was using jison (Bison clone in json + javascript) to parse
>> some larger files, about 1.2mb in size.  There is a HUGE performance
>> bottleneck on the use of substr or slice (which are used as matches are
>> found, for 3 attributes, "matched", past tense, and all matches up till
>> now, "match" which is the current match found in the lexer, "_input" the
>> yet unmatched string(s)), and after further research I came to the
>> realization that in javascript, strings are immutable, and every
>> modification to them causes another string to be created in memory.
>> This got me searching for a way to analyse the string input, but by
>> keeping track of where/what has been analysed, so as to offset this
>> performance and resource issue.  In practice this is somewhat very
>> similar to c#'s and java's "StringReader".
>>
>> Then I found RexEx's lastIndex
>> <http://www.w3schools.com/jsref/jsref_regexp_lastindex.asp>, thinking
>> this was the answer, I started on an abstraction of the "_input"
>> variable that would track the current position, allow for regexp
>> matching, return strings ONLY when needed, and was considerably more
>> performant.  I succeeded!... or so I thought.   inside of jison all the
>> regular expressions have '^' at the beginning, so they are only looking
>> for the next match, and nothing after that.  What I found is that even
>> when you set lastIndex, it doesn't care about that you are essentially
>> telling it to forget anything before the index, if you specify '^' and
>> the position of _input is greater than 0, you will never get a match!
>> This, to me, is absolutely absurd.  In most other languages I'm dealing
>> with (php, and c# for example) you can tell the regexp to have an offset
>> before the match begins, why is javascript the odd one out?  The savings
>> for this little bitty settings could be massive!  Especially in the
>> context of a parser where string analysis and regex's are used
>> frequently, and potentially with large strings.  The resource savings
>> with a 1.2mb file could literally be gigs of memory!
>>
>> Am I missing something?  Can you elaborate?  What other technique could
>> I use?  Any pointers?
>>
>>
>> As always, thanks for your time, and I hope you have a nice weekend!
>>
>> --
>> Robert Plummer
>>
>

Re: [jison] Fwd: Re: High performance regexp using "^" (beginning with) and RegExlastIndex

From:
Zachary Carter
Date:
2014-02-17 @ 00:58
Thanks for looking in to it, Robert. Yes, the y flag is the "sticky mode" I
mentioned in the other thread. Besides an untriaged Chromium bug[1], I
don't see much activity on the part of other engines to support it yet,
unfortunately. Opera used to have it before they switched to Blink.

Thanks,
Zach

[1] https://code.google.com/p/chromium/issues/detail?id=142334


On Sun, Feb 16, 2014 at 4:28 PM, Robert Plummer <
robertleeplummerjr@gmail.com> wrote:

> FYI, Doug chimes in on the performance issues dealing with offset.  He is
> just awesome!
> ---------- Forwarded message ----------
> From: "Douglas Crockford" <douglas@crockford.com>
> Date: Feb 16, 2014 7:23 PM
> Subject: Re: High performance regexp using "^" (beginning with) and
> RegExlastIndex
> To: "Robert Plummer" <robertleeplummerjr@gmail.com>
> Cc:
>
> Mistakes were made.
>>
>> ES6 regexp may have a /y flag that will use lastIndex correctly.
>>
>>
>> On 2014-02-16 10:58 AM, Robert Plummer wrote:
>>
>>> Hey Doug,
>>> Recently I was using jison (Bison clone in json + javascript) to parse
>>> some larger files, about 1.2mb in size.  There is a HUGE performance
>>> bottleneck on the use of substr or slice (which are used as matches are
>>> found, for 3 attributes, "matched", past tense, and all matches up till
>>> now, "match" which is the current match found in the lexer, "_input" the
>>> yet unmatched string(s)), and after further research I came to the
>>> realization that in javascript, strings are immutable, and every
>>> modification to them causes another string to be created in memory.
>>> This got me searching for a way to analyse the string input, but by
>>> keeping track of where/what has been analysed, so as to offset this
>>> performance and resource issue.  In practice this is somewhat very
>>> similar to c#'s and java's "StringReader".
>>>
>>> Then I found RexEx's lastIndex
>>> <http://www.w3schools.com/jsref/jsref_regexp_lastindex.asp>, thinking
>>> this was the answer, I started on an abstraction of the "_input"
>>> variable that would track the current position, allow for regexp
>>> matching, return strings ONLY when needed, and was considerably more
>>> performant.  I succeeded!... or so I thought.   inside of jison all the
>>> regular expressions have '^' at the beginning, so they are only looking
>>> for the next match, and nothing after that.  What I found is that even
>>> when you set lastIndex, it doesn't care about that you are essentially
>>> telling it to forget anything before the index, if you specify '^' and
>>> the position of _input is greater than 0, you will never get a match!
>>> This, to me, is absolutely absurd.  In most other languages I'm dealing
>>> with (php, and c# for example) you can tell the regexp to have an offset
>>> before the match begins, why is javascript the odd one out?  The savings
>>> for this little bitty settings could be massive!  Especially in the
>>> context of a parser where string analysis and regex's are used
>>> frequently, and potentially with large strings.  The resource savings
>>> with a 1.2mb file could literally be gigs of memory!
>>>
>>> Am I missing something?  Can you elaborate?  What other technique could
>>> I use?  Any pointers?
>>>
>>>
>>> As always, thanks for your time, and I hope you have a nice weekend!
>>>
>>> --
>>> Robert Plummer
>>>
>>


-- 
Zach Carter

Re: [jison] Fwd: Re: High performance regexp using "^" (beginning with) and RegExlastIndex

From:
Robert Plummer
Date:
2014-02-17 @ 01:51
If we have to do it ourselves....  We do what we gotta do.
On Feb 16, 2014 7:58 PM, "Zachary Carter" <zack.carter@gmail.com> wrote:

> Thanks for looking in to it, Robert. Yes, the y flag is the "sticky mode"
> I mentioned in the other thread. Besides an untriaged Chromium bug[1], I
> don't see much activity on the part of other engines to support it yet,
> unfortunately. Opera used to have it before they switched to Blink.
>
> Thanks,
> Zach
>
> [1] https://code.google.com/p/chromium/issues/detail?id=142334
>
>
> On Sun, Feb 16, 2014 at 4:28 PM, Robert Plummer <
> robertleeplummerjr@gmail.com> wrote:
>
>> FYI, Doug chimes in on the performance issues dealing with offset.  He is
>> just awesome!
>> ---------- Forwarded message ----------
>> From: "Douglas Crockford" <douglas@crockford.com>
>> Date: Feb 16, 2014 7:23 PM
>> Subject: Re: High performance regexp using "^" (beginning with) and
>> RegExlastIndex
>> To: "Robert Plummer" <robertleeplummerjr@gmail.com>
>> Cc:
>>
>> Mistakes were made.
>>>
>>> ES6 regexp may have a /y flag that will use lastIndex correctly.
>>>
>>>
>>> On 2014-02-16 10:58 AM, Robert Plummer wrote:
>>>
>>>> Hey Doug,
>>>> Recently I was using jison (Bison clone in json + javascript) to parse
>>>> some larger files, about 1.2mb in size.  There is a HUGE performance
>>>> bottleneck on the use of substr or slice (which are used as matches are
>>>> found, for 3 attributes, "matched", past tense, and all matches up till
>>>> now, "match" which is the current match found in the lexer, "_input" the
>>>> yet unmatched string(s)), and after further research I came to the
>>>> realization that in javascript, strings are immutable, and every
>>>> modification to them causes another string to be created in memory.
>>>> This got me searching for a way to analyse the string input, but by
>>>> keeping track of where/what has been analysed, so as to offset this
>>>> performance and resource issue.  In practice this is somewhat very
>>>> similar to c#'s and java's "StringReader".
>>>>
>>>> Then I found RexEx's lastIndex
>>>> <http://www.w3schools.com/jsref/jsref_regexp_lastindex.asp>, thinking
>>>> this was the answer, I started on an abstraction of the "_input"
>>>> variable that would track the current position, allow for regexp
>>>> matching, return strings ONLY when needed, and was considerably more
>>>> performant.  I succeeded!... or so I thought.   inside of jison all the
>>>> regular expressions have '^' at the beginning, so they are only looking
>>>> for the next match, and nothing after that.  What I found is that even
>>>> when you set lastIndex, it doesn't care about that you are essentially
>>>> telling it to forget anything before the index, if you specify '^' and
>>>> the position of _input is greater than 0, you will never get a match!
>>>> This, to me, is absolutely absurd.  In most other languages I'm dealing
>>>> with (php, and c# for example) you can tell the regexp to have an offset
>>>> before the match begins, why is javascript the odd one out?  The savings
>>>> for this little bitty settings could be massive!  Especially in the
>>>> context of a parser where string analysis and regex's are used
>>>> frequently, and potentially with large strings.  The resource savings
>>>> with a 1.2mb file could literally be gigs of memory!
>>>>
>>>> Am I missing something?  Can you elaborate?  What other technique could
>>>> I use?  Any pointers?
>>>>
>>>>
>>>> As always, thanks for your time, and I hope you have a nice weekend!
>>>>
>>>> --
>>>> Robert Plummer
>>>>
>>>
>
>
> --
> Zach Carter
>

Re: [jison] Fwd: Re: High performance regexp using "^" (beginning with) and RegExlastIndex

From:
Robert Plummer
Date:
2014-10-08 @ 16:36
A community update, looks like this is progressing quite well in v8 (for
example, unit tests:

https://code.google.com/p/v8/source/browse/branches/bleeding_edge/test/mjsunit/harmony/regexp-sticky.js?spec=svn24065&r=24065)
and will inevitably be in later versions of it.  A company I'm working with
is helping to progress this.  As a means of progress I'm going to be
building an input reader that optionally falls back to the existing method
if sticky mode isn't available, however if it is available, it will be
used.  How does that sound?

On Sun, Feb 16, 2014 at 8:51 PM, Robert Plummer <
robertleeplummerjr@gmail.com> wrote:

> If we have to do it ourselves....  We do what we gotta do.
> On Feb 16, 2014 7:58 PM, "Zachary Carter" <zack.carter@gmail.com> wrote:
>
>> Thanks for looking in to it, Robert. Yes, the y flag is the "sticky mode"
>> I mentioned in the other thread. Besides an untriaged Chromium bug[1], I
>> don't see much activity on the part of other engines to support it yet,
>> unfortunately. Opera used to have it before they switched to Blink.
>>
>> Thanks,
>> Zach
>>
>> [1] https://code.google.com/p/chromium/issues/detail?id=142334
>>
>>
>> On Sun, Feb 16, 2014 at 4:28 PM, Robert Plummer <
>> robertleeplummerjr@gmail.com> wrote:
>>
>>> FYI, Doug chimes in on the performance issues dealing with offset.  He
>>> is just awesome!
>>> ---------- Forwarded message ----------
>>> From: "Douglas Crockford" <douglas@crockford.com>
>>> Date: Feb 16, 2014 7:23 PM
>>> Subject: Re: High performance regexp using "^" (beginning with) and
>>> RegExlastIndex
>>> To: "Robert Plummer" <robertleeplummerjr@gmail.com>
>>> Cc:
>>>
>>> Mistakes were made.
>>>>
>>>> ES6 regexp may have a /y flag that will use lastIndex correctly.
>>>>
>>>>
>>>> On 2014-02-16 10:58 AM, Robert Plummer wrote:
>>>>
>>>>> Hey Doug,
>>>>> Recently I was using jison (Bison clone in json + javascript) to parse
>>>>> some larger files, about 1.2mb in size.  There is a HUGE performance
>>>>> bottleneck on the use of substr or slice (which are used as matches are
>>>>> found, for 3 attributes, "matched", past tense, and all matches up till
>>>>> now, "match" which is the current match found in the lexer, "_input"
>>>>> the
>>>>> yet unmatched string(s)), and after further research I came to the
>>>>> realization that in javascript, strings are immutable, and every
>>>>> modification to them causes another string to be created in memory.
>>>>> This got me searching for a way to analyse the string input, but by
>>>>> keeping track of where/what has been analysed, so as to offset this
>>>>> performance and resource issue.  In practice this is somewhat very
>>>>> similar to c#'s and java's "StringReader".
>>>>>
>>>>> Then I found RexEx's lastIndex
>>>>> <http://www.w3schools.com/jsref/jsref_regexp_lastindex.asp>, thinking
>>>>> this was the answer, I started on an abstraction of the "_input"
>>>>> variable that would track the current position, allow for regexp
>>>>> matching, return strings ONLY when needed, and was considerably more
>>>>> performant.  I succeeded!... or so I thought.   inside of jison all the
>>>>> regular expressions have '^' at the beginning, so they are only looking
>>>>> for the next match, and nothing after that.  What I found is that even
>>>>> when you set lastIndex, it doesn't care about that you are essentially
>>>>> telling it to forget anything before the index, if you specify '^' and
>>>>> the position of _input is greater than 0, you will never get a match!
>>>>> This, to me, is absolutely absurd.  In most other languages I'm dealing
>>>>> with (php, and c# for example) you can tell the regexp to have an
>>>>> offset
>>>>> before the match begins, why is javascript the odd one out?  The
>>>>> savings
>>>>> for this little bitty settings could be massive!  Especially in the
>>>>> context of a parser where string analysis and regex's are used
>>>>> frequently, and potentially with large strings.  The resource savings
>>>>> with a 1.2mb file could literally be gigs of memory!
>>>>>
>>>>> Am I missing something?  Can you elaborate?  What other technique could
>>>>> I use?  Any pointers?
>>>>>
>>>>>
>>>>> As always, thanks for your time, and I hope you have a nice weekend!
>>>>>
>>>>> --
>>>>> Robert Plummer
>>>>>
>>>>
>>
>>
>> --
>> Zach Carter
>>
>


-- 
Robert Plummer

Re: [jison] Fwd: Re: High performance regexp using "^" (beginning with) and RegExlastIndex

From:
Zachary Carter
Date:
2014-10-11 @ 18:58
On Wed, Oct 8, 2014 at 9:36 AM, Robert Plummer <robertleeplummerjr@gmail.com
> wrote:

> A community update, looks like this is progressing quite well in v8 (for
> example, unit tests:
> 
https://code.google.com/p/v8/source/browse/branches/bleeding_edge/test/mjsunit/harmony/regexp-sticky.js?spec=svn24065&r=24065)
> and will inevitably be in later versions of it.  A company I'm working with
> is helping to progress this.
>

Amazing!


> As a means of progress I'm going to be building an input reader that
> optionally falls back to the existing method if sticky mode isn't
> available, however if it is available, it will be used.  How does that
> sound?
>

Two thumbs up.


>
> On Sun, Feb 16, 2014 at 8:51 PM, Robert Plummer <
> robertleeplummerjr@gmail.com> wrote:
>
>> If we have to do it ourselves....  We do what we gotta do.
>> On Feb 16, 2014 7:58 PM, "Zachary Carter" <zack.carter@gmail.com> wrote:
>>
>>> Thanks for looking in to it, Robert. Yes, the y flag is the "sticky
>>> mode" I mentioned in the other thread. Besides an untriaged Chromium
>>> bug[1], I don't see much activity on the part of other engines to support
>>> it yet, unfortunately. Opera used to have it before they switched to Blink.
>>>
>>> Thanks,
>>> Zach
>>>
>>> [1] https://code.google.com/p/chromium/issues/detail?id=142334
>>>
>>>
>>> On Sun, Feb 16, 2014 at 4:28 PM, Robert Plummer <
>>> robertleeplummerjr@gmail.com> wrote:
>>>
>>>> FYI, Doug chimes in on the performance issues dealing with offset.  He
>>>> is just awesome!
>>>> ---------- Forwarded message ----------
>>>> From: "Douglas Crockford" <douglas@crockford.com>
>>>> Date: Feb 16, 2014 7:23 PM
>>>> Subject: Re: High performance regexp using "^" (beginning with) and
>>>> RegExlastIndex
>>>> To: "Robert Plummer" <robertleeplummerjr@gmail.com>
>>>> Cc:
>>>>
>>>> Mistakes were made.
>>>>>
>>>>> ES6 regexp may have a /y flag that will use lastIndex correctly.
>>>>>
>>>>>
>>>>> On 2014-02-16 10:58 AM, Robert Plummer wrote:
>>>>>
>>>>>> Hey Doug,
>>>>>> Recently I was using jison (Bison clone in json + javascript) to parse
>>>>>> some larger files, about 1.2mb in size.  There is a HUGE performance
>>>>>> bottleneck on the use of substr or slice (which are used as matches
>>>>>> are
>>>>>> found, for 3 attributes, "matched", past tense, and all matches up
>>>>>> till
>>>>>> now, "match" which is the current match found in the lexer, "_input"
>>>>>> the
>>>>>> yet unmatched string(s)), and after further research I came to the
>>>>>> realization that in javascript, strings are immutable, and every
>>>>>> modification to them causes another string to be created in memory.
>>>>>> This got me searching for a way to analyse the string input, but by
>>>>>> keeping track of where/what has been analysed, so as to offset this
>>>>>> performance and resource issue.  In practice this is somewhat very
>>>>>> similar to c#'s and java's "StringReader".
>>>>>>
>>>>>> Then I found RexEx's lastIndex
>>>>>> <http://www.w3schools.com/jsref/jsref_regexp_lastindex.asp>, thinking
>>>>>> this was the answer, I started on an abstraction of the "_input"
>>>>>> variable that would track the current position, allow for regexp
>>>>>> matching, return strings ONLY when needed, and was considerably more
>>>>>> performant.  I succeeded!... or so I thought.   inside of jison all
>>>>>> the
>>>>>> regular expressions have '^' at the beginning, so they are only
>>>>>> looking
>>>>>> for the next match, and nothing after that.  What I found is that even
>>>>>> when you set lastIndex, it doesn't care about that you are essentially
>>>>>> telling it to forget anything before the index, if you specify '^' and
>>>>>> the position of _input is greater than 0, you will never get a match!
>>>>>> This, to me, is absolutely absurd.  In most other languages I'm
>>>>>> dealing
>>>>>> with (php, and c# for example) you can tell the regexp to have an
>>>>>> offset
>>>>>> before the match begins, why is javascript the odd one out?  The
>>>>>> savings
>>>>>> for this little bitty settings could be massive!  Especially in the
>>>>>> context of a parser where string analysis and regex's are used
>>>>>> frequently, and potentially with large strings.  The resource savings
>>>>>> with a 1.2mb file could literally be gigs of memory!
>>>>>>
>>>>>> Am I missing something?  Can you elaborate?  What other technique
>>>>>> could
>>>>>> I use?  Any pointers?
>>>>>>
>>>>>>
>>>>>> As always, thanks for your time, and I hope you have a nice weekend!
>>>>>>
>>>>>> --
>>>>>> Robert Plummer
>>>>>>
>>>>>
>>>
>>>
>>> --
>>> Zach Carter
>>>
>>
>
>
> --
> Robert Plummer
>



-- 
Zach Carter

Re: [jison] Fwd: Re: High performance regexp using "^" (beginning with) and RegExlastIndex

From:
Robert Plummer
Date:
2014-10-11 @ 19:12
I'm half way done.  I'll have the rest next week.
On Oct 11, 2014 2:59 PM, "Zachary Carter" <zack.carter@gmail.com> wrote:

>
>
> On Wed, Oct 8, 2014 at 9:36 AM, Robert Plummer <
> robertleeplummerjr@gmail.com> wrote:
>
>> A community update, looks like this is progressing quite well in v8 (for
>> example, unit tests:
>> 
https://code.google.com/p/v8/source/browse/branches/bleeding_edge/test/mjsunit/harmony/regexp-sticky.js?spec=svn24065&r=24065)
>> and will inevitably be in later versions of it.  A company I'm working with
>> is helping to progress this.
>>
>
> Amazing!
>
>
>> As a means of progress I'm going to be building an input reader that
>> optionally falls back to the existing method if sticky mode isn't
>> available, however if it is available, it will be used.  How does that
>> sound?
>>
>
> Two thumbs up.
>
>
>>
>> On Sun, Feb 16, 2014 at 8:51 PM, Robert Plummer <
>> robertleeplummerjr@gmail.com> wrote:
>>
>>> If we have to do it ourselves....  We do what we gotta do.
>>> On Feb 16, 2014 7:58 PM, "Zachary Carter" <zack.carter@gmail.com> wrote:
>>>
>>>> Thanks for looking in to it, Robert. Yes, the y flag is the "sticky
>>>> mode" I mentioned in the other thread. Besides an untriaged Chromium
>>>> bug[1], I don't see much activity on the part of other engines to support
>>>> it yet, unfortunately. Opera used to have it before they switched to Blink.
>>>>
>>>> Thanks,
>>>> Zach
>>>>
>>>> [1] https://code.google.com/p/chromium/issues/detail?id=142334
>>>>
>>>>
>>>> On Sun, Feb 16, 2014 at 4:28 PM, Robert Plummer <
>>>> robertleeplummerjr@gmail.com> wrote:
>>>>
>>>>> FYI, Doug chimes in on the performance issues dealing with offset.  He
>>>>> is just awesome!
>>>>> ---------- Forwarded message ----------
>>>>> From: "Douglas Crockford" <douglas@crockford.com>
>>>>> Date: Feb 16, 2014 7:23 PM
>>>>> Subject: Re: High performance regexp using "^" (beginning with) and
>>>>> RegExlastIndex
>>>>> To: "Robert Plummer" <robertleeplummerjr@gmail.com>
>>>>> Cc:
>>>>>
>>>>> Mistakes were made.
>>>>>>
>>>>>> ES6 regexp may have a /y flag that will use lastIndex correctly.
>>>>>>
>>>>>>
>>>>>> On 2014-02-16 10:58 AM, Robert Plummer wrote:
>>>>>>
>>>>>>> Hey Doug,
>>>>>>> Recently I was using jison (Bison clone in json + javascript) to
>>>>>>> parse
>>>>>>> some larger files, about 1.2mb in size.  There is a HUGE performance
>>>>>>> bottleneck on the use of substr or slice (which are used as matches
>>>>>>> are
>>>>>>> found, for 3 attributes, "matched", past tense, and all matches up
>>>>>>> till
>>>>>>> now, "match" which is the current match found in the lexer, "_input"
>>>>>>> the
>>>>>>> yet unmatched string(s)), and after further research I came to the
>>>>>>> realization that in javascript, strings are immutable, and every
>>>>>>> modification to them causes another string to be created in memory.
>>>>>>> This got me searching for a way to analyse the string input, but by
>>>>>>> keeping track of where/what has been analysed, so as to offset this
>>>>>>> performance and resource issue.  In practice this is somewhat very
>>>>>>> similar to c#'s and java's "StringReader".
>>>>>>>
>>>>>>> Then I found RexEx's lastIndex
>>>>>>> <http://www.w3schools.com/jsref/jsref_regexp_lastindex.asp>,
>>>>>>> thinking
>>>>>>> this was the answer, I started on an abstraction of the "_input"
>>>>>>> variable that would track the current position, allow for regexp
>>>>>>> matching, return strings ONLY when needed, and was considerably more
>>>>>>> performant.  I succeeded!... or so I thought.   inside of jison all
>>>>>>> the
>>>>>>> regular expressions have '^' at the beginning, so they are only
>>>>>>> looking
>>>>>>> for the next match, and nothing after that.  What I found is that
>>>>>>> even
>>>>>>> when you set lastIndex, it doesn't care about that you are
>>>>>>> essentially
>>>>>>> telling it to forget anything before the index, if you specify '^'
>>>>>>> and
>>>>>>> the position of _input is greater than 0, you will never get a match!
>>>>>>> This, to me, is absolutely absurd.  In most other languages I'm
>>>>>>> dealing
>>>>>>> with (php, and c# for example) you can tell the regexp to have an
>>>>>>> offset
>>>>>>> before the match begins, why is javascript the odd one out?  The
>>>>>>> savings
>>>>>>> for this little bitty settings could be massive!  Especially in the
>>>>>>> context of a parser where string analysis and regex's are used
>>>>>>> frequently, and potentially with large strings.  The resource savings
>>>>>>> with a 1.2mb file could literally be gigs of memory!
>>>>>>>
>>>>>>> Am I missing something?  Can you elaborate?  What other technique
>>>>>>> could
>>>>>>> I use?  Any pointers?
>>>>>>>
>>>>>>>
>>>>>>> As always, thanks for your time, and I hope you have a nice weekend!
>>>>>>>
>>>>>>> --
>>>>>>> Robert Plummer
>>>>>>>
>>>>>>
>>>>
>>>>
>>>> --
>>>> Zach Carter
>>>>
>>>
>>
>>
>> --
>> Robert Plummer
>>
>
>
>
> --
> Zach Carter
>

Re: [jison] Fwd: Re: High performance regexp using "^" (beginning with) and RegExlastIndex

From:
Robert Plummer
Date:
2014-10-16 @ 15:20
Ok, it is essentially feature complete and bug tested with the exception of
unput and backtrack_lexer.  I went ahead and put in pull requests to get it
implemented so the team can both see its implementation, and work to refine
it further.

jison pull request:
https://github.com/robertleeplummerjr/jison/compare/zaach:master...master
jison-lex pull request:
https://github.com/robertleeplummerjr/jison-lex/compare/zaach:master...master
It working in real world project:

https://github.com/Spreadsheets/jQuery.sheet/commit/f72932d1e16a9837cf8329dd150208f91194ec1c
The new input reader with sticky and fallback:

https://github.com/robertleeplummerjr/jison/blob/3db2245534ed24cb4773690ff3f6880b204ddf4f/lib/util/InputReader.js

I'd like to thank Douglas Crockford for taking the time to explain the
issue, Zach for knowing about it, and those who have worked on jison to
make it as powerful and refined as it is.  Again, thanks!


On Sat, Oct 11, 2014 at 3:12 PM, Robert Plummer <
robertleeplummerjr@gmail.com> wrote:

> I'm half way done.  I'll have the rest next week.
> On Oct 11, 2014 2:59 PM, "Zachary Carter" <zack.carter@gmail.com> wrote:
>
>>
>>
>> On Wed, Oct 8, 2014 at 9:36 AM, Robert Plummer <
>> robertleeplummerjr@gmail.com> wrote:
>>
>>> A community update, looks like this is progressing quite well in v8 (for
>>> example, unit tests:
>>> 
https://code.google.com/p/v8/source/browse/branches/bleeding_edge/test/mjsunit/harmony/regexp-sticky.js?spec=svn24065&r=24065)
>>> and will inevitably be in later versions of it.  A company I'm working with
>>> is helping to progress this.
>>>
>>
>> Amazing!
>>
>>
>>> As a means of progress I'm going to be building an input reader that
>>> optionally falls back to the existing method if sticky mode isn't
>>> available, however if it is available, it will be used.  How does that
>>> sound?
>>>
>>
>> Two thumbs up.
>>
>>
>>>
>>> On Sun, Feb 16, 2014 at 8:51 PM, Robert Plummer <
>>> robertleeplummerjr@gmail.com> wrote:
>>>
>>>> If we have to do it ourselves....  We do what we gotta do.
>>>> On Feb 16, 2014 7:58 PM, "Zachary Carter" <zack.carter@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks for looking in to it, Robert. Yes, the y flag is the "sticky
>>>>> mode" I mentioned in the other thread. Besides an untriaged Chromium
>>>>> bug[1], I don't see much activity on the part of other engines to support
>>>>> it yet, unfortunately. Opera used to have it before they switched to Blink.
>>>>>
>>>>> Thanks,
>>>>> Zach
>>>>>
>>>>> [1] https://code.google.com/p/chromium/issues/detail?id=142334
>>>>>
>>>>>
>>>>> On Sun, Feb 16, 2014 at 4:28 PM, Robert Plummer <
>>>>> robertleeplummerjr@gmail.com> wrote:
>>>>>
>>>>>> FYI, Doug chimes in on the performance issues dealing with offset.
>>>>>> He is just awesome!
>>>>>> ---------- Forwarded message ----------
>>>>>> From: "Douglas Crockford" <douglas@crockford.com>
>>>>>> Date: Feb 16, 2014 7:23 PM
>>>>>> Subject: Re: High performance regexp using "^" (beginning with) and
>>>>>> RegExlastIndex
>>>>>> To: "Robert Plummer" <robertleeplummerjr@gmail.com>
>>>>>> Cc:
>>>>>>
>>>>>> Mistakes were made.
>>>>>>>
>>>>>>> ES6 regexp may have a /y flag that will use lastIndex correctly.
>>>>>>>
>>>>>>>
>>>>>>> On 2014-02-16 10:58 AM, Robert Plummer wrote:
>>>>>>>
>>>>>>>> Hey Doug,
>>>>>>>> Recently I was using jison (Bison clone in json + javascript) to
>>>>>>>> parse
>>>>>>>> some larger files, about 1.2mb in size.  There is a HUGE performance
>>>>>>>> bottleneck on the use of substr or slice (which are used as matches
>>>>>>>> are
>>>>>>>> found, for 3 attributes, "matched", past tense, and all matches up
>>>>>>>> till
>>>>>>>> now, "match" which is the current match found in the lexer,
>>>>>>>> "_input" the
>>>>>>>> yet unmatched string(s)), and after further research I came to the
>>>>>>>> realization that in javascript, strings are immutable, and every
>>>>>>>> modification to them causes another string to be created in memory.
>>>>>>>> This got me searching for a way to analyse the string input, but by
>>>>>>>> keeping track of where/what has been analysed, so as to offset this
>>>>>>>> performance and resource issue.  In practice this is somewhat very
>>>>>>>> similar to c#'s and java's "StringReader".
>>>>>>>>
>>>>>>>> Then I found RexEx's lastIndex
>>>>>>>> <http://www.w3schools.com/jsref/jsref_regexp_lastindex.asp>,
>>>>>>>> thinking
>>>>>>>> this was the answer, I started on an abstraction of the "_input"
>>>>>>>> variable that would track the current position, allow for regexp
>>>>>>>> matching, return strings ONLY when needed, and was considerably more
>>>>>>>> performant.  I succeeded!... or so I thought.   inside of jison all
>>>>>>>> the
>>>>>>>> regular expressions have '^' at the beginning, so they are only
>>>>>>>> looking
>>>>>>>> for the next match, and nothing after that.  What I found is that
>>>>>>>> even
>>>>>>>> when you set lastIndex, it doesn't care about that you are
>>>>>>>> essentially
>>>>>>>> telling it to forget anything before the index, if you specify '^'
>>>>>>>> and
>>>>>>>> the position of _input is greater than 0, you will never get a
>>>>>>>> match!
>>>>>>>> This, to me, is absolutely absurd.  In most other languages I'm
>>>>>>>> dealing
>>>>>>>> with (php, and c# for example) you can tell the regexp to have an
>>>>>>>> offset
>>>>>>>> before the match begins, why is javascript the odd one out?  The
>>>>>>>> savings
>>>>>>>> for this little bitty settings could be massive!  Especially in the
>>>>>>>> context of a parser where string analysis and regex's are used
>>>>>>>> frequently, and potentially with large strings.  The resource
>>>>>>>> savings
>>>>>>>> with a 1.2mb file could literally be gigs of memory!
>>>>>>>>
>>>>>>>> Am I missing something?  Can you elaborate?  What other technique
>>>>>>>> could
>>>>>>>> I use?  Any pointers?
>>>>>>>>
>>>>>>>>
>>>>>>>> As always, thanks for your time, and I hope you have a nice weekend!
>>>>>>>>
>>>>>>>> --
>>>>>>>> Robert Plummer
>>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Zach Carter
>>>>>
>>>>
>>>
>>>
>>> --
>>> Robert Plummer
>>>
>>
>>
>>
>> --
>> Zach Carter
>>
>


-- 
Robert Plummer

Re: [jison] Fwd: Re: High performance regexp using "^" (beginning with) and RegExlastIndex

From:
Paul Harper
Date:
2014-02-17 @ 02:19
You might not have to do it from scratch. Check out
xregexp<http://xregexp.com/>.
I haven't used it, but my tests show that the y flag does *not* work, but
there are plugins and it *may* be possible/simpler to implement this as a
plugin. Here is the code I tried:

XRegExp.exec(' asdf', XRegExp('^asdf'), 1, false); // null
XRegExp('^asdf', 'y'); // SyntaxError: invalid regular expression flag y


On Sun, Feb 16, 2014 at 5:51 PM, Robert Plummer <
robertleeplummerjr@gmail.com> wrote:

> If we have to do it ourselves....  We do what we gotta do.
>  On Feb 16, 2014 7:58 PM, "Zachary Carter" <zack.carter@gmail.com> wrote:
>
>> Thanks for looking in to it, Robert. Yes, the y flag is the "sticky mode"
>> I mentioned in the other thread. Besides an untriaged Chromium bug[1], I
>> don't see much activity on the part of other engines to support it yet,
>> unfortunately. Opera used to have it before they switched to Blink.
>>
>> Thanks,
>> Zach
>>
>> [1] https://code.google.com/p/chromium/issues/detail?id=142334
>>
>>
>> On Sun, Feb 16, 2014 at 4:28 PM, Robert Plummer <
>> robertleeplummerjr@gmail.com> wrote:
>>
>>> FYI, Doug chimes in on the performance issues dealing with offset.  He
>>> is just awesome!
>>> ---------- Forwarded message ----------
>>> From: "Douglas Crockford" <douglas@crockford.com>
>>> Date: Feb 16, 2014 7:23 PM
>>> Subject: Re: High performance regexp using "^" (beginning with) and
>>> RegExlastIndex
>>> To: "Robert Plummer" <robertleeplummerjr@gmail.com>
>>> Cc:
>>>
>>> Mistakes were made.
>>>>
>>>> ES6 regexp may have a /y flag that will use lastIndex correctly.
>>>>
>>>>
>>>> On 2014-02-16 10:58 AM, Robert Plummer wrote:
>>>>
>>>>> Hey Doug,
>>>>> Recently I was using jison (Bison clone in json + javascript) to parse
>>>>> some larger files, about 1.2mb in size.  There is a HUGE performance
>>>>> bottleneck on the use of substr or slice (which are used as matches are
>>>>> found, for 3 attributes, "matched", past tense, and all matches up till
>>>>> now, "match" which is the current match found in the lexer, "_input"
>>>>> the
>>>>> yet unmatched string(s)), and after further research I came to the
>>>>> realization that in javascript, strings are immutable, and every
>>>>> modification to them causes another string to be created in memory.
>>>>> This got me searching for a way to analyse the string input, but by
>>>>> keeping track of where/what has been analysed, so as to offset this
>>>>> performance and resource issue.  In practice this is somewhat very
>>>>> similar to c#'s and java's "StringReader".
>>>>>
>>>>> Then I found RexEx's lastIndex
>>>>> <http://www.w3schools.com/jsref/jsref_regexp_lastindex.asp>, thinking
>>>>> this was the answer, I started on an abstraction of the "_input"
>>>>> variable that would track the current position, allow for regexp
>>>>> matching, return strings ONLY when needed, and was considerably more
>>>>> performant.  I succeeded!... or so I thought.   inside of jison all the
>>>>> regular expressions have '^' at the beginning, so they are only looking
>>>>> for the next match, and nothing after that.  What I found is that even
>>>>> when you set lastIndex, it doesn't care about that you are essentially
>>>>> telling it to forget anything before the index, if you specify '^' and
>>>>> the position of _input is greater than 0, you will never get a match!
>>>>> This, to me, is absolutely absurd.  In most other languages I'm dealing
>>>>> with (php, and c# for example) you can tell the regexp to have an
>>>>> offset
>>>>> before the match begins, why is javascript the odd one out?  The
>>>>> savings
>>>>> for this little bitty settings could be massive!  Especially in the
>>>>> context of a parser where string analysis and regex's are used
>>>>> frequently, and potentially with large strings.  The resource savings
>>>>> with a 1.2mb file could literally be gigs of memory!
>>>>>
>>>>> Am I missing something?  Can you elaborate?  What other technique could
>>>>> I use?  Any pointers?
>>>>>
>>>>>
>>>>> As always, thanks for your time, and I hope you have a nice weekend!
>>>>>
>>>>> --
>>>>> Robert Plummer
>>>>>
>>>>
>>
>>
>> --
>> Zach Carter
>>
>

Re: [jison] Fwd: Re: High performance regexp using "^" (beginning with) and RegExlastIndex

From:
Robert Plummer
Date:
2014-02-17 @ 13:34
Does node.js support it that you know of?


On Sun, Feb 16, 2014 at 7:58 PM, Zachary Carter <zack.carter@gmail.com>wrote:

> Thanks for looking in to it, Robert. Yes, the y flag is the "sticky mode"
> I mentioned in the other thread. Besides an untriaged Chromium bug[1], I
> don't see much activity on the part of other engines to support it yet,
> unfortunately. Opera used to have it before they switched to Blink.
>
> Thanks,
> Zach
>
> [1] https://code.google.com/p/chromium/issues/detail?id=142334
>
>
> On Sun, Feb 16, 2014 at 4:28 PM, Robert Plummer <
> robertleeplummerjr@gmail.com> wrote:
>
>> FYI, Doug chimes in on the performance issues dealing with offset.  He is
>> just awesome!
>> ---------- Forwarded message ----------
>> From: "Douglas Crockford" <douglas@crockford.com>
>> Date: Feb 16, 2014 7:23 PM
>> Subject: Re: High performance regexp using "^" (beginning with) and
>> RegExlastIndex
>> To: "Robert Plummer" <robertleeplummerjr@gmail.com>
>> Cc:
>>
>> Mistakes were made.
>>>
>>> ES6 regexp may have a /y flag that will use lastIndex correctly.
>>>
>>>
>>> On 2014-02-16 10:58 AM, Robert Plummer wrote:
>>>
>>>> Hey Doug,
>>>> Recently I was using jison (Bison clone in json + javascript) to parse
>>>> some larger files, about 1.2mb in size.  There is a HUGE performance
>>>> bottleneck on the use of substr or slice (which are used as matches are
>>>> found, for 3 attributes, "matched", past tense, and all matches up till
>>>> now, "match" which is the current match found in the lexer, "_input" the
>>>> yet unmatched string(s)), and after further research I came to the
>>>> realization that in javascript, strings are immutable, and every
>>>> modification to them causes another string to be created in memory.
>>>> This got me searching for a way to analyse the string input, but by
>>>> keeping track of where/what has been analysed, so as to offset this
>>>> performance and resource issue.  In practice this is somewhat very
>>>> similar to c#'s and java's "StringReader".
>>>>
>>>> Then I found RexEx's lastIndex
>>>> <http://www.w3schools.com/jsref/jsref_regexp_lastindex.asp>, thinking
>>>> this was the answer, I started on an abstraction of the "_input"
>>>> variable that would track the current position, allow for regexp
>>>> matching, return strings ONLY when needed, and was considerably more
>>>> performant.  I succeeded!... or so I thought.   inside of jison all the
>>>> regular expressions have '^' at the beginning, so they are only looking
>>>> for the next match, and nothing after that.  What I found is that even
>>>> when you set lastIndex, it doesn't care about that you are essentially
>>>> telling it to forget anything before the index, if you specify '^' and
>>>> the position of _input is greater than 0, you will never get a match!
>>>> This, to me, is absolutely absurd.  In most other languages I'm dealing
>>>> with (php, and c# for example) you can tell the regexp to have an offset
>>>> before the match begins, why is javascript the odd one out?  The savings
>>>> for this little bitty settings could be massive!  Especially in the
>>>> context of a parser where string analysis and regex's are used
>>>> frequently, and potentially with large strings.  The resource savings
>>>> with a 1.2mb file could literally be gigs of memory!
>>>>
>>>> Am I missing something?  Can you elaborate?  What other technique could
>>>> I use?  Any pointers?
>>>>
>>>>
>>>> As always, thanks for your time, and I hope you have a nice weekend!
>>>>
>>>> --
>>>> Robert Plummer
>>>>
>>>
>
>
> --
> Zach Carter
>



-- 
Robert Plummer

Re: [jison] Fwd: Re: High performance regexp using "^" (beginning with) and RegExlastIndex

From:
Paul Harper
Date:
2014-02-17 @ 22:43
It doesn't seem to. I just started an experiment to port pcre to
javascript<https://github.com/benekastah/empcre>.
It's compiling via emscripten and basically working. I have to figure out
how to access my RegExpMatch struct in js and add features, but it should
work. Although I just realized I didn't check if it has the feature we
need... I'm pretty sure it does. I'm having fun, anyway.


On Mon, Feb 17, 2014 at 5:34 AM, Robert Plummer <
robertleeplummerjr@gmail.com> wrote:

> Does node.js support it that you know of?
>
>
> On Sun, Feb 16, 2014 at 7:58 PM, Zachary Carter <zack.carter@gmail.com>wrote:
>
>> Thanks for looking in to it, Robert. Yes, the y flag is the "sticky mode"
>> I mentioned in the other thread. Besides an untriaged Chromium bug[1], I
>> don't see much activity on the part of other engines to support it yet,
>> unfortunately. Opera used to have it before they switched to Blink.
>>
>> Thanks,
>> Zach
>>
>> [1] https://code.google.com/p/chromium/issues/detail?id=142334
>>
>>
>> On Sun, Feb 16, 2014 at 4:28 PM, Robert Plummer <
>> robertleeplummerjr@gmail.com> wrote:
>>
>>> FYI, Doug chimes in on the performance issues dealing with offset.  He
>>> is just awesome!
>>> ---------- Forwarded message ----------
>>> From: "Douglas Crockford" <douglas@crockford.com>
>>> Date: Feb 16, 2014 7:23 PM
>>> Subject: Re: High performance regexp using "^" (beginning with) and
>>> RegExlastIndex
>>> To: "Robert Plummer" <robertleeplummerjr@gmail.com>
>>> Cc:
>>>
>>> Mistakes were made.
>>>>
>>>> ES6 regexp may have a /y flag that will use lastIndex correctly.
>>>>
>>>>
>>>> On 2014-02-16 10:58 AM, Robert Plummer wrote:
>>>>
>>>>> Hey Doug,
>>>>> Recently I was using jison (Bison clone in json + javascript) to parse
>>>>> some larger files, about 1.2mb in size.  There is a HUGE performance
>>>>> bottleneck on the use of substr or slice (which are used as matches are
>>>>> found, for 3 attributes, "matched", past tense, and all matches up till
>>>>> now, "match" which is the current match found in the lexer, "_input"
>>>>> the
>>>>> yet unmatched string(s)), and after further research I came to the
>>>>> realization that in javascript, strings are immutable, and every
>>>>> modification to them causes another string to be created in memory.
>>>>> This got me searching for a way to analyse the string input, but by
>>>>> keeping track of where/what has been analysed, so as to offset this
>>>>> performance and resource issue.  In practice this is somewhat very
>>>>> similar to c#'s and java's "StringReader".
>>>>>
>>>>> Then I found RexEx's lastIndex
>>>>> <http://www.w3schools.com/jsref/jsref_regexp_lastindex.asp>, thinking
>>>>> this was the answer, I started on an abstraction of the "_input"
>>>>> variable that would track the current position, allow for regexp
>>>>> matching, return strings ONLY when needed, and was considerably more
>>>>> performant.  I succeeded!... or so I thought.   inside of jison all the
>>>>> regular expressions have '^' at the beginning, so they are only looking
>>>>> for the next match, and nothing after that.  What I found is that even
>>>>> when you set lastIndex, it doesn't care about that you are essentially
>>>>> telling it to forget anything before the index, if you specify '^' and
>>>>> the position of _input is greater than 0, you will never get a match!
>>>>> This, to me, is absolutely absurd.  In most other languages I'm dealing
>>>>> with (php, and c# for example) you can tell the regexp to have an
>>>>> offset
>>>>> before the match begins, why is javascript the odd one out?  The
>>>>> savings
>>>>> for this little bitty settings could be massive!  Especially in the
>>>>> context of a parser where string analysis and regex's are used
>>>>> frequently, and potentially with large strings.  The resource savings
>>>>> with a 1.2mb file could literally be gigs of memory!
>>>>>
>>>>> Am I missing something?  Can you elaborate?  What other technique could
>>>>> I use?  Any pointers?
>>>>>
>>>>>
>>>>> As always, thanks for your time, and I hope you have a nice weekend!
>>>>>
>>>>> --
>>>>> Robert Plummer
>>>>>
>>>>
>>
>>
>> --
>> Zach Carter
>>
>
>
>
> --
> Robert Plummer
>