librelist archives

« back to archive

parsing large input

parsing large input

From:
Melissa Whittington
Date:
2011-11-21 @ 18:00
I have a fairly complex parslet parser that has been working great so
far. Input contains multiple "documents" and each document has
multiple sections with multiple subsections etc., so it's fairly
nested.

I'm having an issue now with input that is very large, specifically a
document has a section with 5000+ subsections. The total input has
over 500k lines!

My problem is that parslet consumes all the RAM/CPU available, so I'm
looking for a way to parse large input without killing the machine it
is running on. I can split up the input into subsections and parse
each subsection separately, and that sort of works but I'd rather not
do that if I don't have to.

I'm assuming that parslet is holding on to information so that it can
backtrack when it doesn't match a rule. Does anyone have advice on how
to structure rules or hint to parslet that it doesn't need to
backtrack after a certain point and free up the information it's
saving?

-mj

Re: parsing large input

From:
Kaspar Schiess
Date:
2011-11-23 @ 08:46
> I'm having an issue now with input that is very large, specifically a
> document has a section with 5000+ subsections. The total input has
> over 500k lines!
>
> My problem is that parslet consumes all the RAM/CPU available  [...]
>
> I'm assuming that parslet is holding on to information so that it can
> backtrack when it doesn't match a rule. [...]

Hei Melissa,

I was kind of wondering when the first request like this would come 
along. Parslet currently will run into trouble with large inputs. A part 
of that is by design: Returning one large return value after a lengthy 
parsing process means that it has to hold on to it for the whole time.

But let's make sure that we've really ran into this limitation. Is it 
possible to provide us with your parser for analysis? Can you be more 
specific (how much RAM, for instance)? How big is your document really? 
And is it really that much of structured information that cannot be 
parsed with a simpler approach, like regexps for instance? (the right 
tool for the right job...) If you convert the parslet parser into a 
treetop parser, does that one manage to complete?

That it would consume 'all your CPU' is quite natural - that's called 
work ;)

Also, you are hinting that your document is really just a sequence of 
sections that you can easily split off: That leaves me with the 
impression that there is really a top layer of meaning in your document 
that can be extracted using a simple str.split. If that solves your 
trouble, why look further?

regards,
kaspar

Re: parsing large input

From:
Kaspar Schiess
Date:
2011-11-25 @ 09:15
Hei Melissa,

In short: Please, give us something to work with, we'll try to improve!

kaspar

Re: [ruby.parslet] Re: parsing large input

From:
Melissa Whittington
Date:
2011-12-01 @ 19:35
Sorry, I was out on vacation for a while. Back now. :)

Input of 15 MB, 4 minutes runtime and it's using 1.9 GB of RAM (only 2
GB total).
Input of 1 MB used 1.1 GB of RAM and took 3:21 minutes to complete.
Input of .25 MB used 308 MB of RAM and took a little over 22 seconds
to complete.
I have not tried treetop.

The parser is for EDI files, specifically EDI 837 for now. Input looks
like this:

ST*837*987654~BHT*0019*00*A12345*20010801*1800~NM1*41*2*TOO GOOD
HOSPITAL*****46*999008888~NM1*40*2*MY STATE DATA
AGENCY*****46*12000~HL*1**20*1~NM1*85*2*TOO GOOD
HOSPITAL*****24*999008888~REF*1J*898989~HL*2*1*22*1~SBR*P********BL~
NM1*IL*1*GREENE*SCOTT*A**MI*GRNESSC1234~N3*1313 MOCKINGBIRD
LANE~N4*ANYTOWN*NY*09090~DMG*D8*19760706*M**::RET:3::RET:2~REF*SY*130281234~

Three delimiters, ~, *, and :.
~ is the segment delimiter.
* is the field delimiter within a segment.
: is the subfield delimiter within a field.

(For reference, the 15 MB input has 643k segments, 1 MB input has 44k
segments, and .25 MB input has 11k segments.)

The first field of a segment is the header that says what kind of segment it is.
Segments can be grouped together in loops that can be repeated.

The hierarchy of input goes like this:

Document 1
- Group 1a
- - Group 2a
- - - Group 3a
- - - - Group 4a
- - - - Group 4b
- - - Group 3b
- - - - Group 4a
- - - - Group 4b
- - Group 2b
... etc.
- Group 1b
.. etc.
Document 2
... etc.

Each group can be repeated any number of times below its parent group.
Each group has a specification of certain beginning segments, loops of
segments, sub groups, and ending segments.

It's positively atrocious. :)

The reason I am attempting to use parslet is:

1) Need the output to be in a hierarchal hash that matches the input
hierarchy. - Parslet output is this already!
2) Need to be able to convert every instance of certain segments into
certain formats - Transform works great!
3) Need to be able to handle dirty input that does not follow the
spec. Other solutions out there for these file types either require
that data follow the specification and proper segment order or are
cumbersome to customize.
4) Needs to be in Ruby.

So I wrote rules that define fields and segments, with which I wrote
rules to define each segment.
Using those segment rules, I can quite nicely define a parser for EDI
837 input. A rule looks like this:

  rule(:entity) do
    nm1.as(:_nm1).as(:name) >>
    address.as(:address).maybe >>
    dates.maybe >>
    dmg.as(:_dmg).as(:demographics).maybe >>
    prv.as(:_prv).as(:speciality).maybe >>
    ref.as(:_ref).repeat.as(:_merge).as(:reference).maybe >>
    per.as(:_per).repeat.as(:contact).maybe
  end

address/dates are rules for groups of segments.
nm1, dmg, prv, ref, per are all specific segment rules, which are
defined like this:

rule(:nm1) { str('NM1').as(:id) >> fields }
rule(:fields)  { field.repeat.as(:fields) >> segment_delimiter   }
rule(:field) { field_delimiter >> data.repeat(0,nil).as(:field)     }

Everything except the first segment in a rule is .maybe because it
might not be there.
When I need to handle a segment that doesn't follow the spec, I can
write a new :entity rule (for example) that includes a different set
of segments, and ta-da! Working parser!

Splitting the input and parsing it top-level group at a time, and sort
of building the hash hierarchy myself, I can keep the memory usage
down.

I am trying to find ways that maybe I can write the rules better, but
I might just have to parse smaller bits of it at a time. Any tips or
ideas you have, I'll try.

Hope I didn't bore anyone too much. :)

-mj

On Fri, Nov 25, 2011 at 4:15 AM, Kaspar Schiess <eule@space.ch> wrote:
> Hei Melissa,
>
> In short: Please, give us something to work with, we'll try to improve!
>
> kaspar
>

Re: parsing large input

From:
Kaspar Schiess
Date:
2011-12-07 @ 09:02
Hei Melissa,

> Input of 15 MB, 4 minutes runtime and it's using 1.9 GB of RAM (only 2
> GB total).
> Input of 1 MB used 1.1 GB of RAM and took 3:21 minutes to complete.
> Input of .25 MB used 308 MB of RAM and took a little over 22 seconds
> to complete.
> I have not tried treetop.

These numbers about add up - looks as if you hit real limitations with 
the current parslet release. Parslet currently keeps a few things around 
that might be adding up to that memory consumption:

   * an unreduced parse result (containing all the itsy bits, even those 
you did not capture with .as()
   * a packrat cache mapping positions in the file to parse results 
(this alone will be around 300mb in your case probably)
   * a line cache mapping ranges in the input to line numbers

in addition to that, parslet will generate a lot of transient objects, 
generating load on the GC. (errors, result objects, etc...).

Two initial remarks to the above facts:

   * Parslet really is made for parsing source code currently. This 
means relatively small input files with a high amount of complexity. 
Implementation choices that derive from that will make it unsuitable for 
parsing files with low complexity that are large in size.

   * Choice of Ruby implementation affects parslet heavily. You might 
experiment with other implementations (REE?).

This doesn't go to say that I don't want to lift these limitations 
sometime in the future, it's just hard to give a schedule for this.

> The parser is for EDI files, specifically EDI 837 for now. Input looks
> like this:
> [... deletia ...]
>
> It's positively atrocious. :)
>
> The reason I am attempting to use parslet is:
>
> 1) Need the output to be in a hierarchal hash that matches the input
> hierarchy. - Parslet output is this already!
> 2) Need to be able to convert every instance of certain segments into
> certain formats - Transform works great!
> 3) Need to be able to handle dirty input that does not follow the
> spec. Other solutions out there for these file types either require
> that data follow the specification and proper segment order or are
> cumbersome to customize.
> 4) Needs to be in Ruby.

If it has to be parslet, you might want to throw hardware at this 
problem right now. When I look at your grammar specification, I notice 
that you might be better off with a hand written parser; your grammar 
isn't very complex (it's alike to the simpler parts of XML) and you 
mention wanting to parse degenerate input - something parslet will fail 
horribly at.

> Splitting the input and parsing it top-level group at a time, and sort
> of building the hash hierarchy myself, I can keep the memory usage
> down.
This sounds like another good practical solution to me.

> I am trying to find ways that maybe I can write the rules better, but
> I might just have to parse smaller bits of it at a time. Any tips or
> ideas you have, I'll try.
I am not convinced that there is a lot you can do. For an input/parser 
combination like this:

   'abc'
   str('a') >> str('b').as(:b) >> str('c')

parslet will first build this internally:

   ['a'@0, {:b => 'b'@1}, 'c'@2]

which will then at the very end be reduced to:

   {':b => 'b'@1}

So jiggling the rules will probably not affect memory footprint during 
parsing.

regards,
kaspar

Re: parsing large input

From:
Kaspar Schiess
Date:
2011-12-07 @ 09:08
>> Splitting the input and parsing it top-level group at a time, and sort
>> of building the hash hierarchy myself, I can keep the memory usage
>> down.
> This sounds like another good practical solution to me.

Another reason why this appeals to me is that you could split the file 
up and then run parslet on each segment in multiple processes using 
something like 'procrastinate'. This would distribute the parsing over 
multiple cores, giving you a parallel speedup!

k

Re: [ruby.parslet] parsing large input

From:
Chris Corbyn
Date:
2011-12-01 @ 22:23
Hi Melissa,

So is this example one "document", where "ST" is always the header that 
starts the document and each subsequent header inside it starts a new 
group within the document?  Presumably the parts with long strings of 
***** means there are empty fields in that segment.  Trying to get my head
around how that hierarchy looks in practice.  Can you add whitespace to 
the actual input, for display purposes only, to indicate where this 
grouping and nesting is occurring in that excerpt you posted?

Cheers,

Chris


On 02/12/2011, at 6:35 AM, Melissa Whittington wrote:

> Sorry, I was out on vacation for a while. Back now. :)
> 
> Input of 15 MB, 4 minutes runtime and it's using 1.9 GB of RAM (only 2
> GB total).
> Input of 1 MB used 1.1 GB of RAM and took 3:21 minutes to complete.
> Input of .25 MB used 308 MB of RAM and took a little over 22 seconds
> to complete.
> I have not tried treetop.
> 
> The parser is for EDI files, specifically EDI 837 for now. Input looks
> like this:
> 
> ST*837*987654~BHT*0019*00*A12345*20010801*1800~NM1*41*2*TOO GOOD
> HOSPITAL*****46*999008888~NM1*40*2*MY STATE DATA
> AGENCY*****46*12000~HL*1**20*1~NM1*85*2*TOO GOOD
> HOSPITAL*****24*999008888~REF*1J*898989~HL*2*1*22*1~SBR*P********BL~
> NM1*IL*1*GREENE*SCOTT*A**MI*GRNESSC1234~N3*1313 MOCKINGBIRD
> LANE~N4*ANYTOWN*NY*09090~DMG*D8*19760706*M**::RET:3::RET:2~REF*SY*130281234~
> 
> Three delimiters, ~, *, and :.
> ~ is the segment delimiter.
> * is the field delimiter within a segment.
> : is the subfield delimiter within a field.
> 
> (For reference, the 15 MB input has 643k segments, 1 MB input has 44k
> segments, and .25 MB input has 11k segments.)
> 
> The first field of a segment is the header that says what kind of segment it is.
> Segments can be grouped together in loops that can be repeated.
> 
> The hierarchy of input goes like this:
> 
> Document 1
> - Group 1a
> - - Group 2a
> - - - Group 3a
> - - - - Group 4a
> - - - - Group 4b
> - - - Group 3b
> - - - - Group 4a
> - - - - Group 4b
> - - Group 2b
> ... etc.
> - Group 1b
> .. etc.
> Document 2
> ... etc.
> 
> Each group can be repeated any number of times below its parent group.
> Each group has a specification of certain beginning segments, loops of
> segments, sub groups, and ending segments.
> 
> It's positively atrocious. :)
> 
> The reason I am attempting to use parslet is:
> 
> 1) Need the output to be in a hierarchal hash that matches the input
> hierarchy. - Parslet output is this already!
> 2) Need to be able to convert every instance of certain segments into
> certain formats - Transform works great!
> 3) Need to be able to handle dirty input that does not follow the
> spec. Other solutions out there for these file types either require
> that data follow the specification and proper segment order or are
> cumbersome to customize.
> 4) Needs to be in Ruby.
> 
> So I wrote rules that define fields and segments, with which I wrote
> rules to define each segment.
> Using those segment rules, I can quite nicely define a parser for EDI
> 837 input. A rule looks like this:
> 
>  rule(:entity) do
>    nm1.as(:_nm1).as(:name) >>
>    address.as(:address).maybe >>
>    dates.maybe >>
>    dmg.as(:_dmg).as(:demographics).maybe >>
>    prv.as(:_prv).as(:speciality).maybe >>
>    ref.as(:_ref).repeat.as(:_merge).as(:reference).maybe >>
>    per.as(:_per).repeat.as(:contact).maybe
>  end
> 
> address/dates are rules for groups of segments.
> nm1, dmg, prv, ref, per are all specific segment rules, which are
> defined like this:
> 
> rule(:nm1) { str('NM1').as(:id) >> fields }
> rule(:fields)  { field.repeat.as(:fields) >> segment_delimiter   }
> rule(:field) { field_delimiter >> data.repeat(0,nil).as(:field)     }
> 
> Everything except the first segment in a rule is .maybe because it
> might not be there.
> When I need to handle a segment that doesn't follow the spec, I can
> write a new :entity rule (for example) that includes a different set
> of segments, and ta-da! Working parser!
> 
> Splitting the input and parsing it top-level group at a time, and sort
> of building the hash hierarchy myself, I can keep the memory usage
> down.
> 
> I am trying to find ways that maybe I can write the rules better, but
> I might just have to parse smaller bits of it at a time. Any tips or
> ideas you have, I'll try.
> 
> Hope I didn't bore anyone too much. :)
> 
> -mj
> 
> On Fri, Nov 25, 2011 at 4:15 AM, Kaspar Schiess <eule@space.ch> wrote:
>> Hei Melissa,
>> 
>> In short: Please, give us something to work with, we'll try to improve!
>> 
>> kaspar
>> 

Re: [ruby.parslet] parsing large input

From:
Melissa Whittington
Date:
2011-12-01 @ 23:07
Correct about strings of ***** being a bunch of empty fields.

There is always a certain segment that begins a new level, or group of segments.
Here are all the important segment names and their hierarchy:
https://gist.github.com/1420503

The example I pasted is actually the beginning of a transaction, which
is already a few nested levels deep.
Here I structured the example from my previous email in a similar way:
https://gist.github.com/1420518

-mj

On Thu, Dec 1, 2011 at 5:23 PM, Chris Corbyn <chris@w3style.co.uk> wrote:
> Hi Melissa,
>
> So is this example one "document", where "ST" is always the header that 
starts the document and each subsequent header inside it starts a new 
group within the document?  Presumably the parts with long strings of 
***** means there are empty fields in that segment.  Trying to get my head
around how that hierarchy looks in practice.  Can you add whitespace to 
the actual input, for display purposes only, to indicate where this 
grouping and nesting is occurring in that excerpt you posted?
>
> Cheers,
>
> Chris
>
>
> On 02/12/2011, at 6:35 AM, Melissa Whittington wrote:
>
>> Sorry, I was out on vacation for a while. Back now. :)
>>
>> Input of 15 MB, 4 minutes runtime and it's using 1.9 GB of RAM (only 2
>> GB total).
>> Input of 1 MB used 1.1 GB of RAM and took 3:21 minutes to complete.
>> Input of .25 MB used 308 MB of RAM and took a little over 22 seconds
>> to complete.
>> I have not tried treetop.
>>
>> The parser is for EDI files, specifically EDI 837 for now. Input looks
>> like this:
>>
>> ST*837*987654~BHT*0019*00*A12345*20010801*1800~NM1*41*2*TOO GOOD
>> HOSPITAL*****46*999008888~NM1*40*2*MY STATE DATA
>> AGENCY*****46*12000~HL*1**20*1~NM1*85*2*TOO GOOD
>> HOSPITAL*****24*999008888~REF*1J*898989~HL*2*1*22*1~SBR*P********BL~
>> NM1*IL*1*GREENE*SCOTT*A**MI*GRNESSC1234~N3*1313 MOCKINGBIRD
>> LANE~N4*ANYTOWN*NY*09090~DMG*D8*19760706*M**::RET:3::RET:2~REF*SY*130281234~
>>
>> Three delimiters, ~, *, and :.
>> ~ is the segment delimiter.
>> * is the field delimiter within a segment.
>> : is the subfield delimiter within a field.
>>
>> (For reference, the 15 MB input has 643k segments, 1 MB input has 44k
>> segments, and .25 MB input has 11k segments.)
>>
>> The first field of a segment is the header that says what kind of 
segment it is.
>> Segments can be grouped together in loops that can be repeated.
>>
>> The hierarchy of input goes like this:
>>
>> Document 1
>> - Group 1a
>> - - Group 2a
>> - - - Group 3a
>> - - - - Group 4a
>> - - - - Group 4b
>> - - - Group 3b
>> - - - - Group 4a
>> - - - - Group 4b
>> - - Group 2b
>> ... etc.
>> - Group 1b
>> .. etc.
>> Document 2
>> ... etc.
>>
>> Each group can be repeated any number of times below its parent group.
>> Each group has a specification of certain beginning segments, loops of
>> segments, sub groups, and ending segments.
>>
>> It's positively atrocious. :)
>>
>> The reason I am attempting to use parslet is:
>>
>> 1) Need the output to be in a hierarchal hash that matches the input
>> hierarchy. - Parslet output is this already!
>> 2) Need to be able to convert every instance of certain segments into
>> certain formats - Transform works great!
>> 3) Need to be able to handle dirty input that does not follow the
>> spec. Other solutions out there for these file types either require
>> that data follow the specification and proper segment order or are
>> cumbersome to customize.
>> 4) Needs to be in Ruby.
>>
>> So I wrote rules that define fields and segments, with which I wrote
>> rules to define each segment.
>> Using those segment rules, I can quite nicely define a parser for EDI
>> 837 input. A rule looks like this:
>>
>>  rule(:entity) do
>>    nm1.as(:_nm1).as(:name) >>
>>    address.as(:address).maybe >>
>>    dates.maybe >>
>>    dmg.as(:_dmg).as(:demographics).maybe >>
>>    prv.as(:_prv).as(:speciality).maybe >>
>>    ref.as(:_ref).repeat.as(:_merge).as(:reference).maybe >>
>>    per.as(:_per).repeat.as(:contact).maybe
>>  end
>>
>> address/dates are rules for groups of segments.
>> nm1, dmg, prv, ref, per are all specific segment rules, which are
>> defined like this:
>>
>> rule(:nm1) { str('NM1').as(:id) >> fields }
>> rule(:fields)  { field.repeat.as(:fields) >> segment_delimiter   }
>> rule(:field) { field_delimiter >> data.repeat(0,nil).as(:field)     }
>>
>> Everything except the first segment in a rule is .maybe because it
>> might not be there.
>> When I need to handle a segment that doesn't follow the spec, I can
>> write a new :entity rule (for example) that includes a different set
>> of segments, and ta-da! Working parser!
>>
>> Splitting the input and parsing it top-level group at a time, and sort
>> of building the hash hierarchy myself, I can keep the memory usage
>> down.
>>
>> I am trying to find ways that maybe I can write the rules better, but
>> I might just have to parse smaller bits of it at a time. Any tips or
>> ideas you have, I'll try.
>>
>> Hope I didn't bore anyone too much. :)
>>
>> -mj
>>
>> On Fri, Nov 25, 2011 at 4:15 AM, Kaspar Schiess <eule@space.ch> wrote:
>>> Hei Melissa,
>>>
>>> In short: Please, give us something to work with, we'll try to improve!
>>>
>>> kaspar
>>>
>