librelist archives

« back to archive

Parsing the NCBI Genetic Code Table

Parsing the NCBI Genetic Code Table

From:
Stefan Rohlfing
Date:
2011-08-08 @ 04:21
Hi,

I am trying to parse the NCBI genetic code
table<ftp://ftp.ncbi.nih.gov/entrez/misc/data/gc.prt>
:

https://github.com/bytesource/CodonTableParser/blob/master/data/codons.txt

to extract those lines of each block that contain either "name", "id",
"ncbieaa", or "sncbieaa".

As each line either contains the content I am interested in or text that can
be discarded, I started by first parsing the document on a per-line basis:

https://github.com/bytesource/CodonTableParser/blob/master/parser.rb

Unfortunately, parsing the file resulted in an error message that tells me
Parslet failed to parse line 233, which is the very last line of the file:

Expected at least 1 of LINE NEWLINE at line 1 char 1.
`- Expected at least 1 of LINE NEWLINE at line 1 char 1.
   `- Failed to match sequence (LINE NEWLINE) at line 233 char 1.
      `- Failed to match sequence (LF CR?) at line 233 char 1.
         `- Premature end of input at line 233 char 1.

However, apart from knowing where is problem is located, I have difficulties
finding out where my code went wrong.

I already read Parslet's documentation without finding a solution, so now I
hope someone on this list might help me with my problem.

On a site note, I am often not sure when to use 'repeat(1)' instead of just
repeat. I know the latter repeats the rule zero or more times, but how do I
decide when zero is enough? Is there a rule to follow?

Thanks again in advance!

Stefan

Re: [ruby.parslet] Parsing the NCBI Genetic Code Table

From:
Melissa Whittington
Date:
2011-08-08 @ 11:47
Stefan,

The reason you're getting that error on the last line is because there
will be no newline at the end of the last line, so just switch it to
'newline.maybe'.

Your :line rule also does not need the .repeat because there will only
be one of either a :codon or a :comment and not more. The :line rule's
repeat is what is describing multiple lines.

Also, I don't know what "repeat(1)" by itself does, but you probably
don't mean that?

Don't forget any only matches one character. You should probably not
use any, either. For your :content and :no_value rules, they should be
matching everything on a line (sans a possible newline). You could use
any.repeat to parse the rest of the line, but it will try to parse
*anything* including newlines and on to the next lines which is not
what you want.

So, it'll probably be helpful to be a little more descriptive.

Hope that helps you make a little more progress!

-mj

On Mon, Aug 8, 2011 at 12:21 AM, Stefan Rohlfing
<stefan.rohlfing@gmail.com> wrote:
> Hi,
> I am trying to parse the NCBI genetic code table:
> https://github.com/bytesource/CodonTableParser/blob/master/data/codons.txt
> to extract those lines of each block that contain either "name", "id",
> "ncbieaa", or "sncbieaa".
> As each line either contains the content I am interested in or text that can
> be discarded, I started by first parsing the document on a per-line basis:
> https://github.com/bytesource/CodonTableParser/blob/master/parser.rb
> Unfortunately, parsing the file resulted in an error message that tells me
> Parslet failed to parse line 233, which is the very last line of the file:
> Expected at least 1 of LINE NEWLINE at line 1 char 1.
> `- Expected at least 1 of LINE NEWLINE at line 1 char 1.
>    `- Failed to match sequence (LINE NEWLINE) at line 233 char 1.
>       `- Failed to match sequence (LF CR?) at line 233 char 1.
>          `- Premature end of input at line 233 char 1.
> However, apart from knowing where is problem is located, I have difficulties
> finding out where my code went wrong.
> I already read Parslet's documentation without finding a solution, so now I
> hope someone on this list might help me with my problem.
> On a site note, I am often not sure when to use 'repeat(1)' instead of just
> repeat. I know the latter repeats the rule zero or more times, but how do I
> decide when zero is enough? Is there a rule to follow?
> Thanks again in advance!
> Stefan
>
>
>
>
>
>

Re: [ruby.parslet] Parsing the NCBI Genetic Code Table

From:
Melissa Whittington
Date:
2011-08-08 @ 11:49
Whoops, I meant "The :file rule's repeat is what is describing multiple lines."

-mj

On Mon, Aug 8, 2011 at 7:47 AM, Melissa Whittington
<melissa.whittington@gmail.com> wrote:
> Stefan,
>
> The reason you're getting that error on the last line is because there
> will be no newline at the end of the last line, so just switch it to
> 'newline.maybe'.
>
> Your :line rule also does not need the .repeat because there will only
> be one of either a :codon or a :comment and not more. The :line rule's
> repeat is what is describing multiple lines.
>
> Also, I don't know what "repeat(1)" by itself does, but you probably
> don't mean that?
>
> Don't forget any only matches one character. You should probably not
> use any, either. For your :content and :no_value rules, they should be
> matching everything on a line (sans a possible newline). You could use
> any.repeat to parse the rest of the line, but it will try to parse
> *anything* including newlines and on to the next lines which is not
> what you want.
>
> So, it'll probably be helpful to be a little more descriptive.
>
> Hope that helps you make a little more progress!
>
> -mj
>
> On Mon, Aug 8, 2011 at 12:21 AM, Stefan Rohlfing
> <stefan.rohlfing@gmail.com> wrote:
>> Hi,
>> I am trying to parse the NCBI genetic code table:
>> https://github.com/bytesource/CodonTableParser/blob/master/data/codons.txt
>> to extract those lines of each block that contain either "name", "id",
>> "ncbieaa", or "sncbieaa".
>> As each line either contains the content I am interested in or text that can
>> be discarded, I started by first parsing the document on a per-line basis:
>> https://github.com/bytesource/CodonTableParser/blob/master/parser.rb
>> Unfortunately, parsing the file resulted in an error message that tells me
>> Parslet failed to parse line 233, which is the very last line of the file:
>> Expected at least 1 of LINE NEWLINE at line 1 char 1.
>> `- Expected at least 1 of LINE NEWLINE at line 1 char 1.
>>    `- Failed to match sequence (LINE NEWLINE) at line 233 char 1.
>>       `- Failed to match sequence (LF CR?) at line 233 char 1.
>>          `- Premature end of input at line 233 char 1.
>> However, apart from knowing where is problem is located, I have difficulties
>> finding out where my code went wrong.
>> I already read Parslet's documentation without finding a solution, so now I
>> hope someone on this list might help me with my problem.
>> On a site note, I am often not sure when to use 'repeat(1)' instead of just
>> repeat. I know the latter repeats the rule zero or more times, but how do I
>> decide when zero is enough? Is there a rule to follow?
>> Thanks again in advance!
>> Stefan
>>
>>
>>
>>
>>
>>
>

Re: [ruby.parslet] Parsing the NCBI Genetic Code Table

From:
Stefan Rohlfing
Date:
2011-08-09 @ 08:57
Melissa,

Thanks for your help!

However, after fixing the problems you pointed me to I got stuck again

https://github.com/bytesource/CodonTableParser/blob/master/parser.rb

and I am realizing that I am more or less relying on trial & error here. In
other words, I am still lacking the knowledge of translating a document into
its Backus Naur form with which I can then feed the parser (Parslet).

As I have no background in computer science, I would be interested in any
resources (printed or online) you have found valuable in laying the basis
for building a parser. This question is for everyone, as I am always
interested in different opinions.

Stefan


On Mon, Aug 8, 2011 at 19:49, Melissa Whittington <
melissa.whittington@gmail.com> wrote:

> Whoops, I meant "The :file rule's repeat is what is describing multiple
> lines."
>
> -mj
>
> On Mon, Aug 8, 2011 at 7:47 AM, Melissa Whittington
> <melissa.whittington@gmail.com> wrote:
> > Stefan,
> >
> > The reason you're getting that error on the last line is because there
> > will be no newline at the end of the last line, so just switch it to
> > 'newline.maybe'.
> >
> > Your :line rule also does not need the .repeat because there will only
> > be one of either a :codon or a :comment and not more. The :line rule's
> > repeat is what is describing multiple lines.
> >
> > Also, I don't know what "repeat(1)" by itself does, but you probably
> > don't mean that?
> >
> > Don't forget any only matches one character. You should probably not
> > use any, either. For your :content and :no_value rules, they should be
> > matching everything on a line (sans a possible newline). You could use
> > any.repeat to parse the rest of the line, but it will try to parse
> > *anything* including newlines and on to the next lines which is not
> > what you want.
> >
> > So, it'll probably be helpful to be a little more descriptive.
> >
> > Hope that helps you make a little more progress!
> >
> > -mj
> >
> > On Mon, Aug 8, 2011 at 12:21 AM, Stefan Rohlfing
> > <stefan.rohlfing@gmail.com> wrote:
> >> Hi,
> >> I am trying to parse the NCBI genetic code table:
> >>
> https://github.com/bytesource/CodonTableParser/blob/master/data/codons.txt
> >> to extract those lines of each block that contain either "name", "id",
> >> "ncbieaa", or "sncbieaa".
> >> As each line either contains the content I am interested in or text that
> can
> >> be discarded, I started by first parsing the document on a per-line
> basis:
> >> https://github.com/bytesource/CodonTableParser/blob/master/parser.rb
> >> Unfortunately, parsing the file resulted in an error message that tells
> me
> >> Parslet failed to parse line 233, which is the very last line of the
> file:
> >> Expected at least 1 of LINE NEWLINE at line 1 char 1.
> >> `- Expected at least 1 of LINE NEWLINE at line 1 char 1.
> >>    `- Failed to match sequence (LINE NEWLINE) at line 233 char 1.
> >>       `- Failed to match sequence (LF CR?) at line 233 char 1.
> >>          `- Premature end of input at line 233 char 1.
> >> However, apart from knowing where is problem is located, I have
> difficulties
> >> finding out where my code went wrong.
> >> I already read Parslet's documentation without finding a solution, so
> now I
> >> hope someone on this list might help me with my problem.
> >> On a site note, I am often not sure when to use 'repeat(1)' instead of
> just
> >> repeat. I know the latter repeats the rule zero or more times, but how
> do I
> >> decide when zero is enough? Is there a rule to follow?
> >> Thanks again in advance!
> >> Stefan
> >>
> >>
> >>
> >>
> >>
> >>
> >
>

Re: [ruby.parslet] Parsing the NCBI Genetic Code Table

From:
Melissa Whittington
Date:
2011-08-09 @ 13:06
Stefan,

Ah! I missed one important mistake that I've easily made myself
before. You can't use 'match' to match multiple characters, the
regular expression can only match one character. I find that slightly
unintuitive and it gives no warning if you try to do this.

I tried this:
  rule(:content)         {str('  id ') >> match('\d').repeat >> textdata.repeat}
  rule(:no_value)        {textdata.repeat(1)}

Because it tries to match :content first, it will only match :no_value
if it didn't match :content. That matched all the lines with "id".

For me, learning parslet has been fairly trial and error too. And
google thinks 'parsley' is a much better word to search for than
'parslet', heh.

-mj

On Tue, Aug 9, 2011 at 4:57 AM, Stefan Rohlfing
<stefan.rohlfing@gmail.com> wrote:
> Melissa,
> Thanks for your help!
> However, after fixing the problems you pointed me to I got stuck again
> https://github.com/bytesource/CodonTableParser/blob/master/parser.rb
> and I am realizing that I am more or less relying on trial & error here. In
> other words, I am still lacking the knowledge of translating a document into
> its Backus Naur form with which I can then feed the parser (Parslet).
> As I have no background in computer science, I would be interested in any
> resources (printed or online) you have found valuable in laying the basis
> for building a parser. This question is for everyone, as I am always
> interested in different opinions.
> Stefan
>
> On Mon, Aug 8, 2011 at 19:49, Melissa Whittington
> <melissa.whittington@gmail.com> wrote:
>>
>> Whoops, I meant "The :file rule's repeat is what is describing multiple
>> lines."
>>
>> -mj
>>
>> On Mon, Aug 8, 2011 at 7:47 AM, Melissa Whittington
>> <melissa.whittington@gmail.com> wrote:
>> > Stefan,
>> >
>> > The reason you're getting that error on the last line is because there
>> > will be no newline at the end of the last line, so just switch it to
>> > 'newline.maybe'.
>> >
>> > Your :line rule also does not need the .repeat because there will only
>> > be one of either a :codon or a :comment and not more. The :line rule's
>> > repeat is what is describing multiple lines.
>> >
>> > Also, I don't know what "repeat(1)" by itself does, but you probably
>> > don't mean that?
>> >
>> > Don't forget any only matches one character. You should probably not
>> > use any, either. For your :content and :no_value rules, they should be
>> > matching everything on a line (sans a possible newline). You could use
>> > any.repeat to parse the rest of the line, but it will try to parse
>> > *anything* including newlines and on to the next lines which is not
>> > what you want.
>> >
>> > So, it'll probably be helpful to be a little more descriptive.
>> >
>> > Hope that helps you make a little more progress!
>> >
>> > -mj
>> >
>> > On Mon, Aug 8, 2011 at 12:21 AM, Stefan Rohlfing
>> > <stefan.rohlfing@gmail.com> wrote:
>> >> Hi,
>> >> I am trying to parse the NCBI genetic code table:
>> >>
>> >> https://github.com/bytesource/CodonTableParser/blob/master/data/codons.txt
>> >> to extract those lines of each block that contain either "name", "id",
>> >> "ncbieaa", or "sncbieaa".
>> >> As each line either contains the content I am interested in or text
>> >> that can
>> >> be discarded, I started by first parsing the document on a per-line
>> >> basis:
>> >> https://github.com/bytesource/CodonTableParser/blob/master/parser.rb
>> >> Unfortunately, parsing the file resulted in an error message that tells
>> >> me
>> >> Parslet failed to parse line 233, which is the very last line of the
>> >> file:
>> >> Expected at least 1 of LINE NEWLINE at line 1 char 1.
>> >> `- Expected at least 1 of LINE NEWLINE at line 1 char 1.
>> >>    `- Failed to match sequence (LINE NEWLINE) at line 233 char 1.
>> >>       `- Failed to match sequence (LF CR?) at line 233 char 1.
>> >>          `- Premature end of input at line 233 char 1.
>> >> However, apart from knowing where is problem is located, I have
>> >> difficulties
>> >> finding out where my code went wrong.
>> >> I already read Parslet's documentation without finding a solution, so
>> >> now I
>> >> hope someone on this list might help me with my problem.
>> >> On a site note, I am often not sure when to use 'repeat(1)' instead of
>> >> just
>> >> repeat. I know the latter repeats the rule zero or more times, but how
>> >> do I
>> >> decide when zero is enough? Is there a rule to follow?
>> >> Thanks again in advance!
>> >> Stefan
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >
>
>

Re: [ruby.parslet] Parsing the NCBI Genetic Code Table

From:
Stefan Rohlfing
Date:
2011-08-10 @ 08:39
Melissa,

I really thought 'match' would take any regular expression, but I looked it
up and you are right:

# Returns an atom matching a character class. All regular expressions can be
# used, as long as they match only a single character at a time.

With this information I got :id to work, but got stuck again at :name. :name
sometimes reaches over two lines, but the document had already been split
into lines after each newline.

I then tried to first split the document into blocks before and after a
parentheses, but did not succeed. However, I will try to solve this problem
in the next few days.

Thanks again for your help

Stefan




On Tue, Aug 9, 2011 at 21:06, Melissa Whittington <
melissa.whittington@gmail.com> wrote:

> Stefan,
>
> Ah! I missed one important mistake that I've easily made myself
> before. You can't use 'match' to match multiple characters, the
> regular expression can only match one character. I find that slightly
> unintuitive and it gives no warning if you try to do this.
>
> I tried this:
>  rule(:content)         {str('  id ') >> match('\d').repeat >>
> textdata.repeat}
>  rule(:no_value)        {textdata.repeat(1)}
>
> Because it tries to match :content first, it will only match :no_value
> if it didn't match :content. That matched all the lines with "id".
>
> For me, learning parslet has been fairly trial and error too. And
> google thinks 'parsley' is a much better word to search for than
> 'parslet', heh.
>
> -mj
>
> On Tue, Aug 9, 2011 at 4:57 AM, Stefan Rohlfing
> <stefan.rohlfing@gmail.com> wrote:
> > Melissa,
> > Thanks for your help!
> > However, after fixing the problems you pointed me to I got stuck again
> > https://github.com/bytesource/CodonTableParser/blob/master/parser.rb
> > and I am realizing that I am more or less relying on trial & error here.
> In
> > other words, I am still lacking the knowledge of translating a document
> into
> > its Backus Naur form with which I can then feed the parser (Parslet).
> > As I have no background in computer science, I would be interested in any
> > resources (printed or online) you have found valuable in laying the basis
> > for building a parser. This question is for everyone, as I am always
> > interested in different opinions.
> > Stefan
> >
> > On Mon, Aug 8, 2011 at 19:49, Melissa Whittington
> > <melissa.whittington@gmail.com> wrote:
> >>
> >> Whoops, I meant "The :file rule's repeat is what is describing multiple
> >> lines."
> >>
> >> -mj
> >>
> >> On Mon, Aug 8, 2011 at 7:47 AM, Melissa Whittington
> >> <melissa.whittington@gmail.com> wrote:
> >> > Stefan,
> >> >
> >> > The reason you're getting that error on the last line is because there
> >> > will be no newline at the end of the last line, so just switch it to
> >> > 'newline.maybe'.
> >> >
> >> > Your :line rule also does not need the .repeat because there will only
> >> > be one of either a :codon or a :comment and not more. The :line rule's
> >> > repeat is what is describing multiple lines.
> >> >
> >> > Also, I don't know what "repeat(1)" by itself does, but you probably
> >> > don't mean that?
> >> >
> >> > Don't forget any only matches one character. You should probably not
> >> > use any, either. For your :content and :no_value rules, they should be
> >> > matching everything on a line (sans a possible newline). You could use
> >> > any.repeat to parse the rest of the line, but it will try to parse
> >> > *anything* including newlines and on to the next lines which is not
> >> > what you want.
> >> >
> >> > So, it'll probably be helpful to be a little more descriptive.
> >> >
> >> > Hope that helps you make a little more progress!
> >> >
> >> > -mj
> >> >
> >> > On Mon, Aug 8, 2011 at 12:21 AM, Stefan Rohlfing
> >> > <stefan.rohlfing@gmail.com> wrote:
> >> >> Hi,
> >> >> I am trying to parse the NCBI genetic code table:
> >> >>
> >> >>
> https://github.com/bytesource/CodonTableParser/blob/master/data/codons.txt
> >> >> to extract those lines of each block that contain either "name",
> "id",
> >> >> "ncbieaa", or "sncbieaa".
> >> >> As each line either contains the content I am interested in or text
> >> >> that can
> >> >> be discarded, I started by first parsing the document on a per-line
> >> >> basis:
> >> >> https://github.com/bytesource/CodonTableParser/blob/master/parser.rb
> >> >> Unfortunately, parsing the file resulted in an error message that
> tells
> >> >> me
> >> >> Parslet failed to parse line 233, which is the very last line of the
> >> >> file:
> >> >> Expected at least 1 of LINE NEWLINE at line 1 char 1.
> >> >> `- Expected at least 1 of LINE NEWLINE at line 1 char 1.
> >> >>    `- Failed to match sequence (LINE NEWLINE) at line 233 char 1.
> >> >>       `- Failed to match sequence (LF CR?) at line 233 char 1.
> >> >>          `- Premature end of input at line 233 char 1.
> >> >> However, apart from knowing where is problem is located, I have
> >> >> difficulties
> >> >> finding out where my code went wrong.
> >> >> I already read Parslet's documentation without finding a solution, so
> >> >> now I
> >> >> hope someone on this list might help me with my problem.
> >> >> On a site note, I am often not sure when to use 'repeat(1)' instead
> of
> >> >> just
> >> >> repeat. I know the latter repeats the rule zero or more times, but
> how
> >> >> do I
> >> >> decide when zero is enough? Is there a rule to follow?
> >> >> Thanks again in advance!
> >> >> Stefan
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >
> >
> >
>