librelist archives

« back to archive

Loofah and multi-pass tree-walking vs. one-pass SAX filtering

Loofah and multi-pass tree-walking vs. one-pass SAX filtering

From:
Andrew S. Townley
Date:
2011-03-21 @ 12:04
Hi,

I found Loofah last week, and I think it's pretty cool since I'm already 
using both Nokogiri and libxml2 directly in my projects.  However,  the 
"chaining scrubbers" section made me a little nervous when I first read 
it, but when I looked at the code, it makes me a bit more nervous than I 
was before.  My use-case is to use Loofah on generally small to medium 
sized user input that can take HTML, however there are times when I want 
to apply chained filters, e.g. nofollow and times when I don't.

From the code, it seems like one of the major advantages of using libxml -
the fast, one-pass SAX parsing - isn't being leveraged by Loofah.  Maybe 
the performance advantage would be minimal, but I would certainly think 
since you're either ignoring or escaping offending nodes, this would 
better be done by creating a scrubber chain which was applied in a SAX 
parser event handler rather than requiring multiple passes of the whole 
document for each scrubber.

I was just wondering if you'd looked at this approach and rejected it for 
some reason.  I generally do SAX-based parsing of XML whenever possible, 
and even wrote some XPath-based node selection code based on using the 
XPath parts of REXML and using it directly with a light-weight node 
implementation over a libxml SAX parser so that you could still only 
selectively operate on tree nodes.  In the next while, I'm going to 
publish this code along with something else, but I bring it up to 
highlight that it is possible to do a lot with SAX directly without using 
Nokogiri.

Right now, it seems Loofah's doing a parse (libxml+Nokogiri) followed by 
tree-walk(s) to scrub followed by another tree-walk to generate the 
resulting text (admittedly not technically part of Loofah).  My suspicion 
is the minimum of 3-passes currently required is why Loofah doesn't 
outperform other options for small document.  If it was treated as a 
SAX-based, one-pass transformation using a scrubber chain that would 
directly generate the clean output (even where scrubbers could be applied 
based on XPath node selection), I think you could have the same 
functionality in ~1/2 to 1/3 of the time for any size XML document.

Interested in any thoughts on the above, and look forward to using Loofah 
for a long time. :)

Cheers,

ast
--
Andrew S. Townley <ast@atownley.org>
http://atownley.org

Re: [loofah] Loofah and multi-pass tree-walking vs. one-pass SAX filtering

From:
Mike Dalessio
Date:
2011-03-21 @ 13:27
On Mon, Mar 21, 2011 at 8:04 AM, Andrew S. Townley <ast@atownley.org> wrote:

> Hi,
>
> I found Loofah last week, and I think it's pretty cool since I'm already
> using both Nokogiri and libxml2 directly in my projects.  However,  the
> "chaining scrubbers" section made me a little nervous when I first read it,
> but when I looked at the code, it makes me a bit more nervous than I was
> before.  My use-case is to use Loofah on generally small to medium sized
> user input that can take HTML, however there are times when I want to apply
> chained filters, e.g. nofollow and times when I don't.
>
> >From the code, it seems like one of the major advantages of using libxml -
> the fast, one-pass SAX parsing - isn't being leveraged by Loofah.


You're correct. Loofah uses the Nokogiri DOM parser and loads the entire
document into memory before starting processing/sanitizing/stripping.


> Maybe the performance advantage would be minimal, but I would certainly
> think since you're either ignoring or escaping offending nodes, this would
> better be done by creating a scrubber chain which was applied in a SAX
> parser event handler rather than requiring multiple passes of the whole
> document for each scrubber.


> I was just wondering if you'd looked at this approach and rejected it for
> some reason.


This is partly because a fully-parsed document is much more functional than
either the SAX parser or the Reader parser (e.g., you can access
node.ancestors, or node.document, etc.; and you have the flexibility to
traverse bottom-up/depth-first).

However, the primary reason is that the SAX parser and Reader parsers do not
allow for the modification of the document; in fact, there is no document
with these parsers; there is only the context node. Loofah would have to
construct the document on the fly from the SAX callbacks (which is harder
than it sounds (this is how Nokogiri used to parse fragments, and it was
always buggy), and much much slower than allowing libxml2 to do this parsing
once, up-front).

Plus, implementing both document construction as well as some sort of
chaining mechanism (I'm imagining something along the lines of ARel's scope
proxies to defer execution until evaluation?) for multiple scrubbers sounds
hard. Complex, and really hard.

And again, SAX is slower than (and equally memory-inefficient as) the DOM
parser if you are building a fully-constructed document (which Loofah does),
assuming your SAX parser is implemented in Ruby and that most of the nodes
in the document are being kept (not removed). And the code would be more
complicated and harder to maintain.


> I generally do SAX-based parsing of XML whenever possible, and even wrote
> some XPath-based node selection code based on using the XPath parts of REXML
> and using it directly with a light-weight node implementation over a libxml
> SAX parser so that you could still only selectively operate on tree nodes.
>  In the next while, I'm going to publish this code along with something
> else, but I bring it up to highlight that it is possible to do a lot with
> SAX directly without using Nokogiri.
>
> Right now, it seems Loofah's doing a parse (libxml+Nokogiri) followed by
> tree-walk(s) to scrub followed by another tree-walk to generate the
> resulting text (admittedly not technically part of Loofah).  My suspicion is
> the minimum of 3-passes currently required is why Loofah doesn't outperform
> other options for small document.


No, the benchmarks are showing that libxml2's DOM parser has constant
startup overhead, independent of document size. Traversing the document is
fast, as shown by Loofah's  beating all comers in benchmarks on large
documents.


> If it was treated as a SAX-based, one-pass transformation using a scrubber
> chain that would directly generate the clean output (even where scrubbers
> could be applied based on XPath node selection), I think you could have the
> same functionality in ~1/2 to 1/3 of the time for any size XML document.
>

If you have concrete ideas on how to implement this in a performant manner,
I'd love to see the code. However, my suspicion is that such a change would
require significant changes to the Loofah API (limiting what node attributes
can be accessed), would be significantly more complex in implementation, and
would be of only questionable performance benefit.

Loofah's sweet spot is that it is a) flexible and general-purpose, b) just
fast enough to be used in interesting places, and c) obvious how to use. And
in my imagination, conversion to a SAX parser would require API changes that
adversely affect (a) and (c), without necessarily compellingly improving
(b).

Honestly, if Loofah can sanitize a small snippet of markup (~tens of bytes)
in 0.4 milliseconds on my 3-yo laptop, how much faster would it have to be
to justify the additional complexity? IMHO, 0.4 ms is plenty fast enough for
the majority of places in which it would be useful. Anybody who needs more
performance always has the option using Nokogiri's SAX or Reader APIs
directly; or implementing a solution in C using libxml2's API.

I'm not trying to shut this idea down, since I'm obviously interested in
creative solutions to boost performance. However, I'm not optimistic. That
said, I'd love for you to show me I'm wrong!

Thanks for using Loofah, and cheers!
-mike

---
mike dalessio / @flavorjones

Re: [loofah] Loofah and multi-pass tree-walking vs. one-pass SAX filtering

From:
Andrew S. Townley
Date:
2011-03-21 @ 14:15
Hi Mike,

All valid points. :)

In my case, I don't care about the document in 99% of the cases, so the 
only thing I need to do is take one string, process it and convert it to 
another string.  Doing this with a SAX-based echo handler and chaining the
scrubbers as an implementation of the compound Command Object pattern 
called at the appropriate point just seems simpler in this case.

However, I agree that if you're maintaining a bunch of state (which I 
wasn't intending to do) and/or want to use the parsed document for other 
things, then that's a whole different deal.

As I said, I've done some similar stuff before when handling XML, so if I 
get a chance over the next while, I'll have a look.  The one that really 
gets me is having to walk the tree again for nofollow when you've already 
walked it once.  The others are mutually exclusive, so this isn't a 
concern.  In small trees and in C-land where libxml lives, I agree the 
tree traversal is probably fast enough for what I need to do.

Thanks for the rationale, and I'll post something if I have a go trying to
do something like this.

Cheers,

ast

On 21 Mar 2011, at 1:27 PM, Mike Dalessio wrote:

> 
> 
> On Mon, Mar 21, 2011 at 8:04 AM, Andrew S. Townley <ast@atownley.org> wrote:
> Hi,
> 
> I found Loofah last week, and I think it's pretty cool since I'm already
using both Nokogiri and libxml2 directly in my projects.  However,  the 
"chaining scrubbers" section made me a little nervous when I first read 
it, but when I looked at the code, it makes me a bit more nervous than I 
was before.  My use-case is to use Loofah on generally small to medium 
sized user input that can take HTML, however there are times when I want 
to apply chained filters, e.g. nofollow and times when I don't.
> 
> >From the code, it seems like one of the major advantages of using 
libxml - the fast, one-pass SAX parsing - isn't being leveraged by Loofah.
> 
> You're correct. Loofah uses the Nokogiri DOM parser and loads the entire
document into memory before starting processing/sanitizing/stripping.
>  
> Maybe the performance advantage would be minimal, but I would certainly 
think since you're either ignoring or escaping offending nodes, this would
better be done by creating a scrubber chain which was applied in a SAX 
parser event handler rather than requiring multiple passes of the whole 
document for each scrubber. 
> 
> I was just wondering if you'd looked at this approach and rejected it 
for some reason.
> 
> This is partly because a fully-parsed document is much more functional 
than either the SAX parser or the Reader parser (e.g., you can access 
node.ancestors, or node.document, etc.; and you have the flexibility to 
traverse bottom-up/depth-first).
> 
> However, the primary reason is that the SAX parser and Reader parsers do
not allow for the modification of the document; in fact, there is no 
document with these parsers; there is only the context node. Loofah would 
have to construct the document on the fly from the SAX callbacks (which is
harder than it sounds (this is how Nokogiri used to parse fragments, and 
it was always buggy), and much much slower than allowing libxml2 to do 
this parsing once, up-front).
> 
> Plus, implementing both document construction as well as some sort of 
chaining mechanism (I'm imagining something along the lines of ARel's 
scope proxies to defer execution until evaluation?) for multiple scrubbers
sounds hard. Complex, and really hard.
> 
> And again, SAX is slower than (and equally memory-inefficient as) the 
DOM parser if you are building a fully-constructed document (which Loofah 
does), assuming your SAX parser is implemented in Ruby and that most of 
the nodes in the document are being kept (not removed). And the code would
be more complicated and harder to maintain.
>  
> I generally do SAX-based parsing of XML whenever possible, and even 
wrote some XPath-based node selection code based on using the XPath parts 
of REXML and using it directly with a light-weight node implementation 
over a libxml SAX parser so that you could still only selectively operate 
on tree nodes.  In the next while, I'm going to publish this code along 
with something else, but I bring it up to highlight that it is possible to
do a lot with SAX directly without using Nokogiri.
> 
> Right now, it seems Loofah's doing a parse (libxml+Nokogiri) followed by
tree-walk(s) to scrub followed by another tree-walk to generate the 
resulting text (admittedly not technically part of Loofah).  My suspicion 
is the minimum of 3-passes currently required is why Loofah doesn't 
outperform other options for small document.
> 
> No, the benchmarks are showing that libxml2's DOM parser has constant 
startup overhead, independent of document size. Traversing the document is
fast, as shown by Loofah's  beating all comers in benchmarks on large 
documents.
>  
> If it was treated as a SAX-based, one-pass transformation using a 
scrubber chain that would directly generate the clean output (even where 
scrubbers could be applied based on XPath node selection), I think you 
could have the same functionality in ~1/2 to 1/3 of the time for any size 
XML document.
> 
> If you have concrete ideas on how to implement this in a performant 
manner, I'd love to see the code. However, my suspicion is that such a 
change would require significant changes to the Loofah API (limiting what 
node attributes can be accessed), would be significantly more complex in 
implementation, and would be of only questionable performance benefit.
> 
> Loofah's sweet spot is that it is a) flexible and general-purpose, b) 
just fast enough to be used in interesting places, and c) obvious how to 
use. And in my imagination, conversion to a SAX parser would require API 
changes that adversely affect (a) and (c), without necessarily 
compellingly improving (b).
> 
> Honestly, if Loofah can sanitize a small snippet of markup (~tens of 
bytes) in 0.4 milliseconds on my 3-yo laptop, how much faster would it 
have to be to justify the additional complexity? IMHO, 0.4 ms is plenty 
fast enough for the majority of places in which it would be useful. 
Anybody who needs more performance always has the option using Nokogiri's 
SAX or Reader APIs directly; or implementing a solution in C using 
libxml2's API.
> 
> I'm not trying to shut this idea down, since I'm obviously interested in
creative solutions to boost performance. However, I'm not optimistic. That
said, I'd love for you to show me I'm wrong!
> 
> Thanks for using Loofah, and cheers!
> -mike
> 
> ---
> mike dalessio / @flavorjones
> 
> 

--
Andrew S. Townley <ast@atownley.org>
http://atownley.org