Hi, I found Loofah last week, and I think it's pretty cool since I'm already using both Nokogiri and libxml2 directly in my projects. However, the "chaining scrubbers" section made me a little nervous when I first read it, but when I looked at the code, it makes me a bit more nervous than I was before. My use-case is to use Loofah on generally small to medium sized user input that can take HTML, however there are times when I want to apply chained filters, e.g. nofollow and times when I don't. From the code, it seems like one of the major advantages of using libxml - the fast, one-pass SAX parsing - isn't being leveraged by Loofah. Maybe the performance advantage would be minimal, but I would certainly think since you're either ignoring or escaping offending nodes, this would better be done by creating a scrubber chain which was applied in a SAX parser event handler rather than requiring multiple passes of the whole document for each scrubber. I was just wondering if you'd looked at this approach and rejected it for some reason. I generally do SAX-based parsing of XML whenever possible, and even wrote some XPath-based node selection code based on using the XPath parts of REXML and using it directly with a light-weight node implementation over a libxml SAX parser so that you could still only selectively operate on tree nodes. In the next while, I'm going to publish this code along with something else, but I bring it up to highlight that it is possible to do a lot with SAX directly without using Nokogiri. Right now, it seems Loofah's doing a parse (libxml+Nokogiri) followed by tree-walk(s) to scrub followed by another tree-walk to generate the resulting text (admittedly not technically part of Loofah). My suspicion is the minimum of 3-passes currently required is why Loofah doesn't outperform other options for small document. If it was treated as a SAX-based, one-pass transformation using a scrubber chain that would directly generate the clean output (even where scrubbers could be applied based on XPath node selection), I think you could have the same functionality in ~1/2 to 1/3 of the time for any size XML document. Interested in any thoughts on the above, and look forward to using Loofah for a long time. :) Cheers, ast -- Andrew S. Townley <ast@atownley.org> http://atownley.org
On Mon, Mar 21, 2011 at 8:04 AM, Andrew S. Townley <ast@atownley.org> wrote: > Hi, > > I found Loofah last week, and I think it's pretty cool since I'm already > using both Nokogiri and libxml2 directly in my projects. However, the > "chaining scrubbers" section made me a little nervous when I first read it, > but when I looked at the code, it makes me a bit more nervous than I was > before. My use-case is to use Loofah on generally small to medium sized > user input that can take HTML, however there are times when I want to apply > chained filters, e.g. nofollow and times when I don't. > > >From the code, it seems like one of the major advantages of using libxml - > the fast, one-pass SAX parsing - isn't being leveraged by Loofah. You're correct. Loofah uses the Nokogiri DOM parser and loads the entire document into memory before starting processing/sanitizing/stripping. > Maybe the performance advantage would be minimal, but I would certainly > think since you're either ignoring or escaping offending nodes, this would > better be done by creating a scrubber chain which was applied in a SAX > parser event handler rather than requiring multiple passes of the whole > document for each scrubber. > I was just wondering if you'd looked at this approach and rejected it for > some reason. This is partly because a fully-parsed document is much more functional than either the SAX parser or the Reader parser (e.g., you can access node.ancestors, or node.document, etc.; and you have the flexibility to traverse bottom-up/depth-first). However, the primary reason is that the SAX parser and Reader parsers do not allow for the modification of the document; in fact, there is no document with these parsers; there is only the context node. Loofah would have to construct the document on the fly from the SAX callbacks (which is harder than it sounds (this is how Nokogiri used to parse fragments, and it was always buggy), and much much slower than allowing libxml2 to do this parsing once, up-front). Plus, implementing both document construction as well as some sort of chaining mechanism (I'm imagining something along the lines of ARel's scope proxies to defer execution until evaluation?) for multiple scrubbers sounds hard. Complex, and really hard. And again, SAX is slower than (and equally memory-inefficient as) the DOM parser if you are building a fully-constructed document (which Loofah does), assuming your SAX parser is implemented in Ruby and that most of the nodes in the document are being kept (not removed). And the code would be more complicated and harder to maintain. > I generally do SAX-based parsing of XML whenever possible, and even wrote > some XPath-based node selection code based on using the XPath parts of REXML > and using it directly with a light-weight node implementation over a libxml > SAX parser so that you could still only selectively operate on tree nodes. > In the next while, I'm going to publish this code along with something > else, but I bring it up to highlight that it is possible to do a lot with > SAX directly without using Nokogiri. > > Right now, it seems Loofah's doing a parse (libxml+Nokogiri) followed by > tree-walk(s) to scrub followed by another tree-walk to generate the > resulting text (admittedly not technically part of Loofah). My suspicion is > the minimum of 3-passes currently required is why Loofah doesn't outperform > other options for small document. No, the benchmarks are showing that libxml2's DOM parser has constant startup overhead, independent of document size. Traversing the document is fast, as shown by Loofah's beating all comers in benchmarks on large documents. > If it was treated as a SAX-based, one-pass transformation using a scrubber > chain that would directly generate the clean output (even where scrubbers > could be applied based on XPath node selection), I think you could have the > same functionality in ~1/2 to 1/3 of the time for any size XML document. > If you have concrete ideas on how to implement this in a performant manner, I'd love to see the code. However, my suspicion is that such a change would require significant changes to the Loofah API (limiting what node attributes can be accessed), would be significantly more complex in implementation, and would be of only questionable performance benefit. Loofah's sweet spot is that it is a) flexible and general-purpose, b) just fast enough to be used in interesting places, and c) obvious how to use. And in my imagination, conversion to a SAX parser would require API changes that adversely affect (a) and (c), without necessarily compellingly improving (b). Honestly, if Loofah can sanitize a small snippet of markup (~tens of bytes) in 0.4 milliseconds on my 3-yo laptop, how much faster would it have to be to justify the additional complexity? IMHO, 0.4 ms is plenty fast enough for the majority of places in which it would be useful. Anybody who needs more performance always has the option using Nokogiri's SAX or Reader APIs directly; or implementing a solution in C using libxml2's API. I'm not trying to shut this idea down, since I'm obviously interested in creative solutions to boost performance. However, I'm not optimistic. That said, I'd love for you to show me I'm wrong! Thanks for using Loofah, and cheers! -mike --- mike dalessio / @flavorjones
Hi Mike, All valid points. :) In my case, I don't care about the document in 99% of the cases, so the only thing I need to do is take one string, process it and convert it to another string. Doing this with a SAX-based echo handler and chaining the scrubbers as an implementation of the compound Command Object pattern called at the appropriate point just seems simpler in this case. However, I agree that if you're maintaining a bunch of state (which I wasn't intending to do) and/or want to use the parsed document for other things, then that's a whole different deal. As I said, I've done some similar stuff before when handling XML, so if I get a chance over the next while, I'll have a look. The one that really gets me is having to walk the tree again for nofollow when you've already walked it once. The others are mutually exclusive, so this isn't a concern. In small trees and in C-land where libxml lives, I agree the tree traversal is probably fast enough for what I need to do. Thanks for the rationale, and I'll post something if I have a go trying to do something like this. Cheers, ast On 21 Mar 2011, at 1:27 PM, Mike Dalessio wrote: > > > On Mon, Mar 21, 2011 at 8:04 AM, Andrew S. Townley <ast@atownley.org> wrote: > Hi, > > I found Loofah last week, and I think it's pretty cool since I'm already using both Nokogiri and libxml2 directly in my projects. However, the "chaining scrubbers" section made me a little nervous when I first read it, but when I looked at the code, it makes me a bit more nervous than I was before. My use-case is to use Loofah on generally small to medium sized user input that can take HTML, however there are times when I want to apply chained filters, e.g. nofollow and times when I don't. > > >From the code, it seems like one of the major advantages of using libxml - the fast, one-pass SAX parsing - isn't being leveraged by Loofah. > > You're correct. Loofah uses the Nokogiri DOM parser and loads the entire document into memory before starting processing/sanitizing/stripping. > > Maybe the performance advantage would be minimal, but I would certainly think since you're either ignoring or escaping offending nodes, this would better be done by creating a scrubber chain which was applied in a SAX parser event handler rather than requiring multiple passes of the whole document for each scrubber. > > I was just wondering if you'd looked at this approach and rejected it for some reason. > > This is partly because a fully-parsed document is much more functional than either the SAX parser or the Reader parser (e.g., you can access node.ancestors, or node.document, etc.; and you have the flexibility to traverse bottom-up/depth-first). > > However, the primary reason is that the SAX parser and Reader parsers do not allow for the modification of the document; in fact, there is no document with these parsers; there is only the context node. Loofah would have to construct the document on the fly from the SAX callbacks (which is harder than it sounds (this is how Nokogiri used to parse fragments, and it was always buggy), and much much slower than allowing libxml2 to do this parsing once, up-front). > > Plus, implementing both document construction as well as some sort of chaining mechanism (I'm imagining something along the lines of ARel's scope proxies to defer execution until evaluation?) for multiple scrubbers sounds hard. Complex, and really hard. > > And again, SAX is slower than (and equally memory-inefficient as) the DOM parser if you are building a fully-constructed document (which Loofah does), assuming your SAX parser is implemented in Ruby and that most of the nodes in the document are being kept (not removed). And the code would be more complicated and harder to maintain. > > I generally do SAX-based parsing of XML whenever possible, and even wrote some XPath-based node selection code based on using the XPath parts of REXML and using it directly with a light-weight node implementation over a libxml SAX parser so that you could still only selectively operate on tree nodes. In the next while, I'm going to publish this code along with something else, but I bring it up to highlight that it is possible to do a lot with SAX directly without using Nokogiri. > > Right now, it seems Loofah's doing a parse (libxml+Nokogiri) followed by tree-walk(s) to scrub followed by another tree-walk to generate the resulting text (admittedly not technically part of Loofah). My suspicion is the minimum of 3-passes currently required is why Loofah doesn't outperform other options for small document. > > No, the benchmarks are showing that libxml2's DOM parser has constant startup overhead, independent of document size. Traversing the document is fast, as shown by Loofah's beating all comers in benchmarks on large documents. > > If it was treated as a SAX-based, one-pass transformation using a scrubber chain that would directly generate the clean output (even where scrubbers could be applied based on XPath node selection), I think you could have the same functionality in ~1/2 to 1/3 of the time for any size XML document. > > If you have concrete ideas on how to implement this in a performant manner, I'd love to see the code. However, my suspicion is that such a change would require significant changes to the Loofah API (limiting what node attributes can be accessed), would be significantly more complex in implementation, and would be of only questionable performance benefit. > > Loofah's sweet spot is that it is a) flexible and general-purpose, b) just fast enough to be used in interesting places, and c) obvious how to use. And in my imagination, conversion to a SAX parser would require API changes that adversely affect (a) and (c), without necessarily compellingly improving (b). > > Honestly, if Loofah can sanitize a small snippet of markup (~tens of bytes) in 0.4 milliseconds on my 3-yo laptop, how much faster would it have to be to justify the additional complexity? IMHO, 0.4 ms is plenty fast enough for the majority of places in which it would be useful. Anybody who needs more performance always has the option using Nokogiri's SAX or Reader APIs directly; or implementing a solution in C using libxml2's API. > > I'm not trying to shut this idea down, since I'm obviously interested in creative solutions to boost performance. However, I'm not optimistic. That said, I'd love for you to show me I'm wrong! > > Thanks for using Loofah, and cheers! > -mike > > --- > mike dalessio / @flavorjones > > -- Andrew S. Townley <ast@atownley.org> http://atownley.org