Nokogiri, Your New Swiss Army Knife Monday, November 17, 2008

Prologue

Today I'd like to talk about the use of regular expressions to parse and modify HTML. Or rather, the misuse.

I'm going to try to convince you that it's a very bad idea to use regexes for HTML. And I'm going to introduce you to Nokogiri, my new best friend and life companion, who can do this job way better, and nearly as fast.

For those of you who just want the meat without all the starch:

  1. You don't parse Ruby or YAML with regular expressions, so don't do it with HTML, either.
  2. If you know how to use Hpricot, you know how to use Nokogiri.
  3. Nokogiri can parse and modify HTML more robustly than regexes, with less penalty than formatting Markdown or Textile.
  4. Nokogiri is 4 to 10 times faster than Hpricot performing the typical HTML-munging operations benchmarked.

The Scene

On one of the open-source projects I contribute to (names will be withheld for the protection of the innocent, this isn't Daily WTF), I came across the following code:

def spanify_links(text)
  text.gsub(/<a\s+(.*)>(.*)<\/a>/i, '<a \1><span>\2</span></a>')
end

In case it's not clear, the goal of this method is to insert a <span> element inside the link, converting hyperlinks from

<a href='http://foo.com/'> Foo! </a>

to

<a href='http://foo.com/'> <span> Foo! </span> </a>

for CSS styling.

The Problem

Look, I love regexes as much as the next guy, but this regex is seriously busticated. If there is more than one <a> tag on a line, only the final one will be spanified. If the tag contains an embedded newline, nothing will be spanified. There are probably other unobvious bugs, too, and that means there's a code smell here.

Sure, the regex could be fixed to work in these cases. But does a trivial feature like this justify the time spent writing test cases and playing whack-a-mole with regex bugs? Code smell.

Let's look at it another way: If you were going to modify Ruby code programmatically, would you use regular expressions? I seriously doubt it. You'd use something like ParseTree, which understands all of Ruby's syntax and will correctly interpret everything in context, not just in isolation.

What about YAML? Would you modify YAML files with regular expressions? Hells no. You'd slurp it with YAML.parse(), modify the in-memory data structures, and then write it back out.

Why wouldn't you do the same with HTML, which has its own nontrivial (and DTD-dependent) syntax?

Regular expressions just aren't the right tool for this job. Jamie Zawinski said it best:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Why, God? Why?

So, what drives otherwise intelligent people (myself included) to whip out regular expressions when it comes time to munge HTML?

My only guess is this: A lack of worthy XML/HTML libraries.

Whoa, whoa, put down the flamethrower and let me explain myself. By "worthy", I mean three things:

  • fast, high-performance, suitable for use in a web server
  • nice API, easy for a developer to learn and use
  • will successfully parse broken HTML commonly found on the intarwebs

libxml2 and libxml-ruby have been around for ages, and they're incredibly fast. But have you seen the API? It's totally sadistic, and as a result it's inappropriate and not easily usable in simple cases like the one described above.

Now, Hpricot is pure genius. It's pretty fast, and the API is absolutely delightful to work with. It supports CSS as well as XPath queries. I've even used it (with feed-normalizer) in a Rails application, and it performed reasonably well. But it's still much slower than regexes. Here's a (totally unfair) sample benchmark comparing Hpricot to a comparable (though buggy) regular expression (see below for a link to the benchmark gist):

For an html snippet 2374 bytes long ...
                          user     system      total        real
regex * 1000          0.160000   0.010000   0.170000 (  0.182207)
hpricot * 1000        5.740000   0.650000   6.390000 (  6.401207)

it took an average of 0.0064 seconds for Hpricot to parse and operate on an HTML snippet 2374 bytes long

For an html snippet 97517 bytes long ...
                          user     system      total        real
regex * 10            0.100000   0.020000   0.120000 (  0.122117)
hpricot * 10          3.190000   0.300000   3.490000 (  3.502819)

it took an average of 0.3503 seconds for Hpricot to parse and operate on an HTML snippet 97517 bytes long

So, historically, I haven't used Hpricot everywhere I could have, and that's because I was overly-cautious about performance.

Get On With It, Already

Oooooh, if only there was a library with libxml2's speed and Hpricot's API. Then maybe people wouldn't keep trying to use regular expressions where an HTML parser is needed.

Oh wait, there is. Everyone, meet Nokogiri.

Check out the full benchmark, comparing the same operation (spanifying links and removing possibly-unsafe tags) across regular expressions, Hpricot and Nokogiri:

For an html snippet 2374 bytes long ...
                          user     system      total        real
regex * 1000          0.160000   0.010000   0.170000 (  0.182207)
nokogiri * 1000       1.440000   0.060000   1.500000 (  1.537546)
hpricot * 1000        5.740000   0.650000   6.390000 (  6.401207)

it took an average of 0.0015 seconds for Nokogiri to parse and operate on an HTML snippet 2374 bytes long
it took an average of 0.0064 seconds for Hpricot to parse and operate on an HTML snippet 2374 bytes long

For an html snippet 97517 bytes long ...
                          user     system      total        real
regex * 10            0.100000   0.020000   0.120000 (  0.122117)
nokogiri * 10         0.310000   0.020000   0.330000 (  0.322290)
hpricot * 10          3.190000   0.300000   3.490000 (  3.502819)

it took an average of 0.0322 seconds for Nokogiri to parse and operate on an HTML snippet 97517 bytes long
it took an average of 0.3503 seconds for Hpricot to parse and operate on an HTML snippet 97517 bytes long

Wow! Nokogiri parsed and modified blog-sized HTML snippets in under 2 milliseconds! This performance, though still significantly slower than regular expressions, is still fast enough for me to consider using it in a web application server.

Hell, that's as fast (faster, actually) than BlueCloth or RedCloth can render Markdown or Textile of similar length. If you can justify using those in your web application, you can certainly afford the overhead of Nokogiri.

And as for usability, let's compare the regular expressions to the Nokogiri operations:

html.gsub(/<a\s+(.*)>(.*)<\/a>/i, '<a \1><span>\2</span></a>') # broken regex
html.gsub(/<(script|noscript|object|embed|style|frameset|frame|iframe)[>\s\S]*<\/\1>/, '')

doc.search("a/text()").wrap("<span></span>")
doc.search("script","noscript","object","embed","style","frameset","frame","iframe").unlink

The Nokogiri version is much clearer. More maintainable, more robust and, for me, just fast enough to start jamming into all kinds of places.

Where Else Can I Use Nokogiri?

You can use Nokogiri anywhere you read, write or modify HTML or XML. It's your new swiss army knife.

What about your test cases? Merb is using Nokogiri extensively in their controller tests, and they're reportedly much faster than before. And those Merb dudes are S-M-R-T.

Have you thought about using Nokogiri::Builder to generate XML, instead of the default Rails XML template builder? Boy, I have. Upcoming blog post, hopefully.

Let me know where else you've found Nokogiri useful! Or better yet, join the mailing list and tell the community!