Parsing HTML with regular expressions for fun and profit

You shan't use Regexp for HTML parsing

It's well-known that parsing HTML with regular expressions will quickly lead you to failure and madness.

This was beautifully explained in this Stack Overflow answer.

That's because HTML has a context (basically which tag or attribute you're in) whereas regular expressions can't handle context.

But, let me tell you a secret: Parseur is processing HTML with regular expressions all the time.

Gandalf you shall not parse

Or shall you?

Actually, Parseur is "just" a regular expressions generator.

Of course, the trick is to keep the UI simple enough to never actually be aware of the underlying machinery while still leveraging its power.

Why use regular expressions?

Parseur is not really parsing HTML with regular expressions!

Parseur doesn't really need to parse the whole HTML tree, it only needs to extract fragments of it out of a document structure that can change quite a bit.

All text is not HTML

We wanted our parsing engine to work with plain text documents as well as HTML ones. Once parsing HTML with regular expressions works, it's easy to convert all documents from PDF, MS Word or others into HTML (or text), then process them as any other HTML document.

Because performance matters at scale

From the beginning, we wanted Parseur to be able to self-select the right template from a (potentially long) list.

For example, if you process plane tickets, we can extract interesting data from the ticket, regardless of which airline issued it. As long as there is a template (a carefully crafted regular expression) for it.

Another example that works very well with Parseur is to extract relevant data from real estate emails.

Regular expressions are fast enough (if you design them carefully) to allow hundreds of matches per second (per core) on complex documents.

If you're interested in knowing more about Parseur, you can contact us at hello@parseur.com

Last updated: