How to create an email parser from scratch

Portrait of Sylvain Josserand
by Sylvain Josserand
11 mins read

So, your boss just asked you to solve that "email problem" that slows the company down. Hundreds of automated emails on which data entry is done manually, every morning, clogging the employees' mailboxes.

You, being smart and efficient, immediately sees the potential to create an email parsing system. Great idea! Though it might be a little more involved than just a few scripts and elbow grease. Here are six steps to create an email parser and successfully automate your email data entry workflow.

How to create an email parser cover image

Before we start: let's define parse and what is parsing

In Computer Science, parsing is the action of splitting a text into sub parts, following a set of rules.

An email parser is a way to make a computer read emails and act on it according to a set of rules. Ideally, that system will automatically extract relevant data from those emails and feed it to your back-office application. Check out the following article on a deep dive about email parsing is.

Shameless plug: have you met Parseur?

Building your own email parser is a fun project to understand how things work underneath.

But it's a time-consuming one.

Parseur was created from scratch in late 2015, and it took around 5000 man-hours over the course of 6 years, just to build the back-end. The front-end (all the user interface, including the template editor) also took thousands of man-hours to build. And the team behind Parseur is made of seasoned developers with more than 20 years of professional coding behind their belt.

We're not finished and can't even estimate how long it would take to create a "sufficiently good" text parser.

If you need results quick, you should try Parseur. Parseur is a managed and user-friendly email parser that will save you hours setting up your own solution. Check out the extensive set of Parseur features.

Try out Parseur email parser
Free to start with. All features enabled on our Free plan!

1. Get the emails

For now, the emails are arriving at the employees' individual inboxes, team mailing lists, or company-wide mailbox.

The first step would be to setup an e-mail account to centralize all those mailboxes. Or even, god forbid, set up your own email server, also known as an SMTP server.

If you know what you are doing, here are a few SMTP servers that are quite popular at the moment:

  • Exim is a free, open-source email transfer agent (yet another name for email server). It is the most popular SMTP server, and gaining in popularity a bit faster than the second, Postfix.
  • Postfix is also free and open-source. It has the reputation to "just work", with minimal problems. According to this article about email server market share, Exim and Postfix together represent 80% of all email servers.
  • On Microsoft side, the ubiquitous Exchange. You can get the emails from it through EWS instead of more old-fashioned POP3 or IMAP. Nowadays, you can even get Microsoft to host it for you, for a fee.
  • Build your own. That path will be long and winding, but you will learn a lot on the way. In the end, your server might better fit your needs. If your needs does not imply compatibility with the gazillions emails clients out there, that is. If you are determined to go down that path, Python has a lovely module in its standard library to get you started. Have a look at smtpd.

Note that sending a lot of emails without being blacklisted is an art in itself and better left to the specialists.

Also note that the popularity of setting up one's own email server is dwindling. In our era of cloud and SaaS, it's more convenient to use a hosted email service that does the dirty email work for you. Here are the major players in this space:

  • Postmark focuses deliverability and reliability. Also, it has a free plan.
  • Mandrill had a first mover advantage and remains popular. It focuses on marketing and transactional emails.
  • Sendgrid also position itself as a marketing and transactional email platform.
  • Mailgun focuses more on developer and API. Also, it has a free plan.

We love Postmark here at Parseur. Their API is great and the documentation stellar. There are many SDK for all the more popular programming languages around there.

2. Translate email into a proper data format

Email is an old format, the "created before Star Wars" kind of old, and it has accumulated a few warts over the decades. For example, international (non-US) characters handling was not part of the initial specification. To handle special characters, like €, you need to take 3 technical documents (also called RFC) into account:

  • RFC 2047 provides support for international names and subject lines, in the email header
  • RFC 5890 provides support for international domain names in the Domain Name System (DNS)
  • RFC 6532 allows the use of UTF-8 (another way to store international text) in a mail header section

Once again, services like Postmark or Mailgun can save your day here and do the translation for you. You can forget horror stories involving UTF-8, MIME and cp1252 (never heard of UTF-8, MIME or cp1252? I envy your life).

For example, if using mailgun, servers will receive the email for you and transform it into an easy-to-handle JSON document, taking care of all the RFCs known to mankind. It will then post it to your own server at whatever URL you want as webhook in a single HTTP POST request.

For the curious, here is a list of all SMTP-related RFCs. You are welcome.

For example, a simple email received on Mailgun will arrive at your server looking like this:

{
  subject: "My favorite café",
  sender: "John Doe <john.doe@yourcompany.com>",
  recipient: "Mr. Parseur <yourcompany.recipe.parser@parseur.com>",
  message: "It's called Awesome Café! See directions in the attachment. Bye.",
  attachements: [
    { name: "directions.pdf", content: "https://url.with.content" },
    { name: "cappucino.jpg", content: "https://another.content.url"}
  ]
  [... other interesting pieces of data here (read the doc, Luke) ...]
}

Isn't it wonderful? Compare this with a traditional email format:

MIME-Version: 1.0
Received: by 102.29.23.176 with HTTP; Sat, 12 Aug 2016 14:13:31 -0700 (PDT)
Date: Sat, 12 Aug 2016 14:13:31 -0700
Delivered-To: =?ISO-8859-1?Q?Mr. Parseur <yourcompany.recipe.parser@parseur.com>
Message-ID: <CAAJL_=kPAJZ=fryb21wBOALp8-XOEL-h9j84s3SjpXYQjN3Z3A@mail.gmail.com>
Subject: =?ISO-8859-1?Q?My=20Favorite=20Caf=E9
From: =?ISO-8859-1?Q?John Doe <john.doe@yourcompany.com>
To: =?ISO-8859-1?Q?Mr. Parseur <yourcompany.recipe.parser@parseur.com>
Content-Type: multipart/mixed; boundary=mixed
==mixed
Content-Type: multipart/alternative; boundary=alternative
==alternative
Content-Type: text/plain; charset="utf-8"
It's called Awesome Caf=C3=A9! See directions in the attachm= ent. Bye.
==alternative
Content-Type: text/html; charset="utf-8"
It's called <b>Awesome Caf=C3=A9</b>! See directions in the = attachment. Bye. ==alternative== ==mixed
Content-Type: document/pdf; name="directions.pdf"
Content-Disposition: attachment; filename="directions.pdf"
Content-Transfer-Encoding: base64
iVBORw [... the whole encoded attachment here ...] RK5CYII=
==mixed
Content-Type: image/jpg; name="capuccino.jpg"
Content-Disposition: attachment; filename="capuccino.jpg"
Content-Transfer-Encoding: base64
G+aHAAAA [... another attachment encoded here ...] ORK5CYII=
==mixed==

Fortunately, most decent programming languages come with library to decipher emails, such as the email module for Python, or Ruby's RubyMail library.

3. Get the data into the database

From here on, you can count on your coding skills to handle all these HTTP requests and turn them into nice entries into your database of choice.

Here are some popular programming languages and frameworks to help you on the task, in order of increasing trendiness:

The code involved should be trivial if you're not targeting any particular format. But you might have to find out about which format your business software is accepting and then convert to this format. Popular interchange formats include CSV and JSON, but some business applications are using more obscure, binary formats.

If all you need is storage (possibly for your own, custom business application), then you just have to pick how you will store the data.

If you know that you will never ever need to do statistics or non-sequential operations on these stored emails, you may consider using MongoDB for example. Although, I will advise you against it, using arguments from this awesome blog post.

Any relational database management system, based on SQL will store your emails just fine. At a minimum, you will define 2 tables, one for emails, and another for their attachments, if you decide to store them.

Any SQL database engine should handle that as long as your volume and load fit on a server. There are a few popular choices for relational databases nowadays:

  • MySQL, and its recommended, but non-official, fork MariaDB are basic and still popular choices of database servers. Note that since Oracle bought MySQL, support is not as strong as it used to be. Surprise.
  • Postgresql is a larger, feature rich database engine with more options to scale and a more complex setup than MySQL.
  • Other than these free, open-source databases, there is, of course, Oracle, with a truckload of features to answer the needs of large companies. Very large, complex, and expensive. Are you sure that your simple email storage solution needs that much scalability?
  • Also on the commercial side, Microsoft SQL server has much improved in the last years and now appears as a viable competitor to Oracle.

Here we are. If you wanted to put your emails' content as is into your application's database, you're basically done.

But why stop here? You now have a lot of interesting data at your fingertips. This data set is very interesting because it is relevant to your core business. Your emails are probably full of invoices, travel expenses, estimates, prospects and customers.

How about going one step further and extract relevant data from these emails? Refining the data you have can help you automate your business workflow, saving time to you and your employees.

4. Extract relevant text from each email

This is where the actual parsing really takes place. Ideally, we want to do this:

Schematics of an email parser transforming a received email into structured data (for example, a spreadsheet, or a database)

Schematics of an email parser transforming a received email into structured data (for example, a spreadsheet, or a database)

Here are a few approaches to solve this vast problem:

Statistical Word Analysis, or "word counting"

Statistical analysis is well-adapted to email without any predefined form, typically emails written by a human. You could define several email categories with a set of words belonging to each of these categories. You then parse each email, count the words in it from each category, and then decide if the email falls into one or more of these categories.

This works pretty well for sentiment analysis. For example, you could define a "happy customer" category and a "furious customer" category and send the happy customers email to your boss and the furious customer to the trash bin. Just kidding, but you get the idea.

But as you may know, human-to-human communication is prone to error, ambiguities and very sensitive to context. And as long as we don't have real artificial intelligence around, these same ambiguities won't be lifted. They will make your system unreliable at best, and useless at worst.

Regular expressions

This approach works best for automatically generated emails with most of the text staying the same between emails.

For example, say you want to parse a million booking emails from American Airlines and extract the passenger name for each of them. This could be done by creating a regular expression that matches the whole email and only captures the passenger name. Sound easy, right? But what happen when other parts of the email changes too? And what if there are three passengers on that one flight, instead of just one? Oops.

Python has a nice library for regular expressions. Regular expressions, or regexp for short, are part of Ruby Core as the Regexp module. They are also first class citizens in JavaScript, too.

THe downside is that Regular Expressions are complex to maintain and their readability is passable as best. Many Parseur customers told us they initially started developing their own parsing engine using Regexps, but were not able to keep maintaining it against the flow of ever-changing emails they were receiving.

5. A managed solution? Parseur can help!

Wouldn't it be nice to just get the data you want, sorted into the right columns of an Excel spreadsheet or database?

Well, that's our goal here at Parseur. We are providing you with a simple "point and click" interface, to define what data is relevant to you, once and for all. You can then send similar emails, and their data will get extracted and automagically placed into an Excel spreadsheet.

You don't have to create an email parser from scratch yourself. You don't have to do any manual processing after that first short session of pointing and clicking. Each email turns into an Excel row by itself.

Try Parseur out!
Free to start with. All features enabled on our Free plan!

6. Integrate into your business software

Once your extracted data is sitting neatly into your Excel Spreadsheet, you "just" have to get it where it matters: into your business application.

Tools like Zapier or Integromat can help you tremendously here, as they can connect your own email application with your business application. All you have to do is to write a connector for those services. You can then enjoy the many other connectors that are part of their ecosystem.

Parseur integrates with Google Sheets, Zapier, Integromat, Microsoft Flow, Getswift, opening your parsed data to thousands of applications in just a few clicks.

Good luck!

Last updated: