How to create an email parser from scratch?

So, your boss just asked you to solve that "email problem" that slows the company down. Hundreds of automated emails on which data entry is done manually, every morning, clogging the employees' mailboxes.

You, being smart and efficient, immediately sees the potential to create an email parser systemWhat is an email parser? You may ask. In short, it's a way to make a computer read emails and act on it according to a set of rules.

Ideally, that system will automatically extract relevant data from those emails and feed it to your back-office application.

Great idea! Though it might be a little more involved than just a few scripts and elbow grease. Here are five steps to create an email parser and successfully automate your email data entry workflow.

1. Get the emails

For now, the emails are arriving at the employees' individual inboxes, team mailing lists, or company-wide mailbox.

The first step would be to setup an e-mail account to centralize all those mailboxes. Or even, god forbid, set up your own email server, also known as an SMTP server.

If you are brave enough to go down that road, here are a few SMTP servers that are quite popular at the moment:

  • Exim is a free, open-source email transfer agent (yet another name for email server). It is the most popular SMTP server, and gaining in popularity a bit faster than the second, Postfix.
  • Postfix is also free and open-source. It has the reputation to "just work", with minimal problems. According to this article about email server market share, Exim and Postfix together represent 80% of all email servers.
  • On Microsoft side, the ubiquitous (in SMEs, at least) Exchange. You can get the emails from it through EWS instead of more old-fashioned POP3 or IMAP. Nowadays, you can even get Microsoft to host it for you, for a fee.
  • Build your own. That path will be long and winding, but you will learn a lot on the way. In the end, your server might better fit your needs. If your needs does not imply compatibility with the gazillions emails clients out there, that is. If you are determined to go down that path, Python has a lovely module in its standard library to get you started. Have a look at**`smtpd <https://docs.python.org/3/library/smtpd.html>`__.**

Note that sending a lot of emails without being blacklisted is an art in itself and better left to the specialists.

Also note that the popularity of setting up one's own email server is dwindling. In our era of cloud and SaaS, it's more convenient to use a hosted email service that does the dirty email work for you. Here are the major players in this space:

  • Mandrill had a first mover advantage and remains popular. It focuses on marketing and transactional emails.
  • Sendgrid also position itself as a marketing and transactional email platform.
  • Mailgun focuses more on developer and API. Also, it has a free plan.
  • Postmark focuses deliverability and reliability. Also, it has a free plan.
Don't leave me alone with that Postfix server!

We love Mailgun and Postmark here at parseur.com. Their APIs are great and the documentation stellar. There are many SDK for all the more popular programming languages around there.

2. Translate email into a proper data format

Email is an old format, the "created before Star Wars" kind of old, and it has accumulated a few warts over the decades. For example, international (non-US) characters handling was not part of the initial specification. To handle special characters, like €, you need to take 3 technical documents (also called RFC) into account:

  • RFC 2047 provides support for international names and subject lines, in the email header
  • RFC 5890 provides support for international domain names in the Domain Name System
  • RFC 6532 allows the use of UTF-8 (another way to store international text) in a mail header section

SMTP is probably older than Yoda

Yoda and Emails. Many similarities.

Yoda and Emails. Many similarities.

Once again, Mailgun will save your day here and do the translation for you. You can forget horror stories involving UTF-8, MIME and cp1252 (never heard of UTF-8, MIME or cp1252? I envy your life).

For example Mailgun servers will receive the email for you and transform it into an easy-to-handle JSON document, taking care of all the RFCs known to mankind. It will then post it to your own server at whatever URL you want, in a single, beautiful, HTTP POST request.

For the curious, here is a list of all SMTP-related RFCs. You are welcome.

For example, a simple email received on Mailgun will arrive at your server looking like this:

{
  subject: "My favorite café",
  sender: "John Doe <john.doe@yourcompany.com>",
  recipient: "Mr. Parseur <yourcompany.recipe.parser@parseur.com>",
  message: "It's called Awesome Café! See directions in the attachment. Bye.",
  attachements: [
    { name: "directions.pdf", content: "https://url.with.content" },
    { name: "cappucino.jpg", content: "https://another.content.url"}
  ]
  [... other interesting pieces of data here (read the doc, Luke) ...]
}

Isn't it wonderful? Compare this with a traditional email format:

MIME-Version: 1.0
Received: by 102.29.23.176 with HTTP; Sat, 12 Aug 2016 14:13:31 -0700 (PDT)
Date: Sat, 12 Aug 2016 14:13:31 -0700
Delivered-To: =?ISO-8859-1?Q?Mr. Parseur <yourcompany.recipe.parser@parseur.com>
Message-ID: <CAAJL_=kPAJZ=fryb21wBOALp8-XOEL-h9j84s3SjpXYQjN3Z3A@mail.gmail.com>
Subject: =?ISO-8859-1?Q?My=20Favorite=20Caf=E9
From: =?ISO-8859-1?Q?John Doe <john.doe@yourcompany.com>
To: =?ISO-8859-1?Q?Mr. Parseur <yourcompany.recipe.parser@parseur.com>
Content-Type: multipart/mixed; boundary=mixed
==mixed
Content-Type: multipart/alternative; boundary=alternative
==alternative
Content-Type: text/plain; charset="utf-8"
It's called Awesome Caf=C3=A9! See directions in the attachm= ent. Bye.
==alternative
Content-Type: text/html; charset="utf-8"
It's called <b>Awesome Caf=C3=A9</b>! See directions in the = attachment. Bye. ==alternative== ==mixed
Content-Type: document/pdf; name="directions.pdf"
Content-Disposition: attachment; filename="directions.pdf"
Content-Transfer-Encoding: base64
iVBORw [... the whole encoded attachment here ...] RK5CYII=
==mixed
Content-Type: image/jpg; name="capuccino.jpg"
Content-Disposition: attachment; filename="capuccino.jpg"
Content-Transfer-Encoding: base64
G+aHAAAA [... another attachment encoded here ...] ORK5CYII=
==mixed==

Fortunately, most decent programming languages come with library to decipher emails, such as the email module for Python, or Ruby's RubyMail library.

3. Get the data into the database

From here on, you can count on yours (or your nephew's) amazing coding skills to handle all these HTTP requests and turn them into nice entries into your database of choice.

Here are some popular programming languages and frameworks to help you on the task, in order of increasing trendiness:

The code involved should be trivial if you're not targeting any particular format. But you might have to find out about which format your business software is accepting and then convert to this format. Popular interchange formats include CSV and JSON, but some business applications are using more obscure, binary formats.

If all you need is storage (possibly for your own, custom business application), then you just have to pick how you will store the data.

If you know that you will never ever need to do statistics or non-sequential operations on these stored emails, you may consider using MongoDB for example. Although, I will advise you against it, using arguments from this awesome blog post.

Any relational database management system, based on SQL will store your emails just fine. At a minimum, you will define 2 tables, one for emails, and another for their attachments, if you decide to store them.

Any SQL database engine should handle that easy as pie, as long as your volume and load fit on a server. There are a few popular choices for relational databases nowadays:

  • MySQL, and its recommended, but non-official fork MariaDB are basic and still popular choices of database servers. Note that since Oracle bought MySQL, support is not as strong as it used to be. Surprise.
  • Postgresql is a larger, feature rich database engine with more options to scale and a more complex setup than MySQL.
  • Other than these free, open-source databases, there is, of course, Oracle, with a truckload of features to answer the needs of large companies. Very large, complex, and expensive. Are you sure that your simple email storage solution needs that much scalability?
  • Also on the commercial side, Microsoft SQL server has much improved in the last years and now appears as a viable competitor to Oracle.

Here we are. If you wanted to put your emails' content as is into your application's database, you're basically done.

But why stop here? You now have a lot of interesting data at your fingertips. This data set is very interesting because it is relevant to your core business. Your emails are probably full of invoices, travel expenses, estimates, prospects and customers.

How about going one step further and extract relevant data from these emails? Refining the data you have can help you streamline your business workflow, saving time to you and your employees.

4. Create an email parser to extract relevant data from each email

This is where the actual parsing really takes place. Ideally, we want to do this:

Schematics of an email parser transforming a received email into structured data (for example, a spreadsheet, or a database)

Schematics of an email parser transforming a received email into structured data (for example, a spreadsheet, or a database)

Here are a few approaches to solve this vast problem:

Statistical Word Analysis, or "word counting"

Statistical analysis is well-adapted to email without any predefined form, typically emails written by a human. You could define several email categories with a set of words belonging to each of these categories. You then parse each email, count the words in it from each category, and then decide if the email falls into one or more of these categories.

This works pretty well for sentiment analysis. For example, you could define a "happy customer" category and a "furious customer" category and send the happy customers email to your boss and the furious customer to the trash bin. Just kidding, but you get the idea.

But as you may know, human-to-human communication is prone to error, ambiguities and very sensitive to context. And as long as we don't have real artificial intelligence around, these same ambiguities won't be lifted. They will make your system unreliable at best, and useless at worst.

Regular expressions

This approach works best for automatically generated emails with most of the text staying the same between emails.

For example, say you want to parse a million booking emails from America Airlines and extract the passenger name for each of them. This could be done by creating a regular expression that matches the whole email and only captures the passenger name. Sound easy, right? But what happen when other parts of the email changes too? And what if there are three passengers on that one flight, instead of just one? Oops.

Python has a nice library for regular expressions, read the nice documentation if you need to wrap your head around the concept. Regular expressions, or regexp for short, are part of Ruby Core as the Regexp module. They are also first class citizens in JavaScript, too.

5. A managed solution? Parseur can help!

Wouldn't it be nice to just get the data you want, sorted into the right columns of an Excel spreadsheet or database?

Well, that's our goal here at Parseur. We are providing you with a simple "point and click" interface, to define what data is relevant to you, once and for all. You can then send similar emails, and their data will get extracted and *automagically* placed into an Excel spreadsheet. You don't have to create an email parser from scratch yourself. You don't have to do any manual processing after that first short session of pointing and clicking. Each email turns into an Excel row by itself.

It's magic

As Shia LaBoeuf says, it's just magic

Check how Parseur can help parsing your emails and get started in minutes!

6. Integrate into your business software

Once your extracted data is sitting neatly into your Excel Spreadsheet, you "just" have to get it where it matters: into your business application.

Zapier can help you tremendously here, as they can connect your own email application with your business application. All you have to do is to write a connector to Zapier. You can then enjoy the many other connectors that are part of the Zapier ecosystem.

Good luck!

Last updated: