How to create an email parser from scratch

Portrait of Sylvain Josserand
by Sylvain Josserand
11 mins read
Last updated on

So, your boss just asked you to solve the "email problem" that slows the company down. There are hundreds of automated emails on which data entry is done manually every morning and clogging the employees' mailboxes.

You, being smart and efficient, immediately see the potential to create an email parsing system. Great idea! Though it might be a little more involved than just a few scripts and elbow grease. Here are six steps to create an email parser and successfully automate your email data entry workflow.

Before we start: let's define parse and what is parsing

In computer science, parsing is the action of splitting a text into subparts, following a set of rules.

An email parser is a way to make a computer read emails and act on them according to a set of rules. Ideally, that system will automatically extract relevant data from those emails and feed it to your back-office application. Check out the following article on a deep dive about email parsing at.

Shameless plug: Have you met Parseur?

Building your own email parser is a fun project to understand how things work underneath.

But it's a time-consuming one.

Sign up to Parseur for Free
Try out our powerful document processing tool for free.

Parseur was created from scratch in late 2015, and it took around 5,000 man-hours over the course of six years, just to build the back-end. The front-end (all the user interface, including the template editor) also took thousands of man-hours to build. The team behind Parseur is made of seasoned developers with more than 20 years of professional coding under their belts.

We're not finished and can't even estimate how long it would take to create a "sufficiently good" text parser.

If you need results quick, you should try Parseur. Parseur is a managed and user-friendly email parser that will save you hours setting up your own solution. Check out the extensive set of Parseur features.

1. Get the emails

For now, the emails are arriving in the employees' individual inboxes, team mailing lists, or company-wide mailbox.

The first step would be to set up an email account to centralize all those mailboxes. Or even, God forbid, set up your own email server, also known as an SMTP server.

If you know what you are doing, here are a few SMTP servers that are quite popular at the moment:

  • Exim is a free, open-source email transfer agent (yet another name for email server). It is the most popular SMTP server, and gaining in popularity a bit faster than the second, Postfix.
  • Postfix is also free and open-source. It has the reputation to "just work", with minimal problems. According to this article about email server market share, Exim and Postfix together represent 80% of all email servers.
  • On Microsoft side, the ubiquitous Exchange. You can get the emails from it through EWS instead of more old-fashioned POP3 or IMAP. Nowadays, you can even get Microsoft to host it for you, for a fee.
  • Build your own. That path will be long and winding, but you will learn a lot on the way. In the end, your server might better fit your needs. If your needs does not imply compatibility with the gazillions emails clients out there, that is. If you are determined to go down that path, Python has a lovely module in its standard library to get you started. Have a look at smtpd.

Note that sending a lot of emails without being blacklisted is an art in itself and better left to the specialists.

Also note that the popularity of setting up one's own email server is dwindling. In our era of cloud and SaaS, it's more convenient to use a hosted email service that does the dirty email work for you. Here are the major players in this space:

  • Postmark focuses deliverability and reliability. Also, it has a free plan.
  • Mandrill had a first mover advantage and remains popular. It focuses on marketing and transactional emails.
  • Sendgrid also position itself as a marketing and transactional email platform.
  • Mailgun focuses more on developer and API. Also, it has a free plan.

We love Postmark here at Parseur. Their API is great and the documentation stellar. There are many SDK for all the more popular programming languages around there.

2. Translate email into a proper data format

Email is an old format, the "created before Star Wars" kind of old, and it has accumulated a few warts over the decades. For example, international (non-US) characters handling was not part of the initial specification. To handle special characters, like €, you need to take 3 technical documents (also called RFC) into account:

  • RFC 2047 provides support for international names and subject lines, in the email header
  • RFC 5890 provides support for international domain names in the Domain Name System (DNS)
  • RFC 6532 allows the use of UTF-8 (another way to store international text) in a mail header section

Once again, services like Postmark or Mailgun can save your day here and do the translation for you. You can forget horror stories involving UTF-8, MIME and cp1252 (never heard of UTF-8, MIME or cp1252? I envy your life).

For example, if using mailgun, servers will receive the email for you and transform it into an easy-to-handle JSON document, taking care of all the RFCs known to mankind. It will then post it to your own server at whatever URL you want as webhook in a single HTTP POST request.

For the curious, here is a list of all SMTP-related RFCs. You are welcome.

For example, a simple email received on Mailgun will arrive at your server looking like this:

{
  "subject": "My favorite café",
  "sender": "John Doe <[email protected]>",
  "recipient": "Mr. Parseur <[email protected]>",
  "message": "It's called Awesome Café! See directions in the attachment. Bye.",
  "attachements": [
    { "name": "directions.pdf", "content": "https://url.with.content" },
    { "name": "cappucino.jpg", "content": "https://another.content.url" }
  ]
  /*... other interesting pieces of data here (read the doc, Luke) ...*/
}

Isn't it wonderful? Compare this with a traditional email format:

  MIME-Version: 1.0
  Received: by 102.29.23.176 with HTTP; Sat, 12 Aug 2016 14:13:31 -0700 (PDT)
  Date: Sat, 12 Aug 2016 14:13:31 -0700
  Delivered-To: =?ISO-8859-1?Q?Mr. Parseur <[email protected]>
  Message-ID: <CAAJL_=kPAJZ=fryb21wBOALp8-XOEL-h9j84s3SjpXYQjN3Z3A@mail.gmail.com>
  Subject: =?ISO-8859-1?Q?My=20Favorite=20Caf=E9
  From: =?ISO-8859-1?Q?John Doe <[email protected]>
  To: =?ISO-8859-1?Q?Mr. Parseur <[email protected]>
  Content-Type: multipart/mixed; boundary=mixed
  ==mixed
  Content-Type: multipart/alternative; boundary=alternative
  ==alternative
  Content-Type: text/plain; charset="utf-8"
  It's called Awesome Caf=C3=A9! See directions in the attachm= ent. Bye.
  ==alternative
  Content-Type: text/html; charset="utf-8"
  It's called <b>Awesome Caf=C3=A9</b>! See directions in the = attachment. Bye. ==alternative== ==mixed
  Content-Type: document/pdf; name="directions.pdf"
  Content-Disposition: attachment; filename="directions.pdf"
  Content-Transfer-Encoding: base64
  iVBORw [... the whole encoded attachment here ...] RK5CYII=
  ==mixed
  Content-Type: image/jpg; name="capuccino.jpg"
  Content-Disposition: attachment; filename="capuccino.jpg"
  Content-Transfer-Encoding: base64
  G+aHAAAA [... another attachment encoded here ...] ORK5CYII=
  ==mixed==

Fortunately, most decent programming languages come with library to decipher emails, such as the email module for Python, or Ruby's RubyMail library.

3. Get the data into the database

From here on, you can count on your coding skills to handle all these HTTP requests and turn them into nice entries into your database of choice.

Here are some popular programming languages and frameworks to help you on the task, in order of increasing trendiness:

The code involved should be trivial if you're not targeting any particular format. However, you might have to find out about the format that your business software accepts and convert to this format. Popular interchange formats include CSV and JSON, but some business applications use more obscure, binary formats.

If all you need is storage (possibly for your own custom business application), then you just have to pick how you will store the data.

If you know that you will never need to do statistics or non-sequential operations on these stored emails, you may consider using MongoDB, for example. However, I advise against it, using arguments from this awesome blog post.

Any relational database management system, based on SQL will store your emails just fine. At a minimum, you will need to define two tables: one for emails and another for their attachments if you decide to store them.

Any SQL database engine should handle that, as long as your volume and load fit on a server. There are a few popular choices for relational databases nowadays:

  • MySQL, and its recommended, but non-official, fork MariaDB are basic and still popular choices of database servers. Note that since Oracle bought MySQL, support is not as strong as it used to be. Surprise.
  • Postgresql is a larger, feature rich database engine with more options to scale and a more complex setup than MySQL.
  • Other than these free, open-source databases, there is, of course, Oracle, with a truckload of features to answer the needs of large companies. Very large, complex, and expensive. Are you sure that your simple email storage solution needs that much scalability?
  • Also on the commercial side, Microsoft SQL server has much improved in the last years and now appears as a viable competitor to Oracle.

Here we are. If you wanted to put your emails' content as is into your application's database, you're basically done.

But why stop here? You now have a lot of interesting data at your fingertips. This data set is very interesting because it is relevant to your core business. Your emails are probably full of invoices, travel expenses, estimates, prospects and customers.

How about going one step further and extract relevant data from these emails? Refining the data you have can help you automate your business workflow, saving time to you and your employees.

4. Extract relevant text from each email

This is where the actual parsing really takes place. Ideally, we want to do this:

A screen capture of email parser overview
Schematics of an email parser transforming a received email into structured data (for example, a spreadsheet, or a database)

Here are a few approaches to solve this vast problem:

Statistical Word Analysis, or "word counting"

Statistical analysis is well adapted to emails without any predefined form, typically emails written by a human. You could define several email categories with a set of words belonging to each of these categories. You would then parse each email, count the words in it from each category, and then decide if the email falls into one or more of these categories.

This works pretty well for sentiment analysis. For example, you could define a "happy customer" category and a "furious customer" category and send the happy customers' emails to your boss and the furious customers' emails to the trash bin. Just kidding, but you get the idea.

But, as you may know, human-to-human communication is prone to errors, ambiguities, and is very sensitive to context. And, as long as we don't have real artificial intelligence, these same ambiguities won't be resolved. They can make your system unreliable at best, and useless at worst.

Regular expressions

This approach works best for automatically generated emails, with most of the text staying the same between emails.

For example, let's say you want to parse a million booking emails from American Airlines and extract the passenger name from each of them. This could be done by creating a regular expression that matches the entire email and only captures the passenger's name. Sounds easy, right? But what happens when other parts of the email change as well? And what if there are three passengers on that one flight instead of just one? Oops.

Python has a nice library for regular expressions. Regular expressions, or regexp for short, are part of Ruby Core as the Regexp module. They are also first class citizens in JavaScript, too.

The downside is that Regular Expressions are complex to maintain and their readability is passable as best. Many Parseur customers told us they initially started developing their own parsing engine using Regexps, but were not able to keep maintaining it against the flow of ever-changing emails they were receiving.

5. A managed solution? Parseur can help!

Wouldn't it be nice to just get the data you want, sorted into the correct columns of an Excel spreadsheet or database?

Well, that's our goal here at Parseur. We are providing you with a simple "point and click" interface to define what data is relevant to you once and for all. You can then send similar emails, and their data will get extracted and automatically placed into an Excel spreadsheet.

You don't have to create an email parser from scratch yourself. You don't have to do any manual processing after that first short session of pointing and clicking. Each email becomes an Excel row by itself.

6. Integrate into your business software

Once your extracted data is neatly sitting in your Excel spreadsheet, you "just" have to get it where it matters: into your business application.

Tools like Zapier or Make can help you tremendously here, as they can connect your email application with your business application. All you have to do is write a connector for those services. You can then enjoy the many other connectors that are part of their ecosystem.

Parseur integrates with Google Sheets, Zapier, Integromat, Microsoft Flow, and Getswift, opening your parsed data to thousands of applications in just a few clicks.

Good luck!

Last updated on

AI-based data extraction software.
Start using Parseur today.

Automate text extraction from emails, PDFs, and spreadsheets.
Save hundreds of hours of manual work.
Embrace work automation with AI.

Sign up for free
Parseur rated 5/5 on Capterra
Parseur.com has the highest adoption on G2
Parseur.com has the happiest users badge on Crozdesk
Parseur rated 5/5 on GetApp
Parseur rated 4.5/5 on Trustpilot