How to normalize parsed data

When using a mail parser, the raw data you get is not always in the format you need. This article describes how you can normalize parsed data in Parseur to get consistent and structured results.

Normalize data cover picture

Why normalize parsed data?

Email parsers such as Parseur are responsible for reliably extracting text from emails and other documents and converting them into workable data.

Often, text extraction is not enough. The raw data extracted from your documents may contain:

  • extra spaces,
  • extra new lines,
  • formatting code such as HTML code,
  • different formats of dates, number, names, addresses
  • different encoding standards,
  • etc.

You need to perform some normalization in order to "clean" the data and turn it into a structured data set.

By default, Parseur takes care of all this automatically: formatting code is removed, extra spaces are removed, text encoding is streamlined, etc. All this makes Parseur "just work" out of the box.

Sometimes, you need to go the extra mile and use data normalization based on the kind of data, for example to properly format dates, times or addresses.

Here too, Parseur has your back. Let's see how!

Data normalizing using field formats

Assign an output format to a field to control its output format. A format tells Parseur which kind of data a particular field contains and how to normalize it.

Available output formats are:

  • Multi-line Text (this is the default)
  • Single-line Text
  • Date
  • Time
  • Date and Time
  • Number
  • Full Name (Person's name)
  • Address
  • List and Table
  • Linked Document

To assign an output format to a field in Parseur:

  1. Go to the Template editor
  2. Select the piece of text you want to extract and create a field
  3. Click on the  edit button next to the field name
  4. Select the format from the Output Format drop down menu
  5. Click Update to close the field settings
  6. Once you have made all your changes, save the template
From the field settings, you can change the field format

From the field settings, you can change the field format

For some output formats you can also choose a related input format. Unlike output formats that are global for a mailbox, input formats are specific to a template.

In the following sections, we detail how and when use each output format.

Text normalization

Parseur does most of the text sanitizing automatically. However, if offers two variations for standard Text fields, depending on whether you want to keep new lines.

Text (Multi-line) format

This is the default format when creating a field.

The "Text" format will extract all visible text from your emails, including visible new lines. It will also strip out any formatting and HTML elements and just keep the text.

When selecting "Text" output format, you can further tweak the format by selecting an input format:

  • HTML text (default): tells Parseur that the documents contain HTML. Parseur will use HTML markup to determine line breaks and then remove all HTML markup from field result
  • Raw Text: tells Parseur that the document is text-only. Parseur will keep line breaks and any HTML markup in the original value

Text (single-line) format

The "One line text" format will extract all visible text from your emails, excluding visible new lines. Like the text format, it will also strip out any formatting and HTML elements and just keep the text.

Use the One-line Text format if you require the result field to be on a single line and exclude any line breaks.

Same as for the multi-line format, you can further tweak the format by selecting an input format:

  • HTML text (default): tells Parseur that the documents contain HTML. Parseur will remove all HTML markup from field result
  • Raw Text: tells Parseur that the document is text-only. Parseur will remove any consecutive space but keep any HTML markup from the original value

Dates normalization

Dates and times can take all kinds of shapes in your documents. Quite often, applications integrated with Parseur require date fields to be formatted in a specific way. Normalizing dates from emails is almost always a required step before sending and using dates in other applications.

Parseur offers 3 types of date and time formats. They are rather self-explanatory:

Date format

The "Date" format will sanitize a field into a date. If the field contains a date and a time, only the date information part will be kept. Examples of dates recognized by Parseur:

  • 12 Jan 2018
  • 2018-1-2
  • Wed Jan 24th, 2018 1:58pm
  • 01/12/2018: this date can either be the 12th of January or the 1st of December, depending on the locale and conventions. See the section below to tell Parseur how to disambiguate that situation.

Time format

The "Time" format will sanitize a field into a time. If the field contains a date and a time, only the time information part will be kept. Examples of times recognized by Parseur:

  • 1:58pm
  • 13:58:23
  • 12h36

Date and time format

The "Date and Time" format will sanitize a field into a datetime. If the field contains no time information, 00:00:00 will be used for the time part. Examples of datetimes recognized by Parseur:

  • Wed Jan 24th, 2018 1:58pm
  • 12 Jan 2018 13:58:23
  • 2018-01-24T05:18:44.841813+00:00

Using Field formats, Parseur can decode and sanitize dates or times of any shape into a common format.

Configuring date input format

You can change your default date format preferences (in user preferences):

  1. Click on your name in the navigation bar in the top right corner
  2. Click on Settings on the left menu
  3. Click on the Default format tab
  4. In Input, Change your result output preferences:
    • Under Timezone, select the time zone of your documents (most likely your time zone). Default is GMT.
    • Under Date format as found in emails, tell Parseur how ambiguous dates should be treated like (either month first, or day first)
  5. Click on Update
The input format preference form

The input format preference form

Configuring date output format

Now that you've told Parseur that some fields are dates, you can specify the exact output format you would like those dates to be formatted in. The resulting fields are formatted according to your user preferences.

To change your result output preferences:

  1. Click on your name in the navigation bar in the top right corner
  2. Click on Settings on the left menu
  3. Click on the Default formats tab
  4. In Output, Change your result output preferences (use this page to get the list of all available options)
  5. Click on Update
The output format preference form

The output format preference form

Number normalization

Parseur lets you easily parse numbers (including spaces, commas etc) into real numbers using the Number format.

Number format

The "Number" format will transform any number represented in a text into a "real" number. For instance, it will strip out any space, comma and additional formatting characters from the number.

Changing the decimal separator

By default, Parseur will use the period character (".") as the decimal separator. If your documents use the comma (","), change your user preferences. To do that, go to your user preferences and update the Decimal separator setting.

Full name normalization

Working with Person's names can be hard. On top of the usual firstname lastname sequence, some people can have a middle name, a title or choose to only leave their first name on your form. That makes parsing complex.

The Full Name format in Parseur makes it easy to automatically parse a person's name.

Example, say you have the following name in your document: Mr. Enrique S. de la Vega

Capturing that name in a field named "LeadName" with a Full Name format will give you the following result:

  • LeadName.title: Mr.
  • LeadName.first: Enrique
  • LeadName.middle: S.
  • LeadName.last: de la Vega
  • LeadName.full: Mr. Enrique S. de la Vega
Note: Our current name parsing algorithm is primarily able to parse English forms of names at the moment. If may give varying results for names that have other conventions like Slavic, Chinese or Latin names. Contact us if you notice incorrect results for your use case and we'll see how we can improve our parsing algorithm.

Address parsing and normalization

Parseur can automatically parse and normalize an address. It can also fill in the gaps for partial addresses, determine coordinates and provide a google map link.

Note: Parsing addresses will cost one additional credit for each address parsed in a document. For example if your document contains 2 fields with an Address format, Parseur will charge 3 credits (1 credit for parsing the base document and 1 credit for parsing each of the 2 addresses).

Example: Say you have the following address in your document: 500 Chartres Street Appt 34, New Orleans

Capturing that address in a field named "Location" with an Address format will give you the following results:

  • Location.original: 500 Chartres Street Appt 34, New Orleans
  • Location.normalized: 500 Chartres St #34, New Orleans, LA 70130, USA
  • Location.number: 500
  • Location.street: Chartres St
  • Location.address1: 500 Chartres St
  • Location.address2: #34
  • Location.city: New Orleans
  • Location.zip: 70130
  • Location.county: Orleans Parish
  • Location.state: Louisiana
  • Location.state_code: LA
  • Location.country: United States
  • Location.country_code: US
  • Location.found: True
  • Location.lat: 29.9558754
  • Location.lng: -90.065056
  • Location.map: link

In case Parseur is not able to determine the address, it will set the found flag to false. Result will look like this:

  • Location.original: 220 Strangely named road
  • Location.found: False
  • Location.normalized: 220 Strangely named road

Lists and Tables sanitizing

Parseur can transform lists, tables and other repetitive blocks of text into properly formatted list data. Head over to the following article for more information about suing Lists and Tables: How to extract tables from emails.

Last updated: