How to sanitize parsed data

When using a mail parser, the raw data you get is not always in the format you need. This article describes how you can sanitize parsed data in Parseur and get consistent and "clean" results.

Sanitize data cover picture

Why sanitize parsed data?

Email parsers such as Parseur are responsible for reliably extracting text from emails and other documents and converting them into workable data.

But often, text extraction is not enough. The raw data extracted from your documents may contain:

You need to perform some sanitation in order to "clean" the data and turn it into something workable.

By default, Parseur takes care of all this automatically: formatting code is removed, extra spaces are removed, text encoding is streamlined, etc.

All this makes Parseur "just work" out of the box.

But sometimes, you need to go the extra mile and use advanced sanitation techniques, for example to properly format dates and times.

Here too, Parseur has your back. Let's see how!

Data sanitizing in Parseur using Field Formats

Parseur controls your parsed data output by letting you assign an output format to a field. A format tells Parseur which kind of data a particular field contains and how to sanitize it.

Available output formats are:

  • Multi-line Text (this is the default)
  • Single-line Text
  • Number
  • Date
  • Time
  • Date and Time
  • List and Table
  • Linked Document

To assign an output format to a field in Parseur:

  1. Go to the Template editor
  2. Select the piece of text you want to extract and create a field
  3. Click on the  edit button next to the field name
  4. Select the format from the Output Format drop down menu
  5. Click Update to close the field settings
  6. Once you have made all your changes, save the template
From the field settings, you can change the field format

From the field settings, you can change the field format

For some output formats you will also be able to choose a related input format. Unlike output formats that are global for a mailbox, input formats are specific to a template.

In the following sections, we detail how and when use each output format.

Text sanitizing

As mentioned before, Parseur does most of the text sanitizing automatically. However, if offers two variations for standard Text fields, depending on whether you want to keep new lines.

Text (Multi-line) format (default)

This is the default format when creating a field.

The "Text" format will extract all visible text from your emails, including visible new lines. It will also strip out any formatting and HTML elements and just keep the text.

Text (single-line) format

The "One line text" format will extract all visible text from your emails, excluding visible new lines. Like the text format, it will also strip out any formatting and HTML elements and just keep the text.

Use the One-line Text format if you require the result field to be on a single line and exclude any new return line.

Number sanitizing

With Parseur, you can easily transform formatted numbers (including spaces, commas etc) into real numbers using the Number format.

Number format

The "Number" format will transform any number represented in a text into a "real" number. For instance, it will strip out any space, comma and additional formatting characters from the number.

Changing the decimal separator

By default, Parseur will use the Dot character (".") as the decimal separator. If your documents uses the comma (",") instead, just change your user preferences. To do that, go to your user preferences and update the Decimal separator setting.

Dates sanitizing

Dates and times can take all kinds of shapes in your emails. Quite often, applications integrated with Parseur require date fields to be formatted in a specific way. Sanitizing dates from emails is almost always a required step before sending and using dates in other applications.

Parseur offers 3 types of date and time formats. They are rather self-explanatory:

Date format

The "Date" format will sanitize a field into a date. If the field contains a date and a time, only the date information part will be kept. Examples of dates recognized by Parseur:

  • 12 Jan 2018
  • 2018-1-2
  • Wed Jan 24th, 2018 1:58pm
  • 01/12/2018: this date can either be the 12th of January or the 1st of December, depending on the locale and conventions. See the section below to tell Parseur how to disambiguate that situation.

Time format

The "Time" format will sanitize a field into a time. If the field contains a date and a time, only the time information part will be kept. Examples of times recognized by Parseur:

  • 1:58pm
  • 13:58:23
  • 12h36

Date and time format

The "Date and Time" format will sanitize a field into a datetime. If the field contains no time information, 00:00:00 will be used for the time part. Examples of datetimes recognized by Parseur:

  • Wed Jan 24th, 2018 1:58pm
  • 12 Jan 2018 13:58:23
  • 2018-01-24T05:18:44.841813+00:00

Using Field formats, Parseur can decode and sanitize dates or times of any shape into a common format.

Configuring the input format

Most of the time, you won't need to specify the date input format found in your documents, Parseur will be able to understand and decode it properly automatically. However, if you notice incorrect results, you can help Parseur by changing your document format preferences (in your user preferences).

To change your input format preferences:

  1. Click on your name in the navigation bar in the top right corner
  2. Click on Settings on the left menu
  3. Click on the Default format tab
  4. In Input, Change your result output preferences:
    • Under Timezone, select the time zone of your documents (most likely your time zone). Default is GMT.
    • Under Date format as found in emails, tell Parseur how ambiguous dates should be treated like (either month first, or day first)
  5. Click on Update
The input format preference form

The input format preference form

Changing the output format

Now that you've told Parseur that some fields are dates, you can specify the exact output format you would like those dates to be formatted in. The resulting fields are formatted according to your user settings.

To change your result output preferences:

  1. Click on your name in the navigation bar in the top right corner
  2. Click on Settings on the left menu
  3. Click on the Default formats tab
  4. In Output, Change your result output preferences (use this page to get the list of all available options)
  5. Click on Update
The output format preference form

The output format preference form

Lists and Tables sanitizing

Parseur can also transform lists, tables and other repetitive blocks of text into properly formatted list data.

Head over to the following article for more information about suing Lists and Tables: How to extract tables from emails.

Last updated:

Parseur is the most powerful and easy-to-use mail parser.
Save hours of manual work and improve your data entry speed and quality.

Sign up now