How to extract text from email attachments and documents

Using Parseur, you can very easily extract text from email attachments such as .csv, .pdf and .docx. Parseur can parse attached files, extract custom data for you to download or send to any app relevant to your business. All of that in a few clicks. Let's see how that works!

How to extract text from email attachments and documents cover image

Why extract text from attachments?

Email attachments often contain valuable information one would like to parse.

For example:

  • Extract text from PDF invoices, purchase orders, shipping manifests or delivery confirmations
  • Parse travel confirmations in PDF format like flight tickets
  • Extract text from machine-generated Microsoft Word documents
  • Consolidate CSV and Excel spreadsheets

In order to minimize the time and cost spent on manually extracting text from documents, it is valuable to invest in setting up a fully automated data extraction pipeline that will automatically send your parsed data to your spreadsheet, your CRM, your accounting software or anywhere you need it.

This is where Parseur will help. Parseur is a powerful document parsing software that makes it easy to set up your automated data entry and automate your business.

How to extract text and parse email attachments?

Note: The rest of this article assumes you already have a Parseur account.

Click here to create your free account if you don't.

When you create a mailbox in Parseur, attachment parsing is enabled by default (you can change that setting, see below).

When sending an email with attachments, Parseur will create a separate document for each attachment. You can then create a template for those attachments, like you would do for any document in Parseur.

Read our Getting Started guide for more information.

When choosing the type of mailbox, make sure to choose emails and attachments as shown below:

Select emails and attachements

Select emails and attachements

Parse and consolidate CSV and Excel attachments

Parseur can automatically combine CSV and Excel files sent by email without even creating a template. Parseur will combine the files based on their column headers.

All you have to do is send your spreadsheets as email attachments to your Parseur mailbox.

Refer to the following article for more information: How to combine CSV files automatically

Extract text from PDF attachments

Parseur supports extracting text from PDF documents. PDFs need to be text based. Parseur does not support parsing scanned PDF documents at this point (i.e. Parseur doesn't do OCR).

Attachments will be converted to a text document. By default, Parseur will preserve the layout of the document.

Tips for parsing PDF documents with layout

In order to preserve the layout, converted PDF documents use space characters to separate different blocks on the same line. From one document to the other, that number of spaces can vary.

Parseur uses delimiters around fields to locate them in a document (see that article for more information about how Parseur works).

When creating fields in PDF documents with layout, it is recommended to capture some spaces surrounding the fields you want to capture. This will make Parseur more reliable for when the number of spaces around blocks of text changes.

Basic PDF to text conversion

By default, Parseur will try to preserve the layout of the document. This is the best option in most situations.

If you want Parseur to only extract text, without the layout, go to your mailbox settings and change the PDF conversion settings.

Select "Convert to text (basic)" to get rid of the PDF layout altogether

Select "Convert to text (basic)" to get rid of the PDF layout altogether

List of all document formats are supported

Parseur can extract text from most attachments, as long as they are in a text format.

Here is the list of supported document formats that you can extract text from:

Format Description
abw AbiWord Document
csv Comma Separated Value
djvu DjVu Document
doc Microsoft Word
docm Microsoft Office Open XML with Macros Enabled
docx Microsoft Office Open XML
html HTML Document
htm HTML Document
lwp Lotus Word Pro
md Markdown Documentation File
odt ODF Text Document
pages Pages Document Zipped Pages Document
pdf Portable Document File
rst reStructuredText
rtf Rich Text Format
sdw StarWriter 5.0
tex LaTeX Source Document
txt Text Document
wpd WordPerfect Document
wps Microsoft Works Document
xls Microsoft Excel Document
xlsx Microsoft Excel Document Open XML
xlsm Microsoft Excel Document Open XML with Macros Enabled
zabw Compressed AbiWord Document

How to keep the relationship between emails and attachments?

Sometimes you need to extract text from both the email and its attachments and you want to be able to make a link between those two sets of parsed data.

While Parseur processes every email and attachment document independently, it remembers the email every attachment belongs to and you can expose the link using Parseur DocumentID and ParentID Extra Fields.

An attachment ParentID will be the same as the email DocumentID it was attached to.

To enable DocumentID and ParentID those extra fields:

  • Open your Parseur mailbox
  • Click on the Fields section
  • Scroll down to the Extra Fields panel
  • Check the DocumentID and ParentID fields

Check out the following article to learn more about using Extra Fields in Parseur.

How to disable attachment parsing?

By default, when you create a new mailbox, Parseur will also parse every email attachments.

If you would like to disable attachment parsing, go to your mailbox settings and uncheck the attachment parsing box.

Check the box to disable attachment parsing

Check the box to disable attachment parsing

Last updated: