How to extract text from email attachments and documents
Using Parseur, you can very easily extract text from email attachments such as .csv, .pdf and .docx. Parseur can parse attached files, extract custom data for you to download or send to any app relevant to your business. All of that in a few clicks. Let's see how that works!
Why extract text from attachments?
Email attachments often contain valuable information one would like to parse.
- Extract text from PDF invoices, purchase orders, shipping manifests or delivery confirmations
- Parse travel confirmations in PDF format like flight tickets
- Extract text from machine-generated Microsoft Word documents
- Consolidate CSV and Excel spreadsheets
In order to minimize the time and cost spent on manually extracting text from documents, it is valuable to invest in setting up a fully automated data extraction pipeline that will automatically send your parsed data to your spreadsheet, your CRM, your accounting software or anywhere you need it.
This is where Parseur will help. Parseur is a powerful document parsing software that makes it easy to set up your automated data entry and automate your business.
How to extract text and parse email attachments?
Note: The rest of this article assumes you already have a Parseur account.
Click here to create your free account if you don't.
When you create a mailbox in Parseur, attachment parsing is enabled by default (you can change that setting, see below).
When sending an email with attachments, Parseur will create a separate document for each attachment. You can then create a template for those attachments, like you would do for any document in Parseur.
Read our Getting Started guide for more information.
Parse and consolidate CSV and Excel attachments
Parseur can automatically combine CSV and Excel files sent by email without even creating a template. Parseur will combine the files based on their column headers.
All you have to do is send your spreadsheets as email attachments to your Parseur mailbox.
Refer to the following article for more information: How to combine CSV files automatically
Extract text from PDF attachments
Parseur supports extracting text from PDF documents. PDFs need to be text based. Parseur does not support parsing scanned PDF documents at this point (i.e. Parseur doesn't do OCR).
Attachments will be converted to a text document. By default, Parseur will preserve the layout of the document.
Tips for parsing PDF documents with layout
In order to preserve the layout, converted PDF documents use space characters to separate different blocks on the same line. From one document to the other, that number of spaces can vary.
Parseur uses delimiters around fields to locate them in a document (see that article for more information about how Parseur works).
When creating fields in PDF documents with layout, it is recommended to capture some spaces surrounding the fields you want to capture. This will make Parseur more reliable for when the number of spaces around blocks of text changes.
Basic PDF to text conversion
By default, Parseur will try to preserve the layout of the document. This is the best option in most situations.
If you want Parseur to only extract text, without the layout, go to your mailbox settings and change the PDF conversion settings.
List of all document formats are supported
Parseur can extract text from most attachments, as long as they are in a text format.
Here is the list of supported document formats that you can extract text from:
|csv||Comma Separated Value|
|docm||Microsoft Office Open XML with Macros Enabled|
|docx||Microsoft Office Open XML|
|lwp||Lotus Word Pro|
|md||Markdown Documentation File|
|odt||ODF Text Document|
|pages.zip||Zipped Pages Document|
|Portable Document File|
|rtf||Rich Text Format|
|tex||LaTeX Source Document|
|wps||Microsoft Works Document|
|xls||Microsoft Excel Document|
|xlsx||Microsoft Excel Document Open XML|
|xlsm||Microsoft Excel Document Open XML with Macros Enabled|
|zabw||Compressed AbiWord Document|
How to keep the relationship between emails and attachments?
Sometimes you need to extract text from both the email and its attachments and you want to be able to make a link between those two sets of parsed data.
While Parseur processes every email and attachment document independently, it remembers the email every attachment belongs to and you can expose the link using Parseur DocumentID and ParentID Extra Fields.
An attachment ParentID will be the same as the email DocumentID it was attached to.
To enable DocumentID and ParentID those extra fields:
- Open your Parseur mailbox
- Click on the Fields section
- Scroll down to the Extra Fields panel
- Check the DocumentID and ParentID fields
Check out the following article to learn more about using Extra Fields in Parseur.
How to disable attachment parsing?
By default, when you create a new mailbox, Parseur will also parse every email attachments.
If you would like to disable attachment parsing, go to your mailbox settings and uncheck the attachment parsing box.