Key takeaways:

Python can help you automate data extraction from invoices, but there is no silver bullet
Parseur leverages Python to extract data from invoices for you
PDF is not a data format, but a typographic representation of a paper document

The PDF format

The PDF format is versatile, enabling the accurate representation of paper documents, like invoices, without limiting their design. It comes from the world of the paper print and is designed to be a digital representation of a printed page. This flexibility offers significant freedom, allowing PDF creators to express themselves and adhere to various standards and regulations.

However, the challenge arises when data is locked within a PDF. The format's free-form and complex nature can conflict with the structured and consistent approach necessary for managing the vast data a company processes daily.

PDF file format layers

What are the steps to extract data from an invoice?

An invoice is a document that usually comes in the PDF format. An invoice formalizes a transaction between a supplier and a customer, where a product or service is exchanged for a precise amount of money. Here are the steps required to extract data from this document:

Define a schema for the data you want to extract from your invoices
Convert your invoice from image to text
Extract the text from your invoice according to your data schema
Collect the extracted data

Invoice data extraction process

Define a schema for your invoice data

Invoices come from different suppliers and each supplier tend to customize the way their invoices look. Despite this real-life diversity in the form, the substance of all invoices is basically the same: you need a supplier, a customer, an invoice reference, a date and a list of items with an associated quantity, description and cost. A great way to start defining your invoice format would be your accounting software, as it's most probably where to store your extracted invoice data in the end, right? If you just want a data format that can cover all corner cases, let me recommend the schema.org website, which conveniently defines a series of industry-standard data formats for a lot of things, including invoices. Parseur defines a default data schema for your invoices, but you can change it to fit your use case by renaming the fields in your invoice mailbox, as explained here. Once your data format is defined, you can convert your invoice from image to text.

For example, you can define the following fields for your invoice using JSON Swagger format:

{
    "InvoiceNumber": {
        "type": "string",
        "description": "The invoice number"
    },
    "InvoiceIssueDate": {
        "type": "string",
        "description": "The invoice date"
    },
    "Items": {
        "type": "array",
        "description": "The list of items in the invoice",
        "items": {
            "type": "object",
            "properties": {
                "quantity": {
                    "type": "number",
                    "description": "The quantity of the item"
                },
                "description": {
                    "type": "string",
                    "description": "The description of the item"
                },
                "unit_price": {
                    "type": "number",
                    "description": "The unit price of the item"
                },
                "price": {
                    "type": "number",
                    "description": "The total price of the item"
                }
            }
        }
    }
}

Convert your invoice from image to text

Picture of an invoice taken from a smartphone

A PDF file can contain an image. For example, your employee may snap a quick shot of an invoice with their smartphone camera. They then save it as PDF and send it to your accounting department. Your accounting team is in charge of extracting the data from this invoice and somehow get it into your accounting system without any mistake. The next step is to convert this image to text using an Optical Character Recognition system. One of the most popular OCR system is Tesseract. Tesseract is written in C and C++. In order to use Tesseract from our Python program, we'll need to use a binding such as PyTesseract. A binding is a way to call a software library (here, Tesseract) from a language it's not written in (here, Python). There exists many such systems and their results vary wildly depending on their underlying technology and the quality of the scan of the document they are working on. Parseur transparently detects if your document is an image and automatically converts it into text, internally. Once the document's data is in text form, it is ready to get extracted.

Extract the text from your invoice according to your data schema

Once your PDF is in text (or searchable) form, you can use the pdftotext Python library to get the data out of the PDF file, as text. Here is a code snippet to extract the text from a PDF file:

import pdftotext

# Load your invoice
with open("invoice.pdf", "rb") as file_handle:
    pdf = pdftotext.PDF(file_handle)

# Iterate over all the pages
for page in pdf:
    print(page)

Name this script convert_pdf_to_text.py and run it, you'll get the invoice as text to the standard output. If you want to redirect the output to a file, you can run:

$ python convert_pdf_to_text.py > invoice.txt

Now that you have the invoice in text form, you can extract the data you want from it, using any combination of the following techniques:

You can use a regular expression to extract the data you want. Regular expressions are a powerful way to extract data from text, but they are also very brittle. If the invoice format changes, you'll need to update your regular expression. Also, regular expressions are not very good at extracting data from tables.
You can use a visual templating system, ideally leveraging Dynamic OCR and Zonal OCR. This is a more advanced way to extract data from text. It's more robust than regular expressions, but it's also more complex to implement.
Finally, you can use a machine learning system. Machine learning is a very powerful way to extract data from text. It's also the most complex to implement. You'll need to train your machine learning system with a lot of data. This is a very time consuming process. Also, machine learning systems are not perfect and you'll need to manually review the results to make sure they are correct.

Let's extract data from your invoice with Python's regular expressions re module. Here is a code snippet to extract the invoice number from your invoice:

import re

# Load your invoice
with open("invoice.txt", "r") as file_handle:
    invoice = file_handle.read()

# Extract the invoice number
invoice_number = re.search(r"Invoice number: (\w+)", invoice).group(1)
print(invoice_number)

Name this script extract.py and run it, you'll get the invoice number to the standard output:

$ python extract.py

And you will get something like:

INV-1234

Note that this only works for the invoices that are formatted with a line like Invoice number: INV-1234. If the invoice format changes, you'll need to update your regular expression. This can quickly become a time sink if you have a lot of different invoice formats to deal with.

Parseur can help you with that.

Try out our powerful document processing tool for free.

If you decide to use regular expression parsing, our template engine will help you manage them, and we also give you access to a library of templates that we built over the years. We let you select the best extraction method for your use case: You can use regular expressions, but also visual template building and matching (OCR engine), or machine learning (AI engine). You can even combine them to get the best of all worlds. Parseur also lets you review the results of the extraction and correct them if needed. This is a very important step as it will help you improve the extraction accuracy over time.

Collect the extracted data

With Python, you can iterate over the invoices files in a given folder and extract the data from them. Let's say we extract the invoice number and total amount, and output the result in CSV format:

import os
import re

import pdftotext

# Iterate over all the PDF files in the folder
for filename in os.listdir("invoices/"):
    if not filename.endswith(".pdf"):
        continue

    # Load your invoice
    with open("invoices/" + filename, "rb") as file_handle:
        pdf = pdftotext.PDF(file_handle)

    # Print the CSV column header
    print("InvoiceNumber,TotalAmount")

    # Iterate over all the pages
    for page in pdf:
        # Extract the invoice number
        invoice_number = re.search(r"Invoice number: (\w+)", page).group(1)
        total_amount = re.search(r"Total amount: (\w+)", page).group(1)
        print(invoice_number, total_amount, sep=",")

Name this script extract_to_csv.py and run it, you'll get the invoice number and total amount to the standard output, that you can redirect to a CSV file that you can later open with your favorite spreadsheet software, like Excel:

$ python extract_to_csv.py > invoices.csv

In Parseur, once you have extracted the data from your invoices, you can download it all as a spreadsheet, or directly export it to your accounting software, using direct webhook integration, Make, Zapier or Microsoft Power Automate. You can also use our API to retrieve the data in JSON format.

Conclusion

I hope this article was useful to you. In short, extracting data from invoices with Python is a complex process, but it can be done, as long as your invoices are consistent and you have a lot of time to spend on it. In case your time is limited, you can use the accumulated experience and knowledge that's been built into Parseur over the years, to get to your destination quicker, and extract data from all your invoices with flexibility and accuracy.

Last updated on May 31st, 2024

Extract data from invoices with Python

The PDF format

What are the steps to extract data from an invoice?

Define a schema for your invoice data

Convert your invoice from image to text

Extract the text from your invoice according to your data schema

Collect the extracted data

Conclusion

You may also like

AI-based data extraction software.
Start using Parseur today.

Extract data from invoices with Python

The PDF format

What are the steps to extract data from an invoice?

Define a schema for your invoice data

Convert your invoice from image to text

Extract the text from your invoice according to your data schema

Collect the extracted data

Conclusion

You may also like

AI-based data extraction software. Start using Parseur today.

AI-based data extraction software.
Start using Parseur today.