OCR-based parsing engine for PDF documents

Portrait of Sylvain Josserand
by Sylvain Josserand
3 mins read
last updated on

Howdy, I'm Sylvain, building software here at Parseur. We just released a major feature: a new system to parse PDF files visually.

OCR PDF parsing engine

New: Extract data from PDF visually

Parsing PDF documents using OCR is the most requested feature on our feature upvote page.

Improved reliably for complex documents

We used to convert PDF documents into text, trying to preserve the original layout of the pages. It worked great for simple documents (and that why we are keeping the text engine along with the new one).

However, this made it particularly difficult for our legacy, text-based engine to reliably extract data from complex PDF documents.

That is why we are introducing a new parsing engine, called OCR (for Optical Character Recognition). The OCR template editor allows you to create templates by drawing boxes around the text you want to extract. You can also define labels which are acting as landmarks or anchors in your document, helping the engine to position the fields in the page.

You'll find more detailed informations on our support page here: Create your first OCR template.

Optional fields, at last!

This new engine allows you to define optional fields, and is more resilient to small changes in the document layout. It's also faster to build templates, and easier to adjust them, without having to create them from scratch. This is because you can attach several samples to a given template. This allows you to define fields that may show up on some documents but not all.

Complete retro-compatibility

All the current features, such as tables, metadata, post-processing and static fields, keep working with the new engine. The output data format is the same, webhooks are unchanged.

This new engine works along the current one, and you can even mix and match the templates from both engines in the same mailbox, to get the best of both worlds.

If you have both text-based and OCR templates in your mailbox, the template with the most fields will take priority over the others.

Per-page pricing

One credit is now accounted for each successfully parsed page. If a document is not composed of several pages (like a long email or a spreadsheet), then just one credit is accounted when that document gets successfully processed, regardless of the document's length, as usual.

Beta starts now

We are now starting the beta phase: we let you try the new engine by simply asking us in the chat or by email at support@parseur.com. The goal here is to collect your feedback and improve the system, fix bugs and implement features that you need. If you decide to join the beta program, expect bugs and imperfections, but please report them to us; we'll fix them as quickly as we can.

What's next?

After the beta phase is over and the new OCR engine is available for all, we plan to make it work with all HTML documents such as emails and web pages.

All-in-one data extraction software. Start using Parseur today.

Automate text extraction from emails, PDFs and spreadsheets.
Save hundreds of hours of manual work.
Embrace work automation.

Sign up for free
Parseur rated 5/5 on Capterra
Parseur.com is a high performer in data extraction on G2
Parseur.com has the happiest users badge on Crozdesk
Parseur rated 5/5 on GetApp