OCR engine for parsing PDF documents

Portrait of Sylvain Josserand
by Sylvain Josserand
4 mins read
last updated on

Howdy, I'm Sylvain, building software here at Parseur. We just released our biggest feature yet: a new system to parse PDF files visually.

OCR PDF parsing engine

New: Extract data from PDF visually

Parsing PDF documents using OCR is the most requested feature on our feature upvote page.

Improved reliably for complex documents

We used to convert PDF documents into text, trying to preserve the original layout of the pages. It worked great for simple documents (and that why we are keeping the text engine along with the new one).

However, this made it particularly difficult for our legacy, text-based engine to reliably extract data from complex PDF documents.

That is why we are introducing a new parsing engine, called OCR (for Optical Character Recognition). The OCR template editor allows you to create templates by drawing boxes around the text you want to extract. You can also define labels which are acting as landmarks or anchors in your document, helping the engine to position the fields in the page.

You'll find more detailed informations on our support page here: Create your first OCR template.

Optional fields, at last!

This new engine allows you to define optional fields, and is more resilient to small changes in the document layout. It's also faster to build templates, and easier to adjust them, without having to create them from scratch. This is because you can attach several samples to a given template. This allows you to define fields that may show up on some documents but not all.

Complete retro-compatibility

All the current features, such as tables, metadata, post-processing and static fields, keep working with the new engine. The output data format is the same, webhooks are unchanged.

This new engine works along the current one, and you can even mix and match the templates from both engines in the same mailbox, to get the best of both worlds.

If you have both text-based and OCR templates in your mailbox, the template with the most fields will take priority over the others.

Per-page pricing

One credit is now accounted for each successfully parsed page. If a document is not composed of several pages (like a long email or a spreadsheet), then just one credit is accounted when that document gets successfully processed, regardless of the document's length, as usual.

What's next?

After the beta phase is over and the new OCR engine is available for all, we plan to make it work with all HTML documents such as emails and web pages.

Live updates on our progress to public release

April 2022

  • Added custom page header and footer margin setup for table fields.
  • Added option to split a PDF into several documents every X pages.
  • Added row merge options to table fields.
  • Improved field-level error messages in template editor and debugger.
  • Improved parsing engine accuracy.
  • Improved UX on the template editor.
  • Fixed bugs reported to us by our fearless beta testers.

May 2022

  • Enrolled more users into the beta testing program.
  • Added template sample management (add description, remove samples).
  • Improved template editor to highlight optional fields, labels related to fields on hover and vice-versa.
  • Improved text extraction accuracy by using encoded text layer in the PDF rather than OCR, if present.
  • Opened the beta program to anybody via self opt-in in the account page (scroll to the Beta features section).
  • Squashed bugs reported by our customers.

June 2022

  • We're close to public release. Several customers are now using the new engine every day to parse their PDFs!
  • Enrolled more users into the beta testing program.
  • Improved line detections and extraction of multiline fields.
  • Improved table row and cell detections and extractions.
  • Created additional support documentation: Create OCR Template, Use Labels to position fields, Extract PDF tables.
  • Squashed more bugs reported by our customers (thanks everybody!).

July 2022: we are live 🎉

After months of work and weeks of testing, OCR engine is live for everybody! This marks version 4 of Parseur, our biggest feature update yet.

  • Activated OCR parsing engine for all our users
  • Squashed some bugs and improved user experience accros the board with many small enhancements in usability
  • Published a 13-minutes long tutorial on how to extract text from PDFs using our new OCR engine:

All-in-one data extraction software. Start using Parseur today.

Automate text extraction from emails, PDFs and spreadsheets.
Save hundreds of hours of manual work.
Embrace work automation.

Sign up for free
Parseur rated 5/5 on Capterra
Parseur.com is a high performer in data extraction on G2
Parseur.com has the happiest users badge on Crozdesk
Parseur rated 5/5 on GetApp