Howdy, I'm Sylvain, building software here at Parseur. We just released our biggest feature yet: a new system to parse PDF files visually.

New: Extract data from PDF visually

Parsing PDF documents using OCR is the most requested feature on our feature upvote page.

Improved reliably for complex documents

We used to convert PDF documents into text, trying to preserve the original layout of the pages. It worked great for simple documents (and that why we are keeping the text engine along with the new one).

However, this made it particularly difficult for our legacy, text-based engine to reliably extract data from complex PDF documents.

That is why we are introducing a new parsing engine, called OCR (for Optical Character Recognition). The OCR template editor allows you to create templates by drawing boxes around the text you want to extract. You can also define labels which are acting as landmarks or anchors in your document, helping the engine to position the fields in the page.

You'll find more detailed information on our support page here: Create your first OCR template.

Optional fields, at last!

This new engine allows you to define optional fields, and is more resilient to small changes in the document layout. It's also faster to build templates, and easier to adjust them, without having to create them from scratch. This is because you can attach several samples to a given template. This allows you to define fields that may show up on some documents but not all.

Complete retro-compatibility

All the current features, such as tables, metadata, post-processing and static fields, keep working with the new engine. The output data format is the same, webhooks are unchanged.

This new engine works along the current one, and you can even mix and match the templates from both engines in the same mailbox, to get the best of both worlds.

If you have both text-based and OCR templates in your mailbox, the template with the most fields will take priority over the others.

Per-page pricing

One credit is now accounted for each successfully parsed page. If a document is not composed of several pages (like a long email or a spreadsheet), then just one credit is accounted when that document gets successfully processed, regardless of the document's length, as usual.

What's next?

After the beta phase is over and the new OCR engine is available for all, we plan to make it work with all HTML documents such as emails and web pages.

Live updates on our progress to public release

April 2022

Added custom page header and footer margin setup for table fields.
Added option to split a PDF into several documents every X pages.
Added row merge options to table fields.
Improved field-level error messages in template editor and debugger.
Improved parsing engine accuracy.
Improved UX on the template editor.
Fixed bugs reported to us by our fearless beta testers.

May 2022

Enrolled more users into the beta testing program.
Added template sample management (add description, remove samples).
Improved template editor to highlight optional fields, labels related to fields on hover and vice-versa.
Improved text extraction accuracy by using encoded text layer in the PDF rather than OCR, if present.
Opened the beta program to anybody via self opt-in in the account page.
Squashed bugs reported by our customers.

June 2022

We're close to public release. Several customers are now using the new engine every day to parse their PDFs!
Enrolled more users into the beta testing program.
Improved line detections and extraction of multiline fields.
Improved table row and cell detections and extractions.
Created additional support documentation: Create OCR Template, Use Labels to position fields, Extract PDF tables.
Squashed more bugs reported by our customers (thanks everybody!).

July 2022: we are live 🎉

After months of work and weeks of testing, OCR engine is live for everybody! This marks version 4 of Parseur, our biggest feature update yet.

Activated OCR parsing engine for all our users
Squashed some bugs and improved user experience across the board with many small enhancements in usability
Published a 13-minutes long tutorial on how to extract text from PDFs using our new OCR engine:

Last updated on March 7th, 2022

OCR engine for parsing PDF documents

New: Extract data from PDF visually

Improved reliably for complex documents

Optional fields, at last!

Complete retro-compatibility

Per-page pricing

What's next?

Live updates on our progress to public release

April 2022

May 2022

June 2022

July 2022: we are live 🎉

Ready to automate your
document data extraction?

OCR engine for parsing PDF documents

New: Extract data from PDF visually

Improved reliably for complex documents

Optional fields, at last!

Complete retro-compatibility

Per-page pricing

What's next?

Live updates on our progress to public release

April 2022

May 2022

June 2022

July 2022: we are live 🎉

Ready to automate yourdocument data extraction?

Ready to automate your
document data extraction?