Howdy, I'm Sylvain, building software here at Parseur. We just released our biggest feature yet: a new system to parse PDF files visually.
New: Extract data from PDF visually
Parsing PDF documents using OCR is the most requested feature on our feature upvote page.
Improved reliably for complex documents
We used to convert PDF documents into text, trying to preserve the original layout of the pages. It worked great for simple documents (and that why we are keeping the text engine along with the new one).
However, this made it particularly difficult for our legacy, text-based engine to reliably extract data from complex PDF documents.
That is why we are introducing a new parsing engine, called OCR (for Optical Character Recognition). The OCR template editor allows you to create templates by drawing boxes around the text you want to extract. You can also define labels which are acting as landmarks or anchors in your document, helping the engine to position the fields in the page.
You'll find more detailed information on our support page here: Create your first OCR template.
Optional fields, at last!
This new engine allows you to define optional fields, and is more resilient to small changes in the document layout. It's also faster to build templates, and easier to adjust them, without having to create them from scratch. This is because you can attach several samples to a given template. This allows you to define fields that may show up on some documents but not all.
Complete retro-compatibility
All the current features, such as tables, metadata, post-processing and static fields, keep working with the new engine. The output data format is the same, webhooks are unchanged.
This new engine works along the current one, and you can even mix and match the templates from both engines in the same mailbox, to get the best of both worlds.
If you have both text-based and OCR templates in your mailbox, the template with the most fields will take priority over the others.
Per-page pricing
One credit is now accounted for each successfully parsed page. If a document is not composed of several pages (like a long email or a spreadsheet), then just one credit is accounted when that document gets successfully processed, regardless of the document's length, as usual.
What's next?
After the beta phase is over and the new OCR engine is available for all, we plan to make it work with all HTML documents such as emails and web pages.
Live updates on our progress to public release
April 2022
- Added custom page header and footer margin setup for table fields.
- Added option to split a PDF into several documents every X pages.
- Added row merge options to table fields.
- Improved field-level error messages in template editor and debugger.
- Improved parsing engine accuracy.
- Improved UX on the template editor.
- Fixed bugs reported to us by our fearless beta testers.
May 2022
- Enrolled more users into the beta testing program.
- Added template sample management (add description, remove samples).
- Improved template editor to highlight optional fields, labels related to fields on hover and vice-versa.
- Improved text extraction accuracy by using encoded text layer in the PDF rather than OCR, if present.
- Opened the beta program to anybody via self opt-in in the account page.
- Squashed bugs reported by our customers.
June 2022
- We're close to public release. Several customers are now using the new engine every day to parse their PDFs!
- Enrolled more users into the beta testing program.
- Improved line detections and extraction of multiline fields.
- Improved table row and cell detections and extractions.
- Created additional support documentation: Create OCR Template, Use Labels to position fields, Extract PDF tables.
- Squashed more bugs reported by our customers (thanks everybody!).
July 2022: we are live 🎉
After months of work and weeks of testing, OCR engine is live for everybody! This marks version 4 of Parseur, our biggest feature update yet.
- Activated OCR parsing engine for all our users
- Squashed some bugs and improved user experience across the board with many small enhancements in usability
- Published a 13-minutes long tutorial on how to extract text from PDFs using our new OCR engine:
Last updated on