What are the preprocessing options for documents?
Docparser comes with a variety of different preprocessing options which are in particular helpful when you are dealing with scanned documents. As a rule of thumb, the higher the quality of your scanned document is, the easier it gets to extract data reliably.
The following preprocessing options are available in Docparser and can be activated in 'Settings > Preprocessing' in your document parser.
Please note: preprocessing filters will not be applied to documents already present in your parser, documents you wish to apply new preprocessing filters to will need to be re-uploaded to your parser.
Flatten & Rewrite Documents
PDF documents can sometimes contain minor errors which lead to parsing problems. Rewriting the document will fix common errors and create a clean version of your PDF file. Activate this option in case your position based parsing rules (e.g. Table Extraction, Fixed Position Text) are not returning data as expected. This option is also helpful to make PDF form data accessible.
Optical Character Recognition (OCR)
Optical Character Recognition is applied automatically whenever we don't find any text data in your document. In some cases you might want to force the OCR, e.g. when you have documents with text hidden inside images.
Clean & Normalize Scans For OCR
Scanned documents are sometimes misaligned, contain scanning artefacts or have not enough contrast. The accuracy of Optical Character Recognition (OCR) and position based data extraction can be dramatically improved when you clean & normalize your scanned documents.
Keep & Drop Specific Pages When Importing
Getting rid of pages which do not contain any useful data can make the creation of parsing rules much more easy. You can remove pages by page range or keyword matches.
Split Documents When Importing
Docparser works best when each data set is represented by one document, e.g. one invoice per imported document. You can use our document splitting function for documents containing multiple data sets on multiple pages. Once set, we will automatically split up new documents into individual files according to the rules you provide
Decrypt Password Protected Documents
Password decryted files can automatically be encrypted by providing a password.