What are the preprocessing options for documents?

Docparser comes with a variety of different preprocessing options which are in particular helpful when you are dealing with scanned documents. As a rule of thumb, the higher the quality of your scanned document is, the easier it gets to extract data reliably. 

The following preprocessing options are available in Docparser and can be activated in 'Settings > Preprocessing' in your document parser.

Please note: preprocessing filters will not be applied to documents already present in your parser, documents you wish to apply new preprocessing filters to will need to be re-uploaded to your parser.

Flatten & Rewrite Documents

PDF documents can sometimes contain minor errors which lead to parsing problems. Rewriting the document will fix common errors and create a clean version of your PDF file. Activate this option in case your position based parsing rules (e.g. Table Extraction, Fixed Position Text) are not returning data as expected. This option is also helpful to make PDF form data accessible.

Optical Character Recognition (OCR)

Optical Character Recognition is applied automatically whenever we don't find any text data in your document. In some cases you might want to force the OCR, e.g. when you have documents with text hidden inside images.

Clean & Normalize Scans For OCR

Scanned documents are sometimes misaligned, contain scanning artefacts or have not enough contrast. The accuracy of Optical Character Recognition (OCR) and position based data extraction can be dramatically improved when you  clean & normalize your scanned documents.

Keep & Drop Specific Pages When Importing

Getting rid of pages which do not contain any useful data can make the creation of parsing rules much more easy. You can remove pages by page range or keyword matches.

Split Documents When Importing

Docparser works best when each data set is represented by one document, e.g. one invoice per imported document. You can use our  document splitting function for documents containing multiple data sets on multiple pages. Once set, we will automatically split up new documents into individual files according to the rules you provide

Decrypt Password Protected Documents

Password decryted files can automatically be encrypted by providing a password.

