How to improve the accuracy of Optical Character Recognition (OCR)?

Optical Character Recognition (OCR) is the process of turning a scanned document into editable and machine-readable text. At Docparser, we automatically apply OCR whenever we detect that your uploaded document is a scanned image.

The accuracy of OCR is usually near to 100% of your document comes in professional scan quality. There are however various situations where OCR can yield less accurate results, including:

  • The font size of the document is very small
  • The scanned image contains scanning artifacts (pixel noise, black paper borders, ...)
  • The text is not surrounded by a white background
  • The scanned image has low black and white contrasts
  • The document was not well aligned during scanning and the image is skewed

Docparser comes with a variety of pre-processing filters to programmatically improve OCR accuracy. While those filters try to minimize the effects of most scanning issues, the most reliable way of improving OCR accuracy is to provide a high-quality scan. 

A high-quality scan has the following attributes:

  • A resolution of 200 - 300 DPI
  • Well-aligned and no skewing
  • High black & white contrasts
  • No scanning artifacts (pixel noise, black paper borders, ...)
