How to improve the accuracy of Optical Character Recognition (OCR)?
Optical Character Recognition (OCR) is the process of turning a scanned document into editable and machine-readable text. At Docparser, we automatically apply OCR whenever we detect that your uploaded document is a scanned image.
The accuracy of OCR is usually near to 100% of your document comes in professional scan quality. There are however various situations where OCR can yield less accurate results, including:
- The font size of the document is very small
- The scanned image contains scanning artifacts (pixel noise, black paper borders, ...)
- The text is not surrounded by a white background
- The scanned image has low black and white contrasts
- The document was not well aligned during scanning and the image is skewed
Docparser comes with a variety of pre-processing filters to programmatically improve OCR accuracy. While those filters try to minimize the effects of most scanning issues, the most reliable way of improving OCR accuracy is to provide a high-quality scan.
A high-quality scan has the following attributes:
- A resolution of 200 - 300 DPI
- Well-aligned and no skewing
- High black & white contrasts
- No scanning artifacts (pixel noise, black paper borders, ...)