How to extract data from variable positions?
Docparser offers many different ways of extracting data from your PDF documents. Depending on your specific use-case, different approaches can be applied to get the data you are looking for.
The most straight forward way to extract data from your documents is to leverage position based extraction rules. Position based parsing rules are a great fit if your data is located at the exact same position within your document all the time. For example a reference number in the document header.
If your data is however located at variable positions inside your document, one of the following approaches might be right for you:
Extract data with anchor keywords
One common approach is to rely on anchor keywords inside your document. The idea behind this approach is simple: First, you define a specific keyword which is located near to the actual data which you want to extract. In a second step you then define where your data is located relative to your keyword.
For example, you know that somewhere in your document is a line which reads "Reference Number: #1234567". In this case, the phrase "Reference Number:" is the anchor keyword and "#1234567" would be the data field you want to extract.
An easy way to get started with this method is to use the parsing rule preset "Text Variable Position" as shown in the screenshots below. This parsing rule template will first return the entire text of your document. You can then use multiple text filters to boil down the original text until you are left with a specific data field.
You can for example add filters like 'Set start after <A_TEXT_MARKER>', 'Set End to next blank line', 'Remove empty lines', ... Combining two filters should get you already close to the data you are after.
1/ Choose the preset "Text Variable Position"
2/ Set the start position to "Text Match: after"
3/ Set the end position and refine results
Extract data based on patterns
In case the anchor keyword method described above is not an option, you can also try to extract your data based on "how the data looks like". The method of pattern based extraction works well if you are after specific data fields, such as dates in a specific format, reference numbers, email addresses etc.
Docparser comes with a variety of parsing rule presets which can extract data points following a specific format or pattern. Some of our pattern based parsing rule templates are for example:
- Date Extraction
- Number Extraction
- Product Numbers Extraction
- Tax ID Extraction
- Phone Number Extraction
- Regular Expressions
All of these filters let you define an approximate area of your document where you expect the data to reside. In a second step, the extracted text is boiled down to the data field you are after. The example below shows our tracking number extraction tool.
1/ Define the approximate location of the data
2/ Extract the data with pattern based filters
What to do if you all documents look entirely different?
We recommend to create one Document Parser for each layout. For example if you have 20 different vendors sending you 20 different types of PDF documents, you should create 20 Document Parsers. Each Document Parser holds a set of parsing rules tailored to a specific document layout. This is front loaded work, but will save you hours of time in the long run, as the new invoices you receive will fit a parser already created. When the data is in the same place, just a different layout, you would use either "Table Rows" or "Text Fixed Position" parsing rules to get started.