How to parse repeating text blocks?
In general, our PDF table extraction filter does a great for extracting "table like" data. However, in some cases you are dealing with lists which are built up by complex text blocks. These text blocks can be spread over multiple lines and have irregularities. The method described in this article will help you parse this kind of data sets.
Our 'Repeating Text Block' filter lets you parse tabular data which is spread over several lines. The filter provides several options on how to recognise such repeating blocks. Individual text blocks can be recognised either
- by defining how many lines one text block has
- by text patterns, e.g. if a line starts with a certain word
- by separating them whenever there are empty lines
To obtain good results, it is important that you remove all text before and after your list. You can do this by adding text filters such as 'Define Start Position' and 'Define End Position'.
After your text blocks are recognised correctly, you can further refine the the parsed data with table filters, such as splitting and removing columns.