How does the Regular Expression (RegEx) filter work?

Our RegEx filter allows you to extract text data from your PDF documents based on regular expression. This advanced filter comes in handy if you know how the data field which you want to extract looks like, but you don't know where it is located inside the document, for example a tracking numbers with a specific format, a number following a specific label, etc.

The regular expression which you provide to this filter needs to follow the PCRE syntax. Some basic examples are shown below which should get you started. For more in-depth information, please have a look at the  PCRE syntax documentation.

Example #1: Find all integer numbers which are between seven and nine digits long 
/[0-9]{7,9}/

Example #2: Find all text tokens which consist of two letters followed by a dash symbol and seven digits: 
/[A-Z]{2}-[0-9]{7}/

Next to writing down the actual pattern, a delimiter character needs to be placed at the start and the end of the expression. Common delimiter characters are / or #. 

Instead of defining the character sets like shown in the example above, one can also use specific letters representing character classes (for example, \d is the same as [0-9]). Below we will list the most common PCRE syntax elements.

Base Character Classes 

\w Any “word” character (a-z 0-9 _)
\W Any non “word” character
\s Whitespace (space, tab CRLF)
\S Any non whitepsace character
\d Digits (0-9)
\D Any non digit character
. (Period) – Any character except newline
Meta Characters
^ Start of subject (or line in multiline mode)
$ End of subject (or line in multiline mode)
[ Start character class definition
] End character class definition
| Alternates, eg (a|b) matches a or b
( Start subpattern
) End subpattern
\ Escape character
Quantifiers
n* Zero or more of n
n+ One or more of n
n? Zero or one occurrences of n
{n} n occurrences exactly
{n,} At least n occurrences
{,m} At most m occurrences
{n,m} Between n and m occurrences (inclusive)

Still need help? Contact Us Contact Us