9 Extraction Rules

9.1 Overview

Extraction rules enable automatic extraction of information from PDF documents. The extracted data can be used as placeholders in file names, email texts, target folders, and many other contexts.

Opening: Click Add or Edit in the Profile Settings under Data Extraction.

Typical Applications

Application Example
File names <InvoiceDate>_<InvoiceNumber>.pdf
Target folders D:\Archive\<Year>\<Month>\
Email subject Invoice <InvoiceNumber> dated <InvoiceDate>
CSV export All extracted values in a table

Structure of a Rule

Each rule consists of several components:

Component Description
General Name, source, data type
Determination How the value is found
Clean-up Preprocessing of the raw value
Verification Checking the found value
Format Post-formatting of the value

9.2 General

The General tab contains basic settings of the rule.

9.2.1 Name

The name of the rule. This name is used for placeholders.

Format: <RuleId:N(RuleName)> The rule ID N is automatically determined and used.

Tip: Use meaningful names without special characters, for example CustomerNumber or InvoiceDate.

Note: If you create multiple rules with the same name, it is sufficient if one of these rules achieves a valid result. The program automatically uses the first successful result. This is useful for fallback scenarios, e.g., when one rule fails for certain document types.

9.2.2 Comment

Optional field for notes about the rule.

9.2.3 Data Source

Determines where the data is extracted from:

Source Description
Document Text Text of the PDF document
Barcode Content of a barcode in the PDF
PDF Property PDF metadata (title, author, etc.)
File Property File properties (name, path, date)
Custom Text Fixed or calculated value, for example n/a
Placeholder Value Reference to another rule located above the current rule
Form Field Value of a PDF form field

9.2.4 Data Type

The required type of the extracted value:

Data Type Description
Text Any text
Date Date values with automatic detection
Number Numeric values
Query Conditional value selection
Query (with List) Value from a static or file-based list

9.3 Data Source Document Text or Barcode - Determination: Position

With position-based determination, a resizable selection rectangle marks the desired area on the page.

9.3.1 Determine Page

Option Description
Specify page number Selection rectangle is always positioned on the specified page number
Find page with keyword Selection rectangle is always positioned on the page with the specified keyword

9.3.2 Mark in PDF Viewer

Mark the desired area directly in the page preview: 1. Click Change position and adjust the position and size of the selection rectangle to define the desired area 2. Click Fix position


9.4 Data Source Document Text or Barcode - Determination: Keyword

With keyword determination, a value is extracted relative to a search term (keyword).

9.3.1 Determine Page

Option Description
No determination necessary The page is defined by the keyword specified in Define Data Area
Specify page number The page is defined by a specified page number
Find page with keyword The page is defined by the keyword specified here

9.4.1 Define Data Area

9.4.1.1 Keyword

The text searched for in the document.

Example: Invoice number: to find the number to the right of it.

9.4.1.2 Search Options

Option Description
Case sensitive Respects capitalization
Regular expression Interpret keyword as regex
On multiple occurrences A specific occurrence, should normally be the first occurrence

9.4.1.3 Data Position (Position Relative to Keyword)

Position Description
Right Text to the right of the keyword
Left Text to the left of the keyword
Above Text above
Below Text below
Found location area The searched keyword (optimal for adjusting the data area by extending it to define the desired area)

9.4.2 Extend Data Area

Allows relocating and/or extending the area found via the keyword from which data is extracted:

Setting Description
To the left Relocates the left edge of the data area by a positive or negative value
To the right Relocates the right edge of the data area by a positive or negative value
Upward Relocates the upper edge of the data area by a positive or negative value
Downward Relocates the lower edge of the data area by a positive or negative value

9.4.3 Adjust Data Area Extension

If a keyword was referenced in the previous data area extension, you can fine-tune here

9.4.4 Visualization in PDF Viewer

The PDF viewer displays: - Red: The found keyword - Green: The data area - Blue: The extracted value


9.5 Data Source Document Text - Determination: Text of Page(s)

With this determination, the entire text of one or more pages is used as a basis.

9.5.1 Data Determination (Page Text)

9.5.1.1 Determine Page

Option Description
No determination necessary Uses the text of all pages
Specify page number Uses the text of the page with the specified page number
Find page with keyword Uses the text of the page with the specified keyword

9.5.1.2 Combination with Clean-up

Data determination using page text often yields a lot of text. Use clean-up to extract the relevant part.


9.6 Data Types

9.6.1 Text

For extraction, verification, and formatting of text

For most cases, the data type Text is the right choice.

9.6.2 Date

For extraction and verification of a date

With the data type Date, all dates in the text are automatically evaluated. If you don’t specify a keyword, the first found date is used. When using this data type, all date components are separately available when using the placeholder for path or file name. For example, you can use only the four-digit year and the month name.

9.6.2 Number

For extraction and verification of a number

9.6.1 Simple Query

With queries, a value is determined based on conditions.

Defines conditions and associated return values:

Document text contains: "X<OR>Y<OR>Z", then use as result "Delivery Note", else ""

9.6.2 Query (with List)

You can use the data type “Query (with List)” to search for the occurrence of a term and use the associated value as the result, e.g., an email address or a folder name.

List Format: Search term and result value are separated by semicolon.

Example 1: Assign email addresses based on customer numbers:

Customer number : 19006;x@y.de
Customer number : 1900;a@b.de
Customer number : 18765;c@d.de

If the PDF contains “Customer number : 19006”, “x@y.de” is used as the result.

Example 2: Search IBAN, use company name as result:

DE02120300000000202051<OR>DE02 1203 0000 0000 2020 51;Mustermann GmbH
DE02500105170137075030;Musterfrau GmbH

Here the IBAN (with or without spaces) is searched and the associated company name is returned.


9.7 Data Source: Form Field

Extracts values from PDF form fields.

9.7.1 Field Selection

Shows all form fields present in the PDF:

Field Type Description
TextBox Text input field
CheckBox Selection field (Yes/No)
RadioButton Option button
ComboBox Dropdown list
ListBox Selection list

9.7.2 Form Field (Field Name)

Select the form field by its name. The name is defined in the PDF form settings.


9.8 Clean-up

Clean-up enables preprocessing of the extracted raw value.

9.8.1 Available Clean-up Tasks

Replace Operations

Task Description
Replace text Replaces one text with another
Replace text before marker Replaces everything before a marker
Replace text after marker Replaces everything after a marker
Replace regex result Replaces regex matches
Replace line breaks Replaces line breaks with text
Replace with Excel file Replaces based on Excel mapping

Insert Operations

Task Description
Insert before marker Inserts text before a marker
Insert after marker Inserts text after a marker
Insert at position Inserts text at a specific position

Remove Operations

Task Description
Remove text Removes a specific text
Remove text before marker Removes everything before a marker
Remove first/last characters Removes X characters at the beginning/end
Remove regex result Removes regex matches
Remove blank lines Removes all blank lines
Remove lines with regex Removes lines matching a pattern

Line Operations

Task Description
Extract line X Extracts only a specific line
Move line X Moves a line to another position
Move lines with text Moves lines containing specific text

9.8.2 Clean-up Order

Multiple clean-up tasks are executed in the defined order. Use the arrow buttons to adjust the order.


9.9 Verification: Text

Text verifications check the extracted value for certain conditions.

9.9.1 Available Checks

Check Description
Text equals Exact match
Text does not equal No match
Text contains Contains the search term
Text does not contain Does not contain the search term
Text starts with Starts with the search term
Text ends with Ends with the search term
Text matches regex Matches the regular expression
Text does not match regex Does not match the expression
Extracted text is empty No value extracted
Number of characters Checks text length
Number of lines Checks line count

9.9.2 Character Verification

Checks individual characters at specific positions:

Check Description
Is digit Character is 0-9
Is letter Character is A-Z or a-z
Is uppercase Character is A-Z
Is lowercase Character is a-z
Is alphanumeric Character is letter or digit
Matches regex Character matches a pattern

9.10 Verification: Date

Date verifications check whether the extracted value is a valid date.

9.10.1 Available Checks

Check Description
Date is valid Value is a recognizable date
Date is between Date is within the specified period

9.10.2 Date Formats

The system automatically recognizes various date formats: - 01.12.2024 (German) - 12/01/2024 (American) - 2024-12-01 (ISO) - December 1, 2024 (with month name)


9.11 Verification: Number

Number verifications check numeric values.

9.11.1 Available Checks

Check Description
Number is valid Value is a recognizable number
Number is between Value is within range

9.11.2 Number Formats

Recognized formats: - 1234 (integer) - 1.234,56 (German) - 1,234.56 (English) - -123.45 (negative)


9.12 Verification: Query

Query verifications check values based on conditions.

9.12.1 Available Checks

Check Description
Query returns result The query returns a value

9.13 Formatting

Formatting enables post-processing of the verified value. The main difference from clean-up is that here the search word must be present for replacements.

9.14 Using Placeholders

Extracted values can be used as placeholders in many contexts.

9.14.1 Placeholder Syntax

Syntax Description
<RuleName> Simple placeholder
<RuleId:1(RuleName)> Complete syntax with ID
<RuleName{DatePart}> Extract date part

9.14.2 Date Parts

DatePart Description Example
Year4 Four-digit year 2024
Year2 Two-digit year 24
Month Month (two digits) 12
MonthName Month name December
MonthNameAbbreviated Abbreviated month name Dec
Day Day (two digits) 15

Example: <InvoiceDate{Year4}>-<InvoiceDate{Month}> yields “2024-12”

9.14.3 Fallback Rules

If multiple rules have the same name, the first successful rule is used. This enables fallback values:

  1. Rule “Date” - Attempts extraction from document text
  2. Rule “Date” - If failed: Uses file date