9 Extraction Rules

9.1 Overview

Extraction rules enable automatic extraction of information from PDF documents. The extracted data can be used as placeholders in file names, email texts, target folders, and many other contexts.

Opening: Click Add or Edit in the Profile Settings under Data Extraction.

Typical Applications

Application	Example
File names	`<InvoiceDate>_<InvoiceNumber>.pdf`
Target folders	`D:\Archive\<Year>\<Month>\`
Email subject	`Invoice <InvoiceNumber> dated <InvoiceDate>`
CSV export	All extracted values in a table

Structure of a Rule

Each rule consists of several components:

Component	Description
General	Name, source, data type
Determination	How the value is found
Clean-up	Preprocessing of the raw value
Verification	Checking the found value
Format	Post-formatting of the value

9.2 General

The General tab contains basic settings of the rule.

9.2.1 Name

The name of the rule. This name is used for placeholders.

Format: <RuleId:N(RuleName)> The rule ID N is automatically determined and used.

Tip: Use meaningful names without special characters, for example CustomerNumber or InvoiceDate.

Note: If you create multiple rules with the same name, it is sufficient if one of these rules achieves a valid result. The program automatically uses the first successful result. This is useful for fallback scenarios, e.g., when one rule fails for certain document types.

9.2.2 Comment

Optional field for notes about the rule.

9.2.3 Data Source

Determines where the data is extracted from:

Source	Description
Document Text	Text of the PDF document
Barcode	Content of a barcode in the PDF
PDF Property	PDF metadata (title, author, etc.)
File Property	File properties (name, path, date)
Custom Text	Fixed or calculated value, for example `n/a`
Placeholder Value	Reference to another rule located above the current rule
Form Field	Value of a PDF form field

9.2.4 Data Type

The required type of the extracted value:

Data Type	Description
Text	Any text
Date	Date values with automatic detection
Number	Numeric values
Query	Conditional value selection
Query (with List)	Value from a static or file-based list

9.3 Data Source `Document Text or Barcode` - Determination: Position

With position-based determination, a resizable selection rectangle marks the desired area on the page.

9.3.1 Determine Page

Option	Description
Specify page number	Selection rectangle is always positioned on the specified page number
Find page with keyword	Selection rectangle is always positioned on the page with the specified keyword

9.3.2 Mark in PDF Viewer

Mark the desired area directly in the page preview: 1. Click Change position and adjust the position and size of the selection rectangle to define the desired area 2. Click Fix position

9.4 Data Source `Document Text or Barcode` - Determination: Keyword

With keyword determination, a value is extracted relative to a search term (keyword).

9.3.1 Determine Page

Option	Description
No determination necessary	The page is defined by the keyword specified in `Define Data Area`
Specify page number	The page is defined by a specified page number
Find page with keyword	The page is defined by the keyword specified here

9.4.1 Define Data Area

9.4.1.1 Keyword

The text searched for in the document.

Example: Invoice number: to find the number to the right of it.

9.4.1.2 Search Options

Option	Description
Case sensitive	Respects capitalization
Regular expression	Interpret keyword as regex
On multiple occurrences	A specific occurrence, should normally be the first occurrence

9.4.1.3 Data Position (Position Relative to Keyword)

Position	Description
Right	Text to the right of the keyword
Left	Text to the left of the keyword
Above	Text above
Below	Text below
Found location area	The searched keyword (optimal for adjusting the data area by extending it to define the desired area)

9.4.2 Extend Data Area

Allows relocating and/or extending the area found via the keyword from which data is extracted:

Setting	Description
To the left	Relocates the left edge of the data area by a positive or negative value
To the right	Relocates the right edge of the data area by a positive or negative value
Upward	Relocates the upper edge of the data area by a positive or negative value
Downward	Relocates the lower edge of the data area by a positive or negative value

9.4.3 Adjust Data Area Extension

If a keyword was referenced in the previous data area extension, you can fine-tune here

9.4.4 Visualization in PDF Viewer

The PDF viewer displays: - Red: The found keyword - Green: The data area - Blue: The extracted value

9.5 Data Source `Document Text` - Determination: Text of Page(s)

With this determination, the entire text of one or more pages is used as a basis.

9.5.1 Data Determination (Page Text)

9.5.1.1 Determine Page

Option	Description
No determination necessary	Uses the text of all pages
Specify page number	Uses the text of the page with the specified page number
Find page with keyword	Uses the text of the page with the specified keyword

9.5.1.2 Combination with Clean-up

Data determination using page text often yields a lot of text. Use clean-up to extract the relevant part.

9.6 Data Types

9.6.1 Text

For extraction, verification, and formatting of text

For most cases, the data type Text is the right choice.

9.6.2 Date

For extraction and verification of a date

With the data type Date, all dates in the text are automatically evaluated. If you don’t specify a keyword, the first found date is used. When using this data type, all date components are separately available when using the placeholder for path or file name. For example, you can use only the four-digit year and the month name.

9.6.2 Number

For extraction and verification of a number

9.6.1 Simple Query

With queries, a value is determined based on conditions.

Defines conditions and associated return values:

Document text contains: "X<OR>Y<OR>Z", then use as result "Delivery Note", else ""

9.6.2 Query (with List)

You can use the data type “Query (with List)” to search for the occurrence of a term and use the associated value as the result, e.g., an email address or a folder name.

List Format: Search term and result value are separated by semicolon.

Example 1: Assign email addresses based on customer numbers:

Customer number : 19006;x@y.de
Customer number : 1900;a@b.de
Customer number : 18765;c@d.de

If the PDF contains “Customer number : 19006”, “x@y.de” is used as the result.

Example 2: Search IBAN, use company name as result:

DE02120300000000202051<OR>DE02 1203 0000 0000 2020 51;Mustermann GmbH
DE02500105170137075030;Musterfrau GmbH

Here the IBAN (with or without spaces) is searched and the associated company name is returned.

9.7 Data Source: Form Field

Extracts values from PDF form fields.

9.7.1 Field Selection

Shows all form fields present in the PDF:

Field Type	Description
TextBox	Text input field
CheckBox	Selection field (Yes/No)
RadioButton	Option button
ComboBox	Dropdown list
ListBox	Selection list

9.7.2 Form Field (Field Name)

Select the form field by its name. The name is defined in the PDF form settings.

9.8 Clean-up

Clean-up enables preprocessing of the extracted raw value.

9.8.1 Available Clean-up Tasks

Replace Operations

Task	Description
Replace text	Replaces one text with another
Replace text before marker	Replaces everything before a marker
Replace text after marker	Replaces everything after a marker
Replace regex result	Replaces regex matches
Replace line breaks	Replaces line breaks with text
Replace with Excel file	Replaces based on Excel mapping

Insert Operations

Task	Description
Insert before marker	Inserts text before a marker
Insert after marker	Inserts text after a marker
Insert at position	Inserts text at a specific position

Remove Operations

Task	Description
Remove text	Removes a specific text
Remove text before marker	Removes everything before a marker
Remove first/last characters	Removes X characters at the beginning/end
Remove regex result	Removes regex matches
Remove blank lines	Removes all blank lines
Remove lines with regex	Removes lines matching a pattern

Line Operations

Task	Description
Extract line X	Extracts only a specific line
Move line X	Moves a line to another position
Move lines with text	Moves lines containing specific text

9.8.2 Clean-up Order

Multiple clean-up tasks are executed in the defined order. Use the arrow buttons to adjust the order.

9.9 Verification: Text

Text verifications check the extracted value for certain conditions.

9.9.1 Available Checks

Check	Description
Text equals	Exact match
Text does not equal	No match
Text contains	Contains the search term
Text does not contain	Does not contain the search term
Text starts with	Starts with the search term
Text ends with	Ends with the search term
Text matches regex	Matches the regular expression
Text does not match regex	Does not match the expression
Extracted text is empty	No value extracted
Number of characters	Checks text length
Number of lines	Checks line count

9.9.2 Character Verification

Checks individual characters at specific positions:

Check	Description
Is digit	Character is 0-9
Is letter	Character is A-Z or a-z
Is uppercase	Character is A-Z
Is lowercase	Character is a-z
Is alphanumeric	Character is letter or digit
Matches regex	Character matches a pattern

9.10 Verification: Date

Date verifications check whether the extracted value is a valid date.

9.10.1 Available Checks

Check	Description
Date is valid	Value is a recognizable date
Date is between	Date is within the specified period

9.10.2 Date Formats

The system automatically recognizes various date formats: - 01.12.2024 (German) - 12/01/2024 (American) - 2024-12-01 (ISO) - December 1, 2024 (with month name)

9.11 Verification: Number

Number verifications check numeric values.

9.11.1 Available Checks

Check	Description
Number is valid	Value is a recognizable number
Number is between	Value is within range

9.11.2 Number Formats

Recognized formats: - 1234 (integer) - 1.234,56 (German) - 1,234.56 (English) - -123.45 (negative)

9.12 Verification: Query

Query verifications check values based on conditions.

9.12.1 Available Checks

Check	Description
Query returns result	The query returns a value

9.13 Formatting

Formatting enables post-processing of the verified value. The main difference from clean-up is that here the search word must be present for replacements.

9.14 Using Placeholders

Extracted values can be used as placeholders in many contexts.

9.14.1 Placeholder Syntax

Syntax	Description
`<RuleName>`	Simple placeholder
`<RuleId:1(RuleName)>`	Complete syntax with ID
`<RuleName{DatePart}>`	Extract date part

9.14.2 Date Parts

DatePart	Description	Example
`Year4`	Four-digit year	2024
`Year2`	Two-digit year	24
`Month`	Month (two digits)	12
`MonthName`	Month name	December
`MonthNameAbbreviated`	Abbreviated month name	Dec
`Day`	Day (two digits)	15

Example: <InvoiceDate{Year4}>-<InvoiceDate{Month}> yields “2024-12”

9.14.3 Fallback Rules

If multiple rules have the same name, the first successful rule is used. This enables fallback values:

Rule “Date” - Attempts extraction from document text
Rule “Date” - If failed: Uses file date

Filter Tasks Overview

9 Extraction Rules

9.1 Overview

Typical Applications

Structure of a Rule

9.2 General

9.2.1 Name

9.2.2 Comment

9.2.3 Data Source

9.2.4 Data Type

9.3 Data Source Document Text or Barcode - Determination: Position

9.3.1 Determine Page

9.3.2 Mark in PDF Viewer

9.4 Data Source Document Text or Barcode - Determination: Keyword

9.3.1 Determine Page

9.4.1 Define Data Area

9.4.1.1 Keyword

9.4.1.2 Search Options

9.4.1.3 Data Position (Position Relative to Keyword)

9.4.2 Extend Data Area

9.4.3 Adjust Data Area Extension

9.4.4 Visualization in PDF Viewer

9.5 Data Source Document Text - Determination: Text of Page(s)

9.5.1 Data Determination (Page Text)

9.5.1.1 Determine Page

9.5.1.2 Combination with Clean-up

9.6 Data Types

9.6.1 Text

9.6.2 Date

9.6.2 Number

9.6.1 Simple Query

9.6.2 Query (with List)

9.7 Data Source: Form Field

9.7.1 Field Selection

9.7.2 Form Field (Field Name)

9.8 Clean-up

9.8.1 Available Clean-up Tasks

Replace Operations

Insert Operations

Remove Operations

Line Operations

9.8.2 Clean-up Order

9.9 Verification: Text

9.9.1 Available Checks

9.9.2 Character Verification

9.10 Verification: Date

9.10.1 Available Checks

9.10.2 Date Formats

9.11 Verification: Number

9.11.1 Available Checks

9.11.2 Number Formats

9.12 Verification: Query

9.12.1 Available Checks

9.13 Formatting

9.14 Using Placeholders

9.14.1 Placeholder Syntax

9.14.2 Date Parts

9.14.3 Fallback Rules

9.3 Data Source `Document Text or Barcode` - Determination: Position

9.4 Data Source `Document Text or Barcode` - Determination: Keyword

9.5 Data Source `Document Text` - Determination: Text of Page(s)