9.1 Overview
Extraction rules enable automatic extraction of information from PDF documents. The extracted data can be used as placeholders in file names, email texts, target folders, and many other contexts.
Opening: Click Add or Edit in the Profile Settings under Data Extraction.
Typical Applications
| Application |
Example |
| File names |
<InvoiceDate>_<InvoiceNumber>.pdf |
| Target folders |
D:\Archive\<Year>\<Month>\ |
| Email subject |
Invoice <InvoiceNumber> dated <InvoiceDate> |
| CSV export |
All extracted values in a table |
Structure of a Rule
Each rule consists of several components:
| Component |
Description |
| General |
Name, source, data type |
| Determination |
How the value is found |
| Clean-up |
Preprocessing of the raw value |
| Verification |
Checking the found value |
| Format |
Post-formatting of the value |
9.2 General
The General tab contains basic settings of the rule.
9.2.1 Name
The name of the rule. This name is used for placeholders.
Format: <RuleId:N(RuleName)> The rule ID N is automatically determined and used.
Tip: Use meaningful names without special characters, for example CustomerNumber or InvoiceDate.
Note: If you create multiple rules with the same name, it is sufficient if one of these rules achieves a valid result. The program automatically uses the first successful result. This is useful for fallback scenarios, e.g., when one rule fails for certain document types.
Optional field for notes about the rule.
9.2.3 Data Source
Determines where the data is extracted from:
| Source |
Description |
| Document Text |
Text of the PDF document |
| Barcode |
Content of a barcode in the PDF |
| PDF Property |
PDF metadata (title, author, etc.) |
| File Property |
File properties (name, path, date) |
| Custom Text |
Fixed or calculated value, for example n/a |
| Placeholder Value |
Reference to another rule located above the current rule |
| Form Field |
Value of a PDF form field |
9.2.4 Data Type
The required type of the extracted value:
| Data Type |
Description |
| Text |
Any text |
| Date |
Date values with automatic detection |
| Number |
Numeric values |
| Query |
Conditional value selection |
| Query (with List) |
Value from a static or file-based list |
9.3 Data Source Document Text or Barcode - Determination: Position
With position-based determination, a resizable selection rectangle marks the desired area on the page.
9.3.1 Determine Page
| Option |
Description |
| Specify page number |
Selection rectangle is always positioned on the specified page number |
| Find page with keyword |
Selection rectangle is always positioned on the page with the specified keyword |
9.3.2 Mark in PDF Viewer
Mark the desired area directly in the page preview: 1. Click Change position and adjust the position and size of the selection rectangle to define the desired area 2. Click Fix position
9.4 Data Source Document Text or Barcode - Determination: Keyword
With keyword determination, a value is extracted relative to a search term (keyword).
9.3.1 Determine Page
| Option |
Description |
| No determination necessary |
The page is defined by the keyword specified in Define Data Area |
| Specify page number |
The page is defined by a specified page number |
| Find page with keyword |
The page is defined by the keyword specified here |
9.4.1 Define Data Area
9.4.1.1 Keyword
The text searched for in the document.
Example: Invoice number: to find the number to the right of it.
9.4.1.2 Search Options
| Option |
Description |
| Case sensitive |
Respects capitalization |
| Regular expression |
Interpret keyword as regex |
| On multiple occurrences |
A specific occurrence, should normally be the first occurrence |
9.4.1.3 Data Position (Position Relative to Keyword)
| Position |
Description |
| Right |
Text to the right of the keyword |
| Left |
Text to the left of the keyword |
| Above |
Text above |
| Below |
Text below |
| Found location area |
The searched keyword (optimal for adjusting the data area by extending it to define the desired area) |
9.4.2 Extend Data Area
Allows relocating and/or extending the area found via the keyword from which data is extracted:
| Setting |
Description |
| To the left |
Relocates the left edge of the data area by a positive or negative value |
| To the right |
Relocates the right edge of the data area by a positive or negative value |
| Upward |
Relocates the upper edge of the data area by a positive or negative value |
| Downward |
Relocates the lower edge of the data area by a positive or negative value |
9.4.3 Adjust Data Area Extension
If a keyword was referenced in the previous data area extension, you can fine-tune here
9.4.4 Visualization in PDF Viewer
The PDF viewer displays: - Red: The found keyword - Green: The data area - Blue: The extracted value
9.5 Data Source Document Text - Determination: Text of Page(s)
With this determination, the entire text of one or more pages is used as a basis.
9.5.1 Data Determination (Page Text)
9.5.1.1 Determine Page
| Option |
Description |
| No determination necessary |
Uses the text of all pages |
| Specify page number |
Uses the text of the page with the specified page number |
| Find page with keyword |
Uses the text of the page with the specified keyword |
9.5.1.2 Combination with Clean-up
Data determination using page text often yields a lot of text. Use clean-up to extract the relevant part.
9.6 Data Types
9.6.1 Text
For extraction, verification, and formatting of text
For most cases, the data type Text is the right choice.
9.6.2 Date
For extraction and verification of a date
With the data type Date, all dates in the text are automatically evaluated. If you don’t specify a keyword, the first found date is used. When using this data type, all date components are separately available when using the placeholder for path or file name. For example, you can use only the four-digit year and the month name.
9.6.2 Number
For extraction and verification of a number
9.6.1 Simple Query
With queries, a value is determined based on conditions.
Defines conditions and associated return values:
Document text contains: "X<OR>Y<OR>Z", then use as result "Delivery Note", else ""
9.6.2 Query (with List)
You can use the data type “Query (with List)” to search for the occurrence of a term and use the associated value as the result, e.g., an email address or a folder name.
List Format: Search term and result value are separated by semicolon.
Example 1: Assign email addresses based on customer numbers:
Customer number : 19006;x@y.de
Customer number : 1900;a@b.de
Customer number : 18765;c@d.de
If the PDF contains “Customer number : 19006”, “x@y.de” is used as the result.
Example 2: Search IBAN, use company name as result:
DE02120300000000202051<OR>DE02 1203 0000 0000 2020 51;Mustermann GmbH
DE02500105170137075030;Musterfrau GmbH
Here the IBAN (with or without spaces) is searched and the associated company name is returned.
Extracts values from PDF form fields.
9.7.1 Field Selection
Shows all form fields present in the PDF:
| Field Type |
Description |
| TextBox |
Text input field |
| CheckBox |
Selection field (Yes/No) |
| RadioButton |
Option button |
| ComboBox |
Dropdown list |
| ListBox |
Selection list |
Select the form field by its name. The name is defined in the PDF form settings.
9.8 Clean-up
Clean-up enables preprocessing of the extracted raw value.
9.8.1 Available Clean-up Tasks
Replace Operations
| Task |
Description |
| Replace text |
Replaces one text with another |
| Replace text before marker |
Replaces everything before a marker |
| Replace text after marker |
Replaces everything after a marker |
| Replace regex result |
Replaces regex matches |
| Replace line breaks |
Replaces line breaks with text |
| Replace with Excel file |
Replaces based on Excel mapping |
Insert Operations
| Task |
Description |
| Insert before marker |
Inserts text before a marker |
| Insert after marker |
Inserts text after a marker |
| Insert at position |
Inserts text at a specific position |
Remove Operations
| Task |
Description |
| Remove text |
Removes a specific text |
| Remove text before marker |
Removes everything before a marker |
| Remove first/last characters |
Removes X characters at the beginning/end |
| Remove regex result |
Removes regex matches |
| Remove blank lines |
Removes all blank lines |
| Remove lines with regex |
Removes lines matching a pattern |
Line Operations
| Task |
Description |
| Extract line X |
Extracts only a specific line |
| Move line X |
Moves a line to another position |
| Move lines with text |
Moves lines containing specific text |
9.8.2 Clean-up Order
Multiple clean-up tasks are executed in the defined order. Use the arrow buttons to adjust the order.
9.9 Verification: Text
Text verifications check the extracted value for certain conditions.
9.9.1 Available Checks
| Check |
Description |
| Text equals |
Exact match |
| Text does not equal |
No match |
| Text contains |
Contains the search term |
| Text does not contain |
Does not contain the search term |
| Text starts with |
Starts with the search term |
| Text ends with |
Ends with the search term |
| Text matches regex |
Matches the regular expression |
| Text does not match regex |
Does not match the expression |
| Extracted text is empty |
No value extracted |
| Number of characters |
Checks text length |
| Number of lines |
Checks line count |
9.9.2 Character Verification
Checks individual characters at specific positions:
| Check |
Description |
| Is digit |
Character is 0-9 |
| Is letter |
Character is A-Z or a-z |
| Is uppercase |
Character is A-Z |
| Is lowercase |
Character is a-z |
| Is alphanumeric |
Character is letter or digit |
| Matches regex |
Character matches a pattern |
9.10 Verification: Date
Date verifications check whether the extracted value is a valid date.
9.10.1 Available Checks
| Check |
Description |
| Date is valid |
Value is a recognizable date |
| Date is between |
Date is within the specified period |
The system automatically recognizes various date formats: - 01.12.2024 (German) - 12/01/2024 (American) - 2024-12-01 (ISO) - December 1, 2024 (with month name)
9.11 Verification: Number
Number verifications check numeric values.
9.11.1 Available Checks
| Check |
Description |
| Number is valid |
Value is a recognizable number |
| Number is between |
Value is within range |
Recognized formats: - 1234 (integer) - 1.234,56 (German) - 1,234.56 (English) - -123.45 (negative)
9.12 Verification: Query
Query verifications check values based on conditions.
9.12.1 Available Checks
| Check |
Description |
| Query returns result |
The query returns a value |
Formatting enables post-processing of the verified value. The main difference from clean-up is that here the search word must be present for replacements.
9.14 Using Placeholders
Extracted values can be used as placeholders in many contexts.
9.14.1 Placeholder Syntax
| Syntax |
Description |
<RuleName> |
Simple placeholder |
<RuleId:1(RuleName)> |
Complete syntax with ID |
<RuleName{DatePart}> |
Extract date part |
9.14.2 Date Parts
| DatePart |
Description |
Example |
Year4 |
Four-digit year |
2024 |
Year2 |
Two-digit year |
24 |
Month |
Month (two digits) |
12 |
MonthName |
Month name |
December |
MonthNameAbbreviated |
Abbreviated month name |
Dec |
Day |
Day (two digits) |
15 |
Example: <InvoiceDate{Year4}>-<InvoiceDate{Month}> yields “2024-12”
9.14.3 Fallback Rules
If multiple rules have the same name, the first successful rule is used. This enables fallback values:
- Rule “Date” - Attempts extraction from document text
- Rule “Date” - If failed: Uses file date