70.2 Text Extraction
70.2.1 Overview ¶
Text extraction is an extension of the placeholder logic that lets you specifically read out partial values from emails or attachments - e.g. an invoice number from the subject, a booking code from the body, or a contract partner from an attached TXT or CSV file.
In contrast to the fixed placeholders (see chapter 70.1), text extraction rules are configurable: per rule, you define which part of the email is searched, with which boundaries (from / to), and with which additional constraint (regex, number of characters).
70.2.2 Direct regex in subject or body ¶
The simplest variant is direct regex extraction - without a separate rule definition. In any input field you can write:
<BeginOfSubjectRegex>INV-\d{4}-\d{3}<EndOfRegex>$1
Mechanics: The program applies the regex to the subject. The first match (or its capture groups) is stored in the back-references $1, $2, … The <BeginOf…>…<EndOfRegex> block itself is removed from the result. You therefore need to place the back-reference ($1) separately at the position where the found value should appear.
- Without brackets in the pattern:
$1 contains the full match.
- With brackets in the pattern:
$1 contains the first capture group, $2 the second, and so on.
Analogously, there is <BeginOfBodyRegex>...<EndOfRegex> for the body.
Examples:
| Subject |
Placeholder in path |
Result |
Invoice INV-2026-456 Mueller Ltd |
<BeginOfSubjectRegex>INV-\d{4}-\d{3}<EndOfRegex>$1 |
INV-2026-456 |
Invoice INV-2026-456 Mueller Ltd |
<BeginOfSubjectRegex>INV-(\d{4})-(\d{3})<EndOfRegex>$1-$2 |
2026-456 |
Order Number 78901 |
<BeginOfSubjectRegex>Number (\d+)<EndOfRegex>$1 |
78901 |
Multiple matches: If the pattern occurs more than once in the same mail, only the first match is used - further matches are ignored. If the desired position is not the first match, tighten the pattern (e.g. with a more specific prefix or word boundaries \b).
Multiple regex blocks in the same path: When several <BeginOf…>…<EndOfRegex> blocks are used in the same input field, the ambiguity of $1 can be resolved through the numbered back-references $R1G1, $R2G1, … - $R1G1 for group 1 of the first block, $R2G1 for group 1 of the second block.
70.2.3 Text Extraction Rules ¶
For more complex extractions (e.g. multi-step range narrowing, application to attachments, encoding control), use text extraction rules, which are defined in the profile editor under Text Extraction.
For each rule you configure:
| Field |
Description |
| Name |
Unique identifier (for the placeholder reference) |
| Source |
Message body or Attachment (with file filter) |
| Encoding |
ANSI, UTF-8, Unicode, or explicit code page (for attachments with a special format) |
| Range from |
Search string or regex from which extraction begins |
| Range to |
Search string or regex at which extraction ends |
| Constraint |
First X characters, Last X characters, or Regex on the extracted range |
| Value conversion |
Optional lookup table that further maps the extracted value (e.g. code -> plain text) |
70.2.4 Using the Rule as a Placeholder ¶
You reference a configured rule as a placeholder:
| Placeholder |
Effect |
<MRuleId:5(InvoiceNumber)> |
Applies the rule with ID 5 (display name “InvoiceNumber”) to the message body |
<FRuleId:7(BookingCode)> |
Applies the rule with ID 7 (display name “BookingCode”) to the matching attachment |
MRuleId stands for Message-Rule (message body), FRuleId for File-Rule (file attachment). The ID is the unique key of the rule; the bracketed suffix is only a readable display name and is ignored during processing.
Selection is made in the placeholder menu - all defined rules appear under “Text Extraction”.
70.2.5 Range Narrowing ¶
The two-step range narrowing (from + to) is the central logic:
- Range from: Search string identifies the starting position. Everything before it is ignored.
- Range to: Search string identifies the end position. Everything after it is ignored.
- The text in between is the raw match.
- The constraint is applied to the raw match (e.g. first 20 characters).
- Optional: value conversion through a lookup table.
Example email body:
Dear Sir or Madam,
We hereby send you Invoice Number INV-2026-456
with a total amount of 1,234.56 EUR.
Best regards
Rule: - Range from: Number - Range to: with - Constraint: none
Result: INV-2026-456
70.2.6 Encoding and Attachment Sources ¶
For file-based extraction (source: attachment), the program reads the attachment with the configured encoding:
| Encoding |
When to use |
| ANSI |
Classic Windows text files |
| UTF-8 |
Modern text files, JSON, XML |
| Unicode |
UTF-16 Little-Endian (typical Windows email bodies) |
| Code page |
Explicit code page (e.g. 1252, 850) for legacy formats |
Text extraction only works for pure text attachments (e.g. TXT, CSV, XML, JSON, HTML). Binary formats are not supported.
70.2.7 Use case ¶
Invoice number from subject
Email subject: “Invoice INV-2026-456 from May 7.” Path pattern: D:\Incoming-Invoices\<EmailYear4>\<BeginOfSubjectRegex>INV-\d{4}-\d{3}<EndOfRegex>$1.pdf. The regex finds INV-2026-456, stores this value in $1 and removes the <BeginOf…>…<EndOfRegex> block from the path. Final result: D:\Incoming-Invoices\2026\INV-2026-456.pdf.
70.2.8 Tips ¶
- The value conversion through a lookup table is powerful - you can directly convert an extracted code into a readable plain text (see chapter 70.3)
- Test new rules on sample emails in the profile editor - the preview shows the result directly
70.2.9 Related how-tos ¶
- How to extract text with regex - step-by-step instructions for the direct-regex variant and text extraction rules, including a beginner-friendly introduction to regex building blocks