70.2 Text Extraction

70.2.1 Overview

Text extraction is an extension of the placeholder logic that lets you specifically read out partial values from emails or attachments - e.g. an invoice number from the subject, a booking code from the body, or a contract partner from an attached TXT or CSV file.

In contrast to the fixed placeholders (see chapter 70.1), text extraction rules are configurable: per rule, you define which part of the email is searched, with which boundaries (from / to), and with which additional constraint (regex, number of characters).


70.2.2 Direct regex in subject or body

The simplest variant is direct regex extraction - without a separate rule definition. In any input field you can write:

<BeginOfSubjectRegex>INV-\d{4}-\d{3}<EndOfRegex>$1

Mechanics: The program applies the regex to the subject. The first match (or its capture groups) is stored in the back-references $1, $2, … The <BeginOf…>…<EndOfRegex> block itself is removed from the result. You therefore need to place the back-reference ($1) separately at the position where the found value should appear.

  • Without brackets in the pattern: $1 contains the full match.
  • With brackets in the pattern: $1 contains the first capture group, $2 the second, and so on.

Analogously, there is <BeginOfBodyRegex>...<EndOfRegex> for the body.

Examples:

Subject Placeholder in path Result
Invoice INV-2026-456 Mueller Ltd <BeginOfSubjectRegex>INV-\d{4}-\d{3}<EndOfRegex>$1 INV-2026-456
Invoice INV-2026-456 Mueller Ltd <BeginOfSubjectRegex>INV-(\d{4})-(\d{3})<EndOfRegex>$1-$2 2026-456
Order Number 78901 <BeginOfSubjectRegex>Number (\d+)<EndOfRegex>$1 78901

Multiple matches: If the pattern occurs more than once in the same mail, only the first match is used - further matches are ignored. If the desired position is not the first match, tighten the pattern (e.g. with a more specific prefix or word boundaries \b).

Multiple regex blocks in the same path: When several <BeginOf…>…<EndOfRegex> blocks are used in the same input field, the ambiguity of $1 can be resolved through the numbered back-references $R1G1, $R2G1, … - $R1G1 for group 1 of the first block, $R2G1 for group 1 of the second block.


70.2.3 Text Extraction Rules

For more complex extractions (e.g. multi-step range narrowing, application to attachments, encoding control), use text extraction rules, which are defined in the profile editor under Text Extraction.

For each rule you configure:

Field Description
Name Unique identifier (for the placeholder reference)
Source Message body or Attachment (with file filter)
Encoding ANSI, UTF-8, Unicode, or explicit code page (for attachments with a special format)
Range from Search string or regex from which extraction begins
Range to Search string or regex at which extraction ends
Constraint First X characters, Last X characters, or Regex on the extracted range
Value conversion Optional lookup table that further maps the extracted value (e.g. code -> plain text)

70.2.4 Using the Rule as a Placeholder

You reference a configured rule as a placeholder:

Placeholder Effect
<MRuleId:5(InvoiceNumber)> Applies the rule with ID 5 (display name “InvoiceNumber”) to the message body
<FRuleId:7(BookingCode)> Applies the rule with ID 7 (display name “BookingCode”) to the matching attachment

MRuleId stands for Message-Rule (message body), FRuleId for File-Rule (file attachment). The ID is the unique key of the rule; the bracketed suffix is only a readable display name and is ignored during processing.

Selection is made in the placeholder menu - all defined rules appear under “Text Extraction”.


70.2.5 Range Narrowing

The two-step range narrowing (from + to) is the central logic:

  1. Range from: Search string identifies the starting position. Everything before it is ignored.
  2. Range to: Search string identifies the end position. Everything after it is ignored.
  3. The text in between is the raw match.
  4. The constraint is applied to the raw match (e.g. first 20 characters).
  5. Optional: value conversion through a lookup table.

Example email body:

Dear Sir or Madam,
We hereby send you Invoice Number INV-2026-456
with a total amount of 1,234.56 EUR.
Best regards

Rule: - Range from: Number - Range to: with - Constraint: none

Result: INV-2026-456


70.2.6 Encoding and Attachment Sources

For file-based extraction (source: attachment), the program reads the attachment with the configured encoding:

Encoding When to use
ANSI Classic Windows text files
UTF-8 Modern text files, JSON, XML
Unicode UTF-16 Little-Endian (typical Windows email bodies)
Code page Explicit code page (e.g. 1252, 850) for legacy formats

Text extraction only works for pure text attachments (e.g. TXT, CSV, XML, JSON, HTML). Binary formats are not supported.


70.2.7 Use case

Invoice number from subject

Email subject: “Invoice INV-2026-456 from May 7.” Path pattern: D:\Incoming-Invoices\<EmailYear4>\<BeginOfSubjectRegex>INV-\d{4}-\d{3}<EndOfRegex>$1.pdf. The regex finds INV-2026-456, stores this value in $1 and removes the <BeginOf…>…<EndOfRegex> block from the path. Final result: D:\Incoming-Invoices\2026\INV-2026-456.pdf.


70.2.8 Tips

  • The value conversion through a lookup table is powerful - you can directly convert an extracted code into a readable plain text (see chapter 70.3)
  • Test new rules on sample emails in the profile editor - the preview shows the result directly

70.2.9 Related how-tos

  • How to extract text with regex - step-by-step instructions for the direct-regex variant and text extraction rules, including a beginner-friendly introduction to regex building blocks