Understanding Data Extraction
Learn how to extract data from PDF documents
At a Glance
- Difficulty: Beginner
- Time required: ~20 minutes
- Prerequisites: Getting Started tutorial
- What you'll learn: Keywords, data areas, extraction rules, data types
What is data extraction?
Data extraction allows you to automatically read specific values from PDF documents - such as invoice numbers,
dates, customer names, or amounts. These extracted values can then be used as placeholders in
file names, folder paths, emails, and other tasks.
Important: Data extraction only works with PDFs that contain searchable text.
Scanned documents (image-only PDFs) must first be processed with OCR (text recognition) before data can be extracted.
The basic concept: Keywords and data areas
Data extraction works by finding a keyword in the PDF text and then reading the
data area relative to that keyword. Think of it like this:
Example PDF content:
Invoice Number: INV-2024-0042
Invoice Date: December 15, 2024
Customer: ACME Corporation
Total Amount: $1,234.56
To extract the invoice number INV-2024-0042, you would:
- Set the keyword to
Invoice Number:
- Configure the data area to read the text after the keyword
The keyword acts as an anchor point - it tells the program where to look. The data area defines
exactly which text to capture relative to that anchor.
Step 1: Add sample files
Before creating extraction rules, you need sample PDF files. These files are used to preview and test
your extraction configuration without processing actual documents.
- Open the profile settings (double-click a profile or click "Edit profile...")
- Go to the Example Files category
- Click Add... and select 5 or more PDF files similar to those you want to process
- Choose files from a separate folder that won't be processed by the profile
Why multiple files? Having several sample files helps ensure your extraction rules work
consistently across different documents, not just one specific file.
Step 2: Open the rule editor
- In the profile settings, go to the Data Extraction category
- Click Create/Edit Rules... to open the Rule Management window
- Click New Rule... to create your first extraction rule
The rule editor shows a preview of your sample PDF on the left and the configuration options on the right.
As you configure the rule, you'll see the extracted result update in real-time.
Step 3: Configure the keyword
The keyword is the text that identifies where your data is located. Enter a word or phrase that
appears consistently in your documents, directly before or near the data you want to extract.
Good keywords:
Invoice Number: - specific label that appears before the value
Total: - clear identifier for the amount
Date: - common label for date fields
Avoid these keywords:
Invoice - too generic, may appear multiple times
the, and, of - common words that appear everywhere
- Variable text like actual values or dates
In the PDF preview, the keyword is highlighted in red
when found. If the keyword appears multiple times, you can specify which occurrence to use.
Step 4: Define the data area
The data area determines which text is captured relative to the keyword. By default, the program
captures the text block immediately following the keyword.
Data area options:
| Setting |
Description |
Use when... |
| Text Block |
Captures the adjacent text block after the keyword |
Data is directly after keyword with clear separation |
| First Character |
Captures only the first character, then extend manually |
Text block includes unwanted adjacent data |
| Extend Data Area |
Add characters before/after or specify fixed length |
Need precise control over captured text |
In the PDF preview, the data area is highlighted in green.
Check that only the desired text is highlighted.
Step 5: Choose the data type
The data type determines how the extracted value is processed and what options are available when using it.
| Data Type |
Description |
Example |
| Text |
General text extraction - use for most values |
Invoice numbers, names, IDs |
| Date |
Recognizes date formats, allows accessing year/month/day separately |
Invoice date, due date, order date |
| Number |
Extracts numeric values, handles different formats |
Amounts, quantities, page counts |
| Query |
Returns a value based on whether keywords are found |
"Yes" if "Paid" found, "No" otherwise |
| Query with List |
Matches against a list to determine category |
Document type, customer name from list |
Tip: When extracting dates, always use the Date data type.
This allows you to reformat dates (e.g., convert "December 15, 2024" to "2024-12-15") using date placeholders.
Step 6: Give the rule a name
Enter a descriptive name for your extraction rule. This name will appear in placeholder menus
throughout the application. Use clear, meaningful names like:
InvoiceNumber
InvoiceDate
CustomerName
TotalAmount
Avoid spaces and special characters in rule names for easier use in placeholders.
Step 7: Test with all sample files
After configuring the rule, test it with all your sample files:
- Use the file selector at the top of the rule editor to switch between sample files
- Verify that the extraction works correctly for each file
- Check the preview area to see the extracted value
- Adjust the configuration if extraction fails on some files
Click OK to save the rule when you're satisfied with the results.
Using extracted data
Once you've created extraction rules, you can use the extracted values as placeholders in various tasks:
- Rename files:
<RuleId:1(InvoiceNumber)>.pdf
- Create subfolders:
<RuleId:2(CustomerName)>\<RuleId:3(InvoiceDate){Year4}>
- Email subject:
Invoice <RuleId:1(InvoiceNumber)> from <RuleId:2(InvoiceDate)>
Learn more about using placeholders in the Placeholder System Explained tutorial.
Result
After completing this tutorial, you understand:
- How keywords and data areas work together
- How to create and configure extraction rules
- Which data type to choose for different values
- How to test extraction with sample files
- How extracted values become placeholders for other tasks
Common issues & solutions
| Problem |
Solution |
| Keyword not found (not highlighted in red) |
- Check spelling exactly matches the PDF text
- Try a shorter or different keyword
- Verify the PDF contains searchable text (not scanned)
|
| Wrong text captured (green highlight includes extra text) |
- Switch to "First Character" and use "Extend Data Area"
- Set a fixed character count
- Use "Stop at" to define where extraction ends
|
| Extraction works for some files but not others |
- Check if the keyword appears differently in failing files
- Use a more generic keyword that appears in all documents
- Consider creating multiple rules for different document formats
|
| Date not recognized correctly |
- Ensure you selected the "Date" data type
- Check if the date format is supported
- Adjust data area to capture the complete date
|
Next steps
Now that you understand data extraction, continue with these tutorials:
Other step-by-step instructions
Getting Started
Basic Tasks
PDF Editing
E-Invoicing & Archiving
Practical Examples
To the product page of Automatic PDF Processor