Tutorial - Understanding Data Extraction

Automatic PDF Processor - automatically process PDF files

The complete solution for automated processing of PDF documents

Understanding Data Extraction

Learn how to extract data from PDF documents

At a Glance

Difficulty: Beginner
Time required: ~20 minutes
Prerequisites: Getting Started tutorial
What you'll learn: Keywords, data areas, extraction rules, data types

What is data extraction?

Data extraction allows you to automatically read specific values from PDF documents - such as invoice numbers, dates, customer names, or amounts. These extracted values can then be used as placeholders in file names, folder paths, emails, and other tasks.

Important: Data extraction only works with PDFs that contain searchable text. Scanned documents (image-only PDFs) must first be processed with OCR (text recognition) before data can be extracted.

The basic concept: Keywords and data areas

Data extraction works by finding a keyword in the PDF text and then reading the data area relative to that keyword. Think of it like this:

Example PDF content:

Invoice Number: INV-2024-0042
Invoice Date:   December 15, 2024
Customer:       ACME Corporation
Total Amount:   $1,234.56

To extract the invoice number INV-2024-0042, you would:

Set the keyword to Invoice Number:
Configure the data area to read the text after the keyword

The keyword acts as an anchor point - it tells the program where to look. The data area defines exactly which text to capture relative to that anchor.

Step 1: Add sample files

Before creating extraction rules, you need sample PDF files. These files are used to preview and test your extraction configuration without processing actual documents.

Open the profile settings (double-click a profile or click "Edit profile...")
Go to the Example Files category
Click Add... and select 5 or more PDF files similar to those you want to process
Choose files from a separate folder that won't be processed by the profile

Why multiple files? Having several sample files helps ensure your extraction rules work consistently across different documents, not just one specific file.

Step 2: Open the rule editor

In the profile settings, go to the Data Extraction category
Click Create/Edit Rules... to open the Rule Management window
Click New Rule... to create your first extraction rule

The rule editor shows a preview of your sample PDF on the left and the configuration options on the right. As you configure the rule, you'll see the extracted result update in real-time.

Step 3: Configure the keyword

The keyword is the text that identifies where your data is located. Enter a word or phrase that appears consistently in your documents, directly before or near the data you want to extract.

Good keywords:

Invoice Number: - specific label that appears before the value
Total: - clear identifier for the amount
Date: - common label for date fields

Avoid these keywords:

Invoice - too generic, may appear multiple times
the, and, of - common words that appear everywhere
Variable text like actual values or dates

In the PDF preview, the keyword is highlighted in red when found. If the keyword appears multiple times, you can specify which occurrence to use.

Step 4: Define the data area

The data area determines which text is captured relative to the keyword. By default, the program captures the text block immediately following the keyword.

Data area options:

Setting	Description	Use when...
Text Block	Captures the adjacent text block after the keyword	Data is directly after keyword with clear separation
First Character	Captures only the first character, then extend manually	Text block includes unwanted adjacent data
Extend Data Area	Add characters before/after or specify fixed length	Need precise control over captured text

In the PDF preview, the data area is highlighted in green. Check that only the desired text is highlighted.

Step 5: Choose the data type

The data type determines how the extracted value is processed and what options are available when using it.

Data Type	Description	Example
Text	General text extraction - use for most values	Invoice numbers, names, IDs
Date	Recognizes date formats, allows accessing year/month/day separately	Invoice date, due date, order date
Number	Extracts numeric values, handles different formats	Amounts, quantities, page counts
Query	Returns a value based on whether keywords are found	"Yes" if "Paid" found, "No" otherwise
Query with List	Matches against a list to determine category	Document type, customer name from list

Tip: When extracting dates, always use the Date data type. This allows you to reformat dates (e.g., convert "December 15, 2024" to "2024-12-15") using date placeholders.

Step 6: Give the rule a name

Enter a descriptive name for your extraction rule. This name will appear in placeholder menus throughout the application. Use clear, meaningful names like:

InvoiceNumber
InvoiceDate
CustomerName
TotalAmount

Avoid spaces and special characters in rule names for easier use in placeholders.

Step 7: Test with all sample files

After configuring the rule, test it with all your sample files:

Use the file selector at the top of the rule editor to switch between sample files
Verify that the extraction works correctly for each file
Check the preview area to see the extracted value
Adjust the configuration if extraction fails on some files

Click OK to save the rule when you're satisfied with the results.

Using extracted data

Once you've created extraction rules, you can use the extracted values as placeholders in various tasks:

Rename files: <RuleId:1(InvoiceNumber)>.pdf
Create subfolders: <RuleId:2(CustomerName)>\<RuleId:3(InvoiceDate){Year4}>
Email subject: Invoice <RuleId:1(InvoiceNumber)> from <RuleId:2(InvoiceDate)>

Learn more about using placeholders in the Placeholder System Explained tutorial.

Result

After completing this tutorial, you understand:

How keywords and data areas work together
How to create and configure extraction rules
Which data type to choose for different values
How to test extraction with sample files
How extracted values become placeholders for other tasks

Common issues & solutions

Problem	Solution
Keyword not found (not highlighted in red)	Check spelling exactly matches the PDF text Try a shorter or different keyword Verify the PDF contains searchable text (not scanned)
Wrong text captured (green highlight includes extra text)	Switch to "First Character" and use "Extend Data Area" Set a fixed character count Use "Stop at" to define where extraction ends
Extraction works for some files but not others	Check if the keyword appears differently in failing files Use a more generic keyword that appears in all documents Consider creating multiple rules for different document formats
Date not recognized correctly	Ensure you selected the "Date" data type Check if the date format is supported Adjust data area to capture the complete date

Next steps

Now that you understand data extraction, continue with these tutorials:

Placeholder System Explained - Learn how to use extracted data in file names and paths
Rename PDF files automatically - Apply extraction to rename files
Move PDF files automatically - Organize files into folders using extracted data

Other step-by-step instructions

Try Automatic PDF Processor now for 30 days... To the download page