Understanding Data Extraction

Learn how to extract data from PDF documents

At a Glance

  • Difficulty: Beginner
  • Time required: ~20 minutes
  • Prerequisites: Getting Started tutorial
  • What you'll learn: Keywords, data areas, extraction rules, data types

What is data extraction?

Data extraction allows you to automatically read specific values from PDF documents - such as invoice numbers, dates, customer names, or amounts. These extracted values can then be used as placeholders in file names, folder paths, emails, and other tasks.

Important: Data extraction only works with PDFs that contain searchable text. Scanned documents (image-only PDFs) must first be processed with OCR (text recognition) before data can be extracted.


The basic concept: Keywords and data areas

Data extraction works by finding a keyword in the PDF text and then reading the data area relative to that keyword. Think of it like this:

Example PDF content:

Invoice Number: INV-2024-0042
Invoice Date:   December 15, 2024
Customer:       ACME Corporation
Total Amount:   $1,234.56

To extract the invoice number INV-2024-0042, you would:

  1. Set the keyword to Invoice Number:
  2. Configure the data area to read the text after the keyword

The keyword acts as an anchor point - it tells the program where to look. The data area defines exactly which text to capture relative to that anchor.


Step 1: Add sample files

Before creating extraction rules, you need sample PDF files. These files are used to preview and test your extraction configuration without processing actual documents.

  1. Open the profile settings (double-click a profile or click "Edit profile...")
  2. Go to the Example Files category
  3. Click Add... and select 5 or more PDF files similar to those you want to process
  4. Choose files from a separate folder that won't be processed by the profile

Why multiple files? Having several sample files helps ensure your extraction rules work consistently across different documents, not just one specific file.


Step 2: Open the rule editor

  1. In the profile settings, go to the Data Extraction category
  2. Click Create/Edit Rules... to open the Rule Management window
  3. Click New Rule... to create your first extraction rule

The rule editor shows a preview of your sample PDF on the left and the configuration options on the right. As you configure the rule, you'll see the extracted result update in real-time.


Step 3: Configure the keyword

The keyword is the text that identifies where your data is located. Enter a word or phrase that appears consistently in your documents, directly before or near the data you want to extract.

Good keywords:

  • Invoice Number: - specific label that appears before the value
  • Total: - clear identifier for the amount
  • Date: - common label for date fields

Avoid these keywords:

  • Invoice - too generic, may appear multiple times
  • the, and, of - common words that appear everywhere
  • Variable text like actual values or dates

In the PDF preview, the keyword is highlighted in red when found. If the keyword appears multiple times, you can specify which occurrence to use.


Step 4: Define the data area

The data area determines which text is captured relative to the keyword. By default, the program captures the text block immediately following the keyword.

Data area options:

Setting Description Use when...
Text Block Captures the adjacent text block after the keyword Data is directly after keyword with clear separation
First Character Captures only the first character, then extend manually Text block includes unwanted adjacent data
Extend Data Area Add characters before/after or specify fixed length Need precise control over captured text

In the PDF preview, the data area is highlighted in green. Check that only the desired text is highlighted.


Step 5: Choose the data type

The data type determines how the extracted value is processed and what options are available when using it.

Data Type Description Example
Text General text extraction - use for most values Invoice numbers, names, IDs
Date Recognizes date formats, allows accessing year/month/day separately Invoice date, due date, order date
Number Extracts numeric values, handles different formats Amounts, quantities, page counts
Query Returns a value based on whether keywords are found "Yes" if "Paid" found, "No" otherwise
Query with List Matches against a list to determine category Document type, customer name from list

Tip: When extracting dates, always use the Date data type. This allows you to reformat dates (e.g., convert "December 15, 2024" to "2024-12-15") using date placeholders.


Step 6: Give the rule a name

Enter a descriptive name for your extraction rule. This name will appear in placeholder menus throughout the application. Use clear, meaningful names like:

  • InvoiceNumber
  • InvoiceDate
  • CustomerName
  • TotalAmount

Avoid spaces and special characters in rule names for easier use in placeholders.


Step 7: Test with all sample files

After configuring the rule, test it with all your sample files:

  1. Use the file selector at the top of the rule editor to switch between sample files
  2. Verify that the extraction works correctly for each file
  3. Check the preview area to see the extracted value
  4. Adjust the configuration if extraction fails on some files

Click OK to save the rule when you're satisfied with the results.


Using extracted data

Once you've created extraction rules, you can use the extracted values as placeholders in various tasks:

  • Rename files: <RuleId:1(InvoiceNumber)>.pdf
  • Create subfolders: <RuleId:2(CustomerName)>\<RuleId:3(InvoiceDate){Year4}>
  • Email subject: Invoice <RuleId:1(InvoiceNumber)> from <RuleId:2(InvoiceDate)>

Learn more about using placeholders in the Placeholder System Explained tutorial.


Result

After completing this tutorial, you understand:

  • How keywords and data areas work together
  • How to create and configure extraction rules
  • Which data type to choose for different values
  • How to test extraction with sample files
  • How extracted values become placeholders for other tasks

Common issues & solutions

Problem Solution
Keyword not found (not highlighted in red)
  • Check spelling exactly matches the PDF text
  • Try a shorter or different keyword
  • Verify the PDF contains searchable text (not scanned)
Wrong text captured (green highlight includes extra text)
  • Switch to "First Character" and use "Extend Data Area"
  • Set a fixed character count
  • Use "Stop at" to define where extraction ends
Extraction works for some files but not others
  • Check if the keyword appears differently in failing files
  • Use a more generic keyword that appears in all documents
  • Consider creating multiple rules for different document formats
Date not recognized correctly
  • Ensure you selected the "Date" data type
  • Check if the date format is supported
  • Adjust data area to capture the complete date

Next steps

Now that you understand data extraction, continue with these tutorials:


Other step-by-step instructions

Getting Started

Basic Tasks

PDF Editing

E-Invoicing & Archiving

Practical Examples


To the product page of Automatic PDF Processor
Try Automatic PDF Processor now for 30 days...     To the download page