19 OCR (Text Recognition)

Task: OCR

19.1 Description

The OCR task (Optical Character Recognition) converts scanned documents or image PDFs into searchable PDFs. After processing, text in the PDF can be selected, copied, and searched.

Typical Use Cases

  • Archiving: Make scanned documents searchable
  • Data Extraction: Extract text from scanned invoices for further processing
  • Compliance: Prepare documents for full-text search in document management systems
  • Accessibility: Make PDFs accessible for screen readers

Important: This task creates a new file in the configured target folder. The original file remains unchanged. Further tasks contained in the current profile all refer to the original file. The searchable PDF created by OCR must be further processed with a separate profile that monitors the corresponding output folder if needed.


19.2 General Settings

Enabled

Enable this option so the task is executed for matching PDF files. Disabled tasks are skipped.


19.3 Language

Primary Language

Select the main language of the text to be recognized. Correct language selection significantly improves recognition accuracy.

Available Languages: Over 100 languages, including: - German (German Best) - English (English Best) - French, Spanish, Italian - And many more

Note: Languages not installed can be downloaded via Tools → Install Languages.

Use Secondary Language

Enable this option when documents contain text in two languages (e.g., German-English contracts).

Secondary Language

Select the second language appearing in the document.


19.4 DPI Setting (Resolution)

The DPI setting (Dots Per Inch) affects text recognition quality and output file size.

Optimized

The program automatically selects an optimal DPI setting based on document content.

Based on Images

DPI is determined based on the resolution of images contained in the PDF. This is the default setting.

Custom

You can set a fixed DPI value:

DPI Usage
96-150 Fast processing, lower quality
175-225 Good balance between quality and speed (recommended)
300 High quality, longer processing
600-1200 Maximum quality, only for special requirements

19.5 Image Optimization

Correct Skew (Deskew)

Automatically corrects slightly skewed scanned documents. Improves recognition for non-exactly aligned scans.

Sharpen

Increases image sharpness before text recognition. Helpful for slightly blurry scans.

Increase Contrast

Improves contrast between text and background. Useful for faded or weak text.

Convert PDF Content to Images First

Converts entire PDF content to images before OCR. Enable this option for PDFs containing both text and images with text.

Upscale Low Resolutions

Automatically upscales low-resolution images to improve recognition accuracy.


19.6 Handling Files That Already Contain Text

Specify how to handle PDFs that already contain searchable text:

Option Description
Ignore PDF is not processed (default)
Process Anyway OCR is performed even if text is present
Only Copy to Target Directory PDF is copied to target folder without OCR

Recommendation: Leave setting on “Ignore” unless you have a specific reason for another choice.


19.7 Character Restriction

Restrict Characters

Enable this option to limit recognition to specific characters. This can improve accuracy for specialized documents.

Only Allow Following Characters (Whitelist)

Specify characters that should be recognized. All others are ignored.

Default Characters: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789.,-?!

Exclude Following Characters (Blacklist)

Specify characters that should not be recognized.

Use Case: When recognizing article numbers consisting only of numbers and letters, special characters can be excluded.


19.8 Processing

Multithreading

Enables parallel processing of multiple pages. Significantly speeds up OCR on systems with multiple processor cores.

Recommendation: Keep enabled unless stability issues occur.


19.9 Storage Location

Directory

Specify the target directory for searchable PDFs.

Note: It’s recommended to use a separate folder for each processing step to ensure clear separation.

Filename

Set the name for the output file. You can: - Leave field empty (original name is used) - Enter a fixed name - Use placeholders for dynamic names

Examples:

Input Result
(empty) Scan001.pdf (original name)
<FileName>_OCR Scan001_OCR.pdf
<TodaysYear4>-<TodaysMonth>-<TodaysDay>_<FileName> 2024-12-15_Scan001.pdf

Name Collisions

Choose what should happen if a file with the target name already exists:

Option Description
Overwrite Existing file is replaced
Append number Adds a number
Append date Adds processing date
Append date and time Adds date and time
Cancel operation OCR is not performed

19.10 File Date

Adjust Creation and Modification Date

Optionally, you can change the file date of the output file:

Option Description
Do not change File automatically receives current date
Creation date of original file Uses original creation date
Modification date of original file Uses modification date
PDF creation date Date from PDF metadata
Extracted date A date obtained with an extraction rule
Current date Sets today’s date

19.11 Afterwards

Call External Program

After OCR, an external program can be started automatically.

Program: Path to executable file

Parameters: Command line parameters. Available placeholders: - <PathIncludingFilename> - Full path of OCR file - <ParentDirectory> - Path of parent folder - <Filename> - Filename of OCR file


19.12 Example: Make Scanned Invoices Searchable

Initial Situation

Your scanner creates image PDFs without searchable text. These should be automatically processed with OCR so you can later search for invoice numbers or amounts.

Configuration

  1. Enabled: Yes
  2. Primary Language: English Best
  3. DPI Setting: Based on Images
  4. Correct Skew: Yes
  5. Handling files with text: Only copy to target directory
  6. Directory: D:\Archive\OCR
  7. Filename: <FileName>
  8. On name collision: Append number

Result

Original File OCR File
C:\Scanner\Scan001.pdf (image) D:\Archive\OCR\Scan001.pdf (searchable)

19.13 Example: Process Multilingual Documents

Initial Situation

You process international contracts containing both German and English text.

Configuration

  1. Primary Language: German Best
  2. Use Secondary Language: Yes
  3. Secondary Language: English Best
  4. DPI Setting: 225 (Custom)
  5. Sharpen: Yes

Note

Using two languages increases processing time but significantly improves recognition accuracy for mixed-language documents.


19.7 Tips and Notes

Further Processing of OCR PDF

The created searchable PDF is in the configured target folder. To process it further (e.g., data extraction, renaming, email), create a separate profile that monitors this target folder.

Install Languages

Not all OCR languages are installed by default. Use Tools → Install Languages to download additional languages.

Improve Recognition Quality

If text recognition is unsatisfactory: 1. Increase DPI setting 2. Enable image optimizations (Sharpen, Contrast, Deskew) 3. Ensure correct language is selected 4. For problems with certain characters: Use character restriction

Processing Time

OCR processing is computationally intensive. Factors affecting speed: - Document page count - Selected DPI setting - Enabled image optimizations - Use of two languages - Available processor power

Quality vs. Speed

Priority Recommended Settings
Speed Low DPI (150), no image optimization
Balance DPI 225, Correct skew (default)
Quality DPI 300+, all image optimizations enabled

Already Searchable PDFs

When you choose “Ignore” for files with text, already searchable PDFs are not processed again. This saves time and prevents quality loss from multiple processing.

Password-Protected PDFs

Password-protected PDFs can be processed if the password is stored in the password list (program options) or in the profile and content extraction is allowed.