Cleanup Scans / OCR (Optical Character Recognition)

Open Category List

Cleanup Scans / OCR (Optical Character Recognition)

Indigo PDF offers a powerful feature for cleaning up scans and performing Optical Character Recognition (OCR) on PDF documents. This feature is highly versatile and provides various options for text extraction and document enhancement. Here are the steps and explanations for using this feature:

  1. Select PDF for Cleanup and OCR:
    • Use the application’s file selection or upload feature to choose the PDF document that you want to clean up and apply OCR to.
  2. Select Languages:
    • Choose the languages that you want the OCR process to detect within the PDF. Currently, English is supported, and additional languages may be added in the future.
  3. Choose OCR Mode:
    • There are three OCR modes to select from:
      • Ignore Pages with Interactive Text: This mode will only OCR pages that are images and skip pages with interactive or selectable text.
      • Force OCR: In this mode, every page will be OCR’d, removing all original text elements.
      • Normal: This mode will generate an error if the PDF contains text, ensuring that OCR is only applied to images.
  4. Additional Settings:
    • You have the option to configure additional settings for the OCR process:
      • Produce Text File: Check this option if you want to create a separate text file containing the OCR text alongside the OCR’ed PDF.
      • Correct Skewed Pages: Enable this option if some pages in the PDF were scanned at a skewed angle. It will automatically rotate them into the correct orientation.
      • Clean Page to Reduce Background Noise: You can choose to clean pages to reduce the likelihood of OCR picking up text from background noise. This option can be used with or without changes in the output.
      • Remove Images After OCR: If needed, you can remove all images from the document after OCR. This is particularly useful if it’s part of a conversion process.
  5. Render Type (Advanced):
    • There are two render types available for advanced users:
      • HOCR (HTML OCR): This format provides the OCR text in an HTML structure, making it accessible for further processing and integration.
      • Sandwich: The Sandwich method combines the original image with an OCR text layer, allowing the text to be selected and copied while retaining the image’s appearance.
  6. Start Cleanup and OCR:
    • After configuring your OCR preferences and additional settings, click the “Start Cleanup” or “Begin OCR” button to initiate the cleanup and OCR process.
  7. Download the Processed PDF:
    • Once the cleanup and OCR process is complete, you will be prompted to download the cleaned and OCR’ed PDF document. Select a destination folder on your computer and provide a name for the processed PDF if needed.
  8. Confirmation:
    • You will receive confirmation that the PDF has been successfully cleaned up and OCR’d, making the text content searchable and selectable.

By following these steps, you can effectively clean up scans and perform OCR on your PDF documents using Indigo PDF. This process is valuable for converting scanned images into searchable and editable text content.