PDF OCR Batch-processing with ScanSnap

The ScanSnap S1500M (but also the S1500 as well as the earlier models S510 and S510M) comes with ABBYY FineReader for OCR support. This is great for creating searchable PDFs. Especially handy on a Mac with Spotlight. The problem is that in the default setup, each scan is OCRed right after the scan and depending on the age your machine (my G5 is getting a little long in the tooth) in can take quite a while. When you’re in the process of scanning many hundred’s of pages of paper documents, you don’t want to have to wait for the computer to do it’s OCR recognition, you’d rather feed it all the documents and let it do OCR while you’re doing something else.

Fortunately, this is possible. Reading all the way through the handbook as well as through the ABBYY online help I found out that you can scan to PDF only, and then afterwards convert the PDFs with ABBYY FineReader. Here’s how to do it:

  1. Start ScanSnap Manager and set it up for scan to PDF, but without OCR
  2. Do all your scans and save the files to a folder
  3. Locate the “Scan to Searchable PDF” applications in the “ABBYY FineReader for ScanSnap 4.0” folder in your applications folder. Drag it to the Dock.
  4. Start the application and set the OCR preferences using the “Preferences…” menu option. For best results and smallest file sizes, choose the high quality setting and the same format as used for the scan.
  5. Now drag all your previously scanned PDFs onto the “Scan to Searchable PDF” dock icon (up to 100).
  6. Each file will be processed separately and be save in the source location under a new filename. If the original file was named “filename.pdf”, the new processed file will be named “filename processed by FineReader.pdf”

Regarding the OCR preferences, for instance, I scan old account statements in B&W, so I choose B&W for the format. Medium quality results in dithered text output (probably b/c it uses jpeg compression). It seems that when performing OCR from within ScanSnap, it also selects high quality and the same format as used for the scan, because I get very similar results and file sizes using this approach.

Performing OCR of many documents will take a while, so it’s best run overnight. You could spend the time shredding the scanned documents or continuing scanning more documents. Since the ScanSnap Manager does not use ABBYY while scanning, this way both processes can take place in parallel. It would be nice if ScanSnap Manger had this batch processing functionality out of the box, but the workaround presented here works well. Last not least I should mention that the same approach also works with the “Scan to Word” or Scan to Excel” functionality of ABBYY but I don’t have much use for this feature.

7 Responses to PDF OCR Batch-processing with ScanSnap

  1. Pingback: Doing OCR Batch Processing Using The ScanSnap And ABBYY FineReader | Tips To Learn How To Go Paperless | DocumentSnap Paperless Blog

Leave a Reply

Your email address will not be published.