MS-Windows Platform and Mac OS X Platform and Software and Tips & Tricks and Workflow Tobi on 27 Dec 2009 03:23 pm
PDF OCR Batch-processing with ScanSnap
The ScanSnap S1500M (but also the S1500 as well as the earlier models S510 and S510M) comes with ABBYY FineReader for OCR support. This is great for creating searchable PDFs. Especially handy on a Mac with Spotlight. The problem is that in the default setup, each scan is OCRed right after the scan and depending on the age your machine (my G5 is getting a little long in the tooth) in can take quite a while. When you’re in the process of scanning many hundred’s of pages of paper documents, you don’t want to have to wait for the computer to do it’s OCR recognition, you’d rather feed it all the documents and let it do OCR while you’re doing something else.
Fortunately, this is possible. Reading all the way through the handbook as well as through the ABBYY online help I found out that you can scan to PDF only, and then afterwards convert the PDFs with ABBYY FineReader. Here’s how to do it:
- Start ScanSnap Manager and set it up for scan to PDF, but without OCR
- Do all your scans and save the files to a folder
- Locate the “Scan to Searchable PDF” applications in the “ABBYY FineReader for ScanSnap 4.0″ folder in your applications folder. Drag it to the Dock.
- Start the application and set the OCR preferences using the “Preferences…” menu option. For best results and smallest file sizes, choose the high quality setting and the same format as used for the scan.
- Now drag all your previously scanned PDFs onto the “Scan to Searchable PDF” dock icon (up to 100).
- Each file will be processed separately and be save in the source location under a new filename. If the original file was named “filename.pdf”, the new processed file will be named “filename processed by FineReader.pdf”
Regarding the OCR preferences, for instance, I scan old account statements in B&W, so I choose B&W for the format. Medium quality results in dithered text output (probably b/c it uses jpeg compression). It seems that when performing OCR from within ScanSnap, it also selects high quality and the same format as used for the scan, because I get very similar results and file sizes using this approach.
Performing OCR of many documents will take a while, so it’s best run overnight. You could spend the time shredding the scanned documents or continuing scanning more documents. Since the ScanSnap Manager does not use ABBYY while scanning, this way both processes can take place in parallel. It would be nice if ScanSnap Manger had this batch processing functionality out of the box, but the workaround presented here works well. Last not least I should mention that the same approach also works with the “Scan to Word” or Scan to Excel” functionality of ABBYY but I don’t have much use for this feature.
on 05 Jan 2010 at 16:02 # Brooks @ DocumentSnap
Hey thanks for this post. I’ve done batch ScanSnap posts using Acrobat but hadn’t seen one using FineReader so this is great. Blogged about it!
on 24 Jan 2010 at 18:22 # Tobi
Update:
The same approach does not work well for color scans. When I scan color at 300 dpi and then do the OCR with the high quality color option (for documents), the file size grows approximately 2-3 fold. For color OCR, there are two options: (1) documents, and (2) photo. Interestingly, picking the high quality photo option for OCR creates high quality OCR even of text at only a fraction of the file size. Medium and low quality text setting results in small files sizes and very poor quality. The medium quality text setting is really quite useless. Medium quality photo setting is similar to medium quality text. ==> My choice is the “high quality photo” setting for Color Text OCR.