How to batch convert scanned PDF files to Searchable PDF files and remove background color from new created PDF files (OCRed PDF files)?

I'm looking for a way to convert thousands of pdf's to searchable pdf's. I've used an OCR program. However, you can't select a folder, you have to go into each sub folder, select the files to convert, and then go to the next folder.

What is another way to convert a large number of pdf's to searchable pdf's?

Haven't had any suggestions. Surely there must be a way to batch convert pdf's(?).

Customer
-------------------------------------

image
This is a good question, VeryPDF has a solution to batch convert all of your PDF files in a folder and its subfolders to searchable PDF files with one command line, it's very easy and quickly.

In general, you can use "Image to PDF OCR Converter Command Line" software to do this work, only one software is enough, however, because some PDF files are contain color information, in order to remove color and grayscale background, we will use PDF to Image Converter Command Line to remove color and grayscale background from PDF files first, and then use "Image to PDF OCR Converter Command Line" software to convert from modified image files to searchable PDF files again.

Please refer to following steps to finish this work,

1. Please download PDF to Image Converter Command Line software from this web page first,

https://www.verypdf.com/app/pdf-to-image-converter/try-and-buy.html#buy-cmd
https://www.verypdf.com/dl2.php/pdf2image_win.zip

After you download and unzip it to a folder, you can run following command line to convert a PDF file to black and white TIFF file, and also remove the color background from TIFF file,

pdf2img.exe -$ "XXXXXXXXXXXXXX" -r 300 -threshold 180 "D:\verypdf.pdf" "D:\out.tif"

"-threshold 180" option will remove colors which threshold value less than 180, this option will remove background color automatically.

2. Please download "Image to PDF OCR Converter Command Line" software from this web page,

https://www.verypdf.com/app/image-to-pdf-ocr-converter/try-and-buy.html#buy-ocr-cmd
https://www.verypdf.com/tif2pdf/image2pdf_cmd_ocr_trial.zip

After you download and unzip it to a folder, you can run following command line to convert and combine modified TIFF files to a multi-page PDF file, with OCR option,

img2pdfnew.exe -$ XXXXXXXXXXXXXXXXXX -width 595 -height 842 -ocr 1 -tsocr -tsocrlang eng "D:\out*.tif" "D:\VeryPDF.pdf"

With above two steps, you will able to remove background color from PDF file and create a searchable PDF file.

If you have thousands of PDF files in a folder, you can use following .bat file to batch convert all of PDF files in this folder to searchable PDF file on the fly,
-------------------------------------
ECHO ON
set InputFolder=D:\downloads\pdf
set OutputFolder=D:\downloads\pdfocr
set TempFolder=D:\test

mkdir %OutputFolder%
mkdir %TempFolder%

for %%F in ("%InputFolder%\*.pdf") do (

del /Q "%TempFolder%\%%~nF*.tif"

.\pdf2image_win\pdf2img.exe -$ "XXXXXXXXXXXXXX" -r 300 "%%F" "%TempFolder%\%%~nF.tif"

.\img2pdfnew.exe -$ XXXXXXXXXXXXXXXXXX -width 595 -height 842 -ocr 1 -tsocr -tsocrlang eng "%TempFolder%\%%~nF*.tif" "%OutputFolder%\%%~nF.pdf"

)
-------------------------------------

Above .bat file is work for one folder at one time, however, if you wish support subfolders automatically, you can use following .bat file,
-------------------------------------
ECHO ON
set InputFolder=D:\downloads\pdf
set TempFolder=D:\test

mkdir %TempFolder%

for /r "%InputFolder%" %%F in (*.pdf) do (

del /Q "%TempFolder%\%%~nF*.tif"

.\pdf2image_win\pdf2img.exe -$ "XXXXXXXXXXXXXX" -r 300 "%%F" "%TempFolder%\%%~nF.tif"

.\img2pdfnew.exe -$ XXXXXXXXXXXXXXXXXX -width 595 -height 842 -ocr 1 -tsocr -tsocrlang eng "%TempFolder%\%%~nF*.tif" "%~dpnF-ocr.pdf"

)
-------------------------------------

even if you have thousands and thousands of PDF files it a folder and its subfolders, above .bat script will OCR all of them with one command line, it's wonderful.

image

We have another option which allow you to monitor a folder and its subfolders automatically, once a PDF file be copied into the monitored folder, the monitor application will convert this PDF file to searchable PDF file automatically, this function can be done by a "PHP Folder Watcher" application, you may download and buy "PHP Folder Watcher" from this web page,

https://veryutils.com/php-folder-watcher

PHP Folder Watcher is a PHP Script to monitor folders recursively, it's also support xcopy function to backup files.

PHP Folder Watcher is a convenient, automated way to monitor folders at background. PHP Folder Watch monitors one or more folders on your computer for new files. When a file is added to a monitored folder, it will call external script or application to process this new file automatically.

PHP Folder Watcher is especially useful for quickly posting scanned documents to a folder. If you save a scanned document to a folder that is being monitored by PHP Folder Watch, the file posting workflow begins automatically.

Enjoy!

VeryPDF

VN:F [1.9.20_1166]
Rating: 2.0/10 (1 vote cast)
VN:F [1.9.20_1166]
Rating: 0 (from 0 votes)
How to batch convert scanned PDF files to Searchable PDF files and remove background color from new created PDF files (OCRed PDF files)?, 2.0 out of 10 based on 1 rating

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *


Verify Code   If you cannot see the CheckCode image,please refresh the page again!