Even if we have entered the electronic paper document era, however, the paper document has not gone away and exit our history stage. When handling paper documents, we spend lots of time to classify, store, find and preserve. Many years ago, many companies began to scan paper documents to PDF or image then save them to disk. However, there is still another problem, when you need to retrieve the content from scan file, it will be different. As you can not do copy and paste in the scan files.
Based on this need, VeryPDF developed software OCR to Any Converter Command Line, which can be used to extract content from scanned PDF and image files and it supports more than 30 OCR languages. In the following part, I will show you how to use this software.
Step 1. Download OCR to Any Converter
- This is Windows command line application, once downloading finishes, there will be an zip file. Please extract it to some folder then check readme.txt and find the executable file.
- Run bat file to check the conversion effect and check more examples.
Step 2. OCR scanned PDF file by command line and download language package
- Here is the usage for your reference.
- Usage: ocr2any.exe [options] <PDF-file> <Text-file>
- Here are the list of languages supported by this software. And when processing PDF, please download OCR language package according to the content in PDF or image files.
- When you launch OCR engine, please add parameter -lang which allows you to choose the language for OCR engine. Here are some examples for your reference.
Bulgarian bul.zip Catalan cat.zip Czech ces.zip Danish dan.zip German deu.zip Greek ell.zip English eng.zip
Finish fin.zip French fra.zip Hungarian hun.zip Indonesian ind.zip Italian ita.zip Latvian lav.zip
Lithuanian lit.zip Dutch nld.zip Norwegian nor.zip Polish pol.zip Portuguese por.zip Romanian ron.zip
Russian rus.zip Slovak slk.zip Slovenian slv.zip Spanish spa.zip Serbian srp.zip Swedish swe.zip
Tagalog tgl.zip Turkish tur.zip Ukranian ukr.zip Vietnamese vie.zip
pdf2txtocr.exe -ocr -lang eng C:\in.pdf C:\out.txt
By this above command line templates, you can convert PDF in English to text file. However, as the default language is English, you also do not need to add this parameter.
pdf2txtocr.exe -ocr -lang deu C:\in.pdf C:\out.txt
By this above command line, you can launch OCR engine to process PDF in German to text file.
pdf2txtocr.exe -lang spa C:\in.tif C:\out.txt
By this above command line template, you can convert tiff file in Spanish to text file.
Now let us check related parameters.
-lang <string> : choose the language for OCR engine
-ocrmode <int> : set OCR mode
-ocrmode 0: output to text file
-ocrmode 1: OCR PDF pages and insert new text layer under original PDF pages
-ocrmode 2: output to plain text based PDF file
-ocrmode 3: output to OCRed PDF file (BW) with hidden text layer
-ocrmode 4: output to OCRed PDF file (Color) with hidden text layer
Now let us check the conversion effect from the following snapshot.
All the German characters are kept perfectly. By this method, you can extract German content in PDF to text file. During the using, if you have any question, please contact us as soon as possible.