How to extract text from scan multipage tiff file by command line?

   Sometimes we do need to extract text from scan file then we can reuse the content of it. In this article, VeryPDF will show you one method of extracting text from scan image file by command line and I will take extracting text from multipage tiff file for example. The software I use is PDF to Text OCR Converter Command Line, which can be used to extract content from PDF image and other file.

Step 1. Download PDF to Text OCR Converter Command Line

  • There are only server version and developer version stated on our website. If you are common user on laptop or computer, please use the server version.
  • Once downloading finishes, please extract zip file and open MS Dos Window then you can run the conversion.

Step 2.  Extract text from multipage tiff file by command line

  • When you use this software, please refer to the usage and examples.
  • Here is the usage for your reference:  pdf2txtocr.exe [options] <PDF-file> <Text-file>
  • When extracting text content from tiff file, please refer to the following command line templates.  You can either convert scan tiff files to text or scan tiff file to text based PDF file.
  • pdf2txtocr.exe C:\in.tif C:\out.txt
    By this command line, we can extract content in scan tiff file to text file directly.
    pdf2txtocr.exe -ocrmode 3 -threshold 200 -ocr C:\in.tif C:\out.pdf
    By this command line, we can convert tiff file to searchable PDF which allows you to copy text freely. And meanwhile you can set threshold and output to OCRed PDF file (BW) with hidden text layer.
    pdf2txtocr.exe -ocrmode 4 -rotate 90 -ocr C:\in.tif C:\out.pdf
    By this command line,we can rotate PDF in 90 degree  and then convert tiff to searchable PDF file. This mode will output to OCRed PDF file (Color) with hidden text layer

Now let us check related parameters to the conversion.

-rotate <int>       : rotate pages before OCR
-threshold <int>    : lightness threshold that used to convert image to B&W
-ocrmode <int>      : set OCR mode
  -ocrmode 0: output to text file
  -ocrmode 1: OCR PDF pages and insert new text layer under original PDF pages
  -ocrmode 2: output to plain text based PDF file
  -ocrmode 3: output to OCRed PDF file (BW) with hidden text layer
  -ocrmode 4: output to OCRed PDF file (Color) with hidden text layer

There are too more functions to be listed here. If you need to know more parameters, please check them in readme.txt. Now let us check the extraction effect from the following snapshot. During the using, if you have any question, please contact us as soon as possible.

output text from tiff 
                                 This is from output text file.

input tiff file

VN:F [1.9.20_1166]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.20_1166]
Rating: 0 (from 0 votes)

Related Posts

This entry was posted in PDF to Text OCR Command Line and tagged , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *


Verify Code   If you cannot see the CheckCode image,please refresh the page again!