How to extract columns of text from a PDF file by OCR command line application?

Question:I need to extract text from PDF files using some application.The problem is: some PDF files contain 2 columns and when I extract text I get a text file where columns are merged as the result (i.e. text from both columns in the same line). I am just wondering that is there any solution on VeryPDF to OCR PDF of two columns?

Answer: When you need to extract text from PDF of 2 columns by OCR function, maybe you can have a free trial of software VeryPDF OCR to Any Converter Command Line, which can OCR PDF with multiple columns automatically. Meanwhile, this software supports more than 40 OCR languages, so you can extract text from PDF in multiple languages. And this is command line version software, you can call it together with other programming language applications. Please check more information of this software on homepage. In the following part, let us check how to use this software.

Step 1. Free download OCR to Any Converter Command Line

  • This is command line version software, so when downloading finishes, there will be a zip file. Please extract it to some folder then you can use it normally and find the executable file in MS Dos Windows.
  • When you use this software, please refer to the usage and examples of this software.

Step 2. Extract text of PDF with multiple columns.

  • The usage of this software, please have a check: ocr2any.exe [options] <PDF-file> <Text-file>
  • When extract text from multiple columns, please refer to the following command line templates. You do not need to add other parameters, this software can help you process it automatically.
  • When extracting text based PDF with multiple columns, please refer to the following command line:
  • ocr2any.exe C:\in.pdf C:\out.txt
    ocr2any.exe -firstpage 1 -lastpage 1 C:\in.pdf C:\out.txt
    By this command line, you can specify conversion page range.

  • When processing image based PDF with multiple columns, please refer to the following command line:
  • cr2any.exe -ocr -lang eng C:\in.pdf C:\out.txt
      ocr2any.exe -ocr -bitcount 1 C:\in.pdf C:\out.txt
      ocr2any.exe -ocr -bitcount 8 C:\in.pdf C:\out.txt
      ocr2any.exe -ocr -bitcount 24 C:\in.pdf C:\out.txt
    By above command line templates, you can adjust PDF bit count and then extract text from PDF no matter PDF is single column or multiple columns.
      ocr2any.exe -ocr -lang deu C:\in.pdf C:\out.txt
    This command line can help you extract text from Germany language PDF with multiple columns to text.

  • Now let us check conversion effect from the following snapshot.

input PDF file
   The PDF with two columns.

output text

  • This software will process one column each time and then display it when the first column finishes. So it will not mix one with another.

By this OCR application, we can extract text from multiple columns PDF easily. During the using, if you have any question, please contact us as soon as possible.

VN:F [1.9.20_1166]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.20_1166]
Rating: 0 (from 0 votes)

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *


Verify Code   If you cannot see the CheckCode image,please refresh the page again!