Programmatically recognize text from scanned PDF or scan files

Question:I have a PDF file, which contains data that we need to import into a database. The files seem to be PDF scans of printed alphanumeric text. Looks like 10 pt. Times New Roman. Are there any tools or components that can will allow me to recognize and parse this text? Any advice on VeryPDF. Thanks.

Answer:You can not recognize text from scanned PDF files by common tools without OCR function. And VeryPDF has some tools which were developed with advanced OCR function. If you need to recognize text from scan files programmatically, I guess you need to have a trial of some command line OCR application like VeryPDF OCR to Any Converter Command Line. By this software, you can extract data from scan file and then import into a database in the format of text, word, Excel or other file formats. But this is command line tool now there is no components available. Please check more related information of this software on homepage, in the following part, let us check how to use this software.

Step 1. Free download OCR to Any Converter Command Line

  • Currently there are two version license type of this software:Server License (Commercial use for one server or PC) by which you can call this software under the whole server. Developer License (Royalty-free License), by which you can integrate the corresponding SOFTWARE into your developed software and redistribute it with royalty-free.  Please choose the proper version according to your need. No matter which version you choose, you can recognize text from scan file programmatically.
  • When downloading finishes, there will be a zip file. Please extract it to some folder then you can check elements, help documents and other files.

Step 2. Recognize text from scanned PDF and scan files.

  • For recognizing text from scan files, this software supports the following file formats as input: scanned PDF, TIFF and Image files (JPEG, JPG, PNG, BMP, GIF, PCX, TGA, PBM, PNM, PPM).
  • For output, by this software you can import data from scanned PDF, files to editable Word, Excel, CSV, HTML, TXT, Pure Text Layer PDF, Invisible Text Layer PDF, etc. formats.
  • For calling this software programmatically, you can call this software together with C#, VB.NET, ASP.NET, VB, VC, Delphi, ASP, PHP, Javascript, VB Script, etc.
  • Now let us check parameters designed by this software for recognizing text from scanned PDF and others scan files.
  • -ocr                    : enable OCR function for scanned PDF file
      -lang <string>          : choose the language for OCR engine
      -ocrmode <int>          : set OCR mode
        -ocrmode 0: output to text file
        -ocrmode 1: OCR PDF pages and insert new text layer under original PDF pages
        -ocrmode 2: output to plain text based PDF file
        -ocrmode 3: output to OCRed PDF file (BW) with hidden text layer
        -ocrmode 4: output to OCRed PDF file (Color) with hidden text layer

  • Please check more parameters on readme.txt file as I can not list all of them here. Now let us check the recognizing text from scan files effect from the following snapshot.

scanned tif and text
                               scan tiff and output text file

During the using, if you have any question, please contact us as soon as possible.

VN:F [1.9.20_1166]
Rating: 5.5/10 (4 votes cast)
VN:F [1.9.20_1166]
Rating: -1 (from 1 vote)
Programmatically recognize text from scanned PDF or scan files, 5.5 out of 10 based on 4 ratings

Random Posts

Leave a Reply

Your email address will not be published. Required fields are marked *


Verify Code   If you cannot see the CheckCode image,please refresh the page again!