[VeryPDF Release Notes] OCR Scanned PDF or TIFF into plain text PDF file

VeryPDF has released a new version of "OCR to Any Converter Command Line" today, the new version has a new ability to convert scanned PDF files to plain text based PDF file, the following are options which included in "OCR to Any Converter Command Line" software,

http://www.verypdf.com/app/ocr-to-any-converter-cmd/try-and-buy.html

image

VeryPDF OCR to Any Converter Command Line v5.0
Web: http://www.verypdf.com
Web: http://www.verydoc.com
Email: support@verypdf.com
Release Date: Nov 22 2015
-------------------------------------------------------
Description:
  1. Convert text based PDF files to plain text files.
  2. Convert scanned PDF files and image files to plain text files and searchable PDF files by OCR technology.
  3. Convert embedded fonts in PDF file to a new searchable PDF file.
  4. Keep color during PDF, TIFF and image formats to searchable PDF files conversion.
  5. Deskew, Despeckle and Noise Removal, Auto-Orientation, Dithering, Black Border Removal.
  6. Use Enhanced OCR Technology to convert Scanned PDF, TIFF and image files to RTF, DOC, TXT, CSV, Excel, HTML formats.
  7. Create MS Excel document in several layouts.
  8. PDF to Excel Converter: Convert tables from PDF and image files to Microsoft Excel spreadsheets.
  9. PDF to HTML Converter: Convert your PDFs to high quality reflowed HTML while preserving styles, tables, etc.
10. Table Recovery: Superior reconstruction of bordered and borderless tables as table objects, with formatting, in Word & HTML.
Input formats:
  1. Text based PDF files
  2. Scanned PDF files
  3. Scanned single page and multi-page TIFF files
  4. Scanned JPEG, PNG, BMP, GIF, PCX, TGA, PBM, PNM, PPM files
Output formats:
  1. Plain text files without layout
  2. Plain text files with layout
  3. Plain text based PDF files (PDF is contain text only)
  4. Attach OCRed text layer to original PDF file
  5. OCRed BW PDF files with hidden text layer
  6. OCRed Color PDF files with hidden text layer
  7. OCRed Grayscale PDF files with hidden text layer
  8. Output to TIFF, PNG, BMP, TGA, GIF with Deskew, Despeckle, etc. options
  9. Scanned PDF, TIFF and image files to RTF format
10. Scanned PDF, TIFF and image files to DOC format
11. Scanned PDF, TIFF and image files to Tab Text format
12. Scanned PDF, TIFF and image files to CSV format
13. Scanned PDF, TIFF and image files to MS Excel format
14. Scanned PDF, TIFF and image files to HTML format
15. Extract X1, Y1, X2, Y2 coordinates for each character
16. Extract X1, Y1, X2, Y2 coordinates for each Word
-------------------------------------------------------
Usage: ocr2any.exe [options] <PDF-file> <Text-file>
  -firstpage <int>        : first PDF page to convert
  -lastpage <int>         : last PDF page to convert
  -res <int>        : set resolution, the unit is DPI (default is 300 dpi)
  -ownerpwd <string>      : set owner password for encrypted PDF file
  -userpwd <string>       : set user password for encrypted PDF file
  -layout                 : maintain original physical layout
  -layout2                : pdf to table conversion with Best Column Alignment
  -table                  : same as -layout2
  -pdf2table              : same as -layout2
  -noc              : don't insert page breaks 0x0C between pages in text file
  -bitcount <int>   : set color depth when render PDF page to image data, it can be set 1, 8, 24, default is 8bit
  -rotate <int>           : rotate pages before OCR
  -threshold <int>        : lightness threshold that used to convert image to B&W, from 1 to 255, 0 is auto, default is -1
  -imageopt               : deskew and despeckle images automatically
  -dither <int>     : convert the color image to B&W using the desired method:
    -dither 0: Floyd-Steinberg
    -dither 1: Ordered-Dithering (4x4)
    -dither 2: Burkes
    -dither 3: Stucki
    -dither 4: Jarvis-Judice-Ninke
    -dither 5: Sierra
    -dither 6: Stevenson-Arce
    -dither 7: Bayer (4x4 ordered dithering)
  -resizewidth <int>      : resize the image's width, only availalbe when -resizeheight used
  -resizeheight <int>     : resize the image's height, only availalbe when -resizewidth used
  -flip                   : flip the image vertically
  -mirror                 : mirror the image horizontally
  -ocr                    : enable OCR function for scanned PDF file
  -lang <string>          : choose the language for OCR engine
  -ocrmode <int>          : set OCR mode
    -ocrmode 0: output to text file
    -ocrmode 1: OCR PDF pages and insert new text layer under original PDF pages
    -ocrmode 2: output to plain text based PDF file
    -ocrmode 3: output to OCRed PDF file (BW) with hidden text layer
    -ocrmode 4: output to OCRed PDF file (Color) with hidden text layer
  -text <string>          : add additional text at end of each text page, this parameter supports the following variables:
    %PageNumber%: current page number
    %PageCount% : total page count of PDF file
  -outboxfile       : output [X, Y, Width, Height] information for each word when OCR
  -producer <string>      : Set 'producer' to output PDF file
  -creator <string>       : Set 'creator' to output PDF file
  -subject <string>       : Set 'subject' to output PDF file
  -title <string>         : Set 'title' to output PDF file
  -author <string>        : Set 'author' to output PDF file
  -keywords <string>      : Set 'keywords' to output PDF file
  -ownerpwdout <string>   : Set 'owner password' to output PDF file
  -openpwdout <string>    : Set 'open password' to output PDF file
  -keylen <int>           : Key length (40 or 128 bit)
        -keylen 0:  40 bit RC4 encryption (Acrobat 3 or higher)
        -keylen 1: 128 bit RC4 encryption (Acrobat 5 or higher)
        -keylen 2: 128 bit RC4 encryption (Acrobat 6 or higher)
  -encryption <int>       : Restrictions
        -encryption    0: Encrypt the file only
        -encryption 3900: Deny anything
        -encryption    4: Deny printing
        -encryption    8: Deny modification of contents
        -encryption   16: Deny copying of contents
        -encryption   32: No commenting
        ===128 bit encryption only -> ignored if 40 bit encryption is used
        -encryption  256: Deny FillInFormFields
        -encryption  512: Deny ExtractObj
        -encryption 1024: Deny Assemble
        -encryption 2048: Disable high res. printing
        -encryption 4096: Do not encrypt metadata
  -ocr2                   : use enhanced OCR module to convert scanned PDF and image files to PDF, RTF, DOC, TXT, XLS, CSV, Excel, HTML files
  -ocr2aor                : detect page direction and rotate it automatically when -ocr2 used
  -ocr2autorotate         : same as -ocr2aor
  -ocr2excelmode <int>    : set output Excel format when -ocr2 used
    -ocr2excelmode 0: One big sheet + All page sheets
    -ocr2excelmode 1: All page sheets
    -ocr2excelmode 2: One big sheet, default mode
  -dumpcharpos      : Output to a Text file with coordinates for each character
  -dumpwordpos      : Output to a Text file with coordinates for each word
  -$ <string>       : input your License Key
Examples:
  ocr2any.exe C:\in.pdf C:\out.txt
  ocr2any.exe -firstpage 1 -lastpage 1 C:\in.pdf C:\out.txt
  ocr2any.exe -ocr -res 300 C:\in.pdf C:\out.txt
  ocr2any.exe -ownerpwd 123 -userpwd 456 C:\in.pdf C:\out.txt
  ocr2any.exe -layout C:\in.pdf C:\out.txt
  ocr2any.exe -layout2 C:\in.pdf C:\out.txt
  ocr2any.exe -table C:\in.pdf C:\out.txt
  ocr2any.exe -pdf2table C:\in.pdf C:\out.txt
  ocr2any.exe -noc C:\in.pdf C:\out.txt
  ocr2any.exe C:\in.tif C:\out.txt
  ocr2any.exe C:\in.jpg C:\out.txt
  ocr2any.exe C:\in.bmp C:\out.txt
  ocr2any.exe C:\in.png C:\out.txt
  ocr2any.exe -ocr -lang eng C:\in.pdf C:\out.txt
  ocr2any.exe -ocr -bitcount 1 C:\in.pdf C:\out.txt
  ocr2any.exe -ocr -bitcount 8 C:\in.pdf C:\out.txt
  ocr2any.exe -ocr -bitcount 24 C:\in.pdf C:\out.txt
  ocr2any.exe -ocr -lang deu C:\in.pdf C:\out.txt
  ocr2any.exe -lang deu C:\in.tif C:\out.txt
  ocr2any.exe -text "PageText %PageNumber% of %PageCount%" C:\in.pdf C:\out.txt
  ocr2any.exe -subject "subject" C:\in.pdf C:\out.pdf
  ocr2any.exe -ownerpwdout 123 -keylen 2 -encryption 3900 C:\in.pdf C:\out.pdf
  ocr2any.exe -subject "subject" -title "title" C:\in.pdf C:\out.pdf
  ocr2any.exe -ocr -lang eng -ocrmode 0 C:\in.pdf C:\out.txt
  ocr2any.exe -ocr -lang deu -ocrmode 1 C:\in.pdf C:\out.pdf
  ocr2any.exe -ocr -lang eng -ocrmode 2 C:\in.pdf C:\out.pdf
  ocr2any.exe -ocr -lang eng -ocrmode 3 C:\in.pdf C:\out.pdf
  ocr2any.exe -ocr -lang eng -ocrmode 2 -outboxfile C:\in.pdf C:\out.pdf
  ocr2any.exe -ocr -lang fra -ocrmode 1 C:\in.pdf C:\out.pdf
  ocr2any.exe -ocr -lang ita -ocrmode 1 C:\in.pdf C:\out.pdf
  ocr2any.exe -ocr -lang nld -ocrmode 1 C:\in.pdf C:\out.pdf
  ocr2any.exe -ocr -lang spa -ocrmode 1 C:\in.pdf C:\out.pdf
  ocr2any.exe -bitcount 24 -ocrmode 4 -ocr C:\in.pdf C:\out.pdf
  ocr2any.exe -bitcount 8 -ocrmode 4 -ocr C:\in.pdf C:\out.pdf
  ocr2any.exe -ocrmode 4 -ocr C:\in.tif C:\out.pdf
  ocr2any.exe -ocrmode 3 -threshold 200 -ocr C:\in.tif C:\out.pdf
  ocr2any.exe -ocrmode 4 -rotate 90 -ocr C:\in.tif C:\out.pdf

Use Enhanced OCR options:
  ocr2any.exe -ocr2 -ocr2aor C:\in.tif C:\out.rtf
  ocr2any.exe -ocr2 -ocr2aor C:\in.tif C:\out.doc
  ocr2any.exe -ocr2 -ocr2aor C:\in.tif C:\out.xls
  ocr2any.exe -ocr2 -ocr2aor C:\in.pdf C:\out.rtf
  ocr2any.exe -ocr2 -ocr2aor C:\in.pdf C:\out.doc
  ocr2any.exe -ocr2 -ocr2excelmode 0 C:\in.pdf C:\out.xls
  ocr2any.exe -ocr2 -ocr2excelmode 1 C:\in.pdf C:\out.xls
  ocr2any.exe -ocr2 -ocr2excelmode 2 C:\in.pdf C:\out.xls
  ocr2any.exe -ocr2 C:\in.pdf C:\out.doc
  ocr2any.exe -ocr2 C:\in.pdf C:\out.rtf
  ocr2any.exe -ocr2 C:\in.png C:\out.xls
  ocr2any.exe -ocr2 C:\in.tif C:\out.csv
  ocr2any.exe -ocr2 C:\in.bmp C:\out.txt
  ocr2any.exe -ocr2 C:\in.gif C:\out.htm
  ocr2any.exe -ocr2 C:\in.pdf C:\out.html
  ocr2any.exe -ocr2 D:\temp\*.pdf D:\temp\*.html
  ocr2any.exe -ocr2 D:\temp\*.pdf D:\temp\*.doc
  ocr2any.exe -ocr2 C:\in.pdf C:\out.rtf
  ocr2any.exe -ocr2 -lang deu C:\in.pdf C:\out.doc
  ocr2any.exe -ocr2 -lang deu C:\in.pdf C:\out.xls
  ocr2any.exe -ocr2 -dumpcharpos C:\in.pdf C:\out.txt
  ocr2any.exe -ocr2 -dumpwordpos C:\in.pdf C:\out.txt
  ocr2any.exe -ocr2 -dumpcharpos C:\in.pdf C:\out.rtf
  ocr2any.exe -ocr2 -dumpwordpos C:\in.pdf C:\out.rtf
  ocr2any.exe -ocr2 C:\in.pdf C:\text.pdf
  ocr2any.exe -ocr2 C:\in.tif C:\out.pdf
  ocr2any.exe -ocr2 C:\in.png C:\out.pdf
  ocr2any.exe -ocr2 C:\in.jpg C:\out.pdf
  ocr2any.exe -ocr2 C:\in.tif C:\out.doc
  ocr2any.exe -ocr2 C:\in.tif C:\out.rtf
  ocr2any.exe -ocr2 C:\in.tif C:\out.txt
  ocr2any.exe -ocr2 C:\in.tif C:\out.xls

Process image files with Deskew, Despeckle and Noise Removal, Black Border Remova options:
  ocr2any.exe -imageopt C:\in.tif C:\out.tif
  ocr2any.exe -imageopt -rotate 45 C:\in.png C:\out.tif
  ocr2any.exe -imageopt -rotate 90 C:\in.png C:\out.tif
  ocr2any.exe -imageopt -threshold 0 C:\in.tif C:\out.bmp
  ocr2any.exe -threshold 240 C:\in.tif C:\out.bmp
  ocr2any.exe -dither 0 C:\in.bmp C:\out.png
  ocr2any.exe -dither 7 C:\in.bmp C:\out.png
  ocr2any.exe -imageopt -resizewidth 800 -resizeheight 600 C:\in.gif C:\out.tga
  ocr2any.exe -imageopt -flip C:\in.png C:\out.gif
  ocr2any.exe -imageopt -mirror C:\in.tif C:\out.pcx
  ocr2any.exe -imageopt C:\in.bmp C:\out.tif

Following command line will OCR all PDF files in D:\temp\ folder to text files:
  for %F in (D:\temp\*.pdf) do ocr2any.exe -ocr -lang deu "%F" "%~dpnF.txt"

Following command line will OCR all PDF files in D:\temp\ folder and subdirectories to text files:
  for /r D:\temp %F in (*.pdf) do ocr2any.exe -ocr "%F" "%~dpnF.txt"

Following command line will OCR all PDF files from D:\temp\ folder and output text files to C:\test folder:
  for %F in (D:\temp\*.pdf) do ocr2any.exe -ocr "%F" "C:\test\%~nF.txt"

Following command lines will use Enhanced OCR options:
  for %F in (D:\temp\*.pdf) do ocr2any.exe -ocr2 -lang deu "%F" "%~dpnF.txt"
  for %F in (D:\temp\*.pdf) do ocr2any.exe -ocr2 -lang eng "%F" "%~dpnF.doc"
  for %F in (D:\temp\*.tif) do ocr2any.exe -ocr2 "%F" "%~dpnF.doc"
  for %F in (D:\temp\*.tif) do ocr2any.exe -ocr2 -ocr2autorotate "%F" "%~dpnF.xls"
  for /r D:\temp %F in (*.pdf) do ocr2any.exe -ocr2 "%F" "%~dpnF.rtf"
  for %F in (D:\temp\*.pdf) do ocr2any.exe -ocr2 "%F" "C:\test\%~nF.html""
  ocr2any.exe -ocr2 D:\temp\*.tif D:\temp\*.html
  ocr2any.exe -ocr2 -ocr2excelmode 0 D:\temp\*.pdf D:\temp\*.xls
  ocr2any.exe -ocr2 D:\temp\*.png D:\temp\*.rtf
  ocr2any.exe -ocr2 D:\temp\*.tif D:\temp\*.csv
  ocr2any.exe -ocr2 D:\temp\*.pdf D:\temp\*.doc

Please by following steps to convert your scanned PDF or TIFF files to plain text PDF file,

1. This is original scanned PDF file,

image

2. We are run following command line to OCR this scanned PDF file to plain text file,

ocr2any.exe -ocr2 test_table_ocr.pdf _ocr_text_test_table_ocr.pdf

-ocr2 option will use our best and accurate OCR engine, it has 99.99% OCR accuracy.

This is the generated plain text PDF file,

image

This is the full screenshot screenshot, as you see, the text contents are selectable and copyable,

image

3. If you want to get a text file, you can run following command line again,

ocr2any.exe -layout "ocr_text_test_table_ocr.pdf"  "ocr_text1.txt"

You will able to get a plain text file with perfect layout,

image

If you have any question for this new version of "OCR to Any Converter Command Line" software, please don't hesitate to contact us at http://support.verypdf.com ticket system, we are glad to hear from you.

VN:F [1.9.20_1166]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.20_1166]
Rating: 0 (from 0 votes)

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *


Verify Code   If you cannot see the CheckCode image,please refresh the page again!