Extract Characters and Words with coordinates and positions from scanned PDF and TIFF files

VeryPDF has released a new version of "VeryPDF OCR to Any Converter Command Line" today, the new version is able to extract Characters and Words with coordinates from scanned PDF and TIFF files, you may download the new version of "VeryPDF OCR to Any Converter Command Line" from this web page to try,

http://www.verypdf.com/app/ocr-to-any-converter-cmd/try-and-buy.html
http://www.verypdf.com/dl2.php/ocr2any_cmd.zip

after you download it, you can run following command lines to extract Characters and Words with coordinates and positions

ocr2any.exe -ocr2 -dumpcharpos -dumpwordpos test_multi_columns.tif _test_multi_columns_1.txt

ocr2any.exe -ocr2 -dumpcharpos -dumpwordpos test_multi_columns.tif _test_multi_columns_2.rtf

This is original scanned TIFF file,

image

You can run following command line to get a new text file with coordinates for each character,

..\ocr2any.exe -ocr2 -dumpcharpos test_multi_columns.tif _test_multi_columns_1.txt

You will get a text file with following contents, it is contain coordinates for each character,

[0328 0456 0337 0478] t
[0338 0452 0355 0478] h
[0357 0461 0368 0478] r
[0370 0461 0386 0478] o
[0387 0461 0405 0478] u
[0407 0461 0424 0486] g
[0426 0452 0443 0478] h
[0444 0461 0460 0478] o
[0462 0461 0480 0478] u
[0482 0456 0491 0478] t
[0501 0456 0510 0478] t
[0512 0452 0529 0478] h
[0531 0461 0545 0478] e
[0556 0461 0571 0478] c
[0573 0461 0590 0486] y
[0592 0456 0601 0478] t
[0602 0461 0618 0478] o
[0620 0461 0637 0486] p
[0639 0452 0648 0478] l
[0651 0461 0665 0478] a
[0666 0461 0679 0478] s
[0681 0461 0708 0478] m
[0712 0474 0716 0478] .
[0729 0453 0739 0478] I
[0741 0461 0758 0478] n
[0760 0456 0769 0478] t
[0770 0461 0784 0478] e
[0787 0461 0798 0478] r
[0799 0461 0813 0478] e
[0816 0461 0829 0478] s

You can run following command line to get a new text file with coordinates for each word,

..\ocr2any.exe -ocr2 -dumpwordpos test_multi_columns.tif _test_multi_columns_1.txt

You will get a text file with following contents, it is contain coordinates for each word,

[0328 0452 0491 0486] throughout
[0501 0452 0545 0478] the
[0556 0452 0716 0486] cytoplasm.
[0729 0452 0922 0486] Interestingly,
[0935 0452 1018 0486] Golgi
[1028 0452 1187 0486] complexes
[1197 0452 1225 0478] in
[0327 0495 0551 0529] placebo+CC14
[0571 0504 0655 0529] group
[0676 0495 0784 0521] contain
[0804 0495 0884 0521] small
[0904 0495 1079 0529] low-density
[1098 0495 1223 0521] vesicles.
[0329 0539 0412 0573] Golgi
[0426 0539 0584 0573] complexes
[0598 0539 0626 0565] in
[0640 0539 0683 0565] the
[0697 0539 0844 0573] processed
[0855 0539 0987 0565] Morinda
[1002 0539 1130 0573] citrifolia
[1144 0539 1224 0573] prod-
[0327 0584 0495 0609] ucts+CC14
[0518 0592 0603 0617] group
[0628 0583 0736 0609] contain
[0760 0583 0831 0617] large
[0855 0583 0973 0609] vesicles
[0995 0583 1061 0609] with
[1085 0583 1225 0609] increased
[0328 0626 0447 0652] electron
[0458 0626 0571 0660] density,
[0584 0626 0635 0652] and
[0647 0626 0730 0660] Golgi

This is a great feature to analyze and understand text contents in scanned documents, you can format and reuse text contents easily by their positions.

If you have any question or suggestion for this software, please feel free to let us know,

http://support.verypdf.com/open.php

VN:F [1.9.20_1166]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.20_1166]
Rating: 0 (from 0 votes)
Extract Characters and Words with coordinates and positions from scanned PDF and TIFF files, 10.0 out of 10 based on 1 rating

Related Posts

This entry was posted in OCR Products and tagged . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *


Verify Code   If you cannot see the CheckCode image,please refresh the page again!