Command Line Tool for Extracting Text Coordinates in PDF

Q: We are trying to extract coordinates from pdfs based on a text search string.  We would prefer to keep you as our sole vendor for this sort of thing. Do you have a product that would fit the bill?

--------

A: One option is VeryPDF PDF Extract Tool Command Line (for Windows, Linux, or Mac). You can download the utility from the downloads page ( https://www.verypdf.com/app/pdf-extract-tool/index.html ). VeryPDF PDF Extract Tool Command Line allows text extraction word-by-word, or in the form of XML output that can include both positioning and styling information.

For example, you can run following command line to extract text and their position from a PDF file to a text file,

pdfextract.exe -outfolder _test-embedded-fonts test-embedded-fonts.pdf _test-embedded-fonts.pdf.log

You will get the word list like below,

word: x=90.02..108.95 y=74.16..85.20 base=82.44 fontSize=11.04 rot=0 link=00000000 'Test'

word: x=111.50..124.47 y=74.16..85.20 base=82.44 fontSize=11.04 rot=0 link=00000000 'for'

word: x=126.86..144.40 y=74.16..85.20 base=82.44 fontSize=11.04 rot=0 link=00000000 'PDF'

word: x=146.90..195.13 y=74.16..85.20 base=82.44 fontSize=11.04 rot=0 link=00000000 'Embedded'

word: x=197.63..224.89 y=74.16..85.20 base=82.44 fontSize=11.04 rot=0 link=00000000 'Fonts.'

word: x=90.02..112.10 y=99.02..110.06 base=108.50 fontSize=11.04 rot=0 link=00000000 'Test'

word: x=117.62..134.18 y=99.02..110.06 base=108.50 fontSize=11.04 rot=0 link=00000000 'for'

word: x=139.58..156.14 y=99.02..110.06 base=108.50 fontSize=11.04 rot=0 link=00000000 'PDF'

word: x=161.66..205.70 y=99.02..110.06 base=108.50 fontSize=11.04 rot=0 link=00000000 'Embedded'

word: x=210.98..244.10 y=99.02..110.06 base=108.50 fontSize=11.04 rot=0 link=00000000 'Fonts.'

word: x=90.02..111.44 y=126.13..136.35 base=134.06 fontSize=11.04 rot=0 link=00000000 'Test'

word: x=114.50..127.43 y=126.13..136.35 base=134.06 fontSize=11.04 rot=0 link=00000000 'for'

word: x=130.46..152.44 y=126.13..136.35 base=134.06 fontSize=11.04 rot=0 link=00000000 'PDF'

word: x=155.54..208.73 y=126.13..136.35 base=134.06 fontSize=11.04 rot=0 link=00000000 'Embedded'

word: x=211.70..242.34 y=126.13..136.35 base=134.06 fontSize=11.04 rot=0 link=00000000 'Fonts.'

word: x=90.02..461.37 y=165.30..174.31 base=172.94 fontSize=8.04 rot=0 link=00000000 'Test for PDF Embedded Fonts. '

word: x=90.02..104.67 y=203.00..215.85 base=213.86 fontSize=11.04 rot=0 link=00000000 'Test'

word: x=106.94..117.30 y=203.00..215.85 base=213.86 fontSize=11.04 rot=0 link=00000000 'for'

word: x=119.30..132.51 y=203.00..215.85 base=213.86 fontSize=11.04 rot=0 link=00000000 'PDF'

word: x=134.54..169.76 y=203.00..215.85 base=213.86 fontSize=11.04 rot=0 link=00000000 'Embedded'

word: x=171.94..192.50 y=203.00..215.85 base=213.86 fontSize=11.04 rot=0 link=00000000 'Fonts.'

word: x=90.02..115.75 y=230.10..242.37 base=239.90 fontSize=11.04 rot=0 link=00000000 'Test'

word: x=118.46..138.43 y=230.10..242.37 base=239.90 fontSize=11.04 rot=0 link=00000000 'for'

word: x=141.14..160.42 y=230.10..242.37 base=239.90 fontSize=11.04 rot=0 link=00000000 'PDF'

word: x=163.22..218.72 y=230.10..242.37 base=239.90 fontSize=11.04 rot=0 link=00000000 'Embedded'

word: x=221.52..256.39 y=230.10..242.37 base=239.90 fontSize=11.04 rot=0 link=00000000 'Fonts.'

word: x=90.02..119.11 y=254.72..282.19 base=274.61 fontSize=18.00 rot=0 link=00000000 'Test'

word: x=123.62..145.15 y=254.72..282.19 base=274.61 fontSize=18.00 rot=0 link=00000000 'for'

word: x=149.54..178.59 y=254.72..282.19 base=274.61 fontSize=18.00 rot=0 link=00000000 'PDF'

word: x=183.02..259.05 y=254.72..282.19 base=274.61 fontSize=18.00 rot=0 link=00000000 'Embedded'

word: x=263.50..305.40 y=254.72..282.19 base=274.61 fontSize=18.00 rot=0 link=00000000 'Fonts.'

word: x=90.02..150.83 y=301.53..330.48 base=324.65 fontSize=26.04 rot=0 link=00000000 'Test'

word: x=157.22..204.40 y=301.53..330.48 base=324.65 fontSize=26.04 rot=0 link=00000000 'for'

word: x=210.77..256.24 y=301.53..330.48 base=324.65 fontSize=26.04 rot=0 link=00000000 'PDF'

word: x=262.73..394.28 y=301.53..330.48 base=324.65 fontSize=26.04 rot=0 link=00000000 'Embedded'

word: x=400.79..483.31 y=301.53..330.48 base=324.65 fontSize=26.04 rot=0 link=00000000 'Fonts.'

word: x=90.02..109.54 y=349.41..361.90 base=359.33 fontSize=11.04 rot=0 link=00000000 'Test'

word: x=112.34..125.72 y=349.41..361.90 base=359.33 fontSize=11.04 rot=0 link=00000000 'for'

word: x=128.54..148.11 y=349.41..361.90 base=359.33 fontSize=11.04 rot=0 link=00000000 'PDF'

word: x=150.86..202.41 y=349.41..361.90 base=359.33 fontSize=11.04 rot=0 link=00000000 'Embedded'

word: x=205.11..234.33 y=349.41..361.90 base=359.33 fontSize=11.04 rot=0 link=00000000 'Fonts.'

word: x=90.02..132.55 y=374.21..404.71 base=401.33 fontSize=27.96 rot=0 link=00000000 'Test'

word: x=139.34..170.33 y=374.21..404.71 base=401.33 fontSize=27.96 rot=0 link=00000000 'for'

word: x=177.14..218.80 y=374.21..404.71 base=401.33 fontSize=27.96 rot=0 link=00000000 'PDF'

word: x=225.77..337.72 y=374.21..404.71 base=401.33 fontSize=27.96 rot=0 link=00000000 'Embedded'

word: x=344.57..407.17 y=374.21..404.71 base=401.33 fontSize=27.96 rot=0 link=00000000 'Fonts.'

word: x=90.02..127.68 y=422.45..445.88 base=441.07 fontSize=20.04 rot=0 link=00000000 'Test'

word: x=132.62..155.92 y=422.45..445.88 base=441.07 fontSize=20.04 rot=0 link=00000000 'for'

word: x=160.94..205.31 y=422.45..445.88 base=441.07 fontSize=20.04 rot=0 link=00000000 'PDF'

word: x=210.29..304.02 y=422.45..445.88 base=441.07 fontSize=20.04 rot=0 link=00000000 'Embedded'

word: x=308.91..362.67 y=422.45..445.88 base=441.07 fontSize=20.04 rot=0 link=00000000 'Fonts.'

word: x=90.02..127.72 y=460.09..498.79 base=491.59 fontSize=36.00 rot=0 link=00000000 'Test'

word: x=133.10..158.43 y=460.09..498.79 base=491.59 fontSize=36.00 rot=0 link=00000000 'for'

word: x=163.82..195.32 y=460.09..498.79 base=491.59 fontSize=36.00 rot=0 link=00000000 'PDF'

word: x=200.66..286.81 y=460.09..498.79 base=491.59 fontSize=36.00 rot=0 link=00000000 'Embedded'

word: x=292.21..342.72 y=460.09..498.79 base=491.59 fontSize=36.00 rot=0 link=00000000 'Fonts.'

word: x=90.02..141.72 y=513.86..541.08 base=534.43 fontSize=24.00 rot=0 link=00000000 'Test'

word: x=149.66..180.14 y=513.86..541.08 base=534.43 fontSize=24.00 rot=0 link=00000000 'for'

word: x=188.06..268.80 y=513.86..541.08 base=534.43 fontSize=24.00 rot=0 link=00000000 'PDF'

word: x=276.77..378.46 y=513.86..541.08 base=534.43 fontSize=24.00 rot=0 link=00000000 'Embedded'

word: x=386.33..464.91 y=513.86..541.08 base=534.43 fontSize=24.00 rot=0 link=00000000 'Fonts.'

word: x=90.02..130.45 y=562.09..596.34 base=585.55 fontSize=27.96 rot=0 link=00000000 'Test'

word: x=134.90..161.29 y=562.09..596.34 base=585.55 fontSize=27.96 rot=0 link=00000000 'for'

word: x=165.74..224.99 y=562.09..596.34 base=585.55 fontSize=27.96 rot=0 link=00000000 'PDF'

word: x=229.49..326.32 y=562.09..596.34 base=585.55 fontSize=27.96 rot=0 link=00000000 'Embedded'

word: x=330.84..391.24 y=562.09..596.34 base=585.55 fontSize=27.96 rot=0 link=00000000 'Fonts.'

word: x=90.02..108.95 y=614.14..625.18 base=622.42 fontSize=11.04 rot=0 link=00000000 'Test'

word: x=111.50..124.47 y=614.14..625.18 base=622.42 fontSize=11.04 rot=0 link=00000000 'for'

word: x=126.86..144.40 y=614.14..625.18 base=622.42 fontSize=11.04 rot=0 link=00000000 'PDF'

word: x=146.90..195.13 y=614.14..625.18 base=622.42 fontSize=11.04 rot=0 link=00000000 'Embedded'

word: x=197.63..224.89 y=614.14..625.18 base=622.42 fontSize=11.04 rot=0 link=00000000 'Fonts.'

word: x=90.02..115.49 y=639.90..650.94 base=648.70 fontSize=11.04 rot=0 link=00000000 'Test'

word: x=117.74..138.84 y=639.90..650.94 base=648.70 fontSize=11.04 rot=0 link=00000000 'for'

word: x=141.02..162.16 y=639.90..650.94 base=648.70 fontSize=11.04 rot=0 link=00000000 'PDF'

word: x=164.30..222.23 y=639.90..650.94 base=648.70 fontSize=11.04 rot=0 link=00000000 'Embedded'

word: x=224.43..260.82 y=639.90..650.94 base=648.70 fontSize=11.04 rot=0 link=00000000 'Fonts.'

word: x=90.02..115.49 y=665.46..676.50 base=674.26 fontSize=11.04 rot=0 link=00000000 'Test'

word: x=117.74..138.84 y=665.46..676.50 base=674.26 fontSize=11.04 rot=0 link=00000000 'for'

word: x=141.02..162.16 y=665.46..676.50 base=674.26 fontSize=11.04 rot=0 link=00000000 'PDF'

word: x=164.30..222.23 y=665.46..676.50 base=674.26 fontSize=11.04 rot=0 link=00000000 'Embedded'

word: x=224.43..260.82 y=665.46..676.50 base=674.26 fontSize=11.04 rot=0 link=00000000 'Fonts.'

word: x=90.02..115.49 y=691.14..702.18 base=699.94 fontSize=11.04 rot=0 link=00000000 'Test'

word: x=117.74..138.84 y=691.14..702.18 base=699.94 fontSize=11.04 rot=0 link=00000000 'for'

word: x=141.02..162.16 y=691.14..702.18 base=699.94 fontSize=11.04 rot=0 link=00000000 'PDF'

word: x=164.30..222.23 y=691.14..702.18 base=699.94 fontSize=11.04 rot=0 link=00000000 'Embedded'

word: x=224.43..260.82 y=691.14..702.18 base=699.94 fontSize=11.04 rot=0 link=00000000 'Fonts.'

 

The Text Line List at below,

line: x=90.02..224.89 y=74.16..85.20 base=82.44 'Test for PDF Embedded Fonts.'

line: x=90.02..244.10 y=99.02..110.06 base=108.50 'Test for PDF Embedded Fonts.'

line: x=90.02..242.34 y=126.13..136.35 base=134.06 'Test for PDF Embedded Fonts.'

line: x=90.02..461.37 y=165.30..174.31 base=172.94 'Test for PDF Embedded Fonts. '

line: x=90.02..192.50 y=203.00..215.85 base=213.86 'Test for PDF Embedded Fonts.'

line: x=90.02..256.39 y=230.10..242.37 base=239.90 'Test for PDF Embedded Fonts.'

line: x=90.02..305.40 y=254.72..282.19 base=274.61 'Test for PDF Embedded Fonts.'

line: x=90.02..483.31 y=301.53..330.48 base=324.65 'Test for PDF Embedded Fonts.'

line: x=90.02..234.33 y=349.41..361.90 base=359.33 'Test for PDF Embedded Fonts.'

line: x=90.02..407.17 y=374.21..404.71 base=401.33 'Test for PDF Embedded Fonts.'

line: x=90.02..362.67 y=422.45..445.88 base=441.07 'Test for PDF Embedded Fonts.'

line: x=90.02..342.72 y=460.09..498.79 base=491.59 'Test for PDF Embedded Fonts.'

line: x=90.02..464.91 y=513.86..541.08 base=534.43 'Test for PDF Embedded Fonts.'

line: x=90.02..391.24 y=562.09..596.34 base=585.55 'Test for PDF Embedded Fonts.'

line: x=90.02..224.89 y=614.14..625.18 base=622.42 'Test for PDF Embedded Fonts.'

line: x=90.02..260.82 y=639.90..650.94 base=648.70 'Test for PDF Embedded Fonts.'

line: x=90.02..260.82 y=665.46..676.50 base=674.26 'Test for PDF Embedded Fonts.'

line: x=90.02..260.82 y=691.14..702.18 base=699.94 'Test for PDF Embedded Fonts.'

 

You can write a simple application to parse and analyze text contents easily.

VN:F [1.9.20_1166]
Rating: 5.0/10 (1 vote cast)
VN:F [1.9.20_1166]
Rating: 0 (from 0 votes)
Command Line Tool for Extracting Text Coordinates in PDF, 5.0 out of 10 based on 1 rating

Related Posts

One Reply to “Command Line Tool for Extracting Text Coordinates in PDF”

  1. Hi,

    I downloaded your pdfparsersdk2.dll and just have some question. Seems like the sdk outputs text info by individual words. Is there a way for it to output sentences (and the bounding text info) ? Also is there example how to convert a file back to PDF? Thanks.

    Customer
    ————————-
    You can use our “VeryPDF PDF Extract Tool Command Line” software to extract sentences with bounding information from a PDF file, please look at following web pages for more information,

    http://www.verypdf.com/wordpress/201401/command-line-tool-for-extracting-text-coordinates-in-pdf-40235.html

    “VeryPDF PDF Extract Tool Command Line” software can extract both bounding text info for each word and each sentence.

    VeryPDF

    VN:F [1.9.20_1166]
    Rating: 0.0/5 (0 votes cast)
    VN:F [1.9.20_1166]
    Rating: 0 (from 0 votes)

Leave a Reply

Your email address will not be published. Required fields are marked *


Verify Code   If you cannot see the CheckCode image,please refresh the page again!