PDF Extract Tool command line has a problem with special characters in xml file, extract X, Y coordinates from PDF file to text file

Hello Support Team,

Thank you for responding so quickly. I do not have an Order ID, since I am
trying out the product. I hope this is not a problem.

Attached are 2 pdf files:

test_with_cropbox.pdf - this is a pdf in which I defined a cropbox that differs from the mediabox, for testing purposes. VeryPDF does not produce any xml for this pdf. Just errors unfortunately. The pdf without the cropbox is handled fine btw.

test_special_characters.pdf - this is an MS Word generated pdf, to illustrate the problem with special xml characters &, , " and ', as well as Unicode characters. The produced PageContents.xml is not well-formed, and I cannot parse it. Hope you can help...

Customer
--------------------------------------------------
Thanks for your message, we suggest you may download "VeryPDF PDF Extract Tool Command Line" from following web page to try,

https://www.verypdf.com/app/pdf-extract-tool/try-and-buy.html
https://www.verypdf.com/dl2.php/verypdf_pdf_extract_tool.zip

you can use "VeryPDF PDF Extract Tool Command Line" to extract contents from these PDF files to XML and other files easily, e.g.,

pdfextract.exe -outfolder D:\downloads\out D:\downloads\test_special_characters.pdf

pdfextract.exe -outfolder D:\downloads\out D:\downloads\test_with_cropbox.pdf

in the output folder, please refer to "TextFileWithPosition.txt" file, this file is contain X, Y coordinates and text contents for each word and each line, you can parse this text file easily,

[Page #1] *** line fragments ***
line: x=11.80..99.92 y=8.40..18.00 base=16.00 'Accezz International B.V.'
line: x=11.80..87.62 y=19.55..29.00 base=26.99 'Dijkshoornseweg 209'
line: x=11.80..120.25 y=30.55..40.00 base=37.99 '2614 KC Delft The Netherlands'
line: x=274.19..347.17 y=31.24..42.71 base=40.26 'Caballero Fabriek'
line: x=274.19..344.00 y=87.25..98.72 base=96.27 'The Netherlands'
line: x=11.80..49.69 y=151.60..160.99 base=158.98 'Betreft '
line: x=83.80..85.69 y=151.60..160.99 base=158.98 ' '
line: x=119.80..121.69 y=151.60..160.99 base=158.98 ' '
line: x=155.80..157.69 y=151.60..160.99 base=158.98 ' '
line: x=191.80..229.69 y=151.60..160.99 base=158.98 'Kenmerk '
line: x=263.80..265.69 y=151.60..160.99 base=158.98 ' '
line: x=299.80..301.69 y=151.60..160.99 base=158.98 ' '
line: x=335.80..337.69 y=151.60..160.99 base=158.98 ' '
line: x=371.80..397.99 y=151.60..160.99 base=158.98 'Datum '
line: x=407.80..409.69 y=151.60..160.99 base=158.98 ' '
line: x=443.80..445.69 y=151.60..160.99 base=158.98 ' '
line: x=11.80..47.42 y=162.53..175.21 base=169.98 ' Huisstijl'
line: x=191.74..227.13 y=163.73..175.21 base=172.75 'Huisstijl'
line: x=371.03..406.42 y=163.73..175.21 base=172.75 'Huisstijl'
line: x=12.03..132.87 y=203.54..216.06 base=213.38 'FFProfile Regular 11pt/14pt'
line: x=12.03..120.43 y=231.54..244.05 base=241.38 'Voorbeeld opsomming:'
line: x=12.03..82.89 y=259.53..272.05 base=269.37 '" Opsomming 1'
line: x=12.03..84.72 y=273.54..286.05 base=283.38 '" Opsomming 2'
line: x=12.03..84.38 y=287.54..300.06 base=297.38 '" Opsomming 3'
line: x=12.03..85.59 y=301.55..314.06 base=311.39 '" Opsomming 4'
line: x=12.03..84.51 y=315.55..328.07 base=325.39 '" Opsomming 5'

Please look at TextFileWithPosition.txt file to instead of PageContents.xml file, TextFileWithPosition.txt file is contain X, Y and text information for each word, the TextFileWithPosition.txt file is better than PageContents.xml file for PDF contents analysis.

We hoping "VeryPDF PDF Extract Tool Command Line" software will work better for you, you may download it to try.

VeryPDF

学校 See Also:

https://www.verypdf.com/wordpress/201401/command-line-tool-for-extracting-text-coordinates-in-pdf-40235.html

https://www.verypdf.com/app/pdf-extract-tool/user-guide.html

Online Demo:

https://www.verypdf.com/app/pdf-extract-tool/online.html

VN:F [1.9.20_1166]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.20_1166]
Rating: 0 (from 0 votes)

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *


Verify Code   If you cannot see the CheckCode image,please refresh the page again!