TextFileWithPosition Word Records Containing Unprintable Characters with PDF Extractor Command Line software

To whom it may concern,

In using your PDF to Any Converter, we have found a particular PDF which contains some spurious data. This data is obvious when copying all text from the PDF in a reader such as Acrobat Reader. The nature of the PDF is sensitive so I cannot provide it, if deemed necessary then I can look to attempt to sanitise it before passing on but I would prefer to start by explaining the behaviour we are seeing.

Although the PDF contains spurious data, the visuals appear correct in a PDF reader. When the converter is called, the resultant TextFileWithPosition contains unprintable characters in the word records. Is there a reason that these are included in the file? Typically unprintable characters such as CR and LF characters will be ignored by the application and only readable words appear in the words records. Their presence in the PDF will simply mean that words are split into separate word records, for example:

[Page #1] *** initial words ***
word: x=48.24..65.24 y=70.87..85.77 base=82.46 fontSize=11.04 rot=0 link=00000000 'This'
word: x=48.24..55.29 y=86.59..101.49 base=98.18 fontSize=11.04 rot=0 link=00000000 'Is'
word: x=48.24..55.56 y=102.31..117.21 base=113.90 fontSize=11.04 rot=0 link=00000000 'A'
word: x=48.24..66.27 y=118.03..132.93 base=129.62 fontSize=11.04 rot=0 link=00000000 'Test'
word: x=224.75..265.75 y=81.42..94.02 base=91.20 fontSize=13.43 rot=0 link=00000000 '

image

Please see a sanitised snippet of the output in question attached. Note the presence of 0A, 0B, 0C and 0D characters.

Kind regards,
Customer
-----------------------------------------
VeryPDF PDF to Any Converter,

https://www.verypdf.com/app/pdf-to-any-converter/index.html

VeryPDF PDF Extract Tool Command Line,

https://www.verypdf.com/app/pdf-extract-tool/index.html

Thanks for your message, in general, these special characters are caused by a subset embedded font or special character set which embedded in the PDF file itself, the "visual" correct in Adobe Reader is because the font data of this character has been embedded into PDF file itself, but PDF just use a glyph index to use this character, it is impossible to get the original unicode from the glyph index, this is the reason of this problem.

Uncode is the real text of a character in font data.

Glyph Index is the serial number of this character in the font data, the first character is 0, the second character is 1, the third character is 2, etc.

Below is a screenshot of Arial.ttf font,

image

If possible, you may remove confidential information from that PDF file, send to us the new PDF file without confidential information and keep the "bad" character only, if so, we will able to research this problem quickly.

VeryPDF

VN:F [1.9.20_1166]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.20_1166]
Rating: 0 (from 0 votes)

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *


Verify Code   If you cannot see the CheckCode image,please refresh the page again!