Why copy+pasting text from PDF results in garbage?

Question: I am writing Master's thesis - NLP system. I have one component - extractor. It is extracting a plain text from PDF files. There are a few PDF files that can not be extracted correctly. Extractor (PDFBox library) returns a string like this:

"┤xDn║if|d├gDF"Ti&cD╬lh d FÁhis~n ╗xd f«"d┤ffih »h" or "10a61a91a22a25a3a27a17a23a20a8a13a14a61a25a17"

I was checking each file that makes this extraction's problem and all these files' text also can not be copy-pasted from PDF Reader (Adobe Reader and FoxIt reader). Viewing them in this readers is enabled, but after selecting its content and copying to the clipboard I get the same wrong text (as described above - strings of not semantically correct chars or strings of digits and letters) Can VeryPDF help me figure out this problem?

Answer: According to your description, I guess there are some embed font in those problem PDF file. You can  open one of the problem PDF file by Adobe reader and then press Ctr+D on the keyboard to check its PDF property. In the font part, you will find there are embed font. When do copy and paste in those embed PDF, even if you can do copy and paste successfully, but there will be messy code in the pasted fonts. In order to solve this software, you can replace all the embedded font in PDF by system font by VeryPDF PDF Editor. Or you can OCR embed PDF to normal searchable PDF file by OCR to Any Converter Command Line then you can do copy and paste without any error. In the following part, I will show you how to solve embed PDF matter by OCR PDF. Here is an example of embed PDF,please have a check.

check PDF Embedded

Step 1. Free download OCR to Any Converter Command Line

  • By this software, we can convert embedded fonts in PDF file to a new searchable PDF file. Then you can do copy and paste in PDF file without any error.
  • When downloading finishes, there will be a zip file. Please extract it to some folder then you can check executable file and related elements.

Step 2. Convert embed PDF to searchable PDF for copy and paste.

  • When you use this software, please refer to the usage and example in readme.txt. Here is the usage for your reference:ocr2any.exe [options] <PDF-file> <Text-file>
  • By the above parameters, you can convert embed PDF file to searchable PDF file.
  • -ocrmode <int>          : set OCR mode
        -ocrmode 0: output to text file
        -ocrmode 1: OCR PDF pages and insert new text layer under original PDF pages
        -ocrmode 2: output to plain text based PDF file
        -ocrmode 3: output to OCRed PDF file (BW) with hidden text layer
        -ocrmode 4: output to OCRed PDF file (Color) with hidden text layer
    When you call this software, please refer to the following command line templates:
    ocr2any.exe -ocr -lang deu -ocrmode 1 C:\in.pdf C:\out.pdf
    ocr2any.exe -ocr -lang eng -ocrmode 2 C:\in.pdf C:\out.pdf
    ocr2any.exe -ocr -lang eng -ocrmode 3 C:\in.pdf C:\out.pdf
    ocr2any.exe -ocr -lang eng -ocrmode 2 -outboxfile C:\in.pdf C:\out.pdf
    ocr2any.exe -ocr -lang fra -ocrmode 1 C:\in.pdf C:\out.pdf

Now let us check the conversion effect from the following snapshot. You can copy and paste it to word or text without any errors. During the using, if you have any question, please contact us as soon as possible.

copy and paste embed font

VN:F [1.9.20_1166]
Rating: 2.8/10 (5 votes cast)
VN:F [1.9.20_1166]
Rating: -3 (from 3 votes)
Why copy+pasting text from PDF results in garbage?, 2.8 out of 10 based on 5 ratings

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *


Verify Code   If you cannot see the CheckCode image,please refresh the page again!