Question: I am writing Master's thesis - NLP system. I have one component - extractor. It is extracting a plain text from PDF files. There are a few PDF files that can not be extracted correctly. Extractor (PDFBox library) returns a string like this:
"┤xDn║if|d├gDF"Ti&cD╬lh d FÁhis~n ╗xd f«"d┤ffih »h" or "10a61a91a22a25a3a27a17a23a20a8a13a14a61a25a17"
I was checking each file that makes this extraction's problem and all these files' text also can not be copy-pasted from PDF Reader (Adobe Reader and FoxIt reader). Viewing them in this readers is enabled, but after selecting its content and copying to the clipboard I get the same wrong text (as described above - strings of not semantically correct chars or strings of digits and letters) Can VeryPDF help me figure out this problem?
Answer: According to your description, I guess there are some embed font in those problem PDF file. You can open one of the problem PDF file by Adobe reader and then press Ctr+D on the keyboard to check its PDF property. In the font part, you will find there are embed font. When do copy and paste in those VeryPDF PDF Editor. Or you can OCR to normal searchable PDF file by OCR to Any Converter Command Line then you can do copy and paste without any error. In the following part, I will show you how to solve matter by OCR PDF. Here is an example of embed PDF,please have a check., even if you can do copy and paste successfully, but there will be messy code in the pasted fonts. In order to solve this software, you can replace all the embedded font in PDF by system font by
- By this software, we can convert embedded fonts in PDF file to a new searchable PDF file. Then you can do copy and paste in PDF file without any error.
- When downloading finishes, there will be a zip file. Please extract it to some folder then you can check executable file and related elements.
Step 2. Convert embed PDF to searchable PDF for copy and paste.
- When you use this software, please refer to the usage and example in readme.txt. Here is the usage for your reference:ocr2any.exe [options] <PDF-file> <Text-file>
- By the above parameters, you can convert embed PDF file to searchable PDF file.
-ocrmode <int> : set OCR mode
-ocrmode 0: output to text file
-ocrmode 1: OCR PDF pages and insert new text layer under original PDF pages
-ocrmode 2: output to plain text based PDF file
-ocrmode 3: output to OCRed PDF file (BW) with hidden text layer
-ocrmode 4: output to OCRed PDF file (Color) with hidden text layer
When you call this software, please refer to the following command line templates:
ocr2any.exe -ocr -lang deu -ocrmode 1 C:\in.pdf C:\out.pdf
ocr2any.exe -ocr -lang eng -ocrmode 2 C:\in.pdf C:\out.pdf
ocr2any.exe -ocr -lang eng -ocrmode 3 C:\in.pdf C:\out.pdf
ocr2any.exe -ocr -lang eng -ocrmode 2 -outboxfile C:\in.pdf C:\out.pdf
ocr2any.exe -ocr -lang fra -ocrmode 1 C:\in.pdf C:\out.pdf
Now let us check the conversion effect from the following snapshot. You can copy and paste it to word or text without any errors. During the using, if you have any question, please contact us as soon as possible.