Q: I have a .pdf document that is laid out in columns. I have tried exporting to plain text, saving as a .doc file, and copy/paste-ing highlighted text. In each case, the text comes out tangled. That is, it reads a line across all three columns. So the text from the three columns is tangled together and very tedious to separate and paste back into the correct order.
I extract a lot of text from .pdfs but have not run into this issue before. Is there a way to fix it?
A: VeryPDF PDF Columns Text Extractor is a simple-to-use utility that can extract tables and text from existing PDF documents as Text, HTML or XML.
PDF is a hugely popular format, and for good reason: with a PDF, you can be virtually assured that a document will display and print exactly the same way on different computers.
However, PDF documents suffer from a drawback in that they are usually missing information specifying which content constitutes paragraphs, tables, figures, header/footer info etc. This lack of 'logical structure' information makes it difficult to edit files or to view documents on small screens, or to extract meaningful data from a PDF. In a sense, the content becomes 'trapped'.
"VeryPDF PDF Columns Text Extractor" is a simple to use command-line tool that can be used to recover tables, text, and reading order from existing PDF.
"VeryPDF PDF Columns Text Extractor" is included in PDF to Text OCR Converter Command Line software, you can download it from following web page,
after you download and unzip it to a folder, you can run following command line to convert your PDF file to text file with columns easily,
pdf2txtocr.exe -table test.pdf out.txt
"-table" option does analyse the contents in your PDF file and make the columns in text file quickly.
For example, this is original PDF file which contain multiple text columns,
This is the converted text file, as you see, this text file is contain multiple columns, "VeryPDF PDF Columns Text Extractor" does keep the columns perfectly,