Is there a C++ library to extract text from a PDF file?

Question:Last year, I made an application in Java using PDFBox to get the raw text in some PDF files and I need to port that application to C++ now.I wanted to know what was the best C++ alternative to accomplish what I need.I'll give an example in case it helps:Most files will look like this: http://www.jumbala.net/league.pdf . I use some software, but the effect is not good.

The parts I actually need are in the weird characters at the beginning. Using Adobe Acrobat Reader X and using Save As... Text (accessible), I get the following result:http://www.jumbala.net/league_good.pdf.txt . Which is approximately what I get in Java using PDFBox and what I want to get as output in C++. Is there a solution on VeryPDF?

Answer: According to your needs, maybe you can have a free trial of those software: VeryPDF PDF to TXT Converter COM and  VeryPDF PDF to Text OCR SDK for .NET. The difference of the first software and the second software is that the first software does not have OCR function but the second one has. But both of them can help you convert PDF to text. Please check more information of those software on homepage, in the following part, let us check how to use this software. Here I will take the first for example.

Step 1. Free download PDF to TXT COM

  • Please download this software to your computer then you will find a zip file. Please extract it to some folder then you can check code template and executable file.
  • When you use this software, please follow examples and code templates.

Step 2. Convert PDF to text from C++

  • Here is one Visual C++ code template for converting PDF to text for your reference.

void ConvertPDFBuffer2TextBuffer(char *pdffile, char *textfile)
{
int m_iFileLength = 0;
char* pdfBuffer = NULL;
FILE *file = fopen(pdffile,"rb")
if(file)
{
m_iFileLength = _filelength(fileno(file))
if(m_iFileLength <= 0)
{
fclose(file)
return;
}
pdfBuffer = new char[m_iFileLength];
if(pdfBuffer == NULL)
{
fclose(file);
return;
}
memset(pdfBuffer,0,m_iFileLength);
fread(pdfBuffer,1,m_iFileLength,file)
fclose(file);
}
if(pdfBuffer == NULL || m_iFileLength <= 0)
return;
SetPageSeparator("<<<<<<<<<********>>>>>>>>>>>>>")
SetZoomRatio(100)
SetTXTFormat(1);
SetOpenResultFile(0);
SetDeleteBlankLine(TRUE);

int textBufferSize;
const char *textbuffer = PDFBuffer2TextBuffer(
pdfBuffer, m_iFileLength, &textBufferSize
);
if(textbuffer)
printf("%s\n",textbuffer)
PDF2TextFreeBuffer(textbuffer);
LPCWSTR textbufferW = PDFBuffer2TextBufferW(
pdfBuffer, m_iFileLength, &textBufferSize
);
PDF2TextFreeBufferW(textbufferW);
textbufferW = PDFBuffer2TextBufferWEx(
pdfBuffer,m_iFileLength,&textBufferSize,
0, 0,NULL,NULL,NULL,0
);
PDF2TextFreeBufferW(textbufferW);
delete []pdfBuffer;
}

Now let us check the conversion effect from the following snapshot. During the using, if you have any question, please contact us as soon as possible.

input PDF and output text

VN:F [1.9.20_1166]
Rating: 5.5/10 (4 votes cast)
VN:F [1.9.20_1166]
Rating: -3 (from 3 votes)
Is there a C++ library to extract text from a PDF file?, 5.5 out of 10 based on 4 ratings

Related Posts

This entry was posted in PDF to Text OCR Command Line and tagged . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *


Verify Code   If you cannot see the CheckCode image,please refresh the page again!