PDF to Text OCR Converter SDK for .NET package

In your OCR SDK feature page, it mentions about "Create Text file containing the coordination information of text in original PDF, include [X, Y, Width, Height] information for each word when OCR".

We would like to OCR some pdf documents. May I know what is the unit you are using for X, Y, Width and Height? Is it in pixels?

We have tried the PDF to Text OCR Command Line but the function is very limited for our requirement. Do you have a programming reference guide for the SDK instead? We would like to look through it before deciding to buy PDF to Text OCR Converter SDK for .NET package (not the command line version)

Customer
----------------------------------------------

image
Thanks for your message, the unit is pixel.

You can run following command line to get X, Y, Width and Height for each word in PDF or TIFF file,

private void button1_Click(object sender, EventArgs e)
{
string strStartupPath = System.Windows.Forms.Application.StartupPath + "\\";

System.Type pdf2vecName = Type.GetTypeFromProgID("pdfcom.pdfclass");
if (pdf2vecName != null)
{
object pdf2vec = Activator.CreateInstance(pdf2vecName);
string strInFile = strStartupPath + "test-color.tif";
string strOutFile = strStartupPath + "_test-color.pdf";
string strCmd = "-$ XXXXXXXXXXXXXXXXXXXX -ocr -lang eng -ocrmode 2 -outboxfile \"" + strInFile + "\" \"" + strOutFile + "\"";

MessageBox.Show(strCmd);
object[] argn = new object[1];
argn[0] = strCmd;
int nRet = (int)pdf2vecName.InvokeMember("com_PDFToTextOCRSDKShell", BindingFlags.InvokeMethod, null, pdf2vec, argn);
MessageBox.Show("Return Value is: " + string.Format("{0}", nRet));
}
}

"-outboxfile" option does output [X, Y, Width, Height] information for each word when OCR.

The output ".box.txt" file will look like below, one word per text line, you can read this text file into your application for further processing easily,
-------------------------------------
328,234,60,34,US
404,234,284,34,2005/0118291
704,234,57,34,A1
328,452,163,30,throughout
501,452,44,30,the
556,452,160,30,cytoplasm.
729,452,193,30,Interestingly,
935,452,83,30,Golgi
1028,452,159,30,complexes
1197,452,28,30,in
327,495,224,30,placebo+CC14
571,495,84,30,group
676,495,108,30,contain
804,495,80,30,small
904,495,175,30,loW-density
1098,495,125,30,vesicles.
329,539,83,30,Golgi
426,539,158,30,complexes
598,539,28,30,in
640,539,43,30,the
697,539,147,30,processed
855,539,132,30,Morinda
1002,539,128,30,citrifolia
1144,539,80,30,prod-
327,583,168,30,ucts+CC14
518,583,85,30,group
628,583,108,30,contain
760,583,71,30,large
855,583,118,30,vesicles
995,583,66,30,With
1085,583,140,30,increased
-------------------------------------

image

btw, we have released a new version of PDF to Text OCR SDK for .NET today, please download new package from following URL,

http://www.verypdf.com/dl2.php/pdf2txtocrsdk.zip
http://www.verypdf.com/app/pdf-to-text-ocr-converter/sdk-for-net.html

Here are more calling examples for the new version, we hoping these examples will useful to you,

http://www.verypdf.com/wordpress/201408/verypdf-release-notes-verypdf-releases-a-new-version-of-pdf-to-text-ocr-sdk-for-net-today-40914.html

VeryPDF

VN:F [1.9.20_1166]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.20_1166]
Rating: 0 (from 0 votes)

Related Posts

This entry was posted in PDF to Text OCR Command Line, VeryPDF SDK & COM and tagged . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *


Verify Code   If you cannot see the CheckCode image,please refresh the page again!