VeryPDF Knowledge Base

Call PDF to Text OCR from Web or Windows Service

2011/06/06

I downloaded the trial version of PDF2TXT yesterday and attempted to extract the text from the attached PDF. While I am aware that the trial version only extracts a limited number of actual pages per document, it seems that calling the cmd line interface from .NET is not working as far as extracting the text layer. When I use the GUI based version, it seems to work fine.

See attachments for source and output files.

Below is the code from the .NET text app. I’ve used the same process in the past with your Image2PDF OCR product and it works very well. All we are trying to do is to extract the text from the PDF so we can then index for full text search.

Is there anything in particular that we should be doing to get PDF2Text to generate via .NET? Any specific parameters that we MUST send to the cmd line?

I need to get successful runs from this right away, or move to another product. We’ve had good success in the past with your products and wish to continue using.

Thanks for your assistance!

            inputPDFFilePath = @"C:\temp\indexRoot\test1.pdf";
            outputTextFilePath = string.Format(@"C:\temp\indexRoot\test_{0}_.txt",DateTime.Now.Ticks.ToString());
            // Start the child process.
            Process p = new Process();
            // Redirect the output stream of the child process.

            p.StartInfo.CreateNoWindow = true;
            p.StartInfo.UseShellExecute = false;
            p.StartInfo.ErrorDialog = true;
            p.StartInfo.RedirectStandardOutput = true;
            p.StartInfo.RedirectStandardError = true;
            p.StartInfo.RedirectStandardInput = true;
            p.StartInfo.FileName = @"C:\source\visioncore\source\SLS\PDFTextExtraction\pdf2txt.exe";
            //p.StartInfo.FileName = "pdf2txt.exe";
            p.StartInfo.Arguments = string.Format("{0} {1} ", inputPDFFilePath, outputTextFilePath);
            bool retval = p.Start();

            return retval;
==================================

We recently purchased the PDF to Text OCR Converter Command Line product for use within a .NET Windows service.   The PDF2Text works fine in a development environment, however, when the service is installed PDF2Text does not seem to run.

Both environments are running WinXP Pro. The non-dev environment is a VM. This same VM successfully runs you Image 2 PDF product.

A bit of background on the service implementation:

The service creates a FileSystemWatcher that listens to a directory that then calls the method below.

As the .Arguments go, I've tried with and without "-$" "XXXXXXXXXXXXXXXX".

private bool extractText( string inputPDFFilePath, string outputTextFilePath)
{
             // Start the child process.
            Process p = new Process();
            // Redirect the output stream of the child process.

            p.StartInfo.CreateNoWindow = true;
            p.StartInfo.UseShellExecute = false;
            p.StartInfo.ErrorDialog = true;
            p.StartInfo.RedirectStandardOutput = true;
            p.StartInfo.RedirectStandardError = true;
            p.StartInfo.RedirectStandardInput = true;
            p.StartInfo.FileName = @"C:\source\visioncore\source\SLS\PDFTextExtraction\pdf2txt.exe";
            p.StartInfo.Arguments = string.Format("{0} {1} ", inputPDFFilePath, outputTextFilePath);
            bool retval = p.Start();
}

I also tried running pdf2txt.exe using commandline on the VM environment and it works fine.

Any suggestions would be greatly appreciated.
==========================================

We suggest you may run your C# code inside Administrator user account instead of default system account to try again, can you get it work when you run it in Administrator user account?

VeryPDF
=========================================
We've tried running this under both an admin acct and a local system account, but it still doesn't work. What is odd is that I can see that pdf2txtocr.exe is running in Task Manager, but no text file is output UNTIL I stop the service. Then, the text file is written as originally expected.

Any ideas?
=========================================
We suggest you may use following example to run it from an interactive user account to try again.
Run conversion inside an interactive user account from service or web applications,

Please by following solution to run document conversion inside an interactive user account,

1. We assume pdf2txtocr.exe and PDF files are exist in C:\test folder.
Please add "Everyone" user account to "C:\test" folder and sub-folders, give "Full Control" permission to "Everyone" user account,

2. Download CmdAsUser.exe from following page,

http://www.verydoc.com/exeshell.html

You can also download it from following URL directly,

http://www.verydoc.com/download/cmdasuser.zip

3. Run following command line to test CmdAsUser.exe application,

C:\test\CmdAsUser.exe Administrator . /p password /c "C:\test\pdf2txtocr.exe" C:\test\in.pdf C:\test\out.txt

If you can run above command line in Command Line Window correctly, please call above command line from PHP by shell_exec() function or other web applications, then you will get it work properly.

Please notice:
1. You need modify "Administrator" and "password" parameters to correct user name and password in above command line, CmdAsUser.exe will launch doc2pdf.exe from this special user account with administrator privilege.
2. You may encounter Error 1314 in some Windows systems when you switch between user accounts, this is caused by permission setting, please by following steps to solve this 1314 Error,

ERROR 1314:
~~~~~~~~~~~~~
1314 A required privilege is not held by the client. ERROR_PRIVILEGE_NOT_HELD
~~~~~~~~~~~~~

To resolve this issue:
1. Click Start, click Run, type "secpol.msc", and then press ENTER.
2. Double-click "Local Policies".
3. Double-click "User Rights Assignment".
4. Double-click "Replace a process level token".
5. Click "Add", and then double-click the "Everyone" group 6. Click "OK".
7. You may have to logout or even reboot to have this change take effect.

Please refer to following two screenshots to understand above steps,

http://www.verydoc.com/images/err1314-1.png
http://www.verydoc.com/images/err1314-2.png

Please look at following page for the details about ERROR 1314,

http://www.verydoc.com/exeshell.html

If you still have same problem, please create a remote desktop account on your test machine, after we logged into your test machine, we will research this problem for you asap.

VeryPDF
==============================
Thanks, I'll give it a try.

Sent from my iPhone.
==============================
I have another question regarding the Img2PDF product. We currently use that in our application and have it running on a Win XP Pro VM.

Is there an updated version that will run on Windows Server 2003/2008?

We are currently set up with a SaaS implementation, however, we have client that will want a stand-alone version behind their firewall. It's not likely our clients will want to maintain a VM with an antiquated OS, so future purchases of the OCR tool we use would need to operate in a server OS environment.
===============================
Our Img2PDF product does support Windows Server 2003/2008 system, if you encounter any problem on Windows Server 2003/2008 system, please feel free to let us know, we will assist you asap.

VeryPDF
===============================

Rating: 0.0/10 (0 votes cast)

Rating: 0 (from 0 votes)

advanced pdf tools

Need to preserve PDF/A mode in PDF Tools COM

2011/06/062011/06/06

We recently noticed that if a PDF has been saved in PDF/A, upon saving it the PDF/A mode will be lost. Can an update be made to "respecting" the PDF/A mode for both the 32 and 64 bit versions? Thanks.

Before
After "cleaning" the .pdf file
================================
Can you please let us know what functions were you used to clear the properties? Did you call the Save() function to save to a new PDF file? If possible, please email to us your sample code and PDF file in question, after we reproduce your problem in our system, we will figure out a solution to you asap.

VeryPDF
===============================
Looks like the DeleteMetadata method removes the PDF/A setting

Sub test()

    Dim Opdf As PDFSDKCOMLib.pdftools
    Set Opdf = New PDFSDKCOMLib.pdftools

    Opdf.Open "C:\TestPDFAFile.pdf"   ‘’’<<< Has PDF/A at this point
    If Opdf.IsOpened = False Then
        MsgBox "Open PDF file failed."
    Else
        '''MsgBox "Open PDF file successful."
    End If

    Opdf.DeleteMetadata

    Opdf.Save "C:\NEW.pdf"    ‘’’ <<< Does not have PDF/A

End Sub
===============================================
Yes, you are right, PDF/A format MUST contain Metadata, if you wish delete the Metadata, the resultant PDF file will not comply with PDF/A spec, please understand this matter.

Once you store a PDF file to PDF/A format, you can’t delete summaries or metadata from it again, the modification operation will damage PDF/A format, the summaries and metadata are protected by PDF/A spec.

VeryPDF
===============================================
OK, thanks for the clarification, PDF/A has become important to our clients and we are fairly new to it as well.

So we are thinking we may need a new property to the COM object that lets us know the document is in PDF/A format (. IsPDFA ?)

Secondly, we’ve noticed that when we get properties from a PDF/A document aren’t returned properly. In example, in the PDF I send you, the "title" is "test.docx" but If Opdf.Title reports blank. Is it possible to get the correctly value returned?
================================================
Yes, we can check if a PDF file comply with PDF/A format or not, but the result may not accurately, if you want verify the PDF/A file accurately, you can use "Preflight" option in "Adobe Acrobat 9 Pro Extended" product.

If you need a CheckPDFA() function, please feel free to let us know, we will add this function to you shortly.

Yes, PDF/A format does store the information in /Metadata section, it doesn't store the information in Document Summaries section, so we can't read the /Title from some PDF/A files.

If possible, you can also email to us the sample PDF file in question, after we checked your PDF file, we will come back to you shortly.

VeryPDF
==============================================
Yes, we would like the CheckPDFA() functionality. Are there any other document properties that cannot be read in some PDF/A files? Is there any way to read (but not write I would suppose) these values?
==============================================
We have created the 64bit version of Advanced PDF Tools COM to you, please download it from following URL,

XXXXXXXXXXXXXXXXXXXXXXX

the new version does contain CheckPDFA() function, you can call it to check if a PDF file contains PDF/A or not.

The current version of pdfsdkcom.dll can't read all information from PDF/A files, because these information are all store in /Metadata, they are not available in /Info section, please understand.

However, we can develop a new function to you to read the document properties from /Metadata section at additional cost, if this solution will accetpable to you, please feel free to let us know, we will provide a quotation for this function to you shortly.

VeryPDF

=============================
Did you crank a 32 bit version with the CheckPDFA function?

We just checked and the 64 bit version doesn’t have this CheckPDFA() function either. Perhaps you sent us the wrong set of files? And the close function did get renamed, but we did make our own updates for that as well.
===============================
I’m resend this package to you again,

XXXXXXXXXXXXXXXXXXXX

after you unzip it to a folder, you can run following two command lines to register them,

regsvr32.exe pdfsdkcom.dll
PDFSDKCOMExe.exe /regserver
the new version does contain CheckPDFA() function, you can call it to check if a PDF file contains PDF/A or not.

You can use OLE Viewer application to check the methods in PDFSDKCOMExe.exe, please refer to following screenshot, PDFSDKCOMExe.exe does contain both CheckPDFA() and CreatePDFA() methods,

If you still can’t see it, please create a remote desktop account on your test machine, after I logged into your test machine, I will research this problem for you shortly.

VeryPDF
=============================
pdfsdkcom.dll is a 32bit DLL, it contains the 32bit CheckPDFA().

PDFSDKCOMExe.exe is the 64bit COM, it contains the 64bit CheckPDFA() function.

You can call pdfsdkcom.dll or PDFSDKCOMExe.exe from your code easily.

VeryPDF

Rating: 0.0/10 (0 votes cast)

Rating: 0 (from 0 votes)

pdf print

C# example for PDF Print SDK

2011/06/06

Good afternoon,

I am testing your VeryPDF PDFPrint command line tool. I am thinking about buying this tool, but I wanted to test your VeryPDF PDFPrint SDK tool. I download this tool and am looking at youir c# example and it is not much help. Do you have any other c# examples I could try using?

Thanks,
==========================
Hi,

"C#" folder in test package is the C# example, you can compile and run this C# example properly.

PDFPrint SDK does support all options which included in PDFPrint Command Line version, for example,

~~~~~~~~~~~~~~~~~~~~~~~~~~
[DllImport("pdfprintsdk.dll")]
public static extern int VeryPDF_PDFPrint(string CommandLine);

        public static long PrintDoc(string FullDocumentName, string PrinterName)
        {
            string PrintCommand = PrintCommandTemplate;
            if (LicenseKey != null)
                PrintCommand = PrintCommand.Replace("[LicenseKey]", LicenseKey);
            else
                PrintCommand = PrintCommand.Replace("[LicenseKey]", "XXXXXXXXXXXXXXXX");
            PrintCommand = PrintCommand.Replace("<PrinterName>", PrinterName).Replace("<DocFullPath>", FullDocumentName);

            MessageBox.Show(PrintCommand);
            return VeryPDF_PDFPrint(PrintCommand);
        }

        private void button1_Click(object sender, EventArgs e)
        {
            string appPath = Path.GetDirectoryName(Application.ExecutablePath);
            string strPDFFile = (appPath + "\\readme.pdf");
            long nRet = PrintDoc(strPDFFile, "docPrint");
            MessageBox.Show(nRet.ToString());
        }
~~~~~~~~~~~~~~~~~~~~~~~~~~

You can use PDFPrint Command Line to test the command line options first, and then pass the same command line options to VeryPDF_PDFPrint() function in PDFPrint SDK, then you will able to get PDFPrint SDK to print your PDF file correctly.

VeryPDF

Rating: 0.0/10 (0 votes cast)

Rating: 0 (from 0 votes)

docprint pro

docPrint Pro does compatible with Windows Server 2008 R2(64-bit OS)

2011/06/06

We have been having problems running your software on 64-bit Windows Server 2008. Is docPrint Pro v.5.0 compatible with Windows Server 2008 R2(64-bit OS)?
========================
Yes, our docPrint Pro v.5.0 does compatible with Windows Server 2008 R2(64-bit OS), can you please let us know what problem did you encounter on this system?

VeryPDF
========================
The software is not writing PDF, even with command line. This is the string I am using: doc2pdf –i C:\Documents and Settings\csingleton\Desktop\TEST.doc –o C:\Documents and Settings\csingleton\Desktop\TEST.pdf

I enter that into the command line, hit enter and nothing happens.
========================
Please use "" to include input and output filenames to try again, for example,

doc2pdf -i "C:\Documents and Settings\csingleton\Desktop\TEST.doc" -o "C:\Documents and Settings\csingleton\Desktop\TEST.pdf"

We hoping you will get it work at this time.

VeryPDF

Rating: 0.0/10 (0 votes cast)

Rating: 0 (from 0 votes)

pdf parser & modify sdk

Modify your PDF file contents by PDF Modify SDK product

2011/06/062011/06/06

Hello,
We have attempted to work with your code. I have few questions.
4. Why do we need the png file in the code below?
'Render PDF pages to PNG image files first strInPDFFile = Application.StartupPath() & "\F_COC.pdf"
strOutFile = Application.StartupPath() & "\out.png"
strOptions = "-r " & CStr(nDPI) & " -$ XXXXXXXXXXXXXXXXXXXX"
nRet = VeryPDF_PDFParserSDK(strInPDFFile, strOutFile, strOptions) strLogMsg = strInPDFFile & vbCrLf & strOutFile & vbCrLf & strOptions & vbCrLf & "nRet = " & Str(nRet)
　　　　MsgBox(strLogMsg)
　　　　
5. I’ve used F_COC.pdf file in my testing (see attached). Coordinates for text "Voucher" (as marked on attached jpg file) returned by PDFParser (Phase I) are:
X1=226; Y1=404; W1=75; H1=18 OldString was "Voucher" and NewString was "ABC":
I’ve used the following calculation in my code:
       Dim x, y, w, h, hPDF, bRet, nPage As Integer
        Dim x1, y1, w1, h1 As Integer
        Dim strOldText, strNewText As String
        x = x1 * 72.0 / nDPI
        y = y1 * 72.0 / nDPI
        w = w1 * 72.0 / nDPI
        h = h1 * 72.0 / nDPI

        'x = 1333 * 72.0 / nDPI
        'y = 237 * 72.0 / nDPI
        'w = 151 * 72.0 / nDPI
        'h = 27 * 72.0 / nDPI
        nPage = 1
        strOldText = txtOld.Text
        strNewText = txtNew.Text

6. But, probably the most critical question is:
How will these two dlls (Phase I and Phase II) co-exist efficiently.
It looks like I have to open and process the pdf file and their pages twice.
c) First time I need to open the pdf file and ask for text and page images to get text with coordinates.
d) I need to open the file for the second time to update the text and images.
Clearly, opening and processing file twice has a potential for unworkable solution.
Please, provide me with a vb.net code example which will show the architecture of these two DLLs, so process will be very efficient as promised.
Thanks.
==============================================
Hi,

We have attempted to work with your code. I have few questions.
1. Why do we need the png file in the code below?
'Render PDF pages to PNG image files first strInPDFFile = Application.StartupPath() & "\F_COC.pdf"
strOutFile = Application.StartupPath() & "\out.png"
strOptions = "-r " & CStr(nDPI) & " -$ XXXXXXXXXXXXXXXXXXXX"
nRet = VeryPDF_PDFParserSDK(strInPDFFile, strOutFile, strOptions) strLogMsg = strInPDFFile & vbCrLf & strOutFile & vbCrLf & strOptions & vbCrLf & "nRet = " & Str(nRet)
MsgBox(strLogMsg)

This code is used to get the position for a special word, if you have already known the X, Y, Width, Height position for a word, you needn’t call VeryPDF_PDFParserSDK() to generate a PNG image first.
2. I’ve used F_COC.pdf file in my testing (see attached). Coordinates for text "Voucher" (as marked on attached jpg file) returned by PDFParser (Phase I) are:
X1=226; Y1=404; W1=75; H1=18 OldString was "Voucher" and NewString was "ABC":
I’ve used the following calculation in my code:
       Dim x, y, w, h, hPDF, bRet, nPage As Integer
        Dim x1, y1, w1, h1 As Integer
        Dim strOldText, strNewText As String
        x = x1 * 72.0 / nDPI
        y = y1 * 72.0 / nDPI
        w = w1 * 72.0 / nDPI
        h = h1 * 72.0 / nDPI

        'x = 1333 * 72.0 / nDPI
        'y = 237 * 72.0 / nDPI
        'w = 151 * 72.0 / nDPI
        'h = 27 * 72.0 / nDPI
        nPage = 1
        strOldText = txtOld.Text
        strNewText = txtNew.Text

        strOutFile = Application.StartupPath() & "\modified.pdf"
        hPDF = VeryPDF_ModifyPDF_OpenFile(strInPDFFile, strOutFile)
        bRet = VeryPDF_ModifyPDF_ModifyText(hPDF, nPage, x, y, w, h, strOldText, strNewText)
        VeryPDF_ModifyPDF_CloseFile(hPDF) The resulting \modified.pdf file was not changed.
a) What did I do wrong?
The default DPI is 150, so your "X1=226; Y1=404; W1=75; H1=18" position is calculated by 150, in the VB.NET_2 example, we are using 300DPI to parse this PDF file. The default position for "Voucher" is [452, 809, 150, 35] when we parse F_COC.pdf file at 300DPI, please refer to following information in output HTML file, ~~~~~~~~~~~~~~~~~ <div style="position:absolute;left:452;top:809;width:150;height:35"><span style="font-style:normal;font-weight:700;font-size:36px;font-family:Helvetica-Bold;color:#000000;">Voucher</span></div>
~~~~~~~~~~~~~~~~~
b) Please, provide me with a vb.net example which works following my pdf sample.

Yes, no problem, please download a new test package from following URL,

XXXXXXXXXXXXXXXXXXXXXX

The new version can change from "Voucher" to "VeryPDF" properly, please refer to following screenshot,

c) What DPI is used in PDFParser Phase I code? I see no options there to control it. Do I need to?

Default DPI is 150 in PDFParser Phase I, however, you can use "-r" parameter to change the default DPI, for example,

nRet = VeryPDF_PDFParserSDK(strInPDFFile, strOutFile, "-r 300 -html -$ XXXXXXXXXXXXXXXXXXXX")

3. But, probably the most critical question is:
How will these two dlls (Phase I and Phase II) co-exist efficiently.
It looks like I have to open and process the pdf file and their pages twice.
a) First time I need to open the pdf file and ask for text and page images to get text with coordinates.
b) I need to open the file for the second time to update the text and images.
Clearly, opening and processing file twice has a potential for unworkable solution.
Please, provide me with a vb.net code example which will show the architecture of these two DLLs, so process will be very efficient as promised.

Phase I and Phase II have been integrated into one pdfparsersdk2.dll library, there hasn’t two DLL files, if you wish modify the PDF file, you need call VeryPDF_ModifyPDF_OpenFile() function to open an output PDF file for writing, and then call VeryPDF_ModifyPDF_ModifyText() to modify each text words.

VeryPDF
=============================================
Thanks. We will test the cases described.
=============================================
Hello,
I had tested the Modify SDK today. I will do more testing later today.
The Modify SDK seems to be working as designed on a "file based" operations, but:
My application has two modes:
1. Design Mode: When user interacts with our UI to determine the changes need to the PDF pages.
2. Batch Mode: When documents are process automatically based on rule definitions user defined in a Design Mode.
Current Modify SDK seem to be adequate for the Batch Mode. In a Design Mode, user expects immediate response from our application.
I don’t think we can achieve the performance goals with the current “file based” operations and I think we need to add a memory based calls.

I am guessing that current Modify SDK works like:
hPDF = VeryPDF_ModifyPDF_OpenFile(PDFFile, strOutFile) – input and output files are open.
bRet = VeryPDF_ModifyPDF_ModifyText(hPDF, nPage, x, y, w, h, strOldText, strNewText) – input pages are read, modifications are performed and output file written.
VeryPDF_ModifyPDF_CloseFile(hPDF) – input and output files are closed.

As indicated above, in Design Mode performance need to be very, very high, so I think I need something like:
newPageImage = VeryPDF_ModifyPDF_GetImageModifyText(oldPageImage, x, y, w, h, strOldText, strNewText)
where:
- oldPageImage is the image I receive from Phase I SDK.
- newPageImage is a modified oldPageImage.
If needed and improve the speed, I could also pass the Text coordinates for a page which is also available to me at the time of this call from Phase I SDK.
So, the function may look like:
newPageImage = VeryPDF_ModifyPDF_ModifyText(oldPageImage, oldTextCoordinateArray, iRowID, strOldText, strNewText) What do you think?
Thanks.
=============================================
Thanks for your new version of PDF Modify SDK product, everything is work fine now.

=============================================
we will include folloiwng new functions in the PDF Modify SDK Phase II,

//This function will place image included in "NewImage" on a PDF page in rectangle defined by x,y,w,h.
BOOL VeryPDF_ModifyPDF_ModifyImage(hPDF, nPage, x, y, w, h, NewImage)

//This function will place text included in "strNewText" on a PDF page in rectangle defined by x,y,w,h and according to Font Definition.
//FontEnhacement is Bold, Italic, Underline, etc.
//You can use any Windows Font to render the new text
BOOL VeryPDF_ModifyPDF_ModifyTextRich(hPDF, nPage, x, y, w, h, strNewText, FontType, FontSize, FontColor, FontEnhacement)

VeryPDF

Rating: 0.0/10 (0 votes cast)

Rating: 0 (from 0 votes)

April 2024
M	T	W	T	F	S	S
« Mar
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30