We have some PDF files, these PDF files are contain fillable forms + background image, users will fill the data into the forms which be placed on the image. We are using "VeryPDF PDF Parser & Modify Component for .NET Developer License" to extract text contents from this PDF file currently.
In this PDF form, part of the content is available in the htm output as text. Only challenge left out is to get the labels which are present as image.
Converting PDF to image and OCRing will have an impact in the accuracy. As you are the PDF expert, please suggest to get the label content which is present in this pdf as an image/metadata.
Since this is an Form PDF, is there any way to get the key value pair from this form PDF?
>>In this form, part of the content is available in the htm output as text. Only challenge left out is to get the labels which are present as image.
We have double checked your PDF file, yes, your PDF file contains an entire background image and some fillable forms on the image, please look at this background image at below,
The filled data are text contents, but the background is a big image, user will fill the data on the image directly.
We have a solution for you to get key value pairs from this form PDF file, for example,
Step 1. We will extract user filled text contents and their coordinates from this PDF file first,
Step 2. We will use OCR technology to convert entire PDF page (or only the background image) to text contents,
Step 3. We will combine the data from #1 and #2 together, Label Text from OCR + User Filled Text from PDF Parser SDK, with this solution, we will able to get the final key value pairs from this form PDF properly.
Finally, I think PDF Parser SDK + OCR SDK will finish this work to you, we can get user filled data from PDF Parser SDK and get Text Labels from OCR software, if we combine them together, we will able to get the final key value pairs from this form PDF file.
If you have any question for this solution, please feel free to let me know.
Related Software in this article,
VeryPDF PDF Parse & Modify Component for .NET,
VeryPDF OCR to Any Converter Command Line,