How to get key value pairs from both scanned PDF files and plain text PDF files (or text and image mixed PDF files)?

Hi,

We are using "VeryPDF PDF Parser & Modify Component for .NET Developer License" components to get the text from pdf & pdf to tif conversion. As you are the experts in pdf processing, seek your suggestion to process these form pdfs.

Following are the concerns on the below suggestion,

1. As you are aware, OCRing will degrade the accuracy. It is better to retrieve the key value pair present in this form pdfs to achieve better accuracy. Customer's expectation will not be met if OCR done on this as it will lower the accuracy.

2. We are relying on the verypdf output htm file to get the text elements and image elements. As you suggested, for all the image elements, OCRing could be done and used further. But in this case, htm output is having only the text elements. So, what would be indicating factor to perform this OCR. OCRing all the time will not be a good solution as it has impact on time & also performance.

3. Since these are form controls, static text part will also be a control value and be saved in the pdf as an object/data. Can we get the control name and value from the form pdfs.

Please guide.

Regards
Customer
--------------------------------------

image
>>1. As you are aware, OCRing will degrade the accuracy. It is better to retrieve the key value pair present in this form pdfs to achieve better accuracy. Customer's expectation will not be met if OCR done on this as it will lower the accuracy.

Thanks for your message, I know the OCR will degrade the accuracy, but I think the OCR is an alternative option, the OCR option will work for PDF files which contain images only, if a PDF file contains only text contents, you will get text contents by "VeryPDF PDF Parser & Modify Component for .NET Developer License" like before.

The whole workflow would works like below,
1. Your application calls "VeryPDF PDF Parser & Modify Component for .NET Developer License" to extract text contents from a PDF file.
2. "VeryPDF PDF Parser SDK" will extract text contents from PDF file and save them to a XML file, above steps are working currently.
3. "VeryPDF PDF Parser SDK" will determine if this PDF file contains a background image, if not, it will return to your application.
4. If this PDF file contains a background image which contain the form names, "VeryPDF PDF Parser SDK" will use OCR engine to extract text contents from the image, append the new text contents to existing XML file which created on the step #2.
5. Your application will read the XML file to get full text contents, include original text contents and OCRed text contents, you may sort them by X/Y positions, and remove the OCRed text contents if they are already exist in the pre-OCR text contents.
6. You will get a final text data which contain original text contents and OCRed text contents, you may find the key value pairs from this final text data easily.

as you see, "VeryPDF PDF Parser & Modify Component" is only work until #2, the remaining steps need to be done by an OCR engine, with an OCR engine, you will able to get text contents from both original text contents and OCRed text contents, OCR is a supplement to "VeryPDF PDF Parser" SDK product.

>>2. We are relying on the verypdf output htm file to get the text elements and image elements. As you suggested, for all the image elements, OCRing could be done and used further. But in this case, htm output is having only the text elements. So, what would be indicating factor to perform this OCR. OCRing all the time will not be a good solution as it has impact on time & also performance.

Yes, I agree with you, the OCR is not good for all time, but I think OCR is a good supplement if you want to extract text contents from the background image, for this type of PDF file, the OCR is the only choice.

Like I said in #1, you can use "VeryPDF PDF Parser SDK" to extract text contents from text based PDF files, for the image based PDF files and image & text mixed PDF files, you may use OCR engine to extract text contents from them.

>>3. Since these are form controls, static text part will also be a control value and be saved in the pdf as an object/data. Can we get the control name and value from the form pdfs.

Yes, this is possible, you can extract control name and value from form based PDF file, you can use "VeryPDF PDF Toolbox Command Line for Windows" or "VeryPDF PDF Toolbox Component for .NET" to extract control's name and value from fillable PDF files, you may download the trial version of "VeryPDF PDF Toolbox Command Line for Windows" or "VeryPDF PDF Toolbox Component for .NET" from this web page to try,

http://www.verypdf.com/app/pdftoolbox/try-and-buy.html#buy

After you download it, you may run following command line to extract form name and form value from your PDF file,

pdftoolbox.exe "D:\downloads\Sample1.pdf" -outformdata

OR

pdftoolbox.exe "D:\downloads\Sample1.pdf" -outformdata -outfile D:\out.txt

You will get following contents which contain all of form names and form values in your PDF file,
---------------------------------------------------
---
FieldType: Text
FieldName: NUCC USE
FieldFlags: 0
FieldJustification: Left
---
FieldType: Text
FieldName: 40
FieldFlags: 0
FieldJustification: Left
---
FieldType: Text
FieldName: 57
FieldFlags: 0
FieldJustification: Left
FieldMaxLength: 28
---
FieldType: Text
FieldName: 58
FieldFlags: 0
FieldJustification: Left
FieldMaxLength: 28
---
FieldType: Text
FieldName: 41
FieldFlags: 0
FieldJustification: Left
---
FieldType: Text
FieldName: 50
FieldFlags: 0
FieldJustification: Left
FieldMaxLength: 19
---
FieldType: Text
FieldName: 73
FieldFlags: 0
FieldValue: Injury
FieldJustification: Left
---
FieldType: Text
FieldName: 74
FieldFlags: 0
FieldJustification: Left
---
FieldType: Text
FieldName: 85
FieldFlags: 0
FieldJustification: Left
---
FieldType: Text
FieldName: 96
FieldFlags: 0
FieldValue: NA
FieldJustification: Left
FieldMaxLength: 71
---
FieldType: Text
FieldName: 99icd
FieldFlags: 0
FieldJustification: Center
---
FieldType: Text
FieldName: 135
FieldFlags: 0
FieldJustification: Center
---
FieldType: Text
FieldName: 157
FieldFlags: 12582912
FieldJustification: Left
---
FieldType: Text
FieldName: 179
FieldFlags: 0
FieldJustification: Center
---
FieldType: Text
FieldName: 201
FieldFlags: 12582912
FieldJustification: Left
---
FieldType: Text
FieldName: 223
FieldFlags: 12582912
FieldJustification: Left
---
FieldType: Text
FieldName: 245
FieldFlags: 12582912
FieldJustification: Left
---
FieldType: Button
FieldName: 276
FieldNameAlt: On/Off Total
FieldFlags: 0
FieldJustification: Left
FieldStateOption: Off
FieldStateOption: Yes
---
FieldType: Button
FieldName: Clear Form
FieldNameAlt: Clear Form
FieldFlags: 65536
FieldJustification: Left
---
FieldType: Text
FieldName: insurance_name
FieldFlags: 0
FieldJustification: Left
---
FieldType: Text
FieldName: insurance_address
FieldFlags: 0
FieldJustification: Left
---
FieldType: Text
FieldName: insurance_address2
FieldFlags: 0
FieldJustification: Left
---
FieldType: Text
FieldName: insurance_city_state_zip
FieldFlags: 0
FieldJustification: Left
---
FieldType: Text
FieldName: pt_name
FieldFlags: 0
FieldValue: AAAA
FieldJustification: Left
---
FieldType: Text
FieldName: insurance_id
FieldFlags: 0
FieldValue: 14343467
FieldJustification: Left
---
FieldType: Text
FieldName: ins_name
FieldFlags: 0
FieldValue: AAA
FieldJustification: Left
---
FieldType: Button
FieldName: insurance_type
FieldFlags: 0
FieldValue: Medicare
FieldJustification: Left
FieldStateOption: Champva
FieldStateOption: Feca
FieldStateOption: Group
FieldStateOption: Medicaid
FieldStateOption: Medicare
FieldStateOption: Off
FieldStateOption: Other
FieldStateOption: Tricare
---
FieldType: Text
FieldName: birth_mm
FieldFlags: 0
FieldValue: 08
FieldJustification: Center
FieldMaxLength: 2
---
FieldType: Text
FieldName: birth_dd
FieldFlags: 0
FieldValue: 23
FieldJustification: Center
FieldMaxLength: 2
---
FieldType: Text
FieldName: birth_yy
FieldFlags: 0
FieldValue: 1978
FieldJustification: Left
---
FieldType: Button
FieldName: sex
FieldFlags: 0
FieldJustification: Left
FieldStateOption: F
FieldStateOption: M
FieldStateOption: Off
---
FieldType: Text
FieldName: pt_street
FieldFlags: 0
FieldValue: 3590 Northdale Blvd,Rogers. MN 55374
FieldJustification: Left
---
FieldType: Text
FieldName: pt_city
FieldFlags: 0
FieldValue: Rogers
FieldJustification: Left
---
FieldType: Text
FieldName: pt_state
FieldFlags: 12582912
FieldValue: Mn
FieldJustification: Left
FieldMaxLength: 3
---
FieldType: Text
FieldName: pt_zip
FieldFlags: 0
FieldValue: 55374
FieldJustification: Left
---
FieldType: Text
FieldName: pt_AreaCode
FieldFlags: 0
FieldValue: +1
FieldJustification: Center
FieldMaxLength: 3
---
FieldType: Text
FieldName: pt_phone
FieldFlags: 0
FieldValue: 56712121333
FieldJustification: Left
---
FieldType: Button
FieldName: rel_to_ins
FieldFlags: 0
FieldValue: S
FieldJustification: Left
FieldStateOption: C
FieldStateOption: M
FieldStateOption: O
FieldStateOption: Off
FieldStateOption: S
---
FieldType: Text
FieldName: ins_street
FieldFlags: 12582912
FieldValue: 3590 Northdale Blvd,Rogers. MN 55374
FieldJustification: Left
---
FieldType: Text
FieldName: ins_city
FieldFlags: 12582912
FieldValue: Rogers
FieldJustification: Left
---
FieldType: Text
FieldName: ins_state
FieldFlags: 0
FieldValue: MN
FieldJustification: Left
FieldMaxLength: 4
---
FieldType: Text
FieldName: ins_zip
FieldFlags: 12582912
FieldValue: 55374
FieldJustification: Left
---
FieldType: Text
FieldName: ins_phone area
FieldFlags: 0
FieldValue: +1
FieldJustification: Center
FieldMaxLength: 3
---
FieldType: Text
FieldName: ins_phone
FieldFlags: 12582912
FieldValue: 2354235252
FieldJustification: Left
---
FieldType: Text
FieldName: ins_policy
FieldFlags: 12582912
FieldValue: A8923
FieldJustification: Left
---
FieldType: Text
FieldName: ins_dob_mm
FieldFlags: 0
FieldValue: 08
FieldJustification: Center
FieldMaxLength: 2
---
FieldType: Text
FieldName: ins_dob_dd
FieldFlags: 0
FieldValue: 23
FieldJustification: Center
FieldMaxLength: 2
---
FieldType: Text
FieldName: ins_dob_yy
FieldFlags: 0
FieldValue: 1978
FieldJustification: Center
FieldMaxLength: 4
---
FieldType: Button
FieldName: ins_sex
FieldFlags: 0
FieldValue: MALE
FieldJustification: Left
FieldStateOption: FEMALE
FieldStateOption: MALE
FieldStateOption: Off
---
FieldType: Text
FieldName: other_ins_name
FieldFlags: 0
FieldValue: Allen
FieldJustification: Left
---
FieldType: Text
FieldName: other_ins_policy
FieldFlags: 0
FieldJustification: Left
---
FieldType: Text
FieldName: ins_signature
FieldFlags: 0
FieldValue: Signature on file
FieldJustification: Left
---
FieldType: Button
FieldName: ins_benefit_plan
FieldFlags: 0
FieldValue: YES
FieldJustification: Left
FieldStateOption: NO
FieldStateOption: Off
FieldStateOption: YES
---
FieldType: Text
FieldName: ins_plan_name
FieldFlags: 0
FieldValue: BP-13131
FieldJustification: Left
---
FieldType: Text
FieldName: pt_signature
FieldFlags: 0
FieldValue: Signature on file
FieldJustification: Left
---
FieldType: Text
FieldName: pt_date
FieldFlags: 0
FieldValue: 08-05-2020
FieldJustification: Left
---
FieldType: Text
FieldName: cur_ill_mm
FieldFlags: 8388608
FieldValue: 04
FieldJustification: Center
---
FieldType: Text
FieldName: cur_ill_dd
FieldFlags: 0
FieldValue: 17
FieldJustification: Left
---
FieldType: Text
FieldName: cur_ill_yy
FieldFlags: 0
FieldValue: 2020
FieldJustification: Left
---
FieldType: Text
FieldName: ref_physician
FieldFlags: 0
FieldValue: Blue Plan
FieldJustification: Left
---
FieldType: Text
FieldName: id_physician
FieldFlags: 0
FieldJustification: Left
---
FieldType: Text
FieldName: physician number 17a1
FieldFlags: 0
FieldJustification: Center
---
FieldType: Text
FieldName: physician number 17a
FieldFlags: 0
FieldJustification: Left
---
FieldType: Text
FieldName: sim_ill_mm
FieldFlags: 0
FieldValue: 05
FieldJustification: Center
---
FieldType: Text
FieldName: sim_ill_dd
FieldFlags: 0
FieldValue: 05
FieldJustification: Center
FieldMaxLength: 2
---
FieldType: Text
FieldName: sim_ill_yy
FieldFlags: 0
FieldValue: 2020
FieldJustification: Center
---
FieldType: Text
FieldName: work_mm_from
FieldFlags: 0
FieldValue: 04
FieldJustification: Center
FieldMaxLength: 2
---
FieldType: Text
FieldName: work_dd_from
FieldFlags: 0
FieldValue: 20
FieldJustification: Center
FieldMaxLength: 2
---
FieldType: Text
FieldName: work_yy_from
FieldFlags: 0
FieldValue: 2020
FieldJustification: Center
FieldMaxLength: 4
---
FieldType: Text
FieldName: work_mm_end
FieldFlags: 0
FieldValue: 05
FieldJustification: Center
FieldMaxLength: 2
---
FieldType: Text
FieldName: work_dd_end
FieldFlags: 0
FieldValue: 20
FieldJustification: Center
FieldMaxLength: 2
---
FieldType: Text
FieldName: work_yy_end
FieldFlags: 0
FieldValue: 2020
FieldJustification: Center
FieldMaxLength: 4
---
FieldType: Text
FieldName: hosp_mm_from
FieldFlags: 0
FieldJustification: Center
FieldMaxLength: 2
---
FieldType: Text
FieldName: hosp_dd_from
FieldFlags: 0
FieldJustification: Center
FieldMaxLength: 2
---
FieldType: Text
FieldName: hosp_yy_from
FieldFlags: 0
FieldJustification: Center
FieldMaxLength: 4
---
FieldType: Text
FieldName: hosp_mm_end
FieldFlags: 0
FieldJustification: Center
FieldMaxLength: 2
---
FieldType: Text
FieldName: hosp_dd_end
FieldFlags: 0
FieldJustification: Center
FieldMaxLength: 2
---
FieldType: Text
FieldName: hosp_yy_end
FieldFlags: 0
FieldJustification: Center
FieldMaxLength: 4
---
FieldType: Button
FieldName: lab
FieldFlags: 0
FieldValue: YES
FieldJustification: Left
FieldStateOption: NO
FieldStateOption: Off
FieldStateOption: YES
---
FieldType: Text
FieldName: charge
FieldFlags: 0
FieldValue: 321323423423
FieldJustification: Right
---
FieldType: Text
FieldName: medicaid_resub
FieldFlags: 0
FieldJustification: Left
---
FieldType: Text
FieldName: original_ref
FieldFlags: 0
FieldValue: 3214124
FieldJustification: Left
---
FieldType: Text
FieldName: prior_auth
FieldFlags: 0
FieldJustification: Left
---
FieldType: Text
FieldName: emg1
FieldFlags: 0
FieldJustification: Center
---
FieldType: Text
FieldName: local1a
FieldFlags: 0
FieldJustification: Left
---
FieldType: Text
FieldName: sv1_mm_from
FieldFlags: 0
FieldValue: 04
FieldJustification: Center
---
FieldType: Text
FieldName: sv1_dd_from
FieldFlags: 0
FieldValue: 04
FieldJustification: Center
FieldMaxLength: 2
---
FieldType: Text
FieldName: sv1_yy_from
FieldFlags: 0
FieldValue: 20
FieldJustification: Center
FieldMaxLength: 2
---
FieldType: Text
FieldName: sv1_mm_end
FieldFlags: 8388608
FieldValue: 05
FieldJustification: Center
FieldMaxLength: 2
---
FieldType: Text
FieldName: sv1_dd_end
FieldFlags: 0
FieldValue: 05
FieldJustification: Center
FieldMaxLength: 2
---
FieldType: Text
FieldName: sv1_yy_end
FieldFlags: 0
FieldValue: 20
FieldJustification: Center
FieldMaxLength: 2
---------------------------------------------------

"VeryPDF PDF Toolbox Command Line for Windows" and "VeryPDF PDF Toolbox Component for .NET" are great products to import and export PDF forms, you can use these information to make up key value pairs easily, you may give it a try.

VeryPDF

VN:F [1.9.20_1166]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.20_1166]
Rating: 0 (from 0 votes)

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *


Verify Code   If you cannot see the CheckCode image,please refresh the page again!