How to get font size and coordinate information for each word from TIFF file by VeryPDF Cloud OCR API

Hi,

Can the cloud api be used to get back each word or word group *along with its font size in the original text* being provided as input for the OCR? We are a Stealth startup that needs to process texts, and our algorithm needs to know the size of each part of the text, so that we can recognize titles, headers and sub-headers in the original documents.

Thanks,
Customer
--------------------------------------------
Thanks for the accurate information. I will wait for the cloud service to match, as I can't use a desktop grade program as part of our product infrastructure. It will not scale for us.

Customer
--------------------------------------------
Yes, of course, you can get the font size for each word in original TIFF file,

for example, you can use "format" option to get detailed information for each block, line, word and character in input TIFF file,

http://online.verypdf.com/api/?apikey=XXXXXXXXXXXXX&app=ocr
&infile=http://online.verypdf.com/examples/cloud-api/multipage.tif
&outfile=out&lang=eng&format

OR

http://online.verypdf.com/api/?apikey=XXXXXXXXXXXXX&app=ocr
&infile=http://online.verypdf.com/examples/cloud-api/multipage.tif
&lang=eng&format

The output HTML file will contain following information, you need only pay attention to "ocrx_word", it is contain "bbox" information for this word,

<div class='ocr_page' id='page_1' title='image 20130929-220639-8563465574.tif"; bbox 0 0 2000 2388; ppageno 0'>
<div class='ocr_carea' id='block_1_1' title="bbox 22 28 1013 87">
<p class='ocr_par' dir='ltr' id='par_1' title="bbox 23 30 1012 83">
<span class='ocr_line' id='line_1' title="bbox 23 30 1012 83">
<span class='ocrx_word' id='word_1' title="bbox 23 32 262 74">Universal</span>
<span class='ocrx_word' id='word_2' title="bbox 278 31 569 73">Declaration</span>
<span class='ocrx_word' id='word_3' title="bbox 586 31 637 73">of</span>
<span class='ocrx_word' id='word_4' title="bbox 649 31 836 72">Human</span>
<span class='ocrx_word' id='word_5' title="bbox 853 30 1012 83">Rights</span>
</span>
</p>
</div>
<div class='ocr_carea' id='block_2_2' title="bbox 24 144 1845 258">
<p class='ocr_par' dir='ltr' id='par_2' title="bbox 24 148 1845 254">
<span class='ocr_line' id='line_2' title="bbox 24 148 1794 196">
<span class='ocrx_word' id='word_6' title="bbox 24 152 197 186">Whereas</span>
<span class='ocrx_word' id='word_7' title="bbox 212 151 438 196">recognition</span>
<span class='ocrx_word' id='word_8' title="bbox 453 150 497 185">of</span>
<span class='ocrx_word' id='word_9' title="bbox 507 150 567 185">the</span>
<span class='ocrx_word' id='word_10' title="bbox 582 150 745 185">inherent</span>
<span class='ocrx_word' id='word_11' title="bbox 759 150 899 195">dignity</span>
<span class='ocrx_word' id='word_12' title="bbox 914 149 983 184">and</span>
<span class='ocrx_word' id='word_13' title="bbox 998 149 1041 184">of</span>
<span class='ocrx_word' id='word_14' title="bbox 1051 149 1111 184">the</span>
<span class='ocrx_word' id='word_15' title="bbox 1126 149 1232 194">equal</span>
<span class='ocrx_word' id='word_16' title="bbox 1249 148 1319 183">and</span> <span class='ocrx_word' id='word_17' title="bbox 1334 148 1551 183">inalienable</span>
<span class='ocrx_word' id='word_18' title="bbox 1565 148 1677 193">rights</span> <span class='ocrx_word' id='word_19' title="bbox 1692 148 1736 183">of</span>
<span class='ocrx_word' id='word_20' title="bbox 1748 148 1794 183">all</span>
</span>
<span class='ocr_line' id='line_3' title="bbox 25 206 1845 254"><span class='ocrx_word' id='word_21' title="bbox 25 209 206 244">members</span> <span class='ocrx_word' id='word_22' title="bbox 222 209 265 244">of</span> <span class='ocrx_word' id='word_23' title="bbox 275 209 334 244">the</span> <span class='ocrx_word' id='word_24' title="bbox 349 209 483 243">human</span> <span class='ocrx_word' id='word_25' title="bbox 499 208 628 254">family</span> <span class='ocrx_word' id='word_26' title="bbox 642 208 673 243">is</span> <span class='ocrx_word' id='word_27' title="bbox 688 208 747 243">the</span> <span class='ocrx_word' id='word_28' title="bbox 763 207 976 243">foundation</span> <span class='ocrx_word' id='word_29' title="bbox 991 207 1035 242">of</span> <span class='ocrx_word' id='word_30' title="bbox 1045 206 1223 247">freedom,</span> <span class='ocrx_word' id='word_31' title="bbox 1234 206 1368 252">justice</span> <span class='ocrx_word' id='word_32' title="bbox 1383 206 1453 241">and</span> <span class='ocrx_word' id='word_33' title="bbox 1467 217 1579 252">peace</span> <span class='ocrx_word' id='word_34' title="bbox 1594 206 1630 240">in</span> <span class='ocrx_word' id='word_35' title="bbox 1645 206 1704 241">the</span> <span class='ocrx_word' id='word_36' title="bbox 1719 206 1845 247">world,</span>
</span>
</p>
</div>

If you need any more information, please feel free to let us know.

If you wish change the format of output information for each word, please feel free to let us know, we can assist you to change the output format too.

VeryPDF

VN:F [1.9.20_1166]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.20_1166]
Rating: 0 (from 0 votes)

Related Posts

This entry was posted in VeryPDF Cloud API and tagged , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *


Verify Code   If you cannot see the CheckCode image,please refresh the page again!