VeryPDF Layout Analysis SDK allows to analyze the layout of any document using complex algorithms, able to recognize with high accuracy the different kind of areas in the page.
VeryPDF Layout Analysis SDK identifies the following types of areas:
- inverted text
- images (pictures or drawings)
- tables (rows, columns and cells)
- horizontal and vertical lines
Before the "character" recognition will take place, the logical structure of the document has to be be analyzed and defined. For example:
- Where are text blocks, paragraphs, lines?
- Is there a table that should be reconstructed?
- Are there any "images" on the page(s)?
- Are there any barcodes to read?
VeryPDF OCR technology contains several variants of Document Layout Analysis:
Automatic Document Analysis
The Document Analysis searches and "finds" zones for recognition on the document images. Here how it works:
- The Document Analysis algorithms detect different elementary objects on the image, e.g.
words or parts of words,
color gradients, inverted, text areas,
- Then, based on this information, hypotheses for these blocks are formed and checked:
What is type of the block?
Where are the borders of the block?
What type of the document layout could it be (magazine, newspaper, book page) ?
The following screenshot of VeryPDF OCR SDK shows the result of a analyzed layout (text, image and table blocks) , as well as the reconstructed output.
or on a multi-column magazine page with intelligent layout analysis & reconstruciton,
Generated MS Word document with two columns,
If there would be no intelligent layout analysis, but use only use one large text block, then the results may contain messy text contents only.
Automatic Document Analysis in the SDKs can work in the different modes available in the OCR-SDKs:
- Full layout analysis – Text, images, tables and barcodes are detected - see samples above.
- Index mode - tries to find as much text on the image - even if they are embedded in images
- Mode for Invoices and documents with complex tables
- Barcode mode - ignores text and images, it only looks for barcodes
- Lines mode - only returns the text in lines, even in a multi-column document
To get the best result from the analysis, the quality of the image to process needs to be the best quality possible. To help us in this process, we could use some of VeryPDF Image Processing libraries, like:
Using Hi-capacity scanners, sometimes the ADF dekew the paper: you can solve this problem using VeryPDF Deskew SDK: in this way you will get perfect images without re-scan, correcting the wrong inclination of the document automatically and quickly. You can deskew until 45° and the angle may be calculated using two methods: text analysis or finding the black border. For more information please give a look to VeryPDF Deskew SDK.
Despeckle and noise removal
Scanning from copies or microfilm, dust and dirt may add some noise on the images. You can avoid this problem using our VeryPDF Despeckle Library. You just need to determine how big a dust element can be (i.e. 2x2 pixels). For more information visit VeryPDF Despeckle SDK page.
Black border removal and auto-cropping
This Black Border Removal SDK allows the automatic black border detection and removal in monochrome or grayscale images. The black border is produced in the images acquired by scanners when paper size is smaller than scanning area or in images acquired from microfilm, microfiches and aperture-cards. Removing the border from the images is a very important pre-processing step that improves the compression rate, reducing file size, and the visualization aspect. For more information visit VeryPDF Black Border Removal SDK page.
Pricing and ordering info
For more information about VeryPDF Layout Analysis Library, VeryPDF Deskew SDK, VeryPDF Despeckle SDK, VeryPDF Black Border Removal SDK, please feel free to contact us via VeryPDF Ticket System, we will reply to you asap,