Page Layout Analysis for Scanned PDF and TIFF files. Generic layout analysis library or tool not based on OCR.

Layout analysis is a processing step of OCR which is important when recognizing complex documents with multiple columns, tables or embedded images. During layout analysis the OCR software examines the structure of the document, distinguishes between images and text and tries to recognize the text flow of the document. Modern OCR software with good layout analysis can replicate the document structure almost identically with the original and save it in a text file (e.g. DOC, HTML or PDF).

Layout analysis, the division of page images into text blocks, lines, and determination of their reading order, this is a major performance limiting step in large scale document digitization projects.

Document analysis or more precisely, document image analysis, is the process that performs the overall interpretation of document images. This process is the answer to the question, "How is everything that is known about language, document formatting, image processing and character recognition combined in order to deal with a particular application?", Thus document analysis is concerned with the global issues involved in recognition of written language in images. It adds to OCR a superstructure that establishes the organization of the document and applies outside knowledge in interpreting it.

The process of determining document structure may be viewed as guided by a model, explicit or implicit, of the class of documents of interest. The model describes the physical appearance and the relationships between the entities that make up the document. OCR is often at the final level of this process, i.e., it provides a final encoding of the symbols contained in a logical entity such as paragraph or table, once the latter has been isolated by other stages. However, it is important to realize that OCR can also participate in determining document layout. For example, as part of the process of extracting a newspaper article the system may have to recognize the character string, continued on page 5, at the bottom of a page image, in order to locate the entire text.

In practice then, a document analysis system performs the basic tasks of image segmentation, layout understanding, symbol recognition and application of contextual rules in an integrated manner. Current work in this area can be summarized under four main classes of applications.

Here is a question from a customer,

I am looking for layout analysis libraries or tools that can be applied on text PDFs to identify main text content versus sidebars, chapter headings, section headings (possibly even fancy ones having decorations/shading and underlines) etc. Are there libraries which can do the same WITHOUT OCR? It is possible to extract text and images from text PDFs and give an input that contains positions of text and images to the tool; using OCR for such files would be rather circuitous.

VeryPDF Layout Analysis SDK is a best Page Layout Analysis SDK or Library to analyze pages without OCR processing, VeryPDF Layout Analysis SDK can be downloaded from following web page,

The following is a screenshot which using VeryPDF Layout Analysis SDK, as you see, VeryPDF Layout Analysis SDK does recognize text and image areas properly,


If you encounter any problem with VeryPDF Layout Analysis SDK, please feel free to let us know,

