Previous Next


                                               895
       SECTION 10.7                                                             Tagged PDF



       plications all have their own ideas of what constitutes a word. It is not important
       for a Tagged PDF document to identify the words within the text stream accord-
       ing to a single, unambiguous definition that satisfies all of these clients. What is
       important is that there be enough information available for each client to make
       that determination for itself.

       The consumer of a Tagged PDF document finds words by sequentially examining
       the Unicode character stream, perhaps augmented by replacement text specified
       with ActualText (see Section 10.8.3, “Replacement Text”). The consumer does not
       need to guess about word breaks based on information such as glyph positioning
       on the page, font changes, or glyph sizes. The main consideration is to ensure that
       the spacing characters that would be present to separate words in a pure text rep-
       resentation are also present in the Tagged PDF.

       Note that the identification of what constitutes a word is unrelated to how the text
       happens to be grouped into show strings. The division into show strings has no
       semantic significance. In particular, a space or other word-breaking character is
       still needed even if a word break happens to fall at the end of a show string.

       Note: Some applications may identify words by simply separating them at every
       space character. Others may be slightly more sophisticated and treat punctuation
       marks such as hyphens or em dashes as word separators as well. Still other applica-
       tions may identify possible line-break opportunities by using an algorithm similar to
       the one in Unicode Standard Annex #29, Text Boundaries, available from the Uni-
       code Consortium (see the Bibliography).


10.7.2 Basic Layout Model

       Tagged PDF’s standard structure types and attributes are interpreted in the con-
       text of a basic layout model that describes the arrangement of structure elements
       on the page. This model is designed to capture the general intent of the docu-
       ment’s underlying structure and does not necessarily correspond to the one actu-
       ally used for page layout by the application creating the document. (The PDF
       content stream specifies the exact appearance.) The goal is to provide sufficient
       information for Tagged PDF consumers to make their own layout decisions while
       preserving the authoring application’s intent as closely as their own layout models
       allow.

       Note: The Tagged PDF layout model resembles the ones used in markup languages
       such as HTML, CSS, XSL, and RTF, but does not correspond exactly to any of them.

Previous Next