PDF Format Reference - Adobe Portable Document Format

SECTION 10.7

895

Tagged PDF

plications all have their own ideas of what constitutes a word. It is not important

for a Tagged PDF document to identify the words within the text stream accord-

ing to a single, unambiguous definition that satisfies all of these clients. What is

important is that there be enough information available for each client to make

that determination for itself.

The consumer of a Tagged PDF document finds words by sequentially examining

the Unicode character stream, perhaps augmented by replacement text specified

with

ActualText

(see Section 10.8.3, “Replacement Text”). The consumer does not

need to guess about word breaks based on information such as glyph positioning

on the page, font changes, or glyph sizes. The main consideration is to ensure that

the spacing characters that would be present to separate words in a pure text rep-

resentation are also present in the Tagged PDF.

Note that the identification of what constitutes a word is unrelated to how the text

happens to be grouped into show strings. The division into show strings has no

semantic significance. In particular, a space or other word-breaking character is

still needed even if a word break happens to fall at the end of a show string.

Note:

Some applications may identify words by simply separating them at every

space character. Others may be slightly more sophisticated and treat punctuation

marks such as hyphens or em dashes as word separators as well. Still other applica-

tions may identify possible line-break opportunities by using an algorithm similar to

the one in Unicode Standard Annex #29,

Text Boundaries,

available from the Uni-

code Consortium (see the Bibliography).

10.7.2 Basic Layout Model

Tagged PDF’s standard structure types and attributes are interpreted in the con-

text of a basic layout model that describes the arrangement of structure elements

on the page. This model is designed to capture the general intent of the docu-

ment’s underlying structure and does not necessarily correspond to the one actu-

ally used for page layout by the application creating the document. (The PDF

content stream specifies the exact appearance.) The goal is to provide sufficient

information for Tagged PDF consumers to make their own layout decisions while

preserving the authoring application’s intent as closely as their own layout models

allow.

Note:

The Tagged PDF layout model resembles the ones used in markup languages

such as HTML, CSS, XSL, and RTF, but does not correspond exactly to any of them.