PDF Format Reference - Adobe Portable Document Format

CHAPTER 5

470

Text

The Unicode standard defines a system for numbering all of the common charac-

ters used in a large number of languages. It is a suitable scheme for representing

the information content of text, but not its appearance, since Unicode values

identify characters, not glyphs. For information about Unicode, see the

Unicode

Standard

by the Unicode Consortium (see the Bibliography).

When extracting character content, a consumer application can easily convert

text to Unicode values if a font’s characters are identified according to a standard

character set that is known to the application. This character identification can

occur if either the font uses a standard named encoding or the characters in the

font are identified by standard character names or CIDs in a well-known collec-

tion. Section 5.9.1, “Mapping Character Codes to Unicode Values,” describes in

detail the overall algorithm for mapping character codes to Unicode values.

If a font is not defined in one of these ways, the glyphs can still be shown, but the

characters cannot be converted to Unicode values without additional informa-

tion:

•

This information can be provided as an optional

ToUnicode

entry in the font

dictionary

(PDF 1.2;

see Section 5.9.2, “ToUnicode CMaps”), whose value is a

stream object containing a special kind of CMap file that maps character codes

to Unicode values.

•

ActualText

entry for a structure element or marked-content sequence (see

Section 10.8.3, “Replacement Text”) can be used to specify the text content di-

rectly.

5.9.1 Mapping Character Codes to Unicode Values

A consumer application can use the following methods, in the priority given, to

map a character code to a Unicode value. Tagged PDF documents, in particular,

must provide at least one of these methods (see “Unicode Mapping in Tagged

PDF” on page 892):

•

If the font dictionary contains a

ToUnicode

CMap (see Section 5.9.2,

“ToUnicode CMaps”), use that CMap to convert the character code to Unicode.

•

If the font is a simple font that uses one of the predefined encodings

MacRomanEncoding

MacExpertEncoding

, or

WinAnsiEncoding

, or that has an

encoding whose

Differences

array includes only character names taken from

Index Bookmark Pages Text

Previous Next

Pages: Index All Pages

This HTML file was created by VeryPDF PDF to HTML Converter product.