Previous Next


                                              470
      CHAPTER 5                                                                       Text



      The Unicode standard defines a system for numbering all of the common charac-
      ters used in a large number of languages. It is a suitable scheme for representing
      the information content of text, but not its appearance, since Unicode values
      identify characters, not glyphs. For information about Unicode, see the Unicode
      Standard by the Unicode Consortium (see the Bibliography).

      When extracting character content, a consumer application can easily convert
      text to Unicode values if a font’s characters are identified according to a standard
      character set that is known to the application. This character identification can
      occur if either the font uses a standard named encoding or the characters in the
      font are identified by standard character names or CIDs in a well-known collec-
      tion. Section 5.9.1, “Mapping Character Codes to Unicode Values,” describes in
      detail the overall algorithm for mapping character codes to Unicode values.

      If a font is not defined in one of these ways, the glyphs can still be shown, but the
      characters cannot be converted to Unicode values without additional informa-
      tion:

      • This information can be provided as an optional ToUnicode entry in the font
        dictionary (PDF 1.2; see Section 5.9.2, “ToUnicode CMaps”), whose value is a
        stream object containing a special kind of CMap file that maps character codes
        to Unicode values.
      • An ActualText entry for a structure element or marked-content sequence (see
        Section 10.8.3, “Replacement Text”) can be used to specify the text content di-
        rectly.


5.9.1 Mapping Character Codes to Unicode Values

      A consumer application can use the following methods, in the priority given, to
      map a character code to a Unicode value. Tagged PDF documents, in particular,
      must provide at least one of these methods (see “Unicode Mapping in Tagged
      PDF” on page 892):

      • If the font dictionary contains a       ToUnicode CMap (see Section 5.9.2,
        “ToUnicode CMaps”), use that CMap to convert the character code to Unicode.
      • If the font is a simple font that uses one of the predefined encodings
        MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has an
        encoding whose Differences array includes only character names taken from

Previous Next