Previous Next


                                        471
SECTION 5.9                                                Extraction of Text Content



  the Adobe standard Latin character set and the set of named characters in the
  Symbol font (see Appendix D):
    1. Map the character code to a character name according to Table D.1 on
       page 996 and the font’s Differences array.
    2. Look up the character name in the Adobe Glyph List (see the Bibliography)
       to obtain the corresponding Unicode value.
• If the font is a composite font that uses one of the predefined CMaps listed in
  Table 5.15 on page 442 (except Identity–H and Identity–V) or whose descendant
  CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1
  character collection:
    1. Map the character code to a character identifier (CID) according to the
       font’s CMap.
    2. Obtain the registry and ordering of the character collection used by the
       font’s CMap (for example, Adobe and Japan1) from its CIDSystemInfo dic-
       tionary.
    3. Construct a second CMap name by concatenating the registry and order-
       ing obtained in step 2 in the format registry–ordering–UCS2 (for example,
       Adobe–Japan1–UCS2).
    4. Obtain the CMap with the name constructed in step 3 (available from the
       ASN Web site; see the Bibliography).
    5. Map the CID obtained in step 1 according to the CMap obtained in step 4,
       producing a Unicode value.

Note: Type 0 fonts whose descendant CIDFonts use the Adobe-GB1, Adobe-CNS1,
Adobe-Japan1, or Adobe-Korea1 character collection (as specified in the
CIDSystemInfo dictionary) must have a supplement number corresponding to the
version of PDF supported by the application. See Table 5.16 on page 446 for a list of
the character collections corresponding to a given PDF version. (Other supplements
of these character collections can be used, but if the supplement is higher-numbered
than the one corresponding to the supported PDF version, only the CIDs in the latter
supplement are considered to be standard CIDs.)

If these methods fail to produce a Unicode value, there is no way to determine
what the character code represents.

Previous Next