Previous Next


                                          159
SECTION 3.8                                                   Common Data Structures



using PDFDocEncoding with the two characters thorn ydieresis, which is unlikely
to be a meaningful beginning of a word or phrase).

Note: Applications that process PDF files containing Unicode text strings should be
prepared to handle supplementary characters; that is, characters requiring more
than two bytes to represent.

An escape sequence may appear anywhere in a Unicode text string to indicate the
language in which subsequent text is written, which is useful when the language
cannot be determined from the character codes used in the text. The escape
sequence consists of the following elements, in order:

1. The Unicode value U+001B (that is, the byte sequence 0 followed by 27).
2. A 2-character ISO 639 language code—for example, en for English or ja for
   Japanese. Character in this context means byte (as in ASCII character), not
   Unicode character.
3. (Optional) A 2-character ISO 3166 country code—for example, US for the
   United States or JP for Japan.
4. The Unicode value U+001B.

The complete list of codes defined by ISO 639 and ISO 3166 can be obtained
from the International Organization for Standardization (see the Bibliography).


PDFDocEncoded String Type

A PDFDocEncoded string is similar to a string object, but it is a character string
where characters are represented in a single byte using PDFDocEncoding. Note
that PDFDocEncoding does not support all Unicode characters whereas UTF-
16BE does.

Note: This type is not a true type. Rather, it is a string type that represents data en-
coded using a specific convention.


Byte String Type

The byte string type is used for binary data represented as a series of 8-bit bytes,
where each byte can be any value representable in 8 bits. The string may

Previous Next