Previous Next


                                         938
CHAPTER 10                                                     Document Interchange



Language identifiers can be based on codes defined by the International Organi-
zation for Standardization in ISO 639 and ISO 3166 (see the Bibliography) or reg-
istered with the Internet Assigned Numbers Authority (IANA, whose Web site is
located at < http://iana.org/ >), or they can include codes created for private use. A
language identifier consists of a primary code optionally followed by one or more
subcodes (each preceded by a hyphen). The primary code can be any of the fol-
lowing:

• A 2-character ISO 639 language code—for example, en for English or es for
  Spanish
• The letter i, designating an IANA-registered identifier
• The letter x, for private use
The first subcode can be a 2-character ISO 3166 country code, as in en-US, or a
3- to 8-character subcode registered with IANA, as in en-cockney or i-cherokee
(except in private identifiers, for which subcodes are not registered). Subcodes
beyond the first can be any that have been registered with IANA.

Although language codes are commonly represented using lowercase letters and
country codes are commonly represented using uppercase letters, all tags must be
treated as case insensitive.


Language Specification Hierarchy

The Lang entry in the document catalog specifies the natural language for all text
in the document except where overridden by language specifications for struc-
ture elements or for marked-content sequences that are not in the structure hier-
archy (for example, within an entirely unstructured document). Examples in this
section illustrate the hierarchical manner in which the language for text in a doc-
ument is determined.

Example 10.19 shows how a language specified for the document as a whole
could be overridden by one specified for a marked-content sequence within a
page’s content stream, independent of any logical structure. In this case, the Lang
entry in the document catalog (not shown) has the value en-US, meaning U.S. En-
glish, and it is overridden by the Lang property attached (with the Span tag) to
the marked-content sequence Hasta la vista. The Lang property identifies the lan-
guage for this marked content sequence with the value es-MX, meaning Mexican
Spanish.

Previous Next