Previous Next


                                                     947
         SECTION 10.9                                                                 Web Capture




                       TABLE 10.37 Entries in the Web Capture information dictionary
KEY   TYPE            VALUE

V     number          (Required) The Web Capture version number. For PDF 1.3, the version number is 1.0.
                      Note: This value is a single real number, not a major and minor version number. Thus, for
                      example, a version number of 1.2 would be considered greater than 1.15.

C     array           (Optional) An array of indirect references to Web Capture command dictionaries (see
                      “Command Dictionaries” on page 957) describing commands that were used in building
                      the PDF file. The commands appear in the array in the order in which they were executed
                      in building the file.

10.9.2 Content Database
         Web Capture retrieves HTML files from URLs and converts them to PDF. The re-
         sulting PDF file may contain the contents of multiple HTML pages. Conversely,
         since HTML pages do not have a fixed size, a single HTML page may give rise to
         multiple PDF pages. To keep track of the correspondences, Web Capture main-
         tains a content database that maps URLs and digital identifiers to PDF objects
         such as pages and XObjects. By looking up digital identifiers in the database, Web
         Capture can determine whether newly downloaded content is identical to content
         already retrieved from a different URL. Thus, it can perform optimizations such
         as storing only one copy of an image that is referenced by multiple HTML pages.

         Web Capture’s content database is organized into content sets. Each content set is
         a dictionary holding information about a group of related PDF objects generated
         from the same source data. Content sets are of two subtypes: page sets and image
         sets. When Web Capture converts an HTML file to PDF pages, for example, it cre-
         ates a page set to hold information about the pages. Similarly, when it converts a
         GIF image to one or more image XObjects, it creates an image set describing
         those XObjects.

         The content set corresponding to a given data source can be accessed in either of
         two ways:
         • By the URLs from which it was retrieved
         • By a digital identifier generated from the source data itself (see “Digital Identi-
              fiers” on page 950)

         The URLS and IDS entries in a PDF document’s name dictionary (see Section 3.6.3,
         “Name Dictionary”) contain name trees mapping URLs and digital identifiers, re-
         spectively, to Web Capture content sets. Figure 10.1 shows a simple example. An

Previous Next