Dear sir or madam,
I am already using VeryDOC command line versions of PDF Compressor and PDF2Image and I am very satisfied.
I have two questions regarding your Image to PDF command line converter:
1) Can it create PDF/A compliant output
2) Can I specify that the created PDFs have a certain size (e.g. DIN-A4) and that the images used for creation are resized accordingly?
Thanks for a reply and sincerely yours.
Thanks for your message, VeryPDF Image to PDF OCR Converter Command Line software has these functions, you may download VeryPDF Image to PDF OCR Converter Command Line from this web page to try,
after you download it, you can run following command line to convert a TIFF file to a PDF/A file with A4 paper size (595x842pt),
img2pdfnew.exe -pdfa -width 595 -height 842 bw.tif _tif2pdfa.pdf
You can also change the paper size with -width and -height options easily.
Scan to PDF/A – some insights
Traditionally a scanner produces a TIFF or JPEG image for each page. Some of them can directly produce PDF files. And newer devices produce files conforming to the PDF/A standard. However, the quality of the produced files differ significantly. Why is this and why is it worth to use a central scan server?
Of course, the scan to PDF conversion process is not just about embedding an image in a PDF envelope. It can involve text and barcode recognition, embedding of metadata and digital signatures too. But in this article I'd like to concentrate on image data compression which is marketed as a main advantage of PDF/A over TIFF. It is said that PDF/A is better because it offers more advanced compression mechanisms than TIFF. So, let us have a closer look at this particular topic.
One of the main requirements in the scan to PDF/A conversion process is to reduce the file size. A smaller size is often achieved at the price of a lower quality. There are a some factors which have an influence on the quality / size ratio:
* Color vs. Gray vs. Black / White
* Choice of compression algorithm (lossless vs. lossy)
* Multi-page vs. single page
* MRC (Mixed Raster Content) mechanism
The most widely used bi-tonal (black and white) compression algorithms are G4 (standard name ITU.T6) and JBIG2. G4 is lossless whereas JBIG2 can be operated in lossless and lossy mode. In order to achieve a better compression rate lossy JBIG2 may store symbols such as text characters in a table and reuse them. If the symbol table is used it can save a significant amount of space especially in multi-page documents since the JBIG2 symbol table can be commonly used for all pages. The downside of this mechanism is that it may unexpectedly mix up some symbols. That is why lossy mode of JBIG2 is often disabled. But even in lossless mode JBIG2 has in general a better compression rate than G4.
In VeryPDF Image to PDF OCR Converter Command Line software, you can run following command line to convert from black and white TIFF file to PDF file with JBIG2 compression,
img2pdfnew.exe -bwimg 2 bw-pdf-g4.pdf bw-pdf-jbig2.pdf
"-bwimg 2" option does compress black and white TIFF files with JBIG2 compression.
For gray and color images the most often used algorithms are JPEG and JPEG2000. JPEG can only be used in lossy mode whereas JPEG2000 again can be used in both modes. If used in lossy mode both algorithms offer a parameter which controls the quality / size ratio. Although JPEG2000 is more modern it cannot be said to be 'better' than JPEG. Mesurements show that for higher quality settings JPEG2000 has better compression rates whereas for lower quality settings JPEG is better in general. The quality loss introduces image artifacts such as shadows which are typical for both algorithms. JPEG has an additional artifact which is called blocking. It has its origin in the subdivision of the image in 8 x 8 pixel blocks which are compressed independently. In addition to this the JPEG algorithm usually reduces the resolution of the chromaticity signal by 2 with respect to the luminosity signal which increases the compression rate but amplifies the blocking artifacts.
In VeryPDF Image to PDF OCR Converter Command Line software, you can run following command line to convert from color and grayscale image file to PDF file with JPEG2000 compression,
img2pdfnew.exe -dpi 100 -quality 70 -colorimg 2 color.jpg color-jpeg2000-q70.pdf
img2pdfnew.exe -dpi 100 -quality 50 -colorimg 2 color.jpg color-jpeg2000-q50.pdf
"-colorimg 2" option does compress color and grayscale image files (JPEG or PNG or other image formats) to PDF files with JPEG2000 compression.
If converting color scans to PDF then often some sort of a mixed raster content mechanism is used. MRC separates the color information into layers: a background layer, a mask layer and a number of foreground layers. A typical example is a page that contains black text with some words emphasizes in red and blue. The mask then would contain the shapes of the characters and the background layer the color of the text. It is obvious that mask can be efficiently compressed with G4 or JBIG2 and the background layer with either JPEG or JPEG2000 using a very low resolution. When using this mechanism a scanned page can be reduced to approximately 40 k Byte with good quality. This result cannot be achieved by just using a lossy compression algorithm. However if the page contains graphics or images then these have to be isolated and compressed with good quality in one or several foreground layers. This isolation process is called segmentation and it is a essential part of the MRC mechanism.
With VeryPDF Image to PDF OCR Converter Command Line software, you can run following command lines to convert from scanned TIFF files to searchable PDF files,
REM Convert scanned PDF, TIFF and Image files to plain text files and searchable PDF files:
img2pdfnew.exe -ocr 1 -tsocr -pidpi 200 -ocrtxt "%CD%\_ocrtxt_1.txt" "%CD%\test-color.tif" "%CD%\_test-color.pdf"
img2pdfnew.exe -ocr 1 -tsocr -tsocrlang deu -threshold 125 -ocrtxt "%CD%\_ocrtxt_2.txt" "%CD%\_test-color.pdf" "%CD%\_test-bw-out.pdf"
REM Convert scanned PDF, TIFF and Image files to plain text based PDF files:
img2pdfnew.exe -ocr 1 -tsocr -pidpi 200 -plaintextpdf "%CD%\test-color.tif" "%CD%\_test-color-text-only-1.pdf"
img2pdfnew.exe -ocr 1 -tsocr -pidpi 200 -plaintextpdf "%CD%\_test-color.pdf" "%CD%\_test-color-text-only-2.pdf"
REM Convert scanned PDF, TIFF and Image files to OCRed PDF file (BW) with hidden text layer:
img2pdfnew.exe -ocr 1 -tsocr -tsocrlang deu -threshold 125 "%CD%\test-color.tif" "%CD%\_test-color-bw-1.pdf"
img2pdfnew.exe -ocr 1 -tsocr -tsocrlang deu -threshold 125 -bitcount 8 "%CD%\_test-color.pdf" "%CD%\_test-color-bw-2.pdf"
img2pdfnew.exe -ocr 1 -tsocr -pidpi 200 -bitcount 1 "%CD%\_test-color.pdf" "%CD%\_test-color-bw-3.pdf"
REM Convert scanned PDF, TIFF and Image files to OCRed PDF file (Grayscale) with hidden text layer:
img2pdfnew.exe -ocr 1 -tsocr -pidpi 200 -bitcount 8 -grayscale "%CD%\test-color.tif" "%CD%\_test-color-grayscale-1.pdf"
img2pdfnew.exe -ocr 1 -tsocr -pidpi 200 -bitcount 8 -grayscale "%CD%\_test-color.pdf" "%CD%\_test-color-grayscale-2.pdf"
REM Convert scanned PDF, TIFF and Image files to OCRed PDF file (Color) with hidden text layer:
img2pdfnew.exe -ocr 1 -tsocr -pidpi 200 "%CD%\test-color.tif" "%CD%\_test-color-1.pdf"
img2pdfnew.exe -ocr 1 -tsocr -pidpi 200 -bitcount 24 "%CD%\_test-color.pdf" "%CD%\_test-color-2.pdf"
Now, after reviewing the various compression schemes, it is time to discuss them in the context of archiving systems. Of course, the file size is often the most important issue but not always. In many scenarios the display speed is crucial issue. And, with respect to this requirement, JPEG2000 has often proved as too slow especially if it is combined with an MRC mechanism. As we learned JPEG is better at higher compression rates. So, why not use it at least for the background layer. The disturbing blocking artifacts can be reduced if disabling the down-sampling of the chromaticity signal. A bigger problem is that scanners deliver color images in JPEG compression only which reduces the power of a server based compressor software significantly because the JPEG image introduces artifacts which makes the segmentation and MRC compression much more difficult. But why not use the scanners built-in image to PDF conversion feature? This may be useful in a personal environment but in enterprise applications there exist many reasons why to use a central server. The most important are: Better quality, smaller file sizes, better OCR quality, post-scan processing steps and many more.
And, last but not least. Is PDF/A better than TIFF? The answer is definitely Yes! But not with respect to compression. TIFF offers essentially the same compression algorithms as PDF/A does. The real strength of PDF/A is that it provides the embedding of color profiles, metadata and optically recognized text in a standard manner. Furthermore, PDF/A is a uniform standard for scanned as well as born digital documents.