What is OCR?

Home PDF2TXT Sample Support Document Component Articles Products Downloads

WHAT'S OCR?

Almost everyone is familiar with a scanner. Scanners are great: they let us take photographs and documents from the 3D world of pens and paper, and bring them into the digital world inhabited by our computers. Sometimes, however, even people who are familiar with scanners don't understand exactly what a computer does with something once it has been scanned. To put it briefly, computers have different ways of dealing with pictures and text. A scanner is more or less just a digital camera. When you scan something, the scanner sends it to your computer as though it were a picture. It doesn't matter whether you are scanning photos of your beloved neighborhood movie theatre, or that old term paper you lovingly typed out back when you were in college in the winter of 1975.

Why is that important? It's important because the computer can do all kinds of neat things with text. Type up something in your word processor, and the computer understands that you've given it a bunch of words and letters. If you want to find how many times you used the word "stinky," no problem... the computer can count them. If you decide that all of those letters are ugly, no problem... you can tell the computer to change the font, and with one command you can change the typeface on every single letter.

The computer can do all kinds of neat things with pictures, too, but they aren't really the kind of things that help you when you are dealing with the written word. Want to make erase that jerk standing in the background of your perfect picture of the Gateway Arch? No problem: a couple minutes with an image editing program like Paint Shop Pro or Photoshop, and he's gone. Want to remove that wart from Aunt Hortense's forehead? Presto... it's gone.

But let's think about that old college term paper again. Once you've scanned it, what if you decide you want to change the ugly monospaced font that your old typewriter used? Remember: as far as the computer is concerned, at this point, that term paper is still just a picture, no different from your picture of Aunt Hortense. If you wanted to change the font, you could do it... but it would require manually redrawing every last letter in the entire document.

More importantly, since your scan is still a picture, you can't do any of the other important things that we can do with text, like automatically finding and replacing misspelled words.

This is where OCR - Optical Character Recognition - comes to the rescue. An OCR program can look at the "picture" of your document, "read" the document, and convert it to text. Really, really smart programmers with big giant brains have devised methods of looking at the little black areas on the white paper and figuring out what character was typed there. Got a straight vertical line, with a shorter horizontal line extending out from the bottom at a right angle? Oh... that's a capital L. And so it goes.

There are some drawbacks, however. OCR programs are rarely perfect, and a poor quality original - for example, a document that has been faxed and photocopied a couple times - will be fraught with errors. My own experience with various consumer-grade OCR programs has been mixed. It's great on a clearly-typed original document. But if the original isn't clear, well... if you're a fast typist, you might find that it takes less time to retype the whole document rather than correct all the errors in the OCR.

If you're going to do much scanning, it helps to know a little more about the way pictures are stored on the computer. There isn't just one format for storing photos; there are several, each with various advantages and disadvantages. Most people are familiar with a few of these: GIFs, JPGs, and BMPs, for example. Images stored in the .gif and .jpg format are commonly used on websites. The TIFF, however, is much better format for storing scanned documents. The TIFF file specification (abbreviated as .tif) includes a way for the computer to recognize multi-page images. That means that if you scan a five-page document, the computer can store it as a single file. Most other formats would require the document to be stored as five separate image files (one for each page).

The problem with putting .tif images on a website is that most web browsers don't know how to display them. There are add-ins that can be installed to give the browser this ability, but few users would bother.

Then what is the best option for putting a scanned document on a website? From a functional standpoint, the best option is to OCR it and then put it online as a text or HTML document. Unfortunately, that's a lot of work, especially if you don't have good OCR software and a good, high-quality original document. There is, however, one other relatively easy workaround: You can convert your TIFF files to PDFs. This is sometimes called "TIFF wrapped in PDF."

The PDF format, designed by Adobe, is a "portable document format." Files saved as PDFs can be viewed on practically any type of computer. Almost all modern computers will have a PDF viewer installed (usually Adobe's Acrobat Reader), and most web browsers will automatically launch the viewer if a user tries to open a PDF file via the web.

A "TIFF wrapped in PDF" isn't really a proper PDF file. Computers can still recognize the text in a PDF document as being text; that is, it's still possible to do things like copying that text and pasting it into another document, or searching for a particular word or phrase within the document. That isn't possible with a TIFF wrapped in PDF; as far as the computer is concerned, that document is still really just a series of pictures. Wrapping TIFFs in PDF, then, still isn't a perfect solution. But it does have the advantage of making scanned documents easily readable via the web.

PDF to Text OCR Converter: Convert scanned PDF and image files to plain text files.

See Also:
What is OCR? What is OCR? OCR Technology
PDF to Text OCR Converter: Convert scanned PDF and image files to plain text files.
PDF to HTML Converter: Convert PDF files to HTML documents.
PDF to Text Converter: Convert PDF files to plain text files.
PDF to Vector Converter: Convert PDF files to PS, EPS, WMF, EMF, XPS, PCL, HPGL, SWF, SVG, etc. vector files.
PDF to Image Converter: Convert PDF files to TIF, TIFF, JPG, GIF, PNG, BMP, EMF, PCX, TGA formats.
DocConverter COM Component (+HTML2PDF.exe): Convert HTML, DOC, RTF, XLS, PPT, TXT etc. files to PDF files, it is depend on PDFcamp Printer product.
Image to PDF Converter: Convert 40+ image formats to PDF files.
HTML Converter: Convert HTML files to TIF, TIFF, JPG, JPEG, GIF, PNG, BMP, PCX, TGA, JP2 (JPEG2000), PNM, etc. formats.
More PDF Products