What is PDF? by Kurt Foss

What is PDF? A Breakdown of the Various PDF Content Types

PDF - Portable Document Format In recent years, the PDF file format has emerged as a standard for sharing documents between users or posting information to the Internet. All PDF files are structured the same way, with a Header, Content, and Footer for each page of the document. The Header indicates that a new page is starting. Likewise, the Footer indicates that the page has ended. The Content is the part of the file that contains information viewable by the user. This structure is the same for all PDF files and is illustrated below: Page 1 PDF Header Content PDF Footer

Page 2 PDF Header Content PDF Footer

Page 3 PDF Header Content PDF Footer

This setup differs from a traditional word processing file (such as a Microsoft Word .doc file) that contains a single column of information and is only paginated when printed. In the PDF file, each page is a holding area for that page’s specific block of information. Where one PDF file may differ from another is in the format of the content. Many different formats may be used including formatted text, unformatted ASCII text, raster images, vector images, or any combination of these. It is the nature of this content that determines how the final PDF file will function. Many terms have been created and used to describe the PDF files resulting from different content types. Some of these are: • • • • • •

PDF Normal PDF Image True PDF Wrapped PDF PDF Image + Text PDF-wrapped TIFF

PDF Normal Also known as True PDF and Real PDF, these documents represent the ideal PDF files for most applications. These documents have been created and published using PDF software. The content includes the original formatted text of the document. Tables in the document are also usually published as formatted text. Graphics or pictures will usually appear as cut images inserted into the text. The structure of this type of document is illustrated below: Page 1

Page 2

Page 3

PDF Header

Formatted text of the document, including graphics and tables

PDF Footer

PDF Normal documents allow the user to search text and copy/paste into other files. And, because most of the information in these files is text, the file size is greatly reduced making these files easy to use and ideal to exchange.

PDF Images The PDF Image is also called the Wrapped PDF or the PDF Wrapped TIFF. In these files, the content is simply an image file. The image file could be in many formats (GIF, TIFF, JPG, etc.), and of many subjects (scanned page, picture, graphic design). The most common use is a scanned page in TIFF format. To create a PDF Image file from TIFF images, PDF creation software is used to insert the PDF Header and Footer information around the image to make it a PDF page. This process of “wrapping” the image with the PDF information is why these are often referred to as Wrapped PDFs. The structure of the file is illustrated below: Page 1 PDF Header TIFF Image PDF Footer

Page 2 PDF Header

Page 3 PDF Header

TIFF Image

PDF Footer

Text searching and text copy/paste functions are not available with this type of PDF file because the only information they contain is image information. Although a scanned page may appear to contain text, it is actually just a bitmap of that text and not the text itself. Because of the large size of image files, the file size of the resulting PDF files can be quite large. As such, the files can occasionally be difficult to use. PDF Image + Text This file type represents a compromise between PDF Normal and PDF Image files. To make these files, the author begins with a hardcopy document. The document is scanned to get a TIFF image making it similar to the PDF Image document described above. But, the scan is then run through Optical Character Recognition (OCR) software such as OmniPage® to capture the text of document and the position of the text on the page. This text information is then added to the content part of the file. The illustration below shows the structure of this type of PDF file: Page 1 PDF Header TIFF Image OCR Text PDF Footer

Page 2 PDF Header TIFF Image OCR Text PDF Footer

Page 3 PDF Header TIFF Image OCR Text PDF Footer

When these files are viewed, the user sees the image on the screen. However, the text in the background is available for text searching and copy/paste functions. Because these files contain both the image and text information, their file size is even larger than that of PDF Image files.

Copyright © 2003 ScanSoft, Inc. All Rights Reserved. The ScanSoft logo, Productivity Without Boundaries and OmniPage are trademarks or registered trademarks of ScanSoft, Inc. in the United States and/or other countries. All other company names or product names may be the trademarks of their respective owners.