Document Scanning For Data Capture

Page 1

Touchstone Systems Limited

METAMATION Document Scanning Guide Document Reference : META-SCAN-007

DATE

23RD NOVEMBER 2006

CONTACT DETAILS

For further information please contact: Touchstone Systems Ltd Nyumbani House 15 Belmont Heights Hatch Warren Basingstoke Hampshire RG22 4RW

Telephone: +44 0845 434 8949 Email: support@touchstone-systems.co.uk VERSION : V1.0

C:\Users\Jaffa Brown.TOUCHSTONE\Documents\Touchstone\Metamation\Metamation Scanning Guide.doc


1

Copyright Notices © 2006 Touchstone Systems Limited All rights reserved. Metamation and the Metamation logo are registered trademarks of Touchstone Systems Limited. Other trademarks and brands are property of their respective owners. The information in this document belongs to Touchstone Systems Limited. It may not be used, reproduced or disclosed without the written approval of Touchstone Systems Limited. Notice of non-liability: Touchstone Systems Limited is providing the information in this document to you on an ”AS-IS” basis. Touchstone Systems Limited makes no warranties of any kind (whether implied or statutory) with respect to the information contained herein. Touchstone Systems Limited assumes no liability for damages (whether direct or indirect), caused or omissions, or resulting from the use of this document or the information contained in this document or resulting from the application or use product or service described herein. Touchstone Systems Limited reserves the right to make changes to any information herein without further notice.

MetaMation Document Scanning Guide V1.0

© Touchstone Systems Limited 2006.

Page 2 of 25


2

Table of Contents

1

Copyright Notices .................................................................................................................... 2

2

Table of Contents .................................................................................................................... 3

3

Introduction .............................................................................................................................. 5

3.1

Back File and Forward Scanning ............................................................................................... 5

3.2

Scope and Objectives ................................................................................................................ 5

3.3

Background ................................................................................................................................ 5

3.4

Assumptions and Constraints .................................................................................................... 5 Assumptions:................................................................................................................................... 6

4

Document scanning processes ............................................................................................. 7

4.1

General ...................................................................................................................................... 7

4.2

Preparation of paper documents ............................................................................................... 7

4.3

Document batching .................................................................................................................... 7

4.4

Photocopying ............................................................................................................................. 8

4.5

Scanning processes .................................................................................................................. 8

4.6

Quality control ............................................................................................................................ 8

4.7

Evaluating image quality ............................................................................................................ 9

4.8

Checking scanner performance ................................................................................................. 9

4.9

Rescanning .............................................................................................................................. 10

4.10

Image processing .................................................................................................................... 10

5

Image processing .................................................................................................................. 10

5.1

General .................................................................................................................................... 10

5.2

Document skew ....................................................................................................................... 10

5.3

Speckle, noise and background marks.................................................................................... 10

5.4

Black border removal ............................................................................................................... 11

5.5

Forms removal ......................................................................................................................... 11

6

Scanning specific types of document ................................................................................. 11

6.1

General .................................................................................................................................... 11

6.2

Text, typed and printed ............................................................................................................ 11

6.3

Line drawings/art ..................................................................................................................... 12

6.4

Handwritten material ................................................................................................................ 12

6.5

Charts, plans, and drawings .................................................................................................... 12

6.6

Maps ........................................................................................................................................ 12

6.7

Half-tone material .................................................................................................................... 12

6.8

Continuous-tone images .......................................................................................................... 13

6.9

Mixed mode documents .......................................................................................................... 14

6.10

Documents with note sheets attached..................................................................................... 14

6.11

Microform documents .............................................................................................................. 14

MetaMation Document Scanning Guide V1.0

Š Touchstone Systems Limited 2006.

Page 3 of 25


7

Design of documents for optimal scanning ....................................................................... 15

7.1

Machine-readable metadata on forms ..................................................................................... 15

7.2

Barcodes .................................................................................................................................. 15

7.3

Layout ...................................................................................................................................... 15

7.4

Positioning of barcode labels ................................................................................................... 15

7.5

Colours, paper etc ................................................................................................................... 16

8

Scanner specifications ......................................................................................................... 16

8.1

Recommended Scanners ........................................................................................................ 16 8.1.1 8.1.2 8.1.3

Recommended Scanner for small scale users: Kodak i60 Scanner ................................. 16 Recommended Scanner for Medium sized users: Kodak i280 Scanner .......................... 17 Recommended Scanner for Larger scanning situations: Bell & Howell 8080DB1-CE ..... 18

9

Compression Techniques and Recommendations ............................................................ 19

9.1

Introduction to Compression .................................................................................................... 19

9.2

Background .............................................................................................................................. 19

9.3

Samples ................................................................................................................................... 19

9.4

Notes ....................................................................................................................................... 21

9.5

Results ..................................................................................................................................... 22

9.6

Findings ................................................................................................................................... 23

10

About Touchstone Systems .......................................................................................... 25

MetaMation Document Scanning Guide V1.0

Š Touchstone Systems Limited 2006.

Page 4 of 25


3

Introduction This document provides guidelines processes for document imaging which will be compliant with BIP0008, and ensures a successful deployment of a document scanning solution. It covers the following areas:

3.1

Document preparation for scanning

Batching of documents

Scanning process

Sample set (documents used for calibration)

Quality control

Re-scanning

Image processing

Scanner recommendations

Design of documents for optimal scanning

Back File and Forward Scanning Within this document, reference is made to the terms “Back File Scanning” and “Forward”/”Day Forward” scanning. Back file scanning is the term to describe the scanning of historical paper records (such as files), converting them into an electronic medium to allow the removal of the paper thus freeing up storage and paper transfer costs. Forward or Day Forward scanning refers to the on-going process of scanning new paper as it is received. A reference is also made to “On Demand” scanning. This is a method of converting historical files to electronic media, as required (as demanded). Typically, this occurs where historical large files on a subject (company, person, patient, project) exist, but back file scanning is not desired. On Demand scanning means that where documents are required, when paper files still exist, they are converted to electronic format as they are initially requested, thus gradually removing the historical or paper files. As back file scanning quickly frees up large amounts of paper storage space, back-file scanning is normally the preferred method, then moving onto forward scanning for new paper. Back file scanning can be performed either in the companies offices, or can be taken off-site and quickly scanned using industrial scanners as a service.

3.2

Scope and Objectives When a Company takes on a scanning solution for day forward and back file scanning, it is useful to first understand the best practices for scanning paper. This document highlights the recommended process for document preparation and scanning, and provides a recommendation of the types of scanners that will be required to implement the solution.

3.3

Background This document has been created to answer the questions a company will have around the area of the scanning process. The questions answered here will include information based on the Legal Admissibility of documents as defined in BIP0008 (to provide copies of documents as evidence in court), and what techniques are recommended for successful scanning and document management.

3.4

Assumptions and Constraints The following gives the assumptions and constraints upon which this document is dependent.

MetaMation Document Scanning Guide V1.0

© Touchstone Systems Limited 2006.

Page 5 of 25


Assumptions: The process held within this document will be defined during the implementation stage and based on the processes defined in the ‘BCH-PRS-0233 EDM Best Practice Output’ [1] document. This document is a general guide only. Prior to implementing a document scanning solution, it is strongly recommended that a scanning survey be carried out by Touchstone Systems. This survey will ensure that the best possible solution is deployed.

MetaMation Document Scanning Guide V1.0

© Touchstone Systems Limited 2006.

Page 6 of 25


4

Document scanning processes

4.1

General This section includes recommendations relating to the procedures relevant to document image capture. These recommendations cover procedures for:

4.2

preparation of documents

document batching

photocopying (to improve scanning success)

scanning

image processing

Preparation of paper documents All paper documents need to be examined prior to the scanning process, to ensure that a successful image is obtained. Attributes such as paper size, weight, physical state (thin paper, creased, stapled, etc.), binding, and print colour, black-and-white, colour, tonal range, etc. can all affect the physical scanning process. Where documents are found which are unlikely to be accepted by the scanner, there are a number of techniques that can be used. For example the original could be photocopied or transparent wallets could be used. •

When removing staples, clips, or other document bindings, ensure that no damage is caused to the original that may affect the capture of the information from the document.

Where a source document has physical attachments, for example, stick-on notes, they must be distinguishing from the document to which they are attached and linked to the original document after scanning so that both can be viewed.

This should be achieved, by capturing a separate image of the attachment on the original page. The index data should record the fact that there is an attachment and a link to the original page. It is suggested that the document preparer photocopies the original page to scan as the first page of the attachment with the copy and the attached note added as the second and successive pages where there are multiple notes.

4.3

Where a source document has physical amendments, for example, white correction fluid, the workflow introduced should ensure that the presence of such amendments is noted. This should be through the use of a black ink pen to circle the amendment.

All pages of multi-page documents should be kept together and in the appropriate order before, during, and after scanning.

All pages which require specialised scanning, e.g. forms, oversize pages, low contrast pages etc should be extracted for scanning in specialised scanners or with different scanning settings (colour, contrast, resolution, etc).

Document batching Generally, documents can be scanned in two methods. Batch scanning, where documents are sorted into batches by type and subject, or intelligent scanning where OCR is performed on the scanned pages and text and/or markers are used to identify the storage context. Batch scanning of documents generally requires the production of cover sheets to be inserted between each batch of pages. These cover sheets carry identification marks (normally bar codes) which indicate the type of document and context (company, patient, person, etc). The processing

MetaMation Document Scanning Guide V1.0

© Touchstone Systems Limited 2006.

Page 7 of 25


of batched pages within the scanning/reading process is very fast, but requires the pre-production of cover pages. The intelligent recognition of pages removes the need to identify the context prior to scanning, but requires a table to be defined within Metamation with the storage context. The validation and identification is carried out in the background by the scanning reader engine (which performs the OCR on the documents), which means a slower background throughput, but less document preparation time. Both batch and intelligent recognition can be used together for optimum performance (such as using intelligent recognition for document storage, but pre-printed cover pages for paper ‘forwarding’ to individuals within the organisation). The choice of the preferred method of batch/scanning control will depend on the types of documents to be scanned.

4.4

Photocopying It may be helpful for some documents to be photocopied prior to being scanned. Such documents include: •

documents that may be adversely affected by the scanning process, such as damaged or delicate documents

documents where there are substantial contrast or density variations over the area of the original, and where photocopying demonstrably improves the image quality

documents containing paper or ink colours that do not produce legible scanned images, and where photocopying demonstrably improves the image quality

photocopiers and scanners may respond differently to different colours, and it is only in exceptional cases that the technique of photocopying prior to scanning does not produce satisfactory results

Photocopies should be examined to ensure that there is no significant loss of information during this process.

Where an image was made from a photocopy, it should be stamped as a ‘photocopy’ or ‘original photocopy’, and indexed as having been captured from a photocopy, distinguishing between photocopies made during document preparation and source documents which are known photocopies.

4.5

Scanning processes As a general rule of thumb, it is recommended that all scanning should be duplex, in colour, and at 400dpi resolution. However, colour high resolution images take up more storage space, so as detailed within this document, different types of pages may require scanning at lower resolutions, in grey scale etc. As part of the set-up and configuration of a scanning process, the types of pages to be scanned can be checked and scanning ‘jobs’ can be created to scan with different settings. The scanning process would then need to factor in the separation of different document types based on the types, sizes, contrast and usability of the paper concerned. To ensure that all documents in a batch are fully scanned a count of captured documents should be compared with the number of documents in a batch.

4.6

Quality control Procedures are required which reduce the risk of scanned images being of unsatisfactory quality. The evidential weight of scanned images will be increased if it can be demonstrated that the images are of good quality, and that the scanner was working to agreed standards at the time of scanning.

MetaMation Document Scanning Guide V1.0

© Touchstone Systems Limited 2006.

Page 8 of 25


A sample set of source documents should be assembled for the purposes of evaluating scanner results against agreed quality control criteria and should consist of a representative type of documents to be scanned, and should consist of a duplex (front and back) content page.

Documents in the sample set should be representative of the complete set of documents that is to be scanned.

Documents in the sample set should include examples of source documents whose quality is poor relative to those of the majority of the documents.

Quality control criteria should cover; •

overall legibility

smallest detail legibly captured (e.g. smallest type size for text; clarity of punctuation marks, including decimal points)

completeness of detail (e.g. acceptability of broken characters, missing segments of lines)

dimensional accuracy compared with the original, scanner-generated speckle (i.e. speckle not present on the original)

completeness of overall image area (i.e. missing information at the edges of the image area)

density of solid ‘black’ areas, and colour fidelity

Quality control criteria for image quality should be realistic given the nature of the source material and the characteristics of the scanning equipment and based upon the sample set of documents.

4.7

Evaluating image quality The scanners should be setup using the sample set of documents and should be retested on a weekly schedule to ensure the best quality of the scanned images. During operation evaluating image quality should be undertaken using a 20” monitor with a resolution of 90-100 dpi monitor which should allow the validation operator to view the document as a complete page to ensure that the comparison with the original is complete. The validation system should allow the operator to print suspect documents to verify that the image can be reproduced and validated against the original where the reproduced image is as good as the original. The scanned image should be printed on a colour printer with a greater resolution than the 400dpi scanner. This is to ensure that all information is printed. The results of all quality control checks should be stored in the Quality Control Log with the reason for rejection. The sample rate should be every 5th page in the first month reducing to every 10th page in the second month and in the third month it should reduce to one page every 1 hour. When the sample set is used the whole set should be validated for accuracy.

4.8

Checking scanner performance Optical and paper transfer rollers should be cleaned daily or on demand, when for example, a clean original shows banding on the scanned image which is produced by dirt on the optical system. The sample document should become the scanner test target and should be used to monitor scanner performance. Scanner performance checks should be used weekly to ensure that the scanner performance is within agreed tolerances.

MetaMation Document Scanning Guide V1.0

© Touchstone Systems Limited 2006.

Page 9 of 25


Hard copy prints should be made of the scanned images of the test targets and compared with the test targets themselves to determine whether the quality criteria are met.

4.9

Rescanning All pages marked for rescanning should be identified and rescanned using a flatbed scanner where possible, to improve operator controls over the scanning. The operator should have the ability to change the contrast adjustments or increase the resolution of the scanner to improve the scanned image. However, de-speckling of the image should not be allowed as this can change content of the scanned page making the original un-reproducible. All pages rescanned should replace the original scanned page which was marked for rescanning. The operator should ensure that the information on the page is accurately represented before replacing the image.

4.10

Image processing The following sections (5 and 6) describes some different types of documents, and associated image processing facilities that may be used. Some of these operations are carried out during and/or after scanning. The scanners should be setup to automatically de-skew pages. On occasions the operator may need to carry out the de-skew operation using the pull down menus in the application. This should be at the operator's discretion and the alternative is to reject the page and rescan the page. Where documents are OCR'd or OMR'd, then the operator should be required to verify the accuracy of the text or marks against the original page as well as the scanned page. This is to ensure that the accuracy of the content is represented when carrying out free text searches. De-speckling and border removal is NOT acceptable and if the page requires extra processing to remove noise from the page then the page should be rejected and rescanned with a different scanner setting and the page carefully validated for quality.

5

Image processing

5.1

General For legal re-production of the original scanned documents, the majority of image processing tools cannot be used, e.g. De-speckling. The following are the acceptable tools that can be used.

5.2

Document skew Document skew is a term used to describe the phenomenon of poor document alignment (rotation) during the scanning processes. In its most pronounced form, images can appear on a viewing screen as crooked or slanted. Even a small angle of skew is likely to affect data capture processes and thus reduce data recognition rates. Passing images through de-skewing processes may correct this problem.

5.3

Speckle, noise and background marks There features should not to be used. It is included for information only. Random black marks (speckles) which appear on an image may have been generated during the scanning process, or may be present on the original document. These speckles may be removed by systems involving special algorithms. These algorithms assume that small isolated clusters of pixels contain no information, and may be deleted.

MetaMation Document Scanning Guide V1.0

Š Touchstone Systems Limited 2006.

Page 10 of 25


5.4

Black border removal This feature should not to be used. It is included for information only. When scanning documents of mixed sizes using certain scanner types (such as rotary scanners), black borders may be left around the edges of smaller documents. Black border removal entails the deletion of such large areas of black pixels.

5.5

Forms removal The scanning of textual information on a pre-printed form is common when automated data capture processes such as OCR and OMR replace a large keyboarding operation. To increase the accuracy of the recognition rate, images can be passed through a post-scanning process that will remove boxes, lines, and pre-printed text. Where new forms have been designed and are intended for OMR and OCR then forms removal should be used. The forms will be defined during implementation and a list should be given to the operators and the scanning systems setup to recognise the forms which are enabled for this method.

6

Scanning specific types of document

6.1

General This section gives details of different types of documents, and the scanner characteristics needed to give acceptable results within the Metamation information management system. The characteristics detailed in this section are not applicable where Optical Character Recognition is to be performed on the scanned image.

6.2

Text, typed and printed It is recommended that a resolution of 400 dpi be used as the minimum for the following reasons: •

At lower resolutions, some detail may be missing from some characters, particularly if they contain thin elements, including serifs; fonts under about 6 point on the original as they may not be captured very clearly.

With material containing particularly small type sizes (e.g. superscripts and subscripts), a resolution of 600 dpi or more may be necessary.

For material that may be processed using Optical (or ‘Intelligent’) Character Recognition, it may be beneficial to scan at a higher resolution than would be satisfactory for visual legibility. For example, while for much material 200 dpi would be satisfactory for visual representation, it may be preferable to use 300 dpi resolution if OCR/ICR is to be used; similarly, where 300 dpi may be visually satisfactory, 400 dpi may be better for OCR.

Material which contains handwriting is known to be difficult to read a resolution of greater than 300 dpi may be required.

No decisions should be made regarding choice or resolution without conducting tests against the sample set. Careful tests should be carried out to ensure that the resulting image remains an effectively ‘true’ facsimile of the original. These tests should use the sample set of documents, and hard copies should be made of scanned images. There should be no anomalies introduced into the enhanced image that are visible under normal office lighting conditions. It is important to bear in mind that the validation monitor should have an effective resolution of about 90 to 100 dpi. This is normally adequate for typed material but ‘zooming’ may be required MetaMation Document Scanning Guide V1.0

© Touchstone Systems Limited 2006.

Page 11 of 25


with small sized print, and this requires that the scanning resolution should be substantially greater than the basic display resolution. The results of these tests should be stored with other records of the scanning processes.

6.3

Line drawings/art For line drawings/art which form part of otherwise text-oriented documents, the scanning resolutions applicable to text are typically satisfactory for the drawings also. With printed material, where fine lines are used in the artwork, 300 dpi may be too low, but this can only be determined via tests on sample documents.

6.4

Handwritten material With material where a modem pen, ball-point, or pencil was used, 400 dpi will normally be adequate. For older material where a steel-nibbed fountain pen was used, the thinness of the upstrokes will often require 400 dpi as the minimum resolution which will satisfactorily capture the text without significant components of these upstrokes being lost. Handwriting (or hand drawing) using pencils can be faint, and difficult to reproduce. Care should be taken to ensure that image brightness and contrast are appropriate for these images.

6.5

Charts, plans, and drawings For hand-drawn charts, architectural, and engineering drawings, there may be finer lines present than would be the case with a typical ‘full-sized’ CAD drawing, and although 300 dpi will usually be a satisfactory resolution, tests should be done to ensure that the finest detail is captured. It may prove necessary to use 400 dpi. If the scanning is to be done from copies of the originals, and if these copies have been reduced from the originals (which is quite common), then a higher resolution may be required than would otherwise have been satisfactory. With drawings and Critical Care Unit (CCU) charts, dimensional accuracy may be important. Because of the large size of drawings, the paper or film may undergo dimensional change (due mainly to variations in moisture content). For working drawings it is often a requirement when scanning that dimensional inaccuracies are corrected, i.e. the scanned image may be postprocessed to correct scale inaccuracies, skew or lack of orthogonality. Such corrections mean that the subsequent image is not a true facsimile of the original. Where legal admissibility may become an issue, it will be required to preserve an uncorrected version of the scanned image as well as the corrected version with the appropriate links to both to both documents.

6.6

Maps With maps, a minimum resolution of 400 dpi will be required, but much higher resolutions (e.g. up to 1000 dpi) may be required with some material which contains fine detail. As with drawings, scanned images of maps are frequently corrected for scale inaccuracies and lack of orthogonally in the original after scanning. Where coloured maps are being scanned, and the colour is to be preserved, the scanner should be capable of capturing individual colours with the required discrimination. While the number of colours subjectively present may be quite small, 8-bit colour (256 colours) may be inadequate and it may be necessary to scan with 24-bit colour in order to provide the required colour discrimination. Tests should be done to determine how many ‘bits’ of colour are required.

6.7

Half-tone material Where half-tone material (black-and-white or colour separated) is present on a page along with text and/or line art, the outcome objectives of the scanning should be considered. If the objective is to produce a scanned image that is comparable in quality to a ‘normal’ black-and-

MetaMation Document Scanning Guide V1.0

© Touchstone Systems Limited 2006.

Page 12 of 25


white photocopy, then a scanner which produces a digital image (i.e. ‘black-and-white’) will suffice. The resolution may have to be higher than that which would be acceptable for text only: 400 dpi will be required to capture halftone material. If the half-tone content has value in the application context, following the recommendations that apply to scanning text or line art may result in the capture of images of unacceptable quality from the half-tones. Most scanners have different settings for scanning text or line art and scanning half-tones. It is a general problem when scanning mixed text or line art and half-tones with a ‘black-and-white’ scanner that the scanner settings that are optimal for text are far from optimal for the half-tones, and vice versa. When set for ‘text’, the quality of the half-tone images will generally be significantly worse than a photocopy; when set for ‘half-tone’ or ‘photographs’, the text may appear rather blurred in the scanned image, to the extent that the image would not form a good facsimile of the original text. If the half-tone content has ‘cosmetic’ value only and does not contribute to the essential information content of the original, then the scanning should be done according to the recommendations which apply to text or line art material. If the half-tone is to be captured to a quality level comparable to that of a typical (good quality) photocopy, then there are two options. One option is to scan the document with the scanner settings ‘normal’, at a higher resolution than would be necessary for the text alone; 400 dpi minimum is recommended. The other is to scan the document twice, to create two images, one where the text/line art is captured to satisfactory quality and the other where the half-tone material is satisfactorily captured. In the latter case a record should be kept that the production of the two images involved different scanner settings (affecting the processing performed on the images). If the half-tone material is to be produced to a quality comparable to that of the original, then it should be processed according to the recommendations for photographs.

6.8

Continuous-tone images Continuous-tone images include photographs, medical and industrial radiographs (X-rays), and images generated by computer as photographic style images, including, for example, ultrasound images, CT and MR images. With material containing continuous- tone areas (grey scale or colour), where the tonal information should be preserved, scanning should be performed with a scanner capable of capturing the required number of grey levels and/or colour. The number of levels that is appropriate should be determined by benchmark tests on the sample set of documents. For images from photographic material, the number of grey levels will typically be 16, 64, or 256 (i.e. 4, 6, or 8 bits per pixel). For very high quality images, 256 levels are normally used, and for Xrays, up to 1024 levels of grey (i.e. 10 bits per pixel) may be necessary. For colour photographs, 24 bit per pixel of colour information is used in most applications, but for very high quality images, up to 36 bits per pixel may be necessary. Typically, 15 or 16 bits of colour are used; for source material containing only a small palette of colours, 256 grey levels may suffice. Tests should be performed to determine how many colour levels are required. •

With continuous-tone colour, most scanners capture 8 bits of colour information in three different regions of the colour spectrum: Red, Green, Blue (‘RGB’), resulting in 24 bits per pixel, or the ability to reproduce over 16 million colour variations.

With only 8 bits of colour information (256 levels), there may be a noticeable ‘blockiness’ in the image if the original contains a broad range of colours.

Scanning resolution requirements for documents containing colour are normally similar to that for black-and-white material, particularly if there is text present on the original. Thus scanning may be performed at 200-400 dpi, referred to the original photograph. If there is no text present on the original satisfactory images may be achieved at lower resolutions, down to television quality levels MetaMation Document Scanning Guide V1.0

© Touchstone Systems Limited 2006.

Page 13 of 25


(about 350 lines per image frame); this would typically be satisfactory for identity photographs and similar applications. To assess image quality, in general it is satisfactory to compare the screen images with the original. If there is likely to be use of high quality hard copy images then the comparison should be made between hard copies of the images, produced on a high quality colour printer, and the originals. Care should be taken when comparing screen colours with an original that the colours were correctly balanced at the time of image capture, and that the display system has also been calibrated correctly. Otherwise the displayed colours may be significantly different from the colours on the original. The same requirement applies when comparing the original with hard copies of the captured image. Where colour accuracy is important, a standard Colour Gamut test chart should be scanned at the same time as the original (or batch of originals scanned at the same time), and the image of this chart stored along with the original.

6.9

Mixed mode documents Mixed mode documents comprise more than one document type inside a single document (e.g. photograph, text). From a scanning perspective the documents described above containing halftone material are essentially of this type, even though the original has been created in a single print operation. As described in 6.7, the use of scanner settings optimized for one type of material can result in the loss of information in material of other types. As suggested in 6.7, one solution is to capture multiple images, with scanner settings (or even scanner type) selected to optimize the image quality for each material type. One option is to use a scanning system that can scan mixed mode documents automatically, with automatic detection of each type of material and automatic optimization of the settings for each type. These systems can also be set to select the most appropriate compression algorithm for each type of material. Benchmark testing should be done to ensure that the results are acceptable.

6.10

Documents with note sheets attached Some documents may have note sheets or notelets attached. Care should be taken when scanning such documents. It may necessary to remove the attachment where, for example, it obscures information on the document. If removal is required, the note should be marked or stamped as being a part or page of the document to which it was attached, and scanned and indexed separately. The original page should also be indexed to indicate that it has an attachment. Where a system has a facility to indicate that a document has a related image, then this facility should be used.

6.11

Microform documents Microforms should be examined carefully prior to deciding upon the scanning approach. Within multi-frame microfilm media (roll film, microfiche, microfiche jackets, multi-frame aperture cards); unless the inter-frame gap can be detected unambiguously automated frame detection should not be used. If the gap is not detected multiple frames may be merged into one image. Depending on the physical characteristics of the scanning system it is possible that some part(s) of the digitized image may be lost. With jacketed film, film strips may overlap. The processing procedures should ensure that such overlaps may be detected and corrected before scanning, otherwise some page images will be missing or illegible, in whole or in part. Where a rotary camera has been used, images on the film may not have a one-to-one correspondence with the original documents. For example, two pages may have been fed at once, so that on the film part or all of an original page may be missing.

MetaMation Document Scanning Guide V1.0

Š Touchstone Systems Limited 2006.

Page 14 of 25


7

Design of documents for optimal scanning The following guidance has been proposed to assist companies who current use forms for capturing information, in the redesign of forms and other documents to make them as useful for future scanning as possible. It is suggested that where forms are already in use, where possible, the design of forms that are due for reprinting be considered for optimising it for scanning.

7.1

7.2

7.3

7.4

Machine-readable metadata on forms •

There is no technical constraint on which fields are used for barcodes or other machinereadable features. However, the use of too many will require either large codes or will result in a raised failure rate in recognition.

If a stick-on barcode label is to be used, it is recommended that only the context identifier be encoded in the barcode – assumed to be printed out from the system. If multiple codes are provided to a user, it is probable that a user will select the wrong label so causing the document to be incorrectly indexed.

If such a label is not used, it is recommended that specialty, document type and context identifier all be encoded on the stationery at the time that it is printed. This may be through a combination of machine reading printed text, mark recognition, and barcode reading. The company should apply a house style with this information being consistently applied for human and machine use.

Barcodes •

Each organisation should use a single barcode protocol. The standard bar codes which should be used are CODE39 or CODE128.

The barcode should be printed clearly at high resolution on a laser printer on good quality labels with clear white space around the code to give the maximum probability of successful recognition.

Layout •

During capture, the system can be instructed to look anywhere on the image for a barcode.

No critical information should be placed within 5mm of the edge of the sheet due to the risk of folding and information loss.

Positioning of barcode labels •

The systems are tolerant of label positioning and orientation within the area that they are instructed to search. For speed of assimilation and checking whether the label has been attached, it is recommended that a consistent location is used on forms, such as the top right corner of the form.

The essential requirement is that the barcode label should be placed within the area searched. As placement will be by hand and often under stress, the tolerance for the label within the target requested area should be at least +/- 10mm for successful recognition. This allows for the label to be either out of position or tilted (or both).

Where other marks are used (e.g. to identify the document type) these should be placed +/- 1mm.

Bar code tilt (off the horizontal line) is permissible with a general tolerance of between -20 and + 20 degrees. Beyond this, attached bar code labels may not be recognised.

MetaMation Document Scanning Guide V1.0

© Touchstone Systems Limited 2006.

Page 15 of 25


7.5

8

Colours, paper etc •

The best scanning performance occurs with the use of plain white A4 paper at a weight of 80gsm (standard office paper).

The use of coloured stock can be useful in providing visual clues to users so reducing risk and aiding identification, as long as the use of the correct colour can be assured. The tint should be slight (pastels) to maintain legibility in the paper and scanned forms; dark background and shading should be avoided.

Black ink is preferred.

Where forms need to span more than one sheet or to be folded, consideration needs to be given to maintaining them as a unit that does not fall apart during use, but is easily separated for scanning. Glue that remains tacky after separation contributes to a high rate of jams in scanning and must not be used.

Scanner specifications This section provides a brief introduction to the sizes and types of scanners that a company may use as part of introducing a scanning solution. A review of the types of paper to be scanned and volumes would identify the preferred or recommended scanners as part of a full investigation. Typical volume scanners recommended include:

8.1

Bell & Howell Spectrum Series

Kodak i-Series

Canon DR series scanners

Recommended Scanners There are three levels of scanning that need to be considered: •

Low volume scanners for the purpose of desk top scanning specific individual pages

Medium volume scanners for the purpose of day forward or on demand scanning of daily received paperwork

High volume scanning for the purpose of back file scanning large volumes, where this is to be performed by company

The following information is reproduced from the manufacturers’ publications. For definitive current information, the reader is directed to the manufacturers’ web sites. 8.1.1

Recommended Scanner for small scale users: Kodak i60 Scanner The Kodak i160 has a superb feeder & image enhancement capability, and is rated at 1,000 pages per day. It has the ability to scan A3 documents also, should this be required, and a combination of this, and the feeder/image enhancement, will allow most other document types to be scanned. This is fitted with 'ultrasonic multi-feed detection', which accurately tells the operator when more than one page feeds at a time, irrespective of paper length or thickness. It should stop the scanner on a multi-feed if required. The i60 scans in colour, greyscale, and black and white at up to 25 pages a minute. Resolution can be set in the range 75 to 600 dots per inch. This duplex model saves time by simultaneously capturing both sides of your documents at up to 50 images per minute. The shorter the paper path, the less likely that documents will jam. With the proven paper path

MetaMation Document Scanning Guide V1.0

© Touchstone Systems Limited 2006.

Page 16 of 25


design, you can count on thousands of hours of productive scanning. Thanks to the document feeder, you can finally use "reliable" and "automatic" in the same sentence. Load up to 75 sheets at a time, in a wide range of sizes and weights. A built-in flatbed handles your delicate onionskin and cumbersome bound pages. The rated duty cycle is up to 1000 pages a day. An illustration of the i60 Scanner is shown below:

Figure 12 - Kodak i60 Scanner 8.1.2

Recommended Scanner for Medium sized users: Kodak i280 Scanner The following features described the faster i280 Scanner that would be suitable for medium sized scanning situations: Exceptional speed, image quality, and flexibility. Fast scanning, automatic image processing and five output options mean you can scan up to 248 images per minute at 200 dpi. Choose from colour, bitonal, greyscale, simultaneous bitonal and greyscale or simultaneous bitonal and colour output. Perfect Page Scanning with iThresholding delivers clean, sharp images at full speed, dramatically reducing pre-sorting, re-scans, and post-image processing. Exclusive SurePath paper handling is designed for smooth paper transport and virtually jam-free operation. Electronic Colour Dropout allows you to optimize forms processing by removing irrelevant red, green, or blue background colour. Innovative options to meet more scanning needs: Dockable flatbed handles exception documents and offers space-saving flexibility (not included in indicative price). Post-scan imprinter lets you track documents after scanning (not included in indicative price).

MetaMation Document Scanning Guide V1.0

Š Touchstone Systems Limited 2006.

Page 17 of 25


Figure 13 - Kodak i280 Scanner 8.1.3

Recommended Scanner for Larger scanning situations: Bell & Howell 8080DB1-CE The following features described the heavy duty, high volume Bell & Howell 8080DB1-CE scanner that would be suitable for a more demanding situation: The indicative model is 8080DBI-CE, A4/A3, 65ppm, 100 to 400 dpi, SCSI i/f only, with Imprinter. Others are available within the range. The new features on Böwe Bell + Howell’s Copiscan 8000 Spectrum make production scanning a cost-effective option, even for scanning in colour. Here’s how: with Spectrum’s superior paper handling and image enhancement with VirtualReScan™ (VRS) from Kofax, documents can now be scanned the way they come to you— mixed together in every shape, size, and paper type. And now with Auto Colour Detect, colour documents scan in colour, bitonal in bitonal, all simultaneously on one system. If you scan in colour today, this can reduce the time needed for scanning by up to 60% over other methods. Even file sizes for your largest colour images, those with colour backgrounds, can be reduced by up to 40% with the Colour Background Saturation and Dropout feature. Imagine scanning in colour when you want it, bitonal when you don’t, all at maximum speed, minimal cost, and the quality to handle between 6,000 and 60,000 documents per day. Now you can. All Spectrum models handle up to 500 sheets of paper at a time, from 2.60" x 2.60" (66 mm x 66 mm) up to 11.70" x 40" (297 mm x 1016 mm), with optical resolution ranging from 100 to 400 dpi (dots per inch).

Figure 14 - B & H 8080DB-1 Scanner

MetaMation Document Scanning Guide V1.0

© Touchstone Systems Limited 2006.

Page 18 of 25


9

Compression Techniques and Recommendations

9.1

Introduction to Compression When pages are scanned in, the format for the storage of the pages will directly affect the amount of storage space required and reproduction quality available. As a guide to compression and storage options, a test has been carried out on a variety of scans of sample documents to understand impact of colour scanning on storage requirements and the impact of compression schemes on legibility of different content types.

9.2

9.3

Background •

Date of Tests: February 9th 2006

Scanner: Bell & Howell Spectrum 8080

Scanning Software: Kofax Ascent Capture 7.0

Internal Processing Format RAW (Uncompressed) TIFF (Batch Properties)

JPEG Compression quality compression set to 100 for Scan Sources (KSM Panel)

Group IV & JPEG Compressed TIFFs generated directly from scan software

LZW & PDF outputs generated from RAW TIFF 24-Bit Colour Scans (post-processed)

All Colour Scanned at 24-Bit / 16.8 Million Colours

Samples Five sample documents were chosen to represent a range of content types. These are displayed below with brief explanations for the choice made. This document was chosen as an example of a detailed document chart with (arguably) cosmetic colour. The NHS logo (blue), warning message (red), and different pen inks used don’t necessarily convey any additional information. If storage cost becomes an issue and consideration is given to scanning more content as black and white, then this type of content is a candidate for black and white scanning. This image has a short code of “Cos” (Cosmetic) in the final results table

MetaMation Document Scanning Guide V1.0

© Touchstone Systems Limited 2006.

Page 19 of 25


This document was chosen as it is an example of a piece of content that is essentially black and white with a colour background. A variety of forms with different coloured backgrounds exist and this sample demonstrates the potential impact of scanning a colour document in black and white if the colour elements are not deemed meaningful or significant. This image has a short code of “BGC” in the final results table

This document was chosen to represent a situation where a black and white document has been annotated with coloured ink and that ink is considered to be significant in some way. This ‘extreme’ sample was created for the purposes of this exercise on a blank form. In addition to Blue, Red and Green a yellow highlighter was used on various areas of the form to test the ability to scan this very light low contrast colouring. This image has a short code of “Ano” (Annotation) in the final results table.

MetaMation Document Scanning Guide V1.0

© Touchstone Systems Limited 2006.

Page 20 of 25


This image is taken from a magazine and was chosen as it shows a mixture of rich colour photographic detail with fine colour gradients alongside text. This combination of content represents a more significant challenge for a compression scheme. This image is included to represent the photographic images often found within brochures and other printed literature. This image has a short code of “Mag” (Magazine) in the final results table

This document was chosen to represent the fine detail of a clinical chart. This is an ECG with a red chart scale and black ink recording the values on the chart. This document will particularly test the ability of the compression schemes to retain fine detail while reducing content to small file sizes. This image has a short code of “ECG” in the final results table

9.4

Notes •

In the table below some test combinations were not performed as they would not produce useful information or were simply inappropriate to the content type

Scans were performed at 200, 300 and 400dpi for comparative purposes

Black & White scans of some colour images also performed for comparative purposes to illuminate cost delta discussions between colour and black and white scanning

MetaMation Document Scanning Guide V1.0

© Touchstone Systems Limited 2006.

Page 21 of 25


9.5

DjVu images are provided for comparative purposes only and do not conform to government e-GIF standards for image file formats. The DjVu viewer will be required to open these sample images

Results The table below shows the file sizes of the scanned output combining the sample documents detailed above with various file formats and compression schemes.

Output Format

Compression Scheme

Resolution

Colour B&W

Office Doc Cosmetic Colour (Cos)

Office Doc Colour Annotation (Ano)

Office Doc Coloured Background (BGC)

Clinical Doc Fine Detail (ECG)

Colour Photograph (Mag)

TIFF

None (RAW)

200dpi

B&W

474K

471K

474K

TIFF

None (RAW)

300dpi

B&W

1064K

1058K

1066K

TIFF

None (RAW)

400dpi

B&W

1901K

1880K

1894K

1888K

TIFF

Group IV

200dpi

B&W

68K

101K

34K

68K

TIFF

Group IV

300dpi

B&W

210K

176K

52K

TIFF

Group IV

400dpi

B&W

398K

255K

70K

TIFF

None (RAW)

200dpi

Colour

11241K

11201K

11280K

10623K

11009K

11071K

TIFF

None (RAW)

300dpi

Colour

24502K

25286K

25460K

23949K

24821K

24804K

TIFF

None (RAW)

400dpi

Colour

45179K

44946K

45257K

42617K

44170K

44434K

TIFF

JPEG 100%

200dpi

Colour

634K

547K

529K

1063K

604K

675K

TIFF

JPEG 100%

300dpi

Colour

1326K

1053K

1040K

2188K

1157K

1353K

TIFF

JPEG 100%

400dpi

Colour

1990K

1576K

1649K

3343K

1901K

2092K

TIFF

LZW

200dpi

Colour

4532K

3322K

4653K

6603K

1943K

4211K

TIFF

LZW

300dpi

Colour

8725K

6427K

9946K

18401K

10469K

10794K

TIFF

LZW

400dpi

Colour

14252K

10142K

17012K

30166K

17706K

17856K

PDF

LuraTech

300dpi

Colour

88K

128K

73K

282K

101K

134K

PDF

LuraTech

400dpi

Colour

DjVu

DjVu

300dpi

Colour

DjVu

DjVu

400dpi

Colour

473K 1040K

152K

MetaMation Document Scanning Guide V1.0

96K

52K

1057K

148K 241K

408K 72K

Average Size

254K

408K 65K

396K

© Touchstone Systems Limited 2006.

108K 396K

Page 22 of 25


The bar chart below shows the 300dpi colour scan average file sizes for the different compression schemes side by side.

300dpi Colour Scan Comparison 30000

25000

24804

Kilobytes

20000

15000 10794 10000

5000 1353 134

108

LuraTech

DjVu

0 None (RAW)

LZW

JPEG 100%

Compression Scheme

9.6

Findings •

The type of content scanned makes almost zero difference to the size of the uncompressed scan files (colour and black & white) with 200, 300 and 400 dpi colour scans requiring 11, 24 and 44MB per image respectively

LZW compression resulted in an average 57% reduction in file size – good but not excellent – this still means a 300dpi colour A4 scan requires 10MB on average. This is a reflection of the age of this compression scheme and it’s generality – the other schemes tested here are specifically designed to compress images

JPEG compression resulted in an average 95% reduction in file size – excellent but still resulting in a significant file size increase versus a black and white Group IV scan – the JPEG compressed Colour TIFF on average requiring 1353K versus 148K for black and white. Clearly more information is stored, but this is a size (and therefore storage cost) increase of near 1000% (or 10 times the size).

The LuraTech PDF compressor achieved even more significant file size reduction averaging 99.5% compression and resulting in a file sizes comparable to a traditional Group IV black and white TIFF scan. The implication is that this level of compression requires no cost of storage premium for scanning in colour – though of course the cost of the compression software must be factored in. For this scheme PDF must be used as the

MetaMation Document Scanning Guide V1.0

© Touchstone Systems Limited 2006.

Page 23 of 25


container format – this scheme utilises a variety of recent compression techniques (JBIG2, Wavelet etc.) which are valid with PDF (and produce perfectly standard PDF output) but are not part of the (now aging) TIFF standard. •

The DjVu format provides the greatest compression of all – achieving a slightly better ratio than even the LuraTech compressor (99.6%) – however this is at the cost of universal compatibility – very few desktops are likely to have the required software to view this file format and the 0.1% gain does not seem significant enough to warrant overcoming this hurdle. In addition DjVu is not an acceptable format within the government e-GIF technical standards.

Lower scan colour bit depths (e.g. 8-bit) are unavailable – most scanner drivers do not support this as post processing and compression reduces palette ranges appropriately – the goal should always be to acquire the most information possible during the physical paper scan as this is the most expensive element of the process to repeat

These tests say nothing about response times for display of images – this is about colour compression comparison only. Lower specification PCs may struggle to manipulate uncompressed Colour TIFF images

MetaMation Document Scanning Guide V1.0

© Touchstone Systems Limited 2006.

Page 24 of 25


10 About Touchstone Systems Touchstone is a UK based independent freelancing and consultancy company specialising in data migration, data warehousing, report, application and health systems. Touchstone Systems has provided document scanning and data capture from documents over a wide range of projects over many years. We have extensive experience of multi page scanning, form scanning, and data capture from scanning. All our data scanning solutions are designed to provide rapid scanning, clean data capture, with the minimum of human assistance but with the highest possible quality of data captured. All of our work, including supplied products and contractor/consultancy resources are fully covered by our 100% satisfaction guarantee We can provide contractors and consultants with working knowledge in many areas of scanning, data processing and application design including: •

Scanning to image (PDF, TIFF, JPG etc)

Scanning for data capture from forms

Application development for data capture, including handheld devices, browser forms and fat client applications

Database design for data capture and analysis, including migration of existing system data

Integration with other systems, including data from scanning into other databases, and key to image retrieval

Health Document Capture (EDM, scanning, etc)

For further information please contact: Touchstone Systems Ltd Nyumbani House 15 Belmont Heights Hatch Warren Basingstoke Hampshire RG22 4RW Telephone: +44 0845 434 8949

www.touchstone-systems.co.uk

MetaMation Document Scanning Guide V1.0

© Touchstone Systems Limited 2006.

Page 25 of 25


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.