Noise Removal and Compression of Document Images Aditya Zutshi ABSTRACT 2. DIGITAL IMAGE In this paper, I discuss the solution I provided (When I was working in TCS Innovation Labs Delhi in the Multimedia Research Group) to improve the Compression and Quality of Document Images for a pharmacy application made by TCS for one of its clients, which runs one of the largest pharmacy chains in the United States. The application involved scanning and storing the hard-copy medical prescriptions. The prescription images that were being generated by the TCS’s implementation were significantly larger than those generated by the current (legacy) system. This was causing a great concern to the client as it was impacting the budget in terms of extra database size (In order of terabytes of storage) since they needed to pay on disk storage basis. Our solution performed better than the legacy system and was accepted by the client. It was also acknowledged by the CIO of the Company in the TCS Summit. This solution is a generic solution and can be applied to any document image. The code has been implemented in Java and uses Java Advanced Imaging API. The code can be altered easily to save image in other commonly available formats.
1. INTRODUCTION Tata Consultancy Services had made a pharmacy application for one of the largest pharmacy chains in the United States. The pharmacy chain sells prescription drugs and a wide assortment of general merchandise, including over-the counter drugs, beauty products and cosmetics, film and photo finishing services, seasonal merchandise, greeting cards and convenience. The application involved scanning and storing the hard-copy medical prescriptions. The prescription images that were being generated by the TCS’s JSane implementation (With TIF G4 Compression) were about 30% larger than those generated by the current (legacy) system. In two of the real stores of the client, the average image size of prescriptions was found to be 87 KB as against an average size of 15 KB produced by the legacy system. The current system used a utility created by Fujitsu. This tool was built upon TWAIN driver and used G4 compression and saved images in TIF format. There were some business constraints because of which the client wanted to switch over from their existing tool to the TCS pharmacy application. This was causing a great concern to the client as it was impacting the budget in terms of extra database size (In order of terabytes of storage) since they needed to pay on disk storage basis. The TCS Product Team for this client contacted the Innovation Labs Delhi through the Chief Technology Officer, TCS to solve this problem. There was an urgency to provide a solution to this problem because the Customer wanted to consult some external companies to solve the noise problem. TCS Innovation Labs Delhi was able to provide solution a better compression and image quality in 60 days to the TCS Product Team. This solution was accepted by the client and appreciated in the TCS Summit.
An image (from Latin imago) is an artifact, usually twodimensional (a picture), that has a similar appearance to some subject—usually a physical object or a person. An image defined in the "real world" is considered to be a function of two real variables, for example, a(x,y) with a as the amplitude (e.g. brightness) of the image at the real coordinate position (x,y). The 2D continuous image a(x,y) is divided into N rows and M columns. The intersection of a row and a column is termed a pixel. The value assigned to every pixel is the average brightness in the pixel rounded to the nearest integer value. The process of representing the amplitude of the 2D signal at a given coordinate as an integer value with L different gray levels is usually referred to as amplitude quantization or simply quantization. Quality of an image depends on its resolution. DPI denotes dots (pixels) per inch and is the resolution of the medium - a description of how many of the smallest 'bits' of an image can be represented in 1 inch (horizontal or vertical) on a monitor or paper. Number of pixels in an image depends on DPI and image size. For example, 6” x 8” image with 200 DPI contains 6 x 200 x 8 x 200 = 1.92 million pixels. Each color pixel takes 3 bytes therefore a 6” x 8” image with 200 DPI takes 1.92 x 3 = 5.76 MB of memory to store. In computing, a grayscale digital image is an image in which the value of each pixel is a single sample, that is, it carries only intensity information. Images of this sort are composed exclusively of shades of gray varying from black at the weakest intensity to white at the strongest. For a gray scale image, the number of distinct gray levels is usually a power of 2, that is, L=2B where B is the number of bits in the binary representation of the brightness levels. When B>1 we speak of a gray-level image; when B=1 we speak of a binary image. In a binary image there are just two gray levels for each pixel which can be referred to, for example, as "black" and "white" or "0" and "1". A (digital) color image includes color information for each pixel. For visually acceptable results, it is necessary (and almost sufficient) to provide three samples (color channels) for each pixel, which are interpreted as coordinates in some color space. The RGB color space is commonly used in computer displays but other spaces such as YCbCr, HSV and is often used in other contexts.
3. IMAGE FORMATS The three most common image file formats, most important for general purposes today, are JPG, PNG and TIFF. JPEG (Joint Photographic Experts Group; .JPG file extension, pronounced Jay Peg) often compressed to perhaps only 1/10 of the size of the original data is the right format for those photo images which must be very small files, for example, for web sites or for email. However, this fantastic compression efficiency comes with a high price. JPG uses lossy compression (Lossy means that some image quality is lost when the JPG data is compressed and saved, and this quality can never be recovered). PNG supports a large set of Page 1 of 5
technical features, including superior lossless compression from LZ77. Compression in PNG is called the ZIP method, and is like the 'deflate" method in PKZIP (and is royalty free). PNG incorporates special preprocessing filters that can greatly improve the lossless compression efficiency, especially for typical gradient data found in 24 bit photographic images. This filter preprocessing causes PNG to be a little slower than other formats when reading or writing the file. PNG has additional unique features, like an Alpha channel for a variable transparency mask. PNG files may also contain an embedded Gamma value so the image brightness can be viewed properly on both Windows and Macintosh screens. TIFF is a flexible format with many options. New types are easy to invent, and this versatility can cause incompatibly, but about any program anywhere will handle the standard TIFF types that we might encounter. TIFF can store data with bytes in either PC or Mac order (Intel or Motorola CPU chips differ in this way). Several compression formats are used with TIF. TIF with G4 compression is the universal standard for fax and multi-page line art documents.
4. ANALYSIS OF THE PROBLEM To begin with, we asked for some of the prescriptions generated through TCS’s implementation and started analysing them. We were not very sure if TIFF G4 was the best format to save these medical prescriptions most efficiently. We started exploring other commonly available image formats like JPEG, PNG and TIFF. Of these three formats, we quite easily were able to rule out JPEG as a possible choice. With JPEG Compression the document became unreadable and we concluded that JPEG Compression isn’t good for compressing the document images. To decide between TIFF and PNG was difficult. For that we took a representative sample of the image database of the client and compressed all images in PNG and TIFF and did a comparative analysis of the two formats. Table 1 shows some of the prescriptions we took as a sample. Table 1: Comparison of PNG and TIFF Formats Image Name Noise Level Size of TIFF Size of PNG G4 (KB) (KB) Low 40.5 Image 1 24.9 Image 2 Image 3 Image 4 Image 5 Image 6 Image 7 Image 8 Image 9 Image 10
Low High-Texture Noise Low High-Texture Noise High-Texture Noise MediumRandom Noise High-Texture Noise High-Texture Noise Low
6.79 44.2
10.2 33.4
9.44 19.4
16 18.5
173
87
29.5
45.8
104
67.8
46.5
38.5
15
24
1. 2.
In case of more noise or texture in the image, PNG gives higher compression compared to TIF G4. In case of less noise, TIF G4 gives a higher compression compared to PNG.
After this experiment, we were faced with two important questions without answering which we could not decide on which image format to use for saving the medical prescriptions. 1.
2.
In general do stores running with TCS’s implementation have more prescriptions with Texture compared to the stores running with legacy system? In stores running with TCS’s implementation, which are more in number, prescription with texture (usually bigger in size) or prescription without texture (usually smaller in size)?
It was extremely difficult for the onsite team to provide us with the data for the first question because the database of the legacy system was being managed by the client team. For the second question we found out that a larger share of images contained less noise and a small share of images contained a lot of noise. Some images were even of the size of 500KB leading to a significant increase in the average file size of the scanned medical prescriptions. Considering the results from our experiments and academic study, we finally decided to go with the TIFF G4 compression for the three major reasons: 1. TIFF G4 gives more compression in case of images with less noise. A larger share of images contained less noise while a smaller share contained lot of noise. This made TIFF G4 more suited for the case. 2. For the images with noise, we could develop an image cleaning module. Post image cleaning, TIFF G4 would provide better compression. 3. The legacy system of the client was already saving scanned prescriptions as TIGG G4. Saving in the same format would lead to easier acceptability of solution from client. Another important hypothesis we had to confirm was whether the larger file size was because of the random noise or because of some other reason. For this we took a sample of scanned prescription images with a lot of noise and de-noised the images using some trial versions of commercially available tools. We compared the size of the images prior to de-noising and post cleaning and found a significant difference in the file size. This hypothesis is also justified by the information theory that randomness in the data adds to more information thereby requiring more storage. Moreover comparing different prescriptions of same image size but varying in noise, we observed that the images with noise were having a much higher file size compared to the cleaner images (Figure 1).
47.273 38.17 Average File Size (KB) From the above experiment we concluded the following: Page 2 of 5
We observed that irrespective of the background texture, it was required to do an image cleaning. The noise due to texture post thresholding was of a random nature. We used Java Advanced Imaging API to calculate the optimum threshold value using different available auto-thresholding techniques and observed that the Maximum Variance Threshold gave the best results. The noise introduced post thresholding after we applied the auto-threshold algorithm reduced the noise significantly.
Figure 1: Impact of noise on image size With this we concluded that noise was the major reason for higher file size of the scanned medical prescriptions and preventing/cleaning the noise would help reduce the file size. We were yet not clear with the cause of noise in the scanned prescriptions. We could finally narrow down to two possible causes of noise: 1. 2. 3. 4.
Noise introduced during scanning of medical prescription due to dirt in the scanner. Noise introduced during scanning of medical prescription due uneven prescription surface. Noise introduced due to improper settings and configuration of the scanner. Noise introduced after the post scanning and prior to saving of image on the disk.
After our talks with the onsite team, we were confirmed that the problem of noise was there for all the locations which made it quite unlikely that the noise was due to dirt in the scanner or due to uneven surface of medical prescriptions. We also suggested some changes in the settings and configuration to the offshore team but no significant improvement was observed in the quality and size of the scanned medical prescriptions. The only possible cause of noise was now narrowed down to the noise being introduced post scanning and prior to saving. The offshore team informed us of the various image manipulations being done in their system post scanning and prior to saving of the image to disk. In the TCS’s implementation, post scanning, the scanned colored image was converted into a binary image, autocropped and then saved in TIFF G4 on the disk. We suggested the offshore team to provide us the image output after every image manipulation. From the results, we observed that noise was being introduced at the time of binarization of the image from colored to a black and white image. The images having background texture during binarization than images without background texture. Having found this, we had two problems to solve. 1. Reduce the noise due to background texture by proper selection of threshold. 2. De-Noise the scanned prescription to remove the left-over noise.
The next problem we had to tackle was the removal of noise which existed even after the application of auto-threshold. We tried some commercially available tools do remove the noise but didn’t find the results too convincing. After doing some theoretical study and analysis, we started designing our own filters in Java Advanced Imaging. We observed that a Median Filter of 3X3 size gave the best results. We applied the Median Filter on the binarized image to remove the noise. We repeated this set of experiment over around 30 images and found the results convincing. We finally decided to go with the Median Filter. Our next consideration was the slight removal of fine edges after application of Median Filter on the binarized Image. At times the handwriting became unreadable and this was not acceptable to the client. To set that right, we tuned our algorithm a bit and applied the Median Filter before thresholding to remove most of the texture noise. This experimental algorithm solved the problem of the readability.
5. Solution Figure 2 shows the block diagram of TCS Implementation, integrated with our solution. The medical prescription is scanned by the scanner as a gray scale image. To this image, image denoising algorithms are applied and the gray scale image is filtered. Using a fast auto-threshold calculating algorithm, the optimized threshold is calculated for the image and the grayscale image is binarized by using the auto-threshold value. This binarized document image is then compressed using TIFF G4 compression algorithm and saved on disk.
Figure 2: Block diagram of TCS Implementation integrated with our solution.
Page 3 of 5
6. Result After our solution was applied to the TCS’s Implementation of the Pharmacy Application, there was a significant reduction in the image size and the quality of the scanned prescription. A parallel comparison was done for the performance of the legacy system and the TCS Implementation with our solution and it was found that with our solution, the average size of images was 16 KB as against 16.25 KB for the same images by the legacy system (Table 2). The scanned prescriptions post application of our solution had a much better readability (Figure 3 & Figure 4). Our solution was accepted by the client and was mentioned by the CIO of the Company in the TCS Summit.
Image 31 Image 32
13 13
15 14
Average File Size (KB)
16.25
16
Table 2: A comparison between file sizes of image saved by legacy system versus those saved by TCS Implementation with our solution.
Image Name Image 1 Image 2 Image 3 Image 4 Image 5 Image 6 Image 7 Image 8 Image 9 Image 10 Image 11 Image 12 Image 13 Image 14 Image 15 Image 16 Image 17 Image 18 Image 19 Image 20 Image 21 Image 22 Image 23 Image 24 Image 25 Image 26 Image 27 Image 28 Image 29 Image 30
File Size (KB) from Legacy Tool 12 56 20 13 18 8 6 12 6 7 10 12 11 30 11 17 11 10 12 32 14 12 42 8 11 12 11 9 53 8
File Size (KB) from TCS Implementation With Our Solution 11 38 20 14 18 8 7 16 7 9 10 13 12 22 13 16 9 10 15 17 14 14 39 10 13 15 13 15 54 11
Figure 3: A sample noisy medical prescription saved by TCS Implementation prior to our solution. It took a disk space of approximately 82 KB.
Figure 4: After the application of our solution, the noise was significantly reduced, readability was enhanced and disk storage requirement was reduced from 82 KB to approximately 14 KB.
7. Scope of Solution and Future Work Our algorithm is generic and works for any document image. The solution is implemented in JAVA and Java Advanced Imaging API is used. The algorithm is time efficient and takes a couple of hundred milliseconds for processing. Future work includes addition of functionality of de-skewing and intelligent crop and integration of JBIG Compression in our solution. Page 4 of 5
8. Acknowledgement I would like to thank Mr. K Ananth Krishnan, Chief Technology Officer, Tata Consultancy Services for giving me this opportunity. I would also like to acknowledge the support offered by Dr. Hiranmay Ghosh, Head of Multimedia Research Group, TCS Innovation Labs Delhi. I would like to acknowledge the support offered by the Offshore and Onsite Project Team developing the pharmacy application for the client. They very actively provided us with all information and data without which it would not have been possible to complete this project I had undertaken.
9. References [1] http://java.sun.com/javase/technologies/desktop/media/ [2]http://www.mathworks.com/access/helpdesk/help/helpdesk.ht ml [3] http://en.wikipedia.org/wiki/Median_filter [4] http://en.wikipedia.org/wiki/Image_formats [5] Digital Image Processing by Gonzalez & Woods and [6] Digital Image Processing Using MATLAB by Gonzalez, Woods, & Eddins [7] http://en.wikipedia.org/wiki/Digital_image_processing
Page 5 of 5