Spring’09 EE669-HW1 Multimedia data compression Neha Rathore-5994499980
[LOSSLESS DATA COMPRESSION] Study of various lossless datacompression techniques including Huffman,Adaptive huffman, Lempel-Ziv and run length coding.
EE669 Multimedia Data Compression HW1-LOSSLESS COMPRESSION PROBLEM 1:
Huffman Coding: Definition Huffman coding is a procedure based on two observations regarding optimum prefix codes:1 1. In an optimum code, symbols that occur more frequently (higher probability of occurrence) will have shorter code words than symbols that occur less frequently. 2. In an optimum code, the two symbols that occur least frequently will have the same length. 3. No codeword for a symbol can be a prefix for any other symbol. This is logical because if the symbols that occur more often had code words longer than the ones that occur less often, the average number of bits per symbol would be larger. Hence, for a code to be optimum, it should assign longer code words to symbols that occur less often or has least probability. Huffman encoding utilizes the redundancy in statistical ordering of the data and gathers the aprioriknowledge to achieve compression. It is a Variable Length Coding technique in which the length of code is inversely proportional to the probability of occurrence of symbol.
Entropy Entropy is a measure of the average number of binary symbols needed to code the output of the source or the average self-information associated with the symbol set. Entropy uses the probability distribution of the symbol set and the information contained in each symbol occurrence. Entropy of a symbol set is given by:
H(S) = -ΣP(si).log2(P(si)) for i=(1 toK); Average Length of Codes: 1. The length of Huffman code depends on a number of things, including the size of the symbol and the probabilities of individual symbol. 2. The average codeword length L’ of an optimal code for a Source S is greater than or equal to the entropy of the Source.
1
rd
Chapter 3: Huffman coding –Introduction to Data Compression;3 edition,Khalid Sayood
For a source S with symbol set ={s1,s2,….sn} and probability model {P(s1), P(s2),….P(sn)}, the average code word length is given by
L’=Σ P(si).Li for i=(1 toK) OPTIMAL CODE CONDITION The difference between Entropy and Average length code is given by:
H(S)- L’= -ΣP(si).log2(P(si)) - Σ P(si).Li for i=(1 toK); H(S)- L’ ≤ log2(Σ2-Li) for i= 1 to K; If the code is optimal code: (Σ2-Li) <1 and hence, H(S)- L’ ≤ 0. SYMBOL SET: We are here working on the symbols of length 1 byte or 8 bits.Hence, the symbol set used in this case is from 0-255. Symbol set is a very crusial step on design of efficient encoder or decoder. The size of symbol set is inversely proportioanal to the probability of symbols getting repeated. This inturn will decrease the efficiency of the compression algorithm as the redundancy intrisuced by repetitive symbols is also reduced. However, a very Too symbols can lead to too many occurrences of the symbol. The extreme case is when we just look at each bit as a symbol. This would not help us anyway in compression. Hence, we take a 1byte long symbol which is not too short and not too long. It is also gives us more flexinility as most of the progamming languages work on 1 byte format.
Encoding with Global Statictics
-
Algorithm Global statistics apply that we look at the whole file for calculating the symbol set and their respective probability of occurrence with respect to the whole file statistics. The relative frequency of occurrence of each symbol in the file represents the Probability of that symbol. The information contained in the symbol is inversely proportional to the Probability of occurrence. To design the huffman code, Method 1 • we first sort the letters in a descending order of their probability. For a symbol Si the code word for this symbol is given by c(Si) (say). • The two symbols with the lowest probability are ai and ai-1 . • We assign them code words αi*0 & αi*1 where αi is a binary string and codeword for the combined symbol s’ of the two least probable symbols, and * denotes concatenation. • We define a new symbol s’ and a new symbol set. With number of symbols (n-1) ,where n is the original number of symbols in the symbol set.
We sort this new symbol set and find out the least probable symbols. We again assign a code word α2*0 and α2*1 in the second iteration. Where α2 is the code word for 2 least probable symbols in that iteration. • Continue till we reach the last symbols in the set where we assign αn=0 or 1. • As we move ahead, we find the relations between αi of different levels and hence, we are able to generate the complete code word . Method 22 We use the fact that the Huffman code is a prefix code and can be represented as a binary tree in which external nodes or leaves correspond to the symbols. The Huffman code for any given symbol can be obtained by traversing the tree from the root node to the leaf corresponding to the symbol , adding a 0 to the codeword everytime the traversal takes a upper branch and 1 otherwise. • •
• • • • • •
•
We build the binary tree starting at the leaf nodes. We know that the codewords for two symbols with smallest probabilities are identical except for the last bit. This means that the traversal from the root node to the leaves corresponding to these symbols must be same except for the last step. This in turn means that the leaves corresponding to the two symbols with the lowest probabilities are offspring to the same node. Once we have connected the leaves corresponding to the symbols with lowest probabilities to a single node, we treat this node as a symbol of a reduced Symbol set. The probability of this symbol is the sum of probabilities of its offspring. We can now sort the nodes corresponding to the reduced Symbol set and apply the same rule to generate a parent node for the nodes corresponding to the two symbols in the reduces Symbol set with lowest probabilities. Continuing in this manner, we end up with a single node , which is the root node. To obtain the code for each symbol, we traverse the tree form the root to each leaf node, assigning a 0 to the upper branch and a 1 to the lower branch.
Explaination of huff.c: APPENDIX 1 Results FILE
THEORITICAL ENTROPY
SYMTEM ENTROPY
DIFFERENCE
FILE size(original)
Compressed file size
Compression ratio
text.dat image.dat audio.dat binary.dat
4.3961 7.5931 6.4559 0.1832
4.4325 7.6213 6.4945 1.0278
0.0363 0.0282 0.0386 0.8445
8358 65536 65536 65536
4719 62672 53463 8427
44% 5% 19% 88%
TEXT.dat 2
rd
Chapter 3: Huffman coding –Introduction to Data Compression;3 edition,Khalid Sayood
Frequency chart
Fig: The chart shows the frequencies of the symbol set from 0-255. The symbols not shown in the figure denote zero count, in which case the hussman coder discards them and does not assign a codeword.
The table below shows the output for text.dat for huffman encoder. Count: frequency Rf: relative frequency I = information H: entropy Symbol: symbol set Hcode: huffman code
DISCUSSIONS for TEXT.dat As shown in the histogram, the symbols are widely spread between the values 0-128. Which is a characteristic of text. The probabilities of these symbols are variable and hence can be nicely compressed in huffman encoder. Clearly the symbols with very very low frequncies are givn the higest number of bits in the code word, it is also sufficing the prefix quality. Hence, this is an optimum code for text.dat as the difference in entropy and avg length of the coded symbol set is only 0.0363 wich is a very small value. Image.dat
Binary.dat
Discussion for binary.dat : we see ths symbol set here only represents 0 or 255, hence there is lot of repetition in the source. The huffman encoder needs only 2 codes to represent this encoding bits and thus canresult in very high compression ratios. However. The redundancy is increased as the encoder will be sending similar codeword again and again. If we somehow encode this signalagain by run length coding then we can achieve higer compression rations.
Audio.dat
count: 4 count: 2 count: 1 count: 1 count: 3 count: 3 count: 2 count: 5 count: 1 count: 5 count: 1 count: 3 count: 5 count: 4 count: 6 count: 4 count: 1 count: 2 count: 6 count: 4 count: 5 count: 6 count: 5 count: 9 count: 7 count: 4
rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf:
0.000061 0.000031 0.000015 0.000015 0.000046 0.000046 0.000031 0.000076 0.000015 0.000076 0.000015 0.000046 0.000076 0.000061 0.000092 0.000061 0.000015 0.000031 0.000092 0.000061 0.000076 0.000092 0.000076 0.000137 0.000107 0.000061
I: 14.000000 I: 15.000000 I: 16.000000 I: 16.000000 I: 14.415037 I: 14.415037 I: 15.000000 I: 13.678072 I: 16.000000 I: 13.678072 I: 16.000000 I: 14.415037 I: 13.678072 I: 14.000000 I: 13.415037 I: 14.000000 I: 16.000000 I: 15.000000 I: 13.415037 I: 14.000000 I: 13.678072 I: 13.415037 I: 13.678072 I: 12.830075 I: 13.192645 I: 14.000000
H: 0.000854 H: 0.000458 H: 0.000244 H: 0.000244 H: 0.000660 H: 0.000660 H: 0.000458 H: 0.001044 H: 0.000244 H: 0.001044 H: 0.000244 H: 0.000660 H: 0.001044 H: 0.000854 H: 0.001228 H: 0.000854 H: 0.000244 H: 0.000458 H: 0.001228 H: 0.000854 H: 0.001044 H: 0.001228 H: 0.001044 H: 0.001762 H: 0.001409 H: 0.000854
0 Symbol= Hcode=0111011000100 1 Symbol= Hcode=0111011000101 3 Symbol= Hcode=0111011000110 8 Symbol= Hcode=0111011000111 10 Symbol= Hcode=0111011001000 11 Symbol= Hcode=0111011001001 12 Symbol= Hcode=0111011001010 Hcode=0111011001011 14 Symbol= Hcode=0111011001100 15 Symbol= Hcode=0111011001101 16 Symbol= Hcode=0111011001110 17 Symbol= Hcode=0111011001111 18 Symbol= Hcode=0111011010000 19 Symbol= Hcode=0111011010001 20 Symbol= Hcode=0111011010010 22 Symbol= Hcode=0111011010011 23 Symbol= Hcode=0111011010100 24 Symbol= Hcode=0111011010101 25 Symbol= Hcode=0111011010110 26 Symbol= Hcode=0111011010111 27 Symbol= code=0111011011000 28 Symbol= Hcode=0111011011001 29 Symbol= Hcode=0111011011010
count: 11 rf: 0.000168 I: 12.540568 H: 0.002105 count: 5 rf: 0.000076 I: 13.678072 H: 0.001044 count: 10 rf: 0.000153 I: 12.678072 H: 0.001935 count: 9 rf: 0.000137 I: 12.830075 H: 0.001762 count: 7 rf: 0.000107 I: 13.192645 H: 0.001409 count: 4 rf: 0.000061 I: 14.000000 H: 0.000854 count: 14 rf: 0.000214 I: 12.192645 H: 0.002605 count: 10 rf: 0.000153 I: 12.678072 H: 0.001935 count: 3 rf: 0.000046 I: 14.415037 H: 0.000660 count: 7 rf: 0.000107 I: 13.192645 H: 0.001409 count: 16 rf: 0.000244 I: 12.000000 H: 0.002930 count: 14 rf: 0.000214 I: 12.192645 H: 0.002605 count: 12 rf: 0.000183 I: 12.415037 H: 0.002273 count: 13 rf: 0.000198 I: 12.299560 H: 0.002440 count: 12 rf: 0.000183 I: 12.415037 H: 0.002273 count: 12 rf: 0.000183 I: 12.415037 H: 0.002273 count: 18 rf: 0.000275 I: 11.830075 H: 0.003249 count: 22 rf: 0.000336 I: 11.540568 H: 0.003874 count: 17 rf: 0.000259 I: 11.912537 H: 0.003090 count: 16 rf: 0.000244 I: 12.000000 H: 0.002930 count: 19 rf: 0.000290 I: 11.752072 H: 0.003407 count: 15 rf: 0.000229 I: 12.093109 H: 0.002768 count: 26 rf: 0.000397 I: 11.299560 H: 0.004483 count: 20 rf: 0.000305 I: 11.678072 H: 0.003564 count: 30 rf: 0.000458 I: 11.093109 H: 0.005078 count: 25 rf: 0.000381 I: 11.356144 H: 0.004332 count: 28 rf: 0.000427 I: 11.192645 H: 0.004782 count: 31 rf: 0.000473 I: 11.045804 H: 0.005225 count: 44 rf: 0.000671 I: 10.540568 H: 0.007077 count: 59 rf: 0.000900 I: 10.117357 H: 0.009108 count: 50 rf: 0.000763 I: 10.356144 H: 0.007901 count: 46 rf: 0.000702 I: 10.476438 H: 0.007353 count: 41 rf: 0.000626 I: 10.642448 H: 0.006658 count: 52 rf: 0.000793 I: 10.299560 H: 0.008172 count: 67 rf: 0.001022 I: 9.933911 H: 0.010156 count: 57 rf: 0.000870 I: 10.167110 H: 0.008843 count: 60 rf: 0.000916 I: 10.093109 H: 0.009241 count: 78 rf: 0.001190 I: 9.714598 H: 0.011562 count: 71 rf: 0.001083 I: 9.850253 H: 0.010672 count: 79 rf: 0.001205 I: 9.696219 H: 0.011688 count: 68 rf: 0.001038 I: 9.912537 H: 0.010285 count: 88 rf: 0.001343 I: 9.540568 H: 0.012811 count: 78 rf: 0.001190 I: 9.714598 H: 0.011562 count: 102 rf: 0.001556 I: 9.327575 H: 0.014517 count: 102 rf: 0.001556 I: 9.327575 H: 0.014517 count: 101 rf: 0.001541 I: 9.341789 H: 0.014397 count: 120 rf: 0.001831 I: 9.093109 H: 0.016650 count: 138 rf: 0.002106 I: 8.891476 H: 0.018723
30 Symbol= Hcode=0111011011011 31 Symbol= Hcode=0111011011100 ' ' Symbol= Hcode=0111011011101 '!' Symbol= ! Hcode=0111011011110 '"' Symbol= " Hcode=0111011011111 '#' Symbol= # Hcode=0111011100000 '$' Symbol= $ Hcode=0111011100001 '%' Symbol= % Hcode=0111011100010 '&' Symbol= & Hcode=0111011100011 ''' Symbol= ' Hcode=001100110101 '(' Symbol= ( Hcode=0111011100100 ')' Symbol= ) Hcode=0111011100101 '*' Symbol= * Hcode=0111011100110 '+' Symbol= + Hcode=011101010100 ',' Symbol= , Hcode=011101010101 '-' Symbol= - Hcode=0111011100111 '.' Symbol= . Hcode=0111011101000 '/' Symbol= / Hcode=0111011101001 '0' Symbol= 0 Hcode=0111011101010 '1' Symbol= 1 Hcode=011101010110 '2' Symbol= 2 Hcode=110001010011 '3' Symbol= 3 Hcode=011101010111 '4' Symbol= 4 Hcode=011101011000 '5' Symbol= 5 Hcode=011101011001 '6' Symbol= 6 Hcode=011101011010 '7' Symbol= 7 Hcode=111001010110 '8' Symbol= 8 Hcode=011101011011 '9' Symbol= 9 Hcode=00110011011 ':' Symbol= : Hcode=111001010111 ';' Symbol= ; Hcode=01110100110 '<' Symbol= < Hcode=01110100111 '=' Symbol= = Hcode=11100101000 '>' Symbol= > Hcode=0111001100 '?' Symbol= ? Hcode=11111110101 '@' Symbol= @ Hcode=11100101001 'A' Symbol= A Hcode=10110010110 'B' Symbol= B Hcode=0011001100 'C' Symbol= C Hcode=1001101000 'D' Symbol= D Hcode=0111001101 'E' Symbol= E Hcode=0111001110 'F' Symbol= F Hcode=1100010101 'G' Symbol= G Hcode=1011001000 'H' Symbol= H Hcode=1100010110 'I' Symbol= I Hcode=1001101001 'J' Symbol= J Hcode=1110001110 'K' Symbol= K Hcode=1100010111 'L' Symbol= L Hcode=1111111011 'M' Symbol= M Hcode=1111111110
count: 108 rf: 0.001648 I: 9.245112 H: 0.015235 count: 166 rf: 0.002533 I: 8.624961 H: 0.021847 count: 134 rf: 0.002045 I: 8.933911 H: 0.018267 count: 184 rf: 0.002808 I: 8.476438 H: 0.023799 count: 174 rf: 0.002655 I: 8.557057 H: 0.022719 count: 184 rf: 0.002808 I: 8.476438 H: 0.023799 count: 201 rf: 0.003067 I: 8.348948 H: 0.025606 count: 205 rf: 0.003128 I: 8.320520 H: 0.026027 count: 188 rf: 0.002869 I: 8.445411 H: 0.024227 count: 185 rf: 0.002823 I: 8.468619 H: 0.023906 count: 228 rf: 0.003479 I: 8.167110 H: 0.028413 count: 268 rf: 0.004089 I: 7.933911 H: 0.032445 count: 271 rf: 0.004135 I: 7.917851 H: 0.032741 count: 247 rf: 0.003769 I: 8.051633 H: 0.030346 count: 266 rf: 0.004059 I: 7.944718 H: 0.032246 count: 276 rf: 0.004211 I: 7.891476 H: 0.033234 count: 291 rf: 0.004440 I: 7.815125 H: 0.034702 count: 323 rf: 0.004929 I: 7.664610 H: 0.037776 count: 299 rf: 0.004562 I: 7.775998 H: 0.035477 count: 338 rf: 0.005157 I: 7.599121 H: 0.039192 count: 345 rf: 0.005264 I: 7.569547 H: 0.039848 count: 341 rf: 0.005203 I: 7.586372 H: 0.039474 count: 376 rf: 0.005737 I: 7.445411 H: 0.042717 count: 419 rf: 0.006393 I: 7.289194 H: 0.046603 count: 456 rf: 0.006958 I: 7.167110 H: 0.049869 count: 526 rf: 0.008026 I: 6.961081 H: 0.055870 count: 500 rf: 0.007629 I: 7.034216 H: 0.053667 count: 568 rf: 0.008667 I: 6.850253 H: 0.059371 count: 594 rf: 0.009064 I: 6.785681 H: 0.061504 count: 582 rf: 0.008881 I: 6.815125 H: 0.060522 count: 712 rf: 0.010864 I: 6.524267 H: 0.070881 count: 704 rf: 0.010742 I: 6.540568 H: 0.070260 count: 811 rf: 0.012375 I: 6.336442 H: 0.078413 count: 885 rf: 0.013504 I: 6.210466 H: 0.083866 count: 911 rf: 0.013901 I: 6.168693 H: 0.085749 count: 1006 rf: 0.015350 I: 6.025585 H: 0.092495 count: 1067 rf: 0.016281 I: 5.940656 H: 0.096721 count: 1127 rf: 0.017197 I: 5.861728 H: 0.100802 count: 1246 rf: 0.019012 I: 5.716912 H: 0.108693 count: 1267 rf: 0.019333 I: 5.692799 H: 0.110058 count: 1329 rf: 0.020279 I: 5.623875 H: 0.114046 count: 1411 rf: 0.021530 I: 5.537498 H: 0.119223 count: 1494 rf: 0.022797 I: 5.455036 H: 0.124356 count: 1544 rf: 0.023560 I: 5.407543 H: 0.127399 count: 1554 rf: 0.023712 I: 5.398229 H: 0.128004 count: 1666 rf: 0.025421 I: 5.297827 H: 0.134677 count: 1712 rf: 0.026123 I: 5.258533 H: 0.137369 count: 1691 rf: 0.025803 I: 5.276339 H: 0.136143
'N' Symbol= N Hcode=1111111111 'O' Symbol= O Hcode=100001010 'P' Symbol= P Hcode=100110101 'Q' Symbol= Q Hcode=001100111 'R' Symbol= R Hcode=110100010 'S' Symbol= S Hcode=101001110 'T' Symbol= T Hcode=111011111 'U' Symbol= U Hcode=111000110 'V' Symbol= V Hcode=111101100 'W' Symbol= W Hcode=111111110 'X' Symbol= X Hcode=00110010 'Y' Symbol= Y Hcode=111101101 'Z' Symbol= Z Hcode=111101110 '[' Symbol= [ Hcode=01110010 '\' Symbol= \ Hcode=10011011 ']' Symbol= ] Hcode=10100100 '^' Symbol= ^ Hcode=10010100 '_' Symbol= _ Hcode=10100101 '`' Symbol= ` Hcode=10101000 'a' Symbol= a Hcode=10110011 'b' Symbol= b Hcode=11010000 'c' Symbol= c Hcode=10111100 'd' Symbol= d Hcode=11100000 'e' Symbol= e Hcode=11100100 'f' Symbol= f Hcode=11100001 'g' Symbol= g Hcode=11111101 'h' Symbol= h Hcode=0101100 'i' Symbol= i Hcode=0111110 'j' Symbol= j Hcode=1001100 'k' Symbol= k Hcode=1000011 'l' Symbol= l Hcode=1010101 'm' Symbol= m Hcode=1011101 'n' Symbol= n Hcode=1011100 'o' Symbol= o Hcode=1110110 'p' Symbol= p Hcode=1110011 'q' Symbol= q Hcode=001001 'r' Symbol= r Hcode=011011 's' Symbol= s Hcode=011110 't' Symbol= t Hcode=100010 'u' Symbol= u Hcode=100111 'v' Symbol= v Hcode=101011 'w' Symbol= w Hcode=110010 'x' Symbol= x Hcode=110011 'y' Symbol= y Hcode=110101 'z' Symbol= z Hcode=111010 '{' Symbol= { Hcode=111110 '|' Symbol= | Hcode=00000 '}' Symbol= } Hcode=00001
count: 1750 rf: 0.026703 I: 5.226861 H: 0.139572 count: 1689 rf: 0.025772 I: 5.278046 H: 0.136026 count: 1594 rf: 0.024323 I: 5.361564 H: 0.130407 count: 1580 rf: 0.024109 I: 5.374291 H: 0.129568 count: 1577 rf: 0.024063 I: 5.377033 H: 0.129388 count: 1439 rf: 0.021957 I: 5.509149 H: 0.120967 count: 1338 rf: 0.020416 I: 5.614138 H: 0.114620 count: 1330 rf: 0.020294 I: 5.622789 H: 0.114110 count: 1213 rf: 0.018509 I: 5.755636 H: 0.106531 count: 1167 rf: 0.017807 I: 5.811411 H: 0.103484 count: 1064 rf: 0.016235 I: 5.944718 H: 0.096515 count: 1010 rf: 0.015411 I: 6.019860 H: 0.092774 count: 1007 rf: 0.015366 I: 6.024152 H: 0.092565 count: 927 rf: 0.014145 I: 6.143574 H: 0.086900 count: 873 rf: 0.013321 I: 6.230162 H: 0.082992 count: 824 rf: 0.012573 I: 6.313499 H: 0.079381 count: 782 rf: 0.011932 I: 6.388975 H: 0.076236 count: 722 rf: 0.011017 I: 6.504145 H: 0.071655 count: 654 rf: 0.009979 I: 6.646853 H: 0.066331 count: 603 rf: 0.009201 I: 6.763986 H: 0.062236 count: 623 rf: 0.009506 I: 6.716912 H: 0.063852 count: 570 rf: 0.008698 I: 6.845182 H: 0.059536 count: 514 rf: 0.007843 I: 6.994375 H: 0.054857 count: 457 rf: 0.006973 I: 7.163950 H: 0.049956 count: 437 rf: 0.006668 I: 7.228511 H: 0.048200 count: 438 rf: 0.006683 I: 7.225213 H: 0.048289 count: 442 rf: 0.006744 I: 7.212097 H: 0.048641 count: 399 rf: 0.006088 I: 7.359755 H: 0.044808 count: 357 rf: 0.005447 I: 7.520220 H: 0.040966 count: 337 rf: 0.005142 I: 7.603395 H: 0.039098 count: 302 rf: 0.004608 I: 7.761595 H: 0.035767 count: 294 rf: 0.004486 I: 7.800328 H: 0.034993 count: 269 rf: 0.004105 I: 7.928538 H: 0.032544 count: 255 rf: 0.003891 I: 8.005647 H: 0.031150 count: 217 rf: 0.003311 I: 8.238449 H: 0.027279 count: 184 rf: 0.002808 I: 8.476438 H: 0.023799 count: 188 rf: 0.002869 I: 8.445411 H: 0.024227 count: 186 rf: 0.002838 I: 8.460841 H: 0.024013 count: 183 rf: 0.002792 I: 8.484300 H: 0.023691 count: 179 rf: 0.002731 I: 8.516184 H: 0.023260 count: 162 rf: 0.002472 I: 8.660150 H: 0.021407 count: 146 rf: 0.002228 I: 8.810175 H: 0.019627 count: 137 rf: 0.002090 I: 8.901968 H: 0.018609 count: 142 rf: 0.002167 I: 8.850253 H: 0.019176 count: 106 rf: 0.001617 I: 9.272080 H: 0.014997 count: 121 rf: 0.001846 I: 9.081137 H: 0.016767 count: 113 rf: 0.001724 I: 9.179821 H: 0.015828 count: 94 rf: 0.001434 I: 9.445411 H: 0.013548
'~' Symbol= ~ Hcode=00111 127 Symbol= Hcode=01010 128 Symbol= Hcode=01000 129 Symbol= Hcode=01100 130 Symbol= Hcode=01001 131 Symbol= Hcode=00101 132 Symbol= Hcode=00010 133 Symbol= Hcode=00011 134 Symbol= Hcode=111100 135 Symbol= Hcode=110111 136 Symbol= Hcode=110110 137 Symbol= Hcode=110000 138 Symbol= Hcode=101101 139 Symbol= Hcode=101000 140 Symbol= Hcode=100011 142 Symbol= Hcode=100000 143 Symbol= Hcode=010111 144 Symbol= Hcode=001101 145 Symbol= Hcode=001000 146 Symbol= Hcode=1111010 147 Symbol= Hcode=1101001 148 Symbol= Hcode=1011111 149 Symbol= Hcode=1100011 150 Symbol= Hcode=1011000 151 Symbol= Hcode=1001011 152 Symbol= Hcode=0111111 153 Symbol= Hcode=0101101 154 Symbol= Hcode=0110100 155 Symbol= code=0111000 156 Symbol= Hcode=0011000 157 Symbol= 158 Symbol= Hcode=11100010 159 Symbol= Hcode=11000100 160 Symbol= Hcode=10111101 161 Symbol= ¡ Hcode=10100110 162 Symbol= ¢ Hcode=10010101 163 Symbol= £ Hcode=01101010 164 Symbol= ¤ Hcode=111101111 165 Symbol= ¥ Hcode=111111000 166 Symbol= ¦ Hcode=111111001 167 Symbol= § Hcode=111111100 168 Symbol= ¨ Hcode=111001011 169 Symbol= © Hcode=110100011 170 Symbol= ª Hcode=101010010 171 Symbol= « Hcode=101001111 172 Symbol= ¬ Hcode=101010011
count: 59 rf: 0.000900 I: 10.117357 H: 0.009108 count: 75 rf: 0.001144 I: 9.771181 H: 0.011182 count: 72 rf: 0.001099 I: 9.830075 H: 0.010800 count: 85 rf: 0.001297 I: 9.590609 H: 0.012439 count: 56 rf: 0.000854 I: 10.192645 H: 0.008710 count: 56 rf: 0.000854 I: 10.192645 H: 0.008710 count: 60 rf: 0.000916 I: 10.093109 H: 0.009241 count: 46 rf: 0.000702 I: 10.476438 H: 0.007353 count: 36 rf: 0.000549 I: 10.830075 H: 0.005949 count: 23 rf: 0.000351 I: 11.476438 H: 0.004028 count: 20 rf: 0.000305 I: 11.678072 H: 0.003564 count: 25 rf: 0.000381 I: 11.356144 H: 0.004332 count: 25 rf: 0.000381 I: 11.356144 H: 0.004332 count: 37 rf: 0.000565 I: 10.790547 H: 0.006092 count: 31 rf: 0.000473 I: 11.045804 H: 0.005225 count: 22 rf: 0.000336 I: 11.540568 H: 0.003874 count: 28 rf: 0.000427 I: 11.192645 H: 0.004782 count: 27 rf: 0.000412 I: 11.245112 H: 0.004633 count: 23 rf: 0.000351 I: 11.476438 H: 0.004028 count: 13 rf: 0.000198 I: 12.299560 H: 0.002440 count: 18 rf: 0.000275 I: 11.830075 H: 0.003249 count: 11 rf: 0.000168 I: 12.540568 H: 0.002105 count: 17 rf: 0.000259 I: 11.912537 H: 0.003090 count: 11 rf: 0.000168 I: 12.540568 H: 0.002105 count: 17 rf: 0.000259 I: 11.912537 H: 0.003090 count: 10 rf: 0.000153 I: 12.678072 H: 0.001935 count: 9 rf: 0.000137 I: 12.830075 H: 0.001762 count: 9 rf: 0.000137 I: 12.830075 H: 0.001762 count: 16 rf: 0.000244 I: 12.000000 H: 0.002930 count: 8 rf: 0.000122 I: 13.000000 H: 0.001587 count: 7 rf: 0.000107 I: 13.192645 H: 0.001409 count: 11 rf: 0.000168 I: 12.540568 H: 0.002105 count: 18 rf: 0.000275 I: 11.830075 H: 0.003249 count: 8 rf: 0.000122 I: 13.000000 H: 0.001587 count: 8 rf: 0.000122 I: 13.000000 H: 0.001587 count: 3 rf: 0.000046 I: 14.415037 H: 0.000660 count: 5 rf: 0.000076 I: 13.678072 H: 0.001044 count: 8 rf: 0.000122 I: 13.000000 H: 0.001587 count: 9 rf: 0.000137 I: 12.830075 H: 0.001762 count: 2 rf: 0.000031 I: 15.000000 H: 0.000458 count: 1 rf: 0.000015 I: 16.000000 H: 0.000244 count: 3 rf: 0.000046 I: 14.415037 H: 0.000660 count: 2 rf: 0.000031 I: 15.000000 H: 0.000458 count: 4 rf: 0.000061 I: 14.000000 H: 0.000854 count: 5 rf: 0.000076 I: 13.678072 H: 0.001044 count: 4 rf: 0.000061 I: 14.000000 H: 0.000854 count: 4 rf: 0.000061 I: 14.000000 H: 0.000854 count: 4 rf: 0.000061 I: 14.000000 H: 0.000854
173 Symbol= - Hcode=011010110 174 Symbol= ® Hcode=100001011 175 Symbol= ¯ Hcode=011010111 176 Symbol= ° Hcode=1110111101 177 Symbol= ± Hcode=0111001111 178 Symbol= ² Hcode=1011001001 179 Symbol= ³ Hcode=1011001010 180 Symbol= ´ Hcode=1110001111 181 Symbol= µ Hcode=0111010000 182 Symbol= ¶ Hcode=0111010001 183 Symbol= · Hcode=0111010010 184 Symbol= ¸ Hcode=11100101010 185 Symbol= ¹ Hcode=10110010111 186 Symbol= º Hcode=111011110000 187 Symbol= » Hcode=011101011100 188 Symbol= ¼ Hcode=111011110001 189 Symbol= ½ Hcode=111011110010 190 Symbol= ¾ Hcode=11000101000 191 Symbol= ¿ Hcode=01110101000 192 Symbol= À Hcode=111011110011 193 Symbol= Á Hcode=01110101001 194 Symbol= Â Hcode=111111101000 195 Symbol= Ã Hcode=111111101001 196 Symbol= Ä Hcode=0111011101011 197 Symbol= Å Hcode=011101011101 198 Symbol= Æ Hcode=0111011101100 199 Symbol= Ç Hcode=011101011110 200 Symbol= È Hcode=0111011101101 201 Symbol= É Hcode=011101011111 202 Symbol= Ê Hcode=0111011101110 203 Symbol= Ë Hcode=0111011101111 204 Symbol= Ì Hcode=0111011110000 205 Symbol= Í Hcode=011101100000 206 Symbol= Î Hcode=0111011110001 207 Symbol= Ï Hcode=0111011110010 208 Symbol= Ð Hcode=0111011110011 209 Symbol= Ñ Hcode=011101100001 210 Symbol= Ò Hcode=0111011110100 211 Symbol= Ó Hcode=0111011110101 212 Symbol= Ô Hcode=0111011110110 213 Symbol= Õ Hcode=0111011110111 214 Symbol= Ö Hcode=0111011111000 215 Symbol= × Hcode=0111011111001 216 Symbol= Ø Hcode=0111011111010 217 Symbol= Ù Hcode=0111011111011 218 Symbol= Ú Hcode=0111011111100 219 Symbol= Û Hcode=0111011111101 220 Symbol= Ü Hcode=0111011111110
count: 4 count: 4 count: 2 count: 3 count: 2 count: 1 count: 2 count: 3 count: 8 count: 4 count: 2 count: 4 count: 3 count: 1 count: 1 count: 1 count: 5 count: 2 count: 5 count: 3 count: 2 count: 2 count: 4 count: 2 count: 4 count: 1 count: 2 count: 1 count: 2 count: 3 count: 1
rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf: rf:
0.000061 0.000061 0.000031 0.000046 0.000031 0.000015 0.000031 0.000046 0.000122 0.000061 0.000031 0.000061 0.000046 0.000015 0.000015 0.000015 0.000076 0.000031 0.000076 0.000046 0.000031 0.000031 0.000061 0.000031 0.000061 0.000015 0.000031 0.000015 0.000031 0.000046 0.000015
I: 14.000000 I: 14.000000 I: 15.000000 I: 14.415037 I: 15.000000 I: 16.000000 I: 15.000000 I: 14.415037 I: 13.000000 I: 14.000000 I: 15.000000 I: 14.000000 I: 14.415037 I: 16.000000 I: 16.000000 I: 16.000000 I: 13.678072 I: 15.000000 I: 13.678072 I: 14.415037 I: 15.000000 I: 15.000000 I: 14.000000 I: 15.000000 I: 14.000000 I: 16.000000 I: 15.000000 I: 16.000000 I: 15.000000 I: 14.415037 I: 16.000000
H: 0.000854 H: 0.000854 H: 0.000458 H: 0.000660 H: 0.000458 H: 0.000244 H: 0.000458 H: 0.000660 H: 0.001587 H: 0.000854 H: 0.000458 H: 0.000854 H: 0.000660 H: 0.000244 H: 0.000244 H: 0.000244 H: 0.001044 H: 0.000458 H: 0.001044 H: 0.000660 H: 0.000458 H: 0.000458 H: 0.000854 H: 0.000458 H: 0.000854 H: 0.000244 H: 0.000458 H: 0.000244 H: 0.000458 H: 0.000660 H: 0.000244
221 Symbol= Ý Hcode=0111011111111 222 Symbol= Þ Hcode=1000010000000 223 Symbol= ß Hcode=1000010000001 224 Symbol= à Hcode=1000010000010 225 Symbol= á Hcode=1000010000011 226 Symbol= â Hcode=1000010000100 227 Symbol= ã Hcode=1000010000101 228 Symbol= ä Hcode=1000010000110 229 Symbol= å Hcode=1000010000111 230 Symbol= æ Hcode=1000010001000 231 Symbol= ç Hcode=1000010001001 232 Symbol= è Hcode=1000010001010 233 Symbol= é Hcode=1000010001011 234 Symbol= ê Hcode=1000010001100 235 Symbol= ë Hcode=1000010001101 236 Symbol= ì Hcode=1000010001110 237 Symbol= í Hcode=1000010001111 238 Symbol= î Hcode=1000010010000 239 Symbol= ï Hcode=1000010010001 240 Symbol= ð Hcode=1000010010010 241 Symbol= ñ Hcode=1000010010011 242 Symbol= ò Hcode=1000010010100 243 Symbol= ó Hcode=1000010010101 244 Symbol= ô Hcode=1000010010110 245 Symbol= õ Hcode=1000010010111 246 Symbol= ö Hcode=1000010011000 247 Symbol= ÷ Hcode=1000010011001 248 Symbol= ø Hcode=1000010011010 249 Symbol= ù Hcode=1000010011011 250 Symbol= ú Hcode=1000010011100 251 Symbol= û Hcode=1000010011101 252 Symbol= ü Hcode=1000010011110 253 Symbol= ý Hcode=1000010011111 254 Symbol= þ Hcode=1100010100100 255 Symbol= ÿ Hcode=1100010100101
Symbol and huffman code for the above:
Discussion WE see that in the audio file the distribution is in the shape of gaussian. So the frequency between different symbols in the set is decreasing very slowly leading to very little variation in the relative
frequencies. This leads to the nearly uniform length of code for the symbols and hence results in poor compression. As clearly seen form the huffman code chart above, we see almost uniform length of code bits and hence increasing the average codebit length . Thus the redundancy is increased in the compressed code.
DECOMPRESSION
Discussion -Encoding with Locally Adaptive Statistics
ADAPTIVE HUFFMAN CODING(AHC) Huffman coding requires knowledge of the probabilities of the source sequence. If this knowledge is not available, Huffman coding becomes a 2 pass procedure: The statistics are collected in the first pass and the source is encoded in the second pass. In order to convert this algorithm to one pass procedure, adaptive Huffman coding technique was introduced which is based on the statistics of the symbols already encountered. Algorithm To describe the working of the AHC we introduce 2 more parameters to the above method 2 for Huffman coding . We add the weight of each leaf , which is written inside the node and the node number. The weight of each external node is simply the number of times the symbol corresponding to the leaf has been encountered. The weight of each internal node is the sum of the weights of its offspring. The node number is the unique number assigned to each external and internal node. In this method, Neither the transmitter nor the receiver knows the statistics of the source sequence at the starting of the transmission.Th etree at both the Tx and the Rx consists of a single node that corresponds to all symbols not yet transmitted (NYT) and has a weight zero. As transmission progresses , nodes corresponding to symbols transmitted will be added to the tree, and the tree is reconfigured using an update procedure. Before the beginning of the transmission, a fixed code for each is agreed upon between the Tx and the Rx. A simple (sort) code is as follows: e
If the source has a Symbol set ( s1, s2, s3,…. Sm) of size m , then pick e and r such that m=2 +r and e.
0≤r<2 . The symbol sk is encoded as the (e+1)-bit binary representation of k-1 , if 0≤k≤2r;else ak is encoded as the e-bit binary representation of k-r-1. When a symbol is encountered for the first time , the code for NYT node is transmitted followed by the fixed code for the symbol. A node for the symbol is then created , and the symbol is taken out of the NYT list. Both TX and Rx start with the same tree structure. The updating procedure for both Tx and Rx is identical. Thus, the encoding and decoding processes remain synchronized.
Update Procedure:
The update procedure requires that the nodes be infixed order. This ordering is preserved by number the nodes. The largest node number is given to the rot of the tree, and the smallest number is assigned to the NYT node. The numbers from the NYT node to the root of the tree asr assigned in the increasing order from left to right, and from lower level to the upper level. The set of nodes with the same weight makes up a block. The function of the update procedure is to preserve the sibling property . In order that the update procedures at the transmitter and receiver both operate with the same information, the tree at Tx is updated after each symbol is encoded and the tree at the Rx is updated after each symbol is decoded. After the symbol has been encoded or decoded, the external node corresponding to the symbol is examined to see if it has the largest node number in its block. If the external node does not have the largest node number, it is exchanged with the node that has the largest node number in the block, as long as the node number with the higher number is not the parent of the node being updated. The weight of external node is then incremented. If we do not exchange the nodes before the weight of the node is incremented, it is very likely that the oerdering required by the sibling property would be
destroyed.once we have incremented the weight of the node , we have the adated the Huffman tree at that level. We then examine the parent node of the node whose weight was incremented to see if it has the largest number in its block. If it does not, it is exchanged with the node with the largest number in the block. Again, an exception to this iswhen the node with higher node number is the parent of the node under consideration. Once an exchange has taken place( or it has been determined that there was no need for an exchange) , the weight of the parent node is incremented. We then proceed to a new parent node and the process is repeated. This process ontinues until the root of the tree is reached. If the symbol to be encoded is has occurred for the first time, a new external node is assigned to the symbol and a new NYt node is appended to the tree. Both new external node and the NYT are the offspring of the old NYt node. We increment the weight of new external node by 1. As old NYT is the parent of the new external node, we increment its weight by one and then go on to update all the other nodes until we reach the root of the tree. Encoding Procedure Initially, the ree at both the encoder and the decoder consists of a single node,NYT node. Therefore, the codeword that appears is a previously agreed-upon fixed code. After the very first symbol,whenever we have to encode a symbol that is being encountered for the first time , we send the code for NYt node followed by the previously agreed-upon fixed code for code for the symbol. The code for NYt node is obtained by traversing the Huffman tree from the root to the NYt node. This alerts the receiver to the fact that the symbol whose code follows does not as yet have a node in the Huffman tree. If a symbol to be encoded has a corresponding node in the tree, then the code for the symbol is generated by traversing the tree from the root to the external node corresponding to the symbol. DECODING PROCEDURE As the rcieved binary string is read, we traverse the tree in a manner identical to that used in the encoding procedure. Once leaf is encountered , the symbol corresponding to the to that leaf is decoded. If the leaf is the NYT node, then we check the next e bits to see if the resulting number is less than r. If it is less than r , we read in another bit to complete the code for the symbol. The index for the symbol is obtained by adding one to the decimal number corresponding to the e- or e+1 binary string. Once the symbol has been decoded , the tree is updated and the next received bit is used to start another traversal down the tree. Explaination of ahuff.c-APPENDIX2
Results Text.dat
Binary.dat
Audio.dat
Image.dat
Discussion The only difference between the huffman and adaptive huff is that one is static and the other encodes on the fly. Since the final distribution fo the source remains same, the final codeword asgined to the source after compresion is also similar. Hence the same discussion applies here. As we see the results for hussman and adaptive huffman encoding is the same, we assume this implies that with same same source ,both huffman and adaptive huff gives similar performance and the only difference between the two procedures is the nature of encoding;static or adaptive.
Problem 2:
Lempel-Ziv coding Lempel-Ziv-77 (LZ77)3 LZ77 is a dictionary based algorithm that addresses byte sequences from former contents instead of the original data. In general only one coding scheme exists, all data will be coded in the same form: • Address to already coded contents • Sequence length • First deviating symbol If no identical byte sequence is available from former contents, the address 0, the sequence length 0 and the new symbol will be coded. Because each byte sequence is extended by the first symbol deviating from the former contents, the set of already used symbols will continuously grow. No additional coding scheme is necessary. This allows an easy implementation with minimum requirements to the encoder and decoder. Restrictions: To keep runtime and buffering capacity in an acceptable range, the addressing must be limited to a certain maximum. Contents exceeding this range will not be regarded for coding and will not be covered by the size of the addressing pointer. Efficiency: The achievable compression rate is only depending on repeating sequences. Other types of redundancy like an unequal probability distribution of the set of symbols cannot be reduced. For that reason the compression of a pure LZ77 implementation is relatively low. A significant better compression rate can be obtained by combining LZ77 with an additional entropy coding algorithms like Huffman or Shannon-Fano coding. Lempel-Ziv-78 (LZ78)4 LZ78 is based on a dictionary that will be created dynamically at runtime. Both the encoding and the decoding process use the same rules to ensure that an identical dictionary is available. This dictionary contains any sequence already used to build the former contents. The compressed data have the general form: • Index addressing an entry of the dictionary • First deviating symbol 3 4
http://www.binaryessence.com/dct/en000138.htm http://www.binaryessence.com/dct/en000140.htm
In contrast to LZ77 no combination of address and sequence length is used. Instead only the index to the dictionary is stored. The mechanism to add the first deviating symbol remains from LZ77. A LZ78 dictionary is slowly growing. For a relevant compression a larger amount of data must be processed. Additionally the compression is mainly depending on the size of the dictionary. But a larger dictionary requires higher efforts for addressing and administration both at runtime. In practice the dictionary would be implemented as a tree to minimize the efforts for searching. Starting with the current symbol the algorithm evaluates for every succeeding symbol whether it is available in the tree. If a leaf node is found, the corresponding index will be written to the compressed data. The decoder could be realized with a simple table, because the decoder does not need the search function. The size of the dictionary is growing during the coding process, so that the size for addressing the table would increase continuously. In parallel the requirements for storing and searching would be also enlarged permanently. A limitation of the dictionary and corresponding update mechanisms are required. -explaination of LZ12.c in APPENDIX 3
-Results
Discusison: Text.dat: text has words formed from the combination of 21 consonants and 5 vowels. Because of the similar occurrence of words, a dictionary based algorithm can be helpful to us. Thus, if the words of text are so formed that the dictionary is again and again referred rather then sending new words, very good compression ratios can be achieved. We are getting 44% compression ratio wich is very god for text. Audio.dat: As we see at the histogram of the audio file, we see that the symbols are closely palced to each other in the histogram. This implies that similar symbols liis in close proximity to eacho ther rather then very different symbols sparsely distributed. For example;if a node is aa and the next one is aaa, the dictionary of LZ encoder might already have these symbols as reference ad might not have to update the dictionary agan and again. Hence, the audio file is giving very good compression ratios here as compared to the other compression schemes. Image.dat: For a smoth image with uniform or gaussian histogram, this scheme might present very nice compression ratios. But as we saw in the histogram of the image, the histogram is not uniform and the frequencies are widely seperated. Thisimplies the there is little correlation between two symbols and their occurrence. Thus for this image, this compression scheme shows expansion rather than compression as it night need to frequently update the ductionary and send codes for new characters therby increasing the average code bit length. Binary.dat: Since the file contains only 2 symbols this implies the the number of possible combinations are only 4. So the dictionary will also work with just 4 bytes. For 00, 01,10 and 11 and code words form them. Hence the comression ratios achieved are tremendously high (98%) POST PROCESSING with HUFFMANN CODING Image
Overall compression: -6% Binary
Overall compression: 97% Audio
Overall compression: 28%
Text
Overall compression: 43% -discussion It is believed that a significant better compression rate can be obtained by combining LZ77 with an additional entropy coding algorithms like Huffman or Shannon-Fano coding. A Possible reason for this could be that LZ coding tries to bring a structure into the source and tries to combine more and more blocks together so that their collective frequency of occurrence e is taken into consideration. This scheme works wonders on some of the file types. However, if we additionally encode the already LZ encoded file we could achieve better results due to the grouping of symbols and possibly removing the symbols very low in probability, we can achieve a smaller max. codeword length and hence better avg. codeword length. As we see above the audio file is 1% more compressed however the imagefile is 10% mor ecompressed. This is perhaps due to reason mentioned above. However for the other two files; text and binary, the huffman encoding is just adding extra codewords thereby increasing the overall file size and decreasing ratio.
b-3 input images
algorithms -explaination of LZ12.c -results (in terms of # symbols & characterteristics of images)
( image expected to have best compression ratio, were these expectations met.. Discussion: we see form the images above that the image1 is more smoother in cl=olor variation sthat the other two. This variation increases as we go from image 1 to image 3. We also know that Lempel-ziv exploits the property of correlation between the neighbouring pixels. By loking at the images we see that the variations in a way is representing the correlation between the adjacent pixels as we scan the image from left to right. With smoother image, there will be less need for updating of library and hence, good compression ratios can be achieved. In image 3 we see that although the number of grey levels are less but the placement of each grey level is affecting the symbols in the
dictionary. Hence, as expected the image compression is highest for image 1 and lowest form image 3. The scanning order of the files is also in a way important for finding the correlation between adjacent pixels. Scanning the image from top to bottom gives a compression ratio of 20%,-5% and -10% respectively. However, it is difficult to predict a scanning order without totally analysis the image and hence a generic scanning order cannot be proven to be best then others. Problem # 3 Run Length Coding Run Length coding is one of the widely used coding schemes. The scheme is totally based on the occurrence of repetition symbols in the incoming source field.
BASIC SCHEME OF RUN LENGTH CODING The basic idea is to encode a sequence of repeated symbols as the number of repetition followed by the symbol itself. For example:,”00 00 00 10 01 01 E8 E8 E8 E8” is encoded to “03 00 01 10 02 01 04 E8’ Basic format to send data:
COUNT|SYMBOL|COUNT|SYMBOL….
Code and algorithm: rle.c
current=fgetc(input); while((next=fgetc(input))!=EOF) { if(next==current && count_symbol<=max_count){ count_symbol++; } else{ fputc(count_symbol,output); fputc(current,output); count_symbol=1; current=next; } } if (next==EOF) { fputc(count_symbol,output); fputc(current,output); } Result:
FILENAME Text.dat Image.dat Audio.dat Binary.dat
Original filezise 8358 65536 65536 65536
Size after compression 16400 124320 108534 4772
Compression ratio -96% -89% -65% 93%
Discussion: RLC exploits the fact the a symbol can be repeated while transmission. If a symbol form a symbol set of 256 is repeated 255 times we say we achieve the best compression which is denoted by 2 bytes in the compressed file ; one for count and other for symbol. However, if none of the symbols is repeated, the file size will directly go twice as we are sending 2 bytes for every 1 byte symbol.
Text.dat: we see that the compressed file for text.dat is 96% larger then the original file. Hence, the scheme is expanding the file size rather than compressing. The reason for this as mentioned above is that, in text it seldom occurs that there are repeated words. We donâ&#x20AC;&#x2122;t; see ssss or sss in text very often. Mostly in the text a symbol is repeated twice sometimes but a very high repetition is highly improbable. Thus, as we are sending 2 bytes for every 1 byte of text file we see that the file expands rather than compressing. Image.dat: If an image has lot of similarly repeated symbols, this would imply that the image lacks variation and have very little depth. Usually in the real world an image is considered to be good if it covers the whole band of color levels. When, we see our Image.dat in a image viewer, we notice that although the colors appear to by similar but there are lot of tiny variations in the color values of the image thus leading to poor compression values. Audio.dat: the same concept goes for audio signal. If a symbol is repeated too much it is taken as bad music or noise . usually an audio file is comprised of lot of variations in symbol explaining the por compression ratio. Binary.dat: Clearly binary files have lot of repetitions as the symbol set contains of only 2 symbols explaining the high compression ratio. POSTPROCESSING with huffman coding Text
Total effective compression 17% Audio
Total effective compression -6% ~ -5.9% Image
Total effective compression: -20% Binary
Total effective compression: 97% Discussion: Clearly , post processing with Huffman coding schem in increasing the compression ratio tremendously as compared to the usual RLC scheme. The reason being that the rle scheme converts the high probable symbols to similarly-probable symbol denoted by 2 bytes. While huff coding, even count is taken as a symbol. And the encoder takes the count as the repeating symbol while all the other symbols are just written once, therby being equi-probable and hence having same number of codword bits. This might not give the best Huffman coding but clearly improves RLE scheme results. b-Modified Scheme This is schemes is used to modify the performance of the basic RLC schemes. In this scheme we do not encode one singular symbol. For example: 00 10 11 00 00 00 00 21 21 C3 C3 F2.” Becomes “00 10 11 84 00 82 21 82 C3 81 F2”. Since 84 has MSb 1 which indicates there are four countinous symbols’00’. The singular symbol “F2” is encoded as “81 F2” to avoid confusion because “F2” also has MSB 1. Algorithm and code : rlem.c current=fgetc(input); while((next=fgetc(input))!=EOF) { if(next==current){ count_symbol++; if(count_symbol==127){ fputc(0xff,output);
fputc(next,output); current=fgetc(input); count_symbol=1; } } else{ if(count_symbol!=1){ fputc(count_symbol+0x80,output); fputc(current,output); } if(count_symbol == 1 && (current & 0x80) ) { fputc(count_symbol+0x80,output); fputc(current,output); } if(count_symbol ==1 && !(current & 0x80) ) fputc(current,output); count_symbol=1; } current=next; } if(count_symbol > 1) fputc(count_symbol+0x80,output); fputc(current,output);
Results:
FILENAME Text.dat Image.dat Audio.dat Binary.dat
Original filezise 8358 65536 65536 65536
Size after compression 8343 82766 85205 4435
Compression ratio 1% -26% -30% 93%
DISCUSSION: Since we removed the extra byte we were sending for counts less then 2 we see that the text compression has increased to 1% as compared to the basic scheme. The result for the binary file remains same because of obvious reason stated above. Sending only the count for only repeating symbols helps reducing the file size. But since the maximum number of count is now limited to just 127 we will have to send the symbols with more that 127 repetitions in 2 pass. Hence the outputs for all the
inputs files have increased but this limitation to their maximum count Is a hindering factor in optimum compression. c-Best Compression define: The proposed RlE scheme uses three bytes, rather than two, to represent a run (which is the # of repeated counts). The first byte is a flag value indicating that the following two bytes are part of an encoded packet. The second byte is the count value, and the third byte is the run value. When encoding, if a 1-, 2-, or 3-byte character run is encountered, the character values are written directly to the compressed data stream. Because no additional characters are written, no overhead is incurred. Since we are writing the rum by 3 bytes, we do not need to send and symbol repeated twice as this will increase overhead by 1 byte and any symbol repeated thrice as it will not make any difference in the total byte count.
When decoding, a character is read; if the character is a flag value, the run count and run values are read, expanded, and the resulting run written to the data stream. If the character read is not a flag value, it is written directly to the uncompressed data stream. There are two potential drawbacks to this method: â&#x20AC;˘ The minimum useful run-length size is increased from three characters to four. This could affect compression efficiency with some types of data. â&#x20AC;˘ If the unencoded data stream contains a character value equal to the flag value, it must be compressed into a 3-byte encoded packet as a run length of one. This prevents erroneous flag values from occurring in the compressed data stream. If many of these flag value characters are
present, poor compression will result. The RLE algorithm must therefore use a flag value that rarely occurs in the uncompressed data stream. Chossing a flag value: As the compression is done on different files with different characteristics and symbol set chossing a flag value is very important. In our algorithm , if the symbol is equal to flag , then we send ; FLAG|COUNT|FLAG irrespective of the count 1 or more than one. If the flag is very frequently occuring value, then this algorithm might lead to redundancy in code as one (flag) symbol will be repeated by 3 bytes rather than 1 byte. We can choose flag in 2 ways: 1- Assuming the value of flag that is expected to be least in the input type. For Example; in am image file the probability of occurrence of 0 or 255 is usually lesser than another numbers, if the image is properly distributed with grey values. Hence, choosing a Flag value of 255 can help. However in a binary file this flag value can be fatal. Although this method makes the whole algorithm adaptive , it has some unpredictability attached to it. 2- Proposed solution- We propose that we take the statistics of the file first and collect it in for of data buffer. We then calculate the relative frequiencies of the symbol and find out the symbol with least frequency. We choose this symbol as our Flag byte and store it in the first location of the file for the decoder toknow what is the flag byte used for encoding. a. This gives us a certainity that the flag is the least used symbol in the whole file. b. In a way it produces better results. c. The drawback of this algorithm is that it needs a buffer to store the data before it processes it. This can be some time an issue while streaming large files on the network. d. It converts the algorithm from adaptive to static. Results:
FILENAME Text.dat Image.dat Audio.dat Binary.dat
Original filezise 8358 65536 65536 65536
Size after compression 8348 65512 64617 5645
Compression ratio 1% 1.52% 1.4% 98.6%
The proposed scheme clearly proved to be better than all the other schemes of RLE as all the compression ratios are positive thereby showing true compression. All the reasons for this compression are mentioned in the above explanation of the scheme. PROBLEM 4a-
4b. The LZ coding algorithm
CODE developed 0 1
Input position& symbol 0 1 A
DICTIONARY 0 A
OUTPUT (W,C) (0,NULL) (0,A)
2 3 4 5
6
7
8
4C:
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
A A B A B B B A A B A B A A B A B
A AA B A AB B BB A AA AAB A AB ABA A AB ABA ABAB
(1,A) (0,B) (1,B) (3,B)
(2,B)
(4,A)
(7,B)
APPENDIX 1 Huff.c We use main function to call these files that contain the compression and expansion . The functions are self e xplanotroy and otherwise commented to give the description of what they are doing Kindly take my via to knowmore about their working /************************** Start of HUFF.C ************************* * * This is the Huffman coding module used in Chapter 3. * Compile with BITIO.C, ERRHAND.C, and either MAIN-C.C or MAIN-E.C */
#include <stdio.h> #include <stdlib.h> #include <string.h> #include <ctype.h> #include "bitio.h" #include "errhand.h" #include "main.h" #include "math.h" /* * The NODE structure is a node in the Huffman decoding tree. It has a * count, which is its weight in the tree, and the node numbers of its * two children. The saved_count member of the structure is only * there for debugging purposes, and can be safely taken out at any * time. It just holds the intial count for each of the symbols, since * the count member is continually being modified as the tree grows. */ typedef struct tree_node { unsigned int count; unsigned int saved_count; int child_0;
int child_1; } NODE; /* * A Huffman tree is set up for decoding, not encoding. When encoding, * I first walk through the tree and build up a table of codes for * each symbol. The codes are stored in this CODE structure. */ typedef struct code { unsigned int code; int code_bits; } CODE; /* * The special EOS symbol is 256, the first available symbol after all * of the possible bytes. When decoding, reading this symbols * indicates that all of the data has been read in. */ #define END_OF_STREAM 256 /* * Local function prototypes, defined with or without ANSI prototypes. */ #ifdef __STDC__ double r_freq[256]; void count_bytes( FILE *input, unsigned long *long_counts ); void scale_counts( unsigned long *long_counts, NODE *nodes ); int build_tree( NODE *nodes ); void convert_tree_to_code( NODE *nodes, CODE *codes, unsigned int code_so_far, int bits, int node ); void output_counts( BIT_FILE *output, NODE *nodes ); void input_counts( BIT_FILE *input, NODE *nodes ); void print_model( NODE *nodes, CODE *codes ); void compress_data( FILE *input, BIT_FILE *output, CODE *codes ); void expand_data( BIT_FILE *input, FILE *output, NODE *nodes, int root_node ); void print_char( int c ); #else /* __STDC__ */
void count_bytes(); void scale_counts(); int build_tree(); void convert_tree_to_code(); void output_counts(); void input_counts(); void print_model(); void compress_data(); void expand_data(); void print_char(); #endif /* __STDC__ */ /* * These two strings are used by MAIN-C.C and MAIN-E.C to print * messages of importance to the user of the program. */ char *CompressionName = "static order 0 model with Huffman coding"; char *Usage = "infile outfile [-d]\n\nSpecifying -d will dump the modeling data\n"; /* * CompressFile is the compression routine called by MAIN-C.C. It * looks for a single additional argument to be passed to it from * the command line: "-d". If a "-d" is present, it means the * user wants to see the model data dumped out for debugging * purposes. * * This routine works in a fairly straightforward manner. First, * it has to allocate storage for three different arrays of data. * Next, it counts all the bytes in the input file. The counts * are all stored in long int, so the next step is scale them down * to single byte counts in the NODE array. After the counts are * scaled, the Huffman decoding tree is built on top of the NODE * array. Another routine walks through the tree to build a table * of codes, one per symbol. Finally, when the codes are all ready, * compressing the file is a simple matter. After the file is * compressed, the storage is freed up, and the routine returns. * */ void CompressFile( input, output, argc, argv )
FILE *input; BIT_FILE *output; int argc; char *argv[]; { unsigned long *counts; NODE *nodes; CODE *codes; int root_node; int i=0; double actualavg=0; counts = (unsigned long *) calloc( 256, sizeof( unsigned long ) ); if ( counts == NULL ) fatal_error( "Error allocating counts array\n" ); if ( ( nodes = (NODE *) calloc( 514, sizeof( NODE ) ) ) == NULL ) fatal_error( "Error allocating nodes array\n" ); if ( ( codes = (CODE *) calloc( 257, sizeof( CODE ) ) ) == NULL ) fatal_error( "Error allocating codes array\n" ); count_bytes( input, counts ); // frequency calculate scale_counts( counts, nodes ); output_counts( output, nodes ); root_node = build_tree( nodes ); convert_tree_to_code( nodes, codes, 0, 0, root_node ); for(i=0;i<256;i++) { actualavg = actualavg + codes[i].code_bits*(r_freq[i]); // compute actual average code word length = #codebits*probability } printf(" actual average: %f \n ",actualavg); if ( argc > 0 && strcmp( argv[ 0 ], "-d" ) == 0 ) print_model( nodes, codes );
compress_data( input, output, codes ); free( (char *) counts ); free( (char *) nodes ); free( (char *) codes ); }
/* * ExpandFile is the routine called by MAIN-E.C to expand a file that * has been compressed with order 0 Huffman coding. This routine has * a simpler job than that of the Compression routine. All it has to * do is read in the counts that have been stored in the compressed * file, then build the Huffman tree. The data can then be expanded * by reading in a bit at a time from the compressed file. Finally, * the node array is freed and the routine returns. * */ void ExpandFile( input, output, argc, argv ) BIT_FILE *input; FILE *output; int argc; char *argv[]; { NODE *nodes; int root_node; if ( ( nodes = (NODE *) calloc( 514, sizeof( NODE ) ) ) == NULL ) fatal_error( "Error allocating nodes array\n" ); input_counts( input, nodes ); root_node = build_tree( nodes ); if ( argc > 0 && strcmp( argv[ 0 ], "-d" ) == 0 ) print_model( nodes, 0 ); expand_data( input, output, nodes, root_node ); free( (char *) nodes ); } /* * In order for the compressor to build the same model, I have to store * the symbol counts in the compressed file so the expander can read * them in. In order to save space, I don't save all 256 symbols * unconditionally. The format used to store counts looks like this: * * start, stop, counts, start, stop, counts, ... 0 * * This means that I store runs of counts, until all the non-zero * counts have been stored. At this time the list is terminated by * storing a start value of 0. Note that at least 1 run of counts has * to be stored, so even if the first start value is 0, I read it in. * It also means that even in an empty file that has no counts, I have
* to pass at least one count. * * In order to efficiently use this format, I have to identify runs of * non-zero counts. Because of the format used, I don't want to stop a * run because of just one or two zeros in the count stream. So I have * to sit in a loop looking for strings of three or more zero values in * a row. * * This is simple in concept, but it ends up being one of the most * complicated routines in the whole program. A routine that just * writes out 256 values without attempting to optimize would be much * simpler, but would hurt compression quite a bit on small files. * */ void output_counts( output, nodes ) BIT_FILE *output; NODE *nodes; { int first; int last; int next; int i; first = 0; while ( first < 255 && nodes[ first ].count == 0 ) first++; /* * Each time I hit the start of the loop, I assume that first is the * number for a run of non-zero values. The rest of the loop is * concerned with finding the value for last, which is the end of the * run, and the value of next, which is the start of the next run. * At the end of the loop, I assign next to first, so it starts in on * the next run. */ for ( ; first < 256 ; first = next ) { last = first + 1; for ( ; ; ) { for ( ; last < 256 ; last++ ) if ( nodes[ last ].count == 0 ) break; last--; for ( next = last + 1; next < 256 ; next++ )
if ( nodes[ next ].count != 0 ) break; if ( next > 255 ) break; if ( ( next - last ) > 3 ) break; last = next; }; /* * Here is where I output first, last, and all the counts in between. */ if ( putc( first, output->file ) != first ) fatal_error( "Error writing byte counts\n" ); if ( putc( last, output->file ) != last ) fatal_error( "Error writing byte counts\n" ); for ( i = first ; i <= last ; i++ ) { if ( putc( nodes[ i ].count, output->file ) != (int) nodes[ i ].count ) fatal_error( "Error writing byte counts\n" ); } } if ( putc( 0, output->file ) != 0 ) fatal_error( "Error writing byte counts\n" ); } /* * When expanding, I have to read in the same set of counts. This is * quite a bit easier that the process of writing them out, since no * decision making needs to be done. All I do is read in first, check * to see if I am all done, and if not, read in last and a string of * counts. */ void input_counts( input, nodes ) BIT_FILE *input; NODE *nodes; { int first; int last; int i; int c;
for ( i = 0 ; i < 256 ; i++ ) nodes[ i ].count = 0; if ( ( first = getc( input->file ) ) == EOF ) fatal_error( "Error reading byte counts\n" ); if ( ( last = getc( input->file ) ) == EOF ) fatal_error( "Error reading byte counts\n" ); for ( ; ; ) { for ( i = first ; i <= last ; i++ ) if ( ( c = getc( input->file ) ) == EOF ) fatal_error( "Error reading byte counts\n" ); else nodes[ i ].count = (unsigned int) c; if ( ( first = getc( input->file ) ) == EOF ) fatal_error( "Error reading byte counts\n" ); if ( first == 0 ) break; if ( ( last = getc( input->file ) ) == EOF ) fatal_error( "Error reading byte counts\n" ); } nodes[ END_OF_STREAM ].count = 1; } /* * This routine counts the frequency of occurence of every byte in * the input file. It marks the place in the input stream where it * started, counts up all the bytes, then returns to the place where * it started. In most C implementations, the length of a file * cannot exceed an unsigned long, so this routine should always * work. */ #ifndef SEEK_SET #define SEEK_SET 0 #endif
void count_bytes( input, counts ) FILE *input; unsigned long *counts; { long input_marker;
int c; int i=0,j=0; double filesize=0; double relativefrequency; // int total=0; // double information[256]; double entropy=0.0; double Entropy_symbol[256]; double Total_entropy=0.0; double info; double Probability;
input_marker = ftell( input ); while ( ( c = getc( input )) != EOF ) counts[ c ]++; filesize= ftell (input); // file size is the total count fseek( input, input_marker, SEEK_SET ); printf ("---COUNT---------------------------RELAITVE FREQUENCY------------- \n"); printf( "filesize: %d \n ",filesize); for(i=0;i<256;i++) { if(counts[i] != 0){ printf (" count: %d ",counts[i]); // information calculation relativefrequency=counts[i]/filesize; r_freq[i]=relativefrequency; printf ("rf: %f ", r_freq[i]); Probability= 1/r_freq[i]; info= log(Probability)/log(2);// i(a)=logB(1/P(a))=ln(1/P(a))/ln2 information[i]=info; printf ("I: %f ", information[i]);
entropy=r_freq[i]*information[i]; Entropy_symbol[i]= entropy; printf ("H: %f \n ", Entropy_symbol[i]); Total_entropy = Total_entropy + entropy ;
}} printf ("total entropy for file: %f\n ", Total_entropy); // entropy calculation
} /* * In order to limit the size of my Huffman codes to 16 bits, I scale * my counts down so they fit in an unsigned char, and then store them * all as initial weights in my NODE array. The only thing to be * careful of is to make sure that a node with a non-zero count doesn't * get scaled down to 0. Nodes with values of 0 don't get codes. */ void scale_counts( counts, nodes ) unsigned long *counts; NODE *nodes; { unsigned long max_count; int i; max_count = 0; for ( i = 0 ; i < 256 ; i++ ) if ( counts[ i ] > max_count ) max_count = counts[ i ]; if ( max_count == 0 ) { counts[ 0 ] = 1; max_count = 1; } max_count = max_count / 255; max_count = max_count + 1; for ( i = 0 ; i < 256 ; i++ ) { nodes[ i ].count = (unsigned int) ( counts[ i ] / max_count ); if ( nodes[ i ].count == 0 && counts[ i ] != 0 ) nodes[ i ].count = 1; } nodes[ END_OF_STREAM ].count = 1; } /* * Building the Huffman tree is fairly simple. All of the active nodes
* are scanned in order to locate the two nodes with the minimum * weights. These two weights are added together and assigned to a new * node. The new node makes the two minimum nodes into its 0 child * and 1 child. The two minimum nodes are then marked as inactive. * This process repeats until their is only one node left, which is the * root node. The tree is done, and the root node is passed back * to the calling routine. * * Node 513 is used here to arbitratily provide a node with a guaranteed * maximum value. It starts off being min_1 and min_2. After all active * nodes have been scanned, I can tell if there is only one active node * left by checking to see if min_1 is still 513. */ int build_tree( nodes ) NODE *nodes; { int next_free; int i; int min_1; int min_2; nodes[ 513 ].count = 0xffff; for ( next_free = END_OF_STREAM + 1 ; ; next_free++ ) { min_1 = 513; min_2 = 513; for ( i = 0 ; i < next_free ; i++ ) if ( nodes[ i ].count != 0 ) { if ( nodes[ i ].count < nodes[ min_1 ].count ) { min_2 = min_1; min_1 = i; } else if ( nodes[ i ].count < nodes[ min_2 ].count ) min_2 = i; } if ( min_2 == 513 ) break; nodes[ next_free ].count = nodes[ min_1 ].count + nodes[ min_2 ].count; nodes[ min_1 ].saved_count = nodes[ min_1 ].count; nodes[ min_1 ].count = 0; nodes[ min_2 ].saved_count = nodes[ min_2 ].count; nodes[ min_2 ].count = 0; nodes[ next_free ].child_0 = min_1;
nodes[ next_free ].child_1 = min_2; } next_free--; nodes[ next_free ].saved_count = nodes[ next_free ].count; return( next_free ); } /* * Since the Huffman tree is built as a decoding tree, there is * no simple way to get the encoding values for each symbol out of * it. This routine recursively walks through the tree, adding the * child bits to each code until it gets to a leaf. When it gets * to a leaf, it stores the code value in the CODE element, and * returns. */ void convert_tree_to_code( nodes, codes, code_so_far, bits, node ) NODE *nodes; CODE *codes; unsigned int code_so_far; int bits; int node; { if ( node <= END_OF_STREAM ) { codes[ node ].code = code_so_far; codes[ node ].code_bits = bits; return; } code_so_far <<= 1; bits++; convert_tree_to_code( nodes, codes, code_so_far, bits, nodes[ node ].child_0 ); convert_tree_to_code( nodes, codes, code_so_far | 1, bits, nodes[ node ].child_1 ); } /* * If the -d command line option is specified, this routine is called * to print out some of the model information after the tree is built. * Note that this is the only place that the saved_count NODE element * is used for anything at all, and in this case it is just for * diagnostic information. By the time I get here, and the tree has * been built, every active element will have 0 in its count. */
void print_model( nodes, codes ) NODE *nodes; CODE *codes; { int i; for ( i = 0 ; i < 513 ; i++ ) { if ( nodes[ i ].saved_count != 0 ) { //printf( "Symbol value= %d" ,i ); // printf( "Symbol = " ); print_char( i ); if(i<256){ printf( " Symbol= %c" ,i );} //print_char( i ); // printf( " count=%3d", nodes[ i ].saved_count ); // printf( " child_0=" ); // print_char( nodes[ i ].child_0 ); // printf( " child_1=" ); // print_char( nodes[ i ].child_1 ); if ( codes && i <= END_OF_STREAM ) { printf( " Hcode=" ); FilePrintBinary( stdout, codes[ i ].code, codes[ i ].code_bits ); } printf( "\n" ); } } } /* * The print_model routine uses this function to print out node numbers. * The catch is, if it is a printable character, it gets printed out * as a character. Makes the debug output a little easier to read. */ void print_char( c ) int c; { if ( c >= 0x20 && c < 127 ) printf( "'%c'", c ); else
printf( "%3d", c ); } /* * Once the tree gets built, and the CODE table is built, compressing * the data is a breeze. Each byte is read in, and its corresponding * Huffman code is sent out. */ void compress_data( input, output, codes ) FILE *input; BIT_FILE *output; CODE *codes; { int c; while ( ( c = getc( input ) ) != EOF ) OutputBits( output, (unsigned long) codes[ c ].code, codes[ c ].code_bits ); OutputBits( output, (unsigned long) codes[ END_OF_STREAM ].code, codes[ END_OF_STREAM ].code_bits ); // calculate average code length
} /* * Expanding compressed data is a little harder than the compression * phase. As each new symbol is decoded, the tree is traversed, * starting at the root node, reading a bit in, and taking either the * child_0 or child_1 path. Eventually, the tree winds down to a * leaf node, and the corresponding symbol is output. If the symbol * is the END_OF_STREAM symbol, it doesn't get written out, and * instead the whole process terminates. */ void expand_data( input, output, nodes, root_node ) BIT_FILE *input; FILE *output;
NODE *nodes; int root_node; { int node; for ( ; ; ) { node = root_node; do { if ( InputBit( input ) ) node = nodes[ node ].child_1; else node = nodes[ node ].child_0; } while ( node > END_OF_STREAM ); if ( node == END_OF_STREAM ) break; if ( ( putc( node, output ) ) != node ) fatal_error( "Error trying to write expanded byte to output" ); } } /*************************** End of HUFF.C **************************/ APPENDIX 2 The procedure for adaptive Huffman is similar to the Huffman c code and is a modified version to accommodate the adaptive nature o f the * This is the adaptive Huffman coding module used in Chapter 4. * Compile with BITIO.C, ERRHAND.C, and either MAIN-C.C or MAIN-E.C */ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <ctype.h> #include "bitio.h" #include "errhand.h" char *CompressionName = "Adaptive Huffman coding, with escape codes"; char *Usage = "infile outfile [ -d ]"; #define END_OF_STREAM 256 #define ESCAPE 257 #define SYMBOL_COUNT 258
#define NODE_TABLE_COUNT ( ( SYMBOL_COUNT * 2 ) - 1 ) #define ROOT_NODE 0 #define MAX_WEIGHT 0x8000 #define TRUE 1 #define FALSE 0 /* * This data structure is all that is needed to maintain an adaptive * Huffman tree for both encoding and decoding. The leaf array is a * set of indices into the nodes that indicate which node is the * parent of a symbol. For example, to encode 'A', we would find the * leaf node by way of leaf[ 'A' ]. The next_free_node index is used * to tell which node is the next one in the array that can be used. * Since nodes are allocated when characters are read in for the first * time, this pointer keeps track of where we are in the node array. * Finally, the array of nodes is the actual Huffman tree. The child // ARRAY of NODES is actual huffman tree * index is either an index pointing to a pair of children, or an * actual symbol value, depending on whether 'child_is_leaf' is true * or false. */ typedef struct tree { int leaf[ SYMBOL_COUNT ]; int next_free_node; struct node { unsigned int weight; int parent; int child_is_leaf; int child; } nodes[ NODE_TABLE_COUNT ]; //actuall huffman tree } TREE; /* * The Tree used in this program is a global structure. Under other * circumstances it could just as well be a dynamically allocated * structure built when needed, since all routines here take a TREE * pointer as an argument. */ TREE Tree;
/* * Function prototypes for both ANSI C compilers and their K&R brethren. */ #ifdef __STDC__ void CompressFile( FILE *input, BIT_FILE *output, int argc, char *argv[] ); void ExpandFile( BIT_FILE *input, FILE *output, int argc, char *argv[] ); void InitializeTree( TREE *tree ); void EncodeSymbol( TREE *tree, unsigned int c, BIT_FILE *output ); int DecodeSymbol( TREE *tree, BIT_FILE *input ); void UpdateModel( TREE *tree, int c ); void RebuildTree( TREE *tree ); void swap_nodes( TREE *tree, int i, int j ); void add_new_node( TREE *tree, int c ); void PrintTree( TREE *tree ); void print_codes( TREE *tree ); void print_code( TREE *tree, int c ); void calculate_rows( TREE *tree, int node, int level ); int calculate_columns( TREE *tree, int node, int starting_guess ); int find_minimum_column( TREE *tree, int node, int max_row ); void rescale_columns( int factor ); void print_tree( TREE *tree, int first_row, int last_row ); void print_connecting_lines( TREE *tree, int row ); void print_node_numbers( int row ); void print_weights( TREE *tree, int row ); void print_symbol( TREE *tree, int row ); #else void CompressFile(); void ExpandFile(); void InitializeTree(); void EncodeSymbol(); int DecodeSymbol(); void UpdateModel(); void RebuildTree(); void swap_nodes(); void add_new_node(); void PrintTree(); void print_codes(); void print_code();
void calculate_rows(); int calculate_columns(); void rescale_columns(); void print_tree(); void print_connecting_lines(); void print_node_numbers(); void print_weights(); void print_symbol(); #endif /* * The high level view of the compression routine is very simple. * First, we initialize the Huffman tree, with just the ESCAPE and * END_OF_STREAM symbols. Then, we sit in a loop, encoding symbols, * and adding them to the model. When there are no more characters * to send, the special END_OF_STREAM symbol is encoded. The decoder * will later be able to use this symbol to know when to quit. * * This routine will accept a single additional argument. If the user * passes a "-d" argument, the function will dump out the Huffman tree * to stdout when the program is complete. The code to accomplish this * is thrown in as a bonus., and not documented in the book. */ void CompressFile( input, output, argc, argv ) FILE *input; BIT_FILE *output; int argc; char *argv[]; { int c; InitializeTree( &Tree ); // intialize the tree... while ( ( c = getc( input ) ) != EOF ) { // if the symbol is not EOF proceed encoding the symbol EncodeSymbol( &Tree, c, output ); UpdateModel( &Tree, c ); } EncodeSymbol( &Tree, END_OF_STREAM, output ); while ( argc-- > 0 ) { // CHECK THIS if ( strcmp( *argv, "-d" ) == 0 ) PrintTree( &Tree );
else printf( "Unused argument: %s\n", *argv ); argv++; } } /* * The Expansion routine looks very much like the compression routine. * It first initializes the Huffman tree, using the same routine as * the compressor did. It then sits in a loop, decoding characters and * updating the model until it reads in an END_OF_STREAM symbol. At * that point, it is time to quit. * * This routine will accept a single additional argument. If the user * passes a "-d" argument, the function will dump out the Huffman tree * to stdout when the program is complete. */ void ExpandFile( input, output, argc, argv ) BIT_FILE *input; FILE *output; int argc; char *argv[]; { int c; InitializeTree( &Tree ); while ( ( c = DecodeSymbol( &Tree, input ) ) != END_OF_STREAM ) { if ( putc( c, output ) == EOF ) fatal_error( "Error writing character" ); UpdateModel( &Tree, c ); } while ( argc-- > 0 ) { if ( strcmp( *argv, "-d" ) == 0 ) PrintTree( &Tree ); else printf( "Unused argument: %s\n", *argv ); argv++; } } /*
* When performing adaptive compression, the Huffman tree starts out * very nearly empty. The only two symbols present initially are the * ESCAPE symbol and the END_OF_STREAM symbol. The ESCAPE symbol has to * be included so we can tell the expansion prog that we are transmitting a * previously unseen symbol. The END_OF_STREAM symbol is here because * it is greater than eight bits, and our ESCAPE sequence only allows for * eight bit symbols following the ESCAPE code. * * In addition to setting up the root node and its two children, this * routine also initializes the leaf array. The ESCAPE and END_OF_STREAM * leaf elements are the only ones initially defined, the rest of the leaf * elements are set to -1 to show that they aren't present in the * Huffman tree yet. */ void InitializeTree( tree ) TREE *tree; { int i; tree->nodes[ ROOT_NODE ].child = ROOT_NODE + 1; tree->nodes[ ROOT_NODE ].child_is_leaf = FALSE; tree->nodes[ ROOT_NODE ].weight = 2; tree->nodes[ ROOT_NODE ].parent = -1; tree->nodes[ ROOT_NODE + 1 ].child = END_OF_STREAM; tree->nodes[ ROOT_NODE + 1 ].child_is_leaf = TRUE; tree->nodes[ ROOT_NODE + 1 ].weight = 1; tree->nodes[ ROOT_NODE + 1 ].parent = ROOT_NODE; tree->leaf[ END_OF_STREAM ] = ROOT_NODE + 1; tree->nodes[ ROOT_NODE + 2 ].child = ESCAPE; tree->nodes[ ROOT_NODE + 2 ].child_is_leaf = TRUE; tree->nodes[ ROOT_NODE + 2 ].weight = 1; tree->nodes[ ROOT_NODE + 2 ].parent = ROOT_NODE; tree->leaf[ ESCAPE ] = ROOT_NODE + 2; tree->next_free_node
= ROOT_NODE + 3;
for ( i = 0 ; i < END_OF_STREAM ; i++ ) tree->leaf[ i ] = -1; }
/* * This routine is responsible for taking a symbol, and converting * it into the sequence of bits dictated by the Huffman tree. The * only complication is that we are working are way up from the leaf * to the root, and hence are getting the bits in reverse order. This * means we have to rack up the bits in an integer and then send them * out after they are all accumulated. In this version of the program, * we keep our codes in a long integer, so the maximum count is set * to an arbitray limit of 0x8000. It could be set as high as 65535 * if desired. */ void EncodeSymbol( tree, c, output ) TREE *tree; unsigned int c; BIT_FILE *output; { unsigned long code; unsigned long current_bit; int code_size; int current_node; code = 0; current_bit = 1; code_size = 0; current_node = tree->leaf[ c ]; if ( current_node == -1 ) current_node = tree->leaf[ ESCAPE ]; while ( current_node != ROOT_NODE ) { if ( ( current_node & 1 ) == 0 ) code |= current_bit; current_bit <<= 1; code_size++; current_node = tree->nodes[ current_node ].parent; }; OutputBits( output, code, code_size ); if ( tree->leaf[ c ] == -1 ) { OutputBits( output, (unsigned long) c, 8 ); add_new_node( tree, c ); } }
/* * Decoding symbols is easy. We start at the root node, then go down * the tree until we reach a leaf. At each node, we decide which * child to take based on the next input bit. After getting to the * leaf, we check to see if we read in the ESCAPE code. If we did, * it means that the next symbol is going to come through in the next * eight bits, unencoded. If that is the case, we read it in here, * and add the new symbol to the table. */ int DecodeSymbol( tree, input ) TREE *tree; BIT_FILE *input; { int current_node; int c; current_node = ROOT_NODE; while ( !tree->nodes[ current_node ].child_is_leaf ) { current_node = tree->nodes[ current_node ].child; current_node += InputBit( input ); } c = tree->nodes[ current_node ].child; if ( c == ESCAPE ) { c = (int) InputBits( input, 8 ); add_new_node( tree, c ); } return( c ); } /* * UpdateModel is called to increment the count for a given symbol. * After incrementing the symbol, this code has to work its way up * through the parent nodes, incrementing each one of them. That is * the easy part. The hard part is that after incrementing each * parent node, we have to check to see if it is now out of the proper * order. If it is, it has to be moved up the tree into its proper * place. */ void UpdateModel( tree, c ) TREE *tree;
int c; { int current_node; int new_node; if ( tree->nodes[ ROOT_NODE].weight == MAX_WEIGHT ) RebuildTree( tree ); current_node = tree->leaf[ c ]; while ( current_node != -1 ) { tree->nodes[ current_node ].weight++; for ( new_node = current_node ; new_node > ROOT_NODE ; new_node-- ) if ( tree->nodes[ new_node - 1 ].weight >= tree->nodes[ current_node ].weight ) break; if ( current_node != new_node ) { swap_nodes( tree, current_node, new_node ); current_node = new_node; } current_node = tree->nodes[ current_node ].parent; } } /* * Rebuilding the tree takes place when the counts have gone too * high. From a simple point of view, rebuilding the tree just means that * we divide every count by two. Unfortunately, due to truncation effects, * this means that the tree's shape might change. Some nodes might move * up due to cumulative increases, while others may move down. */ void RebuildTree( tree ) TREE *tree; { int i; int j; int k; unsigned int weight; /* * To start rebuilding the table, I collect all the leaves of the Huffman * tree and put them in the end of the tree. While I am doing that, I * scale the counts down by a factor of 2.
*/ printf( "R" ); j = tree->next_free_node - 1; for ( i = j ; i >= ROOT_NODE ; i-- ) { if ( tree->nodes[ i ].child_is_leaf ) { tree->nodes[ j ] = tree->nodes[ i ]; tree->nodes[ j ].weight = ( tree->nodes[ j ].weight + 1 ) / 2; j--; } } /* * At this point, j points to the first free node. I now have all the * leaves defined, and need to start building the higher nodes on the * tree. I will start adding the new internal nodes at j. Every time * I add a new internal node to the top of the tree, I have to check to * see where it really belongs in the tree. It might stay at the top, * but there is a good chance I might have to move it back down. If it * does have to go down, I use the memmove() function to scoot everyone * bigger up by one node. Note that memmove() may have to be change * to memcpy() on some UNIX systems. The parameters are unchanged, as * memmove and memcpy have the same set of parameters. */ for ( i = tree->next_free_node - 2 ; j >= ROOT_NODE ; i -= 2, j-- ) { k = i + 1; tree->nodes[ j ].weight = tree->nodes[ i ].weight + tree->nodes[ k ].weight; weight = tree->nodes[ j ].weight; tree->nodes[ j ].child_is_leaf = FALSE; for ( k = j + 1 ; weight < tree->nodes[ k ].weight ; k++ ) ; k--; memmove( &tree->nodes[ j ], &tree->nodes[ j + 1 ], ( k - j ) * sizeof( struct node ) ); tree->nodes[ k ].weight = weight; tree->nodes[ k ].child = i; tree->nodes[ k ].child_is_leaf = FALSE; } /* * The final step in tree reconstruction is to go through and set up * all of the leaf and parent members. This can be safely done now * that every node is in its final position in the tree.
*/ for ( i = tree->next_free_node - 1 ; i >= ROOT_NODE ; i-- ) { if ( tree->nodes[ i ].child_is_leaf ) { k = tree->nodes[ i ].child; tree->leaf[ k ] = i; } else { k = tree->nodes[ i ].child; tree->nodes[ k ].parent = tree->nodes[ k + 1 ].parent = i; } } } /* * Swapping nodes takes place when a node has grown too big for its * spot in the tree. When swapping nodes i and j, we rearrange the * tree by exchanging the children under i with the children under j. */ void swap_nodes( tree, i, j ) TREE *tree; int i; int j; { struct node temp; if ( tree->nodes[ i ].child_is_leaf ) tree->leaf[ tree->nodes[ i ].child ] = j; else { tree->nodes[ tree->nodes[ i ].child ].parent = j; tree->nodes[ tree->nodes[ i ].child + 1 ].parent = j; } if ( tree->nodes[ j ].child_is_leaf ) tree->leaf[ tree->nodes[ j ].child ] = i; else { tree->nodes[ tree->nodes[ j ].child ].parent = i; tree->nodes[ tree->nodes[ j ].child + 1 ].parent = i; } temp = tree->nodes[ i ]; tree->nodes[ i ] = tree->nodes[ j ]; tree->nodes[ i ].parent = temp.parent; temp.parent = tree->nodes[ j ].parent; tree->nodes[ j ] = temp;
} /* * Adding a new node to the tree is pretty simple. It is just a matter * of splitting the lightest-weight node in the tree, which is the highest * valued node. We split it off into two new nodes, one of which is the * one being added to the tree. We assign the new node a weight of 0, * so the tree doesn't have to be adjusted. It will be updated later when * the normal update process occurs. Note that this code assumes that * the lightest node has a leaf as a child. If this is not the case, * the tree would be broken. */ void add_new_node( tree, c ) TREE *tree; int c; { int lightest_node; int new_node; int zero_weight_node; lightest_node = tree->next_free_node - 1; new_node = tree->next_free_node; zero_weight_node = tree->next_free_node + 1; tree->next_free_node += 2; tree->nodes[ new_node ] = tree->nodes[ lightest_node ]; tree->nodes[ new_node ].parent = lightest_node; tree->leaf[ tree->nodes[ new_node ].child ] = new_node; tree->nodes[ lightest_node ].child = new_node; tree->nodes[ lightest_node ].child_is_leaf = FALSE; tree->nodes[ zero_weight_node ].child = c; tree->nodes[ zero_weight_node ].child_is_leaf = TRUE; tree->nodes[ zero_weight_node ].weight = 0; tree->nodes[ zero_weight_node ].parent = lightest_node; tree->leaf[ c ] = zero_weight_node; } /* * All the code from here down is concerned with printing the tree. * Printing the tree out is basically a process of walking down through
* all the nodes, with each new node to be printed getting nudged over * far enough to make room for everything that has come before. */ /* * This array is used to keep track of all the nodes that are in a given * row. The nodes are kept in a linked list. This array is used to keep * track of the first member. The subsequent members will be found in * a linked list in the positions[] array. */ struct row { int first_member; int count; } rows[ 32 ]; /* * The positions[] array is used to keep track of the row and column of each * node in the tree. The next_member element points to the next node * in the row for the given node. The column is calculated on the fly, * and represents the actual column that a given number is to be printed in. * Note that the column for a node is not an actual column on the page. For * purposes of analysis, it is assumed that each node takes up exactly one * column. So, if printing out the actual values in a node takes up for * spaces on the printed page, we might want to allocate five physical print * columns for each column in the array. */ struct location { int row; int next_member; int column; } positions[ NODE_TABLE_COUNT ]; /* * This is the main routine called to print out a Huffman tree. It first * calls the print_codes function, which prints out the binary codes * for each symbol. After that, it calculates the row and column that * each node will be printed in, then prints the tree out. This code * is not documented in the book, since it is essentially irrelevant to * the data compression process. However, it is nice to be able to * print out the tree.
*/ void PrintTree( tree ) TREE *tree; { int i; int min; print_codes( tree ); for ( i = 0 ; i < 32 ; i++ ) { rows[ i ].count = 0; rows[ i ].first_member = -1; } calculate_rows( tree, ROOT_NODE, 0 ); calculate_columns( tree, ROOT_NODE, 0 ); min = find_minimum_column( tree, ROOT_NODE, 31 ); rescale_columns( min ); print_tree( tree, 0, 31 ); } /* * This routine is called to print out the Huffman code for each symbol. * The real work is done by the print_code routine, which racks up the * bits and puts them out in the right order. */ void print_codes( tree ) TREE *tree; { int i; printf( "\n" ); for ( i = 0 ; i < SYMBOL_COUNT ; i++ ) if ( tree->leaf[ i ] != -1 ) { if ( isprint( i ) ) printf( "%5c: ", i ); else printf( "<%3d>: ", i ); printf( "%5u", tree->nodes[ tree->leaf[ i ] ].weight ); printf( " " ); print_code( tree, i ); printf( "\n" );
} } /* * print_code is a workhorse routine that prints out the Huffman code for * a given symbol. It ends up looking a lot like EncodeSymbol(), since * it more or less has to do the same work. The major difference is that * instead of calling OutputBit, this routine calls putc, with a character * argument. */ void print_code( tree, c ) TREE *tree; int c; { unsigned long code; unsigned long current_bit; int code_size; int current_node; int i; code = 0; current_bit = 1; code_size = 0; current_node = tree->leaf[ c ]; while ( current_node != ROOT_NODE ) { if ( current_node & 1 ) code |= current_bit; current_bit <<= 1; code_size++; current_node = tree->nodes[ current_node ].parent; }; for ( i = 0 ; i < code_size ; i++ ) { current_bit >>= 1; if ( code & current_bit ) putc( '1', stdout ); else putc( '0', stdout ); } } /*
* In order to print out the tree, I need to calculate the row and column * where each node will be printed. The rows are easier than the columns, * and I do them first. It is easy to keep track of what row a node is * in as I walk through the tree. As I walk through the tree, I also keep * track of the order the nodes appear in a given row, by adding them to * a linked list in the proper order. After calculate_rows() has been * recursively called all the way through the tree, I have a linked list of * nodes for each row. This same linked list is used later to calculate * which column each node appears in. */ void calculate_rows( tree, node, level ) TREE *tree; int node; int level; { if ( rows[ level ].first_member == -1 ) { rows[ level ].first_member = node; rows[ level ].count = 0; positions[ node ].row = level; positions[ node ].next_member = -1; } else { positions[ node ].row = level; positions[ node ].next_member = rows[ level ].first_member; rows[ level ].first_member = node; rows[ level ].count++; } if ( !tree->nodes[ node ].child_is_leaf ) { calculate_rows( tree, tree->nodes[ node ].child, level + 1 ); calculate_rows( tree, tree->nodes[ node ].child + 1, level + 1 ); } } /* * After I know which row each of the nodes is in, I can start the * hard work, which is calculating the columns. This routine gets * called recursively. It starts off with a starting guess for where * we want the node to go, and returns the actual result, which is * the column the node ended up in. For example, I might want my node * to print in column 0. After recursively evaluating everything under * the node, I may have been pushed over to node -10 ( the tree is * evaluated down the right side first ). I return that to whoever called
* this routine so it can use the nodes position to calculate where * the node in a higher row is to be placed. */ int calculate_columns( tree, node, starting_guess ) TREE *tree; int node; int starting_guess; { int next_node; int right_side; int left_side; /* * The first thing I check is to see if the node on my immediate right has * already been placed. If it has, I need to make sure that I am at least * 4 columns to the right of it. This allows me to print 3 characters plus * leave a blank space between us. */ next_node = positions[ node ].next_member; if ( next_node != -1 ) { if ( positions[ next_node ].column < ( starting_guess + 4 ) ) starting_guess = positions[ next_node ].column - 4; } if ( tree->nodes[ node ].child_is_leaf ) { positions[ node ].column = starting_guess; return( starting_guess ); } /* * After I have adjusted my starting guess, I calculate the actual position * of the right subtree of this node. I pass it a guess for a starting * node based on my starting guess. Naturally, what comes back may be * moved over quite a bit. */ right_side = calculate_columns( tree, tree->nodes[ node ].child, starting_guess + 2 ); /* * After figuring out where the right side lands, I do the same for the * left side. After doing the right side, I have a pretty good guess where * the starting column for the left side might go, so I can pass it a good * guess for a starting column. */ left_side = calculate_columns( tree, tree->nodes[ node ].child + 1, right_side - 4 );
/* * Once I know where the starting column for the left and right subtrees * are going to be for sure, I know where this node should go, which is * right in the middle between the two. I calcluate the column, store it, * then return the result to whoever called me. */ starting_guess = ( right_side + left_side ) / 2; positions[ node ].column = starting_guess; return( starting_guess ); } int find_minimum_column( tree, node, max_row ) TREE *tree; int node; int max_row; { int min_right; int min_left; if ( tree->nodes[ node ].child_is_leaf || max_row == 0 ) return( positions[ node ].column ); max_row--; min_right = find_minimum_column( tree, tree->nodes[ node ].child + 1, max_row ); min_left = find_minimum_column( tree, tree->nodes[ node ].child, max_row ); if ( min_right < min_left ) return( min_right ); else return( min_left ); } /* * Once the columns of each node have been calculated, I go back and rescale * the columns to be actual printer columns. In this particular program, * each node takes three characters to print, plus one space to keep nodes * separate. We take advantage of the fact that every node has at least one * logical column between it and the ajacent node, meaning that we can space * nodes only two physical columns apart. The spacing here consists of * rescaling each column so that the smallest column is at zero, then * multiplying by two to get a physical printer column. */ void rescale_columns( factor )
int factor; { int i; int node; /* * Once min is known, we can rescale the tree so that column min is * pushed over to column 0, and each logical column is set to be two * physical columns on the printer. */ for ( i = 0 ; i < 30 ; i++ ) { if ( rows[ i ].first_member == -1 ) break; node = rows[ i ].first_member; do { positions[ node ].column -= factor; node = positions[ node ].next_member; } while ( node != -1 ); } } /* * print_tree is called after the row and column of each node have been * calculated. It just calls the four workhorse routines that are * responsible for printing out the four elements that go on each row. * At the top of the row are the connecting lines hooking the tree * together. On the next line of the row are the node numbers. Below * them are the weights, and finally the symbol, if there is one. */ void print_tree( tree, first_row, last_row ) TREE *tree; int first_row; int last_row; { int row; for ( row = first_row ; row <= last_row ; row++ ) { if ( rows[ row ].first_member == -1 ) break; if ( row > first_row ) print_connecting_lines( tree, row );
print_node_numbers( row ); print_weights( tree, row ); print_symbol( tree, row ); } } /* * Printing the connecting lines means connecting each pair of nodes. * I use the IBM PC character set here. They can easily be replaced * with more standard alphanumerics. */ #ifndef ALPHANUMERIC #define LEFT_END 218 #define RIGHT_END 191 #define CENTER 193 #define LINE 196 #define VERTICAL 179 #else #define LEFT_END '+' #define RIGHT_END '+' #define CENTER '+' #define LINE '-' #define VERTICAL '|' #endif
void print_connecting_lines( tree, row ) TREE *tree; int row; { int current_col; int start_col; int end_col; int center_col; int node; int parent;
current_col = 0; node = rows[ row ].first_member; while ( node != -1 ) { start_col = positions[ node ].column + 2; node = positions[ node ].next_member; end_col = positions[ node ].column + 2; parent = tree->nodes[ node ].parent; center_col = positions[ parent ].column; center_col += 2; for ( ; current_col < start_col ; current_col++ ) putc( ' ', stdout ); putc( LEFT_END, stdout ); for ( current_col++ ; current_col < center_col ; current_col++ ) putc( LINE, stdout ); putc( CENTER, stdout ); for ( current_col++; current_col < end_col ; current_col++ ) putc( LINE, stdout ); putc( RIGHT_END, stdout ); current_col++; node = positions[ node ].next_member; } printf( "\n" ); } /* * Printing the node numbers is pretty easy. */ void print_node_numbers( row ) int row; { int current_col; int node; int print_col; current_col = 0; node = rows[ row ].first_member; while ( node != -1 ) { print_col = positions[ node ].column + 1; for ( ; current_col < print_col ; current_col++ ) putc( ' ', stdout ); printf( "%03d", node );
current_col += 3; node = positions[ node ].next_member; } printf( "\n" ); } /* * Printing the weight of each node is easy too. */ void print_weights( tree, row ) TREE *tree; int row; { int current_col; int print_col; int node; int print_size; int next_col; char buffer[ 10 ]; current_col = 0; node = rows[ row ].first_member; while ( node != -1 ) { print_col = positions[ node ].column + 1; sprintf( buffer, "%u", tree->nodes[ node ].weight ); if ( strlen( buffer ) < 3 ) sprintf( buffer, "%03u", tree->nodes[ node ].weight ); print_size = 3; if ( strlen( buffer ) > 3 ) { if ( positions[ node ].next_member == -1 ) print_size = strlen( buffer ); else { next_col = positions[ positions[ node ].next_member ].column; if ( ( next_col - print_col ) > 6 ) print_size = strlen( buffer ); else { strcpy( buffer, "---" ); print_size = 3; } } }
for ( ; current_col < print_col ; current_col++ ) putc( ' ', stdout ); printf( buffer ); current_col += print_size; node = positions[ node ].next_member; } printf( "\n" ); } /* * Printing the symbol values is a little more complicated. If it is a * printable symbol, I print it between simple quote characters. If * it isn't printable, I print a hex value, which also only takes up three * characters. If it is an internal node, it doesn't have a symbol, * which means I just print the vertical line. There is one complication * in this routine. In order to save space, I check first to see if * any of the nodes in this row have a symbol. If none of them have * symbols, we just skip this part, since we don't have to print the * row at all. */ void print_symbol( tree, row ) TREE *tree; int row; { int current_col; int print_col; int node; current_col = 0; node = rows[ row ].first_member; while ( node != -1 ) { if ( tree->nodes[ node ].child_is_leaf ) break; node = positions[ node ].next_member; } if ( node == -1 ) return; node = rows[ row ].first_member; while ( node != -1 ) { print_col = positions[ node ].column + 1; for ( ; current_col < print_col ; current_col++ )
putc( ' ', stdout ); if ( tree->nodes[ node ].child_is_leaf ) { if ( isprint( tree->nodes[ node ].child ) ) printf( "'%c'", tree->nodes[ node ].child ); else if ( tree->nodes[ node ].child == END_OF_STREAM ) printf( "EOF" ); else if ( tree->nodes[ node ].child == ESCAPE ) printf( "ESC" ); else printf( "%02XH", tree->nodes[ node ].child ); } else printf( " %c ", VERTICAL ); current_col += 3; node = positions[ node ].next_member; } printf( "\n" ); } APPENDIX 3 /************************** Start of LZW12.C ************************* * * This is 12 bit LZW program, which is discussed in the first part * of the chapter. It uses a fixed size code, and does not attempt * to flush the dictionary after it fills up. */ #include <stdio.h> #include <stdlib.h> #include <string.h> #include "errhand.h" #include "bitio.h" /* * Constants used throughout the program. BITS defines how many bits * will be in a code. TABLE_SIZE defines the size of the dictionary * table. */ #define BITS 12 #define MAX_CODE ( ( 1 << BITS ) - 1 ) #define TABLE_SIZE 5021 #define END_OF_STREAM 256 #define FIRST_CODE 257 #define UNUSED -1
/* * Local prototypes. */ #ifdef __STDC__ unsigned int find_child_node( int parent_code, int child_character ); unsigned int decode_string( unsigned int offset, unsigned int code ); #else unsigned int find_child_node(); unsigned int decode_string(); #endif char *CompressionName = "LZW 12 Bit Encoder"; char *Usage = "in-file out-file\n\n"; /* * This data structure defines the dictionary. Each entry in the dictionary * has a code value. This is the code emitted by the compressor. Each * code is actually made up of two pieces: a parent_code, and a * character. Code values of less than 256 are actually plain * text codes. */ struct dictionary { int code_value; int parent_code; char character; } dict[ TABLE_SIZE ]; char decode_stack[ TABLE_SIZE ]; /* * The compressor is short and simple. It reads in new symbols one * at a time from the input file. It then checks to see if the * combination of the current symbol and the current code are already * defined in the dictionary. If they are not, they are added to the * dictionary, and we start over with a new one symbol code. If they * are, the code for the combination of the code and character becomes * our new code. */
void CompressFile( input, output, argc, argv ) FILE *input; BIT_FILE *output; int argc; char *argv[]; { int next_code; int character; int string_code; unsigned int index; unsigned int i; next_code = FIRST_CODE; for ( i = 0 ; i < TABLE_SIZE ; i++ ) dict[ i ].code_value = UNUSED; if ( ( string_code = getc( input ) ) == EOF ) string_code = END_OF_STREAM; while ( ( character = getc( input ) ) != EOF ) { index = find_child_node( string_code, character ); if ( dict[ index ].code_value != -1 ) string_code = dict[ index ].code_value; else { if ( next_code <= MAX_CODE ) { dict[ index ].code_value = next_code++; dict[ index ].parent_code = string_code; dict[ index ].character = (char) character; } OutputBits( output, (unsigned long) string_code, BITS ); string_code = character; } } OutputBits( output, (unsigned long) string_code, BITS ); OutputBits( output, (unsigned long) END_OF_STREAM, BITS ); while ( argc-- > 0 ) printf( "Unknown argument: %s\n", *argv++ ); } /* * The file expander operates much like the encoder. It has to * read in codes, the convert the codes to a string of characters. * The only catch in the whole operation occurs when the encoder * encounters a CHAR+STRING+CHAR+STRING+CHAR sequence. When this
* occurs, the encoder outputs a code that is not presently defined * in the table. This is handled as an exception. */ void ExpandFile( input, output, argc, argv ) BIT_FILE *input; FILE *output; int argc; char *argv[]; { unsigned int next_code; unsigned int new_code; unsigned int old_code; int character; unsigned int count; next_code = FIRST_CODE; old_code = (unsigned int) InputBits( input, BITS ); if ( old_code == END_OF_STREAM ) return; character = old_code; putc( old_code, output ); while ( ( new_code = (unsigned int) InputBits( input, BITS ) ) != END_OF_STREAM ) { /* ** This code checks for the CHARACTER+STRING+CHARACTER+STRING+CHARACTER ** case which generates an undefined code. It handles it by decoding ** the last code, and adding a single character to the end of the decode string. */ if ( new_code >= next_code ) { decode_stack[ 0 ] = (char) character; count = decode_string( 1, old_code ); } else count = decode_string( 0, new_code ); character = decode_stack[ count - 1 ]; while ( count > 0 ) putc( decode_stack[ --count ], output ); if ( next_code <= MAX_CODE ) { dict[ next_code ].parent_code = old_code; dict[ next_code ].character = (char) character; next_code++;
} old_code = new_code; } while ( argc-- > 0 ) printf( "Unknown argument: %s\n", *argv++ ); } /* * This hashing routine is responsible for finding the table location * for a string/character combination. The table index is created * by using an exclusive OR combination of the prefix and character. * This code also has to check for collisions, and handles them by * jumping around in the table. */ unsigned int find_child_node( parent_code, child_character ) int parent_code; int child_character; { int index; int offset; index = ( child_character << ( BITS - 8 ) ) ^ parent_code; if ( index == 0 ) offset = 1; else offset = TABLE_SIZE - index; for ( ; ; ) { if ( dict[ index ].code_value == UNUSED ) return( index ); if ( dict[ index ].parent_code == parent_code && dict[ index ].character == (char) child_character ) return( index ); index -= offset; if ( index < 0 ) index += TABLE_SIZE; } } /* * This routine decodes a string from the dictionary, and stores it * in the decode_stack data structure. It returns a count to the * calling program of how many characters were placed in the stack.
*/ unsigned int decode_string( count, code ) unsigned int count; unsigned int code; { while ( code > 255 ) { decode_stack[ count++ ] = dict[ code ].character; code = dict[ code ].parent_code; } decode_stack[ count++ ] = (char) code; return( count ); } /************************** End of LZW12.C *************************/