|
|
|
|
ARITHMETIC CODING
REVISITED: NEW FLOATING POINT ARITHMETIC TECHNIQUE Since the beginning of nineties a new statistical compression technique has emerged which follows shannon’s entropy formula closely. No. of bits = - log base 2(probability) In statistical methods, the probability model is important. Earlier it was Huffman coding which ruled the world. Huffman coding can not go beyond 1 bit. This means that if a symbol had a probability of 1/3 it can be encoded in say about 2 bits which is greater than the optimum value of 1.6 by 25 %. For more glaring example of the inefficiencies involved, take for example a character with 90 % probability. The optimal coding size would be 0.15 bits. The Huffman coding would give 1 bit. A difference of six times. In arithmetic coding, the codes follow shanon’s theoretical optimal value closely. If the probability distribution is right, the arithmetic coding gives better results by 1 bit per byte. In Huffman coding, it is very simple. First the symbol probabilities are found. In accordance to the symbol probability the new codes to the symbol are given. Smaller the symbol probability, larger the code. The code given is unique. No two codes are same. In technical terms, huffman trees are formed, which maintain uniqueness of the code. Arithmetic coding on the other hand gives the result in the form of a very long continuous number. It replaces a stream of input symbols with a single floating point output number. The output from an arithmetic coding process is a single number less than 1 and greater than or equal to 0. This single number can be uniques decoded to create back the original stream of characters. Arithmetic coding starts with finding probability distribution. Once character probabilities are known, individual symbols need to be assigned a range along a probability line. As new symbols are encoded, the possible range of the output numbers get restricted. Since most computers work on digits of 80 bit , it is not possible to encode millions of characters coming in the stream directly. Hence special techniques have been developed which take characters one by one and encode it. Till now integer arithmetic was used in almost all the arithmetic coding schemes. That involves under flow bits and their management. That made the arithmetic coding more complex and cumbersome. In the new method developed by me floating point arithmetic is used. That makes the method very straight forward. Of course probability distribution calculations too are done differently. As expected the new scheme is slower than Huffman coding but gives better results than Huffman coding. Efficiency wise there is little difference between my method and the present day arithmetic coding scheme.. Almost all the compressors available use 2 or 3 levels of compression. In the first level either it is LZ 77 or LZ 78 dictionary based method. It could be BWT transform also. Second stage comprises of a good statistical coding scheme. Any developer has to first search the patent of that method which he intends to use because since early eighties patents have been issued at a vigorous pace to these so-called methods. Many a battle ‘ROYALS’ have been fought by the parties over the patent rights. The new scheme developed by me for Arithmetic coding is equivalent to the patented scheme of IBM performance wise to the best of my testing. The major difference lies in that my scheme works on byte level. At no stage bit level operations are done. Since IBM has patented a good part of the present day arithmetic coding scheme, not many commercial packages can use arithmetic coding scheme of compression.. If applied with BWT transform my scheme can give very good compression ratios. A sample demonstration program is given below for test purpose. It employs a simple order 0, static dictionary model. Since the main aim was to demonstrate the technique, very little care has been taken to make the program efficient. Use of cache memory and assembly language routines can speed up the execution very significantly. To reduce the number of arithmetic operations in the coder, we can use string manipulation techniques to replace the last digits of the floating point number. The program employs rather lengthy procedure using power formula which is computational intensive. As expected decompression is slower than compression because of search operations. By employing binary search it can be made fast. Another routine which is computational intensive is getting the floating point number from compressed data. In any case, through proper mix of assembly language and optimized programming technique the decompression time can be reduced significantly. Another good
property of the method is that the compressed data can be obtained in any number system. Just by
changing the multiplicand the number base can be changed and the base of the number system is
changed. We can get the results in odd bases such as 7,
9 etc. In fact any value up to 65535. If the memory
width is 9 bits, we need not convert the data into first 8 bits and then take an extra bit from next
byte. A very minor theoretical limitation of the program could come if the probability of a symbol is less than 10e-6. This is not a difficult case to handle and slight modification in the scheme can solve the problem. The source code would be released after all the known or unknown bugs are removed. May I request the reader to give this technique a try and find if the method fails in any one case so that I can fix up the problem. To report any failure : Please contact anjain@icenet.net along with file in which the failure occurred. Download : arithn.exe |
|
|