Effective Arithmetic Coding

Effective Arithmetic Coding
By Sachin Garg ( schngrg at gmail.com , http://www.sachingarg.com )

Latest version of this page is available at
http://www.sachingarg.com/articles/sgarith.html

Introduction/References

Arithmetic coding has become a very important technique used in compression. Code by Witten, Neal, and Cleary, which was published in CACM [1987] has become the de facto standard. But it is slightly inefficient, and causes loss in compression ratio achieved.

Any improvements to the coder will yield either better compression efficiency, that is, a reduction in time/memory usage or yield improved compression effectiveness, that is, a decrease in the size of the encoded data. Here I am primarily interested in compression effectiveness.

If you are interested in efficiency, I suggest you read Arithmetic Coding Revisited by Alistair Moffat, Radford M. Neal and Ian H. Witten. Published in July 1998 issue of CACM.

The code here is based on Code published by Mark Nelson in February 1991 issue of Dr. Dobb's Journal, with his article Arithmetic Coding + Statistical Modeling = Data Compression.

Technicalities

Major problem with the CACM code was arithmetic overflow in steps involving calculations of new high and low.

If the machine supports w-bit arithmetic. (Until recently, most systems supported 32-bit arithmetic). Then, to avoid overflow, the no of bits used to store cumulative frequency counts ( f ) is restricted by w. It is required that w >= 2f + 1. Thus, (as w = 32) f can not exceed 15.And as a result, sum of frequency counts in any given context can not be more than 32727. This is a sever limitation that forces us to use approximations. (Usually, frequency counts are halved whenever they are near the limit)

Solution is to use 64-bit arithmetic. Most current compilers and systems allow use of 64-bit arithmetic. (Implemented as unsigned long long or __int64) Now, as w = 64, we can have f = 31. Thus sum of frequency counts can be as much as 2147483648, which is value big enough for most practical purposes. And reduces the loss to give results nearly same as the theoretical value of log2(1/P).

Results

Following are the test results obtained which clearly indicate the benefits of using the 64-bit implementation. Here I am using order-0 model.

32-bit implementation refers to Mark Nelson's original order-0 arithmetic coder. (comp-1.c). It uses 32-bit variables with max value of high=0xffff

64-bit implementation refers to my (Sachin Garg's) modified version of Mark Nelson's order-0 arithmetic coder. It uses 64-bit variables with max value of high=0xffffffff

Testing is done on text files taken from Large Canterbury Corpus.

File Name	bible.txt	e.coli	world192.txt
Original Size	4047392	4638690	2473400
Size after compression with 32-bit implementation	2202081	1176940	1545481
Size after compression with 64-bit implementation	2197537	1160066	1545743
Gain	0.20%	1.43%	-0.01%

Source Code

In this code, arithmetic coding is implemented using w = 64 and f = 30.

Click here to download zipped 64 bit implementation.
Click here to download zipped 32 bit implementation.

You can contact me at schngrg at gmail.com, and visit me at http://www.sachingarg.com
Suggestions, improvements, errors, feedback etc... are requested. Please report any broken links.