INFORMATION CONTENT AND COMPRESSION LIMIT FAQ

This FAQ is meant to answer questions regarding the limits of compression and the measure of information content in general terms. Implementation issues and specific compression algorithms are not addressed.

This document represents the invidivual research of the author, and not of any associated associations. Some of the conclusions drawn in this text may be contrary to those found in some accepted texts on the subject, but the author of this text has come to these conclusions based on extensive research using as few preconceived assumptions as possible, and has had no luck whatsoever in determining the cause of the discrepancy with said accepted texts.

If you have any questions you would like to see added to the FAQ, or if you have any comments about the FAQ, please E-Mail the author.


Part 1) Is there an absolute measure of information content?

Part 2) What about Kolmogorov Complexity?

Part 2b) But what about the KC_A(x) <= KC_B(x) + C thing?

Part 2c) Are you sure? I thought I read that Kolmogorov Complexity was very important.

Part 3) What about Universal Turing Machines?

Part 4) Isn't it possible there's some undiscovered method of coding that can do better than all known methods?

Part 5) Couldn't there be some undiscovered transform like DCT or DWT that would allow better potential compression?

Part 6) Couldn't compressors do better if they work with the whole file at once instead of a bit at a time?

Part 7) How relevant is the size of the decompressor?

Part 8) Can information be "hidden" in a piece of hardware?

Part 9) What about an information content FAQ for lossy compression?

Part 10) Couldn't there be some new compressor which evades the counting argument?

Part 11) How can we make better compressors?


Part 1) Is there an absolute measure of information content?

No.

There is no absolute measure of information content. Any measure of information content requires choosing a probability model.

Proof: Click here!


Part 2) What about Kolmogorov Complexity?

Kolmogorov-Chaitin Complexity of a string S is defined as the size of the smallest program which will produce the string S. This is perfectly good for measuring the information in a string when given a specific language, but fails completely in the general case when all languages are to be considered. Click here for details.

Interestingly, contrary to popular belief, there exist languages for which the Kolmogorov Complexity is directly and easily computable. The trick here is that these languages are not universal languages, but it turns out that universal languages are inherently inefficient for representing information anyway. Read up on multiple reperesentations and efficiency to find out why.

For an attempt to define information content, and thus rationalize the non-uniqueness of Kolmogorov Complexity, see here.


Part 2b) But what about the KC_A(x) <= KC_B(x) + C thing?

Kolmogorov's fundamental theorem states that there exists a language A such that, for any language B and string x, KC_A(x) <= KC_B(x) + C(A,B)

This is true. In fact, there is a large class of languages that satisfy this requirement for A. Unfortunately, since C is dependant on A and B, this tells us nothing about the properties of languages.

Why? Well, the inequality is quite obvious. For instance, I offer the following:

Let us consider bubble gum. We define KC_A(x) as the time it takes for gum A to lose its flavor when chewed by person x.

We find the amazing Fundamental Theorem of Bubble Gum:

For any bubble gums A and B and person x, KC_A(x) <= KC_B(x) + C(A,B).

This does not tell us anything about bubble gum or the chewing habits of people. Similarly, Kolmogorov's theorem tells us nothing of languages or strings. Again, click here for details.


Part 2c) Are you sure? I thought I read that Kolmogorov Complexity was very important.

Kolmogorov Complexity is very important. The concept allows us to formulate some important proofs in information theory. This document does not dispute that. The misconception lies within the meaning commonly associated with Kolmogorov Complexity. Kolmogorov Complexity is in no way a universal, approximatable, meaningful quantity that can be associated with a given object or binary string. However, certain facts about Kolmogorov Complexity with respect to specific languages and the relationships between languages allow us to formulate said important proofs.


Part 3) What about Universal Turing Machines?

There is a common misconception that Universal Turing Machines avoid using any specific language. Somehow the use of a universal language is expected to avoid the bias of other languages.

This is false.

A Universal Turing Machine (UTM), like any other Turing Machine (TM), uses a tape. The tape on a TM contains a program and data. The tape on a UTM contains a program and data, and it also contains a description of a TM that can read the program.

The misconception is here: Since a UTM can emulate any possible TM, then its program can be in any language we like, so it could always be in the most appropriate language. This is true.

However, the UTM also requires a description of the TM that can read the program. This description is encoded in some FIXED way, which must be decided when the UTM is created.

Turing machines react in different ways depending on which state they are currently in. For each state, the machine reacts differently depending on which symbol it encounters on the tape. For each state-symbol combination, it puts itself into a new state, writes a new symbol on the tape over the old symbol, and moves in a certain direction.

Say there are N states, S symbols and D directions. The description on the tape must contain N*S occurrences of NewState / NewSymbol / Direction triples.

We might use an expanding code like a Fibonacci code to record N, S and D. This lets us have arbitrarily large numbers. Then, we can use a simple fixed-length code for each occurrence of NewState / NewSymbol / Direction.

Or, we could use a Delta code instead of the Fibonacci code and use a fixed table Huffman code for the NewState / NewSymbol / Direction triples. Now, the description is a different length!

In other words, we can make several UTMs which need different lengths of tape to do exactly the same task! There is no one almighty Universal Turing Machine; there are many.

What if we used another UTM to emulate any of these UTMs? Would that let us always use the shortest tape? Well, yes, but no matter how many UTMs we stack up, the one on top would need its own tape with its own description format, too. And so on, forever.

Anyway, despite what most sources say, Universal Turing Machines are actually inherently inefficient for encoding purposes, and we can make machines which do better for compression. Read up on multiple reperesentations and efficiency to find out why.


Part 4) Isn't it possible there's some undiscovered method of encoding that can do better than all known methods?

No.

By Kraft and Shannon and such, given the probability P(S) of a string S occurring, the optimal codelength for S is -Log2(P(S)) bits.

This is the length achieved by infinite precision AriCoding, and is indicative of the information content of S under the model P.

With Arithmetic Coding and an approprate probability model, we can do equal to or better than ANY POSSIBLE ENCODER.

Why?

Given any encoder E and string S, there exists a probability model P such that AriCodeLength(P(S)) <= Length(E(S)).

Define P(S) = 2 ^ -Length(E(S)) for all strings S

AriCodeLength(P(S)) = -Log2(2 ^ -Length(E(S)))

= -(-Length(E(S)))

= Length(E(S))


Part 5) Couldn't there be some undiscovered transform like DCT or DWT that would allow better potential compression?

No. Probability modelling can do everything a transform can do.

Why?

Given any probability model P and reversible mapping M, there exists a probability model Q such that Q(M(S)) = P(S) for any string S.

M is reversible, thus for all strings S and T, S != T implies M(S) != M(T)

Then we can define Q(S) such that for all strings S, Q(M(S)) = P(S)

So, reversible transforms (DCT, DWT, etc) are irrelevant to the potential compression, since the same results can be achieved using a probability model. (These transforms must be reversible, or you can't get your original file back.) However, transforms can make it easier to implement probability models, and they can also make it easier to measure closeness for lossy compression (see below)


Part 6) Can't compressors do better if they work with the whole file at once instead of a bit at a time?

No.

If we assign a distinct probability to every possible file, and thus assign an optimal codelength, we can achieve that length exactly even if we transmit only one bit at a time.

Why?

Given any probability model P there exists a serial probability model Ps such that P(S) = product(Ps(S,n),n = 1..N, N = symbols in S).

Define Sn = string of first n symbols of S

Define Ps(S,n) = P(Sn)/P(Sn-1) where P(S0) = 1

Then product(Ps(S,n),n = 1..N)

= P(S1)/P(S0) * P(S2)/P(S1) * ... P(SN-1)/P(SN-2) * P(SN)/P(SN-1)

= P(SN)/P(S0)

= P(SN)

= P(S)


Part 7) How relevant is the size of the decompressor?

In the general sense where we're transferring files and stuff, the size of the decompressor is completely irrelevant.

The size of the decompressor is only relevant when the hardware it's run on is fixed, like in the Compression Challenge.

For example, just how big is UnZip? Well, it is different sizes on different hardware. You could even make a chip that can UnZip files, and then the decompressor size is zip!

But, some might ask, isn't that just moving the information to the hardware? See part 8...


Part 8) Can information be "hidden" in a piece of hardware?

No.

We can safely say that if we can transmit a complete, unambiguous description of the hardware, then we have transmitted all of the bits "hidden" inside it.

To do this description, we need to choose a language. Which one? Well, pick your choice. Most languages will give you different results.

This is exactly the same problem as asking how many bits are "hidden" in a string. You have to pick a language, or else the quantity remains undefined.

The problem is recursive and it's stuck in a loop. You cannot break out of the loop without making arbitrary decisions, like choosing your favorite language as a final method of representation.

Finally, this language can be chosen such that the description of the particular piece of hardware is arbitrarily small like, say, 1 bit.


Part 9) What about an information content FAQ for lossy compression?

This FAQ in some ways covers lossy encoders as well as lossless encoders.

Lossy encoders simply quantize a string from one set to a string from a smaller set using some closeness measure, and encode the resulting string losslessly. In other words, lossy encoding = quantization + lossless coding. Finding the closest-match from original set to quantized set is often, but not always, accomplished by transforming the string to a different basis and truncating (or rounding) the transformed string.


Part 10) Couldn't there be some new compressor which evades the counting argument?

Short answer: NO!

Long answer: Okay, the most common claim made by people inventing "universal compressors" is that their "new" or "revolutionary" method somehow evades the counting argument, because the counting argument fails to consider their "revolutionary new" algorithm.

I think we need to review what it means to have a proof.

The counting argument does not and cannot fail to consider any algorithm. Why? It is based on a generalization. The proof is completed without specifying a particular compressor. Thus, it holds true for all possible compressors.

As an analogy, it is provable that the inside angles of an n-sided polygon will add up to 180*(n-2) degrees, without specifying a particular polygon. Now, the claims some of these people make would be like claiming that they have found some "revolutionary new" polygon that evades the proof due to its particular arrangement of angles and sidelenths or something.

Basically, in a proof such as the counting argument, anything left undefined in the proof is completely irrelevant to the proof. That is the whole point of proofs.


Part 11) How can we make better compressors?

In this FAQ we've seen that Arithmetic Coding allows the best possible encoding for a given probability model, that probability modelling can meet or beat any other computable method, and that there is no universal probability model. So, what's the point of trying to make new compressors?

Nowhere in this FAQ does it say compression is finished. There is no universal compressor, but that doesn't mean there are no good compressors! And there are certainly better compressors than the ones we use today.

Similarly, we cannot make a prediction algorithm that closely models every possible file and achieves compression, but that doesn't mean we can't make a prediction model that closely models stuff down here on Earth.

In other words, probably the most useful thing we can do to improve compression is to research prediction algorithms which better model our world. This may involve new transforms which mirror real-world data correlations, or maybe new statistical models of phenomena and whatnot.

An other useful thing we can do is to make faster ways of encoding based on probabilities, simliar to Arithmetic and Huffman Coding.

- GLYPH