INFORMATION CONTENT AND COMPRESSION LIMIT FAQ
This FAQ is meant to answer questions regarding
the limits of compression and the measure of information content
in general terms. Implementation issues and specific compression
algorithms are not addressed.
This document represents the invidivual research of the author,
and not of any associated associations. Some of the conclusions
drawn in this text may be contrary to those found in some
accepted texts on the subject, but the author of this text has
come to these conclusions based on extensive research using as
few preconceived assumptions as possible, and has had no luck
whatsoever in determining the cause of the discrepancy with said
accepted texts.
If you have any questions you would like to see added to the FAQ,
or if you have any comments about the FAQ, please E-Mail the author.
Part 1)
Is there an absolute measure of information content?
Part 2)
What about Kolmogorov Complexity?
Part 2b)
But what about the KC_A(x) <= KC_B(x) + C thing?
Part 2c)
Are you sure? I thought I read that Kolmogorov
Complexity was very important.
Part 3)
What about Universal Turing Machines?
Part 4)
Isn't it possible there's some undiscovered method of
coding that can do better than all known methods?
Part 5)
Couldn't there be some undiscovered transform like DCT
or DWT that would allow better potential compression?
Part 6)
Couldn't compressors do better if they work with the
whole file at once instead of a bit at a time?
Part 7)
How relevant is the size of the decompressor?
Part 8)
Can information be "hidden" in a piece of
hardware?
Part 9)
What about an information content FAQ for lossy
compression?
Part 10)
Couldn't there be some new compressor which evades the
counting argument?
Part 11)
How can we make better compressors?
Part 1) Is there an absolute
measure of information content?
No.
There is no absolute measure of information content. Any measure
of information content requires choosing a probability model.
Proof: Click
here!
Part 2) What about Kolmogorov
Complexity?
Kolmogorov-Chaitin Complexity of a string S is defined as the
size of the smallest program which will produce the string S.
This is perfectly good for measuring the information in a string
when given a specific language, but fails completely in the
general case when all languages are to be considered. Click here
for details.
Interestingly, contrary to popular belief, there exist languages
for which the Kolmogorov Complexity is directly and easily
computable. The trick here is that these languages are not
universal languages, but it turns out that universal languages are
inherently inefficient for representing information anyway.
Read up on multiple reperesentations and efficiency
to find out why.
For an attempt to define information content, and thus
rationalize the non-uniqueness of Kolmogorov Complexity, see here.
Part 2b) But what about the
KC_A(x) <= KC_B(x) + C thing?
Kolmogorov's fundamental theorem states that there exists a
language A such that, for any language B and string x, KC_A(x)
<= KC_B(x) + C(A,B)
This is true. In fact, there is a large class of languages that
satisfy this requirement for A. Unfortunately, since C is
dependant on A and B, this tells us nothing about the
properties of languages.
Why? Well, the inequality is quite obvious. For instance, I offer
the following:
Let us consider bubble gum. We define KC_A(x) as the time it
takes for gum A to lose its flavor when chewed by person x.
We find the amazing Fundamental Theorem of Bubble Gum:
For any bubble gums A and B and person x, KC_A(x) <= KC_B(x) +
C(A,B).
This does not tell us anything about bubble gum or the chewing
habits of people. Similarly, Kolmogorov's theorem tells us
nothing of languages or strings. Again, click here for details.
Part 2c)
Are you sure? I thought I read that Kolmogorov
Complexity was very important.
Kolmogorov Complexity is
very important. The concept allows us to formulate some important
proofs in information theory. This document does not dispute
that. The misconception lies within the meaning commonly
associated with Kolmogorov Complexity. Kolmogorov Complexity is
in no way a universal, approximatable, meaningful quantity that
can be associated with a given object or binary string. However,
certain facts about Kolmogorov Complexity with respect to
specific languages and the relationships between languages allow
us to formulate said important proofs.
Part 3) What about Universal
Turing Machines?
There is a common misconception that Universal Turing Machines
avoid using any specific language. Somehow the use of a universal
language is expected to avoid the bias of other languages.
This is false.
A Universal Turing Machine (UTM), like any other Turing Machine
(TM), uses a tape. The tape on a TM contains a program and data.
The tape on a UTM contains a program and data, and it also
contains a description of a TM that can read the program.
The misconception is here: Since a UTM can emulate any possible
TM, then its program can be in any language we like, so it could
always be in the most appropriate language. This is true.
However, the UTM also requires a description of the TM that can
read the program. This description is encoded in some FIXED way,
which must be decided when the UTM is created.
Turing machines react in different ways depending on which state
they are currently in. For each state, the machine reacts
differently depending on which symbol it encounters on the tape.
For each state-symbol combination, it puts itself into a new
state, writes a new symbol on the tape over the old symbol, and
moves in a certain direction.
Say there are N states, S symbols and D directions. The
description on the tape must contain N*S occurrences of NewState
/ NewSymbol / Direction triples.
We might use an expanding code like a Fibonacci code to record N,
S and D. This lets us have arbitrarily large numbers. Then, we
can use a simple fixed-length code for each occurrence of
NewState / NewSymbol / Direction.
Or, we could use a Delta code instead of the Fibonacci code and
use a fixed table Huffman code for the NewState / NewSymbol /
Direction triples. Now, the description is a different length!
In other words, we can make several UTMs which need different
lengths of tape to do exactly the same task! There is no one
almighty Universal Turing Machine; there are many.
What if we used another UTM to emulate any of these UTMs? Would
that let us always use the shortest tape? Well, yes, but no
matter how many UTMs we stack up, the one on top would need its
own tape with its own description format, too. And so on,
forever.
Anyway, despite what most sources say, Universal Turing Machines
are actually inherently inefficient for encoding purposes, and we
can make machines which do better for compression. Read up on multiple reperesentations
and efficiency to find out why.
Part 4) Isn't it possible
there's some undiscovered method of encoding that can do better
than all known methods?
No.
By Kraft and Shannon and such, given the probability P(S) of a
string S occurring, the optimal codelength for S is -Log2(P(S))
bits.
This is the length achieved by infinite precision AriCoding, and
is indicative of the information content of S under the model P.
With Arithmetic Coding and an approprate probability model, we can do
equal to or better than ANY POSSIBLE ENCODER.
Why?
Given any encoder E and string S, there exists a probability
model P such that AriCodeLength(P(S)) <= Length(E(S)).
Define P(S) = 2 ^ -Length(E(S)) for all strings S
AriCodeLength(P(S)) = -Log2(2 ^ -Length(E(S)))
= -(-Length(E(S)))
= Length(E(S))
Part 5) Couldn't there be some
undiscovered transform like DCT or DWT that would allow better
potential compression?
No. Probability modelling can do everything a transform can do.
Why?
Given any probability model P and reversible mapping M, there
exists a probability model Q such that Q(M(S)) = P(S) for any
string S.
M is reversible, thus for all strings S and T, S != T implies
M(S) != M(T)
Then we can define Q(S) such that for all strings S, Q(M(S)) =
P(S)
So, reversible transforms (DCT, DWT, etc) are irrelevant to the
potential compression, since the same results can be achieved
using a probability model. (These transforms must be reversible,
or you can't get your original file back.) However, transforms
can make it easier to implement probability models, and they can
also make it easier to measure closeness for lossy compression
(see below)
Part 6) Can't compressors do
better if they work with the whole file at once instead of a bit
at a time?
No.
If we assign a distinct probability to every possible file, and
thus assign an optimal codelength, we can achieve that length
exactly even if we transmit only one bit at a time.
Why?
Given any probability model P there exists a serial probability
model Ps such that P(S) = product(Ps(S,n),n = 1..N, N = symbols
in S).
Define Sn = string of first n symbols of S
Define Ps(S,n) = P(Sn)/P(Sn-1) where P(S0) = 1
Then product(Ps(S,n),n = 1..N)
= P(S1)/P(S0) * P(S2)/P(S1) * ... P(SN-1)/P(SN-2) * P(SN)/P(SN-1)
= P(SN)/P(S0)
= P(SN)
= P(S)
Part 7) How relevant is the
size of the decompressor?
In the general sense where we're transferring files and stuff,
the size of the decompressor is completely irrelevant.
The size of the decompressor is only relevant when the hardware
it's run on is fixed, like in the Compression Challenge.
For example, just how big is UnZip? Well, it is different sizes
on different hardware. You could even make a chip that can UnZip
files, and then the decompressor size is zip!
But, some might ask, isn't that just moving the information to
the hardware? See part 8...
Part 8) Can information be
"hidden" in a piece of hardware?
No.
We can safely say that if we can transmit a complete, unambiguous
description of the hardware, then we have transmitted all of the
bits "hidden" inside it.
To do this description, we need to choose a language. Which one?
Well, pick your choice. Most languages will give you different
results.
This is exactly the same problem as asking how many bits are
"hidden" in a string. You have to pick a language,
or else the quantity remains undefined.
The problem is recursive and it's stuck in a loop. You cannot
break out of the loop without making arbitrary decisions, like
choosing your favorite language as a final method of
representation.
Finally, this language can be chosen such that the description of
the particular piece of hardware is arbitrarily small like, say,
1 bit.
Part 9) What about an
information content FAQ for lossy compression?
This FAQ in some ways covers lossy encoders as well as lossless encoders.
Lossy encoders simply quantize a string from one set to a string
from a smaller set using some closeness measure, and encode the
resulting string losslessly. In other words, lossy encoding =
quantization + lossless coding. Finding the closest-match from
original set to quantized set is often, but not always,
accomplished by transforming the string to a different basis and
truncating (or rounding) the transformed string.
Part 10) Couldn't there be
some new compressor which evades the counting argument?
Short answer: NO!
Long answer: Okay, the most common claim made by people inventing
"universal compressors" is that their "new"
or "revolutionary" method somehow evades the counting
argument, because the counting argument fails to consider their
"revolutionary new" algorithm.
I think we need to review what it means to have a proof.
The counting argument does not and cannot fail to
consider any algorithm. Why? It is based on a generalization.
The proof is completed without specifying a particular
compressor. Thus, it holds true for all possible
compressors.
As an analogy, it is provable that the inside angles of an
n-sided polygon will add up to 180*(n-2) degrees, without specifying a
particular polygon. Now, the claims some of these people make
would be like claiming that they have found some "revolutionary new"
polygon that evades the proof due to its particular arrangement
of angles and sidelenths or something.
Basically, in a proof such as the counting argument, anything
left undefined in the proof is completely irrelevant to the proof.
That is the whole point of proofs.
Part 11) How can we make
better compressors?
In this FAQ we've seen that Arithmetic Coding allows the best possible
encoding for a given probability model, that probability
modelling can meet or beat any other computable method, and that
there is no universal probability model. So, what's the point of
trying to make new compressors?
Nowhere in this FAQ does it say compression is finished. There is
no universal compressor, but that doesn't mean there are no good
compressors! And there are certainly better compressors than the
ones we use today.
Similarly, we cannot make a prediction algorithm that closely
models every possible file and achieves compression, but that
doesn't mean we can't make a prediction model that closely models
stuff down here on Earth.
In other words, probably the most useful thing we can do to
improve compression is to research prediction algorithms which
better model our world. This may involve new transforms which
mirror real-world data correlations, or maybe new statistical
models of phenomena and whatnot.
An other useful thing we can do is to make faster ways of
encoding based on probabilities, simliar to Arithmetic and Huffman
Coding.
- GLYPH