Return to GLYPH's Project Page
Prediction

Okay, I've said that everything involves prediction. Now I'll run through some stuff and say why:

And here is my FAQ on information content and Kolmogorov complexity.
And here is a discussion on decoder size and Kolmogorov complexity.
And here is a discussion on prediction and its arbitrary qualities.
And here is a discussion on multiple representations and efficiency.
And here is a proof that there is no absolute measure of information.

Interpolation / Extrapolation

Okay, this one is mostly obvious so I'll keep it short. To interpolate or extrapolate data, you simply look at the data you have, and predict the data you don't have. That's prediction, plain and simple.
Compression

Well, actually, data compression is simply data representation with the goal being to have as small a representation as is feasible. See
representation.
Algorithm Optimization

This one's not quite as easy to see.

Any algorithm or function, however arbitrary, can be represented as a simple mapping. The input parameters can be thought of as an input vector, which is mapped to an output vector. Therefore, the mapping exists in a vector space with the number of dimensions equal to the number of input parameters. Each point in the vector space represents one input vector. Thus, at each point there is an output vector. So it's just a big matrix.

To evaluate the algorithm or function, we need to look up the output vector corresponding to the current input vector. So, we need to represent the matrix somehow.

In other words, algorithm optimization is simply representation with the goal being to have as fast a representation as is feasible. See
representation.
Representation

Hmm. This is the sticky one. First let's do transforms.

Transforms

A transform, the way I'm using the term, is any reversible thing you can do to information. Some examples are:
  • Discrete Cosine Transform
  • Discrete Wavelet Transform
  • Multiplying by 3
  • Move To Front and string matching transforms
  • Flipping everything upside-down
  • Writing a program which produces the data
  • Scanning the data in some rediculously complex order
  • Etc.
I'll be crafting up my own definition of Information Content here, and I'll start by stating that after any reversible transform, the information content of the original signal under any measure can be recovered by simply reversing the transform.

Another way of putting it is this: The information content of a signal is invariant with respect to the basis of representation. The model can simply be changed to account for it.


Now let's do probability prediction.

Probability Prediction

A while back, Shannon, Kraft, and other people I forget did some stuff and it was found that the optimal length, in bits, of the code used for a particular symbol at a particular time is the negative base-2 logarithm of the probability of that symbol occurring at that time. i.e. L = -Log2(P[S]).

So, to compress stuff really well, all you have to do is predict the probabilities really well. This is often called Probability Modelling. Generating the actual codes is pretty easy, with Huffman and Arithmetic coding at our disposal.

The reason many compressors and algorithms use transforms is to make it easier to find patterns and predict the probabilities. However, you can make a probability model that takes these transforms into account, and gives you exactly the same result without actually altering the data, so transforms are simply part of the probability model.

Once you have your probability model, you can directly measure the information content of a signal simply by compressing it. You can even leave out the encoding step and just compute its compressed size. This is just the sum of -Log2(P[S]) for each symbol S.

This size is the information content of the signal, using that particular probability model. This means that information content is completely dependant on the probability model used.

This does not include the size of the decoder, but that is irrelevant in the general case.
Click here for a discussion on decoder size and Kolmogorov complexity.

But, what if you don't want to spend all your time predicting? Well, you can make faster, less accurate predictors, or you can precompute some of the prediction steps and encode them along with the symbols. The extreme bound of this is to encode the raw data, but we can do much better while still being really fast.

Some examples:
  • LZP compression (by Charles Bloom)
  • Radix Sort
  • Wu's line algorithm
  • Look-up tables

Now let's do lossy stuff.

Lossy Stuff

Lossy stuff is easy to do. All we do is quantize stuff so there's less possible things to represent, and bingo, it takes less time and space. This is at the cost, of course, of some error. This error is fine-tuned for the particular application, so it's okay.

Examples of Lossy Stuff:
  • JPEG and Wavelet compression
  • Polynomial approximation of viscosity forces in Physics
  • The method your math coprocessor uses to compute cosine


Conclusions

Well, that pretty much covers it. Everything can be reduced to prediction, more or less, so I think it's worth spending more time working on better methods of prediction, and other stuff will follow from it, in all fields of information theory.