tvtropes.org

Data Compression - TV Tropes

️Sat Feb 22 2025

Data compression. That thing that magically makes data take up less space, somehow.

It might feel like it, and some very smart people came up with some very smart ways to do it, but it isn't magic.

At its heart, lossless data compression relies on reducing data redundancy. In other words, removing stuff that is repeated or not needed. Purely random data is incompressible because there is no redundancy to exploit.

Lossy compression relies on removing or simplifying data to reduce the storage requirements. Generally speaking, you can reduce the size needed by much more than lossless compression, at the expense of accuracy. A JPEG for instance, which is lossy, could be made smaller than a PNG file, which is lossless.

Needless to say, this is a huge simplification of a complicated topic.

Also see Texture Compression.

Let's imagine two people are playing a game of chess and we want to keep a record of the match. An easy way of fully describing the game state snapshot would to list all 64 squares on the board, along with any piece contained on it (6 piece types, 2 colors, including an "empty" state this gives 13 possibilities), and a turn indicator (black or white), which we could do in a single bit, if we want a complete record of the game, we would store a turn number instead. 2 bytes for a turn counter would be enough to handle up to ~65 thousand moves (the longest human match was 269 moves - we could probably do this in 9 bits, up to 512 turns, if we want to maximise every bit of space).

Every single turn, we can store all this data.

You might see some benefits and some obvious drawbacks of this.

Let's start with the main benefit: there is redundancy, which means we can repair the data. Let's say the move for turn 25 gets lost in transmission or corrupted in storage. We know the game state on turns 24 and 26, so we can use this to work out what must have happened on turn 25. This ability is not endless; if we lost 3 turns in a row, we could end up in a state where one player might have moved piece A and then piece B, or piece B and then piece A, we can't tell the difference. Given that even the best storage and transmission will occasionally have errors, this gives us some protection against loss. And even if we do end up losing a few turns of information, we can still pretty much tell how the game went. Awesome!

Another benefit is that we don't have to do anything to the data to convert it into a game state, so this would be quite quick to process, we would just have to read it into memory.

However, you will have probably figured out the downside: this is a lot of information to store (relatively speaking), and takes a relatively long time to transmit (obviously by today's standards this is very little information).

So let's look at ways to improve this and see where this takes us.

Optimisation

Chess only really cares about pieces not squares. Instead of listing all 64 squares and their state each turn, we could describe exactly where the 32 pieces are (or if they're captured). We've now halved the amount of data stored each turn. This isn't compression as such, this is optimisation (we are not removing any redundancy, we're just not wasting time transmitting and storing data that doesn't give us useful information), but it still reduces the amount of data needed, and for some bonus points it will probably take less time to process as well.

What else can we do? We don't need to list every single piece every turn, especially since there will quickly be less than 32 pieces as the same goes on. We could instead do the equivalent of a radio call "here is a list of all pieces, over". If we have an average of 8 pieces on a turn, this saves a lot of space over the course of the game, with some overhead for the 'radio call' (since we can't just assume a message lists all 32 pieces and move on, we need to be told when the transmission is finished because it's not a fixed size, which takes information/space).

Differential compression

The part where we actually start to look at compression.

We can do a lot better than the above. Players can only move their piece, and only 1 move per turn. We could just store what move is made each turn. We can infer what the side moving is from the turn number, so we could do this with the piece type (6 values), the start position (64 possibilities), and the destination square (64 possibilities - you could do even more compression here and just list and select from the valid moves each piece can make but this will require a lot more processing). We're very quickly moving towards the system of Chess notation.

As you can see, we have greatly compressed the amount of data we need to transmit and store, but at the expense of two other considerations: the processing required (we have to apply the change our side, and if we want to calculate the game after say 20 turns we have no choice but to crunch the numbers and work it out), and what happens when data gets lost (the game data is completely corrupted after that point because we can't fill in any gaps).

This technique of storing the differences in a set of data is differential compression.

The drawbacks are pretty considerable though, so compression ends up being a real world trade off. One option may be to store the state of all pieces every 5 turns as a compromise (add some redundancy back in). It will take up more space, but we won't lose the entire game if a turn gets corrupted. Video compression uses this technique (amongst others), storing differences and every so many frames (a key frame) it provides the full frame before starting the process again.

Differential compression by itself is not lossy or lossless as that depends on the underlying compression algorithm.

Lossy compression

The first major method is to throw some data away. This doesn't have to mean literally erasing it all from existence (though we might if it's not important enough), we can instead quantize it to allow us to describe it in fewer bits. Digital imaging and audio are all about this.

A simple example is a compass rose. There are an infinite number of directions. If we go with a 16 point compass rose, we are quantizing an infinite number of directions into something we can store in 4 bits (2^4). The downside is we lose precision due to quantization error; 2 degrees would be represented as "North". We've compressed the information, but in a lossy way. When it comes to images and audio, we can do the same thing with colors or volume levels in a signal. Since humans can't see infinite colors and the ear doesn't have infinite dynamic range, if we do this to a limit we can't tell the difference. If you do this excessively you end up with noise (literally with audio); unwanted modification and distortion of a signal.

If you've ever noticed color banding in an image or video file (Posterization), this can be done deliberately, but it can also be an image artifact caused by quantization; we don't have enough bits to store the differences in color needed to accurately replicate the original view.

Video encoding runs on this principle (along with a lot of other complicated processing), as does MP3 audio. By throwing away relatively unimportant information, we can use far less information to describe something, and if we do it carefully enough we can still end up with a very good result.

Lossless compression

The other method, lossless compression as it sounds does not throw any information away. This is not to be confused with removing redundancy which is how we actually achieve this. If we use our chess notation example, we have losslessly compressed the information since we can perfectly reconstruct the game state. The way we've done this is by eliminating a lot of redundant information in how we describe the state.

Let's say we have an image of bright red (#FF0000) and bright blue (#0000FF) pixels, in an image file that uses 8 bit color depth (per red/green/blue channel - ~16.8 million possible values). Uncompressed, that's 3 bytes to describe each pixel's color. In this case, we don't need that much information, we can store it as "color 1" or "color 2", and in the file say "color 1 = red, color 2 = blue". The redundant information is storing millions of color possibilities when we only use 2, and eliminating this is how we get our compression.

Another way is to encode the data. Let's say half our image is red, one is blue. We could store the data in something like "split the image in half vertically and make one side red, one side blue". This perfectly describes the image and takes up a lot less room, but we then have to decode (decompress) the data to actually get the image when we want to view it; in this case the computer would have to process these instructions to draw the image we want.

Or with numbers, if we have the number 10^1000, we can store this in shorthand rather than storing all 1000 digits separately (we can round numbers and do the same thing, but this would be lossy compression instead). In this case, you're not so much storing data as an instruction on how to calculate the data.

Another method is to substitute data by mapping it to something which can be represented in less space. The letter "z" is rarely used in English. If we have some text, we could replace a common word such as "the" with "z", and this will save us space since we can represent 3 characters using 1 character. There's no such thing as a free lunch though, and the consequence in this example is that if we actually want a "z" we will need to use more space to represent this (such as using escape characters). The compression comes from the fact that "the" is more common than "z", so overall we would win out.

Huffman coding is another means of encoding data we can use, where you take some characteristic of your data, such as the characters used in some text, list them by frequency, draw a tree, and then encode the character by taking a left fork or a right fork. The most frequently used characters take up a lot less space, compressing the data.

There are many ways to encode data, and it's also possible to include things such as check bits to catch some errors (provide some redundancy), depending on our needs.