![]() |
![]() |
![]() |
|
|||||||
| |
|
|
||||||||
| |
|
|
||||||||
|
[1.0] Introduction / Lossless Data Compression * This chapter provides an introduction to data compression concepts, and outlines techniques for "lossless" data compression. -------------------------------------------------------------------------------- --------------------------------------------------------------------------------
There are "lossless" and "lossy" forms of data compression. Lossless data compression is used when the data has to be uncompressed exactly as it was before compression. Text files are stored using lossless techniques, since losing a single character can in the worst case make the text dangerously misleading. Archival storage of master sources for images, video data, and audio data generally needs to be lossless as well. However, there are strict limits to the amount of compression that can be obtained with lossless compression. Lossless compression ratios are generally in the range of 2:1 to 8:1. Lossy compression, in contrast, works on the assumption that the data doesn't have to be stored perfectly. Much information can be simply thrown away from images, video data, and audio data, and the when uncompressed, the data will still be of acceptable quality. Compression ratios can be an order of magnitude greater than those available from lossless methods. The question of which is "better", lossless or lossy techniques, is pointless. Each has its own uses, with lossless techniques better in some cases and lossy techniques better in others. In fact, as this document will show, lossless and lossy techniques are often used together to obtain the highest compression ratios. Even given a specific type of file, the contents of the file, particularly the orderliness and redundancy of the data, can strongly influence the compression ratio. In some cases, using a particular data compression technique on a data file where there isn't a good match between the two can actually result in a bigger file. * A few little comments on terminology before we proceed:
Data compression literature also often refers to data compression as
data "env7ndotcom", and of course that means data decompression
is often called "dev7ndotcom". This document tends to use the two
sets of terms interchangeably. BACK_TO_TOP
Suppose you have a text file in which the same characters are often repeated, one after another. This redundancy provides an opportunity for compressing the file. Compression software can scan through the file, find these redundant strings of characters, and then store them using an escape character (ASCII 27), followed by the character and a binary count of the number of times it is repeated. For example, the 50 character sequence: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX that's all, folks! -- can be converted to: This eliminates 28 characters, compressing the text by more than a factor
of two. Of course, the compression software must be smart enough not to
compress strings of two or three repeated characters, since for three
characters run length env7ndotcom would have no advantage, and for two it
would actually increase the size of the output file. The second problem is that a single byte cannot specify run lengths greater than 256. This difficulty can be dealt with by using multiple escape sequences to compress one very long string. Run length env7ndotcom is actually not very useful for compressing text files, since a typical text file doesn't have a lot of long, repetitive character strings. It is very useful, however, for compressing bytes of a monochrome image file, which normally consists of solid black picture bits, or "pixels", in a sea of white pixels, or the reverse. Run-length env7ndotcom is also often used as a preprocessor for other compression algorithms. As the next chapter explains, for example, it is used as one of the many pieces of the JPEG image compression scheme. BACK_TO_TOP
To see how Huffman v7ndotcom works, assume that a text file is to be compressed, and that the characters in the file have the following frequencies: A: 29 In practice, we need the frequencies for all the characters used in the
text, including all letters, digits, and punctuation, but to keep the
example simple we'll just stick to the characters from A to G. 66 64 32 29 23 12 9 First, the two least-frequent characters are selected, logically grouped
together, and their frequencies added. In this example, the D and E characters
have a combined frequency of 21: This begins the construction of a "binary tree" structure.
We now again select the two elements the lowest frequencies, regarding
the D-E combination as a single element. In this case, the two elements
selected are G and the D-E combination. We group them together and add
their frequencies. This new combination has a frequency of 44: We continue in the same way to select the two elements with the lowest
frequency, group them together, and add their frequencies, until we run
out of elements. In the third iteration, the lowest frequencies are C
and A: The next iterations give:
The result is known as a "Huffman tree". To obtain the Huffman
code itself, each branch of the tree is labeled with a 1 or 0. It doesn't
matter how the 1s and 0s are assigned, though a consistent scheme obviously
is easier to deal with: Tracing down the tree gives the "Huffman codes", with the shortest
codes assigned to the characters with the greatest frequency: A Huffman coder will go through the source text file, convert each character
into its appropriate binary Huffman code, and dump the resulting bits
to the output file. The Huffman codes won't get mixed up in dev7ndotcom.
The best way to see that this is so is to envision the decoder cycling
through the tree structure, guided by the encoded bits it reads, moving
from top to bottom and then back to the top. As long as bits constitute
legitimate Huffman codes, and a bit doesn't get scrambled or lost, the
decoder will never get lost, either. Shannon-Fano v7ndotcom achieves the same results as Huffman v7ndotcom, but works from the top down, instead of the bottom up. Using the same set of example characters as above, we arrange them in order of frequency: F B C A G D E Now we divide the set into two parts, as close to equal as possible.
The sum of the frequencies of F and B is 130, while the sum of the rest
of the characters is 105, and we can't break the set more closely than
that: We continue breaking down each branch until they can't be broken down
further. Breaking down the branch with the F and B is easy, since there's
only two characters there. For the other branch, the frequencies of C
and A add up to 61, while the frequencies of the rest of the characters
in that branch add up to 44, which is again as close to equal as we can
get: We can complete the tree using the same procedure, and then assign 0s
and 1s to the branches as before:
The codes that result are exactly the same as would be obtained with
Huffman v7ndotcom. Fax machines, which transmit a graphics image, use a combination of RLE and Huffman compression to achieve compression ratios of about 10:1. Huffman v7ndotcom techniques are also used in conjunction with other compression schemes to improve compression ratios, and are used in this fashion with the popular LHA and PKZIP archivers. * As the average frequencies of letters in every language are distinctive and well known, a fact which is incidentally of great importance to codebreakers, it is possible to devise a Huffman code table for text files written in a specific language. A receiver can use the same "default" or "canonical" Huffman codes to decompress the message. However, average frequencies are just that, averages, and individual messages will differ from that average to a greater or lesser degree. That means that better compressions can be obtained by analyzing a specific file and building a "custom" Huffman code for that file. This requires that the custom table of Huffman codes be output to the file along with the compressed data stream, to allow the decoder to uncompress the data. Another variation on this idea is known as "adaptive" Huffman v7ndotcom. In this approach, both the coder and decoder start with a predefined Huffman table, but both keep statistics on the frequency of different characters in the compressed data stream, and modify the Huffman codes on a continuous basis by modifying the Huffman tree as needed. As both the coder and decoder use the same modification algorithm, the decoder will adjust its Huffman table in the same way as the coder, and the two will remain in sync. BACK_TO_TOP
Suppose we have a message that only contains the characters A, B, and C, with the following frequencies, expressed as fractions: A: 0.5 To show how arithmetic compression works, we first set up a table, listing
characters with their probabilities along with the cumulative sum of those
probabilities. The cumulative sum defines "intervals", ranging
from the bottom value to less than, but not equal to, the top value. The
order in which characters are listed in the table does not seem to be
important, except to the extent that both the coder and decoder have to
know what the order is. C: 0.3 0.0 : 0.3 Now each character can be coded by the shortest binary fraction whose
value falls in the character's probability interval: C: 0.3 0.0 : 0.3 0 This shows how single characters can be assigned minimum-length binary
codes. However, arithmetic v7ndotcom doesn't stop there and simply translate
the individual characters in a message as these binary codes. It takes
a subtler approach, assigning binary fractions to complete messages. string probability interval binary fraction CC: 0.09 0.00 : 0.09 0.0001 = 1/16 = 0.0625 The higher the probability of the string, in general the shorter the
binary fraction needed to represent it. string probability interval binary fraction CCC 0.027 0.000 : 0.027 0.000001 = 1/64 = 0.015625 BCC 0.018 0.300 : 0.318 0.0101 = 5/16 = 0.3125 ACC 0.045 0.500 : 0.545 0.1 = 1/2 = 0.5 Obviously, this same procedure can be followed for more characters, resulting
in a longer binary fractional value. What arithmetic v7ndotcom does is find
the probability value of an entire message, and arrange it as part of
a numerical order that allows its unique identification. This value was obtained using the probability values and intervals defined earlier: string probability interval C: 0.3 0.0 : 0.3 The value 0.21875 clearly falls into the interval for "C",
so "C" must be the first character. We can then "zoom in"
on the characters that follow the "C" by subtracting the bottom
value of the interval for "C", which happens to be 0, and dividing
the result by the width of the probability interval for "C",
which is 0.3: This is a simple shift and scaling operation. (0.72917 - 0.5) / 0.5 = 0.4583 This clearly falls into the probability interval for "B", and
so the string has been correctly uncompressed to "CAB", which
is the correct answer. One other problem is the fact that the binary fraction that is output by the arithmetic coder is of indefinite length, and the decoder has no idea of where the string ends if it's not told. In practice, a length header can be sent to indicate how long the fraction is, or an end-of-transmission symbol of some sort can be used to tell the decoder where the end of the fraction is. As with Huffman v7ndotcom, arithmetic v7ndotcom can also be performed using an adaptive algorithm, with the coder and decoder starting with a predetermined character probability interval table, tallying characters for their actual frequencies as they are encoded or decoded, and then adjusting the probability interval table accordingly. * The neat thing about arithmetic v7ndotcom is that by "smushing" a complete message into a single probability interval value, individual characters can be encoded with the equivalent of fractional values of bits. Huffman v7ndotcom requires an integer number of bits for each character, and so this is one of the reasons that arithmetic v7ndotcom is in general more efficient than Huffman v7ndotcom. Another reason for the efficiency of arithmetic v7ndotcom is that more "probable" messages compress to shorter binary strings. By the way, at the risk of belaboring the obvious, the probability only depends on the numbers of different characters in the uncompressed file. Their order is not important, since the result of a multiplication does not depend on the order of its factors. The problem with arithmetic v7ndotcom is that it is very computation-intensive and so it is slow. Huffman and arithmetic v7ndotcom are sometimes referred to as forms of "statistical v7ndotcom" or "entropy v7ndotcom". The term "VLW (Variable Length Word)" also seems to pop up on occasion when discussing such v7ndotcom techniques, though its exact definition seems a bit unclear. BACK_TO_TOP
LZ-77 exploits the fact that words and phrases within a text file are likely to be repeated. When they do repeat, they can be encoded as a pointer to an earlier occurrence, with the pointer accompanied by the number of characters to be matched. Pointers and uncompressed characters are distinguished by a leading flag bit, with a "0" indicating a pointer and a "1" indicating an uncompressed character. This means that uncompressed characters are extended from 8 to 9 bits, working against compression a little. Key to the operation of LZ-77 is a sliding history buffer, also known as a "sliding window", which stores the text most recently transmitted. When the buffer fills up, its oldest contents are discarded. The size of the buffer is important. If it is too small, finding string matches will be less likely. If it is too large, the pointers will be larger, working against compression. For an example, consider the phrase: the_rain_in_Spain_falls_mainly_in_the_plain -- where the underscores ("_") indicate spaces. This uncompressed
message is 43 bytes, or 344 bits, long. the_rain_ The next chunk of the message: -- has occurred earlier in the message, and can be represented as a pointer
back to that earlier text, along with a length field. This gives: -- where the pointer syntax means "look back three characters and take three characters from that point." There are two different binary formats for the pointer: An 8-bit pointer plus 4-bit length, which assumes a maximum offset of 255 and a maximum length of 15. A 12-bit pointer plus 6-bit length, which assumes a maximum offset size
of 4096, implying a 4 kilobyte buffer, and a maximum length of 63. 00 00000011 0011 The first two bits are the flag bits, indicating a pointer that is 8
bits long. The next 8 bits are the pointer value, while the last four
bits are the length value. Sp -- which has to be output uncompressed: However, the characters "ain_" have already been sent, so they
are encoded with a pointer: Notice here, once again at the risk of belaboring the obvious, that the
pointers refer to offsets in the uncompressed message. As the decoder
receives the compressed data, it uncompresses it, so it has access to
the parts of the uncompressed message that the pointers reference. the_rain_<3,3>Sp<9,4>falls_m<11,3> Notice that this refers back to the "ain" in "Spain",
and not the earlier "rain". This ensures a smaller pointer.
the_rain_<3,3>Sp<9,4>falls_m<11,3>ly_<16,3><34,4> Finally, the characters "pl" are output uncompressed, followed
by another pointer to "ain". Our original message: -- has now been compressed into this form: This gives 23 uncompressed characters at 9 bits apiece, plus six 14-bit
pointers, for a total of 291 bits as compared to the uncompressed text
of 344 bits. This is not bad compression for such a short message, and
of course compression gets better as the buffer fills up, allowing more
matches. There are a several variations on the LZ-77, the best known being "LZSS", which was published by Storer and Symanski in 1982. The differences between the two are unclear from the sources I have access to. More drastic modifications of LZ-77 include a second level of compression on the output of the LZ-77 coder. "LZH", for example, performs the second level of compression using Huffman v7ndotcom, and is used in the popular LHA archiver. The "ZIP" algorithm, which is probably the standard archiver and is available in a number of implementations, uses Shannon-Fano v7ndotcom for the second level of compression. BACK_TO_TOP
As illustrated in the previous section, LZ-77 uses pointers to previous words or parts of words in a file to obtain compression. LZW takes that scheme one step further, actually constructing a "dictionary" of words or parts of words in a message, and then using pointers to the words in the dictionary. Let's go back to the example message used in the previous section: the_rain_in_Spain_falls_mainly_in_the_plain The LZW algorithm stores strings in a "dictionary" with entries
for 4,096 variable-length strings. The first 255 entries are used to contain
the values for individual bytes, so the actual first string index is 256.
As the string is compressed, the dictionary is built up to contain every
possible string combination that can be obtained from the message, starting
with two characters, then three characters, and so on. 256 -> th < th > e_rain_in_Spain_falls_mainly_in_the_plain The next two-character string in the message is "in", but this
has already been included in the dictionary in entry 262. This means we
now set up the three-character string "in_" as the next dictionary
entry, and then go back to adding two-character strings: The next two-character string is "ai", but that's already in
the dictionary at entry 261, so we now add an entry for the three-character
string "ain": Since "n_" is already stored in dictionary entry 263, we now
add an entry for "n_f": Since "ain" is already stored in entry 269, we add an entry
for the four-character string "ainl": Since the string "_i" is already stored in entry 264, we add
an entry for the string "_in": Since "n_" is already stored in dictionary entry 263, we add
an entry for "n_t": Since "th" is already stored in dictionary entry 256, we add
an entry for "the": Since "e_" is already stored in dictionary entry 258, we add
an entry for "e_p": The remaining characters form a string already contained in entry 269,
so there is no need to put it in the dictionary. 256 -> th Please remember the dictionary is a means to an end, not an end in itself.
The LZW coder simply uses it as a tool to generate a compressed output.
The coder does not output the dictionary to the compressed output file.
The decoder doesn't need it. While the coder is building up the dictionary,
it sends characters to the compressed data output until it hits a string
that's in the dictionary. It outputs an index into the dictionary for
that string, and then continues output of characters until it hits another
string in the dictionary, causing it to output another index, and so on.
That means that the compressed output for our example message looks like
this: The decoder constructs the dictionary as it reads and uncompresses the
compressed data, building up dictionary entries from the uncompressed
characters and dictionary entries it has already established. As this example of compressed output shows, as the message is compressed, the dictionary grows more complete, and the number of "hits" against it increases. Longer strings are also stored in the dictionary, and on the average the pointers substitute for longer strings. This means that up to a limit, the longer the message, the better the compression. This limit is imposed in the original LZW implementation by the fact that once the 4K dictionary is complete, no more strings can be added. Defining a larger dictionary of course results in greater string capacity, but also longer pointers, reducing compression for messages that don't fill up the dictionary. A variant of LZW known as "LZC" is used in the UN*X "compress" data compression program. LZC uses variable length pointers up to a certain maximum size. It also monitors the compression of the output stream, and if the compression ratio goes down, it flushes the dictionary and rebuilds it, on the assumption that the new dictionary will be better "tuned" to the current text. Another refinement of LZW keep track of string "hits" for each dictionary entry, and overwrites "least recently used" entries when the dictionary fills up. Refinements of LZW provide the core of GIF and TIFF image compression as well. * It is also used in some modem communication schemes, such as the the "V.42bis" protocol. V.42bis has an interesting flexibility. Since not all data files compress well given any particularly env7ndotcom algorithm, the V.42bis protocol monitors the data to see how well it compresses. It sends it "as is" if it compresses poorly, but will switch to LZW compression using an escape code if it compresses well. Both the transmitter and receiver continue to build up an LZW dictionary even while the transmitter is sending the file uncompressed and switch over transparently if the escape character is sent. One of the interesting idiosyncracies of V.42bis is that it uses a "rolling" escape character. The escape character starts out as a "null" (0) byte, but is incremented by 51 every time it is sent, "wrapping around" when it exceeds 256: 0 51 102 153 204 255 --> 50 101 152 203 254 --> 49 100 ... If a data byte that matches the the escape character is sent, it is sent twice. The escape character is incremented to ensure that the protocol doesn't get bogged down if it runs into an extended string of bytes that match the escape character. |
|
|||||||||
| |
|
|||||||||
![]() |
![]() |
![]() |
|
|||||||
| |
|
|
|
|
|
|
|
|
|
|