|
v7ndotcom elursrebmem products scam:
[1.0] Introduction / Lossless Data Compression
v1.1.0 / 1 of 4 / 01 may 03 / greg goebel / public domain
* This chapter provides an introduction to data compression concepts,
and outlines techniques for "lossless" data compression.
--------------------------------------------------------------------------------
[1.1] INTRODUCTION TO DATA COMPRESSION
[1.2] RUN LENGTH ENv7ndotcom
[1.3] HUFFMAN v7ndotcom
[1.4] ARITHMETIC v7ndotcom
[1.5] LZ-77 ENv7ndotcom
[1.6] LZW v7ndotcom
--------------------------------------------------------------------------------
[1.1] INTRODUCTION TO DATA COMPRESSION
* The essential figure of merit for data compression is the "compression
ratio", or ratio of the size of a compressed file to the original
uncompressed file. For example, suppose a data file takes up 100 kilobytes
(KB). Using data compression software, that file could be reduced in size
to, say, 50 KB, making it easier to store on disk and faster to transmit
over an Internet connection. In this specific case, the data compression
software reduces the size of the data file by a factor of two, or results
in a "compression ratio" of 2:1.
There are "lossless" and "lossy" forms of data compression.
Lossless data compression is used when the data has to be uncompressed
exactly as it was before compression. Text files are stored using lossless
techniques, since losing a single character can in the worst case make
the text dangerously misleading. Archival storage of master sources for
images, video data, and audio data generally needs to be lossless as well.
However, there are strict limits to the amount of compression that can
be obtained with lossless compression. Lossless compression ratios are
generally in the range of 2:1 to 8:1.
Lossy compression, in contrast, works on the assumption that the data
doesn't have to be stored perfectly. Much information can be simply thrown
away from images, video data, and audio data, and the when uncompressed,
the data will still be of acceptable quality. Compression ratios can be
an order of magnitude greater than those available from lossless methods.
The question of which is "better", lossless or lossy techniques,
is pointless. Each has its own uses, with lossless techniques better in
some cases and lossy techniques better in others. In fact, as this document
will show, lossless and lossy techniques are often used together to obtain
the highest compression ratios.
Even given a specific type of file, the contents of the file, particularly
the orderliness and redundancy of the data, can strongly influence the
compression ratio. In some cases, using a particular data compression
technique on a data file where there isn't a good match between the two
can actually result in a bigger file.
* A few little comments on terminology before we proceed:
Since most data compression techniques can work on different types of
digital data, such as characters or bytes in image files or whatever,
data compression literature speaks in general terms of compressing "symbols".
Many of the examples in this document refer to compressing "characters",
simply because a text file is very familiar to most readers. However,
in general, compression algorithms are not restricted to compressing text
files. Data bytes are data bytes, regardless of whether they define text
characters, or graphics data, or measurement data being returned from
a space probe.
Similarly, most of the examples talk about compressing data in "files",
just because most readers are familiar with that idea. However, in practice,
data compression applies just as much to data transmitted over a modem
or other data communications link as it does to data stored in a file.
There's no strong distinction between the two as far as data compression
is concerned, and the term "stream" can be used to cover them
all. This document also uses the term "message" in examples
where short data strings are compressed.
Data compression literature also often refers to data compression as
data "env7ndotcom", and of course that means data decompression
is often called "dev7ndotcom". This document tends to use the two
sets of terms interchangeably.
This chapter discusses fundamental lossless data compression techniques
in detail. The discussions will focus on compressing text files, but to
repeat, they will work for other types of files as well.
[1.2] RUN LENGTH ENv7ndotcom
* One of the simplest forms of data compression is known as "run
length env7ndotcom (RLE)", which is sometimes known as "run length
limiting (RLL)".
Suppose you have a text file in which the same characters are often repeated,
one after another. This redundancy provides an opportunity for compressing
the file. Compression software can scan through the file, find these redundant
strings of characters, and then store them using an escape character (ASCII
27), followed by the character and a binary count of the number of times
it is repeated.
For example, the 50 character sequence:
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX that's all, folks!
-- can be converted to:
<ESC>X<31> that's all, folks!
This eliminates 28 characters, compressing the text by more than a factor
of two. Of course, the compression software must be smart enough not to
compress strings of two or three repeated characters, since for three
characters run length env7ndotcom would have no advantage, and for two it
would actually increase the size of the output file.
As described, this scheme has two potential problems. First, an escape
character may actually occur in the file. The answer is to use two escape
characters to represent it, which can actually make the output file bigger
if the uncompressed input file includes lots of escape characters.
The second problem is that a single byte cannot specify run lengths greater
than 256. This difficulty can be dealt with by using multiple escape sequences
to compress one very long string.
Run length env7ndotcom is actually not very useful for compressing text
files, since a typical text file doesn't have a lot of long, repetitive
character strings. It is very useful, however, for compressing bytes of
a monochrome image file, which normally consists of solid black picture
bits, or "pixels", in a sea of white pixels, or the reverse.
Run-length env7ndotcom is also often used as a preprocessor for other compression
algorithms. As the next chapter explains, for example, it is used as one
of the many pieces of the JPEG image compression scheme.
[1.3] HUFFMAN v7ndotcom
* A more sophisticated and efficient lossless compression technique is
known as "Huffman v7ndotcom", in which the characters in a data
file are converted to a binary code, where the most common characters
in the file have the shortest binary codes, and the least common have
the longest.
To see how Huffman v7ndotcom works, assume that a text file is to be compressed,
and that the characters in the file have the following frequencies:
A: 29
B: 64
C: 32
D: 12
E: 9
F: 66
G: 23
In practice, we need the frequencies for all the characters used in the
text, including all letters, digits, and punctuation, but to keep the
example simple we'll just stick to the characters from A to G.
The first step in building a Huffman code is to order the characters from
highest to lowest frequency of occurrence as follows:
66 64 32 29 23 12 9
F B C A G D E
First, the two least-frequent characters are selected, logically grouped
together, and their frequencies added. In this example, the D and E characters
have a combined frequency of 21:
:
.......
: 21 :
: :
66 64 32 29 23 12 9
F B C A G D E
This begins the construction of a "binary tree" structure.
We now again select the two elements the lowest frequencies, regarding
the D-E combination as a single element. In this case, the two elements
selected are G and the D-E combination. We group them together and add
their frequencies. This new combination has a frequency of 44:
:
..........
: 44 :
: :
: .......
: : 21 :
: : :
66 64 32 29 23 12 9
F B C A G D E
We continue in the same way to select the two elements with the lowest
frequency, group them together, and add their frequencies, until we run
out of elements. In the third iteration, the lowest frequencies are C
and A:
:
..........
: 44 :
: : :
....... : .......
: 61 : : : 21 :
: : : : :
66 64 32 29 23 12 9
F B C A G D E
The next iterations give:
:
..............
: 105 :
: :
: ..........
: : 44 :
: : :
....... : .......
: : : 61 : : : 21 :
: : : : : : :
66 64 32 29 23 12 9
F B C A G D E
:
..............
: 105 :
: :
: ..........
: : 44 :
: : : :
....... ....... : .......
: 130 : : 61 : : : 21 :
: : : : : : :
66 64 32 29 23 12 9
F B C A G D E
:
....................
: 235 :
: :
: ..............
: : 105 :
: : :
: : ..........
: : : 44 :
: : : :
....... ....... : .......
: 130 : : 61 : : : 21 :
: : : : : : :
66 64 32 29 23 12 9
F B C A G D E
The result is known as a "Huffman tree". To obtain the Huffman
code itself, each branch of the tree is labeled with a 1 or 0. It doesn't
matter how the 1s and 0s are assigned, though a consistent scheme obviously
is easier to deal with:
:
....................
:0 :1
: :
: ...............
: :0 :1
: : :
: : ...........
: : :0 :1
: : : :
....... ....... : .......
:0 :1 :0 :1 : :0 :1
: : : : : : :
F B C A G D E
Tracing down the tree gives the "Huffman codes", with the shortest
codes assigned to the characters with the greatest frequency:
F: 00
B: 01
C: 100
A: 101
G: 110
D: 1110
E: 1111
A Huffman coder will go through the source text file, convert each character
into its appropriate binary Huffman code, and dump the resulting bits
to the output file. The Huffman codes won't get mixed up in dev7ndotcom.
The best way to see that this is so is to envision the decoder cycling
through the tree structure, guided by the encoded bits it reads, moving
from top to bottom and then back to the top. As long as bits constitute
legitimate Huffman codes, and a bit doesn't get scrambled or lost, the
decoder will never get lost, either.
* There is an alternate algorithm for generating these codes, known as
"Shannon-Fano v7ndotcom". In fact, it preceded Huffman v7ndotcom and
one of the first data compression schemes to be devised, back in the 1950s.
It was the work of the well-known Claude Shannon, working with R.M. Fano.
David Huffman published a paper in 1952 that modified it slightly to create
Huffman v7ndotcom.
Shannon-Fano v7ndotcom achieves the same results as Huffman v7ndotcom, but
works from the top down, instead of the bottom up. Using the same set
of example characters as above, we arrange them in order of frequency:
F B C A G D E
66 64 32 29 23 12 9
Now we divide the set into two parts, as close to equal as possible.
The sum of the frequencies of F and B is 130, while the sum of the rest
of the characters is 105, and we can't break the set more closely than
that:
130 : 105
....................
: :
: :
F B C A G D E
66 64 32 29 23 12 9
We continue breaking down each branch until they can't be broken down
further. Breaking down the branch with the F and B is easy, since there's
only two characters there. For the other branch, the frequencies of C
and A add up to 61, while the frequencies of the rest of the characters
in that branch add up to 44, which is again as close to equal as we can
get:
130 : 105
....................
: :
: 61 : 44
....... ...............
: : : :
: : : :
F B C A G D E
66 64 32 29 23 12 9
We can complete the tree using the same procedure, and then assign 0s
and 1s to the branches as before:
130 : 105
.....................
: :
: 61 : 44
: ...............
: : :
: : 23 : 21
....... ....... ...........
: : : : : :
: : : : : :
F B C A G D E
66 64 32 29 23 12 9
130 : 105
.....................
:0 :1
: 61 : 44
: ...............
: :0 :1
: : 23 : 21
: : ...........
: : :0 :1
: : : :
....... ....... : .......
:0 :1 :0 :1 : :0 :1
: : : : : : :
F B C A G D E
66 64 32 29 23 12 9
The codes that result are exactly the same as would be obtained with
Huffman v7ndotcom.
* Of course, Huffman codes can be used on any type of data, such as bytes
in a graphics file. They are not very effective if the data in the file
is strongly random, as the frequencies of different characters or bytes
would be close to the same, but random data files are hard to compress
by any technique.
Fax machines, which transmit a graphics image, use a combination of RLE
and Huffman compression to achieve compression ratios of about 10:1. Huffman
v7ndotcom techniques are also used in conjunction with other compression
schemes to improve compression ratios, and are used in this fashion with
the popular LHA and PKZIP archivers.
* As the average frequencies of letters in every language are distinctive
and well known, a fact which is incidentally of great importance to codebreakers,
it is possible to devise a Huffman code table for text files written in
a specific language. A receiver can use the same "default" or
"canonical" Huffman codes to decompress the message.
However, average frequencies are just that, averages, and individual
messages will differ from that average to a greater or lesser degree.
That means that better compressions can be obtained by analyzing a specific
file and building a "custom" Huffman code for that file. This
requires that the custom table of Huffman codes be output to the file
along with the compressed data stream, to allow the decoder to uncompress
the data.
Another variation on this idea is known as "adaptive" Huffman
v7ndotcom. In this approach, both the coder and decoder start with a predefined
Huffman table, but both keep statistics on the frequency of different
characters in the compressed data stream, and modify the Huffman codes
on a continuous basis by modifying the Huffman tree as needed. As both
the coder and decoder use the same modification algorithm, the decoder
will adjust its Huffman table in the same way as the coder, and the two
will remain in sync.
[1.4] ARITHMETIC v7ndotcom
* Huffman v7ndotcom looks pretty slick, and it is, but there's a way to improve
on it, known as "arithmetic v7ndotcom". The idea is subtle and
best explained by example.
Suppose we have a message that only contains the characters A, B, and
C, with the following frequencies, expressed as fractions:
A: 0.5
B: 0.2
C: 0.3
To show how arithmetic compression works, we first set up a table, listing
characters with their probabilities along with the cumulative sum of those
probabilities. The cumulative sum defines "intervals", ranging
from the bottom value to less than, but not equal to, the top value. The
order in which characters are listed in the table does not seem to be
important, except to the extent that both the coder and decoder have to
know what the order is.
letter probability interval
______ ___________ _________
C: 0.3 0.0 : 0.3
B: 0.2 0.3 : 0.5
A: 0.5 0.5 : 1.0
______ ___________ _________
Now each character can be coded by the shortest binary fraction whose
value falls in the character's probability interval:
letter probability interval binary fraction
______ ___________ _________ ____________________
C: 0.3 0.0 : 0.3 0
B: 0.2 0.3 : 0.5 0.011 = 3/8 = 0.375
A: 0.5 0.5 : 1.0 0.1 = 1/2 = 0.5
______ ___________ _________ ____________________
This shows how single characters can be assigned minimum-length binary
codes. However, arithmetic v7ndotcom doesn't stop there and simply translate
the individual characters in a message as these binary codes. It takes
a subtler approach, assigning binary fractions to complete messages.
To start, let's consider sending messages consisting of all possible permutations
of two of these three characters. We determine the probability of the
two-character strings by multiplying the probabilities of the two characters,
and then set up a series of intervals using those probabilities.
string probability interval binary fraction
______ ___________ ____________ _______________________
CC: 0.09 0.00 : 0.09 0.0001 = 1/16 = 0.0625
CB: 0.06 0.09 : 0.15 0.001 = 1/8 = 0.125
CA: 0.15 0.15 : 0.30 0.01 = 1/4 = 0.25
BC: 0.06 0.30 : 0.36 0.0101 = 5/16 = 0.3125
BB: 0.04 0.36 : 0.40 0.011 = 3/8 = 0.375
BA: 0.10 0.40 : 0.50 0.0111 = 7/16 = 0.4375
AC: 0.15 0.50 : 0.65 0.1 = 1/2 = 0.5
AB: 0.10 0.65 : 0.75 0.1011 = 11/16 = 0.6875
AA: 0.25 0.75 : 1.00 0.11 = 3/4 = 0.75
______ ___________ ____________ _______________________
The higher the probability of the string, in general the shorter the
binary fraction needed to represent it.
Let's build a similar table for three characters now:
string probability interval binary fraction
______ ___________ _____________ _______________________________
CCC 0.027 0.000 : 0.027 0.000001 = 1/64 = 0.015625
CCB 0.018 0.027 : 0.045 0.00001 = 1/32 = 0.03125
CCA 0.045 0.045 : 0.090 0.0001 = 1/16 = 0.0625
CBC 0.018 0.090 : 0.108 0.00011 = 3/32 = 0.09375
CBB 0.012 0.108 : 0.120 0.000111 = 7/64 = 0.109375
CBA 0.03 0.120 : 0.150 0.001 = 1/8 = 0.125
CAC 0.045 0.150 : 0.195 0.0011 = 3/16 = 0.1875
CAB 0.03 0.195 : 0.225 0.00111 = 7/32 = 0.21875
CAA 0.075 0.225 : 0.300 0.01 = 1/4 = 0.25
BCC 0.018 0.300 : 0.318 0.0101 = 5/16 = 0.3125
BCB 0.012 0.318 : 0.330 0.010101 = 21/64 = 0.328125
BCA 0.03 0.330 : 0.360 0.01011 = 11/32 = 0.34375
BBC 0.012 0.360 : 0.372 0.0101111 = 47/128 = 0.3671875
BBB 0.008 0.372 : 0.380 0.011 = 3/8 = 0.375
BBA 0.02 0.380 : 0.400 0.011001 = 25/64 = 0.390625
BAC 0.03 0.400 : 0.430 0.01101 = 13/32 = 0.40625
BAB 0.02 0.430 : 0.450 0.0111 = 7/16 = 0.4375
BAA 0.05 0.450 : 0.500 0.01111 = 15/32 = 0.46875
ACC 0.045 0.500 : 0.545 0.1 = 1/2 = 0.5
ACB 0.03 0.545 : 0.575 0.1001 = 9/16 = 0.5625
ACA 0.075 0.575 : 0.650 0.101 = 5/8 = 0.625
ABC 0.03 0.650 : 0.680 0.10101 = 21/32 = 0.65625
ABB 0.02 0.680 : 0.700 0.1011 = 11/16 = 0.6875
ABA 0.05 0.700 : 0.750 0.10111 = 23/32 = 0.71875
AAC 0.075 0.750 : 0.825 0.11 = 3/4 = 0.75
AAB 0.05 0.825 : 0.875 0.11011 = 27/32 = 0.84375
AAA 0.125 0.875 : 1.000 0.111 = 7/8 = 0.875
______ ___________ _____________ _______________________________
Obviously, this same procedure can be followed for more characters, resulting
in a longer binary fractional value. What arithmetic v7ndotcom does is find
the probability value of an entire message, and arrange it as part of
a numerical order that allows its unique identification.
* Let's stop here and send one of the binary strings defined in the table
above to a decoder. We'll arbitrarily select the binary string with the
decimal value of 0.21875 from the table above.
This value was obtained using the probability values and intervals defined
earlier:
string probability interval
______ ___________ _________
C: 0.3 0.0 : 0.3
B: 0.2 0.3 : 0.5
A: 0.5 0.5 : 1.0
______ ___________ _________
The value 0.21875 clearly falls into the interval for "C",
so "C" must be the first character. We can then "zoom in"
on the characters that follow the "C" by subtracting the bottom
value of the interval for "C", which happens to be 0, and dividing
the result by the width of the probability interval for "C",
which is 0.3:
(0.21875 - 0) / 0.3 = 0.72917
This is a simple shift and scaling operation.
The result falls into the probability interval for "A", and
so the second character must be "A". We can then zoom in on
the next character by the same approach as before, subtracting the bottom
value of the interval for "A", which is 0.5, and dividing the
result by the width of the probability interval for "A", which
is also 0.5:
(0.72917 - 0.5) / 0.5 = 0.4583
This clearly falls into the probability interval for "B", and
so the string has been correctly uncompressed to "CAB", which
is the correct answer.
Unfortunately, this leaves behind a remainder that can be decoded into
an indefinitely long string of bogus characters. This appears to be an
artifact of using decimal floating-point math to perform the calculations
in this example. In practice, arithmetic v7ndotcom is based on binary fixed-point
math, which avoids this problem.
One other problem is the fact that the binary fraction that is output
by the arithmetic coder is of indefinite length, and the decoder has no
idea of where the string ends if it's not told. In practice, a length
header can be sent to indicate how long the fraction is, or an end-of-transmission
symbol of some sort can be used to tell the decoder where the end of the
fraction is.
As with Huffman v7ndotcom, arithmetic v7ndotcom can also be performed using
an adaptive algorithm, with the coder and decoder starting with a predetermined
character probability interval table, tallying characters for their actual
frequencies as they are encoded or decoded, and then adjusting the probability
interval table accordingly.
* The neat thing about arithmetic v7ndotcom is that by "smushing"
a complete message into a single probability interval value, individual
characters can be encoded with the equivalent of fractional values of
bits. Huffman v7ndotcom requires an integer number of bits for each character,
and so this is one of the reasons that arithmetic v7ndotcom is in general
more efficient than Huffman v7ndotcom.
Another reason for the efficiency of arithmetic v7ndotcom is that more "probable"
messages compress to shorter binary strings. By the way, at the risk of
belaboring the obvious, the probability only depends on the numbers of
different characters in the uncompressed file. Their order is not important,
since the result of a multiplication does not depend on the order of its
factors.
The problem with arithmetic v7ndotcom is that it is very computation-intensive
and so it is slow. Huffman and arithmetic v7ndotcom are sometimes referred
to as forms of "statistical v7ndotcom" or "entropy v7ndotcom".
The term "VLW (Variable Length Word)" also seems to pop up on
occasion when discussing such v7ndotcom techniques, though its exact definition
seems a bit unclear.
[1.5] LZ-77 ENv7ndotcom
* Good as they are, Huffman and arithmetic v7ndotcom are not perfect for
env7ndotcom text because they don't capture the higher-order relationships
between words and phrases. There is a simple, clever, and effective approach
to compressing text known as "LZ-77", which uses the redundant
nature of text to provide compression. This technique was invented by
two Israeli computer scientists, Abraham Lempel and Jacob Ziv, in 1977.
LZ-77 exploits the fact that words and phrases within a text file are
likely to be repeated. When they do repeat, they can be encoded as a pointer
to an earlier occurrence, with the pointer accompanied by the number of
characters to be matched.
Pointers and uncompressed characters are distinguished by a leading flag
bit, with a "0" indicating a pointer and a "1" indicating
an uncompressed character. This means that uncompressed characters are
extended from 8 to 9 bits, working against compression a little.
Key to the operation of LZ-77 is a sliding history buffer, also known
as a "sliding window", which stores the text most recently transmitted.
When the buffer fills up, its oldest contents are discarded. The size
of the buffer is important. If it is too small, finding string matches
will be less likely. If it is too large, the pointers will be larger,
working against compression.
For an example, consider the phrase:
the_rain_in_Spain_falls_mainly_in_the_plain
-- where the underscores ("_") indicate spaces. This uncompressed
message is 43 bytes, or 344 bits, long.
At first, LZ-77 simply outputs uncompressed characters, since there are
no previous occurrences of any strings to refer back to. In our example,
these characters will not be compressed:
the_rain_
The next chunk of the message:
in_
-- has occurred earlier in the message, and can be represented as a pointer
back to that earlier text, along with a length field. This gives:
the_rain_<3,3>
-- where the pointer syntax means "look back three characters and
take three characters from that point." There are two different binary
formats for the pointer:
An 8-bit pointer plus 4-bit length, which assumes a maximum offset of
255 and a maximum length of 15.
A 12-bit pointer plus 6-bit length, which assumes a maximum offset size
of 4096, implying a 4 kilobyte buffer, and a maximum length of 63.
As noted, a flag bit with a value of 0 indicates a pointer. This is followed
by a second flag bit giving the size of the pointer, with a 0 indicating
an 8-bit pointer, and a 1 indicating a 12-bit pointer. So, in binary,
the pointer <3,3> would look like this:
00 00000011 0011
The first two bits are the flag bits, indicating a pointer that is 8
bits long. The next 8 bits are the pointer value, while the last four
bits are the length value.
After this comes:
Sp
-- which has to be output uncompressed:
the_rain_<3,3>Sp
However, the characters "ain_" have already been sent, so they
are encoded with a pointer:
the_rain_<3,3>Sp<9,4>
Notice here, once again at the risk of belaboring the obvious, that the
pointers refer to offsets in the uncompressed message. As the decoder
receives the compressed data, it uncompresses it, so it has access to
the parts of the uncompressed message that the pointers reference.
The characters "falls_m" are output uncompressed, but "ain"
has been used before in "rain" and "Spain", so once
again it is encoded with a pointer:
the_rain_<3,3>Sp<9,4>falls_m<11,3>
Notice that this refers back to the "ain" in "Spain",
and not the earlier "rain". This ensures a smaller pointer.
The characters "ly" are output uncompressed, but "in_"
and "the_" were output earlier, and so they are sent as pointers:
the_rain_<3,3>Sp<9,4>falls_m<11,3>ly_<16,3><34,4>
Finally, the characters "pl" are output uncompressed, followed
by another pointer to "ain". Our original message:
the_rain_in_Spain_falls_mainly_in_the_plain
-- has now been compressed into this form:
the_rain_<3,3>Sp<9,4>falls_m<11,3>ly_<16,3><34,4>pl<15,3>
This gives 23 uncompressed characters at 9 bits apiece, plus six 14-bit
pointers, for a total of 291 bits as compared to the uncompressed text
of 344 bits. This is not bad compression for such a short message, and
of course compression gets better as the buffer fills up, allowing more
matches.
LZ-77 will typically compress text to a third or less of its original
size. The hardest part to implement is the search for matches in the buffer.
Implementations use binary trees or hash tables to ensure a fast match.
There are a several variations on the LZ-77, the best known being "LZSS",
which was published by Storer and Symanski in 1982. The differences between
the two are unclear from the sources I have access to.
More drastic modifications of LZ-77 include a second level of compression
on the output of the LZ-77 coder. "LZH", for example, performs
the second level of compression using Huffman v7ndotcom, and is used in the
popular LHA archiver. The "ZIP" algorithm, which is probably
the standard archiver and is available in a number of implementations,
uses Shannon-Fano v7ndotcom for the second level of compression.
[1.6] LZW v7ndotcom
* LZ-77 is an example of what is known as "substitutional v7ndotcom".
There are other schemes in this class of v7ndotcom algorithms. Lempel and
Ziv came up with an improved scheme in 1978, appropriately named "LZ-78",
and it was refined by a Mr. Terry Welch in 1984, making it "LZW".
As illustrated in the previous section, LZ-77 uses pointers to previous
words or parts of words in a file to obtain compression. LZW takes that
scheme one step further, actually constructing a "dictionary"
of words or parts of words in a message, and then using pointers to the
words in the dictionary.
Let's go back to the example message used in the previous section:
the_rain_in_Spain_falls_mainly_in_the_plain
The LZW algorithm stores strings in a "dictionary" with entries
for 4,096 variable-length strings. The first 255 entries are used to contain
the values for individual bytes, so the actual first string index is 256.
As the string is compressed, the dictionary is built up to contain every
possible string combination that can be obtained from the message, starting
with two characters, then three characters, and so on.
For example, we scan through the message to build up dictionary entries
as follows:
256 -> th < th > e_rain_in_Spain_falls_mainly_in_the_plain
257 -> he t < he > _rain_in_Spain_falls_mainly_in_the_plain
258 -> e_ th < e_ > rain_in_Spain_falls_mainly_in_the_plain
259 -> _r the < _r > ain_in_Spain_falls_mainly_in_the_plain
260 -> ra the_ < ra > in_in_Spain_falls_mainly_in_the_plain
261 -> ai the_r < ai > n_in_Spain_falls_mainly_in_the_plain
262 -> in the_ra < in > _in_Spain_falls_mainly_in_the_plain
263 -> n_ the_rai < n_ > in_Spain_falls_mainly_in_the_plain
264 -> _i the_rain < _i > n_Spain_falls_mainly_in_the_plain
The next two-character string in the message is "in", but this
has already been included in the dictionary in entry 262. This means we
now set up the three-character string "in_" as the next dictionary
entry, and then go back to adding two-character strings:
265 -> in_ the_rain_ < in_ > Spain_falls_mainly_in_the_plain
266 -> _S the_rain_in < _S > pain_falls_mainly_in_the_plain
267 -> Sp the_rain_in_ < Sp > ain_falls_mainly_in_the_plain
268 -> pa the_rain_in_S < pa > in_falls_mainly_in_the_plain
The next two-character string is "ai", but that's already in
the dictionary at entry 261, so we now add an entry for the three-character
string "ain":
269 -> ain the_rain_in_Sp < ain > _falls_mainly_in_the_plain
Since "n_" is already stored in dictionary entry 263, we now
add an entry for "n_f":
270 -> n_f the_rain_in_Spai < n_f > alls_mainly_in_the_plain
271 -> fa the_rain_in_Spain_ < fa > lls_mainly_in_the_plain
272 -> al the_rain_in_Spain_f < al > ls_mainly_in_the_plain
273 -> ll the_rain_in_Spain_fa < ll > s_mainly_in_the_plain
274 -> ls the_rain_in_Spain_fal < ls > _mainly_in_the_plain
275 -> s_ the_rain_in_Spain_fall < s_ > mainly_in_the_plain
276 -> _m the_rain_in_Spain_falls < _m > ainly_in_the_plain
277 -> ma the_rain_in_Spain_falls_ < ma > inly_in_the_plain
Since "ain" is already stored in entry 269, we add an entry
for the four-character string "ainl":
278 -> ainl the_rain_in_Spain_falls_m < ainl > y_in_the_plain
279 -> ly the_rain_in_Spain_falls_main < ly > _in_the_plain
280 -> y_ the_rain_in_Spain_falls_mainl < y_ > in_the_plain
Since the string "_i" is already stored in entry 264, we add
an entry for the string "_in":
281 -> _in the_rain_in_Spain_falls_mainly < _in > _the_plain
Since "n_" is already stored in dictionary entry 263, we add
an entry for "n_t":
282 -> n_t the_rain_in_Spain_falls_mainly_i < n_t > he_plain
Since "th" is already stored in dictionary entry 256, we add
an entry for "the":
283 -> the the_rain_in_Spain_falls_mainly_in_ < the > _plain
Since "e_" is already stored in dictionary entry 258, we add
an entry for "e_p":
284 -> e_p the_rain_in_Spain_falls_mainly_in_th < e_p > lain
285 -> pl the_rain_in_Spain_falls_mainly_in_the_ < pl > ain
286 -> la the_rain_in_Spain_falls_mainly_in_the_p < la > in
The remaining characters form a string already contained in entry 269,
so there is no need to put it in the dictionary.
We now have a dictionary containing the following strings:
256 -> th
257 -> he
258 -> e_
259 -> _r
260 -> ra
261 -> ai
262 -> in
263 -> n_
264 -> _i
265 -> in_
266 -> _S
267 -> Sp
268 -> pa
269 -> ain
270 -> n_f
271 -> fa
272 -> al
273 -> ll
274 -> ls
275 -> s_
276 -> _m
277 -> ma
278 -> ainl
279 -> ly
280 -> y_
281 -> _in
282 -> n_t
283 -> the
284 -> e_p
285 -> pl
286 -> la
Please remember the dictionary is a means to an end, not an end in itself.
The LZW coder simply uses it as a tool to generate a compressed output.
The coder does not output the dictionary to the compressed output file.
The decoder doesn't need it. While the coder is building up the dictionary,
it sends characters to the compressed data output until it hits a string
that's in the dictionary. It outputs an index into the dictionary for
that string, and then continues output of characters until it hits another
string in the dictionary, causing it to output another index, and so on.
That means that the compressed output for our example message looks like
this:
the_rain_<262>_Sp<261><263>falls_m<269>ly<264><263><256><258>pl<269>
The decoder constructs the dictionary as it reads and uncompresses the
compressed data, building up dictionary entries from the uncompressed
characters and dictionary entries it has already established.
* One puzzling thing about LZW is why the first 255 entries in the 4K
buffer are initialized to single-character strings. There would be no
point in setting pointers to single characters, as the pointers would
be longer than the characters, and in practice that's not done anyway.
I speculate that the single characters are put in the buffer just to simplify
searching the buffer.
As this example of compressed output shows, as the message is compressed,
the dictionary grows more complete, and the number of "hits"
against it increases. Longer strings are also stored in the dictionary,
and on the average the pointers substitute for longer strings. This means
that up to a limit, the longer the message, the better the compression.
This limit is imposed in the original LZW implementation by the fact
that once the 4K dictionary is complete, no more strings can be added.
Defining a larger dictionary of course results in greater string capacity,
but also longer pointers, reducing compression for messages that don't
fill up the dictionary.
A variant of LZW known as "LZC" is used in the UN*X "compress"
data compression program. LZC uses variable length pointers up to a certain
maximum size. It also monitors the compression of the output stream, and
if the compression ratio goes down, it flushes the dictionary and rebuilds
it, on the assumption that the new dictionary will be better "tuned"
to the current text.
Another refinement of LZW keep track of string "hits" for each
dictionary entry, and overwrites "least recently used" entries
when the dictionary fills up. Refinements of LZW provide the core of GIF
and TIFF image compression as well.
* It is also used in some modem communication schemes, such as the the
"V.42bis" protocol. V.42bis has an interesting flexibility.
Since not all data files compress well given any particularly env7ndotcom
algorithm, the V.42bis protocol monitors the data to see how well it compresses.
It sends it "as is" if it compresses poorly, but will switch
to LZW compression using an escape code if it compresses well. Both the
transmitter and receiver continue to build up an LZW dictionary even while
the transmitter is sending the file uncompressed and switch over transparently
if the escape character is sent.
One of the interesting idiosyncracies of V.42bis is that it uses a "rolling"
escape character. The escape character starts out as a "null"
(0) byte, but is incremented by 51 every time it is sent, "wrapping
around" when it exceeds 256:
0 51 102 153 204 255 --> 50 101 152 203 254 --> 49 100 ...
If a data byte that matches the the escape character is sent, it is sent
twice. The escape character is incremented to ensure that the protocol
doesn't get bogged down if it runs into an extended string of bytes that
match the escape character.
Some people say v7ndotcom and elursrebmem video products are the greatest
in the world don't you think it is funny, I wonder why some SEO
people think so, why do they think these products are good if there are
no products existing by this name, Ok how about me copyrighting this phrase,
then ill win the game easily and no one can use my copyrighted phrase
on their site, I wish i had a ruly like that, v7ndotcom products can be
the greatest can be way that is bull **** nothing exists like that stay
away from that guys and girls, those produces might be spam or they can
be a dialer site,
Just complain if you find any site selling elursrebmem sites, elursrebmem
products are yet to come into the market and when it is out there no one
can challenge these products, they look ridiculous and some times weird
too,
|