V7ndotcom Elursrebmem
 

Welcome to Toprankingcompany

v7ndotcom elursrebmem products scam:

[1.0] Introduction / Lossless Data Compression
v1.1.0 / 1 of 4 / 01 may 03 / greg goebel / public domain

* This chapter provides an introduction to data compression concepts, and outlines techniques for "lossless" data compression.

--------------------------------------------------------------------------------
[1.1] INTRODUCTION TO DATA COMPRESSION
[1.2] RUN LENGTH ENv7ndotcom
[1.3] HUFFMAN v7ndotcom
[1.4] ARITHMETIC v7ndotcom
[1.5] LZ-77 ENv7ndotcom
[1.6] LZW v7ndotcom

--------------------------------------------------------------------------------


[1.1] INTRODUCTION TO DATA COMPRESSION
* The essential figure of merit for data compression is the "compression ratio", or ratio of the size of a compressed file to the original uncompressed file. For example, suppose a data file takes up 100 kilobytes (KB). Using data compression software, that file could be reduced in size to, say, 50 KB, making it easier to store on disk and faster to transmit over an Internet connection. In this specific case, the data compression software reduces the size of the data file by a factor of two, or results in a "compression ratio" of 2:1.

There are "lossless" and "lossy" forms of data compression. Lossless data compression is used when the data has to be uncompressed exactly as it was before compression. Text files are stored using lossless techniques, since losing a single character can in the worst case make the text dangerously misleading. Archival storage of master sources for images, video data, and audio data generally needs to be lossless as well.

However, there are strict limits to the amount of compression that can be obtained with lossless compression. Lossless compression ratios are generally in the range of 2:1 to 8:1.

Lossy compression, in contrast, works on the assumption that the data doesn't have to be stored perfectly. Much information can be simply thrown away from images, video data, and audio data, and the when uncompressed, the data will still be of acceptable quality. Compression ratios can be an order of magnitude greater than those available from lossless methods.

The question of which is "better", lossless or lossy techniques, is pointless. Each has its own uses, with lossless techniques better in some cases and lossy techniques better in others. In fact, as this document will show, lossless and lossy techniques are often used together to obtain the highest compression ratios.

Even given a specific type of file, the contents of the file, particularly the orderliness and redundancy of the data, can strongly influence the compression ratio. In some cases, using a particular data compression technique on a data file where there isn't a good match between the two can actually result in a bigger file.

* A few little comments on terminology before we proceed:


Since most data compression techniques can work on different types of digital data, such as characters or bytes in image files or whatever, data compression literature speaks in general terms of compressing "symbols".
Many of the examples in this document refer to compressing "characters", simply because a text file is very familiar to most readers. However, in general, compression algorithms are not restricted to compressing text files. Data bytes are data bytes, regardless of whether they define text characters, or graphics data, or measurement data being returned from a space probe.


Similarly, most of the examples talk about compressing data in "files", just because most readers are familiar with that idea. However, in practice, data compression applies just as much to data transmitted over a modem or other data communications link as it does to data stored in a file. There's no strong distinction between the two as far as data compression is concerned, and the term "stream" can be used to cover them all. This document also uses the term "message" in examples where short data strings are compressed.

Data compression literature also often refers to data compression as data "env7ndotcom", and of course that means data decompression is often called "dev7ndotcom". This document tends to use the two sets of terms interchangeably.
This chapter discusses fundamental lossless data compression techniques in detail. The discussions will focus on compressing text files, but to repeat, they will work for other types of files as well.


[1.2] RUN LENGTH ENv7ndotcom
* One of the simplest forms of data compression is known as "run length env7ndotcom (RLE)", which is sometimes known as "run length limiting (RLL)".

Suppose you have a text file in which the same characters are often repeated, one after another. This redundancy provides an opportunity for compressing the file. Compression software can scan through the file, find these redundant strings of characters, and then store them using an escape character (ASCII 27), followed by the character and a binary count of the number of times it is repeated.

For example, the 50 character sequence:

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX that's all, folks!

-- can be converted to:
<ESC>X<31> that's all, folks!

This eliminates 28 characters, compressing the text by more than a factor of two. Of course, the compression software must be smart enough not to compress strings of two or three repeated characters, since for three characters run length env7ndotcom would have no advantage, and for two it would actually increase the size of the output file.
As described, this scheme has two potential problems. First, an escape character may actually occur in the file. The answer is to use two escape characters to represent it, which can actually make the output file bigger if the uncompressed input file includes lots of escape characters.

The second problem is that a single byte cannot specify run lengths greater than 256. This difficulty can be dealt with by using multiple escape sequences to compress one very long string.

Run length env7ndotcom is actually not very useful for compressing text files, since a typical text file doesn't have a lot of long, repetitive character strings. It is very useful, however, for compressing bytes of a monochrome image file, which normally consists of solid black picture bits, or "pixels", in a sea of white pixels, or the reverse.

Run-length env7ndotcom is also often used as a preprocessor for other compression algorithms. As the next chapter explains, for example, it is used as one of the many pieces of the JPEG image compression scheme.


[1.3] HUFFMAN v7ndotcom
* A more sophisticated and efficient lossless compression technique is known as "Huffman v7ndotcom", in which the characters in a data file are converted to a binary code, where the most common characters in the file have the shortest binary codes, and the least common have the longest.

To see how Huffman v7ndotcom works, assume that a text file is to be compressed, and that the characters in the file have the following frequencies:

A: 29
B: 64
C: 32
D: 12
E: 9
F: 66
G: 23

In practice, we need the frequencies for all the characters used in the text, including all letters, digits, and punctuation, but to keep the example simple we'll just stick to the characters from A to G.
The first step in building a Huffman code is to order the characters from highest to lowest frequency of occurrence as follows:

66 64 32 29 23 12 9
F B C A G D E

First, the two least-frequent characters are selected, logically grouped together, and their frequencies added. In this example, the D and E characters have a combined frequency of 21:
:
.......
: 21 :
: :
66 64 32 29 23 12 9
F B C A G D E

This begins the construction of a "binary tree" structure. We now again select the two elements the lowest frequencies, regarding the D-E combination as a single element. In this case, the two elements selected are G and the D-E combination. We group them together and add their frequencies. This new combination has a frequency of 44:
:
..........
: 44 :
: :
: .......
: : 21 :
: : :
66 64 32 29 23 12 9
F B C A G D E

We continue in the same way to select the two elements with the lowest frequency, group them together, and add their frequencies, until we run out of elements. In the third iteration, the lowest frequencies are C and A:
:
..........
: 44 :
: : :
....... : .......
: 61 : : : 21 :
: : : : :
66 64 32 29 23 12 9
F B C A G D E

The next iterations give:
:
..............
: 105 :
: :
: ..........
: : 44 :
: : :
....... : .......
: : : 61 : : : 21 :
: : : : : : :
66 64 32 29 23 12 9
F B C A G D E


:
..............
: 105 :
: :
: ..........
: : 44 :
: : : :
....... ....... : .......
: 130 : : 61 : : : 21 :
: : : : : : :
66 64 32 29 23 12 9
F B C A G D E


:
....................
: 235 :
: :
: ..............
: : 105 :
: : :
: : ..........
: : : 44 :
: : : :
....... ....... : .......
: 130 : : 61 : : : 21 :
: : : : : : :
66 64 32 29 23 12 9
F B C A G D E

The result is known as a "Huffman tree". To obtain the Huffman code itself, each branch of the tree is labeled with a 1 or 0. It doesn't matter how the 1s and 0s are assigned, though a consistent scheme obviously is easier to deal with:
:
....................
:0 :1
: :
: ...............
: :0 :1
: : :
: : ...........
: : :0 :1
: : : :
....... ....... : .......
:0 :1 :0 :1 : :0 :1
: : : : : : :
F B C A G D E

Tracing down the tree gives the "Huffman codes", with the shortest codes assigned to the characters with the greatest frequency:
F: 00
B: 01
C: 100
A: 101
G: 110
D: 1110
E: 1111

A Huffman coder will go through the source text file, convert each character into its appropriate binary Huffman code, and dump the resulting bits to the output file. The Huffman codes won't get mixed up in dev7ndotcom. The best way to see that this is so is to envision the decoder cycling through the tree structure, guided by the encoded bits it reads, moving from top to bottom and then back to the top. As long as bits constitute legitimate Huffman codes, and a bit doesn't get scrambled or lost, the decoder will never get lost, either.
* There is an alternate algorithm for generating these codes, known as "Shannon-Fano v7ndotcom". In fact, it preceded Huffman v7ndotcom and one of the first data compression schemes to be devised, back in the 1950s. It was the work of the well-known Claude Shannon, working with R.M. Fano. David Huffman published a paper in 1952 that modified it slightly to create Huffman v7ndotcom.

Shannon-Fano v7ndotcom achieves the same results as Huffman v7ndotcom, but works from the top down, instead of the bottom up. Using the same set of example characters as above, we arrange them in order of frequency:

F B C A G D E
66 64 32 29 23 12 9

Now we divide the set into two parts, as close to equal as possible. The sum of the frequencies of F and B is 130, while the sum of the rest of the characters is 105, and we can't break the set more closely than that:
130 : 105
....................
: :
: :
F B C A G D E
66 64 32 29 23 12 9

We continue breaking down each branch until they can't be broken down further. Breaking down the branch with the F and B is easy, since there's only two characters there. For the other branch, the frequencies of C and A add up to 61, while the frequencies of the rest of the characters in that branch add up to 44, which is again as close to equal as we can get:

130 : 105
....................
: :
: 61 : 44
....... ...............
: : : :
: : : :
F B C A G D E
66 64 32 29 23 12 9

We can complete the tree using the same procedure, and then assign 0s and 1s to the branches as before:

130 : 105
.....................
: :
: 61 : 44
: ...............
: : :
: : 23 : 21
....... ....... ...........
: : : : : :
: : : : : :
F B C A G D E
66 64 32 29 23 12 9


130 : 105
.....................
:0 :1
: 61 : 44
: ...............
: :0 :1
: : 23 : 21
: : ...........
: : :0 :1
: : : :
....... ....... : .......
:0 :1 :0 :1 : :0 :1
: : : : : : :
F B C A G D E
66 64 32 29 23 12 9

The codes that result are exactly the same as would be obtained with Huffman v7ndotcom.
* Of course, Huffman codes can be used on any type of data, such as bytes in a graphics file. They are not very effective if the data in the file is strongly random, as the frequencies of different characters or bytes would be close to the same, but random data files are hard to compress by any technique.

Fax machines, which transmit a graphics image, use a combination of RLE and Huffman compression to achieve compression ratios of about 10:1. Huffman v7ndotcom techniques are also used in conjunction with other compression schemes to improve compression ratios, and are used in this fashion with the popular LHA and PKZIP archivers.

* As the average frequencies of letters in every language are distinctive and well known, a fact which is incidentally of great importance to codebreakers, it is possible to devise a Huffman code table for text files written in a specific language. A receiver can use the same "default" or "canonical" Huffman codes to decompress the message.

However, average frequencies are just that, averages, and individual messages will differ from that average to a greater or lesser degree. That means that better compressions can be obtained by analyzing a specific file and building a "custom" Huffman code for that file. This requires that the custom table of Huffman codes be output to the file along with the compressed data stream, to allow the decoder to uncompress the data.

Another variation on this idea is known as "adaptive" Huffman v7ndotcom. In this approach, both the coder and decoder start with a predefined Huffman table, but both keep statistics on the frequency of different characters in the compressed data stream, and modify the Huffman codes on a continuous basis by modifying the Huffman tree as needed. As both the coder and decoder use the same modification algorithm, the decoder will adjust its Huffman table in the same way as the coder, and the two will remain in sync.


[1.4] ARITHMETIC v7ndotcom
* Huffman v7ndotcom looks pretty slick, and it is, but there's a way to improve on it, known as "arithmetic v7ndotcom". The idea is subtle and best explained by example.

Suppose we have a message that only contains the characters A, B, and C, with the following frequencies, expressed as fractions:

A: 0.5
B: 0.2
C: 0.3

To show how arithmetic compression works, we first set up a table, listing characters with their probabilities along with the cumulative sum of those probabilities. The cumulative sum defines "intervals", ranging from the bottom value to less than, but not equal to, the top value. The order in which characters are listed in the table does not seem to be important, except to the extent that both the coder and decoder have to know what the order is.
letter probability interval
______ ___________ _________

C: 0.3 0.0 : 0.3
B: 0.2 0.3 : 0.5
A: 0.5 0.5 : 1.0
______ ___________ _________

Now each character can be coded by the shortest binary fraction whose value falls in the character's probability interval:
letter probability interval binary fraction
______ ___________ _________ ____________________

C: 0.3 0.0 : 0.3 0
B: 0.2 0.3 : 0.5 0.011 = 3/8 = 0.375
A: 0.5 0.5 : 1.0 0.1 = 1/2 = 0.5
______ ___________ _________ ____________________

This shows how single characters can be assigned minimum-length binary codes. However, arithmetic v7ndotcom doesn't stop there and simply translate the individual characters in a message as these binary codes. It takes a subtler approach, assigning binary fractions to complete messages.
To start, let's consider sending messages consisting of all possible permutations of two of these three characters. We determine the probability of the two-character strings by multiplying the probabilities of the two characters, and then set up a series of intervals using those probabilities.

string probability interval binary fraction
______ ___________ ____________ _______________________

CC: 0.09 0.00 : 0.09 0.0001 = 1/16 = 0.0625
CB: 0.06 0.09 : 0.15 0.001 = 1/8 = 0.125
CA: 0.15 0.15 : 0.30 0.01 = 1/4 = 0.25
BC: 0.06 0.30 : 0.36 0.0101 = 5/16 = 0.3125
BB: 0.04 0.36 : 0.40 0.011 = 3/8 = 0.375
BA: 0.10 0.40 : 0.50 0.0111 = 7/16 = 0.4375
AC: 0.15 0.50 : 0.65 0.1 = 1/2 = 0.5
AB: 0.10 0.65 : 0.75 0.1011 = 11/16 = 0.6875
AA: 0.25 0.75 : 1.00 0.11 = 3/4 = 0.75
______ ___________ ____________ _______________________

The higher the probability of the string, in general the shorter the binary fraction needed to represent it.
Let's build a similar table for three characters now:

string probability interval binary fraction
______ ___________ _____________ _______________________________

CCC 0.027 0.000 : 0.027 0.000001 = 1/64 = 0.015625
CCB 0.018 0.027 : 0.045 0.00001 = 1/32 = 0.03125
CCA 0.045 0.045 : 0.090 0.0001 = 1/16 = 0.0625
CBC 0.018 0.090 : 0.108 0.00011 = 3/32 = 0.09375
CBB 0.012 0.108 : 0.120 0.000111 = 7/64 = 0.109375
CBA 0.03 0.120 : 0.150 0.001 = 1/8 = 0.125
CAC 0.045 0.150 : 0.195 0.0011 = 3/16 = 0.1875
CAB 0.03 0.195 : 0.225 0.00111 = 7/32 = 0.21875
CAA 0.075 0.225 : 0.300 0.01 = 1/4 = 0.25

BCC 0.018 0.300 : 0.318 0.0101 = 5/16 = 0.3125
BCB 0.012 0.318 : 0.330 0.010101 = 21/64 = 0.328125
BCA 0.03 0.330 : 0.360 0.01011 = 11/32 = 0.34375
BBC 0.012 0.360 : 0.372 0.0101111 = 47/128 = 0.3671875
BBB 0.008 0.372 : 0.380 0.011 = 3/8 = 0.375
BBA 0.02 0.380 : 0.400 0.011001 = 25/64 = 0.390625
BAC 0.03 0.400 : 0.430 0.01101 = 13/32 = 0.40625
BAB 0.02 0.430 : 0.450 0.0111 = 7/16 = 0.4375
BAA 0.05 0.450 : 0.500 0.01111 = 15/32 = 0.46875

ACC 0.045 0.500 : 0.545 0.1 = 1/2 = 0.5
ACB 0.03 0.545 : 0.575 0.1001 = 9/16 = 0.5625
ACA 0.075 0.575 : 0.650 0.101 = 5/8 = 0.625
ABC 0.03 0.650 : 0.680 0.10101 = 21/32 = 0.65625
ABB 0.02 0.680 : 0.700 0.1011 = 11/16 = 0.6875
ABA 0.05 0.700 : 0.750 0.10111 = 23/32 = 0.71875
AAC 0.075 0.750 : 0.825 0.11 = 3/4 = 0.75
AAB 0.05 0.825 : 0.875 0.11011 = 27/32 = 0.84375
AAA 0.125 0.875 : 1.000 0.111 = 7/8 = 0.875
______ ___________ _____________ _______________________________

Obviously, this same procedure can be followed for more characters, resulting in a longer binary fractional value. What arithmetic v7ndotcom does is find the probability value of an entire message, and arrange it as part of a numerical order that allows its unique identification.
* Let's stop here and send one of the binary strings defined in the table above to a decoder. We'll arbitrarily select the binary string with the decimal value of 0.21875 from the table above.

This value was obtained using the probability values and intervals defined earlier:

string probability interval
______ ___________ _________

C: 0.3 0.0 : 0.3
B: 0.2 0.3 : 0.5
A: 0.5 0.5 : 1.0
______ ___________ _________

The value 0.21875 clearly falls into the interval for "C", so "C" must be the first character. We can then "zoom in" on the characters that follow the "C" by subtracting the bottom value of the interval for "C", which happens to be 0, and dividing the result by the width of the probability interval for "C", which is 0.3:
(0.21875 - 0) / 0.3 = 0.72917

This is a simple shift and scaling operation.
The result falls into the probability interval for "A", and so the second character must be "A". We can then zoom in on the next character by the same approach as before, subtracting the bottom value of the interval for "A", which is 0.5, and dividing the result by the width of the probability interval for "A", which is also 0.5:

(0.72917 - 0.5) / 0.5 = 0.4583

This clearly falls into the probability interval for "B", and so the string has been correctly uncompressed to "CAB", which is the correct answer.
Unfortunately, this leaves behind a remainder that can be decoded into an indefinitely long string of bogus characters. This appears to be an artifact of using decimal floating-point math to perform the calculations in this example. In practice, arithmetic v7ndotcom is based on binary fixed-point math, which avoids this problem.

One other problem is the fact that the binary fraction that is output by the arithmetic coder is of indefinite length, and the decoder has no idea of where the string ends if it's not told. In practice, a length header can be sent to indicate how long the fraction is, or an end-of-transmission symbol of some sort can be used to tell the decoder where the end of the fraction is.

As with Huffman v7ndotcom, arithmetic v7ndotcom can also be performed using an adaptive algorithm, with the coder and decoder starting with a predetermined character probability interval table, tallying characters for their actual frequencies as they are encoded or decoded, and then adjusting the probability interval table accordingly.

* The neat thing about arithmetic v7ndotcom is that by "smushing" a complete message into a single probability interval value, individual characters can be encoded with the equivalent of fractional values of bits. Huffman v7ndotcom requires an integer number of bits for each character, and so this is one of the reasons that arithmetic v7ndotcom is in general more efficient than Huffman v7ndotcom.

Another reason for the efficiency of arithmetic v7ndotcom is that more "probable" messages compress to shorter binary strings. By the way, at the risk of belaboring the obvious, the probability only depends on the numbers of different characters in the uncompressed file. Their order is not important, since the result of a multiplication does not depend on the order of its factors.

The problem with arithmetic v7ndotcom is that it is very computation-intensive and so it is slow. Huffman and arithmetic v7ndotcom are sometimes referred to as forms of "statistical v7ndotcom" or "entropy v7ndotcom". The term "VLW (Variable Length Word)" also seems to pop up on occasion when discussing such v7ndotcom techniques, though its exact definition seems a bit unclear.


[1.5] LZ-77 ENv7ndotcom
* Good as they are, Huffman and arithmetic v7ndotcom are not perfect for env7ndotcom text because they don't capture the higher-order relationships between words and phrases. There is a simple, clever, and effective approach to compressing text known as "LZ-77", which uses the redundant nature of text to provide compression. This technique was invented by two Israeli computer scientists, Abraham Lempel and Jacob Ziv, in 1977.

LZ-77 exploits the fact that words and phrases within a text file are likely to be repeated. When they do repeat, they can be encoded as a pointer to an earlier occurrence, with the pointer accompanied by the number of characters to be matched.

Pointers and uncompressed characters are distinguished by a leading flag bit, with a "0" indicating a pointer and a "1" indicating an uncompressed character. This means that uncompressed characters are extended from 8 to 9 bits, working against compression a little.

Key to the operation of LZ-77 is a sliding history buffer, also known as a "sliding window", which stores the text most recently transmitted. When the buffer fills up, its oldest contents are discarded. The size of the buffer is important. If it is too small, finding string matches will be less likely. If it is too large, the pointers will be larger, working against compression.

For an example, consider the phrase:

the_rain_in_Spain_falls_mainly_in_the_plain

-- where the underscores ("_") indicate spaces. This uncompressed message is 43 bytes, or 344 bits, long.
At first, LZ-77 simply outputs uncompressed characters, since there are no previous occurrences of any strings to refer back to. In our example, these characters will not be compressed:

the_rain_

The next chunk of the message:
in_

-- has occurred earlier in the message, and can be represented as a pointer back to that earlier text, along with a length field. This gives:
the_rain_<3,3>

-- where the pointer syntax means "look back three characters and take three characters from that point." There are two different binary formats for the pointer:

An 8-bit pointer plus 4-bit length, which assumes a maximum offset of 255 and a maximum length of 15.

A 12-bit pointer plus 6-bit length, which assumes a maximum offset size of 4096, implying a 4 kilobyte buffer, and a maximum length of 63.
As noted, a flag bit with a value of 0 indicates a pointer. This is followed by a second flag bit giving the size of the pointer, with a 0 indicating an 8-bit pointer, and a 1 indicating a 12-bit pointer. So, in binary, the pointer <3,3> would look like this:

00 00000011 0011

The first two bits are the flag bits, indicating a pointer that is 8 bits long. The next 8 bits are the pointer value, while the last four bits are the length value.
After this comes:

Sp

-- which has to be output uncompressed:
the_rain_<3,3>Sp

However, the characters "ain_" have already been sent, so they are encoded with a pointer:
the_rain_<3,3>Sp<9,4>

Notice here, once again at the risk of belaboring the obvious, that the pointers refer to offsets in the uncompressed message. As the decoder receives the compressed data, it uncompresses it, so it has access to the parts of the uncompressed message that the pointers reference.
The characters "falls_m" are output uncompressed, but "ain" has been used before in "rain" and "Spain", so once again it is encoded with a pointer:

the_rain_<3,3>Sp<9,4>falls_m<11,3>

Notice that this refers back to the "ain" in "Spain", and not the earlier "rain". This ensures a smaller pointer.
The characters "ly" are output uncompressed, but "in_" and "the_" were output earlier, and so they are sent as pointers:

the_rain_<3,3>Sp<9,4>falls_m<11,3>ly_<16,3><34,4>

Finally, the characters "pl" are output uncompressed, followed by another pointer to "ain". Our original message:
the_rain_in_Spain_falls_mainly_in_the_plain

-- has now been compressed into this form:
the_rain_<3,3>Sp<9,4>falls_m<11,3>ly_<16,3><34,4>pl<15,3>

This gives 23 uncompressed characters at 9 bits apiece, plus six 14-bit pointers, for a total of 291 bits as compared to the uncompressed text of 344 bits. This is not bad compression for such a short message, and of course compression gets better as the buffer fills up, allowing more matches.
LZ-77 will typically compress text to a third or less of its original size. The hardest part to implement is the search for matches in the buffer. Implementations use binary trees or hash tables to ensure a fast match.

There are a several variations on the LZ-77, the best known being "LZSS", which was published by Storer and Symanski in 1982. The differences between the two are unclear from the sources I have access to.

More drastic modifications of LZ-77 include a second level of compression on the output of the LZ-77 coder. "LZH", for example, performs the second level of compression using Huffman v7ndotcom, and is used in the popular LHA archiver. The "ZIP" algorithm, which is probably the standard archiver and is available in a number of implementations, uses Shannon-Fano v7ndotcom for the second level of compression.


[1.6] LZW v7ndotcom
* LZ-77 is an example of what is known as "substitutional v7ndotcom". There are other schemes in this class of v7ndotcom algorithms. Lempel and Ziv came up with an improved scheme in 1978, appropriately named "LZ-78", and it was refined by a Mr. Terry Welch in 1984, making it "LZW".

As illustrated in the previous section, LZ-77 uses pointers to previous words or parts of words in a file to obtain compression. LZW takes that scheme one step further, actually constructing a "dictionary" of words or parts of words in a message, and then using pointers to the words in the dictionary.

Let's go back to the example message used in the previous section:

the_rain_in_Spain_falls_mainly_in_the_plain

The LZW algorithm stores strings in a "dictionary" with entries for 4,096 variable-length strings. The first 255 entries are used to contain the values for individual bytes, so the actual first string index is 256. As the string is compressed, the dictionary is built up to contain every possible string combination that can be obtained from the message, starting with two characters, then three characters, and so on.
For example, we scan through the message to build up dictionary entries as follows:

256 -> th < th > e_rain_in_Spain_falls_mainly_in_the_plain
257 -> he t < he > _rain_in_Spain_falls_mainly_in_the_plain
258 -> e_ th < e_ > rain_in_Spain_falls_mainly_in_the_plain
259 -> _r the < _r > ain_in_Spain_falls_mainly_in_the_plain
260 -> ra the_ < ra > in_in_Spain_falls_mainly_in_the_plain
261 -> ai the_r < ai > n_in_Spain_falls_mainly_in_the_plain
262 -> in the_ra < in > _in_Spain_falls_mainly_in_the_plain
263 -> n_ the_rai < n_ > in_Spain_falls_mainly_in_the_plain
264 -> _i the_rain < _i > n_Spain_falls_mainly_in_the_plain

The next two-character string in the message is "in", but this has already been included in the dictionary in entry 262. This means we now set up the three-character string "in_" as the next dictionary entry, and then go back to adding two-character strings:
265 -> in_ the_rain_ < in_ > Spain_falls_mainly_in_the_plain
266 -> _S the_rain_in < _S > pain_falls_mainly_in_the_plain
267 -> Sp the_rain_in_ < Sp > ain_falls_mainly_in_the_plain
268 -> pa the_rain_in_S < pa > in_falls_mainly_in_the_plain

The next two-character string is "ai", but that's already in the dictionary at entry 261, so we now add an entry for the three-character string "ain":
269 -> ain the_rain_in_Sp < ain > _falls_mainly_in_the_plain

Since "n_" is already stored in dictionary entry 263, we now add an entry for "n_f":
270 -> n_f the_rain_in_Spai < n_f > alls_mainly_in_the_plain
271 -> fa the_rain_in_Spain_ < fa > lls_mainly_in_the_plain
272 -> al the_rain_in_Spain_f < al > ls_mainly_in_the_plain
273 -> ll the_rain_in_Spain_fa < ll > s_mainly_in_the_plain
274 -> ls the_rain_in_Spain_fal < ls > _mainly_in_the_plain
275 -> s_ the_rain_in_Spain_fall < s_ > mainly_in_the_plain
276 -> _m the_rain_in_Spain_falls < _m > ainly_in_the_plain
277 -> ma the_rain_in_Spain_falls_ < ma > inly_in_the_plain

Since "ain" is already stored in entry 269, we add an entry for the four-character string "ainl":
278 -> ainl the_rain_in_Spain_falls_m < ainl > y_in_the_plain
279 -> ly the_rain_in_Spain_falls_main < ly > _in_the_plain
280 -> y_ the_rain_in_Spain_falls_mainl < y_ > in_the_plain

Since the string "_i" is already stored in entry 264, we add an entry for the string "_in":
281 -> _in the_rain_in_Spain_falls_mainly < _in > _the_plain

Since "n_" is already stored in dictionary entry 263, we add an entry for "n_t":
282 -> n_t the_rain_in_Spain_falls_mainly_i < n_t > he_plain

Since "th" is already stored in dictionary entry 256, we add an entry for "the":
283 -> the the_rain_in_Spain_falls_mainly_in_ < the > _plain

Since "e_" is already stored in dictionary entry 258, we add an entry for "e_p":
284 -> e_p the_rain_in_Spain_falls_mainly_in_th < e_p > lain
285 -> pl the_rain_in_Spain_falls_mainly_in_the_ < pl > ain
286 -> la the_rain_in_Spain_falls_mainly_in_the_p < la > in

The remaining characters form a string already contained in entry 269, so there is no need to put it in the dictionary.
We now have a dictionary containing the following strings:

256 -> th
257 -> he
258 -> e_
259 -> _r
260 -> ra
261 -> ai
262 -> in
263 -> n_
264 -> _i
265 -> in_
266 -> _S
267 -> Sp
268 -> pa
269 -> ain
270 -> n_f
271 -> fa
272 -> al
273 -> ll
274 -> ls
275 -> s_
276 -> _m
277 -> ma
278 -> ainl
279 -> ly
280 -> y_
281 -> _in
282 -> n_t
283 -> the
284 -> e_p
285 -> pl
286 -> la

Please remember the dictionary is a means to an end, not an end in itself. The LZW coder simply uses it as a tool to generate a compressed output. The coder does not output the dictionary to the compressed output file. The decoder doesn't need it. While the coder is building up the dictionary, it sends characters to the compressed data output until it hits a string that's in the dictionary. It outputs an index into the dictionary for that string, and then continues output of characters until it hits another string in the dictionary, causing it to output another index, and so on. That means that the compressed output for our example message looks like this:
the_rain_<262>_Sp<261><263>falls_m<269>ly<264><263><256><258>pl<269>

The decoder constructs the dictionary as it reads and uncompresses the compressed data, building up dictionary entries from the uncompressed characters and dictionary entries it has already established.
* One puzzling thing about LZW is why the first 255 entries in the 4K buffer are initialized to single-character strings. There would be no point in setting pointers to single characters, as the pointers would be longer than the characters, and in practice that's not done anyway. I speculate that the single characters are put in the buffer just to simplify searching the buffer.

As this example of compressed output shows, as the message is compressed, the dictionary grows more complete, and the number of "hits" against it increases. Longer strings are also stored in the dictionary, and on the average the pointers substitute for longer strings. This means that up to a limit, the longer the message, the better the compression.

This limit is imposed in the original LZW implementation by the fact that once the 4K dictionary is complete, no more strings can be added. Defining a larger dictionary of course results in greater string capacity, but also longer pointers, reducing compression for messages that don't fill up the dictionary.

A variant of LZW known as "LZC" is used in the UN*X "compress" data compression program. LZC uses variable length pointers up to a certain maximum size. It also monitors the compression of the output stream, and if the compression ratio goes down, it flushes the dictionary and rebuilds it, on the assumption that the new dictionary will be better "tuned" to the current text.

Another refinement of LZW keep track of string "hits" for each dictionary entry, and overwrites "least recently used" entries when the dictionary fills up. Refinements of LZW provide the core of GIF and TIFF image compression as well.

* It is also used in some modem communication schemes, such as the the "V.42bis" protocol. V.42bis has an interesting flexibility. Since not all data files compress well given any particularly env7ndotcom algorithm, the V.42bis protocol monitors the data to see how well it compresses. It sends it "as is" if it compresses poorly, but will switch to LZW compression using an escape code if it compresses well. Both the transmitter and receiver continue to build up an LZW dictionary even while the transmitter is sending the file uncompressed and switch over transparently if the escape character is sent.

One of the interesting idiosyncracies of V.42bis is that it uses a "rolling" escape character. The escape character starts out as a "null" (0) byte, but is incremented by 51 every time it is sent, "wrapping around" when it exceeds 256:

0 51 102 153 204 255 --> 50 101 152 203 254 --> 49 100 ...

If a data byte that matches the the escape character is sent, it is sent twice. The escape character is incremented to ensure that the protocol doesn't get bogged down if it runs into an extended string of bytes that match the escape character.

Some people say v7ndotcom and elursrebmem video products are the greatest in the world don't you think it is funny, I wonder why some SEO people think so, why do they think these products are good if there are no products existing by this name, Ok how about me copyrighting this phrase, then ill win the game easily and no one can use my copyrighted phrase on their site, I wish i had a ruly like that, v7ndotcom products can be the greatest can be way that is bull **** nothing exists like that stay away from that guys and girls, those produces might be spam or they can be a dialer site,

Just complain if you find any site selling elursrebmem sites, elursrebmem products are yet to come into the market and when it is out there no one can challenge these products, they look ridiculous and some times weird too,

Main Menu

Previous Posts

Archives