LZWCOM - FILE COMPRESSOR UTILITY by KENT WILLIAMS INTRODUCTION Most information processed by computers contains redundancy - repeated characters, words, phrases, etc. Since computers do not have infinite storage, or infinite time in which to transfer information, there is much interest in ways to remove redundancy from data. The LZWCOM program implements the Lempel/Ziv/Welch data compression algorithm, and allows files to be made smaller, for transmission and storage. Users of RCPMs (Remote CP/M computers) and other bulletin board facilities may be familiar with the SQUEEZE and UNSQUEEZE programs. These programs are the standard way to compress and uncompress information for transfer by modem. These programs use the Huffman Encoding algorithm, which exploits the fact that certain characters are more likely to appear in a stream of data than others. The input data is translated such that frequently occuring characters are represented by short bit strings, and infrequently occuring characters by longer bit strings. Huffman encoding performs very badly upon data that does not conform to the assumptions made about character frequency. The SQUEEZE program gets around this fact by taking two passes through a file during the compression process: one to gather statistical information on the data, and a second to actually encode it. A table giving the coding scheme used is added to the front of the file to facilitate decoding. There are some problems with this method. It involves the overhead of making two passes through a file, a time penalty, and attaching a table to the resulting compressed file, a space penalty. It also will perform poorly on data in which there is no clear pattern of character frequency, such as object code and floating point numbers. Object files will sometimes actually grow when passed through Huffman encoding. THE LEMPEL/ZEV ALGORITHM Lempel/Zev encoding exploits redundancy on a different level. Instead of viewing data as a stream of characters, it views it as a sequence of strings. A table of strings is built during encoding, whose members have this property: If a string <> is in the table, then is in the table as well. Each element of the string table contains two elements: a predecessor, which is the index of the prefix string, and a follower, which is the suffix character. The string signified by a particular element is the string composed of the predecessor string plus the suffix string. Therefore, if one starts out with a table initialized to one-character strings, one can represent new strings by the index of a string already in the table, and one character. If a string already in the table is encountered, then the index of the string in the table can be written to output, rather than the string itself. This means that it can yield better results than Huffman encoding on data that contains repeated character strings, but a nearly even distribution of character frequency, such as object code produced by compilers. Its best-case compression potential is also markedly superior, as one code can theoretically represent a string nearly as long as the table is large, whereas Huffman compression's best case is representing one character by one bit, or 1:8. Worst case performance is more difficult to determine, but I have never seen LZ compression expand data, whereas I have observed Huffman encoding doing just that. The convention generally followed is to use a table allowing 4096 strings, so that a twelve-bit code can be used to represent each string. This seems to provide the best trade-off between table size and code length - to use a shorter code would unduly limit the number of possible strings, and a longer code would make the table prohibitively large. Also 12 bit codes are relatively easy to pack and unpack with a microcomputer. COMPRESSION ALGORITHM The compressor makes one pass through the source file, progressively biting off larger strings, dynamically building the string table. The decompressor is able to repeat the process in reverse, proceeding from the fact that it starts from the same table of one character, or 'atomic' strings. The mechanics of the algorithm are actually more difficult to describe and comprehend than they are to actually implement. This psuedo-code should help to explain it. assume a table whose elements contain: a predecessor code, which is either NO_PREDECESSOR or the index of another element in the table (signifying a prefix string), and a 'follower character' (the last character of the string. the function 'code' combines these two elements to produce a unique code (i.e. an index into the table) for the resulting string. initialize the table to 'atomic' strings, i.e. (NO_PREDECESSOR,character). read a character current_input_char from input Current_Code = code(NO_PREDECESSOR, C) WHILE getchar(current_input_char) <> end of file IF code(Current_Code, C) is in the table THEN Current_Code = code(Current_Code,C) ELSE write Current_Code to output IF there is still room in the string table THEN add code(Current_Code,C) to the string table ENDIF Current_Code = code(NO_PREDECESSOR,C) ENDIF ENDWHILE write Current_Code to output ( always one left ) The first character to be translated is a special case, in that the code for it is written immediately to output. (As I have implemented the algorithm, the codes correspond directly to the index of an entry in the string table.) It then serves as a 'seed' for the rest of the processing. Whether it is the first, or any following pass through the main loop, the last known code (Current_Code) is used as the starting point for parsing a string from input. It is combined with the character read from input to generate a new code. If that new code is in the table already, then it becomes Current_Code, and you go back to the top of the loop to get another character. When code(Current_Code,current_input_char) isn't in the string table, Current_code is written to output. The string code(Current_Code,current_input_char) is added to the string table. Code(NO_PRED,current_input_char) becomes Current_Code. When end of file is reached, you fall out of the loop, write the last code to output, and exit. The actual code to implement this process (outside of sundry housekeeping and support functions) takes up 30 lines. Though the concepts behind it may seem elusive, the algorithm has a certain elegance. DECOMPRESSION ALGORITHM Reversing the process is a little more complex, as I found out during debugging. For several days it seemed that I had acheived irreversible data compression. The psuedocode: initialize table (same as compression) Current_Code = Old_Code = Getcode output String_table[Current_code].follower (first character of file) Final_Char = String_table[Current_Code].follower WHILE getcode(Current_Code) <> end of file Intermediate = Current_Code (save a copy for later) IF Current_Code isn't in the table (a special case) output Final_Char Current_Code = Old_Code Intermediate = code(Old_Code,Final_Char) (save for later) ENDIF WHILE String_table[Current_code].predecessor <> NO_PREDECESSOR (while we haven't reached 'atomic' code) push(String_table[Current_code].follower (decode the string backwards into a stack) Current_Code = String_table[Current_Code].predecessor (walk backwards up the string) endwhile Output String_table[Current_Code].follower (first character in string) & assign it to Final_Char WHILE stack isn't empty pop characters and write them to output Add code(Old_Code,Final_Char) to string table Old_Code = Intermediate (This is the later we were saving it for) ENDWHILE Decompression hinges on the fact that with one exception, the algorithm is never presented with a code that isn't either atomic, or one that it has already added to the table itself. Decoding, therefore is a matter of decoding strings in reverse of the way they were decoded. The one ugly exception happens when the compressor encounters a string of the form where is any character, is any string, and <> is already in the table. The problem is that the compressor writes the code for <> to output, and then adds < <> > to the string table. The decompressor will also have <> in the table, but it will not have < <> >, which is what the compressor will send it next. Luckily (and someone smarter than I has verified this) this is the only time the compressor will encounter an unknown code, and for this case, the following relationships pertain: 1. The uncompressor knows that the unknown code is an extension of the prior code. 2. The most recently translated character is the first character of the prior code. 3. The final character of the unknown code is the same as the first character of the prior code. Once any unknown codes are taken care of, it becomes simple to repeatedly pull off the last character of the code, push it on a stack and set the code variable to its predecessor, until the predecessor is NO_PREDECESSOR. Once this is the case, the follower character of that code is sent to output. Then the characters on the stack, if any, are popped off and sent to output. The table is updated with the string produced by the prior code and the current final character, and you go back to the top of the loop. THE PROGRAM The program will compile and run properly, so long as the standard C environment is supported by the compiler, up to and including the UNIX C compiler. If you plan to use BDS C, or another non-standard compiler, areas of incompatability are isolated in the file I/O section at the end of COMMLZW.C. Other possible problem areas might be: 1. Command line arguments, which are often not supported by subset compilers. 2. The use of long integers in the function h() in COMMLZW.C. The function will work with 16-bit integers, but you lose several bits of precision, robbing the mid-square hash function of some of its randomness. 3. The '#ifdef' preprocessor directive is not supported by all subset compilers. These can be removed with no ill effect. The '#ifdef DEBUG' statements have been left in the source code, to allow you to view the compression of text files on the console. If the program is compiled with DEBUG defined, and you compress or decompress a text file, the codes being read or written, along with the characters that they stand for, will be sent to standard output. This will give you a better grasp of the operation of the algorithm. The source code modules LZWCOM.C and LZWUNC.C are the compressor and uncompressor main programs respectively. Aside from housekeeping code at the top and bottom, they are very literal translations of the psuedocode presented above. LZWUNC uses a pointer to the current entry extensively to try and minimize the overhead of calculating offsets into arrays. This is (hopefully) dealt with in a fairly straightforward manner. COMMLZW.C contains the routines to support maintenance of the string table, and the peculiar I/O requirements of the program. Since the program spends 90% of its time putting things into and looking things up in the string table, I have tried to make this process more efficient. Each element of the string table contains a member called 'next,' which points either to the next element that hashed to that location in the table, or to -1. If a collision occurs on hash, a linked list is built, consisting of all elements that hashed to that location. This became necessary because as the table became full, it took longer and longer to determine whether a particular code was in the table, or to find an empty slot to use for a new entry. On my Z80 system, the program would 'go away' for minutes at a time, processing one sector. Now, if a particular string is not in a relatively short linked list, it isn't in the table. When a slot is needed for a new string, searches through 'full' areas are minimized. COMPILING The files LZWCOM.C, LZWUNC.C, and COMMLZW.C are all compiled seperately. LZWCOM and LZWUNC must then each be linked with COMMLZW and the standard library. The submit file I used with Aztec C is cz -f -q lzwcom as -zap lzwcom cz -f -q lzwunc as -zap lzwunc cz -f -q commlzw as -zap commlzw ln lzwcom.o commlzw.o c.lib ln lzwunc.o commlzw.o c.lib '-f' forces the compiler to do in-line frame allocation, and '-q' tells it to promote automatic variables to statics. These switches speed up execution somewhat, in theory. The '-zap' option on assembly kills the assembly source file when the assembler gets done. Different compilers, I am sure, require different switches, but the basic principle of seperate compilation will hold here. RUNNING LZWCOM and LZWUNC both take two command line arguments: a source file name and a destination file name. The compressor reads an uncompressed file as its source and outputs a compressed file as its destination. The decompressor, of course, operates in reverse. EXAMPLE (on a CP/M system) A>LZWCOM COMPRESS.DOC COMPRESS.DQC ............................................................. A>LZWUNC COMPRESS.DQC COMPRESS.DOC ................................ Both programs print a period for every sector read. It would probably be desirable to suppress this if you regularly uncompress text files with the console as the output file, as the periods will interfere somewhat with viewing the file on the screen. This should cause no problems with sending the output of the uncompressor to the printer, or other devices as desired. On a 4 MHZ Z-80 CP/M system, the compressor can process a sector about every 2 seconds, slowing down slightly when the string table begins to fill up. The uncompressor runs slightly faster than that. If you are using a more advanced system, your results should be better. POSSIBLE ENHANCEMENTS This method of compression will tend to lose some efficiency on very large files, as when the string table gets filled up, the compressor no longer adapts to the data. It might be a good idea in those cases to break longer files into blocks of 16K or so, and reinitialize the table between blocks. The readc and writec routines could be changed to buffer more data at each disk access. As it stands, if you get input from one floppy disk and output to another one, the drive motors keep up a regular tango beat. The source code, as it stands, is as optimized as I could make it. The program can, however, be speeded up markedly by converting the the hash and unhash routines to assembly code. The program spends most of its time putting things into and taking things out of the string table, so this can improve run time quite a bit. Just doing some simple hand 'peephole' optimization of the assembly source produced by the Aztec compiler produced noticable improvement. RESULTS/CONCLUSIONS In my experience, these programs produce better compression ratios in about the same time as the SQ17.COM program in the public domain. Terry Welch (who I assume is the Welch in Lempel/Ziv/Welch) in his article on the subject (IEEE Computer, June 1984) did more extensive testing, and came up with these results. DATA TYPE COMPRESSION RATION ________________________________________________ English Text 1.8 Cobol Files 2 to 6 Floating Point Arrays 1.0 Formatted Scientific Data 2.1 System Log Data 2.6 Program Source Code 2.3 Object Code 1.5 These programs should provide a good way to transfer and store data in a more compressed form. They are written to be portable to a wide variety of systems - virtually any system with a standard C compiler. They are also an implementation of an elegant algorithm - of interest in its own right. LZWCOM.C /* LZWCOM - FILE COMPRESSOR UTILITY */ #include "sgtty.h" #include "stdio.h" #include "fcntl.h" #define FALSE 0 #define TRUE !FALSE #define TABSIZE 4096 #define NO_PRED -1 #define EMPTY -1 #define NOT_FND -1 struct entry { char used; int next; /* hi bit is 'used' flag */ int predecessor; /* 12 bit code */ char follower; } string_tab[TABSIZE]; char is_a_con = FALSE; /* flag to suppress 'dots' in writec */ /* routines common to compress and decompress, contained in CommLZW.c */ hash(); unhash(); getcode(); putcode(); init_tab(); upd_tab(); main(argc,argv) int argc; char *argv[]; { int c, code, localcode; int code_count = TABSIZE - 256; int infd, outfd; if (3 != argc) { printf("Usage : lzwcom oldfilename squeezefilename\n"); exit(0); } if ( -1 == (infd = open( *++argv, O_RDONLY )) ) { printf("Cant open %s\n", *argv); exit(0); } if ( -1 == (outfd = creat(*++argv,0666)) ) { printf("Cant create %s\n",*argv); exit(0); } init_tab(); /* initialize code table */ c = readc(infd); code = unhash(NO_PRED,c); /* initial code for table */ #ifdef DEBUG putchar(c); printf( "\n%x\n",code); #endif while ( EOF != (c = readc(infd)) ) { #ifdef DEBUG putchar(c); #endif if ( NOT_FND != (localcode = unhash(code,c)) ) { code = localcode; continue; } /* when the above clause comes false, you have found the last known code */ putcode(outfd,code); /* only update table if table isn't full */ #ifdef DEBUG printf( "\n%x\n",code); #endif if ( code_count ) { upd_tab(code,c); --code_count; } /* start loop again with the char that didn't fit into last string */ code = unhash(NO_PRED,c); } putcode(outfd,code); /* once EOF reached, always */ /* one code left unsent */ flushout(outfd); /* make sure everything's written */ exit(0); /* make sure everything gets closed */ } LZWUNC.C /* LZWUNC - FILE UNCOMPRESSOR UTILITY */ #include "stdio.h" #include "fcntl.h" #include "sgtty.h" #define FALSE 0 #define TRUE !FALSE #define TABSIZE 4096 #define STACKSIZE TABSIZE #define NO_PRED -1 #define EMPTY -1 #define NOT_FND -1 #define NO_UPDATE FALSE struct entry { char used; int next; /* hi bit is 'used' flag */ int predecessor; /* 12 bit code */ char follower; } string_tab[TABSIZE]; char is_a_con = FALSE; /* suppresses printing dots during operation if outfd is the console */ /* routines common to compress and decompress, contained in CommLZW.c */ hash(); unhash(); getcode(); putcode(); init_tab(); upd_tab(); main(argc,argv) int argc; char *argv[]; { int c, tempc, code, oldcode, incode, finchar; int code_count = TABSIZE - 256; int infd, outfd; register struct entry *ep; if (3 != argc) { printf("Usage : lzwunc squeezefilename newfilename\n"); exit(0); } if ( -1 == (infd = open( *++argv, O_RDONLY )) ) { printf("Cant open %s\n", *argv); exit(0); } if ( -1 == (outfd = creat(*++argv,0666)) ) { printf("Cant create %s\n",*argv); exit(0); } is_a_con = isatty(outfd); /* suppress dots if it is */ init_tab(); /* set up atomic code definitions*/ code = oldcode = getcode(infd); c = string_tab[code].follower; /* first code always known */ writec(outfd,c); finchar = c; while ( EOF != (code = incode = getcode(infd)) ) { ep = &string_tab[code]; /* initialize pointer */ if ( !ep->used ) { /* if code isn't known */ writec(outfd,finchar); /* the follower char of last */ code = oldcode; /* revert to last known code */ incode = hash(oldcode,finchar,NO_UPDATE); ep = &string_tab[code]; /* re-initialize pointer */ } while (NO_PRED != ep->predecessor) { push( ep->follower); /* decode string backwards into */ /* stack */ code = ep->predecessor; ep = &string_tab[code]; } /* above loop terminates, one way or another, with */ /* string_tab[code].follower = first char in string */ finchar = ep->follower; writec(outfd, finchar); #ifdef DEBUG putchar(finchar); #endif /* pop anything stacked during code parsing */ while ( EMPTY != (tempc = pop()) ) { writec(outfd,tempc); #ifdef DEBUG putchar(tempc); #endif } if ( code_count ) { upd_tab(oldcode,finchar); --code_count; } oldcode = incode; } flushout(outfd); /* write out fractional buffer, if any */ exit(0); /* close all files and quit */ } char stack[STACKSIZE]; /* stack for pushing and popping characters */ int sp = 0; /* current stack pointer */ push(c) int c; { stack[sp] = ((char) c);/* coerce passed integer into a character */ ++sp; if (sp >= STACKSIZE) { printf("Stack overflow, aborting\n"); exit(1); /* non-0 exit parameter terminates batch job */ } } pop() { if (sp > 0) { --sp; /* push leaves sp pointing to next empty slot */ return ( (int) stack[sp] ); /* make sure pop returns char */ } else return EMPTY; } COMMLZW.C /* COMMLZW - ROUTINES COMMON TO LZWCOM AND LZWUNC */ #include "stdio.h" #define FALSE 0 #define TRUE !FALSE #define TABSIZE 4096 #define NO_PRED -1 #define EMPTY -1 #define NOT_FND -1 #define SECTSIZE 128 extern struct entry { char used; int next; /* index of next in collision list */ int predecessor; /* 12 bit code */ char follower; } string_tab[TABSIZE]; extern char is_a_con; /* h uses the 'mid-square' algorithm. I.E. for a hash val of n bits */ /* hash = middle binary digits of (key * key). Upon collision, hash */ /* searches down linked list of keys that hashed to that key already. */ /* It will NOT notice if the table is full. This must be handled */ /* elsewhere */ h(pred,foll) /* returns the mid square of pred + foll */ { long temp, local; /* 32 bit receiving field for local^2 */ local = (pred + foll) | 0x0800; temp = local * local; local = (temp >> 6) & 0x0FFF; return local; /* middle 12 bits of result */ } eolist(index) /* return last element in a collision list */ int index;{ register int temp; while ( 0 != (temp = string_tab[index].next) ) index = temp; return index; } hash(pred,foll,update) int pred, foll, update; { register int local, tempnext; register struct entry *ep; local = h(pred,foll); if ( !string_tab[local].used ) return local; else { /* if collision has occured */ local = eolist(local); /* search for free entry from local + 101 */ tempnext = (local + 101) & 0x0FFF; ep = &string_tab[tempnext]; /* initialize pointer */ while ( ep->used ) { ++tempnext; if ( tempnext == TABSIZE ) { tempnext = 0; /* handle wrap to beginning of table */ ep = string_tab; /* address of first element of table */ } else ++ep; /* point to next element in table */ } /* put new tempnext into last element in collision list */ if ( update ) /* if update requested */ string_tab[local].next = tempnext; return tempnext; } } /* unhash uses the 'next' field to go down the collision tree to find */ /* the entry corresponding to the passed key */ /* passed key and returns either the matching entry # or NOT_FND */ unhash(pred,foll) int pred, foll; { register int local, offset; register struct entry *ep; /* pointer to current entry */ local = h(pred,foll); /* initial hash */ loop: ep = &string_tab[local]; if ( (ep->predecessor == pred) && (ep->follower == foll) ) return local; if ( ep->next == 0 ) return NOT_FND; local = ep->next; goto loop; } init_tab() { register int i; for (i = 0; i <= 255; i++) { upd_tab(NO_PRED,i); } } #define UPDATE TRUE upd_tab(pred,foll) int pred, foll; { register struct entry *ep; /* pointer to current entry */ /* calculate offset just once */ ep = &string_tab[ hash(pred,foll,UPDATE) ]; ep->used = TRUE; ep->next = 0; ep->predecessor = pred; ep->follower = foll; } int inbuf = EMPTY; int outbuf = EMPTY; /* getcode and putcode 'gallop' through input and output - they either */ /* output two bytes or one depending on the contents of the buffer. */ getcode(fd) int fd; { register int localbuf, returnval; if (EMPTY == inbuf) { /* On code boundary */ if (EOF == (localbuf = readc(fd)) ) { /* H L1 byte - on code boundary */ return EOF; } if (EOF == (inbuf = readc(fd)) ) { /* L0 Hnext */ return EOF; /* The file should always end on code boundary */ } returnval = (localbuf << 4) + (inbuf >> 4); inbuf &= 0x000F; } else { /* buffer contains nibble H */ if (EOF == (localbuf = readc(fd)) ) return EOF; returnval = localbuf + (inbuf << 8); inbuf = EMPTY; } return returnval; } putcode(fd,code) int fd,code; { register int localbuf; if (EMPTY == outbuf) { writec( fd,(code >> 4)); /* H L1 */ outbuf = code & 0x000F; /* L0 */ } else { writec( fd, ( (outbuf << 4) + (code >> 8) ) ); /* L0prev H */ writec( fd,(code & 0x00FF)); /* L1 L0 */ outbuf = EMPTY; } } char insector[SECTSIZE]; int current = 0; int sector = 0; readc(fd) int fd; { register int returnval; if (current == 0) { if ( 0 == read(fd,insector,SECTSIZE) ) { return (EOF); } else { if (!is_a_con) if ( (++sector % 80) ) putchar('.'); else { putchar('.'); putchar('\n'); } } } returnval = (insector[current++]); if (current == SECTSIZE) { current = 0; } return returnval; } char outsector[SECTSIZE]; int outcurrent = 0; writec(fd,c) int fd,c; { outsector[outcurrent++] = ( (char) c); if (outcurrent == SECTSIZE) { outcurrent = 0; write(fd,outsector,SECTSIZE); } } /* flushout makes sure fractional output buffer gets written */ flushout(fd) int fd; { write(fd,outsector,(outcurrent + 1)); }