Turbo Pascal File Squeezer / UnSqueezer for CP/M and MsDos/PcDos Version 2.0 04/17/85 Bob Berry ----------- I originally downloaded these programs from the CompuServe Borland SIG. They were (rather brazenly, I thought) identified as being "CP/M compatible". There were, however, the following problems with CP/M compatibility: - Wouldn't compile. First the text was too large to allow compilation in 64k: Compile time error 99 - Compiler Overflow. Easily overcome by setting up $Include files. - Syntax error. The original version included "LongFileSize", which is strictly an MsDos turbo function. Replaced with FileSize for CP/M. - The input and output files were "File of Char;", which works fine with Dos. Since the MsDos directory identifies the size of a file in bytes, Dos Turbo does not need to do anything special to keep track of record count, etc. In CP/M, turbo writes two integers at the beginning of each "typed" file ("File of ...;") to identify the logical record length and record count. So using "File of Char" on input leads turbo to expect the first four bytes of the file to contain the two integers. Naturally this will not be true (unless the file was actually created BY A TURBO PROGRAM as a "File of ..."). For CP/M, the input and output files must be "untyped" files ("File;"), read and written with BlockRead and BlockWrite. - Recursive procedures and functions require the "A-" compiler directive to work under CP/M turbo. With these compatibility problems fixed, one other program bug appeared. At end-of-file, the original program immediately sent the SpEof character and terminated. Any characters which were repeated at end of file would be truncated. A CP/M file which ended with a string of fifty ^Z's (ASCII end-of file) would be unsqueezed to a single ^Z. A Dos file which ended with a string of eighty *'s would unsqueeze to a single *. (Dos files don't need a ^Z for EOF, as the directory keeps track of the length of the file in bytes.) Actually the Dos error would be more obvious than the CP/M error, as the "physical" CP/M file is almost always larger than the "logical" file (the tail end of the last sector is normally "garbage".) The original programs stored (upon squeezing) the file checksum, but (intentionally?) avoided comparing the output file to the checksum on unsqueezing. The above "end-of-file repeats" problem caused a checksum error, which I first discovered when testing these for "compatibility" with public domain (CP/M) SQ/USQ. The current version of the program(s) cause no checksum errors (USQZ does check!), and the programs are compatible with SQ/USQ. Also the "pre-squeeze" and "post-unsqueeze" files match exactly (a minimum requirement!) This has been verified for CP/M with CRC.COM and for DOS with COMP.COM. The Huffman algorithm is apparently developed differently from the (standard?) public domain SQ/USQ programs (the CRC's of the squeezed files don't match.) However, this is of no significance, as a file squeezed with the public domain program can be unsqueezed with the turbo program, and vice versa with no checksum errors, and with the output file exactly matching the original. Finally, with the programs working correctly, I set about to make them a little less cumbersome. The original programs would be loaded and then ask for the file name (no provision to pass the file name from the command line, as "A>SQZ SAMPLE.DOC".) The program would then squeeze or unsqueeze the file and boot you back to the system. To do another file, you would start from the "A>" prompt. I added: - Provision to specify the first file name on the command line. - Optional specification of a different output drive. - CP/M Re-Log to prevent BDOS R/O errors on the output drive. (You can swap disks!) - Verification of the file checksum on unsqueezing. - Statistics showing filesize and percentages of output to input files. - Repeat until a "blank" file name is entered (just ). - Separate (TESTED and WORKING) routines to accommodate the differences between CP/M and MsDos. Version 2.1 01/29/86 Bob Berry ----------- - Added WildCard file specifications. - Added Printer Echo. - Version 2.1 was prompted primarily by the following: For years, CP/M (you remember CP/M) has had public domain file compression programs available. Commonly called "SQUEEZE" and "UNSQUEEZE" these programs conformed to the following "squeezed" file definition: Byte 00 - 01: an integer, FF76H, the squeezed file ID, (stored LSB, MSB or 76 in 00, FF in 01) Byte 02 - 03: an integer, the file CheckSum, the integer total of all characters in the original file. Byte 04 - ??: the original file name with null (00) terminator (since the squeezed file has a "Q" as the second character of the file extension.) Followed By: an integer, the number of nodes in the tree built when the file was squeezed. Followed By: 2*Nodes integers: the tree. Followed By: The Squeezed file. Recently at least one new format of squeezed file has surfaced. Apparently in an attempt to preserve the Date and Time of last write which MsDos (PcDos) stores as part of a directory entry, a new format was devised: Byte 00 - 01: an integer FFFAH, the new squeezed file ID Byte 02 - ??: the original file name, null terminated. Followed By: a null terminated ASCII string with the file date. Followed By: a null (perhaps a null terminated string?) Followed By: an ASCII End-of-file character (^Z = 1AH) Followed By: the file CheckSum Followed By: an integer, the MsDos (FCB format) file date. Followed By: an integer, the MsDos (FCB format) file time. Followed By: an integer, the number of nodes. Followed By: 2*Nodes integers: the "tree". Followed By: The Squeezed File. This program will UnSqueeze files of either format, doing complete CheckSum verification, and will do so on either CP/M or DOS computers (by selecting the appropriate .INC module). Don't be surprised, however, if you discover yet another format which this program doesn't recognize. It seems there are new Squeeze/UnSqueeze programs popping up all the time, with authors who may not be too concerned about the compatibility of their programs with the rest of the universe. Bob Berry CompuServe 76555,167 The following documentation is included from the "original" programs. ======================================================================= From SQZ.PAS: { CP/M compatible file squeezer utility. This translation uses the Huffman algorithm to develop a binary tree representing the decoding information for a variable length bit string code for each input value. Each string's length is in inverse proportion to its frequency of appearance in the incoming data stream. The encoding table is derived from the decoding table. The range of valid values into the Huffman algorithm are the values of a byte stored in an integer plus the special endfile value chosen to be an adjacent value. Overall, 0-SPEOF. The algorithm develops the single element trees into a single binary tree by forming subtrees rooted in interior nodes having weights equal to the sum of weights of all their descendents and having depth counts indicating the depth of their longest paths. When all trees have been formed into a single tree satisfying the heap property (on weight, with depth as a tie breaker) then the binary code assigned to a leaf (value to be encoded) is then the series of left (0) and right (1) paths leading from the root to the leaf. Note that trees are removed from the heaped list by moving the last element over the top element and reheaping the shorter list. To further compress the output code stream, all bytes pass directly through except for: 1) DLE is encoded as (DLE, zero). 2) repeated byte values (count >= 3) are encoded as (value, DLE, count). In the original design it was believed that a Huffman code would fit in the same number of bits that will hold the sum of all the counts. That was disproven by a user's file and was a rare but infamous bug. This version attempts to choose among equally weighted subtrees according to their maximum depths to avoid unnecessarily long codes. In case that is not sufficient to guarantee codes <= 16 bits long, we initially scale the counts so the total fits in an unsigned integer, but if codes longer than 16 bits are generated the counts are rescaled to a lower ceiling and code generation is retried. The "node" array of structures contains the nodes of the binary tree. The first NUMVALS nodes are the leaves of the tree and represent the values of the data bytes being encoded and the special endfile, SPEOF. The remaining nodes become the internal nodes of the tree. Program states: NoHist don't consider previous input SentChar lastchar set, no lookahead yet SendNewC newchar set, previous sequence done SendCnt newchar set, DLE sent, send count next } (* NOTE: UnReferenced (debug) procedures and functions have been moved here from SQZ.PAS. Bob Berry *) (* Procedure PrintBits(len, number: integer); var i, j: integer; begin Write(' code '); For i:=len-1 DownTo 0 do begin j:=(number shr i) and $0001; write(j:1); end; writeln; end; *) (* Local to InitializeHuffman Procedure PrintFrequency; Var i, j: integer; begin j := 0; For i:=0 to pred(NumVals do If Node[i].Weight>0 then begin j:=succ(j); Writeln(Lst,'Node ',i:3,' Weight is ',Node[i].Weight:4:0); end; WriteLn(Lst); WriteLn(Lst,'Total Node count is ',j); end; Procedure PrintList; Var i: integer; str: string[10]; begin WriteLn(', waiting'); ReadLn(str); For i:=0 to pred(NumVals) do begin Write('Number ',i:3,' length ',CodeLen[i]:2, ' weight ',node[i].Weight:4:0); If CodeLen[i]>0 then PrintBits(CodeLen[i],Code[i]) else WriteLn; end; end; *) FROM USQZ.PAS: { This program unsqueezes a file which has been squeezed or compressed to reduce the space required to store it on disk. The program was converted from the original version written for CP/M in the C language. This program can be used to unsqueeze files which have been downloaded from RCP/M systems where almost all files are saved in this squeezed format. The technique used is the Huffman encoding technique which converts the most common characters in the input file to a compressed bit stream of data. This program unsqueezes such a Huffman encoded file. PUBLIC DOMAIN - Feel free to distribute this program. Do not distribute it by commercial means or make any charge for this pgm. Version 1.0 - 09/05/82 Scott Loftesness Version 1.1 - 01/06/83 Added capability to strip off parity bit if output file is text. Ernie LeMay 71435,730 Version 1.2 - 07/20/84 converted to Turbo Pascal. Steve Freeman }