UNARCZ ZCPR3 Archive File Extraction Utility Version 1.0 Modified for ZCPR3 by Gene Pizzetta March 11, 1990 Copyright (C) 1986, 1987 by Robert A. Freed All Rights Reserved UNARCZ allows the listing, typeout, printing, checking, and extraction of member files contained in ARK and ARC "archive" library files. These are commonly used for compressed file storage on remote access bulletin boards. UNARCZ requires a version of ZCPR3 for command line parsing and wheel byte detection, so it will not run properly on vanilla CP/M systems. As you might expect, a Z80 CPU or compatible is required. UNARCZ requires at least 30K of free memory (TPA) for full support of all archive file formats, but smaller systems may be able to use some of the program's capabilities. USAGE: UNARCZ {dir:}arcfile{.typ} {dir:}{afn.aft} {{/}N|P|C} If a DIR or DU specification is not given for the archive file, the current drive/user is assumed. The second filename, which can be ambiguous, refers to a member file or files in the archive. If a DU or DIR specification is provided for the member file, it will be extracted to that directory. To extract to the current directory, only a colon is required. If a directory specification is given without a filename, all files (*.*) are assumed. If no DU or DIR specification is given, UNARCZ acts differently depending on whether the member name is ambiguous or not. If the member name is unambiguous, and the filetype is not restricted, the file will be typed to the screen. If the member name is ambiguous, or if no member name is given at all, a directory of the ARK will be displayed. If no filetype is given for the archive file, UNARCZ first tries ARK and then ARC. An on-line help message will be displayed if UNARCZ is called with no command tail or if the command tail is "//". OPTIONS: The actions of UNARCZ are also affected by three available options, which may or may not be preceded by a slash, but which must be the third token (element) on the command line. Only a single option may be used at a time. C Check the validity of the archive and the given member files. N Turn off console paging. This can be especially handy it you are extracting a large number of member files whose names fill more than a single screen. If you use this option, you will not get the "[more]" message and have to hit a key before UNARCZ continues. It's primary use, however, is when you want to scroll through an ARK'ed text file using ^S to pause at the points you're interested in. P Sends a member file to the printer (LST device). The member name cannot be ambiguous. The file will be printed continuously, with any sort of formating or paging. UNARCZ can be aborted at any time with ^C or ^K. When UNARCZ pauses after 23 lines of console output, the listing may be resumed by typing any key other than ^S, ^C, or ^K. The space bar may be used to display one more line of console output (overwriting the "[more]" message) and the program will again pause. For hard copy terminals, line feed may be used to prevent overprinting of the "[more]" line. Screen pauses can be turned off by using the "N" command line option (see above). LISTING AN ARCHIVE DIRECTORY: UNARC always produces a detailed console listing of all the member files of an archive, or of those members which match the second file specification, if one is given. If no member name is given, or if the member name is ambiguous, then UNARCZ only lists the directory, without doing anything else. (That is, unless the C option is included.) A sample directory listing: A0>UNARCZ CODES Archive File = A0:CODES.ARK Name Length Disk Method Ver Stored Saved Date Time CRC ============ ======= ==== ======== === ======= ===== ========= ====== ==== ABLE .DOC 24320 24k Crunched 8 11777 52% 30 Apr 86 10:50a 42C0 BRAVO .COM 17152 17k Squeezed 4 14750 14% 2 May 86 4:11p 8CBD CHARLIE .TXT 234 1k Packed 3 99 58% 2 May 86 4:11p 8927 ==== ======= ==== ======= === ==== Total 3 41706 42k 26626 36% 58A4 The listing is equivalent to the "verbose" listing of the MS-DOS ARC program, with the addition of the "Disk" and "Ver" fields, which are unique to UNARCZ and previous UNARC versions. The listing requires 78-columns of terminal width. "Name" is the filename which will be generated if the file is extracted by UNARCZ. This is not necessarily the same as the name recorded in the archive file. Although CP/M and MS-DOS file naming conventions are identical, two conversions are made to guarantee filename validity: Lower-case letters are converted to upper-case and non-printing characters are converted to dollar signs ("$"). Archive entries are usually maintained and so listed in alphabetical order. "Length" is the uncompressed file length, i.e., the number of bytes the file will occupy if extracted to disk, exclusive of any additional length imposed by the file system. MS-DOS permits files of arbitrary lengths, but CP/M restricts files to multiples of 128 bytes. "Disk" is the actual amount of space required to extract the file to a CP/M disk, expressed as a multiple of 1K (1024) bytes. The number is dependent on the output drive's allocation block size, which can range from 1K to 16K bytes. Typically, 1K is used for single-density floppy disks, 2K for double-density floppies, and 4K for hard disks. In the absence of an explicit output drive, UNARCZ uses the block size of the currently logged drive. "Method" is the compression method used: "Unpacked", "Packed", "Squeezed", "Crunched", "Squashed", or "Unknown!". If the method "Unknown!" appears, it likely indicates a faulty archive file or a newer compression method not yet supported by UNARCZ. "Ver" is the version of compression method used. UNARC supports versions 1-9: unpacked files, versions 1 or 2; packed files, version 3; squeezed files, version 4; crunched files, versions 5- 8; and squashed files, version 9. "Stored" is the compressed file length, that is, the number of bytes occupied by the file in the archive, not including the directory information overhead, which adds an additional 29 bytes to each member file. "Saved" indicates the percentage of the original file length which was saved by compression. Higher values indicate better compression. The MS-DOS ARC documentation refers to this as the "stowage factor". The value shown in the totals applies to the archive as a whole, excluding directory overhead. "Date" and "Time" are for the last file modification at the time it was added to the archive. "CRC" is an internal 16-bit cyclic redundancy check value computed when a file is added to an archive, expressed in hexadecimal. UNARCZ checks file validity by recomputing this value when it extracts a file. The value is calculated by a different method than that used by either of the two popular public domain programs, CRCK and CHEK, but it is a quite valid and reliable error-detection mechanism. The value is given for completeness only. The total in the last line is the 16-bit sum of the displayed CRC values and is useful for comparing entire archives. Since the CRC values are computed before compression, the total should be the same for all archives created from the same set of input files, without regard for variations in file order or compression methods. The "Total" line is displayed only if more than one file appears in the listing. EXTRACTING FILES FROM AN ARCHIVE: If the second command line parameter contains a DU or DIR specification UNARCZ will extract the selected member file or files to to the indicated disk directory. If the directory specification is given without a filename, all member files will be extracted to the indicated directory. If only a colon is given, the current drive/user will be assumed. Below is a directory listing as might be generated during file extraction, along with some possible warning messages: A0>UNARCZ CODES B1: Archive File = A0:CODES.ARK Output Drive = B1: Name Length Disk Method Ver Stored Saved Date Time CRC ============ ======= ==== ======== === ======= ===== ========= ====== ==== ABLE .DOC 24320 24k Crunched 8 11777 52% 30 Apr 86 10:50a 42C0 Replace existing output file (y/n)? Y BRAVO .COM 17152 18k Squeezed 4 14740 14% 2 May 86 4:11p 8CBD Warning: Extracted file has incorrect CRC Warning: Extracted file has incorrect length Warning: Bad archive file header, bytes skipped = 10 CHARLIE .TXT 234 2k Packed 3 99 58% 2 May 86 4:11p 8927 ==== ======= ==== ======= === ==== Total 3 41706 44k 26616 36% 58A4 "Replace existing output file (y/n)?" appears if a file of the same name exists in the output directory, requiring a "Y" or "N" response. Any response other than "Y" will be consided to be the same as "N". The first two of the "Warning:" messages above indicate that either the cyclic redundancy check (CRC) value or the extracted file length does not match the value recorded in the archive header when the original file was added. The third warning message is displayed if the proper format for the beginning of a new member is not detected, but UNARCZ recovered by skipping a certain number of bytes in the archive file. If a recovery attempt fails, UNARC aborts and issues a different message, "Invalid archive file format". The appearance of any of these messages probably means the file data has been corrupted in some way. If the original MS-DOS file length was not an exact multiple of 128 bytes, the final record of the extracted file will be padded with 1Ah characters (ASCII ^Z). Disk space in the listing will be correct for the specified output directory. In the two examples above, drive A has 1K allocation blocks while drive B has a 2K blocks, which accounts for the differences in the two listings. To determine the exact disk space requirements before extracting files, log into the desired output drive and take an UNARCZ directory listing of the ARK file. If a file extraction is aborted with ^C, any partial output file will have to be deleted manually. TYPING MEMBER FILES: Typing the contents of a member file in an archive to the console may be requested by giving a non-ambiguous filename and no output disk directory as the second command line parameter. For example: A0>UNARCZ CODES ABLE.DOC Archive File = A0:CODES.ARK Name Length Disk Method Ver Stored Saved Date Time CRC ============ ======= ==== ======== === ======= ===== ========= ====== ==== ABLE .DOC 24320 24k Crunched 8 11777 52% 30 Apr 86 10:50a 42C0 ------------------------------------------------------------------------------- This is file ABLE.DOC, contained within the archive CODES.ARK. Typeout will proceed until the end of this file, so you'd better be patient. For somebody who has nothing to say, I've written an awfully big file here. If you don't want to read all 24K of it, you can type ^C .... The specified file is assumed to contain valid ASCII text data. All bytes are masked to seven bits and all control characters are ignored except horizontal tabs, which is expanded to blanks with stops at every eighth column), and line feeds, vertical tabs, and form feeds, all of which generate a new line. SUB (^Z) is interpreted and the end of the file. Backspaces and carriage returns are ignored, so text will not be obscured. UNARCZ will refuse to type files whose filetype indicates are not ASCII text files, including COM, EXE, OBJ, REL, LBR, and ?Z?. If one of these or other restricted types is given, directory information only is listed. CRC and file length checking are not performed when a file is typed to the screen. PRINTING MEMBER FILES: A single member file may be sent to the printer (CP/M LST device) with the "P" option as the third parameter on the command line with or without a preceding slash. In addition, the member name must be non-ambiguous and must not be preceded by a drive or user specification. For example: A0>UNARCZ CODES CHARLIE.TXT P or A0>UNARCZ CODES CHARLIE.TXT /P The contents of the specified file is passed directly to the printer without alteration, additional formatting, or even paging. The user should make sure it contains data suitable for printer output. This unfiltered operation is particularly well- suited for the output of binary graphics images to dot-matrix printers. These files can be extremely large, but compress quite well, often to less than 5% of their original size. The same filetypes excluded from typing are also excluded from printing. Printing may be paused or aborted with ^S and ^C respectively. CHECKING MEMBER FILES: With the "C" option UNARCZ can be directed to extract one or more member files from an archive, without actually storing these as disk files. This operation performs file CRC and length checking, so it is useful for verifying correct modem data transmission of an archive. As with all options, the "C" must be the third command line parameter. The member name may be ambiguous, but it cannot be preceded by a disk directory specification. For example: A0>UNARCZ CODES *.* C PROGRAM CONFIGURATION OPTIONS: Several configuration patch points may be used to tailor the program for specific requirements, particularly to guarantee security on RCP/M (Remote CP/M) systems. The secure version of UNARCZ can be used by remote callers for archive directory listing or for member file typeout, but not for file extraction. Other patch points are provided for specialized non-standard systems and need not concern the majority of users running ZCPR3, NZ-COM, or Z3PLUS. Additional patch options allow adjustment of a few user preferences, such as the number of screen lines between console output pauses or the list of restricted filetypes for typing and printing. Patching options are thoroughly described in UNARCZOV.ASM, an assembler source file that can be edited, assembled, and then layed over UNARCZ.COM. The default options in the distributed program files, however, are suitable for most users. ABOUT ARC/ARK FILES: The files which UNARCZ processes utilize a format that was introduced by the ARC shareware utility program, which executes on 16-bit computers running the MS-DOS (or PC-DOS) operating system. This format has achieved widespread popularity since the ARC program first appeared in March 1985, and it has become the de facto standard for file storage on remote access systems catering to 16-bit computer users. More recently this file format has achieved increased popularity on RCP/M (Remote CP/M) systems. Most RCP/M system operators have adopted the convention of naming CP/M archive files with the filetype ARK. This differentiates these from MS-DOS archive files, which use the filetype ARC. This is a naming convention only; there is no difference in format, and UNARC will accept files of either type interchangeably. An archive is a group of files collected together into a single file in such a way that the individual files may be recovered intact. In this respect, archives are similar in function to libraries (LBR files), which have been commonplace on CP/M systems since 1982, when the original LU library utility program was introduced by Gary P. Novosielski. The two file formats, however, are not compatible.) The distinguishing characteristic of an ARC archive is that its component files are automatically compressed when they are added to the archive, so that the resulting file occupies a minimum amount of disk space. Of course, file compression techniques have also been commonplace in the CP/M world since 1981, when the public domain SQ and USQ "squeeze and unsqueeze" programs were introduced by Richard Greenlaw. The SQ/USQ programs and their numerous popular descendants utilize a well-known general-purpose form of data compression (Huffman coding). This technique, which is also utilized in ARC files, performs well for many text files but often produces poor compression of binary files (e.g., object program COM files). The ARC program also provides an advanced data compression method, which it terms "crunching." This method (which is based on the Lempel-Ziv-Welch or "LZW" algorithm) performs better than squeezing in most cases, often achieving 50% or better compression of ASCII text files, 15-40% compression of binary object files, and as much as 95% compression of bit-mapped graphics image files. Five different methods are actually employed for storing files in an archive. The method chosen for a particular file is the one which results in the best compression for that file: 1. No compression ("unpacked"). The file is stored in its original form. 2. Run-length encoding ("packed"). Repeated sequences of 3- 255 identical bytes are compressed into a three-byte sequence. 3. Huffman coding ("squeezed"). Each 8-bit byte (after run- length encoding) is encoded by a variable number of bits, with bit length (approximately) inversely proportional to the frequency of occurence of the corresponding byte. 4. LZW compression ("crunched"). Variable-length strings of bytes (in theory, up to nearly 4000 bytes in length) are represented by a single (maximum) 12-bit code (after run-length encoding). 5. LZW compression ("squashed"). This is a variation of crunching which uses (maximum) 13-bit codes (and no run-length encoding). Since one of the five methods involves no compression at all, the resulting archive entry will never be larger than the original file. The most recent release of the MS-DOS ARC program (version 5.20) has eliminated squeezing as a compression technique. However, UNARC continues to process squeezed files for compatibility with archives created by earlier versions of ARC and by other MS-DOS archiving programs (notably PKARC). The squashed compression method was recently introduced by the MS- DOS programs PKARC and PKXARC. UNARC can process files which use this method, although it is not universally accepted by other MS- DOS archive extraction programs (including ARC). During its lifetime, the ARC program has undergone numerous revisions which have employed different variations on some of the above methods, particularly LZW compression. In order to retain compatibility with archives created by earlier program revisions, ARC stores a "version" indicator with each file in an archive. Based on this indicator, the latest release of the ARC program can always extract files created by older releases (although it will only use the latest data compression versions when adding new files to an archive). The current release of UNARC supports archive file versions generated by all releases of the following MS-DOS programs through (at least) the indicated program versions: ARC 5.20 (24 Oct 86), by System Enhancement Associates, Inc. ARCA 1.22 (13 Sep 86), by Wayne Chin and Vernon Buerg ARCH 5.38 (26 Jun 86), by Les Satenstein PKARC 2.0 (15 Dec 86), by Phil Katz (PKWARE, Inc.) UNARC does not recognize, but is unaffected by, the non-standard archive and file commenting feature of PKARC. Although the above discussion has emphasized the origin of archive files for the MS-DOS operating system, their use has recently spread to many other systems. Programs compatible with MS-DOS ARC have appeared for UNIX, Atari 68000, VAX/VMS, and TOPS- 20 systems. A CP/M utility for building archive files is also available. For additional information about archive files and the MS-DOS ARC utility, refer to the documentation file, ARC.DOC, which is available from most remote access systems which utilize archive files. For additional information about the LZW algorithm (and data compression methods in general), refer to the article "A Technique for High-Performance Data Compression", by Terry A. Welch, in IEEE Computer magazine, Vol. 17, No. 6, June 1984. FUTURE ENHANCEMENTS: I can see the desireability of a few future enhancements. UNARCZ should get its own name from the external FCB. It should automatically restrict directory access using the information from the environment descriptor. Most importantly, it should transfer the file dates from the archive to Datestamper and CP/M Plus date stamps. Anyone interested?