SPELL V2.0 DOCUMENTATION Michael C. Adler December 22, 1982 (C) 1982 Michael C. Adler This program has been released into the public domain by the author. It may neither be sold for profit nor included in a sold software package without permission of the author. The first SPELL using this dictionary was probably written by Ralph Gorin at Stanford. It was transported to MIT by Wayne Mattson. Both the program at MIT and the dictionary were most recently revised by William Ackerman at MIT. Section 5 of this document was copied from portions of Mr. Ackerman's documentation. Thanks to all for the effort spent designing the dictionary! Spell is a program, written for Z80 processors running CP/M, designed to detect misspellings in a document. 1. USING SPELL The minimum configuration of SPELL requires the files SPELL.COM and DICT.DIC (the main dictionary). At the time of execution, DICT.DIC must be on either the default drive or drive A:. The name of the file to be corrected must be included on the command line that is used to invoke spell. If a drive name is specified as a second file name, output is directed to the specified drive. Thus, SPELL useless.doc will check the file "useless.doc" and direct output to the default drive and SPELL b:useless.doc c: will check the file "b:useless.doc" and direct output to disk c. Spell will check the input file for errors by comparing each word in the file to the dictionary. If a word is not found, a null (ascii 0) is placed before the word. To change this marking character, see section 4, PATCHING SPELL. If a backup version (.BAK file type) of the input file exists, it will be deleted. The input file will be renamed to a backup file and the checked file will replace the input file. 2. USER DICTIONARIES A user dictionary is a list of correct words that can be 1 loaded by SPELL to augment the main dictionary. Words such as proper nouns can be placed in user dictionaries to inhibit error marking. User dictionary files may be formatted in any way that the user desires, as long as words are delimited by non-alphabe- tic characters. SPELL will automatically search for the user dictionary SPELL.DIC on the default drive and on drive A: if it is not on the default one. It's contents are then loaded and temporarily added to the dictionary. It must be loaded again to be included in subsequent executions of SPELL. SPELL will also automatically search for d:file.UDC, where file is the name of the file being corrected and d: is the drive on which file is found. If found, it is also loaded and tempo- rarily augments the dictionary. Thus, users may create separate dictionaries for each text file being corrected. After locating d:file.UDC, SPELL will search file d:file.ADD. This file is created by WordStar's ^QL command (see section 3) and is not an ASCII file. d:file.ADD contains commands generated by WordStar to include specific words in the user dictionary associated with d:file. SPELL will temporarily place all of the words in it in the dictionary and will also save the words by copying them into d:file.UDC. It is possible to load additional user dictionaries by specifying them on the SPELL command line. A list of user dic- tionaries must be preceded by a dollar sign. A dictionary is specified by a file name and an optional drive name. If no drive is specified, the default drive is searched and then drive A: is checked. Extensions are ignored and default to SPELL useless.doc b: $dict1 c:dict2 dict3.fun would correct useless.doc and direct output to drive B:. User dictionary DICT1.DIC would be loaded from the default drive or drive A:, dictionary DICT2.DIC would be loaded from drive C:, and DICT3.DIC would be loaded from the default drive or drive A:. Notice that the extension .fun was ignored. 3. WordStar's ^QL COMMAND Files checked by SPELL can be corrected using WordStar. In response to ^QL, the user is asked which portions of the file should be searched. WordStar will then position the cursor on the first marked word and print a menu offering F (Fix word), B (Bypass word), I (Ignore word), D (Add to dictionary), and S (Add to supplemental dictionary). The F option deletes the error marker and returns to the WordStar main menu, allowing the user to correct the word. B will leave the word marker and will search for the next misspelled word. In this implementation of SPELL, the I, D and S options all perform the same function (although I is easier to use because no question is asked by WordStar). If either of these options (I, D, S) are chosen, the 2 mark will be removed and the word will be added to file.ADD. Thus, choosing these options informs SPELL that the word is cor- rect and should not be marked again. The D and S options do not add the word to SPELL's main dictionary because the compression method used to store the dictionary is too complicated to allow such modification efficiently. After choosing all of the options except F, WordStar will automatically search for the next marked word. 4. PATCHING SPELL It is not necessary to recompile SPELL to change the character that marks misspelled words. The byte at 0103H contains the marking character. Byte 0104H contains the "default disk" [1 for A: , 2 for B: etc]. In the distribution version of SPELL, the bytes are 0 and 1 [default is NULL and A:] change the bytes at 0103H, 0104H. Octal 23 - '#' is a tolerable marking character for FinalWord. 5. PROGRAM AND DICTIONARY CHARACTERISTICS 5.1 Word identification algorithm A word is any uninterrupted sequence of letters and apostrophes, which does not begin or end with an apostrophe. Any punctuation, digit, or control character separates words. Any word consisting of a single letter, or any word more than 40 letters long, is considered to be correctly spelled. 5.2 Dictionary policy It is the policy of this program to contain only one spelling of a word, even if ordinary dictionaries show two or more "acceptable" spellings. Hence, the dictionary contains LABELED and LABELING, but not LABELLED or LABELLING, even though all four are actually acceptable. The intention is to enforce uniformity within each document. The author apologizes for the restriction on creativity and diversity that this necessitates, but believes that it is the best policy for this program. The dictionary contains many technical and computer terms such as MICROPROGRAM and DEBUGGER, but does not contain extreme jargon words such as CONTROLIFY or VALRET. The dictionary contains no proper names other than names of countries and states of the United States. The reason is that it would be virtually impossible to contain all of the proper names that commonly arise in normal use. Users should keep proper names (and other correctly spelled words) that arise in their own work in private dictionaries to avoid having to repeatedly tell SPELL to accept them. The dictionary is significantly smaller than that found in other spelling checkers, such as the DEC TOPS-20 program. The author believes that the larger dictionary would not reduce the number of false misspelling indications by very much. 3 [Note: I believe that this dictionary is actually MUCH larger than any dictionaries currently available for microcomputers. -Michael] 5.3 Dictionary flags Words in SPELL's main dictionary (but not the other dictio- naries) may have flags associated with them to indicate the legality of suffixes without the need to keep the full suffixed words in the dictionary. The flags have "names" consisting of single letters. Their meaning is as follows: Let # and @ be "variables" that can stand for any letter. Upper case letters are constants. "..." stands for any string of zero or more letters, but note that no word may exist in the dictionary which is not at least 2 letters long, so, for example, FLY may not be produced by placing the "Y" flag on "F". Also, no flag is effective unless the word that it creates is at least 4 letters long, so, for example, WED may not be produced by placing the "D" flag on "WE". "V" flag: ...E --> ...IVE as in CREATE --> CREATIVE if # .ne. E, ...# --> ...#IVE as in PREVENT --> PREVENTIVE "N" flag: ...E --> ...ION as in CREATE --> CREATION ...Y --> ...ICATION as in MULTIPLY --> MULTIPLICATION if # .ne. E or Y, ...# --> ...#EN as in FALL --> FALLEN "X" flag: ...E --> ...IONS as in CREATE --> CREATIONS ...Y --> ...ICATIONS as in MULTIPLY --> MULTIPLICATIONS if # .ne. E or Y, ...# --> ...#ENS as in WEAK --> WEAKENS "H" flag: ...Y --> ...IETH as in TWENTY --> TWENTIETH if # .ne. Y, ...# --> ...#TH as in HUNDRED --> HUNDREDTH "Y" FLAG: ... --> ...LY as in QUICK --> QUICKLY "G" FLAG: ...E --> ...ING as in FILE --> FILING if # .ne. E, ...# --> ...#ING as in CROSS --> CROSSING "J" FLAG" ...E --> ...INGS as in FILE --> FILINGS if # .ne. E, ...# --> ...#INGS as in CROSS --> CROSSINGS "D" FLAG: ...E --> ...ED as in CREATE --> CREATED if @ .ne. A, E, I, O, or U, ...@Y --> ...@IED as in IMPLY --> IMPLIED if # .ne. E or Y, or (# = Y and @ = A, E, I, O, or U) 4 ...@# --> ...@#ED as in CROSS --> CROSSED or CONVEY --> CONVEYED "T" FLAG: ...E --> ...EST as in LATE --> LATEST if @ .ne. A, E, I, O, or U, ...@Y --> ...@IEST as in DIRTY --> DIRTIEST if # .ne. E or Y, or (# = Y and @ = A, E, I, O, or U) ...@# --> ...@#EST as in SMALL --> SMALLEST or GRAY --> GRAYEST "R" FLAG: ...E --> ...ER as in SKATE --> SKATER if @ .ne. A, E, I, O, or U, ...@Y --> ...@IER as in MULTIPLY --> MULTIPLIER if # .ne. E or Y, or (# = Y and @ = A, E, I, O, or U) ...@# --> ...@#ER as in BUILD --> BUILDER or CONVEY --> CONVEYER "Z FLAG: ...E --> ...ERS as in SKATE --> SKATERS if @ .ne. A, E, I, O, or U, ...@Y --> ...@IERS as in MULTIPLY --> MULTIPLIERS if # .ne. E or Y, or (# = Y and @ = A, E, I, O, or U) ...@# --> ...@#ERS as in BUILD --> BUILDERS or SLAY --> SLAYERS "S" FLAG: if @ .ne. A, E, I, O, or U, ...@Y --> ...@IES as in IMPLY --> IMPLIES if # .eq. S, X, Z, or H, ...# --> ...#ES as in FIX --> FIXES if # .ne. S, X, Z, H, or Y, or (# = Y and @ = A, E, I, O, or U) ...# --> ...#S as in BAT --> BATS or CONVEY --> CONVEYS "P" FLAG: if @ .ne. A, E, I, O, or U, ...@Y --> ...@INESS as in CLOUDY --> CLOUDINESS if # .ne. Y, or @ = A, E, I, O, or U, ...@# --> ...@#NESS as in LATE --> LATENESS or GRAY --> GRAYNESS "M" FLAG: ... --> ...'S as in DOG --> DOG'S Note: The existence of a flag on a root word in the directory is not by itself sufficient to cause SPELL to recognize the indicated word ending. If there is more than one root for which a flag will indicate a given word, only one of the roots is the correct one for which the flag is effective; generally it is the longest root. For example, the "D" rule implies that either PASS or PASSE, with a "D" flag, will yield PASSED. The flag must be on PASSE; it will be ineffective on PASS. This is because, when SPELL encounters the word PASSED and fails to 5 find it in its dictionary, it strips off the "D" and looks up PASSE. Upon finding PASSE, it then accepts PASSED if and only if PASSE has the "D" flag. Only if the word PASSE is not in the main dictionary at all does the program strip off the "E" and search for PASS. Furthermore, some combinations of flags are forbidden to allow for dense flag encoding to save space. For example, only one of the "P", "J", or "V" flags may be on in any one word. 6. SPELL INTERNALS SPELL uses a number of temporary files during execution. The file file.D$$ is the union of file.UDC and file.ADD. At the end of execution, file.UDC and file.ADD are deleted and file.D$$ is renamed to file.UDC. The file file.$$$ is the output file. At the end of execution, file.BAK is deleted, the input file is renamed to file.BAK, and file.$$$ is renamed to the input file name. Warning: if you do not have room on your disk for file.BAK, file.DOC and file.$$$ at the same time, either use two drives or delete file.BAK before you start. SPELL corrects files with two passes of the input file. On the first pass, the words in the file are sorted alphabetically and duplicate words are eliminated. An attempt is then made to search for the words in the dictionary. Words that are found are marked. On the second pass of the input file, SPELL determines whether each word was found by locating them in memory. This method makes the operation of SPELL more efficient because common words must be looked up only once and because the dictionary can be searched sequentially, minimizing disk head travel. If all of the file does not fit in memory on the first pass, the input file is partitioned into sections small enough to fit into memory and is then corrected in a series of two pass operations until the entire file has been checked. It is unlikely that memory will be filled in large systems by even large text files as 3000 individual words should fit easily. 7. DICTIONARY INTERNALS The dictionary has been compressed, significantly, in order to save space. Dictionary records are all 256 bytes long and each record contains as many words as will fit. Individual words are stored in the following code: 4 bits -- Number of characters to copy from the previous word. Because the dictionary is stored in alphabetical order, this saves a large number of characters. This field is 0 at the beginning of each record. x * 5 bits -- Characters are stored in 5 bit code. There may be any number of 5 bit characters. A character string is terminated by the following field. 3 bits -- Set to 111 binary to indicate the end of the word. 6 Since 11100 binary is greater than 26, all alphabetic characters can be stored without using this combination. 4 bits -- Number of bits of flag data following the word. The bit position of the flags has been ordered so that the flags most frequently used are earliest. Flags not stored are assumed to be off. x bits -- Flag data. x is determined by the previous field. Each bit represents one of the 14 suffix flags. 8. MODIFYING THE MAIN DICTIONARY The source for the main dictionary can currently be found in the file "[MIT-XX]SRC:SPELL.DCT". In order to make it compatible with SPELL, all of the "/" characters that delimit flags must be converted to "%" characters so that flags will be considered earlier in the alphabet than hyphens (DOG%S should be before DOG'S). The file must then be sorted alphabetically. No utilities are provided with SPELL to accomplish either of these tasks. Without high capacity disk drives, you may find it necessary to perform the above steps on a larger computer. Once a copy of the main dictionary has been placed on the microcomputer, use the program DICCRE to create a dictionary. Include the name of the source file on the DICCRE command line. DICCRE will create the files DICT.DIC (compressed dictionary) and SPELL0.MAC (pointer file to dictionary) ON THE DEFAULT DISK DRIVE. When it has finished converting the input file to the dictionary file, it will execute a warm boot if the output file is on the same drive as the input file. However, if the output file is not on the same disk, it will ask whether another input file exists. This feature allows the user to put the source file on two disks in case it does not fit on one. DICCRE will combine them into one dictionary file. If no more files exist, answer N to the question. If another file does exist, put the disk with the new file in the input drive and type Y. After the dictionary file has been created, it is necessary to recompile SPELL with the new pointer file, SPELL0.MAC. If your assembler does not support the INCLUDE statement, you will have to replace the line INCLUDE SPELL0.MAC in the file SPELL.MAC with the contents of SPELL0.MAC. After SPELL is recompiled, be sure to use the correct copy of DICT.DIC with it or you will obtain unpredictable results. For more information about dictionaries, see the file: [MIT-XX]SS:DICT.LETTER Good luck and happy hacking! Michael Adler (MADLER@MIT-ML) 3 Sunny Knoll Terrace Lexington, MA 02173