CP/M Assembly Language Part VII: Filter Programs by Eric Meyer Last time we put together the basic subroutines needed to read and write text files. Now we'll use these to construct "filter programs": programs that read text in, process it in some way, then write the result back out. One use for such a program is to convert text back and forth from Wordstar document (henceforth "DOC") to non-document (plain ASCII) form. 1. AND and/or OR First we need to introduce a family of 8080 instructions we avoided until now: the logical operations ANA (and), ORA (or), and XRA (exclusive or), and their immediate cousins ANI, ORI, XRI. Each operates with the accumulator and another 8-bit value, combining them one bit at a time ("bitwise") to produce a result. Logical AND, e.g., produces a 1 if both its arguments are 1, and a 0 otherwise. That is, 1 and 1 is 1, and anything else (1 and 0, or 0 and 0) is 0. When applied bitwise, you will find for example that 61h AND 5Fh is 41h: 61h = 01100001 binary AND 5Fh = 01011111 --------------- 41h = 01000001 Why is this interesting? Well, 61h is the ASCII code for the character a, and 41h is A. (If you don't have a nice ASCII table with hex/decimal/binary values, make or find one.) That is, by ANDing it with 5Fh, we have uppercased the letter a. Because of the way the ASCII codes are assigned, upper and lower case letters all differ only by a single bit (the 6th from the right, "bit 5" in assembler speak), and the same trick works for all letters. Note, of course, that the operation ANI 5FH changes (zeros) not only bit 5, but also bit 7, the high (parity) bit. (You can also zero just bit 7, by using ANI 7FH, since 7Fh = 01111111b.) Remember that much of the difference between a WordStar and a plain ASCII file is that Wordstar sets the high bit on many characters; so undoing this is part of the task of converting between these two formats. Logical OR produces a 1 if either argument is 1, and a 0 otherwise. So 1 or 1, 1 or 0 are both 1, while 0 or 0 is 0. Thus you can turn on certain bits, by ORing with a certain value. For example, we can undo what we did above: 41h = 01000001b OR 20h = 00100000 --------------- 61h = 01100001 Thus the operation ORI 20h will uppercase a letter. Logical XOR produces a 1 if either argument, but not both, is 1, and a 0 otherwise. We won't have any immediate use for this now, but you might note that programmers commonly use XRA A to zero the accumulator, since any value XORed with itself gives 0. Let's quickly embody this case business in two routines which you may find useful: UCASE and LCASE. These respectively convert the ASCII value in the accumulator to upper or lower case. UCASE: CPI 'a' LCASE: CPI 'A' RC RC CPI 'z'+1 CPI 'Z'+1 RNC RNC ANI 5FH ORI 20H RET RET Note that before applying the ANI or ORI operation, we first check to make sure the character in A is in fact a letter! For example, in UCASE, we simply return if the value is less than a or greater than z (not less than z+1). This is because ANDing other characters with 5FH could change them in undesirable ways. (It would convert - to ^M.) The logical operations affect the flags, too: the Z flag will be set if the result of the operation is 0, otherwise cleared. The C flag will always be cleared. (You will often see something like ORA A used just to clear Carry, instead of STC, CMC.) 2. The Filter Program The basic "filter" program reads a byte of text from an input file, processes it in some way, then writes it to an output file. It would look something like this: ;*** FILTER.ASM ;*** General Filter Program ; BDOS EQU 0005H ;basic equates FCB1 EQU 005CH FCB2 EQU 006CH ; ORG 0100H ;programs start here ; START: LXI D,FCB1 ;point to 1st FCB ; (source file) CALL GCOPEN ;open it for reading JC IOERR ;complain if error LXI D,FCB2 ;point to 2nd FCB ; (destination) CALL PCOPEN ;open it for writing JC IOERR ;complain if error ; LOOP: CALL GETCH ;get a character JC IOERR ;complain if error CPI 1AH ;EOF? JZ DONE ;quit if at end of file CALL FILTER ;process it in some way JMP LOOP ;keep going ; DONE: CALL PCLOSE ;close the output file JC IOERR ;error? RET ;all finished ; IOERR: RET ;error? just quit, for now ; ;Here is the processing routine FILTER: CALL PUTCH ;just write it out, for now RET ; ;*** Be sure to include here the following disk ;*** file subroutines from our previous column: ;*** GETCH, PUTCH, GCOPEN, PCOPEN, PCLOSE ; END If you assemble this as written here, you will have a program called FILTER.COM, that will simply make a copy of a disk file; i.e., if you say A>filter oldfile newfile and FILTER will read OLDFILE and construct an identical copy NEWFILE. If you want, you can spruce it up a bit, by adding a signon message like FILTER 1.0 (8/19/86) at the START, or an error message like I/O ERROR at the IOERR routine. (Use BDOS function 9 or the SPMSG routine, described in earlier columns.) Of course, what we really want is to do something to the text enroute. As you can see, you can put any further code you want at the location FILTER, which now just writes the character out as is. For example, you can add the UCASE routine above, and you will have a program that makes an uppercase copy of a file. 3. The WordStar To ASCII Filter To get the FILTER program to convert a WordStar DOC to a plain ASCII file, you have to know what's in a DOC file. We've already said that a lot of characters have their high bits set (such as "soft" spaces and returns), so the first thing we want to do to them is ANI 7FH to strip that off. But there's more than that! For example, there's hyphens. WordStar has "soft hyphens", which are represented by 1Eh (when not in use) or 1Fh (when in use). Thus you want to ignore 1Eh, and translate 1Fh to a real hyphen. Adding this also to our FILTER routine would produce: FILTER: ANI 7FH ;strip parity bit CPI 1EH ;is it dead soft hyphen? RZ ;if so, quit (ignore it) CPI 1FH ;is it live soft hyphen? JNZ FLT1 ;if not, skip following MVI A,'-' ;if so, replace with '-' FLT1: CALL PUTCH ;okay, now write it out RNC ;return if all clear POP H ;ERROR, kill return to ; LOOP JMP IOERR ;and go here instead If you use this code above, you will have a FILTER program that does a pretty credible job of converting WordStar DOC to ASCII files. FILTER.COM will take up only 1k on disk, and will be quite fast, and much easier to use than the equivalent program in, say, MBASIC. Of course, you will eventually want to add more processing, to suit your taste. For example you may decide you want to ignore all the funny control codes like ^S that WordStar uses for printer functions, or instead, translate them to the actual control codes your printer will need to perform those functions. It's your program; you are in control. 4. Buffering Characters Now let's consider how you might write another filter program to go the other way. How often have you encountered files you'd like to edit (and reformat) with WordStar, but they're full of hard returns, so you can't? This is a slightly harder problem. You don't just want to turn all hard returns into soft ones, because there are places where you want them left hard (like the end of a paragraph). How can we tell when this is the case? No routine will do this perfectly. However, if you can assume that paragraphs are always indented (always good practice), you can use the following pretty good rule: A return is the end of a paragraph, and should be left hard, if: (1) the next line is blank; (2) the next line begins with a space. In terms of character values, this means that the next character, after this CR and LF, is (1) another CR, or (2) a space. Notice that what we do with the current character (in this case a soft CR) depends on the value of the character after next! How can we cope with this? We must be able to look ahead and see what's coming, without affecting our position in the file: to read characters from the source file, but then save them to read again later. This can be done by storing them in a special little buffer, and modifying our GETCH routine to see if there are any characters in this buffer before going to look in the file again. Here's the new UNGETC routine, which will "unget" a character: ;Routine to UNGET a character, saving ; it for GETCH UNGETC: PUSH H ;save registers here PUSH D ;(if you don't do this, PUSH B ; UNGETCwill be a ; hassle to use) PUSH PSW ;save the character last LDA BUFCNT ;fetch buffer count CPI 5 ;already maximal? JNC UNG0 ;yes, leave it INR A ;no, increase it STA BUFCNT ;and put it back UNG0: LXI H,UGBUF+3 ;point from next-last LXI D,UGBUF+4 ;to last position MVI B,4 ;prepare to move 4 bytes UNGLP: MOV A,M ;get a byte STAX D ;move it up ahead DCX H ;back up DCX D ;to previous DCR B ;count down on B JNZ UNGLP ;loop if more to go POP PSW ;recover new character STA UGBUF ;put it at front of ; buffer POP B ;restore POP D ; the POP H ; registers RET BUFCNT: DB 0 ;count chars in UGBUF UGBUF: DS 5 ;room for 5 characters UNGETC maintains a list of characters read, and put back for future use, at UGBUF. The most recently read one is first, the oldest last -- BUFCNT holds the count. To unget a character, we increment the count, move the existing ones ahead to make room, and then put in the new one. (Don't try to unget more than the maximum of 5 characters, or the earlier ones will disappear into the bit bucket. Of course, you could make this value larger if you want.) Now what does GETCH have to do? ;Modified GETCH routine for use with UNGETC GETCH: LDA BUFCNT ;check UNGETC buffer CPI 0 ;is it empty? JZ FGETCH ;if so go read file DCR A ;decrease count STA BUFCNT ;and put it back MOV E,A ;put count (less 1) in E MVI D,0 ;now D-E is 16-bit ; version LXI H,UGBUF ;point to buffer DAD D ;now HL points to eldest ; character MOV A,M ;get it STC CMC ;clear C flag RET ;and return FGETCH: .... ;put the old GETCH here If there are characters in the UGBUF buffer, we decrement the count, then fetch the oldest one and return with it; if the buffer is empty, we just go ahead and do the usual read from the file. 5. The ASCII to WordStar Filter If you will add UNGETC, and make the above changes to GETCH, we can now get the FILTER program to "soften" CRs more or less properly. The processing routine will look like this: FILTER: CPI 0DH ;is it a CR? JNZ FLT1 ;no, just go on CALL GETCH ;get the next char (LF?) JC FLTERR ;error? MOV D,A ;and save it CALL GETCH ;once more we want this one JC FLTERR ;error? MOV E,A ;save it too MOV A,D ;recover the first CALL UNGETC ;unget it MOV A,E ;now the second CALL UNGETC ;unget it too MOV A,E ;okay, here it is CPI 0DH ;here goes: is it a CR? JZ FLTH ;yes, make current CR HARD CPI ' ' ;or a space? JZ FLTH ;yes, HARD again FLTS: MVI A,8DH ;no, use a SOFT CR here JMP FLT1 FLTH: MVI A,0DH ;use a HARD CR FLT1: CALL PUTCH ;write the char out RNC ;return if all clear FLTERR: POP H ;ERROR, kill return to LOOP JMP IOERR ;and go here, instead If we've read a CR, and the character after next (the second LOOK ahead) is a space or CR, we write a hard CR; otherwise, it gets softened. Other characters go through unaffected. This is the central task in creating DOC files from ASCII files. Of course you can do as much more as you want: e.g., soften hyphens if they occur at the end of a line (before a CR). It's all up to you. 6. Other Applications You probably will be able to think of other filtering tasks as well. One possibility is communication with various mainframe computers, which have differing requirements for text formats. Another is encrypting and decrypting text, using anything from a simple substitution cipher on up. And if you eliminate the output file routines, you can turn the FILTER program into a simple SEARCH program that just reads through a disk file: perhaps counting words, or looking for a particular string and printing out every line that contains it. You will find that the resulting program is remarkably compact and fast. If you want to make it even more efficient, you can try your hand at increasing the buffering of the GETCH and PUTCH routines. As they stand they use a simple 128-byte DMA, which means your computer will have to alternately read data from the source, and write to the destination, in small pieces (the BDOS does its own buffering, in units of "blocks", usually from 1K to 4K in size). You can speed all this up if you use buffers larger than this; 16K apiece would be a good choice. This would require increasing the GCDMA and PCDMA buffers from 128 bytes to 16*1024 bytes, and modifying the read/write code in GETCH and PUTCH to do the whole 16K a record at a time, stepping the DMA address along in 128-byte increments. (An exercise for the stout-hearted reader.) 7. Coming Up Next time we'll learn how to input and output numbers.