CP/M Assembly Language
                    Part VII: Filter Programs
                          by Eric Meyer

     Last time we put together the basic subroutines needed to
read and write text files.
     Now we'll use these to construct "filter programs": programs
that read text in, process it in some way, then write the result
back out.
     One use for such a program is to convert text back and forth
from Wordstar document (henceforth "DOC") to non-document (plain
ASCII) form.


1. AND and/or OR
     First we need to introduce a family of 8080 instructions we
avoided until now: the logical operations ANA (and), ORA (or),
and XRA (exclusive or), and their immediate cousins ANI, ORI,
XRI.
     Each operates with the accumulator and another 8-bit value,
combining them one bit at a time ("bitwise") to produce a result.
     Logical AND, e.g., produces a 1 if both its arguments are 1,
and a 0 otherwise. That is, 1 and 1 is 1, and anything else (1
and 0, or 0 and 0) is 0.
     When applied bitwise, you will find for example that 61h AND
5Fh is 41h:

               61h =  01100001 binary
          AND  5Fh =  01011111
               ---------------
               41h =  01000001

     Why is this interesting?
     Well, 61h is the ASCII code for the character a, and 41h is
A. (If you don't have a nice ASCII table with hex/decimal/binary
values, make or find one.)
     That is, by ANDing it with 5Fh, we have uppercased the
letter a. Because of the way the ASCII codes are assigned, upper
and lower case letters all differ only by a single bit (the 6th
from the right, "bit 5" in assembler speak), and the same trick
works for all letters.
     Note, of course, that the operation ANI 5FH changes (zeros)
not only bit 5, but also bit 7, the high (parity) bit.
     (You can also zero just bit 7, by using ANI 7FH, since 7Fh =
01111111b.)  Remember that much of the difference between a
WordStar and a plain ASCII file is that Wordstar sets the high
bit on many characters; so undoing this is part of the task of
converting between these two formats.
     Logical OR produces a 1 if either argument is 1, and a 0
otherwise. So 1 or 1, 1 or 0 are both 1, while 0 or 0 is 0.
     Thus you can turn on certain bits, by ORing with a certain
value. For example, we can undo what we did above:

               41h =  01000001b
           OR  20h =  00100000
               ---------------
               61h =  01100001

     Thus the operation ORI 20h will uppercase a letter.
     Logical XOR produces a 1 if either argument, but not both,
is 1, and a 0 otherwise.
     We won't have any immediate use for this now, but you might
note that programmers commonly use XRA A to zero the accumulator,
since any value XORed with itself gives 0.
     Let's quickly embody this case business in two routines
which you may find useful: UCASE and LCASE. These respectively
convert the ASCII value in the accumulator to upper or lower
case.

UCASE:    CPI  'a'         LCASE:    CPI  'A'
          RC                         RC
          CPI  'z'+1                 CPI  'Z'+1
          RNC                        RNC
          ANI  5FH                   ORI  20H
          RET                        RET

     Note that before applying the ANI or ORI operation, we first
check to make sure the character in A is in fact a letter!
     For example, in UCASE, we simply return if the value is less
than a or greater than z (not less than z+1). This is because
ANDing other characters with 5FH could change them in undesirable
ways. (It would convert - to ^M.)
     The logical operations affect the flags, too: the Z flag
will be set if the result of the operation is 0, otherwise
cleared. The C flag will always be cleared. (You will often see
something like ORA A used just to clear Carry, instead of STC,
CMC.)


2. The Filter Program
     The basic "filter" program reads a byte of text from an
input file, processes it in some way, then writes it to an output
file. It would look something like this:

;*** FILTER.ASM
;*** General Filter Program
;
BDOS    EQU  0005H   ;basic equates
FCB1    EQU  005CH
FCB2    EQU  006CH
;
        ORG  0100H   ;programs start here
;
START:  LXI  D,FCB1  ;point to 1st FCB
                     ; (source file)
        CALL GCOPEN  ;open it for reading
        JC   IOERR   ;complain if error
        LXI  D,FCB2  ;point to 2nd FCB
                     ; (destination)
        CALL PCOPEN  ;open it for writing
        JC   IOERR   ;complain if error
;
LOOP:   CALL GETCH   ;get a character
        JC   IOERR   ;complain if error
        CPI  1AH     ;EOF?
        JZ   DONE    ;quit if at end of file
        CALL FILTER  ;process it in some way
        JMP  LOOP    ;keep going
;
DONE:   CALL PCLOSE  ;close the output file
        JC   IOERR   ;error?
        RET          ;all finished
;
IOERR:  RET          ;error? just quit, for now
;
;Here is the processing routine
FILTER: CALL PUTCH   ;just write it out, for now
        RET
;
;*** Be sure to include here the following disk
;*** file subroutines from our previous column:
;*** GETCH, PUTCH, GCOPEN, PCOPEN, PCLOSE
;
        END

     If you assemble this as written here, you will have a
program called FILTER.COM, that will simply make a copy of a disk
file; i.e., if you say

A>filter oldfile newfile<cr>

and FILTER will read OLDFILE and construct an identical copy
NEWFILE.
     If you want, you can spruce it up a bit, by adding a signon
message like FILTER 1.0 (8/19/86) at the START, or an error
message like I/O ERROR at the IOERR routine. (Use BDOS function 9
or the SPMSG routine, described in earlier columns.)
     Of course, what we really want is to do something to the
text enroute. As you can see, you can put any further code you
want at the location FILTER, which now just writes the character
out as is. For example, you can add the UCASE routine above, and
you will have a program that makes an uppercase copy of a file.


3. The WordStar To ASCII Filter
     To get the FILTER program to convert a WordStar DOC to a
plain ASCII file, you have to know what's in a DOC file.
     We've already said that a lot of characters have their high
bits set (such as "soft" spaces and returns), so the first thing
we want to do to them is ANI 7FH to strip that off.
     But there's more than that!
     For example, there's hyphens. WordStar has "soft hyphens",
which are represented by 1Eh (when not in use) or 1Fh (when in
use).
     Thus you want to ignore 1Eh, and translate 1Fh to a real
hyphen. Adding this also to our FILTER routine would produce:

FILTER:   ANI  7FH     ;strip parity bit
          CPI  1EH     ;is it dead soft hyphen?
          RZ           ;if so, quit (ignore it)
          CPI  1FH     ;is it live soft hyphen?
          JNZ  FLT1    ;if not, skip following
          MVI  A,'-'   ;if so, replace with '-'
FLT1:     CALL PUTCH   ;okay, now write it out
          RNC          ;return if all clear
          POP  H       ;ERROR, kill return to
                       ; LOOP
          JMP  IOERR   ;and go here instead

     If you use this code above, you will have a FILTER program
that does a pretty credible job of converting WordStar DOC to
ASCII files.
     FILTER.COM will take up only 1k on disk, and will be quite
fast, and much easier to use than the equivalent program in, say,
MBASIC.
     Of course, you will eventually want to add more processing,
to suit your taste. For example you may decide you want to ignore
all the funny control codes like ^S that WordStar uses for
printer functions, or instead, translate them to the actual
control codes your printer will need to perform those functions.
     It's your program; you are in control.


4. Buffering Characters
     Now let's consider how you might write another filter
program to go the other way.
     How often have you encountered files you'd like to edit (and
reformat) with WordStar, but they're full of hard returns, so you
can't?
     This is a slightly harder problem.
     You don't just want to turn all hard returns into soft ones,
because there are places where you want them left hard (like the
end of a paragraph).
     How can we tell when this is the case?
     No routine will do this perfectly. However, if you can
assume that paragraphs are always indented (always good
practice), you can use the following pretty good rule:

     A return is the end of a paragraph, and should be left hard,
if:

     (1) the next line is blank;
     (2) the next line begins with a space.

     In terms of character values, this means that the next
character, after this CR and LF, is (1) another CR, or (2) a
space.
     Notice that what we do with the current character (in this
case a soft CR) depends on the value of the character after next!
How can we cope with this?
     We must be able to look ahead and see what's coming, without
affecting our position in the file: to read characters from the
source file, but then save them to read again later.
     This can be done by storing them in a special little buffer,
and modifying our GETCH routine to see if there are any
characters in this buffer before going to look in the file again.
     Here's the new UNGETC routine, which will "unget" a
character:

;Routine to UNGET a character, saving
; it for GETCH
UNGETC: PUSH H         ;save registers here
        PUSH D         ;(if you don't do this,
        PUSH B         ; UNGETCwill be a
                       ; hassle to use)
        PUSH PSW       ;save the character last
        LDA  BUFCNT    ;fetch buffer count
        CPI  5         ;already maximal?
        JNC  UNG0      ;yes, leave it
        INR  A         ;no, increase it
        STA  BUFCNT    ;and put it back
UNG0:   LXI  H,UGBUF+3 ;point from next-last
        LXI  D,UGBUF+4 ;to last position
        MVI  B,4       ;prepare to move 4 bytes
UNGLP:  MOV  A,M       ;get a byte
        STAX D         ;move it up ahead
        DCX  H         ;back up
        DCX  D         ;to previous
        DCR  B         ;count down on B
        JNZ  UNGLP     ;loop if more to go
        POP  PSW       ;recover new character
        STA  UGBUF     ;put it at front of
                       ; buffer
        POP  B         ;restore
        POP  D         ;  the
        POP  H         ;    registers
        RET
BUFCNT: DB   0         ;count chars in UGBUF
UGBUF:  DS   5         ;room for 5 characters

     UNGETC maintains a list of characters read, and put back for
future use, at UGBUF. The most recently read one is first, the
oldest last -- BUFCNT holds the count.
     To unget a character, we increment the count, move the
existing ones ahead to make room, and then put in the new one.
(Don't try to unget more than the maximum of 5 characters, or the
earlier ones will disappear into the bit bucket.
     Of course, you could make this value larger if you want.)
     Now what does GETCH have to do?

;Modified GETCH routine for use with UNGETC
GETCH:  LDA  BUFCNT    ;check UNGETC buffer
        CPI  0         ;is it empty?
        JZ   FGETCH    ;if so go read file
        DCR  A         ;decrease count
        STA  BUFCNT    ;and put it back
        MOV  E,A       ;put count (less 1) in E
        MVI  D,0       ;now D-E is 16-bit
                       ; version
        LXI  H,UGBUF   ;point to buffer
        DAD  D         ;now HL points to eldest
                       ; character
        MOV  A,M       ;get it
        STC
        CMC            ;clear C flag
        RET            ;and return
FGETCH: ....           ;put the old GETCH here

     If there are characters in the UGBUF buffer, we decrement
the count, then fetch the oldest one and return with it; if the
buffer is empty, we just go ahead and do the usual read from the
file.


5. The ASCII to WordStar Filter
     If you will add UNGETC, and make the above changes to GETCH,
we can now get the FILTER program to "soften" CRs more or less
properly. The processing routine will look like this:

FILTER: CPI  0DH     ;is it a CR?
        JNZ  FLT1    ;no, just go on
        CALL GETCH   ;get the next char (LF?)
        JC   FLTERR  ;error?
        MOV  D,A     ;and save it
        CALL GETCH   ;once more we want this one
        JC   FLTERR  ;error?
        MOV  E,A     ;save it too
        MOV  A,D     ;recover the first
        CALL UNGETC  ;unget it
        MOV  A,E     ;now the second
        CALL UNGETC  ;unget it too
        MOV  A,E     ;okay, here it is
        CPI  0DH     ;here goes: is it a CR?
        JZ   FLTH    ;yes, make current CR HARD
        CPI  ' '     ;or a space?
        JZ   FLTH    ;yes, HARD again
FLTS:   MVI  A,8DH   ;no, use a SOFT CR here
        JMP  FLT1
FLTH:   MVI  A,0DH   ;use a HARD CR
FLT1:   CALL PUTCH   ;write the char out
        RNC          ;return if all clear
FLTERR: POP  H       ;ERROR, kill return to LOOP
        JMP  IOERR   ;and go here, instead

     If we've read a CR, and the character after next (the second
LOOK ahead) is a space or CR, we write a hard CR; otherwise, it
gets softened. Other characters go through unaffected.
     This is the central task in creating DOC files from ASCII
files. Of course you can do as much more as you want: e.g.,
soften hyphens if they occur at the end of a line (before a CR).
It's all up to you.


6. Other Applications
     You probably will be able to think of other filtering tasks
as well.
     One possibility is communication with various mainframe
computers, which have differing requirements for text formats.
Another is encrypting and decrypting text, using anything from a
simple substitution cipher on up.
     And if you eliminate the output file routines, you can turn
the FILTER program into a simple SEARCH program that just reads
through a disk file: perhaps counting words, or looking for a
particular string and printing out every line that contains it.
     You will find that the resulting program is remarkably
compact and fast.
     If you want to make it even more efficient, you can try your
hand at increasing the buffering of the GETCH and PUTCH routines.
     As they stand they use a simple 128-byte DMA, which means
your computer will have to alternately read data from the source,
and write to the destination, in small pieces (the BDOS does its
own buffering, in units of "blocks", usually from 1K to 4K in
size).
     You can speed all this up if you use buffers larger than
this; 16K apiece would be a good choice. This would require
increasing the GCDMA and PCDMA buffers from 128 bytes to 16*1024
bytes, and modifying the read/write code in GETCH and PUTCH to do
the whole 16K a record at a time, stepping the DMA address along
in 128-byte increments. (An exercise for the stout-hearted
reader.)


7. Coming Up
     Next time we'll learn how to input and output numbers.