CP/M Assembly Language Part I: Assembler Basics by Eric Meyer I first discovered this about two years ago, when I needed to modify the source code for a public domain modem program for an unusual application. Since then, I've gone on to write a number of programs in assembler, ranging from some simple public domain utilities to the memory resident utility PRESTO!. For many such applications, assembler is the language of choice: it's very compact and fast; it's the most efficient way to do simple tasks that deal with moving around bytes of data, such as copying and modifying files; and it allows the most sophisticated interfacing with the CP/M operating system, which is itself written in assembler. Another nice thing is that you already have all the tools that you need to learn and use assembly language: nothing more to buy, unless your needs grow to be very sophisticated. CP/M 2.2 includes the ASM assembler; CP/M 3.0 comes with MAC and RMAC. All you lack is instructions. Let me quickly mention two good books on the subject: CP/M Assembly Language Programming, and The Soul of CP/M. Both, while not complete language references, put a lot of emphasis on programming in the CP/M environment, which will have you doing truly useful things (like manipulating disk files) in short order. Both are far more comprehensive than I can attempt to be here; I will just just present an introduction, and explain some basic concepts for those who would like to become literate in assembler. Numbers play an important role in all that follows. Basically, everything in the computer is (or is represented as) numbers -- such as the instructions that make up a program, or the operating system itself; characters of text and other data that you may be manipulating; addresses in memory where various data or subroutines can be found; and so on. Only the context determines whether a particular value is to be interpreted as a number, an ASCII character, part of an address, or a machine instruction. This can be very powerful, but it's also potentially very confusing. (Pascal aficionados may need a strong drink before proceeding.) All numbers in what follows are decimal, unless followed by a "H" (for Hexadecimal, base 16) or "B" (for Binary, base 2). Hexadecimal is commonly used in assembly language programming, as it's the most natural representation for the numbers from 1 to 255 (or 65535) that your computer manipulates on the most fundamental level. If you're unfamiliar with these base systems, you may want to find or make a conversion chart for reference. 1. The CPU The CPU (central processing unit) is the integrated circuit at the heart of your computer. It fetches your instructions, executes them, and keeps track in the meantime (via "interrupts") of all the other tasks your computer needs to have done. Most CP/M computers today use the Z80 CPU, though some still use the 8080 (or 8085), which are very similar but don't have quite as many instructions. These "8-bit" CPUs deal primarily with "bytes", numeric values from 0 to 255 (11111111B, or FFH); though two bytes together can also be used as a 16-bit "word", a value from 0 to 65535 (FFFFH). In this manner, up to 64K (64 times 1024, or 65536, bytes) of memory can be addressed. Part of this memory will be holding the CP/M operating system; part will contain the transient program that is actually running at the moment; and part will remain available as data storage space for that program. 2. Assembly Language The CPU has a moderate number of "instructions", each of which performs some simple but useful task: adding two values, fetching a byte of data from memory, and so on. Each instruction is "coded" by one (or possibly several) bytes, according to an arbitrary system. For example, C9H (201) is the "return" instruction, which marks the end of a subroutine. On the earliest microcomputers, programs were entered as a series of such numbers, often with a row of eight mechanical switches: thus the sequence "on, on, off, off, on, off, off, on" would represent 11001001B, or C9H. This was incredibly tedious. Today, having plenty of memory available to work with, you can write assembly language like any other language, using an editor to create a text file; a special program, the assembler, will translate the statements you write (e.g., the mnemonic "RET" for return) into the appropriate machine code. The assembler functions very much like a compiler for a higher-level language. The difference is that a language compiler will incorporate prewritten library routines to perform many common tasks, and allows you to do very complex things with just a few statements. Thus when you write something like: 100 INPUT "DIAMETER:",D 110 PRINT "CIRCUMFERENCE IS:",3.14159*D you are actually invoking a whole set of routines (part of your BASIC interpreter or compiler) that prints messages on the screen, gets input from the keyboard, stores and retrieves data values in memory, performs floating point arithmetic, and so on. When you program in assembler, you have to write every single CPU instruction yourself. This can be a lot of work, since the CPU can basically do two things: move a byte from one place to another; and add, subtract, and do logical operations like "and" and "or" with byte values from 1 to 255. Are you wondering how you would do floating point multiplication (C=3.14159*D) using an instruction set so primitive that it can only add and subtract integers from 0 to 255? The answer is that if you are sane, you wouldn't. There are tasks well suited to assembly language, and others best done in higher level languages. (Somebody has already written the floating point code that's part of your BASIC interpreter; take advantage of it.) In assembler, stick to fundamentally lower level tasks, such as talking to your computer hardware (like memory and I/O ports), and manipulating disk files with the CP/M BDOS calls. For these purposes there is no better "language". 3. The Assembler There are several common assemblers, but they all work in similar ways. CP/M 2.2's ASM is a good example of a basic 8080 assembler. MAC is a macro assembler, meaning that it lets you designate frequently-used blocks of code as "macros", and invoke them with a single name, much as you would a function call in another language -- this is just a convenience. RMAC is a relocatable macro assembler, meaning that it can produce output in a format that can be installed to run in different parts of memory as circumstances require; the usual assembler output is code intended to run only at address 0100H, the beginning of the TPA (transient program area) under CP/M. (This is not something you are going to need to worry about at first.) Many commercial assemblers are also available, such as Microsoft's M80. Generally these are even more powerful, and frequently they can also take advantage of the expanded instruction set of the Z80 CPU. My personal favorites are SLR Systems' SLRMAC (8080) and Z80ASM, both of which are incredibly fast relocatable assemblers, and can also generate COM files directly. But unless you get as heavily involved in assembly language as I have recently, it won't much matter which you use. The common procedure is: 1) Write the source code with your favorite text editor. 2) Run the assembler, typically producing a HEX output file. 3) Generate an executable (COM) file from the HEX file. The first step will require learning the assembler instruction set. The second is usually as easy as typing A>ASM PROG; see your computer documentation for (probably minimal) instructions on assembler usage. The third is done using the HEXCOM utility under CP/M 3.0, or LOAD and SAVE under CP/M 2.2 (though a fine public domain utility called MLOAD is much easier than this combination). 4. Practical Tasks Before we get into real assembler programming, it's worthwhile to note that frequently, what you need to do is not actually to write a program from scratch, but simply to get an existing program running the way you want. Good public domain utilities, for example, often allow a number of features to be changed, to allow proper operation on different computers, or just to conform to different tastes. At the simplest level, the program's DOC file may just give a list of patching addresses. For example, the instructions for the (imaginary) XYZED text editor might include this information: ADDRESS VALUE 0130H create BAKup files? (00=no, FF=yes) 0131H copy buffer size in bytes (0...3000H) This indicates, for example, that you can get XYZED to create backup files or not, as you like, by changing a particular byte in the COM file. The easiest way to do this is to edit XYZED.COM with a utility like EDFILE, PATCH, or DU; find the value at address 0130H; and change it, if necessary, to what you wanted. That's all you have to do; XYZED must be designed to check the value it has at 0130H, and adjust its behavior accordingly. Sometimes the installation process can be more complex. Modem programs, for example, typically have to have very different basic routines to talk to the I/O hardware of different computers. Here there will often be a whole "overlay"; an assembler source file containing an actual listing of portions of the program. You will have to edit this file, then assemble it and merge it with the rest of the COM file. This can require knowledge of some basic assembly language, but sometimes it can also be as simple as changing data values. Let's begin by considering a handful of simple assembler directives. These are not actually CPU instructions at all; they are merely instructions to the assembler, regarding where to put code, and the insertion of data values. You will see these used frequently in overlay files. 5. Assembler Directives ORG (origin): tells the assembler the address in memory at which the following code, or data, should be put. Most programs, e.g., begin with "ORG 0100H", since transient CP/M programs load in at address 0100H, the beginning of the TPA. END: marks the end of an assembler source file. EQU (equate): assigns a numerical value to a label. This isn't a "variable", as its value cannot change, and it generates no output code; it's merely a convenience. DB, DW (define byte, define word): like the "DATA" statement in BASIC, instructs the assembler simply to insert the following numerical values at the current address in memory. Presumably the program is going to refer to them as data at some point. Consider the XYZED program again. Instead of merely giving a table of patch information to go by, as described above, it might have provided you with an overlay file XYZEDOV.ASM which would include the following instructions: ;XYZEDOV.ASM installation overlay YES EQU 0FFH NO EQU 0 ORG 0130H BAKFLG: DB YES ;create BAK files? ; (yes or no) BUFSIZ: DW 0800H ;copy buffer size, ; in bytes END The semicolon ";", like REM in BASIC, indicates that the rest of the line is simply a comment, to be ignored by the assembler. The two EQUates tell the assembler to substitute the number FFH (255) everywhere "YES" occurs in what follows, and 0 for "NO". Not only is this convenient; it also makes the code more understandable, by making it clear that a value is logical (yes/no), rather than just an arbitrary number (like 255). This kind of thing always helps in assembly language, which is prone to be very confusing otherwise. The ORG statement tells the assembler that the following code or data is to be put starting at address 0130H in memory. In this case, XYZED.COM expects to find these data items at this address. The labels "BAKFLG:" and "BUFSIZ:" are just for the purpose of identification here, though in an actual program, labels can function as names for variables or subroutines, as we'll see later. The "DB YES" inserts one byte of data (in this case "YES", or FFH) at the current address (in this case 0130H, set by the ORG statement). The "DW 0800H" inserts a word (two bytes) of data at the current address (now 0131H, since the previous byte went at 0130H). In fact, two-byte values are stored "backwards" or low byte first, so the assembler is actually going to put the 00H at address 0131H, and then the 08H at 0132H. So this file has instructed the assembler to set up the following sequence of three data bytes: 9ADDRESS DATA 0130H FFH 0131H 00H 0132H 08H If you now assemble this file, with a command like A> asm xyzedov you will get an output file XYZEDOV.HEX which contains the HEX version of this code, a compact (though still ASCII text) format frequently used as an intermediary between source code and (unreadable) machine code. If you looked at the HEX file, you would see something like this: :03013000FF0008F5 which can be read as "three bytes, starting at address 0130, as follows: FF, 00, 08". (The last value on the line is just a checksum byte for safety.) You can then use a utility like MLOAD to merge this HEX file with the program XYZED.COM itself: A> mload xyzed.com=xyzed.com,xyzedov.hex and you will have a new copy of the XYZED program, with the values changed accordingly. 6. Coming Up. . . In future installments we'll learn about the 8080 CPU and its instruction set, and explain how to use CP/M BDOS calls.