Spreadsheets As An Intermediate Step In Data Analysis by Willie Lockeretz When I first started using a Kaypro, I didn't expect to have much use for a spreadsheet. The little I knew about this type of program was gleaned from advertisements, which always featured things like financial planning, depreciation schedules, and profit and loss projections. As someone who knows nothing -- and cares even less -- about Presentation Graphics, Bottom Lines, or Corner Offices, I didn't see that there could be much in it for me. It seemed that the main shtick was that this or that spreadsheet could put me on the Fast Track, in turn letting me get Up the Corporate Ladder, at the top of which, presumably, I would find the Key to the Executive Bathroom. Well I was wrong. Nowadays I probably get more use out of spreadsheets than any other single category of software. But first I needed some consciousness-raising. For starters, this meant acknowledging that no matter what the numbers represent, a tabular array of numbers is a tabular array of numbers, and a program that lets you manipulate such an array quickly and conveniently could be very handy even if the numbers didn't happen to have dollar signs in front of them. Second, I realized that the manipulations you might perform with such a program could be very different from the operations the dressed-for-success types in the ads were doing. In fact, spreadsheets could be useful even if you didn't do any manipulations with them at all. That was the real breakthrough. Spreadsheets might be great for calculating net present value or internal rate of return, but in my work I do statistical computations that would be impossible on a spreadsheet, things like multiple regressions and factor analysis. For this I need a scientific statistics package. But rather than ruling out spreadsheets, I started using them to prepare the data for a statistics program that would do the real analysis. In other words, the fact that a spreadsheet can't do the kinds of calculations I need is not important -- I don't depend on it for calculations. But as an intermediary between raw data and statistical analysis, a spreadsheet has simplified my life immensely. I happen to favor SuperCalc2, which works very nicely on a CP/M Kaypro, but the basic idea would hold for any reasonably powerful spreadsheet. I often work with moderately large arrays of demographic, economic, and agricultural data -- perhaps 12 numbers for each of 150 counties, say. I use SuperCalc2 for three purposes. First, it's a convenient way to enter the data. Second, it can help spot large errors (like an extra zero after a number). Third, it's a convenient way to compute additional variables from the raw data. After it has done all that, the statistics program is ready to take over. A statistics package could have done the first three tasks, too, but most don't do them as conveniently. This added convenience should not entail any new problems. Any good statistics program and any good spreadsheet should have at least one format in common that permits them to be linked this way. That is, the spreadsheet should be able to write a file that the statistics program can use as input. I begin an analysis with a blank spreadsheet that has the right column and row labels, but with a dashed line where each number is to go. Printed out, this is a very efficient form on which to copy numbers from a source that you can't take back to the office, such as the Census. (A laptop would eliminate the needed for writing the numbers down at all, of course.) When you enter the data into the computer, the spreadsheet on the screen corresponds in every visual detail to the sheet from which you are reading the numbers -- the rows and columns are in the right order, the headings are identical, the column widths are the same, and so forth. This helps eliminate a very common error -- your eye skipping to the wrong line. Checking for order-of-magnitude errors is easily done by asking the spreadsheet to report the maximum and minimum value in a row or column. For some kinds of data, values that are way out of line are obviously wrong. The average age in a county is not likely to be 272 -- 27.2 is more like it. But for some variables an extreme value is not necessarily an error. In that case, the trick is to construct a new variable whose range is more restricted than the original one. For example, the population of towns and cities might range from a few hundred to several million. But if you are using population data from two different census years, the spreadsheet can quickly compute the percentage change. If you see a change of +923% in a decade, there is a pretty good chance that a digit was repeated or a decimal misplaced -- exactly the kind of error that is most likely to occur with manual data entry. If you are clever in concocting new variables you should be able to detect every order-of-magnitude error. It's not a substitute for comparing each number to the source, of course. But that can be very monotonous and wearying, which means an error can slip past. It's nice to have a second line of defense. Finally, most data analysis will require you to construct new variables from the raw data (I'm talking now about quantities that are of interest in their own right, not just as a way of detecting errors). I prefer doing this on the spreadsheet rather than with the statistics package because I can see the results before they get irrevocably stored with all the other data. A statistics package is a black box: you put in the data at one end, and you get results out the other. But you don't see what goes on in between. I like to keep an eye on the data as much as possible. With a spreadsheet you can inspect the newly created variables, get familiar with them, fondle them, do whatever you want with them (in private, and with their consent, of course). I prefer to relinquish control to the statistics package only after the spreadsheet has done everything it can. In short, I use a statistics package for the things that only a statistics package can do. But a spreadsheet is a much more practical and pleasant way to do the tedious but necessary preliminaries: to enter the raw numbers, check them, and carry out first-level processing. As soon as I started using SuperCalc2 this way, I found that it was a true anomaly in the world of commercial personal computer packages -- a program suitable for an even wider range of applications than its advertising claimed. It also is a true anomaly in having earned the highest tribute I have ever bestowed on anything in my overflowing box of non-bundled software: I own it legally. -- from The Boston Kugel, published by The Boston Kaypro Users Group (BOSKUG) of the Boston Computer Society.