MULTIVARIATE ANALYSIS PACKAGE 1.6 Copyright 1985,86,87 Douglas L. Anderton Department of Sociology University of Chicago 1126 E. 59th Street Chicago, IL 60637 These programs are released for distribution so long as 1) any charges involved do not exceed costs of media and mailing, and 2) no portion of the programs is used for commercial resale. Revision History: 01/12/87 - added codebooks for variable names and missing values see usage documentation below. 01/10/87 - major optimization in FACTOR eigen subroutines cut iter- ations by about 40% and gave user control of tolerance. 01/09/87 - minor revisions DESCRPT, PLOT, CORREL, PARTIAL, CLUSTER, HYPOTHS and MANOVA. 01/08/87 - fixed rollover bug in grand totals in CROSSTAB, and minor optimization. 01/07/87 - substantially optimized TRANSFRM for 16% speed increase. 01/05/87 - fixed bug in REGRESS mean squares, converted to gaussian and LU solution. Modified GETCOR subroutine to get names. 08/25/86 - Modified TRANSFRM to allow leading minus signs on number entry and numbers up to 11 characters long. 06/25/86 - Fixed bug in group option and histograms in DESCRPT. 05/27/86 - New Release. Buffering added to TRANSFRM, MANOVA program, Simple 2-dimensional PLOT, and Kmeans CLUSTERing program. 04/21/86 - Added Spicer algorithm and weighted data to CORREL. 04/19/86 - Added (improved accuracy) Spicer algorithm, weighted data and 'by' group computations to DESCRPT. 03/23/86 - Fixed IFS bug in TRANSFRM and Sped it up considerably. 09/27/85 - Fixed Critical bugs in CORREL with missing values. 09/24/85 - New Release. Transformations Package, Partial Correl- ations, Factor Analysis and Hypotheses Tests. 09/13/85 - Fixed bug which dropped sign of correlations from CORREL when read into REGRESS if negative. 06/28/85 - Fixed bug in CROSSTAB (unidimensional addressing). 06/26/85 - Fixed bug in CROSSTAB (init row and col tots). 06/22/85 - New Release. CROSSTAB. 06/15/85 - First Release. DESCRPT, CORREL, REGRESS. INTRODUCTION: Mapstat is a very serious multivariate statistical analysis package capable of meeting 90% or more of most users analytical needs. The routines have, at this point, been well tested and provide the most frequently used procedures of the relatively expensive statistical packages without cost. Source code is included for modifications and elaborations at your own risk. Eleven programs are included in this sixth release of MAP. 1) DESCRPT - descriptive statistics and frequency histograms. 2) CORREL - correlation and covariance matrices. 3) REGRESS - multiple linear regression. 4) CROSSTAB - n-way crosstabulation and association tests. 5) TRANSFRM - data transformations. 6) HYPOTHS - simple hypotheses test on means and variances. 7) PARTIAL - partial correlation coefficients. 8) FACTOR - principle axis factoring with rotations. 9) CLUSTER - kmeans clustering program. 10) PLOT - simple 2 dimensional plots 11) MANOVA - multiple dependent variable analysis of variance Users are encouraged to REPORT BUGS and make REQUESTS for future versions. Do not release your own versions or modifications using the copyrighted MAP or MAPSTAT logos - and abide by the above copyright notice. HARDWARE REQUIREMENTS: MAP is written in version 2 (or 3) of Turbo Pascal (@Borland Intl). It has been written to compile with less than 56k TPA for those running ZCPR3 or an alternative OS on 8-bit machines. Only several statements must be altered to run the programs on MSDOS machines. Change BDOS(0) calls to EXIT and try to compile. As I recall only two or three other lines need to be changed out of all the code herein for MSDOS version 3 Turbo. PLOT contains printer control codes for the EPSON MX80 in procedure Openfiles, modify these codes to suit your printer. DESIGN PHILOSOPHY: First, MAP is written as a sequential case processor to avoid memory resident storage and achieve the greatest speed possible. This has several consequences 1) the package contains powerful statistical analysis programs without horrendous memory requirements, 2) however, the cost arises in that for redundant functions such as histograms, regression residuals, etc., the package currently requires multiple passes at the data. Even for large data sets the programs are sufficiently fast to make such passes reasonable. INPUT DATA REQUIREMENTS: MAP expects to find your data in a free format with at least one blank separating each variable and a newline at the end of each line. All variables for each case must be on a single line, i.e. newlines separate records. It will not accept alphanumeric data. Programs assume all data transformation has been performed (e.g. CROSSTAB expects a finite number of values, not necessarily integer value). These are the only data requirements. Codebook files containing variable names and missing values are also allowed, see 'running the programs' below. COMPILING THE PROGRAMS: Use your Turbo Pascal (@Borland Intl) compiler to compile the programs with the options set to a .COM file for MAPSTAT and a .CHN file for all others. Rename all except MAPSTAT to the names given in the file MAPSTAT.PAS. If you plan to run these programs under ZEX control (highly reccommended) then be sure to compile them under ZEX. This is done putting all the distribution files along with your turbo compiler in a common access area and running the MINSTALL.ZEX file included. Alternatively, to compile one at a time, but under zex, just enter: >ZEX :TURBO : and then proceed as you normally would. RUNNING THE PROGRAMS: 1. Data Input and Output Files - After invoking the programs they will ask for the name of an input data file (or a file created from a prior MAP run - for example, the output of CORREL is used by REGRESS), and the name of an output file. For printer output specify the filename as LST: and for screen output specify CON:. An exception is TRANSFRM, which uses buffered output routines will accept LST: and CON: but will send output to LIST.TMP and CONSOLE.TMP disk files respectively. 2. Codebook Variable Description Files - If the input to the program is raw data (i.e. it is not one of the procedures which input a prior CORREL matrix), then the program will ask for a codebook file. The codebook file contains three items of input for each variable in the data file (1) the column number, (2) a variable name of eight characters, and (3) a missing value code for missing values. Again, I repeat, one line must be provided for each variable in the data file (whether it is used in this particular analysis or not). All three items must be provided for each variable on a new line and separated by blanks. For example, 1 THISIS1 -9 2 HERESTWO -1E37 (etc.) Note that eight spaces must be allowed for variable names, leave blanks if necessary to fill out the string. Note also that a missing value code must be given for every variable. The example above used MAPSTAT's default value of -1E37 for missing data, this or another equally implausible value may be given in the codebook. Alternatively, if the user specifies 'none' in answer to the codebook file query, variable names will default to variable numbers and the default missing value will be assumed. This is not a recommended option if you will return to your output sometime in the future. 3. Variable Column Identification - After file names the programs will typically request the number of variables in the data file and then the number of variables to be used in the present run. For example, a CORREL run might be run on a file containing lines for 500 cases each with 12 variables, only 4 of which are to intercorrelated in the present run. The total number of variables would then be entered as 12 and the number for the present run as 4. For each variable to be used the program will request information on the column number of the variable (e.g. 1 for the first variable, 2 for the second, etc.). These are column numbers in the raw file not among the subset to be used. In the above example, say the first, third, sixth, and eleventh of the 12 variables were to be used, the user would enter 1 3 6 and 12. 4. Specification of Groups, Weights and Special variables - Occasionally, the programs will ask you to identify one of the variables for use in weighting data, grouping data, as a dependent variable, etc. Again, reference is by original column number of the input data set. For example, if the correlations in the example above were to be weighted by population which is contained as the sixth variable, you would identify the weight as column 6, it's position in the raw data file. All of the variables used as weights, groups, etc., must have been included in the original number of variables to use and selection of the columns for the analysis. That is, it would not be possible to specify, for example, column 4 as a weight since it has not been specified in the variable list above. 5. Hints on Further Documentation - All other information necessary is prompted for with what I hope are explicit prompts. If you have problems as to input queries, or the interpretation of output, refer to a statistics book. Some of the multivariate routines are recognizably influenced by those in Fortran by Cooley and Lohnes in their Multivariate Data Analysis book. The Kmeans clustering routine is found in almost any book on cluster analysis. Some routines lifted from numerical methods books, etc., have references in the source code. The transformation options are relatively well elaborated if you initially specify to input transformations for the CON: file. Once you become familiar with the program you can input transformations from files. Finally, I am eminently reachable for the near present at the BBS number at the end of this file. If you have any questions regarding interpretation, etc., feel free to give me a line. 6. Hints on Power Usage - There are a number of features which the design philosophy of mapstat preclude. However, most of these features are readily derivable through coupling TRANSFRM with the other programs. For example, many regression packages output residuals from the regression and plots of the standardized residuals, etc. Mapstat does not force such a second pass through the data since it is designed for large data sets without retention of the data in memory. If the user desires such an analysis the residuals could be readily computed using TRANSFRM and then plotted with PLOT. Similarly, FACTOR produces score coefficients which could be used to generate factor scores for further analysis etc. Dummy variables can be coded through use of the recoding facilities in TRANSFRM and used to compute complicated general linear model analyses of variance (GLM/ANOVA's) through REGRESS. The list goes on, and on, and on. The more you know about statistics and what you are doing the more you will find these programs of use. At the same time, if you are a basic user you will probably not require more than the basic output provided by routines. PROGRAM LIMITATIONS: The addition of codebooks and transformation files makes these routines roughly competitive with other micro statistics packages. Given you have recieved them free of cost and, "omigosh," with the source code, they are extremely flexible and useful tools for data analysis. Both DESCRPT and CORREL now allow weighted data to be entered. While the Spicer algorithm provides good accuracy on computations in both these programs it is not as robust against weighted data. The results are sufficent for most purposes but excercixe caution with heavily weighted data. At this stage with humble documentation it is up to the user to look at the beginning type and variable declarations to see what the limitations on the number of variables, etc., of each program are. I think if you are doing any REAL data analysis you will find the provisions ample. I have relied almost exclusively upon these routines in several analyses published over the last couple of years and they have been scrutinized by a number of graduate students and colleagues. While I can't guarantee any revision won't create some obscure bug, I can assure you there are no subtle bugs of any significance for regular data analysis. As with all statistical software, you should avoid absurd or extreme value input. Leave messages on the LILLIPUTE ZNODE (312-649-1730).