DataLoad

Author:

D. Baird
AgResearch
PO Box 60
Lincoln, New Zealand
EMail: BairdD@AgResearch.CRI.NZ

Program Specs

DataLoad - Translates Spreadsheet/Database files into Genstat/Gauss format.
Copyright D. Baird, AgResearch, PO Box 60, Lincoln, New Zealand 1998-2001.

DataLoad - Translates Spreadsheet/Database files into InStat/GenStat format.
Copyright  D. Baird, AgResearch, PO Box 60, Lincoln, New Zealand 1998-2001.
           David.Baird@AgResearch.CO.NZ

DataLoad is provided in the public domain as freeware and carries no 
warranties for its usage. Use it at your own risk.

For any bugs/comments email David Baird at David.Baird@AgResearch.CO.NZ.

DataLoad can be used with the non Windows version of GenStat by using the GEN
output format which creates file containing GenStat directives which any 
version of GenStat can run.

The default version of DataLoad runs only under Windows 95/NT and can handle
up to 32,000,000 data items.

In spreadsheet files, the data extracted are labels, numerical values
and the results from formulae. A label of * in otherwise numerical column
is taken as a missing value. Empty cells are taken as missing values.
Empty rows at the start and end of a block are trimmed, but empty rows 
in the middle of a block are kept. Empty columns are also ignored. 
A single row of labels as the first non-empty row in the block will be
taken as column names. Empty cells in this row will have a default name
C1, C2 etc assigned. A column name ending with an exclamation mark ! 
will be converted to a factor in InStat/GenStat. In the GenStat procedure,
the type of a column can be forced by providing a !,# or $ character
on the end of the text items provided in the column names.

Supported File Types: Excel 2-5,95,97,2000, Lotus WK1, Quattro WQ1,WB*, dBase 2-5,
Paradox 3-5, GenStat GSH, SAS PC 6.03-12, Minitab 8-13, Systat, MStat, Instat,
Epi-Info, SPSS/Win, Gauss Data/Matrix (PC/Win/Unix), MatLab, S+ (PC/Unix/Text)
Stata 4-7, SAS Transport, Logbook Excel format, ArcView/Info Shapefiles, 
MapInfo Interchange (MIF), CSV/Tab/space delimited text files, 
Windows BMP bit mapped graphics files and Windows WAV sound files.

The file type is worked out from the contents, so the usual extension need not
be used with the exception of the following file types which do not contain a 
unique signature: Epi-Info (.REC), S+ Text (.SDD) and Paradox (.DB).
Any files not containing a unique file signature, but ending in these
extensions classified as above.

DataLoad can save the output file in various formats. The format is taken
from the output file extension:

.GSH - a GenStat spreadsheet file.
.GEN - a GenStat command file.
.XLS - an Excel 95 spreadsheet file.
.WQ1 - a Quattro spreadsheet file.
.DBF - a dBase III worksheet file.
.WOR - an Instat+ worksheet file.
.FMT - a Gauss 86 matrix file.
.MAT - a Matlab 4 matrix file.
.SDD - a SPlus data frame file.
.TPT - a SAS Transport file.
.TXT - a space delimited text file.
.CSV - a comma delimited text file.
.TAB - a tab delimited text file.
.RTF - a rich text format file (Word).
.HTM - a hypertext markup format file (Internet viewable).
.HTML - a hypertext markup format file (Internet viewable).

Usage: DataLoad Inputfile/options Outputfile/options  

Options must each be preceded with a / (i.e. /iv should be /i/v) and must
not be separated from the filename by spaces (i.e. file.xls/w not file.xls /w).
 
   Inputfile  - name of file to extract data from.
   
         /f     An option of /f can be added to Instat filenames to not output
                factors with more than 223 levels. Otherwise the if the factor has 
                labels, levels greater than 233 will be set to missing, or else the
                factor will be written as a variate.

         /i     An option of /i can be added to SPSS/Win filenames to ignore 
                SPSS missing value codes.
   
         /l     An option of /l on an Excel file specifies that the Logbook
                conventions should be used in reading the data. The column_row
                range is found and used to extract the data. If multiple
                column_row ranges are contained in the spreadsheet, the data
                can be extracted from subsequent ranges by adding the range
                number, i.e. /l=, eg /l=3 for the third column_row range.

         /m=    An option of /m can be added to filenames to specify the three
                Instat missing value codes, which are supplied in a comma
                delimited list. A asterisk can be used for the generic missing
                value indicator. E.g. /m=*,-99,-9999 specifies the three
                Instat missing values will be set as -1e309, -99, -999 if
                a Instat work sheet is saved, and if an Instat worksheet is
                being read, that the three Instat missing values are converted
                to these values

         /n     No column name row
         
         /p=    For .GEN output a /p= can be added to create
                a pointer with the given name pointing to all columns read
                in. /v and /p= are mutually exclusive, but /i can be used 
                with either.
                
         /t=    Strictness of text conversions to numerical values:
                1 = Strict (label must be an exact representation of a number)
                2 = Single (a single letter commonly entered as a number will
                            be converted to a digit (o,O = 0, i,I,l,L = 1, 
                            z,Z = 2 s,S = 5, , = .) 
                3 = Common (any of the letters commonly entered as a number will
                            be converted to a digit (as above) 
                4 = Standard a numerical value, after making common substitutions
                    will be read from the label up to the first non numerical character.
                5 = Lax - Any numerical values will be read from the string
                          so that 'A1B2' would be read as 12.
               
                A file TxtToNum.ERR will be produced that list all text to numerical
                conversions made. 
               
         /u     For an Excel or Quattro file, the option /u produces a unsorted 
                factor levels or labels. The factor labels/levels are put into
                the order that they are first met in the data. 

         /v     An option of /v can be added to take variable names from a 
                .VAR file, i.e. for test.xls, a file test.var will contain the
                variable names, one per line. The column type can be forced
                by providing a single character code (!,#,$ for factor, 
                variate and text respectively) at the end of the supplied name.
                If this file does not exist it will be created.
   
         /w     For an Excel or Quattro file, the option /w produces a print out 
                of the sheet names and named ranges and stops, producing no output 
                file.

         /x=    An option of /x=string can be added to interpret the string as
                as missing value in Excel. The default string is *. For example
                /x=M will interpret a label of M in a cell as a missing value.
      
   Outputfile - name of file to write data to - extension determines file type:
                .GSH - GenStat spreadsheet file loaded with SPLOAD in a batch
                       file or File>Open in the Windows interface.
                .GEN - GenStat command file loaded with INPUT.
                .FMT - Gauss Matrix File (no text columns output).
                .XLS - Excel 5 Spreadsheet file or Logbook file.
                .WQ1 - Quattro Spreadsheet file.
                .DBF - dBase 3 database file.
                .MAT - Matlab matrix file.
                .WOR - Instat+ worksheet.
                .SDD - SPlus transport file.
                .TPT - SAS transport file.
                .TXT - ASCII Text file (space delimited).
                .CSV - ASCII CSV comma delimited text file.
                .TAB - ASCII tab delimited text file.
                .RTF - a rich text format file (Word).
                .HTM - a hypertext markup format file (Internet viewable).
                .HTML - a hypertext markup format file (Internet viewable).
                
         /*     For a GEN output file, the option /* stops the standard RETURN
                statement being added at the end of the file.

         /b     The option /b produces an ENDBREAK statement at the end of a GEN
                output file rather than the standard RETURN statement.
                Note filenames containing the text G42D are reserved for temporary
                file type in conjunction with the GenStat procedure DATALOAD.
                
         /c     The option /c produces combines colours from an BMP file into
                a single RGB value (red + 256*green + 256*256*blue) or 
                stereo channels in a WAV sound file into a average mono channel.

         /l=    The number of layout factors in the logbook format. 
                Eg /l=3 specifies the first two factors in the data are layout
                factors. Subsequent factors will be marked as treatments.
                
         /p     The option /p is reserved also for use by the GenStat procedure.
         
         /q     No data summaries produced in running *.GEN file

         /s     An option of /s can be added to produce suffixed identifiers
                i.e. C[1] C[2] etc where columns are not named.
                
         /v=    An option of /v=1 or /v=2 or /v=3 can be added to define
                the version of GSH spreadsheet file produced. (Default version 3).
                
         /-     Don't quote text in CSV file.


   Sheet      - optional name of worksheet or named range within a spreadsheet
                file to take data from (enclose names containing spaces in
                quotes eg "Sheet One" (Windows 95/NT only I think).
                Alternatively, a sheet number can be used i.e. 1,2,3...
                The default is the first worksheet in the file.
                
   Range      - optional range of cells within a worksheet to take data from
                eg B3:D12 or AB2:AZ1201. If you do not know the last cell
                just enter an extreme value for the address, eg, IV64000.
                Just the column names can be entered to select all data in
                a column i.e. B:D will extract all data in columns B, C & D.
                The default is all valid data cells on the worksheet.
                The cell addresses in the range can leave out the row or
                column components, in which case the maximal range will be 
                used, eg: B:Z, 3:102 are valid ranges equivalent to 
                B1:Z64000 and A3:IV102 respectively. Also just the starting
                cell can be provided (eg D48 which is equivalent to D48:IV64000)
                
Examples: DataLoad Test.DBF Test.GSH/s/v=1      (suffixed identifiers C[1] in V1 format)
          DataLoad Test.XLS Test.GEN Sheet1
          DataLoad Test.XLS Test.GEN 2          (i.e. take the second sheet)
          DataLoad Test.WB2 Test.GSH "Sheet One" B2:AA1001
          DataLoad Test.XLS Test.GEN DataRange  (a named range)
          DataLoad Test.WB2 Test.TXT A C1:K100
          DataLoad Test.SAV/i Test.GEN          (SPSS Missing codes ignored)
          DataLoad Test.SAV/v Test.GEN          (Take column names from TEST.VAR)
          DataLoad Test.SAV/p=Data Test.GEN     (Create pointer called Data)
          DataLoad Test.SAV Test.XLS            (Create an Excel file)
          DataLoad Test.SAV Test.DBF            (Create an dBase file)
          DataLoad Test.SAV Test.FMT            (Create an Gauss matrix file)
          DataLoad Test.SAV Test.WOR            (Create an Instat file)
          DataLoad Test.SAV Test.SDD            (Create an SPlus file)
          DataLoad Test.XLS/l Test.GSH          (Read data in Logbook format)
          DataLoad Test.XLS Test.TAB            (produce a tab delimited file)
          DataLoad Test.XLS Test.CSV            (produce a comma delimited file)
          DataLoad Test.MST Test.XLS/l=2        (Create Excel file in logbook format
                                                 with two layout factors)

DataLoad.EXE is a Windows 95/NT command line utility. To run it start a
command window (using the MSDOS icon) and enter the DataLoad command
as above. If running DataLoad, the input files can be prefixed with
their directories, if they are in another directory. eg
c:\gen5ed\bin\dataload c:\excel\test.xls c:\data\test.gsh

DataLoad.UNI is a UNIX Motif executable. It needs to be renamed to
DataLoad.EXE to be used on a UNIX system.

An icon on the desktop can be created for DataLoad by right clicking
the desktop, and using the New>Shortcut menu item. Click the browse
button and select DataLoad.EXE and click OK. Give it a name DataLoad
and click OK. Now select the new icon, right click it and choose the
Properties menu item. In the Target line add the files and options
as above, and in the Working directory field enter the directory
where the files are located, then Click OK. Now double clicking this
icon will transfer the data between the two specified files.

____________________________________________________________________
Dr David Baird, Biometrician  EMail: David.Baird@AgResearch.CO.NZ
Mail: AgResearch, PO Box 60, Gerald St, Lincoln, NEW ZEALAND
Phone: +64 3 983 3975   Fax: +64 3 983 3946