Home > Informatics >

Format Guidelines for Tabular Data

Data Table Formatting

Data tables can be submitted as ASCII text files or spreadsheet files in Microsoft Excel or Lotus 1-2-3 format.  The files should be given a descriptive name, such as the investigator's name, study description, and date submitted, with the appropriate extension for the file format (e.g. Sheldon_weather_020101a.txt, Sheldon_hydro_020101.xls).

General Guidelines

  • The data table must be laid out as a rectangular "matrix" with an equal number of data values in each row.

  • Any missing observations should be represented by a missing value code, such as NaN for numerical columns and NaN or a user-defined code for text columns.

  • Each column in the data set should only contain one distinct type of information, so that columns represent variables and rows represent individual observations for each variable.
  • 25-Apr-2008 each row should be separated by a single tab, single comma, or one or more spaces as field delimiters.  The same delimiter should be used throughout the data table, and combinations of spaces and tabs or commas are not allowed (e.g. use 32.3,32.5 rather than 32.3,  32.5).

  • When column format headers are included (see below) different delimiters can be used on each row, but do not use combinations of delimiters within a single row to avoid parsing errors.  Also note that 'Descriptions' must be delimited by tabs, commas, or semicolons to preserve spaces between words.

  • Individual text values should not contain any internal spaces or commas to prevent data parsing errors, although other characters, such as underscores, are allowed (if necessary, placeholder characters can be substituted for commas and spaces temporarily, and then replaced after data processing to restore the original values).

  • The data file should not contain any header or footer information, except for optional column format headers as described below.

  • If a word processor is used to generate the file, make sure that word wrap is disabled and that line break characters are included on every row.

  • Use a file extension of .txt or .asc to distinguish the file as ASCII and prevent confusion.

Spreadsheets

  • When submitting spreadsheets make sure that columns are formatted consistently and numerical precisions are set appropriately, because data will be processed as formatted (i.e. non-displayed digits will be lost).

  • It is a good practice to recalculate the spreadsheet then convert all formula cells to values to prevent update anomalies caused by complex formula logic or linked cells (e.g. copy cells, then paste values and formats only).

  • If date/time information is included in the table, make sure to inform the Data Manager and specify how this data should be processed.  Spreadsheet programs often "coerce" date and time information into proprietary numerical date formats during data entry.  Numerical date standards differ among programs and platforms so this is a potential source of data contamination.  To avoid this problem, manually format date/time information in one of the following formats:
  1. As individual columns containing integer date components (e.g. 02-Feb-2001 06:00PM as columns Year, Month, Day, Time with values 2001, 2, 2, 1800)
  2. As fractional Julian days based on Jan. 1, 0000 as 1 (e.g. 730884.750)
  3. As integer year (e.g. 2001) and Julian date/Year date (i.e. days since Jan. 1 of the same year, e.g. 32.75)
  • Documentation and column format information can be included in the same workbook as the data table, but generally avoid including extraneous worksheets, plots, graphical elements, or comments interspersed with data when submitting spreadsheets.

Data Column Format Information

Specific format information for each column in the data set must be provided by the contributor at the time of submission.  This information is needed to complete the Data Structural Descriptors portion of the metadata, and to support online data analysis and dynamic document generation.  Note that some format selections require corresponding information in other sections of the data documentation, as indicated.

The format information can be provided separately in list format, or elements can be included as header rows at the top of each data column with values delimited like the data values.  When including header rows, precede the first value in each row with the attribute name and a colon (e.g. Names:Salinity,Temperature,Depth,...), and do not use more than one delimiter between values (e.g. do not include spaces when delimiting by commas).  Note that individual column descriptions must be delimited by tabs, commas, or semicolons in order to preserve spaces between words.  All attributes are required, except for Descriptions (Names will be substituted if omitted).

Attribute Description and Formatting Examples
Names See Site standards
(no internal spaces, symbols are discouraged,
mixed-case text and underscores are supported)
Salinity,Stem_Count,Marsh_Zone
Descriptions Description of each column in the table Pore-water salinity,Number of Spartina stems in the plot,Marsh zone code
Units Measurement units for data in each column
(no internal spaces, symbols are allowed)
PSU,count,none
DataTypes Physical data type of each column
   F = floating-point number
   I = integer number
   S = alphanumeric string
F,I,S
VariableTypes Logical data type of each column
   Data (measured data values)
   Calculation (calculated values)1
   Nominal (categorical values)2
   DateTime (date and/or time values)3
   Logical (Boolean values, e.g. 1/0, True/False)
   Description (textual observation)
   Code (strings matching items in a list)1
Data,Data,Code
NumberTypes Numerical data type of each column
   Continuous (continuous numbers)
   Discrete (discontinuous numbers)
   Angular (angular numbers, in degrees)
   None or N/A (not applicable, e.g. strings)
Continuous,Discrete,None
Precisions Decimal places to display for each column
(i.e. indicating decimal places to use when 
exporting values as text files, HTML tables)
2,0,0
FlagCriteria Flag criteria expression for each column4 x<0='?';x>36='H',x<0='?',none
 
  1. Equations, code lists, and flag character definitions are required in the data documentation.  This information will be stored in the following section of the Metadata:

          Metadata Category IV: Data Structural Descriptors
               Equations in:  Data_Calculations
               Code lists and flag character definitions in:  Data_ValueCodes
               Flag criteria in:  Data_FlagCriteria

  2. Categorical values can be numbers or strings that identify some fixed property of the observation, such as site or plot, but aren't data values in themselves and won't be used for calculations or statistics (other than as grouping columns).

  3. Date and time values can be expressed as numerical components (e.g. year, month, day, hour columns) or formatted strings (e.g. 09-Feb-2001) as appropriate for the data set - by default, the datetime variable type designation will prevent these columns from being included in statistical analyses regardless of data type.

  4. Flag criteria are text expressions containing one or more equations per column defining flag characters to associate with numerical values that match the specified criteria.  Flagging questionable data values is a powerful quality control technique, so criteria should be defined whenever possible.  (Note: value flags are maintained separately from data in the MATLAB data structure, so flagged values are not removed from the data set by default.  Criteria can be edited and reevaluated any time, and flags can be suppressed or displayed in various formats when the data set is saved as text or HTML).

    Flag criteria syntax is as follows (or contact the Data Manager for assistance):
              x[conditional][value]='[flag character]' , where:
                        [conditional] is <, >, <=, >=, ~= (or <>), == (or =)
                        [value] is a numerical data value
                        [flag character] is any single text character, symbol, or digit

    Multiple equations can be specified per column by using a semicolon to separate each one (overlapping criteria are supported, allowing flag characters to be combined when more than one criteria is matched)

    For example:  x<0='L';x>30='H' generates 'L' flags for negative values, 'H' flags for values over 30

    (Note: contact the data manager for assistance assigning criteria for flagging coded values)

Appendix: Data types

Physical data type specifies how the values are represented and stored in the data set, whereas logical data type specifies the information content of the column (i.e. variable type or domain) and numerical data type specifies the numerical characteristics of the column.  These three column attributes are used by GCE-LTER Data Structure tools to display and analyze values appropriately (e.g. statistics requiring ratio data, such as mean and standard error, will only be calculated on Floating-point numbers with logical types of Data or Calculation, whereas columns with any combination of data types can be enumerated or used for querying, sorting, or aggregating rows).

Supported combinations of these column attributes are listed in the following table, along with descriptions and common usage examples.  Note that some logical data types can be represented by several physical data types, depending on the form of the data and requirements of the investigator.

Data Type Category Description and Examples
Physical Logical Numerical
Floating-point number Data Continuous measurements of continuous variables
(e.g. salinity, temperature, elapsed time)
Angular measurements of directional parameters
(e.g. wind direction)
Calculation* Continuous calculations on continuous data
(e.g. fractions, percentages, rates)
Angular calculations on directional parameters
(e.g. wind direction)
DateTime Continuous fractional numerical date values
(e.g. fractional Julian day)
Integer number Data Discrete count or abundance data
Calculation* Discrete calculations on count data
(e.g. min, max, sum, median)
DateTime Discrete integer date components
(e.g. year, month, day, hour)
Nominal** Discrete digits representing fixed categories
(e.g. sites, plots, zones)
Logical** Discrete binary or numerical Boolean values
(e.g. 1 = present, 0 = absent)
Text string DateTime NA or
blank
formatted date or time strings
(e.g. 01-Feb-2001)
Logical** NA or
blank
Textual Boolean values
(e.g. True/False, Present/Absent)
Description NA or
blank
textual observation, taxonomic names
(e.g. color, sex, appearance)
Code** NA or
blank
alphanumeric code strings
(e.g. species code, description code) 

* equations must be included in the data documentation (Category IV: Data_Calculations)
** value definitions or code lists must be included in the data documentation (Category IV: Data_ValueCodes)

   Data Formatting
LTER
NSF

This material is based upon work supported by the National Science Foundation under grants OCE-9982133, OCE-0620959 and OCE-1237140. Any opinions, findings, conclusions, or recommendations expressed in the material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.