|
|
Format Guidelines for Tabular Data
Data Table Formatting
Data tables can be submitted as ASCII text files or spreadsheet files in Microsoft Excel or Lotus 1-2-3
format. The files should be given a descriptive name, such as the
investigator's name, study description, and date submitted, with the
appropriate extension for the file format (e.g. Sheldon_weather_020101a.txt,
Sheldon_hydro_020101.xls).
General Guidelines
- The data table must be laid out as a rectangular "matrix"
with an equal number of data values in each row.
- Any missing observations should be represented by a missing value code,
such as NaN for numerical columns and NaN or a user-defined code for
text columns.
- Each column in the data set should only contain one distinct type of
information, so that columns represent variables and rows represent
individual observations for each variable.
- 25-Apr-2008 each row should be separated by a single tab, single comma,
or one or more spaces as field delimiters. The same delimiter
should be used throughout the data table, and combinations of spaces
and tabs or commas are not allowed (e.g. use 32.3,32.5 rather than
32.3, 32.5).
- When column format headers are included (see
below) different delimiters can be used on each row, but do not
use combinations of delimiters within a single row to avoid parsing
errors. Also note that 'Descriptions' must be delimited by tabs,
commas, or semicolons to preserve spaces between words.
- Individual text values should not contain any internal spaces or
commas to prevent data parsing errors, although other characters, such
as underscores, are allowed (if necessary, placeholder characters can
be substituted for commas and spaces temporarily, and then replaced
after data processing to restore the original values).
- The data file should not contain any header or footer information,
except for optional column format headers as described
below.
- If a word processor is used to generate the file, make sure that word
wrap is disabled and that line break characters are included on every
row.
- Use a file extension of .txt or .asc to distinguish the file as ASCII
and prevent confusion.
Spreadsheets
- When submitting spreadsheets make sure that columns are formatted
consistently and numerical precisions are set appropriately, because
data will be processed as formatted (i.e. non-displayed digits will be
lost).
- It is a good practice to recalculate the spreadsheet then convert all
formula cells to values to prevent update anomalies caused by complex
formula logic or linked cells (e.g. copy cells, then paste values and
formats only).
- If date/time information is included in the table, make sure to
inform the Data
Manager and specify how this data should be processed.
Spreadsheet programs often "coerce" date and time
information into proprietary numerical date formats during data
entry. Numerical date standards differ among programs and
platforms so this is a potential source of data contamination.
To avoid this problem, manually format date/time information in one of
the following formats:
- As individual columns containing integer date components (e.g.
02-Feb-2001 06:00PM as columns Year, Month, Day, Time with values
2001, 2, 2, 1800)
- As fractional Julian days based on Jan. 1, 0000 as 1 (e.g.
730884.750)
- As integer year (e.g. 2001) and Julian date/Year date (i.e. days
since Jan. 1 of the same year, e.g. 32.75)
- Documentation and column format information can be included in the
same workbook as the data table, but generally avoid including
extraneous worksheets, plots, graphical elements, or comments
interspersed with data when submitting spreadsheets.
Data Column Format Information
Specific format information
for each column in the data set must be provided by the contributor at the
time of submission. This information is needed to complete the Data
Structural Descriptors portion of the metadata, and to support online data
analysis and dynamic document generation. Note that some format
selections require corresponding information in other sections of the data
documentation, as indicated.
The format information can be
provided separately in list format, or elements can be included as header rows
at the top of each data column with values delimited like the data
values. When including header rows, precede the first value in each row
with the attribute name and a colon (e.g. Names:Salinity,Temperature,Depth,...),
and do not use more than one delimiter between values (e.g. do not include
spaces when delimiting by commas). Note that individual column
descriptions must be delimited by tabs, commas, or semicolons in order to
preserve spaces between words. All attributes are required, except for
Descriptions (Names will be substituted if omitted).
| Attribute |
Description
and Formatting |
Examples |
| Names |
See Site standards
(no internal spaces, symbols are discouraged,
mixed-case text and underscores are supported) |
Salinity,Stem_Count,Marsh_Zone |
| Descriptions |
Description of each column in the
table |
Pore-water salinity,Number of Spartina
stems in the plot,Marsh zone code |
| Units |
Measurement units for data in each
column
(no internal spaces, symbols are allowed) |
PSU,count,none |
| DataTypes |
Physical data
type of each column
F = floating-point number
I = integer number
S = alphanumeric string |
F,I,S |
| VariableTypes |
Logical data
type of each column
Data (measured data values)
Calculation (calculated values)1
Nominal (categorical values)2
DateTime (date and/or time values)3
Logical (Boolean values, e.g. 1/0, True/False)
Description (textual observation)
Code (strings matching items in a list)1 |
Data,Data,Code |
| NumberTypes |
Numerical data type of each column
Continuous (continuous numbers)
Discrete (discontinuous numbers)
Angular (angular numbers, in degrees)
None or N/A (not applicable, e.g. strings) |
Continuous,Discrete,None |
| Precisions |
Decimal places to display for each
column
(i.e. indicating decimal places to use when
exporting values as text files, HTML tables) |
2,0,0 |
| FlagCriteria |
Flag criteria expression for each
column4 |
x<0='?';x>36='H',x<0='?',none |
-
Equations, code lists, and flag character definitions are required in
the data documentation. This information will be stored in the
following section of the Metadata:
Metadata Category IV: Data
Structural Descriptors
Equations
in: Data_Calculations
Code lists
and flag character definitions in: Data_ValueCodes
Flag
criteria in: Data_FlagCriteria
-
Categorical values can be numbers or strings that identify some fixed
property of the observation, such as site or plot, but aren't data
values in themselves and won't be used for calculations or statistics
(other than as grouping columns).
-
Date and time values can be expressed as numerical components (e.g.
year, month, day, hour columns) or formatted strings (e.g. 09-Feb-2001)
as appropriate for the data set - by default, the datetime variable type
designation will prevent these columns from being included in
statistical analyses regardless of data type.
-
Flag criteria are text expressions containing one or more equations per
column defining flag characters to associate with numerical values that
match the specified criteria. Flagging questionable data values is
a powerful quality control technique, so criteria should be defined
whenever possible. (Note: value flags are maintained separately
from data in the MATLAB data structure, so flagged values are not
removed from the data set by default. Criteria can be edited and
reevaluated any time, and flags can be suppressed or displayed in
various formats when the data set is saved as text or HTML).
Flag criteria syntax is as follows (or contact the Data
Manager for assistance):
x[conditional][value]='[flag character]' , where:
[conditional] is <, >, <=, >=, ~= (or <>), == (or =)
[value] is a numerical data value
[flag character] is any single text character, symbol, or digit
Multiple equations can be specified per column by using a semicolon to
separate each one (overlapping criteria are supported, allowing flag
characters to be combined when more than one criteria is matched)
For example: x<0='L';x>30='H' generates 'L' flags for
negative values, 'H' flags for values over 30
(Note: contact the data manager for assistance assigning criteria for
flagging coded values)
Appendix: Data types
Physical data type specifies
how the values are represented and stored in the data set, whereas logical
data type specifies the information content of the column (i.e. variable type
or domain) and numerical data type specifies the numerical characteristics of
the column. These three column attributes are used by GCE-LTER Data
Structure tools to display and analyze values appropriately (e.g. statistics
requiring ratio data, such as mean and standard error, will only be calculated
on Floating-point numbers with logical types of Data or Calculation, whereas
columns with any combination of data types can be enumerated or used for
querying, sorting, or aggregating rows).
Supported combinations of
these column attributes are listed in the following table, along with
descriptions and common usage examples. Note that some logical data
types can be represented by several physical data types, depending on the form
of the data and requirements of the investigator.
| Data Type Category |
Description and
Examples |
| Physical |
Logical |
Numerical |
| Floating-point
number |
Data |
Continuous |
measurements of continuous variables
(e.g. salinity, temperature, elapsed time) |
| Angular |
measurements of directional parameters
(e.g. wind direction) |
| Calculation* |
Continuous |
calculations on continuous data
(e.g. fractions, percentages, rates) |
| Angular |
calculations on directional parameters
(e.g. wind direction) |
| DateTime |
Continuous |
fractional numerical date values
(e.g. fractional Julian day) |
| Integer number |
Data |
Discrete |
count or abundance data |
| Calculation* |
Discrete |
calculations on count data
(e.g. min, max, sum, median) |
| DateTime |
Discrete |
integer date components
(e.g. year, month, day, hour) |
| Nominal** |
Discrete |
digits representing fixed categories
(e.g. sites, plots, zones) |
| Logical** |
Discrete |
binary or numerical Boolean values
(e.g. 1 = present, 0 = absent) |
| Text string |
DateTime |
NA or
blank |
formatted date or time strings
(e.g. 01-Feb-2001) |
| Logical** |
NA or
blank |
Textual Boolean values
(e.g. True/False, Present/Absent) |
| Description |
NA or
blank |
textual observation, taxonomic names
(e.g. color, sex, appearance) |
| Code** |
NA or
blank |
alphanumeric code strings
(e.g. species code, description code) |
* equations must be included in the data documentation (Category
IV: Data_Calculations)
** value definitions or code lists must be included in the data
documentation (Category IV:
Data_ValueCodes)
|