GCE Data Toolbox for MATLAB®
The GCE Data Toolbox is a comprehensive software framework for metadata-based analysis, quality control, transformation and management of ecological data sets. The toolbox is a free add-on library to the MATLAB® technical computing language, based on a MATLAB data model for storing tabular data along with all information required to interpret the data and generate formatted metadata (documentation). The various metadata fields in the structure are queried by toolbox functions for all operations, allowing functions to process and format values appropriately based on the type of information they represent. This 'semantic processing' approach supports highly automated and intelligent data analysis and ensures data set validity throughout all processing steps.
All GCE-LTER data products are distributed in data structure format, and data can also be imported from a wide variety of local sources (e.g. environmental data loggers, delimited text files, database queries, standard MATLAB files) and online databases (e.g. LTER ClimDB, USGS NWIS, NOAA NCDC, NOAA HADS, LTER NIS). Additional import filters and metadata templates can also be added to the toolbox by users to extend support to additional data types and workflows. Interactive GUI forms are provided, along with a function library for building custom workflows for unattended processing.
Some common end-user tasks that can be performed using the GCE Data Toolbox include:
The toolbox and specification were developed using the MATLAB® programming language (The MathWorks), and require MATLAB 6.5 (R13) or higher to run. However, a complete suite of graphical user interface programs is provided to augment the command-line functions, allowing users with no prior MATLAB experience to use the toolbox with minimal instruction. MATLAB is compatible with all major computer operating systems, including Microsoft Windows®, Unix/Linux, Sun Solaris®, and Apple OS/X®.
Data Structure Format
Data values are stored in structures in a virtual table format, with data organized as a series of single column arrays with each array containing one type of information (i.e. a single variable). Each array is composed of an equal number of rows, representing records or observations for the corresponding data column as in a relational database table. The major attributes of each column (i.e. data descriptor metadata, such as column names, units, data types, precisions) are stored as matching arrays in individual structure fields.
Functions in the GCE Data Toolbox rigorously maintain the consistency of column attributes and correspondence of rows in data structures to preserve the validity of the data from operation to operation. All operations that are performed on a data structure are also written to a history field by toolbox functions, allowing the complete processing history (i.e. lineage) to be viewed at any time and included in the data set metadata.
General documentation information is stored in data structures as a parseable array of categories, fields, and values (i.e. two-tiered hierarchy). Metadata is automatically updated to reflect changes to the structure, and can be manually edited in a GUI application. This parseable storage format also permits documentation to be meshed when two structures are merged together, preserving all the information from both structures without unnecessary duplication.
A flexible formatting language was developed to convert metadata to printable documentation in various styles. Prototype tools to convert metadata to hierarchical element-based XML format are also provided.
Quality Control Framework
GCE Data Structures also employ a dynamic, extensible QA/QC framework in which a table of data quality 'flags' is maintained in synchrony with the data table. The separation of data values and QA/QC flags obviates the need to delete questionable values from data sets, and permits flexible handling and display of QA/QC information during analysis and data export. For example:
QA/QC flags can be automatically assigned based on 'flagging' expressions defined for each data column (i.e. sets of expressions that return a character flags for values meeting the specified criteria), assigned manually in a spreadsheet-like data editor, or assigned graphically by selecting data points with the mouse. Flagging expressions can also contain references to other data columns and composite flags from multiple columns can be manually propagated to dependent columns, allowing users to perform data flagging based on complex, multi-column dependency relationships (e.g. flagging of all measured values when a hydrographic instrument is out of the water, based on depth or pressure value).
Structure metadata fields are queried by toolbox functions for all data management, analysis, and display operations, allowing functions to process and format values appropriately based on the type of information they represent. This semantic processing approach maintains the validity of data and calculated parameters, and supports intelligent automation, such as:
Data and documentation can be imported from various sources to create GCE Data Structures, including existing data structures, delimited ASCII files, MATLAB files containing both vectors and matrices, and SQL databases (requires the MATLAB Database Toolbox). Metadata can be imported along with the data (e.g. headers on ASCII files), imported from existing data structures as metadata templates, or entered manually.
A number of specialized import filters have also been developed to directly parse data and documentation from specific types of data sources, including SBE MicroCAT data logger files, USGS tab-delimited files, Campbell Scientific array-based data loggers, and NOAA NCDC files.
Data, documentation, and statistical reports can also be exported in a wide variety of delimited ASCII text (including CSV) and MATLAB formats to support external programs or for archival purposes. Structures and selected variables can also be transferred to the base MATLAB workspace from the GUI editor application at any time to support mixed GUI and command-line processing.
This material is based upon work supported by the National Science Foundation under grants OCE-9982133, OCE-0620959 and OCE-1237140. Any opinions, findings, conclusions, or recommendations expressed in the material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.