Home > Overview

GCE Data Toolbox Overview

Introduction

The GCE Data Toolbox is a comprehensive software framework for metadata-based analysis, quality control, transformation and management of environmental data sets. The toolbox is a free add-on library to the MATLAB technical computing language, based on a MATLAB data model for storing tabular data along with all information required to interpret the data and generate formatted metadata (documentation). The various metadata fields in the structure are queried by toolbox functions for all operations, allowing functions to process and format values appropriately based on the type of information they represent. This 'semantic processing' approach supports highly automated and intelligent data analysis and ensures data set validity throughout all processing steps.

Data Structure Format

The GCE Data Toolbox data model is implemented as a scalar MATLAB structure (i.e. 'struct' variable). Data values are stored in the structure in a virtual table arrangement, organized as a series of single column arrays with each array containing one type of information (i.e. a single variable). Each array is composed of an equal number of rows, representing records or observations for the corresponding data column as in a relational database table. The major attributes of each column (i.e. data descriptor metadata, such as column names, units, data types, precisions) are stored as matching arrays in individual structure fields. A dedicated array of qualifier flags is also paired each data array (see Quality Control Framework below).

Functions in the GCE Data Toolbox rigorously maintain the consistency of column attributes and correspondence of rows in data structures to preserve the validity of the data from operation to operation. All operations that are performed on a data structure are also written to a history field by toolbox functions, allowing the complete processing history (i.e. lineage) to be viewed at any time and included in the data set metadata.

General documentation information is stored in data structures as a parseable array of categories, fields, and values (i.e. two-tiered hierarchy). Metadata is automatically updated to reflect changes to the structure, and can be manually edited in a GUI application. This parseable storage format also permits documentation to be meshed when two structures are merged together, preserving all the information from both structures without unnecessary duplication.

A flexible formatting language was developed to convert metadata to printable documentation in various user-editable styles. Metadata can also be exported in XML text format, either generic element-based XML or XML compliant with the Ecological Metadata Language version 2.2 schema.

Quality Control Framework

The GCE Data Toolbox provides a dynamic, extensible, rule-based framework for quality assurance and quality control (QA/QC) of environmental data. A table of data qualifier 'flags' is maintained in synchrony with the main data table, so qualifiers shadow data values throughout processing. The separation of data value and QA/QC flag storage obviates the need to delete suspect or invalid values from data sets, and permits flexible handling and display of QA/QC information during analysis and data export.

For example:

  • value flags can be displayed in the data editor and above the corresponding data values in plots
  • flagged values can be included or excluded in statistical reports
  • flags can be converted to data columns and displayed alongside the data values
  • flagged values or rows containing flagged values can be omitted from exported data sets
  • flagged values can be selectively deleted from data sets, with deletions logged to the structure history and data anomalies metadata field

QA/QC flags can be automatically assigned based on criteria expressions defined for each data column (i.e. sets of expressions that return a character flags for values meeting the specified criteria), assigned manually in a spreadsheet-like data editor, or assigned graphically by selecting data points with the mouse. Flagging expressions can contain references to other data columns and composite flags from multiple columns can be manually propagated to dependent columns, allowing users to perform data flagging based on complex, multi-column dependency relationships (e.g. flagging of all measured values when a hydrographic instrument is out of the water, based on depth or pressure value).

QA/QC rules can be defined interactively, or managed along with other attribute and documentation metadata in a metadata template database for repeated use. Rules can even reference custom MATLAB functions, including functions that retrieve data from external sources, run complex models, or utilize code written in other programming languages, so the toolbox QA/QC framework is highly extensible.

Metadata-Driven Analysis

Structure metadata fields are queried by toolbox functions for all data management, analysis, and display operations, allowing functions to process and format values appropriately based on the type of information they represent. This semantic processing approach maintains the validity of data and calculated parameters, and supports intelligent automation, such as:

  • automatic statistical report generation with appropriate statistics computed based on the data type, numerical characteristics, and variable category of each column
  • automatic unit conversions and calculation of related information, e.g. geographic coordinate system inter-conversions, date/time format inter-conversions
  • validation of column selections used for relational joins and unions (i.e. merging multiple data sets) based on variable category and unit compatibility
  • intelligent plotting of data, e.g. automatic recognition of date/time axes and encoding of text columns to allow plotting as serial integers with text displayed as labels
  • automatic validation of entries in the data editor application based on column data type, numerical type and precision

Import/Export Capabilities

Data can be imported into the toolbox from a wide variety of local data sources (e.g. environmental data loggers, delimited text files, database queries, standard MATLAB files), online databases (e.g. LTER ClimDB, USGS NWIS, NOAA NCDC, NOAA HADS, LTER NIS) and other frameworks (e.g. Data Turbine). Additional import filters and metadata templates can be added to the toolbox to extend support to other data types and workflows. Interactive GUI forms are provided, along with a function library for building custom workflows for unattended processing.

Data, metadata (documentation) and statistical reports can be exported in delimited ASCII (including CSV), HTML and XML format, as well as standard MATLAB arrays, matrices and structs, to support external programs or for archival purposes. GCE data structures and selected variables can also be transferred to the base MATLAB workspace from the GUI editor application at any time to support mixed GUI and command-line processing.

Beginning with version 3.9, data and metadata can also be exported as complete Ecological Metadata Language (EML)-described data packages suitable for archiving in a long-term data repository, such as the LTER Data Portal or ESA Data Registry.

Additional Information

For additional information about the GCE Data Toolbox, please see the Documentation, Function List and FAQ pages.