GCE Data Toolbox for MATLAB®

Selected Screen Shots
(click for larger image)

GCE Data Toolbox startup screen GCE Data Toolbox editor GCE Data Toolbox data search engine GCE Data Toolbox data editor GCE Data Toolbox data plot GCE Data Toolbox map plot GCE Data Toolbox data merge dialog GCE Data Toolbox metadata viewer

Overview

The GCE Data Toolbox is a comprehensive software framework for metadata-based analysis, quality control, transformation and management of ecological data sets. The toolbox is a free add-on library to the MATLAB® technical computing language, based on a MATLAB data model for storing tabular data along with all information required to interpret the data and generate formatted metadata (documentation). The various metadata fields in the structure are queried by toolbox functions for all operations, allowing functions to process and format values appropriately based on the type of information they represent. This 'semantic processing' approach supports highly automated and intelligent data analysis and ensures data set validity throughout all processing steps.

All GCE-LTER data products are distributed in data structure format, and data can also be imported from a wide variety of local sources (e.g. environmental data loggers, delimited text files, database queries, standard MATLAB files) and online databases (e.g. LTER ClimDB, USGS NWIS, NOAA NCDC, NOAA HADS, LTER NIS). Additional import filters and metadata templates can also be added to the toolbox by users to extend support to additional data types and workflows. Interactive GUI forms are provided, along with a function library for building custom workflows for unattended processing.

Some common end-user tasks that can be performed using the GCE Data Toolbox include:

  • Unit inter-conversions
  • Sub-sampling data sets by removing unneeded columns or rows
  • Filtering data based on values in one or more columns or mathematical expressions
  • Performing data quality control using rule-based and interactive flagging tools
  • Visualizing data using frequency histograms, line/scatter plots and map plots
  • Summarizing data sets by aggregation, binning, and date/time re-sampling
  • Re-factoring data sets by combining similar data in separate columns (normalizing) or splitting compound data series into separate columns (de-normalizing)
  • Joining and merging multiple structures to create integrated data sets
  • Exporting data and/or metadata in various ASCII and MATLAB formats for analysis in other programs
  • Searching local and web-based data sets using thematic, temporal and spatial search criteria
  • Importing (mining) data from the LTER ClimDB, USGS NWIS and NOAA NCDC databases over the Internet

The toolbox and specification were developed using the MATLAB programming language (The MathWorks), and require MATLAB 6.5 (R13) or higher to run. However, a complete suite of graphical user interface programs is provided to augment the command-line functions, allowing users with no prior MATLAB experience to use the toolbox with minimal instruction. MATLAB is compatible with all major computer operating systems, including Microsoft Windows, Unix/Linux, Sun Solaris, and Apple OS/X.

Data Structure Format

Data values are stored in structures in a virtual table format, with data organized as a series of single column arrays with each array containing one type of information (i.e. a single variable). Each array is composed of an equal number of rows, representing records or observations for the corresponding data column as in a relational database table. The major attributes of each column (i.e. data descriptor metadata, such as column names, units, data types, precisions) are stored as matching arrays in individual structure fields.

Functions in the GCE Data Toolbox rigorously maintain the consistency of column attributes and correspondence of rows in data structures to preserve the validity of the data from operation to operation. All operations that are performed on a data structure are also written to a history field by toolbox functions, allowing the complete processing history (i.e. lineage) to be viewed at any time and included in the data set metadata.

General documentation information is stored in data structures as a parseable array of categories, fields, and values (i.e. two-tiered hierarchy).  Metadata is automatically updated to reflect changes to the structure, and can be manually edited in a GUI application. This parseable storage format also permits documentation to be meshed when two structures are merged together, preserving all the information from both structures without unnecessary duplication.

A flexible formatting language was developed to convert metadata to printable documentation in various styles. Prototype tools to convert metadata to hierarchical element-based XML format are also provided.

Quality Control Framework

GCE Data Structures also employ a dynamic, extensible QA/QC framework in which a table of data quality 'flags' is maintained in synchrony with the data table. The separation of data values and QA/QC flags obviates the need to delete questionable values from data sets, and permits flexible handling and display of QA/QC information during analysis and data export.  For example:

  • value flags can be displayed in the data editor and above the corresponding data values in plots
  • flagged values can be included or excluded in statistical reports
  • flags can be converted to data columns and displayed alongside the data values
  • flagged values or rows containing flagged values can be omitted from exported data sets
  • flagged values can be selectively deleted from data sets, with deletions logged to the structure history and data anomalies metadata field

QA/QC flags can be automatically assigned based on 'flagging' expressions defined for each data column (i.e. sets of expressions that return a character flags for values meeting the specified criteria), assigned manually in a spreadsheet-like data editor, or assigned graphically by selecting data points with the mouse. Flagging expressions can also contain references to other data columns and composite flags from multiple columns can be manually propagated to dependent columns, allowing users to perform data flagging based on complex, multi-column dependency relationships (e.g. flagging of all measured values when a hydrographic instrument is out of the water, based on depth or pressure value).

Metadata-Driven Analysis

Structure metadata fields are queried by toolbox functions for all data management, analysis, and display operations, allowing functions to process and format values appropriately based on the type of information they represent. This semantic processing approach maintains the validity of data and calculated parameters, and supports intelligent automation, such as:

  • automatic statistical report generation with appropriate statistics computed based on the data type, numerical characteristics, and variable category of each column
  • automatic unit conversions and calculation of related information, e.g. geographic coordinate system inter-conversions, date/time format inter-conversions
  • validation of column selections used for relational joins and unions (i.e. merging multiple data sets) based on variable category and unit compatibility
  • intelligent plotting of data, e.g. automatic recognition of date/time axes and encoding of text columns to allow plotting as serial integers with text displayed as labels
  • automatic validation of entries in the data editor application based on column data type, numerical type and precision

Import/Export Capabilities

Data and documentation can be imported from various sources to create GCE Data Structures, including existing data structures, delimited ASCII files, MATLAB files containing both vectors and matrices, and SQL databases (requires the MATLAB Database Toolbox). Metadata can be imported along with the data (e.g. headers on ASCII files), imported from existing data structures as metadata templates, or entered manually.

A number of specialized import filters have also been developed to directly parse data and documentation from specific types of data sources, including SBE MicroCAT data logger files, USGS tab-delimited files, Campbell Scientific array-based data loggers, and NOAA NCDC files.

Data, documentation, and statistical reports can also be exported in a wide variety of delimited ASCII text (including CSV) and MATLAB formats to support external programs or for archival purposes. Structures and selected variables can also be transferred to the base MATLAB workspace from the GUI editor application at any time to support mixed GUI and command-line processing.

Background   
LTER
NSF

This material is based upon work supported by the National Science Foundation under grants OCE-9982133, OCE-0620959 and OCE-1237140. Any opinions, findings, conclusions, or recommendations expressed in the material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.