GCE Data Toolbox Background
The GCE Data Toolbox is a major component of the overall GCE Information System. Data from various data sources, including spreadsheet and MATLAB files contributed by investigators, ASCII text from instrument data loggers, and resultsets from SQL database queries, are converted to data structures for validation, quality control/quality assurance analysis, and post-processing such as unit conversions and calculation of derived parameters. Metadata (i.e. documentation) is added from standard templates or existing data structures, derived from the data source itself via custom filter programs, or imported directly from the GCE Metabase relational database management system. All processing steps and changes to the data and metadata are automatically logged to the structure history field by date to provide a complete record of all data processing for each structure.
Analysis and Distribution
After data sets are packaged as data structures, toolbox functions are used to generate distributable data products and documentation in various standard formats, including delimited ASCII text, comma-separated value, and MATLAB binary files containing standard arrays and matrices. Data sets are also distributed in native data structure format allowing MATLAB users who have downloaded the GCE Data Toolbox to customize the data set to meet their own format requirements, aggregate data and perform statistical analyses, subset data using queries, and visualize data using histogram and scatter plots or by plotting data values on raster and vector maps.
Online analysis programs are also being developed using the MATLAB® Web Server to provide basic data set customization, visualization and analysis services to researchers without access to MATLAB.
Support for Automated Data Harvesting
The primary GCE Data Toolbox functions can be run from the MATLAB command line and return consistent output optimized for serial batch processing. This capability, combined with the self-documenting nature of GCE Data Structures (i.e. automatic metadata synchronization, processing history generation), rule-based QA/QC flagging, and metadata-based auto parameterization of many functions, make these tools ideal for application in unattended data harvesting scenarios. Recent enhancements to the MATLAB environment in Release 13 (version 6.5), such as new functions for WWW data access, timed function execution, and XML/XSLT support, open even more possibilities.
A recent application of this technology is the USGS Data Harvesting Service for HydroDB, developed in partnership with the Andrews LTER in 2003. Real-time and finalized data are automatically harvested on a weekly basis for 52 streamflow stations near to 11 LTER and USFS sites (as of February 2004). The data are fully processed, quality-checked, formatted and submitted to AND-LTER for immediate inclusion in the LTER All-Site Hydrological database. As new finalized data are released by USGS for each station, provisional real-time values are overwritten to ensure that HydroDB contains both up-to-date and accurate data for LTER synthesis projects. (This service is described in detail on the USGS Data Harvesting page).
Support for Synthesis
The GCE Data Toolbox will play a central role in data synthesis as the GCE-LTER database matures and modeling projects get underway. Data sets stored in data structure format can easily be meshed to create composite data sets using flexible merge and join functions. High resolution time-series data sets can also be scaled to other temporal resolutions by statistical aggregation on date/time columns and subset by date or time using parametric queries. Generic ASCII and MATLAB import filters with automatic data descriptor assignment and automatic unit conversion functions will also allow GCE investigators to analyze data from non-GCE databases alongside GCE data sets to gain a broader perspective.
In addition, automatic logging of data processing history and dynamic synchronization of column metadata descriptors with general metadata ensures that documentation remains accurate through all processing steps, which is crucial for complex synthetic research conducted by multiple investigators.
This material is based upon work supported by the National Science Foundation under grants OCE-9982133, OCE-0620959 and OCE-1237140. Any opinions, findings, conclusions, or recommendations expressed in the material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.