Home > Training & Support > Documentation > Data Harvesting

Data Harvesting

Introduction

The GCE Data Toolbox can be used to develop comprehensive data harvesting workflows that include metadata generation, calculation of derived variables, automatic QA/QC checks, and statistical resampling. Harvested data, metadata and other products (e.g. plots) can then be exported in various formats for distribution. All GCE Data Toolbox functions automatically log changes and operations to the data set metadata, documenting all workflow steps automatically and greatly simplifying metadata generation for derived products as compared to other analytical software and workflow systems.

Data harvesting workflows can be run interactively from the command line, included in the Data Set Editor menus, and also set to run unattended on a timed basis by MATLAB timer objects. Demonstration workflows and utilities for data distribution and timed execution are included with the toolbox distributions, and are described briefly below.

Data Harvesting Workflows

A typical data harvesting worklow includes steps for importing or loading data, applying a metadata template containing documentation metadata and QA/QC rules, post-processing, and generation of data products. A wide variety of high level functions are available in the GCE Data Toolbox for performing these actions - see List of Functions in the toolbox help, or open contents.html in a web browser to view descriptions of functions and their syntax, grouped by category.

Various generalized workflows are included in the /workflows directory of toolbox distributions, and sample data harvester workflows are included in /demo for study and customization, along with test data sets:

/demo/data_harvester.m -- comprehensive workflow for harvesting Campbell Scientific logger data and generating formatted data sets and plots along with web index pages

/demo/data_harvester_sql.m -- comprehensive workflow for harvesting data from a relational database (via JDBC/ODBC using the MATLAB Database toolbox) and generating formatted data sets and plots along with web index pages

As in these examples, a workflow is typically implemented as a MATLAB function that accepts various input arguments (e.g. file names, template names, output paths, other options) and returns a status message. In addition to simplifying running the workflow, coding all processing steps in a function file allows workflows to be versioned and archived along with the data products, and even shared with other toolbox users.

Timed Workflow Execution

MATLAB includes built-in support for running commands on a timed basis in the background, without interfering with commands or GUI applications running interactively. This functionality is implemented using Java timer objects (see 'help timer' for more information). Any number of timer objects can be running at the same time, and when conflicts occur events are queued and run when cpu time is available. Note that timers only operate while the MATLAB instance is running, and shutting down the MATLAB session shuts down the timers as well. However, multiple instances of MATLAB can be run simultaneously to prevent conflicts between automated data harvesters and interactive use of the GCE Data Toolbox or other MATLAB programs.

Timers can be created and configured manually from the command line, but the GCE Data Toolbox includes several helper functions that greatly simplify setting up and managing timed execution of workflows:

/core/start_harvesters.m -- creates timer objects based on entries in the GCE Data Structure /demo/harvest_timers.mat or a custom harvest_timers.mat in /userdata or /settings (see below)

/core/stop_harvesters.m -- stops all or specified harvest timers and clears them from memory

/core/list_harvesters.m -- lists the names and status of all timer objects in memory (for use with stop_harvesters.m)

The harvest_timers.mat file is a GCE Data Structure that can be loaded into the Dataset Editor application and revised using Edit > View/Edit Data. Each row in this data structure with Period > 0 will be instantiated as a timer when start_harvesters.m is run, using timer configuration options defined in the following columns:

Name: name of the harvester (displayed by list_harvesters.m and used by stop_harvesters.m)

ExecutionMode: timer execution mode:

'singleShot' = run once (not generally used for harvesting)

'fixedDelay' = run repeatedly, with period measured from when execution starts

'fixedRate' = run repeatedly, with period measured from the designated start time (default)

'fixedSpacing' = run repeatedly, with period measured from when exeuction ends

Period: timer period in minutes (i.e. interval)

TimerFcn: MATLAB statement to execute (i.e. m-file name plus arguments, e.g. data_harvester(...)); note that the function or statement must return a character array (message) as the first output argument or an error will result

StartTime: starting time (hh:mm:ss) of initial harvest on the designated day

StartDay: numeric day of the week to start for long-period harvests (0 = auto, 1 = Sunday, 7 = Saturday)

Examples:

  • 30-minute harvests, 15 and 45 minutes past the hour:
    • Period = 30
    • StartTime = 00:15:00
    • StartDay = 0
  • 24-hour harvests at 2:30 AM
    • Period = 1440
    • StartTime = 02:30:00
    • StartDay = 0
  • Weekly harvests on Friday at 7:00 AM
    • Period = 10080
    • StartTime = 07:00:00
    • StartDay = 6

Note that beginning with version 3.7 of the toolbox, the Dataset Editor application includes menu items under Tools > Data Harvesters for managing timer objects. Timers can be started, stopped and listed, and the harvest_timers.mat database can be opened for editing using these menu items. Harvest timer events are also automatically logged to /settings/harvest_logs.mat, which is a GCE Data Structure containing columns Date, Harvester and Entry. Harvest logs can therefore be viewed, filtered and exported or used in custom dashboard applications.

Adding Workflows to the Dataset Editor Menus

The Dataset Editor GUI application (gui/ui_editor.m) dynamically generates menu items for metadata templates in /userdata/imp_templates.mat and import filters in /userdata/imp_filters.mat, as well as content in other reference databases (e.g. geographic references in /settings/thalweg_ref.mat). These menu items are also updated dynamically as metadata templates, import filters and other content are edited using the corresponding management applications.

In addition, the Dataset Editor can be customized by adding entries to /extensions/extensions.m. This m-file can be opened and edited using MATLAB or a text editor, and example code and instructions are included as code comments. The first step is to add a 'uimenu' command defining the menu item label and callback for inclusion in the appropriate place in the hierarchy. A code block is then added to handle the callback events when the menu item is selected. The existing code in /extensions/extensions.m provided in toolbox distributions can be used to guide implementing custom menu options.