USGS Data Harvesting Service
In January 2003, an automated system was developed by Wade Sheldon at GCE for harvesting streamflow data from any real-time USGS gauging station and processing it for submission to HydroDB, the LTER All-site hydrological database at Andrews LTER. Working in collaboration with Suzanne Remillard and Don Henshaw at AND-LTER, this system was generalized and offered as a service to the broader LTER and USFS communities in June 2003.
Recent provisional data are harvested on a weekly basis from one or more stations requested by each participating site. The data are converted to units compatible with HydroDB and undergo several levels of quality control analysis and flagging to identify questionable values. Values flagged as invalid (e.g. negative precipitation, provisional discharge values >3x historic maxima) are removed from data sets prior to submission to HydroDB. Also, any updates to provisional data or Q/C flags assigned by USGS are automatically synchronized with the database each week, and provisional values are overwritten with finalized data as soon as they are released.
This harvesting service provides several important benefits to the LTER and broader scientific community. USGS has made great strides in providing timely access to national monitoring data via the WWW, but the vast size of this monitoring network (over 1.5 millions sites, with over 5500 streamflow stations) makes finding data relevant to LTER sites a significant task. Data are also not provided in standard metric units, and provisional data are often not subjected to any quality checks prior to initial web posting. Harvesting, transforming, and quality-checking data from stations near to or within LTER sites on a regular basis and providing access through a single web interface greatly enhances the usability of these data and therefore facilitates synthesis.
This service also provides a useful demonstration of how metadata-based data processing technology (see below), well-defined data interchange standards, and web-based communications protocols can ease the application of information technology developed at individual LTER sites to network-level problems, providing a significant research benefit with almost no added cost.
The USGS data harvester is based on the generalized metadata-based data processing and analysis software developed by the GCE LTER project (i.e. GCE Data Toolbox for MATLAB). An import filter was written to parse tab-delimited data files acquired from the USGS National Water Information System (NWIS), and a USGS metadata template was created to add basic data set documentation and assign complete metadata descriptors to each data column (i.e. based on matching column names to a list of standard USGS parameter codes).
In addition to units, descriptions and data type information, the metadata templates contain detailed quality control criteria for each parameter. This permits rule-based quality flagging to be performed automatically as the template is applied to each downloaded data set. More specific Q/C criteria, based on historic minima and maxima of finalized data for each station, are also applied to flag extreme provisional values. Additionally, limit check metadata stored in the HydroDB metadata tables are used to alert sites to the presence of values outside of expected ranges in harvested data. Excessive warnings cause the harvest to be halted until the data are reviewed and corrective action is taken, minimizing the potential for incorporation of invalid data.
Harvests are conducted for all stations on a weekly basis. The entire process is fully automated, from initial data retrieval through incorporation in the ClimDB/HydroDB database, and initiated as a timed job on a single GCE workstation.
As of March 2011, streamflow data from 81 USGS gauging stations are being harvested for 13 LTER sites (AND, BES, CAP, CWT, FCE, GCE, KBS, KNZ, LUQ, NTL, PIE, SBC, SEV), 1 USFS site (Neversink Valley, Delaware River Basin) and 1 Washington DNR site (Olympic Experimental State Forest) on a weekly basis. Precipitation data is also harvested from stations equipped with rain gauges.
Finalized data with USGS-assigned quality flags are harvested for the full period of record at each station, and recent provisional data are harvested from the end of finalized data to the present. Changes to provisional data and newly-released finalized data are automatically synchronized to HydroDB on an ongoing basis to provide the best available data to the LTER community at all times.
How to Participate
Additional LTER or USFS sites that are interested in participating in this data harvesting service can contact the ClimDB/HydroDB database administrator at the LTER Network Office. The site will be asked to provide a list of USGS streamflow stations they would like to harvest, and will be required to enter and maintain basic metadata (i.e. documentation) about their site, stations, and station variables in the ClimDB/HydroDB database using the provided web forms.
After the metadata has been entered for each station, the GCE Information Manager will be contacted and the stations added to the harvest list. An initial harvest will be performed to retrieve all available finalized data and recent provisional data from each new station. The designated data contact for the site will receive an automatic email message detailing the status of the initial data harvest with links to the harvested and processed data. If excessive Q/C limit check warnings cause the harvest to fail, sites can review and update limit check metadata and request a new harvest themselves via the HydroDB web page, or contact the database administrator for assistance.
Weekly harvests and reporting will then commence automatically, and no additional effort is required to continue participation (other than occasional metadata updates to reflect changes in personnel contact information or station Q/C criteria). Participation can also be stopped at any time on request.
Custom End-User Harvesting
The standard public distribution of the GCE Data Toolbox for MATLAB now includes end-user tools for performing custom harvests of real-time, daily or finalized data from any USGS station accessible on the USGS NWIS server (requires MATLAB 6.5 or higher on any supported platform). A graphical user interface dialog (illustrated on the right) and table of USGS stations are included, allowing users to locate stations by state and name and specify user-customized metadata templates to apply. Example source code to construct custom harvesting applications based on this toolbox is also available on request.
Note that a similar dialog is also available for retrieving data directly from ClimDB/HydroDB, allowing participants to download and plot USGS data harvested for their site to review the data and evaluate the effectiveness of Q/C limits stored in HydroDB.
Development of this harvesting service was made possible by NSF, through supplemental funding provided by Henry Gholz to LTER sites in 2002 to improve ClimDB and HydroDB participation.
This material is based upon work supported by the National Science Foundation under grants OCE-9982133, OCE-0620959 and OCE-1237140. Any opinions, findings, conclusions, or recommendations expressed in the material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.