GCE Data Search Engine

ui_search_data.png (64234 bytes)

Search Engine GUI
(click to enlarge)

ui_search_data_bboxmap.png (47386 bytes)
Bounding Box selection map
(click to enlarge)


The GCE Data Search Engine is a GUI application for performing metadata-based searches to identify GCE Data Structures that meet various thematic, temporal, and geospatial criteria. Multiple search criteria can be defined by filling in text boxes, selecting items from lists or dragging rectangular bounds on maps, and then data sets matching the criteria are added to a cumulative search results list. Data sets in the results list can then be examined, loaded into various data analysis and plotting tools, exported in user-specified formats or integrated to form composite data sets.

Data Indexing

In order to support searching, data files are first analyzed using a combination of metadata and data mining techniques to generate an optimized search index. GCE Data Structures stored in MATLAB files in any number of local directories can be indexed, and the generated indices can be saved and then re-loaded for subsequent search sessions for immediate start-up. Indices can also be refreshed at any time to remove entries for deleted files, update entries for changed files, and index any new files in previously-indexed directories.

Using MATLAB 6.5 or higher, pre-generated indices of public data sets in the GCE Data Catalog and GCE Data Portal can also be downloaded and merged with local indices to support simultaneous searches of local and web-based data sets. When data files residing on the GCE web server are selected for any analyses, the corresponding data structure is automatically retrieved and cached locally. This application therefore functions as a remote GCE data access client in addition to an end-user data management tool.

Data Searching

Various data set metadata fields can be searched, including title, key words, abstract, methods, study descriptors, author, and taxonomic names. Searches can also be performed on study dates (either by date range or contained date), parameter names, study sites, and geographic bounding boxes. Negative criteria can be specified for textual search terms (e.g. -PAR) to exclude unwanted matches, and positive and negative criteria can be  mixed and matched in fields accepting multiple terms (e.g. keywords) to fine-tune results.

Queries can optionally be saved to a query history list and then reloaded at any time for editing and re-execution. This feature allows users to build up standard queries which can be run against new or updated search indices. The query history window can also be hidden to make more room on screen for search results.

Working with Search Results

After every successful query, new data sets matching the specified criteria are added to a cumulative search result list. All information necessary to retrieve the corresponding data set is stored along with each entry, so search results are completely independent from search indices. In fact, result sets can be generated over multiple sessions using any number of different index files.

Double clicking on any entry with the mouse loads the corresponding data set and displays its metadata information in one of several user-selectable styles. If the data set resides on the GCE web server, the user registration information entered on initial program startup is used retrieve the corresponding file from the GCE server, logging the user access and change notification preference. Web-based files are cached locally the first time they are retrieved, then the cached copy is used for all subsequent analyses to minimize network file access.

Data sets in the search result list can also be opened in the Data Editor application for detailed examination and analysis (e.g. statistical analysis and re-sampling, sub-sampling, value filtering, unit conversions), as well as various data plotting and summarization tools. Multiple data sets can also be selected and simultaneously copied or exported in various text and MATLAB formats or merged to create composite data sets, with user-specified QA/QC-flag handling and metadata format options. This application therefore provides users with convenient batch-processing capabilities that would otherwise require MATLAB scripting to perform.

 Data Harvesting Applications

This material is based upon work supported by the National Science Foundation under grants OCE-9982133, OCE-0620959 and OCE-1237140. Any opinions, findings, conclusions, or recommendations expressed in the material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.