Quality Assurance/Quality Control Flagging Reference
Introduction
The GCE Data Toolbox for MATLAB provides a comprehensive framework for Quality Assurance, Quality Control flagging and analysis. In GCE Data Structures, the native storage format used by the toolbox, arrays of data quality "flags" (qualifiers) are created automatically whenever attributes (columns) are added to the structure. These flags are transparently maintained in synchrony with the data they describe throughout all processing steps and analyses. This separation of data values and QA/QC flags obviates the need to delete questionable values from data sets, permitting subsequent re-analysis and flexible handling and display of QA/QC information during analysis and data export.
Flags can be assigned automatically based on QA/QC criteria expressions (i.e. rules) defined for each data column, assigned manually in a spreadsheet-like data editor, or assigned graphically by selecting data points with the mouse. Criteria expressions can include simple conditionals, mathematical formulae and references to built-in or custom MATLAB functions in any combination. Criteria can also include cross-references to other data columns, and flags from multiple columns can be combined and propagated to dependent columns allowing users to perform QA/QC based on complex, multi-column dependency relationships (e.g. flagging of all measured values when a hydrographic instrument is out of the water, based on depth reading).
Flagging of invalid or questionable values in data sets is an important aspect of data processing and management, so QA/QC criteria should be defined whenever practical.
Automatic QA/QC Flagging
Flags can be assigned automatically to values in data columns by defining specific QA/QC criteria (i.e. rules) in the corresponding attribute metadata field (i.e. "criteria"). QA/QC criteria are MATLAB expressions that define alphanumerical flag characters to associate with column values that match the conditions specified. A Graphical User Interface (GUI) form is provided with the toolbox to simplify creating and editing QA/QC criteria expressions, which is opened by clicking the "Edit" button next to the criteria field (see figure). Simple criteria can be defined using drop-down menus, and more complex criteria based on function calls or multi-column relationships can be defined using additional GUI dialogs opened by clicking the respective buttons at the bottom of the editor form.
Many common Q/C rules can be defined using the GUI dialogs, but familiarity with the underlying syntax is still recommended for fine-tuning criteria or creating compound rules for specialized use cases. Basic QA/QC criteria (e.g. range or limit checks) can be defined using simple conditional statements, such as "x<0" or "x>=10", where x is a placeholder for the column values. Criteria can also reference any MATLAB statement or built-in function that returns a logical index of zeros and ones (i.e. zero for no flag, one for flag) or numerical index specifying flags to assign by array position (examples below). Custom QA/QC functions can also be referenced to assign flags based on advanced computations (e.g. statistical analysis, signal processing, time-series analysis), as long as a single logical or numerical index is returned from the function as the first output parameter. A variety of specialized QA/QC functions are provided with the GCE Data Toolbox distribution, and additional functions can be added at any time and referenced in criteria.
Criteria expressions can also include cross references to other data columns, both in conditional statements and function calls, allowing complex dependency-based criteria to be defined. Column references are indicated by prefacing the respective column name with "col_" (e.g. "col_Salinity" to reference "Salinity"). The "col_" prefix can be used in place of "x" for the primary data column reference, if desired, to improve readability of criteria expressions in metadata. Note that missing values in any dependent column will cause the criteria expression to return 0 (no flag) for that value, and incorrect column name spellings or deletion of a referenced column will cause the entire expression to be skipped; however, changes to column names and units performed in the Data Structure Editor (ui_editor) will automatically be propagated to all flag criteria expressions in the data set to maintain validity of QA/QC criteria and dependencies.
Note that QA/QC criteria defined in metadata templates are evaluated automatically whenever the template is applied to a dataset (e.g. on data import). Defining criteria in templates is therefore a powerful mechanism for providing automatic QA/QC for newly acquired or harvested raw data. Criteria are also re-evaluated automatically whenever criteria or data values are updated using GCE Data Toolbox programs, unless flags are locked by insertion of the "manual" token (see below).
QA/QC Criteria Syntax
Flag criteria expressions follow the pattern [condition]=[flag code], where [condition] is any MATLAB expression (or function call) that returns a logical or numerical index, and [flag code] is a corresponding alphanumeric flag code to assign when the condition is met. A GUI criteria editor is provided in the GCE Data Toolbox to simplify defining, editing and re-ordering Q/C criteria expressions. This editor can be invoked by pressing the "Edit" button next to the criteria field on the Data Editor window.
Specific syntax and examples are listed below:
1) Numeric conditionals (e.g. limit/range checks):
Syntax: x[operator][value]='[flag]' , where:
x (or col_[column name]) is an alias for values in the current data column
[operator] is ==, <, >, <=, >=, ~= (or <>)
[value] is a numeric value (scalar or array the same size as "x")
[flag] is any one text character, symbol, or digit enclosed in single quotes
Examples:
x<0='I' -- generates 'I' flags for negative values
x>=30='Q' -- generates 'Q' flags for values 30 or higher
x~=1='Q' -- generates 'Q' flags for values other than 1
2) Column cross-references (e.g. dependency checks):
Examples:
col_Depth<0='I' (in column Salinity) -- generates 'I' flags for salinity
values when values in Depth are negative (out of the water)
col_Dry_Weight>col_Wet_Weight='I' (in column Dry_Weight) -- generates 'I'
flags for dry weights that exceed the total wet weight for a sample
3) Basic mathematical expressions (e.g. multi-column dependency checks):
Example:
col_Wet_Weight>(col_Dry_Weight+col_Water_Weight)='Q' (in column Wet_Weight) --
generates 'Q' flags for wet weights that exceed dry weight plus
water weight (note that parenthesis can be used to control order
of operations in expressions)
4) Built-in MATLAB numeric functions (e.g. statistical checks):
Examples:
isnan(x)='M' -- generates 'M' flags for any missing numerical values (NaN)
x<(mean(x)-3.*std(s))='Q' -- generates 'Q' flags for any values < 3 standard
deviations below the column mean (assumes no missing values)
x<(mean(x(~isnan(x)))-3.*std(x(~isnan(x))))='Q' -- same as above, allowing for
missing values
std([col_Temp1,col_Temp2,col_Temp3,col_Temp4],0,2)>0.2='Q' -- checks for
excessive standard deviation of replicate sensor readings (rule would
be repeated in columns Temp1, Temp2, Temp3 and Temp4). Note that the
optional normalize and dimension arguments are used for the std() function
to calculate non-normalized std across rows of the matrix of column values.
abs(x-mean([col_Temp1,col_Temp2,col_Temp3,col_Temp4],2))>0.5='Q' -- checks for
excessive deviation from the mean of 4 redundant sensors (note that the
optional dimension argument is used for the mean() function to calculate
means for rows of the matrix of column values from Temp1, Temp2, etc.
5) Built-in MATLAB string functions (e.g. code checks):
Examples:
strcmp(x,'none')='M' -- generates 'M' flags for strings matching 'none'
~strcmp(x,'missing')='G' -- generates 'G' flags for strings not matching 'missing'
strncmp(x,'Spartina',8)='G' -- generates 'G' flags for strings with the first 8
characters matching 'Spartina'
6) Custom MATLAB functions (single column criteria):
Any MATLAB function that accepts column values as input and returns a logical or numeric index as its first output variable can be used in criteria expressions. Note that a function call editor with syntax help is available from the 'Q/C Flag Criteria Editor' tool.
Examples:
flag_notinlist(x,'Spartina,Juncus,Batis')='Q' -- generates 'Q' flags for any string
values not in the controlled vocabulary of genus names (or use flag_inlist to
check for values that are in a list, e.g. an error code)
flag_notinarray(x,[0,1,2,3])='Q' -- generates 'Q' flags for any numeric values that
are not present in the specified array (or use flag_inarray to check for values
that are in an array, e.g. numeric errors codes)
flag_valuechange(x,5,10,3)='Q' -- generates 'Q' flags for any values that are more
than 5 below or 10 above the mean of the preceding 3 values, in the native units
of measurement for the column (note: input parameters are 'value','lowlimit',
'highlimit' and 'framesize', resp.)
~flag_valuechange(x,0.1,0.1,3)='Q' -- generates 'Q' flags for any values that are NOT
at least 0.1 below or above the mean of the preceding 3 values, indicating
that the sensor may be stuck (note: input parameters are 'value','lowlimit',
'highlimit' and 'framesize', resp.)
flag_percentchange(x,20,20,3)='Q' -- generates 'Q' flags for any values that vary
by more than 20% below or above the mean of the preceding 3 values (note:
input parameters are 'value','lowlimit','highlimit' and 'framesize', resp.)
flag_nsigma(x,3,3,5)='Q' -- generates 'Q' flags for any values that are more than
3 standard deviations below of above the mean of the preceding 5 values (note:
input parameters are 'value','lowlimit','highlimit' and 'framesize', resp.)
7) Custom MATLAB functions (multiple-column criteria):
Same as single-column custom function syntax, except additional column values are entered as function arguments, using the column reference format: col_[column name].
Examples:
flag_o2saturation(col_Oxygen,col_Temperature,col_Salinity,110,50)='Q' -- generates
'Q' flags for any oxygen values that are above 110% saturation or below 50%
saturation based on the oxygen saturation calculated as a function of oxygen
concentration, temperature and salinity.
flag_locationnames(col_Site,'sensitive')='I' -- generates 'I' flags for any location
names in column 'Site' that do not match registered location names in the
geographic database file 'geo_locations.mat' using case-sensitive matching
flag_locationcoords(col_Site,col_Longitude,col_Latitude,0.2,'gce_locations.mat')='Q' --
generates 'Q' flags for any location names in 'Site' with longitude and latitude
values that deviate more than 0.2km from the coordinates registered in
'gce_locations.mat' by dead reckoning (i.e. flags geo-referencing errors)
8) Compound criteria:
Multiple criteria can be specified for each column by using a semicolon to separate each expression. Overlapping criteria are supported, resulting in multiple flag assignments when more than one criteria is matched. Note that certain operations (e.g. encoding flags as unique integers - automatic for MATLAB file export) will only retain the first-assigned flag, therefore order of precedence should be considered when assigning multiple criteria (e.g. list rules that assign 'invalid' flags before rules that assign 'questionable' flags).
Example:
x<0='I';col_Depth<0.1='I';x>36='Q';flag_percentchange(x,20,20,3)='Q' (in "Salinity") --
generates 'I' flags for negative values, 'I' flags for values recorded when
Depth was < 0.1, 'Q' flags for values > 36 and 'Q' flags for values that are
20% above or below the mean of the three preceding values.
Manual QA/QC Flagging
Flags can also be assigned manually using various GCE Data Toolbox programs and utilities. For example, data values displayed on line/scatter plots can be flagged (or unflagged) visually with the mouse using the "Visual Q/C Tool" available in plot figure menus. The user just selects a column name and flag to assign, then clicks on individual values or drags a rectangle over a range of values with the mouse. Whenever flags are manually assigned or cleared, the term "manual" is appended to the criteria field for the respective data column(s) to lock the flags and prevent automatic recalculation. Automatic flagging can be reinstated by removing the "manual" token from the criteria string or by using the "Unlock Q/C Flags" option under the "Edit > Q/C Flag Functions" menu in the Data Editor window.
Similarly, flags assigned prior to importing data into the GCE Data Toolbox (e.g. flags assigned by a data provider, such as USGS, NOAA, or LTER ClimDB/HydroDB) can also be converted to flag arrays and meshed with (or replace) existing flags assigned by QA/QC criteria or manual editing. Predefined flag fields should be text columns that are named according to the convention "Flag_[column name]", e.g. "Flag_Salinity" for column "Salinity".
Flag Codes and Metadata
QA/QC flag codes should be documented in the metadata (i.e. 'Data' category, 'Codes' field) using the following format: "Q = questionable value, I = invalid value, M = missing", etc. This ensures that the flag codes are properly displayed in standard and XML metadata, and also allows column values codes to be automatically generated when flags are optionally converted to encoded integer columns during ASCII or MATLAB export operations or manually in the structure editor. A GUI flag definition editor is provided with the GCE Data Toolbox, which can be opened using the 'View/Edit Q/C Flag Definitions' option on the 'Edit > Q/C Flag Functions' menu.
Suggested flag codes are listed below:
I = invalid value (out of range) -- use for out-of-range/impossible values (e.g. negative mass)
Q = questionable value -- use for values outside of expected range (e.g. below detection limit,
well outside of historical value range, pattern indicating data contamination)
E = estimated value -- use for values that were estimated by interpolation or other means
S = spike/noise -- use for sharp discontinuities/spikes indicating data contamination
Automated QA/QC in Scripted Batch-mode Scenarios
The GCE Data Toolbox is well suited to use in scripted batch-mode data processing scenarios. Dataset metadata are used to automatically parameterize toolbox functions, so simple high-level commands can be used to carry out complex multi-step processing and analysis. All operations performed using GUI forms can be accomplished using a corresponding command line statement in a script, including propagation of flags to dependent columns, selective removal of flagged values, and automatic flagging of derived data sets (e.g. aggregated, temporally-resampled and binned data) based on number or percentage of flagged and/or missing values in primary data.
The key to performing automated QA/QC in unattended batch mode is to create a metadata template for the data source, containing appropriate QA/QC criteria (rules) for each attribute. When the template is applied to the raw data after loading or importing, QA/QC flags are automatically assigned to each attribute based on these criteria. The full suite of QA/QC-related functions can then be used to manage the display of flags in exported data products and plots, or to remove values assigned particular flags or perform other operations. Note that a GUI editor is provided with the GCE Data Toolbox for defining, managing and editing metadata templates.
Once a suitable metadata template is defined, simple functions or scripts can be used to fully process raw data files, for example:
[s,msg] = imp_ascii('weather.txt','d:\data\met','Weather Data','weather_template');
[s,msg] = clearflags(s,'I');
msg = exp_ascii(s,'tab','weather_qc.txt','d:\data\met','Weather Data','ST','M','FLED');
This script would perform the following operations:
- import and parse a raw ASCII data file (d:\data\met\weather.txt), automatically applying the 'weather_template' metadata template and assigning QA/QC flags after import
- remove values assigned 'I' flags, converting to NaN, retaining other flagged values
- export the processed data in tab-delimited ASCII format, with column titles, separate metadata file (in ESA FLED style), and text flag columns following the corresponding data columns
Additional commands could also be included to fill in missing records to create monotonic time series, add derived parameters based on equations referencing data columns (each with their own QA/QC criteria), and resample or filter the data to produce derived data products that can be further manipulated and exported along with the primary data. Specialized import filters can also be defined to perform an entire prescribed workflow using a single command. Such filters are included with the GCE Data Toolbox distribution for USGS NWIS data, NOAA NCDC climate data, LTER ClimDB/HydroDB data, NOAA HADS data and other sources.
Recent versions of MATLAB also include support for timed program execution, network data access (via HTTP, FTP and UNC paths), and a SOAP web services client, allowing the GCE Data Toolbox to be used for automated remote data acquisition and QA/QC processing. At GCE, fully automated data harvesters have been developed for NOAA HADS data, USGS NWIS data, and LTER ClimDB/HydroDB data (i.e. the USGS data harvesting service for HydroDB).
QA/QC Flag Handing in Post Processing
QA/QC flags are a constitutive component of GCE Data Structures, so most GCE Data Toolbox GUI dialogs and functions provide explicit options for handling flagged values in data sets during post processing and analysis. For example, flags can be displayed, ignored or removed when data are plotted, and summary statistics displays and reports can be generated with and without flagged values (or both), and numbers of flagged values are summarized for each attribute. Data export functions also provide various options for formatting flags in delimited ASCII and MATLAB files to support other programs and standards. Data integration tools (e.g. merge/union and join) also provide options for "locking" QA/QC flags to prevent inappropriate application of criteria after multiple data sets are combined.
Data aggregation, date/time re-sampling, and binning tools offer particularly fine-grained control over QA/QC flags. Values assigned specific flags can be removed prior to analysis, and QA/QC criteria can be defined automatically for derived data columns based on the number or percentage of flagged and/or missing values in each respective group, date/time interval or bin. Attributes listing the number (and percentage) of flagged and missing values are also included in derived data sets. Information on the quality and completeness of primary data can therefore be documented and preserved in derived data to guide usage and interpretation.



