27.03.2014 Views

SEKE 2012 Proceedings - Knowledge Systems Institute

SEKE 2012 Proceedings - Knowledge Systems Institute

SEKE 2012 Proceedings - Knowledge Systems Institute

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

V. ANOMALY DETECTION IN EDDY COVARIANCE DATA<br />

EXPERIMENT<br />

A. EC Data Description<br />

A series of experiments were conducted to determine if<br />

SDVe can be utilized to evaluate D aProS-specified data<br />

properties to find anomalies in sensor data. Upon further<br />

analysis of the data properties obtained from the literature<br />

survey in [8], it w as determined that three data property types<br />

account for approximately 72.5% of the total number of data<br />

properties specified by scientis ts in the literature survey;<br />

datum properties (32.5% ), i.e. properties that capture the<br />

behavior of a single data sensor reading, datum relationship<br />

properties (30.8%), i.e. propert ies that capture a relationship<br />

between two or more sensors, and datum dependent<br />

instruments (9.2%), i.e., properties that capture environmental<br />

data behavior effect on the data collection instruments. Based<br />

on these findings, an experiment was design to determine if<br />

SDVe can be used to identify anom alies for the m ost<br />

frequently specified data properties: datum, datum<br />

relationship, and datum dependent instrument.<br />

A data property specification expert collaborated with<br />

expert scientists working with eddy covariance and<br />

biomesonet towers’ CO2 data to develop a set of data<br />

properties of interest. The contributing scientists were working<br />

on building their first eddy covariance tower and were<br />

interested in capturing data pr operties extracted from sensor<br />

reference manuals [ 12] [13], climate and climatological<br />

variations in the research site literature [2], eddy covariance<br />

towers post-field data quality control literature [ 14] and their<br />

own expertise.<br />

B. Experimental Setup<br />

An initial experiment was conducted to validate the ability<br />

of the SDVe tool to detect anom alies. An eddy covariance<br />

error-free data file was random ly selected from a scientistprovided<br />

repository and seeded with a number of anomalies,<br />

based on a 95% confidence level calculation given the number<br />

of readings in the file, to evaluate a group of data properties of<br />

type datum, datum-relationship, and datum dependent<br />

instrument. The experiment identified all seeded anomalies<br />

and all events marked as anomalies were actually anomalies.<br />

A second experiment was conducted to determine if SDVe<br />

can identify anomalies in Eddy C ovariance sensor data and to<br />

illustrate how such anom alies can be identified and<br />

documented by cross referencing the results obtained from<br />

SDVe to existing metadata recorded during the data collection<br />

process. The experiment does not quantify the improvement in<br />

the overall quality of the data.<br />

The scientists developed a m atrix of all the relationships<br />

between sensors in the Eddy covariance tower of interest for<br />

the specific site. T he matrix included raw sensor<br />

measurements and derived data aggregated at different<br />

temporal resolutions and as part of the combination of<br />

measurements from two or more sensors. The sensor<br />

relationship matrix consisted of approximately 118 sensor<br />

readings along with their associated relationships. In<br />

collaboration with scientist som e of the sensor relationships<br />

from the matrix were used to create 23 data properties to be<br />

evaluated over the Eddy covarian ce datasets; the scientists<br />

specified properties of interest of type Datum, Instrument, and<br />

Datum Relationship. The properties w ere intended to capture<br />

anomalies in raw data at collection tim e. The sensor readings<br />

of interest were selected based on the relationship to other<br />

sensor readings and the deriva tion of aggregated values from<br />

them. In collaboration w ith scientists, data properties of<br />

interest were specified, refined, and validated using DaProS.<br />

The numeric thresholds used in the data properties were<br />

defined following scientific community’s algorithms and<br />

protocols.<br />

The Eddy Covariance (EC) data verified using SDVe were<br />

collected from July 06, 2010 to July 13, 2010 to capture EC<br />

summer behavior and from February 09, 2010 to February 16,<br />

2010 to capture EC winter behavior. The sensors at the tower<br />

collected the EC data continuously, and a scientist ma nually<br />

split the data into 1-hour interval files to ease the verification<br />

process. 349 data files were evaluated for this work.<br />

Two data property specification files were created, one for<br />

each season, and used to automatically evaluate individual data<br />

files according to the season to which they belonged. The data<br />

files and specification files w ere automatically inputted to<br />

SDVe to be evaluated. For each data file, the sensor data<br />

streams were extracted to separate data scopes according to the<br />

data property specifications. Then, the data scopes were<br />

evaluated by applying the specifi cation Boolean statement and<br />

data pattern to every individual sensor reading in the scope. If<br />

the individual sensor reading in the scope did not satisfy the<br />

data pattern and Boolean statement, a flag was raised and<br />

stored to the verification file. Once the verification process had<br />

concluded, a verification summary file was generated<br />

aggregating the number of violati ons, i.e, anomalies, identified<br />

by each data property along with the aggregated processing<br />

times.<br />

C. Results<br />

SDVe performed a total of 219,800,854 evaluation calls of<br />

which 50,857,351 of the evaluation calls were captured as<br />

anomalies, approximately 23% . The evaluation process took<br />

approximately 20 hours to complete, of which approximately<br />

1 hour was spent loading the files into the system and 19 hours<br />

were spent verifying the data. Assum ing a data file takes 15<br />

minutes on average to be manually processed and evaluated by<br />

a scientist manually processing the 349 files would take<br />

approximately 87 hours. SDVe automatically evaluates the<br />

data in one fourth of the tim e that it w ould take a scientist to<br />

manually evaluate the same amount of data.<br />

For summer data, 124,958,406 evaluations calls took place,<br />

of which 24,417,791were of the evaluation calls were captured<br />

as anomalies, approximately 20%.<br />

Datum properties identified the m ost anomalies<br />

(21,429,802), followed by Data Relationship (2,639,985),<br />

Instrument (348,004) and Data Dependent Instrument (0). The<br />

sensor datasets with the most anomalies included water vapor<br />

mass density (H 2 O), atmospheric pressure (atm_press), carbon<br />

dioxide (CO 2 ), and temperature (Ts).<br />

681

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!