28.02.2013 Views

Bio-medical Ontologies Maintenance and Change Management

Bio-medical Ontologies Maintenance and Change Management

Bio-medical Ontologies Maintenance and Change Management

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Extraction of Constraints from <strong>Bio</strong>logical Data 181<br />

called itemset. We are interested in finding frequent itemsets, which are itemsets<br />

with support higher than a specific threshold.<br />

A well-known technique for frequent itemset extraction is the Apriori algorithm<br />

[1]. It relies upon a fundamental property of frequent itemsets, called the apriori<br />

property: every subset of a frequent itemset must also be a frequent itemset.<br />

The algorithm proceeds iteratively, firstly identifying frequent itemsets containing<br />

a single item. In subsequent iterations, frequent itemsets with n items identified in<br />

the previous iteration are combined together to obtain itemset with n+1 items. A<br />

single scan of the database after each iteration suffices to determine which generated<br />

c<strong>and</strong>idates are frequent itemsets.<br />

Once the largest frequent itemsets are identified, each of them can be subdivided<br />

into smaller itemsets to find association rules. For every largest frequent<br />

itemset s, all non-empty subsets a are computed. For every such subset, a rule<br />

a →(s-a) is generated <strong>and</strong> its confidence is computed. If the rule confidence exceeds<br />

a specified minimum threshold, the rule is included in the result set.<br />

5.3 Quasi Tuple Constraints<br />

For tuple constraints, the minimum confidence must be equal to 1. The minimum<br />

support value allows us to concentrate on the most frequent constraints. If the support<br />

is set to the inverse of the total number of records, then all the constraints are<br />

considered (this support corresponds to rules contained in a single data entry). We<br />

defined the tolerance as the complement of the confidence value (e.g., if confidence<br />

is 0.95, then tolerance is 0.05).<br />

By setting a tolerance value of 0.0, our method detected all <strong>and</strong> only the tuple<br />

constraints contained in the SCOP <strong>and</strong> in the CATH databases. Furthermore, we<br />

performed experiments with tolerance values ranging from 0.001 to 0.100 (corresponding<br />

to confidence values from 0.999 to 0.900), while the support value has<br />

been set to the inverse of the total number of records. Results show a consistent<br />

number of quasi tuple constraints whose presence increases as the tolerance value<br />

increases (see Fig. 3).<br />

By investigating tuples that violate such constraints, anomalies can be identified.<br />

Such analysis allows experts interested in the domain to focus their study on<br />

a small set of data in order to highlight biological exceptions or inconsistencies in<br />

the data. As the results in Fig. 3 show, the lower the tolerance is, the stronger the<br />

tuple constraint between the two values is <strong>and</strong> the fewer rules are extracted. Tuning<br />

the tolerance, a domain expert is allowed to concentrate on the desired subset<br />

of most probable anomalies present in the data. As an example, on the SCOP database<br />

a tolerance value of 0.05 extracts less than 200 rules to be further investigated<br />

by domain experts, among the overall 30000 protein entries.<br />

To distinguish between biological exceptions or inconsistencies, a further<br />

analysis of the discovered anomalies can be performed by means of three<br />

approaches:

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!