28.02.2013 Views

Bio-medical Ontologies Maintenance and Change Management

Bio-medical Ontologies Maintenance and Change Management

Bio-medical Ontologies Maintenance and Change Management

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Extraction of Constraints from <strong>Bio</strong>logical Data 179<br />

Errors <strong>and</strong> anomalies may be distinguished by analyzing the confidence value<br />

of the rules. The confidence of the second rule (0.3%) is at least one order of magnitude<br />

smaller than the average confidence values of the other rules. Hence, we<br />

can conclude that this is probably an error.<br />

5 Experiments on <strong>Bio</strong>logical Data<br />

It is about thirty years that biological data are generated from a variety of bio<strong>medical</strong><br />

devices <strong>and</strong> stored at an increasing rate in public repositories. Thus, data<br />

constraints can be unknown <strong>and</strong> it is difficult to recognize the structure <strong>and</strong> the relationships<br />

among data.<br />

In this section we present experiments showing the application of the constraint<br />

extraction to biological databases. Two kinds of experiments are reported. The<br />

first aims at validating the constraint extraction method by applying the proposed<br />

framework to a database whose constraints <strong>and</strong> dependencies are already known<br />

in advance. The second extracts <strong>and</strong> analyzes quasi constraints to discover semantic<br />

anomalies <strong>and</strong> inconsistencies due to possible violations.<br />

5.1 <strong>Bio</strong>logical Databases<br />

We selected two databases whose structure <strong>and</strong> dependencies among data were<br />

known in order to verify the accuracy of our method. We considered the SCOP<br />

(Structural Classification Of Proteins, version 1.71, http://scop.berkeley.edu) <strong>and</strong><br />

CATH (Class, Architecture, Topology <strong>and</strong> Homologous superfamily, version<br />

3.0.0, http://www.cathdb.info) databases, which are characterized by a hierarchical<br />

structure, similarly to many biological data sources. However, the method is independent<br />

of the hierarchical structure, since the Apriori algorithm <strong>and</strong> our analysis<br />

can be applied to different data models ranging from the relational model to XML.<br />

The SCOP database classifies about 30.000 proteins in a hierarchical tree of<br />

seven levels. From the root to the leaves they are: class, fold, superfamily, family,<br />

protein, species, as shown in Fig. 2. This tree structure is particularly suitable for<br />

verifying the tuple constraints <strong>and</strong> functional dependencies that can be extracted<br />

by our tool. In this database tuple constraints are represented by tree edges, while<br />

functional dependencies are represented by the tree hierarchy.<br />

For example we expect to find the following association rules:<br />

• (Superfamily=alpha helical ferredoxin)→(Fold=globin-like) with confidence of<br />

100%<br />

• (Superfamily=globin-like)→(Fold=globin-like) with confidence of 100%<br />

∑s i<br />

Superfamily→Fold<br />

• = 1<br />

In this way we can deduce that (NOT(Superfamily=alpha helical ferredoxin)<br />

OR (Fold=globin-like)) <strong>and</strong> (NOT(Superfamily=globin-like) OR (Fold=globinlike))<br />

are tuple constraints <strong>and</strong> Superfamily→Fold is a functional dependency.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!