28.02.2013 Views

Bio-medical Ontologies Maintenance and Change Management

Bio-medical Ontologies Maintenance and Change Management

Bio-medical Ontologies Maintenance and Change Management

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

170 D. Apiletti et al.<br />

of possible schema constraints. However, since they are inferred from data, they<br />

highlight possible schema properties, but do not represent actual schema properties.<br />

An instance constraint may become a schema property when it is validated by<br />

an application domain expert.<br />

Since our aim is to infer unknown constraints from data, we will focus on instances<br />

constraints. We will denote instance constraints with the term constraint in<br />

the remainder of this chapter. For example, a constraint may be inferred by analyzing<br />

the values of the Age attribute. If it always has positive values, we can infer<br />

the (instance) domain constraint Age>0. Instead, by analyzing a mark database,<br />

we may detect a tuple constraint between the values of the attributes Mark <strong>and</strong><br />

Honours. If the Honours attribute is true, Mark should take the highest value. Instead,<br />

if Mark does not take the highest value, Honours is false. If all the values of<br />

two attributes are linked by tuple constraints, then there is a functional dependency<br />

between the attributes. A functional dependency states that if in a relation<br />

two rows agree on the value of a set of attributes X, then they also agree on the<br />

value of a set of attributes Y. The dependency is written as X → Y. For example, in<br />

a relation such as Buyers (Name, Address, City, Nation, Age, Product), there is a<br />

functional dependency City → Nation, because for each row the value of the attribute<br />

City identifies the value of the attribute Nation (i.e., a city always belongs<br />

to the same nation).<br />

Unfortunately, real datasets are affected by errors. Errors can be divided into<br />

two categories: syntactic <strong>and</strong> semantic. Among syntactic errors there are incompleteness<br />

(due to the lack of attribute values), inaccuracy (due to the presence of<br />

errors <strong>and</strong> outliers), lexical errors, domain format errors <strong>and</strong> irregularities. Among<br />

semantic errors there are discrepancies, due to a conflict between some attribute<br />

values (i.e. age <strong>and</strong> date of birth), ambiguities, due to the presence of synonyms,<br />

homonyms or abbreviations, redundancies, due to the presence of duplicate information,<br />

inconsistencies, due to an integrity constraint violation (i.e. the attribute<br />

age must be a value grater than 0) or a functional constraint violation (i.e. if the attribute<br />

married is false, the attribute wife must be null), invalidities, due to the<br />

presence of tuples that do not display anomalies of the classes defined above but<br />

still do not represent valid entities [12].<br />

Public genomic <strong>and</strong> proteomic databases may be affected by the aforementioned<br />

errors, mainly because they grew some years ago, under the pressing<br />

need of storing the large amount of genetic information available at the time,<br />

without having neither a st<strong>and</strong>ard method to collect it, nor a st<strong>and</strong>ard format to<br />

represent it. Due to these problems it is difficult to clearly describe the actual relationships<br />

among data. Recently, a significant effort has been devoted to the integration<br />

of distributed heterogeneous databases, where researchers continuously<br />

store their new experimental results. However, the existence of erroneous or poor<br />

data may harmfully affect any further elaboration or application.<br />

Errors can occur in different phases of data production: experiment (unnoticed<br />

experimental setup failure or systematic errors), analysis (misinterpretation of<br />

information), transformation (from one representation into another), propagation

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!