08.01.2013 Views

Back Room Front Room 2

Back Room Front Room 2

Back Room Front Room 2

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

This paper describes the data sampling<br />

techniques developed as a basis for the data analysis<br />

component of a data quality software system based<br />

on formal methods (Ranito et al., 1998 and Neves,<br />

Oliveira et al. 1999).<br />

2 FUZZY CLUSTERING FOR<br />

SAMPLING<br />

Several methods can be used to approach sampling<br />

in databases (Olken, 1993) but, in particular,<br />

weighted and stratified sampling algorithms appear<br />

to produce best results on data quality auditing.<br />

Whenever processes are concerned with small<br />

amounts of data exhibiting similar behaviour, the<br />

exceptions and problems emerge in a faster way.<br />

Fuzzy clustering is an interesting technique to<br />

produce weights and partitions for the sampling<br />

algorithms. The creation of partitions is not a static<br />

and disjoint process. Records have a chance to<br />

belong to more than one partition and this will<br />

reduce the sampling potential error, since it is<br />

possible to select a relevant record 1 during the<br />

sampling of subsequent partitions, even when it was<br />

not selected in the partition that shared more similar<br />

values with it. The same probabilities can also be<br />

used to produce universal weighted samples.<br />

The Partition Around Method (Kaufman and<br />

Rousseeuw, 1990) is a popular partition algorithm<br />

where the k-partitions method is used to choose the<br />

centred (representative) element of each partition,<br />

whereupon the partitions’ limits are established by<br />

neighbourhood affinity. The fuzziness introduced in<br />

the algorithm is related with the dependency<br />

between probability of inclusion in a partition and<br />

the Euclidean distance between elements, not only<br />

regarding the nearest neighbour but also other<br />

partitions’ representatives.<br />

2.1 K-partitions Method<br />

For a given population, each record is fully<br />

characterized wherever it is possible to know every<br />

value in all p attributes. Let xit represent the value of<br />

record i in attribute t, 1 � t � p. The Euclidean<br />

distance d(i,j) 2 between two records, i and j, is given<br />

by:<br />

d �<br />

2<br />

( i,<br />

j)<br />

� ( xi1<br />

� x j1)<br />

�...<br />

� ( xip<br />

x jp<br />

1 Critical record in terms of data quality.<br />

2 d(i,j) = d(j,i)<br />

RELATIONAL SAMPLING FOR DATA QUALITY AUDITING AND DECISION SUPPORT 83<br />

)<br />

2<br />

If necessary, some (or all of the) attributes must<br />

be normalized to avoid that different domains affect<br />

with more preponderance the cluster definition. The<br />

common treatment is the calculation of the mean<br />

value of an attribute and its standard deviation (or<br />

mean absolute deviation as an alternative).<br />

The k-partitions method defines as first<br />

partition’s representative the element that minimizes<br />

the sum of all the Euclidean distances to all the<br />

elements in the population. The other representatives<br />

are selected according to the following steps:<br />

1. Consider an element i not yet selected as a<br />

partition’s representative.<br />

2. Consider element j not yet selected and denote<br />

by Dj its distance to the nearest representative of<br />

a partition, already selected. As mentioned<br />

above, d(j,i) denotes its distance to element i.<br />

3. If i is closer to j than its closest representative,<br />

then j will contribute for the possible selection i<br />

as a representative. The contribute of j for the<br />

selection of i is expressed by the following gain<br />

function:<br />

C ji<br />

j<br />

� max( D � d(<br />

j,<br />

i),<br />

0)<br />

4. The potential of selection of individual i as<br />

representative is then given by:<br />

� j<br />

C ji<br />

5. Element i will be selected as representative if it<br />

maximizes the potential of selection:<br />

max<br />

i �<br />

j<br />

2.2 Fuzzy Clustering Probabilities<br />

After defining the partitions‘ representatives, it is<br />

possible to set a probability of inclusion of each<br />

element in each one of the k partitions established,<br />

based on the Euclidean distance between elements.<br />

A representativeness factor fr is set according to<br />

the relevance of an element in the context of each<br />

cluster. Empirical tests indicate this factor to be 0.9<br />

when dealing with a partition’s representative and<br />

0.7 for all the others (Cortes, 2002). The algorithm’s<br />

definition is described below 3 :<br />

3 For a full description see (Cortes, 2002).<br />

C<br />

ji

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!