Back Room Front Room 2
Back Room Front Room 2
Back Room Front Room 2
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
This paper describes the data sampling<br />
techniques developed as a basis for the data analysis<br />
component of a data quality software system based<br />
on formal methods (Ranito et al., 1998 and Neves,<br />
Oliveira et al. 1999).<br />
2 FUZZY CLUSTERING FOR<br />
SAMPLING<br />
Several methods can be used to approach sampling<br />
in databases (Olken, 1993) but, in particular,<br />
weighted and stratified sampling algorithms appear<br />
to produce best results on data quality auditing.<br />
Whenever processes are concerned with small<br />
amounts of data exhibiting similar behaviour, the<br />
exceptions and problems emerge in a faster way.<br />
Fuzzy clustering is an interesting technique to<br />
produce weights and partitions for the sampling<br />
algorithms. The creation of partitions is not a static<br />
and disjoint process. Records have a chance to<br />
belong to more than one partition and this will<br />
reduce the sampling potential error, since it is<br />
possible to select a relevant record 1 during the<br />
sampling of subsequent partitions, even when it was<br />
not selected in the partition that shared more similar<br />
values with it. The same probabilities can also be<br />
used to produce universal weighted samples.<br />
The Partition Around Method (Kaufman and<br />
Rousseeuw, 1990) is a popular partition algorithm<br />
where the k-partitions method is used to choose the<br />
centred (representative) element of each partition,<br />
whereupon the partitions’ limits are established by<br />
neighbourhood affinity. The fuzziness introduced in<br />
the algorithm is related with the dependency<br />
between probability of inclusion in a partition and<br />
the Euclidean distance between elements, not only<br />
regarding the nearest neighbour but also other<br />
partitions’ representatives.<br />
2.1 K-partitions Method<br />
For a given population, each record is fully<br />
characterized wherever it is possible to know every<br />
value in all p attributes. Let xit represent the value of<br />
record i in attribute t, 1 � t � p. The Euclidean<br />
distance d(i,j) 2 between two records, i and j, is given<br />
by:<br />
d �<br />
2<br />
( i,<br />
j)<br />
� ( xi1<br />
� x j1)<br />
�...<br />
� ( xip<br />
x jp<br />
1 Critical record in terms of data quality.<br />
2 d(i,j) = d(j,i)<br />
RELATIONAL SAMPLING FOR DATA QUALITY AUDITING AND DECISION SUPPORT 83<br />
)<br />
2<br />
If necessary, some (or all of the) attributes must<br />
be normalized to avoid that different domains affect<br />
with more preponderance the cluster definition. The<br />
common treatment is the calculation of the mean<br />
value of an attribute and its standard deviation (or<br />
mean absolute deviation as an alternative).<br />
The k-partitions method defines as first<br />
partition’s representative the element that minimizes<br />
the sum of all the Euclidean distances to all the<br />
elements in the population. The other representatives<br />
are selected according to the following steps:<br />
1. Consider an element i not yet selected as a<br />
partition’s representative.<br />
2. Consider element j not yet selected and denote<br />
by Dj its distance to the nearest representative of<br />
a partition, already selected. As mentioned<br />
above, d(j,i) denotes its distance to element i.<br />
3. If i is closer to j than its closest representative,<br />
then j will contribute for the possible selection i<br />
as a representative. The contribute of j for the<br />
selection of i is expressed by the following gain<br />
function:<br />
C ji<br />
j<br />
� max( D � d(<br />
j,<br />
i),<br />
0)<br />
4. The potential of selection of individual i as<br />
representative is then given by:<br />
� j<br />
C ji<br />
5. Element i will be selected as representative if it<br />
maximizes the potential of selection:<br />
max<br />
i �<br />
j<br />
2.2 Fuzzy Clustering Probabilities<br />
After defining the partitions‘ representatives, it is<br />
possible to set a probability of inclusion of each<br />
element in each one of the k partitions established,<br />
based on the Euclidean distance between elements.<br />
A representativeness factor fr is set according to<br />
the relevance of an element in the context of each<br />
cluster. Empirical tests indicate this factor to be 0.9<br />
when dealing with a partition’s representative and<br />
0.7 for all the others (Cortes, 2002). The algorithm’s<br />
definition is described below 3 :<br />
3 For a full description see (Cortes, 2002).<br />
C<br />
ji