TESI DOCTORAL - La Salle
TESI DOCTORAL - La Salle
TESI DOCTORAL - La Salle
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
2.1 Related work on cluster ensembles<br />
Chapter 2. Cluster ensembles and consensus clustering<br />
Our aim in this section is to review the strategies applied in the literature as regards the<br />
construction of cluster ensembles, given its influence on the consensus clustering process results.<br />
Two alternative approaches have been traditionally followed in this context, differing<br />
in the number of distinct clustering algorithms used for generating the individual partitions<br />
in the ensemble.<br />
The first cluster ensemble creation strategy consists of compiling the outcomes of multiple<br />
runs of a single clustering algorithm, which gives rise to what is known as a homogeneous<br />
cluster ensemble (Hadjitodorov, Kuncheva, and Todorova, 2006). In this case, the diversity<br />
of the ensemble components can be induced by several means, often in a combined manner:<br />
– application of a stochastic clustering algorithm: this strategy relies on the fact that<br />
the outcome of a stochastic clustering algorithm depends on how its parameters are<br />
adjusted. For instance, diverse clustering solutions can be obtained by the random<br />
initialization of the starting centroids of k-means (Fred, 2001; Fred and Jain, 2002a;<br />
Fred and Jain, 2003; Dimitriadou, Weingessel, and Hornik, 2001; Greene et al., 2004;<br />
Long, Zhang, and Yu, 2005; Hore, Hall, and Goldgof, 2006; Kuncheva, Hadjitodorov,<br />
and Todorova, 2006; Li, Ding, and Jordan, 2007; Nguyen and Caruana, 2007; Ayad<br />
and Kamel, 2008; Fern and Lin, 2008) or fuzzy c-means (Dimitriadou, Weingessel,<br />
and Hornik, 2002), or the initial settings of EM clustering (Punera and Ghosh, 2007;<br />
Gonzàlez and Turmo, 2008a; Gonzàlez and Turmo, 2008b).<br />
– random number of clusters: in this case, at each run of the clustering algorithm,<br />
the number of clusters to be found is set randomly (Fred and Jain, 2002b; Fred and<br />
Jain, 2005; Topchy, Jain, and Punch, 2004; Kuncheva, Hadjitodorov, and Todorova,<br />
2006; Hadjitodorov and Kuncheva, 2007; Gonzàlez and Turmo, 2008a; Gonzàlez and<br />
Turmo, 2008b; Ayad and Kamel, 2008). In general terms, this number of clusters<br />
is usually set to be much larger than the expected number of categories in the data<br />
set (Dimitriadou, Weingessel, and Hornik, 2001; Fred and Jain, 2002a), being often<br />
selected at random from a predefined interval (Long, Zhang, and Yu, 2005; Hore, Hall,<br />
and Goldgof, 2006).<br />
– distinct object representations: another source of diversity lies in the way objects<br />
are represented. Indeed, as we showed in section 1.4, running the same clustering<br />
algorithm on distinct representations of the same data set often leads to pretty diverse<br />
clustering solutions. Allowing for this fact, cluster ensembles have been created by<br />
running a single clustering algorithm on different data representations generated by<br />
random feature selection (Agogino and Tumer, 2006; Hadjitodorov and Kuncheva,<br />
2007; Fern and Lin, 2008), random feature extraction (Greene et al., 2004; Long,<br />
Zhang, and Yu, 2005; Hore, Hall, and Goldgof, 2006; Hadjitodorov and Kuncheva,<br />
2007; Fern and Lin, 2008) or deterministic feature extraction (Sevillano et al., 2006a;<br />
Sevillano et al., 2006b; Sevillano et al., 2007a; Sevillano, Alías, and Socoró, 2007b).<br />
– data subsampling: the creation of multiple clustering solutions upon distinct random<br />
subsamples of the data set has been applied as a means for generating diverse cluster<br />
ensembles (Fischer and Buhmann, 2003; Dudoit and Fridlyand, 2003; Minaei-Bidgoli,<br />
Topchy, and Punch, 2004; Kuncheva, Hadjitodorov, and Todorova, 2006; Punera and<br />
Ghosh, 2007).<br />
31