Reviews in Computational Chemistry Volume 18
Reviews in Computational Chemistry Volume 18
Reviews in Computational Chemistry Volume 18
- No tags were found...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Cluster<strong>in</strong>g Algorithms 13<br />
are equal. Assignment of each compound to the closest-cluster centroid is the<br />
expectation step; recalculation of the cluster centroids (model parameters)<br />
after assignment is the maximization step.<br />
Topographic<br />
Topographic cluster<strong>in</strong>g methods attempt to preserve the proximities<br />
between clusters, thus facilitat<strong>in</strong>g visualization of the cluster<strong>in</strong>g results. For<br />
k-means cluster<strong>in</strong>g, the cost function is <strong>in</strong>variant, whereas <strong>in</strong> topographic cluster<strong>in</strong>g<br />
it is not, and a predef<strong>in</strong>ed neighborhood is imposed on the clusters to<br />
preserve the proximities between them. The Kohonen, or self-organiz<strong>in</strong>g,<br />
map, 37,38 apart from be<strong>in</strong>g one of the most commonly used types of neural<br />
network, is also a topographic cluster<strong>in</strong>g method. A Kohonen network uses<br />
an unsupervised learn<strong>in</strong>g technique to map higher dimensional spaces of a<br />
data set down to, typically, two or three dimensions (2D or 3D), so that clusters<br />
can be identified from the neurons’ coord<strong>in</strong>ates (topological position); the<br />
values of the output are ignored. Initially, the neurons are assigned weight vectors<br />
with random values (weights). Dur<strong>in</strong>g the self-organization process, the<br />
data vectors of the neuron hav<strong>in</strong>g the most similar weight vector to each<br />
data vector and its immediately adjacent neurons are updated iteratively to<br />
place them closer to the data vector. The Kohonen mapp<strong>in</strong>g thus proceeds<br />
as follows:<br />
1. Initialize each neuron’s weight vector with random values.<br />
2. Assign the next data vector to the neuron hav<strong>in</strong>g the most similar weight<br />
vector.<br />
3. Update the weight vector of the neuron of step 2 to br<strong>in</strong>g it closer to the<br />
data vector.<br />
4. Update neighbor<strong>in</strong>g weight vectors us<strong>in</strong>g a given updat<strong>in</strong>g function.<br />
5. Repeat steps 2–4 until all data vectors have been processed.<br />
6. Start aga<strong>in</strong> with the first data vector, and repeat steps 2–5 for a given<br />
number of cycles.<br />
The iterative adjustment of weight vectors is similar to the iterative ref<strong>in</strong>ement<br />
of k-means cluster<strong>in</strong>g to derive cluster centroids. The ma<strong>in</strong> difference is that<br />
adjustment affects neighbor<strong>in</strong>g weight vectors at the same time. Kohonen<br />
mapp<strong>in</strong>g requires O(Nmn) time and OðNÞ space, where m is the number of<br />
cycles and n the number of neurons.<br />
Other Nonhierarchical Methods<br />
We have del<strong>in</strong>eated the ma<strong>in</strong> categories of cluster<strong>in</strong>g methods applicable<br />
to chemical applications above. We have also provided one basic algorithm as<br />
an example of each. Researchers <strong>in</strong> other discipl<strong>in</strong>es sometimes use variants of<br />
these ma<strong>in</strong> categories. The ma<strong>in</strong> categories that have been used by those<br />
researchers but omitted here <strong>in</strong>clude density-based cluster<strong>in</strong>g and graph-based<br />
cluster<strong>in</strong>g techniques. These will be mentioned briefly <strong>in</strong> the next section.