19.02.2013 Views

Reviews in Computational Chemistry Volume 18

Reviews in Computational Chemistry Volume 18

Reviews in Computational Chemistry Volume 18

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Cluster<strong>in</strong>g Algorithms 13<br />

are equal. Assignment of each compound to the closest-cluster centroid is the<br />

expectation step; recalculation of the cluster centroids (model parameters)<br />

after assignment is the maximization step.<br />

Topographic<br />

Topographic cluster<strong>in</strong>g methods attempt to preserve the proximities<br />

between clusters, thus facilitat<strong>in</strong>g visualization of the cluster<strong>in</strong>g results. For<br />

k-means cluster<strong>in</strong>g, the cost function is <strong>in</strong>variant, whereas <strong>in</strong> topographic cluster<strong>in</strong>g<br />

it is not, and a predef<strong>in</strong>ed neighborhood is imposed on the clusters to<br />

preserve the proximities between them. The Kohonen, or self-organiz<strong>in</strong>g,<br />

map, 37,38 apart from be<strong>in</strong>g one of the most commonly used types of neural<br />

network, is also a topographic cluster<strong>in</strong>g method. A Kohonen network uses<br />

an unsupervised learn<strong>in</strong>g technique to map higher dimensional spaces of a<br />

data set down to, typically, two or three dimensions (2D or 3D), so that clusters<br />

can be identified from the neurons’ coord<strong>in</strong>ates (topological position); the<br />

values of the output are ignored. Initially, the neurons are assigned weight vectors<br />

with random values (weights). Dur<strong>in</strong>g the self-organization process, the<br />

data vectors of the neuron hav<strong>in</strong>g the most similar weight vector to each<br />

data vector and its immediately adjacent neurons are updated iteratively to<br />

place them closer to the data vector. The Kohonen mapp<strong>in</strong>g thus proceeds<br />

as follows:<br />

1. Initialize each neuron’s weight vector with random values.<br />

2. Assign the next data vector to the neuron hav<strong>in</strong>g the most similar weight<br />

vector.<br />

3. Update the weight vector of the neuron of step 2 to br<strong>in</strong>g it closer to the<br />

data vector.<br />

4. Update neighbor<strong>in</strong>g weight vectors us<strong>in</strong>g a given updat<strong>in</strong>g function.<br />

5. Repeat steps 2–4 until all data vectors have been processed.<br />

6. Start aga<strong>in</strong> with the first data vector, and repeat steps 2–5 for a given<br />

number of cycles.<br />

The iterative adjustment of weight vectors is similar to the iterative ref<strong>in</strong>ement<br />

of k-means cluster<strong>in</strong>g to derive cluster centroids. The ma<strong>in</strong> difference is that<br />

adjustment affects neighbor<strong>in</strong>g weight vectors at the same time. Kohonen<br />

mapp<strong>in</strong>g requires O(Nmn) time and OðNÞ space, where m is the number of<br />

cycles and n the number of neurons.<br />

Other Nonhierarchical Methods<br />

We have del<strong>in</strong>eated the ma<strong>in</strong> categories of cluster<strong>in</strong>g methods applicable<br />

to chemical applications above. We have also provided one basic algorithm as<br />

an example of each. Researchers <strong>in</strong> other discipl<strong>in</strong>es sometimes use variants of<br />

these ma<strong>in</strong> categories. The ma<strong>in</strong> categories that have been used by those<br />

researchers but omitted here <strong>in</strong>clude density-based cluster<strong>in</strong>g and graph-based<br />

cluster<strong>in</strong>g techniques. These will be mentioned briefly <strong>in</strong> the next section.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!