10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Clustering News Articles<br />

In graph theory, a spanning tree is a set of edges on a graph that connects all of the<br />

nodes together. The Minimum Spanning Tree (MST) is simply the spanning tree<br />

<strong>with</strong> the lowest total weight. For our application, the nodes in our graph are samples<br />

from our dataset, and the edge weights are the number of times those two samples<br />

were clustered together—that is, the value from our coassociation matrix.<br />

In the following figure, a MST on a graph of six nodes is shown. Nodes on the graph<br />

can be used more than once in the MST. The only criterion for a spanning tree is that<br />

all nodes should be connected together.<br />

To compute the MST, we use SciPy's minimum_spanning_tree function, which is<br />

found in the sparse package:<br />

from scipy.sparse.csgraph import minimum_spanning_tree<br />

The mst function can be called directly on the sparse matrix returned by our<br />

coassociation function:<br />

mst = minimum_spanning_tree(C)<br />

However, in our coassociation matrix C, higher values are indicative of<br />

samples that are clustered together more often—a similarity value. In contrast,<br />

minimum_spanning_tree sees the input as a distance, <strong>with</strong> higher scores penalized.<br />

For this reason, we compute the minimum spanning tree on the negation of the<br />

coassociation matrix instead:<br />

mst = minimum_spanning_tree(-C)<br />

[ 232 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!