01.04.2015 Views

1FfUrl0

1FfUrl0

1FfUrl0

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 3<br />

How to do it<br />

More robust than edit distance is the so-called bag-of-word approach. It uses simple<br />

word counts as its basis. For each word in the post, its occurrence is counted and<br />

noted in a vector. Not surprisingly, this step is also called vectorization. The vector is<br />

typically huge as it contains as many elements as the words that occur in the whole<br />

dataset. Take for instance two example posts with the following word counts:<br />

Word<br />

Occurrences in<br />

Post 1<br />

disk 1 1<br />

format 1 1<br />

how 1 0<br />

hard 1 1<br />

my 1 0<br />

problems 0 1<br />

to 1 0<br />

Occurrences in<br />

Post 2<br />

The columns Post 1 and Post 2 can now be treated as simple vectors. We could<br />

simply calculate the Euclidean distance between the vectors of all posts and take<br />

the nearest one (too slow, as we have just found out). As such, we can use them<br />

later in the form of feature vectors in the following clustering steps:<br />

1. Extract the salient features from each post and store it as a vector per post.<br />

2. Compute clustering on the vectors.<br />

3. Determine the cluster for the post in question.<br />

4. From this cluster, fetch a handful of posts that are different from the post in<br />

question. This will increase diversity.<br />

However, there is some more work to be done before we get there, and before we<br />

can do that work, we need some data to work on.<br />

Preprocessing – similarity measured as<br />

similar number of common words<br />

As we have seen previously, the bag-of-word approach is both fast and robust.<br />

However, it is not without challenges. Let's dive directly into them.<br />

[ 51 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!