29.06.2013 Views

NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...

NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...

NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Improving Twitter Search by Removing Spam and Ranking Results<br />

Steven Connolly & Colm O’Riordan,<br />

CIRG, Information Technology, <strong>NUI</strong>G,<br />

s.connolly13@nuigalway.ie, colm.oriordan@nuigalway.ie<br />

Abstract<br />

With the ability to post real-time status updates, Twitter<br />

is an excellent source of first hand news. Twitter search<br />

is widely used for retrieving real-time information.<br />

However, with such high spam content on Twitter,<br />

search results often return multiple spam Tweets. With<br />

this in mind we aim to identify and remove spam and to<br />

rank remaining Tweets promoting high Tweet content<br />

and novelty in the answer set. We analyse Twitter data<br />

to identify a set of features to detect spam. We aim to<br />

improve the quality of a search result set by ranking<br />

Tweets based on a number of identified heuristics.<br />

Introduction<br />

Twitter is a real-time information network. Since its<br />

launch in July 2006, Twitter has acquired 175 million<br />

registered users posting 95 million tweets and<br />

performing 900,000 search queries per day. Twitter<br />

posts or Tweets are short text message a maximum of<br />

140 characters in length.<br />

Twitter Spam<br />

Spam is the electronic transmission of messages, in<br />

large volume, to people who do not choose to receive<br />

them. Spam accounts have multiple behaviors including,<br />

posting harmful links to malware and phishing web<br />

sites, repeatedly posting duplicate Tweets, posting<br />

misleading links or following and un-following accounts<br />

to draw attention [1]. Twitter provides limited spam<br />

removal. Spam accounts can be reported to Twitter.<br />

These accounts are reviewed and investigated for abuse<br />

and any account showing spamming behavior is<br />

permanently suspended. However, this method will not<br />

prevent spam Tweets appearing in search results and<br />

spam is getting worse as Twitter becomes more<br />

prevalent [2].<br />

Approach<br />

Our research focuses on correctly identifying spam<br />

Tweets. Our approach involves gathering a data set of<br />

Twitter records. A sample of 11,000 records was<br />

gathered based on various keywords. The Twitter<br />

Streaming API was used to gather the data. Each record<br />

consisted of information related to the Tweet along with<br />

user account detail, necessary to our research.<br />

Heuristics<br />

Based on literature and our own preliminary<br />

analysis, we identified a set of heuristics which we use<br />

for spam detection and data ranking. Some of these<br />

include, Follower Count, Account Age, Hash Tag Count<br />

77<br />

URL Count which refers to the no. of URLs present in a<br />

Tweet, Status Count referring to the no. of Tweets<br />

posted from an account, List Count which refers to the<br />

no. of lists/groups an account belongs to, Duplicate<br />

Tweets referring to duplicate records present in a record<br />

set etc.<br />

Following analysis of these heuristics we were able<br />

to identify the techniques proving most successful for<br />

spam detection. Some of our results can be seen below:<br />

Results Spam % Spam<br />

0 Followers 93 65 65%<br />

Acc Age <<br />

1day<br />

145 55 63%<br />

Hash Tags ><br />

8<br />

23 22 95%<br />

Initial analysis showed that individual heuristics can be<br />

used to detect spam. Further analysis showed that<br />

combination of particular heuristics detected spam more<br />

accurately e.g. an account created within 24 hours with<br />

40 friends returns only spam Tweets.<br />

Analysis confirmed that one of the main<br />

characteristics of Twitter spam is repetition. Spam<br />

accounts repeatedly post the same Tweet in an effort to<br />

be seen. We explain the use of the Levenshtein distance<br />

to measure the similarity between Tweets and removed<br />

very similar Tweets. The results show that 20% of our<br />

dataset comprise of duplicate or near duplicate Tweets.<br />

Future Work<br />

To date we have been concerned with identifying<br />

heuristics to give high precision. We are now working<br />

towards measuring recall of these heuristics<br />

Following this we aim to develop a ranking<br />

algorithm which will give a score to the remaining<br />

Tweets and accurately rank them as a measure of time<br />

and query relevance.<br />

References<br />

[1] Ryan Kelly, Aug 2009, Twitter Study, Pear<br />

Analytics.<br />

[2] Danny Sullivan, Jun 2009, Twitter’s Real Time<br />

Spam Problem.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!