NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...
NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...
NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Improving Twitter Search by Removing Spam and Ranking Results<br />
Steven Connolly & Colm O’Riordan,<br />
CIRG, Information Technology, <strong>NUI</strong>G,<br />
s.connolly13@nuigalway.ie, colm.oriordan@nuigalway.ie<br />
Abstract<br />
With the ability to post real-time status updates, Twitter<br />
is an excellent source of first hand news. Twitter search<br />
is widely used for retrieving real-time information.<br />
However, with such high spam content on Twitter,<br />
search results often return multiple spam Tweets. With<br />
this in mind we aim to identify and remove spam and to<br />
rank remaining Tweets promoting high Tweet content<br />
and novelty in the answer set. We analyse Twitter data<br />
to identify a set of features to detect spam. We aim to<br />
improve the quality of a search result set by ranking<br />
Tweets based on a number of identified heuristics.<br />
Introduction<br />
Twitter is a real-time information network. Since its<br />
launch in July 2006, Twitter has acquired 175 million<br />
registered users posting 95 million tweets and<br />
performing 900,000 search queries per day. Twitter<br />
posts or Tweets are short text message a maximum of<br />
140 characters in length.<br />
Twitter Spam<br />
Spam is the electronic transmission of messages, in<br />
large volume, to people who do not choose to receive<br />
them. Spam accounts have multiple behaviors including,<br />
posting harmful links to malware and phishing web<br />
sites, repeatedly posting duplicate Tweets, posting<br />
misleading links or following and un-following accounts<br />
to draw attention [1]. Twitter provides limited spam<br />
removal. Spam accounts can be reported to Twitter.<br />
These accounts are reviewed and investigated for abuse<br />
and any account showing spamming behavior is<br />
permanently suspended. However, this method will not<br />
prevent spam Tweets appearing in search results and<br />
spam is getting worse as Twitter becomes more<br />
prevalent [2].<br />
Approach<br />
Our research focuses on correctly identifying spam<br />
Tweets. Our approach involves gathering a data set of<br />
Twitter records. A sample of 11,000 records was<br />
gathered based on various keywords. The Twitter<br />
Streaming API was used to gather the data. Each record<br />
consisted of information related to the Tweet along with<br />
user account detail, necessary to our research.<br />
Heuristics<br />
Based on literature and our own preliminary<br />
analysis, we identified a set of heuristics which we use<br />
for spam detection and data ranking. Some of these<br />
include, Follower Count, Account Age, Hash Tag Count<br />
77<br />
URL Count which refers to the no. of URLs present in a<br />
Tweet, Status Count referring to the no. of Tweets<br />
posted from an account, List Count which refers to the<br />
no. of lists/groups an account belongs to, Duplicate<br />
Tweets referring to duplicate records present in a record<br />
set etc.<br />
Following analysis of these heuristics we were able<br />
to identify the techniques proving most successful for<br />
spam detection. Some of our results can be seen below:<br />
Results Spam % Spam<br />
0 Followers 93 65 65%<br />
Acc Age <<br />
1day<br />
145 55 63%<br />
Hash Tags ><br />
8<br />
23 22 95%<br />
Initial analysis showed that individual heuristics can be<br />
used to detect spam. Further analysis showed that<br />
combination of particular heuristics detected spam more<br />
accurately e.g. an account created within 24 hours with<br />
40 friends returns only spam Tweets.<br />
Analysis confirmed that one of the main<br />
characteristics of Twitter spam is repetition. Spam<br />
accounts repeatedly post the same Tweet in an effort to<br />
be seen. We explain the use of the Levenshtein distance<br />
to measure the similarity between Tweets and removed<br />
very similar Tweets. The results show that 20% of our<br />
dataset comprise of duplicate or near duplicate Tweets.<br />
Future Work<br />
To date we have been concerned with identifying<br />
heuristics to give high precision. We are now working<br />
towards measuring recall of these heuristics<br />
Following this we aim to develop a ranking<br />
algorithm which will give a score to the remaining<br />
Tweets and accurately rank them as a measure of time<br />
and query relevance.<br />
References<br />
[1] Ryan Kelly, Aug 2009, Twitter Study, Pear<br />
Analytics.<br />
[2] Danny Sullivan, Jun 2009, Twitter’s Real Time<br />
Spam Problem.