12.07.2015 Views

Study of Trend-Stuffing on Twitter through Text Classification

Study of Trend-Stuffing on Twitter through Text Classification

Study of Trend-Stuffing on Twitter through Text Classification

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<str<strong>on</strong>g>Study</str<strong>on</strong>g> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>Trend</str<strong>on</strong>g>-<str<strong>on</strong>g>Stuffing</str<strong>on</strong>g> <strong>on</strong> <strong>Twitter</strong><strong>through</strong> <strong>Text</strong> Classificati<strong>on</strong>Danesh Irani, Steve Webb, Calt<strong>on</strong> PuCollege <str<strong>on</strong>g>of</str<strong>on</strong>g> ComputingGeorgia Institute <str<strong>on</strong>g>of</str<strong>on</strong>g> Technology266 Ferst Drive,Atlanta, Georgia{danesh, webb, calt<strong>on</strong>}@cc.gatech.eduABSTRACT<strong>Twitter</strong> has become an important mechanism for users tokeep up with friends as well as the latest popular topics,reaching over 20 milli<strong>on</strong> unique visitors m<strong>on</strong>thly and generatingover 1.2 billi<strong>on</strong> tweets a m<strong>on</strong>th. To make popular topicseasily accessible, <strong>Twitter</strong> lists the current most tweetedtopics <strong>on</strong> its homepage as well as <strong>on</strong> most user pages. Thisprovides a <strong>on</strong>e-click shortcut to tweets related to the mostpopular or trending topics. Due to the increased visibility<str<strong>on</strong>g>of</str<strong>on</strong>g> tweets associated with a trending topic, miscreants andspammers havestarted exploitingthembypostingunrelatedtweets to such topics – a practice we call trend-stuffing. Westudy the use <str<strong>on</strong>g>of</str<strong>on</strong>g> text-classificati<strong>on</strong> over 600 trends c<strong>on</strong>sisting<str<strong>on</strong>g>of</str<strong>on</strong>g> 1.3 milli<strong>on</strong> tweets and their associated web pages toidentify tweets that are closely-related to a trend as wellas unrelated tweets. Using Informati<strong>on</strong> Gain, we reducethe original set <str<strong>on</strong>g>of</str<strong>on</strong>g> over 12,000 features for tweets and over500,000 features for the associated web pages by over 91%and 99%, respectively, showing that any additi<strong>on</strong>al featureswould have a low Informati<strong>on</strong> Gain <str<strong>on</strong>g>of</str<strong>on</strong>g> less than 10 −4 and0.016 bits, respectively. We compare the use <str<strong>on</strong>g>of</str<strong>on</strong>g> naïve Bayes,C4.5 decisi<strong>on</strong> trees, and Decisi<strong>on</strong> Stumpsover the individualsets <str<strong>on</strong>g>of</str<strong>on</strong>g> features. Then, we combine classifier predicti<strong>on</strong>s forthe tweet text and associated web page text. Although wefindthat the C4.5 decisi<strong>on</strong> tree classifier achieves thehighestaverage F1-measure <str<strong>on</strong>g>of</str<strong>on</strong>g> 0.79 <strong>on</strong> the tweet text and 0.9 <strong>on</strong> theassociated web page c<strong>on</strong>tent, we recommend the use <str<strong>on</strong>g>of</str<strong>on</strong>g> thenaïve Bayes classifier. The naïve Bayes classifier performsslightly worse with an average F1-measure <str<strong>on</strong>g>of</str<strong>on</strong>g> 0.77 <strong>on</strong> thetweet text and 0.74 <strong>on</strong> the associated web page c<strong>on</strong>tent, butits required training time is significantly smaller.1. INTRODUCTION<strong>Twitter</strong> is growing in popularity as a micro-blogging servicefor users to share a short message (called a tweet) thatc<strong>on</strong>tains a maximum length <str<strong>on</strong>g>of</str<strong>on</strong>g> 140 characters with friendsand others. It grew to over 7 milli<strong>on</strong> unique visitors [17]and 50 milli<strong>on</strong> tweets per day [29] in 2009. Estimates as <str<strong>on</strong>g>of</str<strong>on</strong>g>February 2010 [4] put <strong>Twitter</strong> at 20 milli<strong>on</strong> unique visitorsm<strong>on</strong>thly and over 1.2 billi<strong>on</strong> tweets a m<strong>on</strong>th. Technically,<strong>Twitter</strong> is a multicast service where followers subscribe topostings from users and topics [23] <str<strong>on</strong>g>of</str<strong>on</strong>g> interest. <strong>Twitter</strong>’sCEAS 2010 - Seventh annual Collaborati<strong>on</strong>, Electr<strong>on</strong>ic messaging, Anti-Abuse and Spam C<strong>on</strong>ference July 13-14, 2010, Redm<strong>on</strong>d, Washingt<strong>on</strong>, USKang LiDepartment <str<strong>on</strong>g>of</str<strong>on</strong>g> Computer ScienceUniversity <str<strong>on</strong>g>of</str<strong>on</strong>g> Georgia415 Boyd GSRC,Athens, Georgiakangli@cs.uga.eduhomepage c<strong>on</strong>tains listings <str<strong>on</strong>g>of</str<strong>on</strong>g> trending topics, which are associatedwith news related to current events or upcomingevents, under the headings <str<strong>on</strong>g>of</str<strong>on</strong>g>: “currently popular,” “popularthis hour,” and “popular today.” An abbreviated list <str<strong>on</strong>g>of</str<strong>on</strong>g>these topics is also found al<strong>on</strong>gside most user pages.<str<strong>on</strong>g>Trend</str<strong>on</strong>g>ing topics help users receive the tweets they are interestedin and allow visitors to catch up with the latesttopics. Unfortunately, thekeywordsfoundintrendingtopicsare increasingly misused by spammers and other miscreantsto promote their own interests in a practice we call trendstuffing. Such a practice can be classified under the WebSpam Tax<strong>on</strong>omy [9] as a form <str<strong>on</strong>g>of</str<strong>on</strong>g> boosting with dumping.Typical scenarios <str<strong>on</strong>g>of</str<strong>on</strong>g> misuse include tweets that are unrelatedto a trending topic’s meaning but include the necessarykeywordstobeassociatedwiththetopicanyway. Thesedeceptive associati<strong>on</strong>s can be achieved easily by adding thehashtag – a word or tag prefixed with a ‘#’ character toidentify topics – for a currently popular topic to the c<strong>on</strong>tent<str<strong>on</strong>g>of</str<strong>on</strong>g> a spammy tweet. When unsuspecting followers <str<strong>on</strong>g>of</str<strong>on</strong>g>a targeted trending topic receive spammy tweets, they unknowinglyclick<strong>on</strong>spammylinks,exposingthemselvestounwantedandpotentiallydangerouswebc<strong>on</strong>tent. Thepractice<str<strong>on</strong>g>of</str<strong>on</strong>g> trend stuffing is against the rules outlined in the “Spamand Abuse” secti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>Twitter</strong>’s fair use policy [3], and toolshave been introduced to combat such tweets [2, 8]. However,despite these efforts, trend stuffing c<strong>on</strong>tinues to occurfrequently <strong>on</strong> <strong>Twitter</strong>.Automated identificati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> tweets that c<strong>on</strong>tain trendstuffingis a significant challenge due to the limited amount<str<strong>on</strong>g>of</str<strong>on</strong>g> informati<strong>on</strong> c<strong>on</strong>tained within their 140-character limit.The first c<strong>on</strong>tributi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> this paper is an approach to automaticallyidentify trend-stuffing in tweets. Our approachc<strong>on</strong>sists <str<strong>on</strong>g>of</str<strong>on</strong>g> three important comp<strong>on</strong>ents. First, for eachtrendingtopic, webuildamodelthatdescribes itsassociatedtweet c<strong>on</strong>tent. The goal <str<strong>on</strong>g>of</str<strong>on</strong>g> these models is to distinguishtrend-stuffing tweets from legitimate tweets based solely <strong>on</strong>the limited c<strong>on</strong>tent informati<strong>on</strong> provided by a tweet. Sec<strong>on</strong>d,when a tweet c<strong>on</strong>tains a URL, we use the URL to builda meta-model <str<strong>on</strong>g>of</str<strong>on</strong>g> its web page c<strong>on</strong>tent. This meta-model iseffective since most spammy tweets aim to promote a website.Finally, we combine the tweet c<strong>on</strong>tent models with theweb page c<strong>on</strong>tent meta-models to increase their discriminatingpower.The sec<strong>on</strong>d c<strong>on</strong>tributi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the paper is a quantitativeevaluati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the performance <str<strong>on</strong>g>of</str<strong>on</strong>g> each model and their combinati<strong>on</strong>using a data set <str<strong>on</strong>g>of</str<strong>on</strong>g> over 1.3 milli<strong>on</strong> tweets (collectedbetween November 2009 and February 2010). This evalua-


ti<strong>on</strong> data set includes the collecti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> web pages pointed toby the tweets, which c<strong>on</strong>sists <str<strong>on</strong>g>of</str<strong>on</strong>g> 1.3 milli<strong>on</strong> web pages afterfollowing any redirect chains that may have been present.Additi<strong>on</strong>ally, we also verified the l<strong>on</strong>gevity <str<strong>on</strong>g>of</str<strong>on</strong>g> the accountsresp<strong>on</strong>sible for sending those tweets. If an account has beensuspended, we assume that it has been classified as a spammeror miscreant by <strong>Twitter</strong>. This informati<strong>on</strong> is used tosharpen our evaluati<strong>on</strong> (e.g., to verify if tweets classified as“notbel<strong>on</strong>gingtoatrend”areactuallytrend-stuffedtweets).When using individual single-trend classifiers, the C4.5 decisi<strong>on</strong>tree classifier achieves the highest average F1-measure<str<strong>on</strong>g>of</str<strong>on</strong>g> 0.79 <strong>on</strong> the tweet text and 0.9 <strong>on</strong> the associated web pagec<strong>on</strong>tent. After combining the classificati<strong>on</strong> predicti<strong>on</strong>s fromthe tweet text and the associated web page c<strong>on</strong>tent, we findthat the C4.5 decisi<strong>on</strong> trees perform the best when combinedwith the ‘OR’ operator, displaying a 0.9 F1-measurefor combined classificati<strong>on</strong>. Combining the classifiers usingsimple ‘OR’ and ‘AND’ operators allows us to take advantage<str<strong>on</strong>g>of</str<strong>on</strong>g> the performance <str<strong>on</strong>g>of</str<strong>on</strong>g> the respective classifiers while“short-circuiting” the combined classificati<strong>on</strong> with a positiveor negative predicti<strong>on</strong> respectively.2. PROBLEM DESCRIPTIONBefore we discuss the problem we are trying to solve, webriefly provide an overview <str<strong>on</strong>g>of</str<strong>on</strong>g> tweets and trending topics.Readers familiar with this subject matter can skip to Secti<strong>on</strong>2.2.2.1 Overview <str<strong>on</strong>g>of</str<strong>on</strong>g> Tweets and <str<strong>on</strong>g>Trend</str<strong>on</strong>g>ing TopicsTweets (or statuses) are short, 140-character messagesthat are posted by users in resp<strong>on</strong>se to the questi<strong>on</strong> “Whatare you doing?”. A user’s tweets are delivered to all thefriends following the user, and the tweets are also available<strong>on</strong> the user’s account. As the evoluti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>Twitter</strong> into anavenue for c<strong>on</strong>tent-creati<strong>on</strong> has steadily progressed, <strong>Twitter</strong>has changed the questi<strong>on</strong> posed to users (when posting amessage) to “What’s happening?” [23].<str<strong>on</strong>g>Trend</str<strong>on</strong>g>ing topics <strong>on</strong> <strong>Twitter</strong> are popular topics that are beingtweeted about the most frequently at a particular pointintime. Asthequesti<strong>on</strong>beingposedtousersis now“What’shappening?”, popular topics are usually associated with currentevents. In fact, some trending topics have played asignificant role in providing news for breaking stories andallowing users to provide opini<strong>on</strong>s <strong>on</strong> current events (e.g.,Haiti disaster relief). <str<strong>on</strong>g>Trend</str<strong>on</strong>g>ing topics can be associatedwith global trends or local trends (based <strong>on</strong> local geographicevents), but since global trending topics are more frequentlytargeted by spammers, we focus <strong>on</strong> them in this work. Onefinal observati<strong>on</strong> about trending topics is that they are <str<strong>on</strong>g>of</str<strong>on</strong>g>tenidentified by hashtags – word or tags prefixed with a ‘#’character.2.2 <str<strong>on</strong>g>Trend</str<strong>on</strong>g> <str<strong>on</strong>g>Stuffing</str<strong>on</strong>g>Currently, trending topics are listed <strong>on</strong> the <strong>Twitter</strong> homepageas well as al<strong>on</strong>gside most other pages <strong>on</strong> <strong>Twitter</strong>. Dueto their high visibility, trending topics attract tweets thatmay not be directly related to the topic – a practice wecall trend stuffing. These deceptive tweets may arise fromspammers, marketers, or users who want to promote a particularmessage (e.g., “Happy Birthday Tim #worldcup”).<strong>Twitter</strong>’s Rules [3] forbid this behavior, stating permanentsuspensi<strong>on</strong> will result from “post[ing] multiple unrelated updatesto a trending or popular topic.”<strong>Twitter</strong> has taken steps to reduce the amount <str<strong>on</strong>g>of</str<strong>on</strong>g> noise intrending topics [2, 8], and even though such measures havehad an effect <strong>on</strong> the amount <str<strong>on</strong>g>of</str<strong>on</strong>g> obvious spam in the trendingtopics, it has not eliminated noise or spam completely.The exact details behind these measures have not been released,but when details have been released, spammers havefound a way around them. For example, to avoid the detecti<strong>on</strong>mechanisms that take into account the history <str<strong>on</strong>g>of</str<strong>on</strong>g> auser’s tweets, spammers delete historic tweets (having <strong>on</strong>lya few “n<strong>on</strong>-deleted” tweets at a particular time). We furtherdiscuss some details <str<strong>on</strong>g>of</str<strong>on</strong>g> known measures used by <strong>Twitter</strong> inSecti<strong>on</strong> 6.3. OUR APPROACHOur approach to trend stuffing can be split into threemain steps. First, we study the use <str<strong>on</strong>g>of</str<strong>on</strong>g> text-classificati<strong>on</strong>based <strong>on</strong>ly <strong>on</strong> the 140 characters (or less) <str<strong>on</strong>g>of</str<strong>on</strong>g> the tweets.This analysis helps determine how useful the text <str<strong>on</strong>g>of</str<strong>on</strong>g> thetweets themselves is for building a classificati<strong>on</strong> model fora trend. Additi<strong>on</strong>ally, as we will discuss in Secti<strong>on</strong> 4, usingover 12,000 features for the classificati<strong>on</strong> model is not feasibleand requires an intermediary step to reduce the number<str<strong>on</strong>g>of</str<strong>on</strong>g> features c<strong>on</strong>sidered for each trend.Sec<strong>on</strong>d, we investigate text-classificati<strong>on</strong> based <strong>on</strong> webpages associated with tweets. The intuiti<strong>on</strong> is that the c<strong>on</strong>tent<str<strong>on</strong>g>of</str<strong>on</strong>g>thewebpages associated withtweets canalso be usedto determine if a tweet bel<strong>on</strong>gs to a trend or not. Further,this approach would likely be effective because spammers<str<strong>on</strong>g>of</str<strong>on</strong>g>ten promote websites by including URLs in their tweets.Once again, the number <str<strong>on</strong>g>of</str<strong>on</strong>g> website c<strong>on</strong>tent features is toolarge (over 500,000) to be feasible for classificati<strong>on</strong>, and as aresult, we employ an intermediary step <str<strong>on</strong>g>of</str<strong>on</strong>g> feature selecti<strong>on</strong>(discussed in Secti<strong>on</strong> 4) to reduce the number <str<strong>on</strong>g>of</str<strong>on</strong>g> featuresc<strong>on</strong>sidered for each trend.The third step involves combining the text-classificati<strong>on</strong>results from the tweet text and the associated web page c<strong>on</strong>tent.In particular, we compare the performance <str<strong>on</strong>g>of</str<strong>on</strong>g> usingthe ‘AND’ and ‘OR’ operators for combining the predicti<strong>on</strong>sfrom the tweet text classifier and the associated webpage c<strong>on</strong>tent classifier to determine whether the tweet (andassociated web page) bel<strong>on</strong>g to a given trend.Beforediscussingthestatistical methodsusedinourstudy,we describe the data set used for our experiments.3.1 Dataset collecti<strong>on</strong>We gathered more than 1.3 milli<strong>on</strong> tweets that were relatedto 2,067 trends over a 3 m<strong>on</strong>th period between November22, 2009 and February 21, 2010. The top 10 most populartrendingtopicswerequeriedeveryhourusingthe<strong>Twitter</strong>API [1], followed by subsequently fetching tweets associatedwith the top 10 trends every minute.We also crawled all web pages linked by a tweet becausetweets may c<strong>on</strong>tain little to no text with <strong>on</strong>ly a link promotinga site. In fact, since we are particularly interestedin finding spammers or users who try to promote sites unrelatedto a particular trend, we focus <strong>on</strong> tweets that c<strong>on</strong>tainlinks. We follow the link and its corresp<strong>on</strong>ding redirecti<strong>on</strong>chain to fetch the final web page bel<strong>on</strong>ging to the link. Thiscollecti<strong>on</strong> process generated over 1.3 milli<strong>on</strong> web pages, notincludingintermediate webpages resp<strong>on</strong>sible for redirecti<strong>on</strong>.Further details <strong>on</strong> how we handle redirect chains as well asJavaScriptredirecti<strong>on</strong> canbefoundinourpreviouswork[25,26].


The average size <str<strong>on</strong>g>of</str<strong>on</strong>g> a tweet in our data set is 105 characters,with a standard deviati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> 9 characters. The averagesize <str<strong>on</strong>g>of</str<strong>on</strong>g> a web page linked to by a tweet is 55kB, with astandard deviati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> 41kB. Although there are a number<str<strong>on</strong>g>of</str<strong>on</strong>g> other characteristics to explore, previous papers have alreadyfocused <strong>on</strong> characterizing <strong>Twitter</strong> [11, 13, 14, 15]. Inthis paper, we focus <strong>on</strong> the features used in classificati<strong>on</strong>and the actual classificati<strong>on</strong> itself.We perform text-classificati<strong>on</strong> <strong>on</strong> both the tweet text andthe c<strong>on</strong>tent <str<strong>on</strong>g>of</str<strong>on</strong>g> the final web page pointed to by the tweet. Inboth cases, we tokenize the text and use each word as a featureafterperforming Porter’s stemming[18], removingstopwords,and removing tokens with less than 50 occurrences<strong>through</strong>out the dataset. The tweet text also undergoes additi<strong>on</strong>alpre-processing to remove any text that is directlyassociated with the trending topic (e.g., trend hashtags orwords found to be part <str<strong>on</strong>g>of</str<strong>on</strong>g> the topic). Finally, we count thenumber <str<strong>on</strong>g>of</str<strong>on</strong>g> occurrences <str<strong>on</strong>g>of</str<strong>on</strong>g> the token and use this count asa numeric feature. After performing these steps, the tweettext generated over 12,000 features, and the web page c<strong>on</strong>tentgenerated over 520,000 features.3.2 Statistical Methods used in our <str<strong>on</strong>g>Study</str<strong>on</strong>g>We use a standard set <str<strong>on</strong>g>of</str<strong>on</strong>g> statistical classificati<strong>on</strong> methodsin our study. We outline some important basic informati<strong>on</strong>about each statistical classificati<strong>on</strong> method as well as differencesbetween the methods. Then, we detail the manner inwhich the classifiers are run al<strong>on</strong>g with our evaluati<strong>on</strong> criteria.Readers familiar with standard classifiers can skip toSecti<strong>on</strong> 3.2.2.3.2.1 Statistical Classificati<strong>on</strong> MethodsWe use a standard implementati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> supervised machinelearning algorithms from Weka 3.6.1 [30]. Table 1 lists theWeka classifiers used as well as the standard name <str<strong>on</strong>g>of</str<strong>on</strong>g> thealgorithm implemented by the Weka classifier. We brieflydiscuss the classifiers below:1. Naïve Bayes is a standard classifier used with textclassificati<strong>on</strong>.Due to its str<strong>on</strong>g assumpti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> independencebetween features, the classifier is fairly robust,even with a large set <str<strong>on</strong>g>of</str<strong>on</strong>g> features because it canindependently assess each feature with respect to theclass label. Further, the naïve Bayes classifier can beeasily trained as a passive or active learning classifier.2. C4.5 decisi<strong>on</strong> tree is a standard decisi<strong>on</strong> tree classifier.The algorithm uses a small set <str<strong>on</strong>g>of</str<strong>on</strong>g> features, whichmost effectively split the class labels at each branch<str<strong>on</strong>g>of</str<strong>on</strong>g> the decisi<strong>on</strong> tree. The model created by a decisi<strong>on</strong>tree is easily readable and provides useful insight whenanalyzing results.3. Decisi<strong>on</strong> Stump is a weak learner and can be thought<str<strong>on</strong>g>of</str<strong>on</strong>g> as a single-level “binary” decisi<strong>on</strong> tree. We use thisto see how easily the class can be partiti<strong>on</strong>ed using asingle feature.All classifiers are used with their default settings and arerun with 10-fold cross-validati<strong>on</strong> to avoid over-fitting. Weprovideamore thoroughdiscussi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g>thebenefitsanddrawbacksto using particular classifiers al<strong>on</strong>g with the results inSecti<strong>on</strong> 5.We use supervised learning because our initial experimentswith an unsupervised learning algorithm (clusteringClassifier Algorithm TypeNaïve Bayes Naïve BayesJ48 C4.5 Decisi<strong>on</strong> TreeDecisi<strong>on</strong>Stump Decisi<strong>on</strong> StumpTable 1: List <str<strong>on</strong>g>of</str<strong>on</strong>g> classifiers used with categorical features.using k-means) generated unsatisfactory results. Specifically,we found that the large number <str<strong>on</strong>g>of</str<strong>on</strong>g> features and sometimesirrelevant features caused the clustering algorithm toperform poorly. On the other hand, the supervised learningalgorithms were able to quickly and accurately identifytweets not bel<strong>on</strong>ging to a given trend, even in the presence<str<strong>on</strong>g>of</str<strong>on</strong>g> noisy labels.3.2.2 Statistical Classificati<strong>on</strong> SetupWe build a classifier model for each <str<strong>on</strong>g>of</str<strong>on</strong>g> the 670 trends withover 200 tweets, using<strong>on</strong>e-to-manyclassificati<strong>on</strong> with all thetweets bel<strong>on</strong>ging to the trend in <strong>on</strong>e class and a stratifiedsampling <str<strong>on</strong>g>of</str<strong>on</strong>g> tweets in the other trends (with at minimum<strong>on</strong>e tweet per trend). As an example, if a particular trendhas 500 tweets, we pick 669 tweets from the other trendsresulting in classificati<strong>on</strong> for 1169 tweets.We do not build <strong>on</strong>e single multi-class classifier for multiplereas<strong>on</strong>s: 1) retraining would be required for new classes(trends) and over time, as topic drift occurs; 2) smaller classifierswould be quicker and more robust; 3) more aggressivefeature selecti<strong>on</strong> can be performed to reduce the feature setto <strong>on</strong>ly those terms relevant to the particular trend.As we are using supervised learning, we need to providelabels for training the classifier (or building the model). Weusealabelassociated witheachuniquetrendtext(e.g., “welcome2010” or “tiger woods”) as assigned by <strong>Twitter</strong>. Whenassigning labels, we do not remove tweets from suspendedaccounts from the data set because in practice, these algorithmswill be trained <strong>on</strong> somewhat noisy data as tweetsare <strong>on</strong>ly identified as “bad” after a certain time-lag. Theclassifiers will need to rely <strong>on</strong> the c<strong>on</strong>fidence bounds as wellas cross-validati<strong>on</strong> to <strong>on</strong>ly learn features <str<strong>on</strong>g>of</str<strong>on</strong>g> tweets relatedto the trend and ignore the noise. Although this decisi<strong>on</strong>might result in trends c<strong>on</strong>taining a large number <str<strong>on</strong>g>of</str<strong>on</strong>g> noisylabels such that the noise is included in the model <str<strong>on</strong>g>of</str<strong>on</strong>g> tweetsbel<strong>on</strong>ging to the trend, this methodology represents a realisticscenario that we wish to analyze.3.2.3 Evaluati<strong>on</strong> Criteria <str<strong>on</strong>g>of</str<strong>on</strong>g> Statistical Methods <strong>on</strong>Data SetsWe use F1-measure to evaluate and compare the results<str<strong>on</strong>g>of</str<strong>on</strong>g> the classifiers. The F-measure is a weighted harm<strong>on</strong>icmeanbetweentheprecisi<strong>on</strong> andrecall. Todefinethismetric,we first need to review the possible outcomes <str<strong>on</strong>g>of</str<strong>on</strong>g> a singleclassificati<strong>on</strong> using a c<strong>on</strong>fusi<strong>on</strong> matrix:Actual Predicted LabelLabel <str<strong>on</strong>g>Trend</str<strong>on</strong>g> Other<str<strong>on</strong>g>Trend</str<strong>on</strong>g> True-Positive (TP) False-Negative (FN)Other False-Positive (FP) True-Negative (TN)In our <strong>on</strong>e-to-many experimental setup, positive tweetsare tweets that bel<strong>on</strong>g to a given trend, and negative tweetsare tweets that bel<strong>on</strong>g to other trends for the particularexperiment. In our evaluati<strong>on</strong>, we take into account the


Percentage <str<strong>on</strong>g>of</str<strong>on</strong>g> tweets from suspended accounts10090807060504030201001000Number <str<strong>on</strong>g>of</str<strong>on</strong>g> tweets10000100000Figure 1: Percentage <str<strong>on</strong>g>of</str<strong>on</strong>g> suspended tweets vs number<str<strong>on</strong>g>of</str<strong>on</strong>g> tweets in a trendnumber <str<strong>on</strong>g>of</str<strong>on</strong>g> suspended tweets classified as positive. Thesetweetsareincorrectly classifiedbecause asmenti<strong>on</strong>edearlier,suspended tweets are likely to not bel<strong>on</strong>g to any trend.Precisi<strong>on</strong> is a measure <str<strong>on</strong>g>of</str<strong>on</strong>g> the fracti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> actual positivetweets found bythe classifier inrelati<strong>on</strong> toall positive tweetsTPfoundbytheclassifier, givenby . Recall isameasureTP + FP<str<strong>on</strong>g>of</str<strong>on</strong>g> the fracti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> positive tweets found by the classifier inTPrelati<strong>on</strong> to all actual positive tweets, given by . TP + FN(1+β2)∗TPFinallytheF-measureisgivenby . TheF1-β2∗TP+FP+FNmeasure is the F-measure with the value <str<strong>on</strong>g>of</str<strong>on</strong>g> β set to 1. Aswhen calculating precisi<strong>on</strong> and recall, we take into accountthe number <str<strong>on</strong>g>of</str<strong>on</strong>g> suspended tweets and adjust the F1-measureaccordingly.3.2.4 Data Set Label RefinementTo identify tweets associated with trend stuffing, we requeriedthe <strong>Twitter</strong> accounts <str<strong>on</strong>g>of</str<strong>on</strong>g> users who had tweets inour data set to check if their <strong>Twitter</strong> account had beensuspended. We found over 26,000 <strong>Twitter</strong> accounts associatedwithtweets thathadbeensuspended(or approximately3.5% <str<strong>on</strong>g>of</str<strong>on</strong>g> all accounts collected), which were resp<strong>on</strong>sible forover 130,000 tweets (or approximately 10% <str<strong>on</strong>g>of</str<strong>on</strong>g> all tweets collected).We do submit that a very small percentage <str<strong>on</strong>g>of</str<strong>on</strong>g> theseaccounts may have been suspended for purposes other thanspam, and in-additi<strong>on</strong>, there are <strong>Twitter</strong> accounts that havebeen used for spam that have not yet been suspended by<strong>Twitter</strong>. Figure 1 shows the number <str<strong>on</strong>g>of</str<strong>on</strong>g> tweets in a trend<strong>on</strong> the x-axis, as well as the percentage <str<strong>on</strong>g>of</str<strong>on</strong>g> tweets in a trendwhich come from suspended accounts <strong>on</strong> the y-axis. Eachpoint represents a unique trend.4. REDUCTION OF FEATURE SETSClassificati<strong>on</strong> with over 12,000 features for the tweet textand over 520,000 features for web page c<strong>on</strong>tent is not practical.Most classifiers will perform poorly either due to thecurse <str<strong>on</strong>g>of</str<strong>on</strong>g> dimensi<strong>on</strong>ality [6] or due to the computati<strong>on</strong>al burden<str<strong>on</strong>g>of</str<strong>on</strong>g> trying to analyze patterns associated with all thefeatures (e.g., in Neural Networks learning weights for thousands<str<strong>on</strong>g>of</str<strong>on</strong>g> input nodes).We chose Informati<strong>on</strong> Gain to reduce the number <str<strong>on</strong>g>of</str<strong>on</strong>g> featuresto a tractable size. Informati<strong>on</strong> Gain, from an informati<strong>on</strong>theoretical perspective, is a measure <str<strong>on</strong>g>of</str<strong>on</strong>g> the amount<str<strong>on</strong>g>of</str<strong>on</strong>g> expected reducti<strong>on</strong> in entropy if <strong>on</strong>e were to use the fea-(a) Average lowest Informati<strong>on</strong> GainNumber <str<strong>on</strong>g>of</str<strong>on</strong>g> featuresFeature category100 1000 5000Tweet <strong>Text</strong> 0.004 < 10 −4 -Web Page C<strong>on</strong>tent 0.096 0.016 0.002(b) Average total feature set’s Informati<strong>on</strong> GainFeature categoryNumber <str<strong>on</strong>g>of</str<strong>on</strong>g> features100 1000 5000Tweet <strong>Text</strong> 2.1 2.5 -Web Page C<strong>on</strong>tent 14.6 46.9 70.9Table 2: Comparis<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the number <str<strong>on</strong>g>of</str<strong>on</strong>g> features c<strong>on</strong>sidered,when using Informati<strong>on</strong> Gain to reduce thefeature set size.ture to partiti<strong>on</strong> tweets bel<strong>on</strong>ging to a trend from tweets notbel<strong>on</strong>ging in a trend. After calculating the amount <str<strong>on</strong>g>of</str<strong>on</strong>g> Informati<strong>on</strong>Gain for each feature, we pick the str<strong>on</strong>gest features.Previous work <strong>on</strong> a study <str<strong>on</strong>g>of</str<strong>on</strong>g> feature selecti<strong>on</strong> in text categorizati<strong>on</strong>[31] found that Informati<strong>on</strong> Gain was effective indrastically reducing (by up to 98%) the number <str<strong>on</strong>g>of</str<strong>on</strong>g> featuresused for classificati<strong>on</strong> while not losing categorizati<strong>on</strong> accuracy.As we plan to do the feature selecti<strong>on</strong> for each trendwhen performing <strong>on</strong>e-to-many classificati<strong>on</strong>, we believe aneven more drastic reducti<strong>on</strong> can be preformed.We decided to evaluate our classificati<strong>on</strong> with a moretractable set <str<strong>on</strong>g>of</str<strong>on</strong>g> 100 and 1000 features for the tweet text –0.8% and 8% <str<strong>on</strong>g>of</str<strong>on</strong>g> the total number <str<strong>on</strong>g>of</str<strong>on</strong>g> tweet text features, respectively– and a set <str<strong>on</strong>g>of</str<strong>on</strong>g> 100, 1000, and 5000 features for theweb page c<strong>on</strong>tent – 0.02%, 0.2% and 1% <str<strong>on</strong>g>of</str<strong>on</strong>g> the total number<str<strong>on</strong>g>of</str<strong>on</strong>g>webpage c<strong>on</strong>tentfeatures, respectively. Table 2showsthe average lowest Informati<strong>on</strong> Gain and the average totalInformati<strong>on</strong> Gain. The average lowest Informati<strong>on</strong> Gain isthe average Informati<strong>on</strong> Gain <str<strong>on</strong>g>of</str<strong>on</strong>g> the weakest feature in thefeature set, and it indicates that any features not addedinto the set will have an Informati<strong>on</strong> Gain value lower thanit. The average total Informati<strong>on</strong> Gain can intuitively bethought <str<strong>on</strong>g>of</str<strong>on</strong>g> as the average total discriminative power <str<strong>on</strong>g>of</str<strong>on</strong>g> allthe features used to distinguish the trend from others.For the tweet text from the Table 2(a), we see that after100 features, the average Informati<strong>on</strong> Gain <str<strong>on</strong>g>of</str<strong>on</strong>g> the weakestfeature drops to 0.004, which is indicative <str<strong>on</strong>g>of</str<strong>on</strong>g> most <str<strong>on</strong>g>of</str<strong>on</strong>g> thediscriminative power beingcapturedinthefirst100features.This is further shown in Table 2(b) where the average totalInformati<strong>on</strong> Gain increases by <strong>on</strong>ly 0.4 when c<strong>on</strong>sidering anadditi<strong>on</strong>al 900 tweet text features. In fact, we find that <strong>on</strong>average, after 317 features, the Informati<strong>on</strong> Gain drops tozero. For evaluati<strong>on</strong> purposes, we still c<strong>on</strong>sider tweet textclassificati<strong>on</strong> both using 100 and 1000 features.For the web page c<strong>on</strong>tent from Table 2(a), the averageInformati<strong>on</strong> Gain <str<strong>on</strong>g>of</str<strong>on</strong>g> the weakest feature when c<strong>on</strong>sidering100, 1000, and 5000 features is 0.096, 0.016, and 0.002, respectively.C<strong>on</strong>sidering the average total Informati<strong>on</strong> Gainshown in Table 2(b), we see a 300% increase from 100 to1000 features and a further 50% increase from 1000 to 5000features. The large increase in entropy when c<strong>on</strong>sideringweb page c<strong>on</strong>tent features (especially when going from 100to 1000 features) makes it likely that using 100 features forclassificati<strong>on</strong> for web page c<strong>on</strong>tent may not be enough. Wewill evaluate web page c<strong>on</strong>tent classificati<strong>on</strong> using 100, 1000,and 5000 features.


F1−measure10.90.80.70.60.50.40.30.20.10Naive_BayesC4.5_Decisi<strong>on</strong>_TreesClassifiersDecisi<strong>on</strong>_Stump1001000Tweet text Number <str<strong>on</strong>g>of</str<strong>on</strong>g> featuresTraining Time (ms) 100 1000Naïve Bayes 0.21 0.79C4.5 Decisi<strong>on</strong> Tree 0.77 7.80Decisi<strong>on</strong>Stump 0.22 0.86Test Time (ms) 100 1000Naïve Bayes 0.26 1.12C4.5 Decisi<strong>on</strong> Tree 0.08 0.11Decisi<strong>on</strong>Stump 0.07 0.11Table 3: Average time in ms, to train and test theclassifiers over the tweet text.Figure 2: Results <str<strong>on</strong>g>of</str<strong>on</strong>g> classificati<strong>on</strong> <strong>on</strong> tweettext usingdifferent number <str<strong>on</strong>g>of</str<strong>on</strong>g> features. Error-bars represent<strong>on</strong>e standard-deviati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the F1-measure.F1-Measure10.80.60.40.201000Number <str<strong>on</strong>g>of</str<strong>on</strong>g> Tweets10000100000Figure 3: F-measure <str<strong>on</strong>g>of</str<strong>on</strong>g> naïve Bayes classifier using100 tweet text features.5. EXPERIMENTAL CLASSIFICATION RE-SULTSWe separately evaluate the effectiveness <str<strong>on</strong>g>of</str<strong>on</strong>g> classificati<strong>on</strong><strong>on</strong> the tweet text and the web page c<strong>on</strong>tent. Then, we evaluatecombining the predicti<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> both classifiers. As previouslymenti<strong>on</strong>ed, we use tweets from suspended accountsto validate the false-negatives. Validating the true-positivesrequires alarger manual effort, and we explore this approachfor a subset <str<strong>on</strong>g>of</str<strong>on</strong>g> tweets in Secti<strong>on</strong> 5.4.As a baseline for our classificati<strong>on</strong> results, we use the majorityclassifier, which classifies all pr<str<strong>on</strong>g>of</str<strong>on</strong>g>iles into the largerclass. The class representing a sample <str<strong>on</strong>g>of</str<strong>on</strong>g> all tweets excepttweets naïvely classified into the trend has the same size asthe opposing class so both classes have an equal size. Themajority classifier would have a 50% accuracy with a F1-measure <str<strong>on</strong>g>of</str<strong>on</strong>g> 0.66 (assuming the positive class is assigned themajority class) or 0 (assuming all tweets are majority classifiedinto the negative class) – for our purposes, we use abaseline F1-measure <str<strong>on</strong>g>of</str<strong>on</strong>g> 0.66.All experiments were run <strong>on</strong> Redhat Enterprise Linux usinga 3 Ghz Intel Core 2 Quad CPU (Q9650) with 8 GBRAM.5.1 Classificati<strong>on</strong> using Tweet <strong>Text</strong>We first look at the results <str<strong>on</strong>g>of</str<strong>on</strong>g> the classifiers that use <strong>on</strong>lyfeatures based <strong>on</strong> tokens in the tweet text. The results <str<strong>on</strong>g>of</str<strong>on</strong>g>classificati<strong>on</strong> using these features is shown in Figure 2. Eachset <str<strong>on</strong>g>of</str<strong>on</strong>g> bars represents a different classifier, and each barwithin the set represents the average performance in terms<str<strong>on</strong>g>of</str<strong>on</strong>g> F1-measure <strong>on</strong> a different number <str<strong>on</strong>g>of</str<strong>on</strong>g> features. We seethat naïve Bayes and C4.5 decisi<strong>on</strong> trees perform well, withC4.5 decisi<strong>on</strong> trees performing the best by a small margin.Decisi<strong>on</strong> Stumps, <strong>on</strong> the other hand, perform even worsethan the baseline majority classifier.C4.5 decisi<strong>on</strong> trees and naïve Bayes have the highest F1-measureastheyhavealowernumber<str<strong>on</strong>g>of</str<strong>on</strong>g>false-negatives (tweetsclassified as not being in the trend). Decisi<strong>on</strong> Stumps are<strong>on</strong>ly able to use a single feature to decide if a tweet is in thetrend or not, and in the absence <str<strong>on</strong>g>of</str<strong>on</strong>g> a “hashtag,” it is notable to perform very well. The poor performance <str<strong>on</strong>g>of</str<strong>on</strong>g> Decisi<strong>on</strong>Stumps was due to the low expressive power <str<strong>on</strong>g>of</str<strong>on</strong>g> a singleDecisi<strong>on</strong> Stump and lack <str<strong>on</strong>g>of</str<strong>on</strong>g> a single feature that could,without noise, capture the essence <str<strong>on</strong>g>of</str<strong>on</strong>g> a trend. Comparedto a sample <str<strong>on</strong>g>of</str<strong>on</strong>g> C4.5 decisi<strong>on</strong> trees, we found the root nodeto be the same, but the C4.5 decisi<strong>on</strong> tree had a number<str<strong>on</strong>g>of</str<strong>on</strong>g> branches under the root to filter out false-positives andfalse-negatives.Figure 3 shows the F1-measure results <str<strong>on</strong>g>of</str<strong>on</strong>g> naïve Bayes classificati<strong>on</strong>using 100 features. The number <str<strong>on</strong>g>of</str<strong>on</strong>g> tweets bel<strong>on</strong>gingto the trend being classified (the positive class) is shown<strong>on</strong>thex-axiswiththeF1-measure <strong>on</strong>they-axis. Wesee thatas the number <str<strong>on</strong>g>of</str<strong>on</strong>g> tweets in the positive class increases, theamount <str<strong>on</strong>g>of</str<strong>on</strong>g> variati<strong>on</strong> in the F1-measure reduces significantlywith the scores tending to be higher. The reas<strong>on</strong> for this isthat some <str<strong>on</strong>g>of</str<strong>on</strong>g> the positive classes may c<strong>on</strong>tain a large percentage<str<strong>on</strong>g>of</str<strong>on</strong>g> noise (suspended tweets); this is especially truefor smaller trends. If this is the case, the classifier learns thenoise as part <str<strong>on</strong>g>of</str<strong>on</strong>g> the trend. In calculating the F1-measure,since we c<strong>on</strong>sider suspended tweets not part <str<strong>on</strong>g>of</str<strong>on</strong>g> the trend,this can reduce the resulting F1-measure. As the number<str<strong>on</strong>g>of</str<strong>on</strong>g> tweets in the positive class increases, the likelihood <str<strong>on</strong>g>of</str<strong>on</strong>g>noisy tweets being classified as part <str<strong>on</strong>g>of</str<strong>on</strong>g> the trend decreases,resulting in fewer false-positives (i.e., tweets that were suspendedand thus not part <str<strong>on</strong>g>of</str<strong>on</strong>g> the trend, being classified aspositive). The reas<strong>on</strong> for the small difference between theclassificati<strong>on</strong> using 100 features and 1000 features <str<strong>on</strong>g>of</str<strong>on</strong>g> tweettext is likely due to the Informati<strong>on</strong> Gain per feature beingvery low after selecting the str<strong>on</strong>gest 100 features (in somecases even zero).One <str<strong>on</strong>g>of</str<strong>on</strong>g> the other important factors in the performance <str<strong>on</strong>g>of</str<strong>on</strong>g>a classifier is the amount <str<strong>on</strong>g>of</str<strong>on</strong>g> time it takes to build the modeland the time it takes to classify a tweet. Table 3 shows theresult <str<strong>on</strong>g>of</str<strong>on</strong>g> the average model build time (training time) andaverage classificati<strong>on</strong> time (test time) for a single tweet forboth 100 and 1000 features using different classifiers. We


F1−measure10.80.60.40.210010005000Web page c<strong>on</strong>tent Number <str<strong>on</strong>g>of</str<strong>on</strong>g> featuresTraining Time (ms) 100 1000 5000Naïve Bayes 0.17 1.19 7.06C4.5 Decisi<strong>on</strong> Tree 0.62 9.46 781.90Decisi<strong>on</strong>Stump 0.17 1.25 7.53Test Time (ms) 100 1000 5000Naïve Bayes 0.17 0.90 5.96C4.5 Decisi<strong>on</strong> Tree 0.05 0.11 0.19Decisi<strong>on</strong>Stump 0.05 0.13 0.390Naive_BayesC4.5_Decisi<strong>on</strong>_TreesClassifiersDecisi<strong>on</strong>_StumpTable 4: Average time in ms, to train and test theclassifiers over the web page c<strong>on</strong>tent.Figure 4: Results <str<strong>on</strong>g>of</str<strong>on</strong>g> classificati<strong>on</strong> <strong>on</strong> web page c<strong>on</strong>tentusing different number <str<strong>on</strong>g>of</str<strong>on</strong>g> features. Error-barsrepresent <strong>on</strong>e standard-deviati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the F1-measure.F1-Measure10.80.60.40.201000Number <str<strong>on</strong>g>of</str<strong>on</strong>g> Tweets10000100000Figure 5: F-measure <str<strong>on</strong>g>of</str<strong>on</strong>g> C4.5 decisi<strong>on</strong> tree classifierusing 100 web page c<strong>on</strong>tent features.see that Decisi<strong>on</strong> Stumps take the least time for both buildingthe model and to classify a tweet, but this is due to thevery simplistic single branch model. Naïve Bayes is fasterat building a model than C4.5 Decisi<strong>on</strong> Trees because innaïve Bayes, each feature is treated independently and thuscan be looked at <strong>on</strong>ce. On the other hand, when buildinga C4.5 decisi<strong>on</strong> tree, at every node an attribute is pickedbased <strong>on</strong> the highest Informati<strong>on</strong> Gain <str<strong>on</strong>g>of</str<strong>on</strong>g> available features,followed by a recursi<strong>on</strong> <strong>on</strong> the smaller subtrees. C4.5 decisi<strong>on</strong>strees are much faster than naïve Bayes <strong>on</strong> classifyinga tweet <strong>on</strong>ce the model is built because a relatively smalldecisi<strong>on</strong> tree needs to be traversed to determine a label forthe tweet. C<strong>on</strong>versely, naïve Bayes needs to calculate thecombined sum <str<strong>on</strong>g>of</str<strong>on</strong>g> probabilities for all the features. This requirementis the reas<strong>on</strong> there is an over 400% increase intime taken when using naïve Bayes to classify a tweet with100 features versus 1000 features.5.2 Classificati<strong>on</strong> using Web Page C<strong>on</strong>tentWe now explore the classificati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> tweets based <strong>on</strong> thetext features from the web page c<strong>on</strong>tent associated with thetweets. Web page c<strong>on</strong>tent associated with tweets shouldalso be related to the trend or be separable in such a way.The results <str<strong>on</strong>g>of</str<strong>on</strong>g> classificati<strong>on</strong> using these features is shown inFigure 4. Similar to Figure 2, each set <str<strong>on</strong>g>of</str<strong>on</strong>g> bars representsa different classifier, and each bar within a set representsthe average performance in terms <str<strong>on</strong>g>of</str<strong>on</strong>g> F1-measure using a differentnumber <str<strong>on</strong>g>of</str<strong>on</strong>g> features. The C4.5 decisi<strong>on</strong> trees performthe best, followed closely by Decisi<strong>on</strong> Stumps. Naïve Bayesperforms the worst but better than a baseline classifier.C4.5 decisi<strong>on</strong> trees perform the best as they have thesmallest number <str<strong>on</strong>g>of</str<strong>on</strong>g> false-negatives, followed by the Decisi<strong>on</strong>Stump classifier. The tree-based classifiers (C4.5 decisi<strong>on</strong>trees and Decisi<strong>on</strong> Stump) perform well in this case as thereare a few str<strong>on</strong>g features, which distinguish the trends. Theexplanati<strong>on</strong> for this is that web pages related to the trendc<strong>on</strong>tained words str<strong>on</strong>gly related to the topic. For example,the str<strong>on</strong>gest features for the hashtag“#helphaiti” were“haiti”, “earthquake”, “d<strong>on</strong>ate”, “disaster”, and “relief”.The Decisi<strong>on</strong> Stump, due to the single level tree, cannotreduce the number <str<strong>on</strong>g>of</str<strong>on</strong>g> false-positives using further decisi<strong>on</strong>s.Figure 5 shows the F1-measure results <str<strong>on</strong>g>of</str<strong>on</strong>g> using the C4.5decisi<strong>on</strong> tree classifier with 100 features <strong>on</strong> the web pagec<strong>on</strong>tent. The number <str<strong>on</strong>g>of</str<strong>on</strong>g> web pages bel<strong>on</strong>ging to the trendbeing classified (the positive class) is shown <strong>on</strong> the x-axiswith the F1-measure <strong>on</strong> the y-axis. Similar to classificati<strong>on</strong>using tweet text, we find that the variance is lower andthe F1-measure higher as the number <str<strong>on</strong>g>of</str<strong>on</strong>g> positive web pagesincreases. The reas<strong>on</strong> for this is that as the trend size increases,the percentage <str<strong>on</strong>g>of</str<strong>on</strong>g> suspended tweets that make upa trend decreases, causing the classifier to be less likely tolearn the suspended tweets as part <str<strong>on</strong>g>of</str<strong>on</strong>g> the trend.We also look at the time (ms) taken to build the classificati<strong>on</strong>models and to classify individual web pages. Table 4shows the average time to build the model (training time)and time to classify (test time) a single web page using differentclassifiers and number <str<strong>on</strong>g>of</str<strong>on</strong>g> features. We see that thetraining time for naïve Bayes and Decisi<strong>on</strong> Stumps are thelowest, and C4.5 decisi<strong>on</strong> trees take the most time. In terms<str<strong>on</strong>g>of</str<strong>on</strong>g> classifying a web page <strong>on</strong>ce a model is built, both C4.5decisi<strong>on</strong> trees and Decisi<strong>on</strong> Stumps are the fastest, whilethe naïve Bayes is relatively slow. The time the naïve Bayestakes to classify a single web page associated with a tweetincreases by over 600% when increasing the number <str<strong>on</strong>g>of</str<strong>on</strong>g> featuresfrom 1000 to 5000 features.5.3 Classificati<strong>on</strong> using Tweet <strong>Text</strong> and WebPage C<strong>on</strong>tent CombinedAfter studying the performance <str<strong>on</strong>g>of</str<strong>on</strong>g> the classifiers <strong>on</strong> theindividual sets <str<strong>on</strong>g>of</str<strong>on</strong>g> features, a natural next step is to combinethe predicti<strong>on</strong>s from the classifiers <str<strong>on</strong>g>of</str<strong>on</strong>g> the tweets and associatedweb pages to a combined predicti<strong>on</strong>. There are twomain reas<strong>on</strong>s for doing so:1. Tweets might not c<strong>on</strong>tain enough text to make an accurateclassificati<strong>on</strong> all the time. Some tweets might


110.80.8F1−measure0.60.40.2Naive_BayesC4.5_Decisi<strong>on</strong>_TreesDecisi<strong>on</strong>_StumpValue0.60.40Naive_Bayes C4.5_Decisi<strong>on</strong>_Trees Decisi<strong>on</strong>_StumpOR combined predicti<strong>on</strong>s using 100 features0.2Figure 6: Results <str<strong>on</strong>g>of</str<strong>on</strong>g> F1-measure based <strong>on</strong> predicti<strong>on</strong>from tweet text and web page c<strong>on</strong>tent classifiercombined using the ‘OR’ operator.01000Number <str<strong>on</strong>g>of</str<strong>on</strong>g> Tweets1000010000010.8Figure 8: F-measure <str<strong>on</strong>g>of</str<strong>on</strong>g> combined classifier usingnaïve Bayes with 100 features <strong>on</strong> the tweet-text andC4.5 with 100 web page c<strong>on</strong>tent features.F1−measure0.60.40.20Naive_BayesC4.5_Decisi<strong>on</strong>_TreesDecisi<strong>on</strong>_StumpAND combined classifiers using 100 featuresNaive_BayesC4.5_Decisi<strong>on</strong>_TreesDecisi<strong>on</strong>_StumpFigure 7: Results <str<strong>on</strong>g>of</str<strong>on</strong>g> F1-measure based <strong>on</strong> predicti<strong>on</strong>from tweet text and web page c<strong>on</strong>tent classifiercombined using the ‘AND’ operator.<strong>on</strong>ly c<strong>on</strong>tain a URL and a hashtag associating it witha trend. In this case, classifying the web page c<strong>on</strong>tentassociated with the tweet might help make an accurateclassificati<strong>on</strong> <strong>on</strong> whether or not the tweet actually doesbel<strong>on</strong>g to the trend.2. It might be easy for a spammer to camouflage a tweetto look like it bel<strong>on</strong>gs to a trend, but it would takemore effort (<strong>on</strong> the part <str<strong>on</strong>g>of</str<strong>on</strong>g> the spammer) to also camouflagethe web page linked in the tweet to look likeit bel<strong>on</strong>ged to the trend. Even if the spammer wereto do so, they would have to customize their web pagefor each trend that they wish to spam, and this wouldlikely be prohibitive.We compare two methods <str<strong>on</strong>g>of</str<strong>on</strong>g> combining the classificati<strong>on</strong>results from tweets and from the associated web pages, outliningpossible improvements to this method in Secti<strong>on</strong> 5.5.The ‘AND’ and ‘OR’ operator are comm<strong>on</strong>ly used to checkif results from two statements agree or disagree, and in thecase <str<strong>on</strong>g>of</str<strong>on</strong>g> combining classificati<strong>on</strong> results, we will use these tocheck if the predicti<strong>on</strong> over a tweet matches the predicti<strong>on</strong>over the associated web page c<strong>on</strong>tent. In the case <str<strong>on</strong>g>of</str<strong>on</strong>g> the‘AND’ operator, we will mark a tweet as bel<strong>on</strong>ging to atrend, if the predicti<strong>on</strong> for the tweet ‘AND’ the predicti<strong>on</strong>for the web page associated with the tweet both predict thatthe tweet bel<strong>on</strong>gs to the trend. In the case <str<strong>on</strong>g>of</str<strong>on</strong>g> the ‘OR’ operator,we will mark a tweet as bel<strong>on</strong>ging to a trend, if eitherpredicti<strong>on</strong> for the tweet ‘OR’ predicti<strong>on</strong> for the web page associatedwith the tweet predict that the tweet bel<strong>on</strong>gs to thetrend. Intuitively, by using the ‘AND’ operator, we ensurethat both predicti<strong>on</strong>s <strong>on</strong> text and associated web page agreeto classify it in the trend, causing an increase in the number<str<strong>on</strong>g>of</str<strong>on</strong>g> false-negatives and true-negatives. On the other hand,the ‘OR’ operator, will cause an increase in the number <str<strong>on</strong>g>of</str<strong>on</strong>g>true-positives and false-positives.Figure 6 and Figure 7 show the results based <strong>on</strong> combinedclassificati<strong>on</strong> using the ‘OR’ operator and ‘AND’ operator,respectively. The x-axis displays the classifier used for thetweet text, whereas each bar within a cluster denotes a classifierused to predict the label for the associated trend. Allclassifiers were trained and tested using 100 features. They-axis shows the value <str<strong>on</strong>g>of</str<strong>on</strong>g> the F1-measure achieved by a combinedtweet text and web page c<strong>on</strong>tent classifier.We find that the combined classifier using the ‘OR’ operatorperforms better <strong>on</strong> average than the combined classifierusing the ‘AND’ operator. As previously menti<strong>on</strong>ed, thisresult is due to the ‘OR’ operator leading to more truepositivesand perhaps even a few more false-positives andperforming better than the best individual 100 feature classifierused. One <str<strong>on</strong>g>of</str<strong>on</strong>g> the possible reas<strong>on</strong>s for not achievingresults closer to 1 is likely due to trends which are largelymade up <str<strong>on</strong>g>of</str<strong>on</strong>g> suspended tweets. We are currently investigatingthis further.In the case <str<strong>on</strong>g>of</str<strong>on</strong>g> combining classifiers using the “AND” and“OR” operator, both classifiers would have to be trained<strong>on</strong> the trend, and at least <strong>on</strong>e <str<strong>on</strong>g>of</str<strong>on</strong>g> them would have to beprobed at the time <str<strong>on</strong>g>of</str<strong>on</strong>g> making the determinati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> whetherthe tweet and associated web page bel<strong>on</strong>g to the trend. Thisis similar to the comm<strong>on</strong> practice <str<strong>on</strong>g>of</str<strong>on</strong>g> short-circuiting booleanexpressi<strong>on</strong>s. In the case <str<strong>on</strong>g>of</str<strong>on</strong>g> an “AND” operator, a single negativepredicti<strong>on</strong> will result in the final label being negative,whereas in the case <str<strong>on</strong>g>of</str<strong>on</strong>g> the “OR” operator, a single positivepredicti<strong>on</strong> will result in the final label being positive. In respectto the tweets, the tweet text would be easier to get aclassificati<strong>on</strong> label for, especially since obtaining a classificati<strong>on</strong>for the web page would involve fetching the destinati<strong>on</strong>web page followed by tokenizing and feature reducti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> allthe web page c<strong>on</strong>tent. Thus, the minimum required timeto classify a tweet using the combined approach is the timetaken to classify the tweet text plus an estimated 1 <str<strong>on</strong>g>of</str<strong>on</strong>g> the2time taken to classify the web page c<strong>on</strong>tent (not taking into


account time to fetch, tokenize, and reduce features).5.4 Manual Verificati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> a Subset <str<strong>on</strong>g>of</str<strong>on</strong>g> ResultsIn our manual vetting <str<strong>on</strong>g>of</str<strong>on</strong>g> the results we reviewed a fewhundred tweets and associated web pages. We make a number<str<strong>on</strong>g>of</str<strong>on</strong>g> interesting observati<strong>on</strong>s based <strong>on</strong> what we found:1. Somespammytweets hadlegitimate lookingtweettext(they were “camouflage tweets”), but the link pointedto a web page that was unrelated to a trend. An example<str<strong>on</strong>g>of</str<strong>on</strong>g> this is the tweet “NFL Sunday Night BettingPreview - Brett Favre vs. Kurt Warner (maybe) [URLremoved]” that was related to the trend “NFL” whilelinking to web page c<strong>on</strong>tent for the dating site “FriendJungle.”2. Many <str<strong>on</strong>g>of</str<strong>on</strong>g> the false-negatives were tweets properly classifiedas not bel<strong>on</strong>ging to the trend. The reas<strong>on</strong> theywerecountedasfalse-negatives isthattheybel<strong>on</strong>gedto<strong>Twitter</strong> accounts not yet suspended. For example, thetweet “My Top 3 Weekly - Play Free Online Games[URL removed]” and associated web page were bothclassified as not bel<strong>on</strong>ging to the “My Top 3 Weekly”trend, but the <strong>Twitter</strong> account associated with thistweet had not been suspended yet.3. Somefalse-positives bel<strong>on</strong>gedtoatrendbutwerepostedby suspended accounts. These were actually tweetsand web pages to legitimate c<strong>on</strong>tent related to a topicthat used inappropriate or obscene language. An exampleis the tweet “Check this - Nicole Parker-Best <str<strong>on</strong>g>of</str<strong>on</strong>g>Britney Spears Part 2 [URL removed] Plz ReTweet!#musicm<strong>on</strong>day” related to the trend “#musicm<strong>on</strong>day”that links to a parody video <str<strong>on</strong>g>of</str<strong>on</strong>g> Britney Spears.It is likely that the account posting this tweet got suspendedfor the sexually suggestive c<strong>on</strong>tent <str<strong>on</strong>g>of</str<strong>on</strong>g> the parody.5.5 Overall Discussi<strong>on</strong>C4.5 decisi<strong>on</strong> trees perform the best and would likely beresilient to a certain amount <str<strong>on</strong>g>of</str<strong>on</strong>g> adversarial interference, butthe time taken to build a model for a C4.5 decisi<strong>on</strong> tree isa hindering factor in using such classifiers. The Decisi<strong>on</strong>Stump classifier performs quite well and can be trained veryquickly, but it can be easily defeated by adversaries. NaïveBayes is the most robust, but it takes the l<strong>on</strong>gest when classifyingeach tweet. An advantage <str<strong>on</strong>g>of</str<strong>on</strong>g> the naïve Bayes classifieris its suitability for <strong>on</strong>line-learning; namely, it can beupdated based <strong>on</strong> user feedback without a large re-trainingtime. Therefore, we believe that using naïve Bayes with a100 features (or perhaps fewer, depending <strong>on</strong> the maturity<str<strong>on</strong>g>of</str<strong>on</strong>g> the trend) would be ideal in classifying tweets and theirassociated web pages. We plan <strong>on</strong> studying the applicability<str<strong>on</strong>g>of</str<strong>on</strong>g> using varying feature set sizes based <strong>on</strong> the number <str<strong>on</strong>g>of</str<strong>on</strong>g>tweets bel<strong>on</strong>ging to a trend as well as the effect <str<strong>on</strong>g>of</str<strong>on</strong>g> classifyingrecurring or multi-day trends.Working with 140 characters <str<strong>on</strong>g>of</str<strong>on</strong>g> text or less has been bothbeneficial and challenging in that we find that for a particulartrend, a small selecti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> features are adequate in beingable to accurately capture tweets bel<strong>on</strong>ging to a trend. Onthe flip side, the challenge is that as tweets c<strong>on</strong>tain pointersto allow gathering <str<strong>on</strong>g>of</str<strong>on</strong>g> further informati<strong>on</strong>, it is very easyfor a spammer or a miscreant to camouflage the tweet (i.e.,make the tweet text look legitimate while making the linkpoint to a spammy web page). It is due to this possibilitythat we also crawl web pages associated with tweets and usesuch c<strong>on</strong>tent in our classificati<strong>on</strong>.6. RELATED WORK<strong>Twitter</strong> has largely been studiedfor its networkcharacteristicsand structure. The largest such measure to date hasbeen d<strong>on</strong>e by Kwak et al. [15] who study over 106 milli<strong>on</strong>tweets and 41.7 milli<strong>on</strong> user pr<str<strong>on</strong>g>of</str<strong>on</strong>g>iles. They investigate networkcharacteristics ranging from a basic follower/followingrelati<strong>on</strong>ships to homophily (tendency for similar people toassociate with <strong>on</strong>e-another) to pagerank. They also studytrends in relati<strong>on</strong> to the similarity to topics <strong>on</strong> CNN andGoogle <str<strong>on</strong>g>Trend</str<strong>on</strong>g>s, further characterizing most trends as beingactive. Huberman et al. [11] and Krishnamurthy etal. [14] also studythecharacteristics <str<strong>on</strong>g>of</str<strong>on</strong>g><strong>Twitter</strong>, with thelatteridentifying users, their behaviors as well as geographicgrowth patterns. Java et al. [13] did an earlier study <strong>on</strong>the <strong>Twitter</strong> network a year after its growth and used thenetwork structure to categorize users into three main categories:informati<strong>on</strong> source; friends; and, informati<strong>on</strong> seeker.Informati<strong>on</strong> sources theynote, were “found to be automatedtools postingnewsandotherusefulinformati<strong>on</strong> <strong>on</strong><strong>Twitter</strong>”.In terms <str<strong>on</strong>g>of</str<strong>on</strong>g> spam detecti<strong>on</strong> <strong>on</strong> <strong>Twitter</strong>, Yardi et al. [32]looked at the life-cycle and evoluti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> a trend, with focus<strong>on</strong> spam. Although theydoidentifysome behavioral characteristicsthat can be used to identify spammers, they do c<strong>on</strong>cedethat such behavioral patterns do come with their falsepositives.We plan to investigate if we could combine suchtechniques to come up with a robust technique to keep spammessages out <str<strong>on</strong>g>of</str<strong>on</strong>g> trending topics. Other attempts at spamdetecti<strong>on</strong> <strong>on</strong> <strong>Twitter</strong> include, Spamdetector [24], which isa tool to apply “some heuristics to detect spam accounts”.Details <strong>on</strong> this project are unavailable, but their effort isc<strong>on</strong>tinuing based <strong>on</strong> users’ feedback to bots <strong>on</strong> <strong>Twitter</strong>.<strong>Twitter</strong> themselves havetakenaproactiveapproach todetectingspam <strong>on</strong> their network. In as early as July 2008 [21]andAugust2008[22], <strong>Twitter</strong>acknowledgedan<strong>on</strong>linebattlewith spammers and seemed to resp<strong>on</strong>d with placing limits<strong>on</strong> followers and delete spam accounts by using the “number<str<strong>on</strong>g>of</str<strong>on</strong>g> users who blocked a user” as feedback. Further stepswere taken in November 2009 [8] and in March 2010 [10],first in relati<strong>on</strong> to reducing the amount <str<strong>on</strong>g>of</str<strong>on</strong>g> clutter posted totrends (although details <strong>on</strong> their approach are limited), andsec<strong>on</strong>d in relati<strong>on</strong> to introducing a URL filter for protectingagainst phishing scams.There has been a lot <str<strong>on</strong>g>of</str<strong>on</strong>g> related work [20] <strong>on</strong> using textbasedclassificati<strong>on</strong> <strong>on</strong> e-mails to classify them as spam andn<strong>on</strong>-spam. Such techniques could be directly applied to<strong>Twitter</strong>, but due to the short messages and camouflage toget users to click links in such posts, we believe that suchtechniqueswouldhavelimitedsuccess. Ourpreviouswork[12]<strong>on</strong> classificati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> Social Pr<str<strong>on</strong>g>of</str<strong>on</strong>g>iles in MySpace is a precursorto this work, as that looks at zero-minute classificati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g>social pr<str<strong>on</strong>g>of</str<strong>on</strong>g>iles using static pr<str<strong>on</strong>g>of</str<strong>on</strong>g>ile informati<strong>on</strong> (features suchas Age, Interests, and Relati<strong>on</strong>ship Status were used). As<strong>Twitter</strong> gathers very little informati<strong>on</strong> from a user duringaccount creati<strong>on</strong>, the previous technique would likely needto be applied in c<strong>on</strong>juncti<strong>on</strong> with other techniques, such asthe <strong>on</strong>e being proposed in this paper, to increase accuracy.Finally, over the past few years, we have performed an extensiveamount <str<strong>on</strong>g>of</str<strong>on</strong>g>research <strong>on</strong> thegeneral security challengesfacing thesocial Web. We began byenumerating thevarious


threats and attacks facing social networking envir<strong>on</strong>mentsand their users [28]. Then, we turned our attenti<strong>on</strong> to <strong>on</strong>e <str<strong>on</strong>g>of</str<strong>on</strong>g>the most pressing threats against the social Web: deceptivepr<str<strong>on</strong>g>of</str<strong>on</strong>g>iles. In [27], we presented a novel technique for collectingdeceptivepr<str<strong>on</strong>g>of</str<strong>on</strong>g>iles(socialh<strong>on</strong>eypots), andweperformedafirst-<str<strong>on</strong>g>of</str<strong>on</strong>g>-its-kind characterizati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> deceptive pr<str<strong>on</strong>g>of</str<strong>on</strong>g>ile featuresand behaviors. Next, using the pr<str<strong>on</strong>g>of</str<strong>on</strong>g>iles we collected withour h<strong>on</strong>eypots, we extended our analysis and incorporatedautomated classificati<strong>on</strong> techniques to distinguish betweenlegitimate and deceptive pr<str<strong>on</strong>g>of</str<strong>on</strong>g>iles [12, 16]. In this paper, wehave complemented these previous efforts and helped to improvethe quality <str<strong>on</strong>g>of</str<strong>on</strong>g> informati<strong>on</strong> in another important socialenvir<strong>on</strong>ment.7. CONCLUSIONIn this paper, we study the n<strong>on</strong>-trivial problem <str<strong>on</strong>g>of</str<strong>on</strong>g> textclassificati<strong>on</strong> in a restricted envir<strong>on</strong>ment <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>on</strong>ly 140 characters,such as that presented by trend stuffing <strong>on</strong> <strong>Twitter</strong>.Our approach to the problem included first modeling tweetsthat bel<strong>on</strong>g to a trend using text-classificati<strong>on</strong> <strong>on</strong> the text<str<strong>on</strong>g>of</str<strong>on</strong>g> the tweet itself. We followed this with building a metamodelfor web pages linked in tweets that bel<strong>on</strong>g to a trend.Finally, we combined both the tweet text model and the webpage c<strong>on</strong>tent model to increase the performance in classifyingtweets that bel<strong>on</strong>g to trends. In each <str<strong>on</strong>g>of</str<strong>on</strong>g> the steps, we useadditi<strong>on</strong>al informati<strong>on</strong> about suspended tweets to validateour results. To make text-classificati<strong>on</strong> <strong>on</strong> the tweet textand the web page c<strong>on</strong>tent feasible, we first used Informati<strong>on</strong>Gain to reduce the number <str<strong>on</strong>g>of</str<strong>on</strong>g> features from over 12,000 andover 500,000, to less than 1,000 features and less than 5,000features, respectively. For tweet text features and the webpage c<strong>on</strong>tent features, this accounted for a reducti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> over91% and 99%, respectively. We found that even with reducti<strong>on</strong>sthis large, the Informati<strong>on</strong> Gain <str<strong>on</strong>g>of</str<strong>on</strong>g> additi<strong>on</strong>al featuresto be <strong>on</strong> the order <str<strong>on</strong>g>of</str<strong>on</strong>g> a thousandth <str<strong>on</strong>g>of</str<strong>on</strong>g> a bit.C4.5 decisi<strong>on</strong> trees achieve the highest F1-measure <strong>on</strong>both the classificati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> tweet text and associated web pagec<strong>on</strong>tent, resulting in an F1-measure <str<strong>on</strong>g>of</str<strong>on</strong>g> 0.79 and 0.9, respectively.Naïve Bayes performs well <strong>on</strong> the tweet text with aresult <str<strong>on</strong>g>of</str<strong>on</strong>g> 0.77 and not very well <strong>on</strong> the web page c<strong>on</strong>tent. Onthe other hand, Decisi<strong>on</strong> Stumps perform well <strong>on</strong> the webpage c<strong>on</strong>tent with an F1-measure <str<strong>on</strong>g>of</str<strong>on</strong>g> 0.81 and not very well<strong>on</strong> the tweet text. We find by combining predicti<strong>on</strong>s fromclassifiers <strong>on</strong> the two sets <str<strong>on</strong>g>of</str<strong>on</strong>g> features, we perform well <strong>on</strong>most combinati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> classifiers in the case <str<strong>on</strong>g>of</str<strong>on</strong>g> the ‘OR’ operator.In most cases, we do not find a significant differencebetween classificati<strong>on</strong> using100 features andover a1000 features.In additi<strong>on</strong> we also compare the time taken to build aclassificati<strong>on</strong> model and to test tweets or web pages againsteach classifier. We find that C4.5 decisi<strong>on</strong> trees take thel<strong>on</strong>gest to build a classificati<strong>on</strong> model but are very quick attesting an instance. Naïve Bayes takes a much shorter timeto build a classificati<strong>on</strong> model but takes l<strong>on</strong>ger at testingeach instance.8. FUTURE WORKCurrent performance <str<strong>on</strong>g>of</str<strong>on</strong>g> the combined classifiers could beimprovedbyusingclassifier combinati<strong>on</strong>similar totheBayesiancombinati<strong>on</strong> technique used in our previous work [7]. Theidea essentially being to adjust the weight <str<strong>on</strong>g>of</str<strong>on</strong>g> the classifierby the c<strong>on</strong>fidence <str<strong>on</strong>g>of</str<strong>on</strong>g> the classifier, or the amount <str<strong>on</strong>g>of</str<strong>on</strong>g> informati<strong>on</strong>(text or c<strong>on</strong>tent) that is available to it. The root <str<strong>on</strong>g>of</str<strong>on</strong>g>the combinati<strong>on</strong> technique is represented by the formula:P(ω|x) = ∑ iP(ω|Z i,x)P(Z i|x)Here ω is the probability <str<strong>on</strong>g>of</str<strong>on</strong>g> a trend given a tweet x, whichresults in the summati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the tweet text and web pageclassifiers, distinguished by the subscript i. P(Z i|x) is theclassifier’s c<strong>on</strong>fidence given tweet x, which can be a simplemeasure <str<strong>on</strong>g>of</str<strong>on</strong>g> how much text is provided to the classifieror used to ignore the classifier in case the set <str<strong>on</strong>g>of</str<strong>on</strong>g> featuresit uses becomes unavailable. For example, certain tweetsmight <strong>on</strong>ly c<strong>on</strong>tain a link and no text, this would resultin P(Z tweet−text|x) being zero or close to zero, whereas ifthe web page c<strong>on</strong>tained a lot <str<strong>on</strong>g>of</str<strong>on</strong>g> text, P(Z webpage−c<strong>on</strong>tent |x)would be close to 1. P(ω|Z i,x) is the posterior probability<str<strong>on</strong>g>of</str<strong>on</strong>g> ω given a single classifier and x. Further details <str<strong>on</strong>g>of</str<strong>on</strong>g> thismethod <str<strong>on</strong>g>of</str<strong>on</strong>g> combinati<strong>on</strong> and applying it to a disjoint set <str<strong>on</strong>g>of</str<strong>on</strong>g>features can be read in Secti<strong>on</strong> 4.2 <str<strong>on</strong>g>of</str<strong>on</strong>g> the previous work.In terms <str<strong>on</strong>g>of</str<strong>on</strong>g> scalability, <strong>on</strong>e <str<strong>on</strong>g>of</str<strong>on</strong>g> the issues already addressedhas been the training/testing times <str<strong>on</strong>g>of</str<strong>on</strong>g> the classifiers. Dueto the large volume <str<strong>on</strong>g>of</str<strong>on</strong>g> Tweets and possible large number <str<strong>on</strong>g>of</str<strong>on</strong>g>models which will need to be stored, the questi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> furtherenhancements to the performance <str<strong>on</strong>g>of</str<strong>on</strong>g> classifiers does arise.In regards to the amount <str<strong>on</strong>g>of</str<strong>on</strong>g> space required for such classifiers,using “Mini”-filters [19] could be investigated as wellas discarding models after a set period <str<strong>on</strong>g>of</str<strong>on</strong>g> time. The hashingtrick [5] alternatively might useful in reducing the number<str<strong>on</strong>g>of</str<strong>on</strong>g> features c<strong>on</strong>sidered by hashing features <strong>on</strong>to a fixed sizehash (possibly eliminating the need for using Informati<strong>on</strong>-Gain for attribute selecti<strong>on</strong>).Currently we use all the tweets bel<strong>on</strong>ging to a trend fortraining followed by classificati<strong>on</strong>, allowing cross-validati<strong>on</strong>topartiti<strong>on</strong> the data. Further studyis requiredto determinea threshold for the number <str<strong>on</strong>g>of</str<strong>on</strong>g> tweets required to build aclassificati<strong>on</strong> model, followed byatime-interval for updatingthe model due to topic-drift.To validate our experiments further, as previously menti<strong>on</strong>ed,we would like to manually label a large set <str<strong>on</strong>g>of</str<strong>on</strong>g> thetweets in a <strong>on</strong>e-to-many setting, with each tweet and webpage receiving labels from at least 2 human verifiers. Weplan <strong>on</strong> using Amaz<strong>on</strong>’s Mechanical Turk to provide humanlabels to the tweets and the web pages, followed by checkingthese labels against the list <str<strong>on</strong>g>of</str<strong>on</strong>g> tweets from suspended<strong>Twitter</strong> accounts. Lastly combining all this informati<strong>on</strong>, wewould like to run our classificati<strong>on</strong> against the manual labelsto validate our classificati<strong>on</strong> performance.9. REFERENCES[1] <strong>Twitter</strong> data API. http://apiwiki.twitter.com/<strong>Twitter</strong>-API-Documentati<strong>on</strong>.[2] <strong>Twitter</strong> help – keeping search relevant. http://help.twitter.com/forums/10713/entries/42646.[3] <strong>Twitter</strong> help – <strong>Twitter</strong> rules. http://help.twitter.com/forums/26257/entries/18311.[4] <strong>Twitter</strong>: Now more than 1 billi<strong>on</strong> tweets per m<strong>on</strong>th.http://royal.pingdom.com/2010/02/10/twitter-now-more-than-1-billi<strong>on</strong>-tweets-per-m<strong>on</strong>th/.[5] J. Attenberg, K. Weinberger, A. Dasgupta, A. Smola,and M. Zinkevich. Collaborative Email-Spam Filteringwith the Hashing-Trick. In Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> the SixthC<strong>on</strong>ference <strong>on</strong> Email and Anti-Spam (CEAS 2009),2009.


[6] R. Bellman. Adaptive c<strong>on</strong>trol processes: a guidedtour. 1961.[7] B. Byun, C.-H. Lee, S. Webb, D. Irani, and C. Pu. Ananti-spam filter combinati<strong>on</strong> framework fortext-and-image emails <strong>through</strong> incremental learning.In Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> the Sixth C<strong>on</strong>ference <strong>on</strong> Email andAnti-Spam (CEAS 2009), 2009.[8] J. Dawn. <strong>Twitter</strong> blog – Get to the point: <strong>Twitter</strong>trends. http://blog.twitter.com/2009/11/get-to-point-twitter-trends.html.[9] Z. Gy”<strong>on</strong>gyi and H. Garcia-Molina. Web spam tax<strong>on</strong>omy.Adversarial Informati<strong>on</strong> Retrieval <strong>on</strong> the Web, 2005.[10] D. Harvey. <strong>Twitter</strong> blog – Trust and safety. http://blog.twitter.com/2010/03/trust-and-safety.html.[11] B. Huberman, D. Romero, and F. Wu. Social networksthat matter: <strong>Twitter</strong> under the microscope. FirstM<strong>on</strong>day, 14(1-5), 2009.[12] D. Irani, S. Webb, and C. Pu. <str<strong>on</strong>g>Study</str<strong>on</strong>g> <str<strong>on</strong>g>of</str<strong>on</strong>g> staticclassificati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> social spam pr<str<strong>on</strong>g>of</str<strong>on</strong>g>iles in MySpace. InInternati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> Weblogs & Social Media,2010.[13] A. Java, X. S<strong>on</strong>g, T. Finin, and B. Tseng. Why wetwitter: understanding microblogging usage andcommunities. In Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> the 9th WebKDD and1st SNA-KDD 2007 workshop <strong>on</strong> Web mining andsocial network analysis, pages 56–65, New York, NY,USA, 2007. ACM.[14] B. Krishnamurthy, P. Gill, and M. Arlitt. A few chirpsabout twitter. In Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> the first workshop <strong>on</strong>Online social networks, pages 19–24. ACM, 2008.[15] H. Kwak, C. Lee, H. Park, and S. Mo<strong>on</strong>. What is<strong>Twitter</strong>, a social network or a news media? InProceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> the 19th Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong>World Wide Web, 2010.[16] K. Lee, J. Caverlee, and S. Webb. Uncovering SocialSpammers: Social H<strong>on</strong>eypots + Machine Learning. InProceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> the 33rd Annual ACM SIGIRC<strong>on</strong>ference (SIGIR 2010), 2010.[17] M. McGib<strong>on</strong>ey. <strong>Twitter</strong>’s tweet smell <str<strong>on</strong>g>of</str<strong>on</strong>g> success.http://blog.nielsen.com/nielsenwire/<strong>on</strong>line_mobile/twitters-tweet-smell-<str<strong>on</strong>g>of</str<strong>on</strong>g>-success/.[18] M. Porter. An algorithm for suffix stripping. Program,14(3):130–137, 1980.[19] D. Sculley and G. Cormack. Going Mini: ExtremeLightweight Spam Filters. In Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> the SixthC<strong>on</strong>ference <strong>on</strong> Email and Anti-Spam (CEAS 2009),2009.[20] F. Sebastiani. Machine learning in automated textcategorizati<strong>on</strong>. ACM Computing Survey, 34(1):1–47,2002.[21] B. St<strong>on</strong>e. <strong>Twitter</strong> blog – An <strong>on</strong>going battle. http://blog.twitter.com/2008/07/<strong>on</strong>going-battle.html.[22] B. St<strong>on</strong>e. <strong>Twitter</strong> blog – Making progress <strong>on</strong> spam.http://blog.twitter.com/2008/08/making-progress-<strong>on</strong>-spam.html.[23] B. St<strong>on</strong>e. <strong>Twitter</strong> blog – What’s happening. http://blog.twitter.com/2009/11/whats-happening.html.[24] G. Stringhini. Spamdetector. http://www.cs.ucsb.edu/~gianluca/spamdetector.html.[25] S. Webb, J. Caverlee, and C. Pu. Introducing theWebb Spam Corpus: Using email spam to identifyweb spam automatically. In Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> the ThirdC<strong>on</strong>ference <strong>on</strong> Email and Anti-Spam (CEAS 2006),2006.[26] S. Webb, J. Caverlee, and C. Pu. Characterizing WebSpam Using C<strong>on</strong>tent and HTTP Sessi<strong>on</strong> Analysis. InProceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> the Fourth C<strong>on</strong>ference <strong>on</strong> Email andAnti-Spam (CEAS 2007), 2007.[27] S. Webb, J. Caverlee, and C. Pu. Social H<strong>on</strong>eypots:Making Friends with a Spammer Near You. InProceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> the Fifth C<strong>on</strong>ference <strong>on</strong> Email andAnti-Spam (CEAS 2008), 2008.[28] S. Webb, J. Caverlee, and C. Pu. Granular ComputingSystem Vulnerabilities: Exploring the Dark Side <str<strong>on</strong>g>of</str<strong>on</strong>g>Social Networking Communities. In Encyclopedia <str<strong>on</strong>g>of</str<strong>on</strong>g>Complexity and Systems Science, 2009.[29] K. Weil. <strong>Twitter</strong> blog – Measuring tweets. http://blog.twitter.com/2010/02/measuring-tweets.html.[30] I. Witten, E. Frank, L. Trigg, M. Hall, G. Holmes, andS. Cunningham. Weka: Practical machine learningtools and techniques with Java implementati<strong>on</strong>s. InICONIP/ANZIIS/ANNES, volume 99, pages 192–196,1999.[31] Y. Yang and J. Pedersen. A comparative study <strong>on</strong>feature selecti<strong>on</strong> in text categorizati<strong>on</strong>. In Proceedings<str<strong>on</strong>g>of</str<strong>on</strong>g> the Fourteenth Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong>Machine Learning, pages 412–420. Morgan KaufmannPublishers Inc. San Francisco, CA, USA, 1997.[32] S. Yardi, D. Romero, G. Schoenebeck, and d. boyd.Detecting spam in a twitter network. First M<strong>on</strong>day,15(1), 2010.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!