NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...

More documents

Recommendations

Info

Improving Twitter Search by Removing Spam and Ranking Results Steven Connolly & Colm O’Riordan, CIRG, Information Technology, <strong>NUI</strong>G, s.connolly13@nuigalway.ie, colm.oriordan@nuigalway.ie Abstract With the ability to post real-time status updates, Twitter is an excellent source of first hand news. Twitter search is widely used for retrieving real-time information. However, with such high spam content on Twitter, search results often return multiple spam Tweets. With this in mind we aim to identify and remove spam and to rank remaining Tweets promoting high Tweet content and novelty in the answer set. We analyse Twitter data to identify a set of features to detect spam. We aim to improve the quality of a search result set by ranking Tweets based on a number of identified heuristics. Introduction Twitter is a real-time information network. Since its launch in July 2006, Twitter has acquired 175 million registered users posting 95 million tweets and performing 900,000 search queries per day. Twitter posts or Tweets are short text message a maximum of 140 characters in length. Twitter Spam Spam is the electronic transmission of messages, in large volume, to people who do not choose to receive them. Spam accounts have multiple behaviors including, posting harmful links to malware and phishing web sites, repeatedly posting duplicate Tweets, posting misleading links or following and un-following accounts to draw attention [1]. Twitter provides limited spam removal. Spam accounts can be reported to Twitter. These accounts are reviewed and investigated for abuse and any account showing spamming behavior is permanently suspended. However, this method will not prevent spam Tweets appearing in search results and spam is getting worse as Twitter becomes more prevalent [2]. Approach Our research focuses on correctly identifying spam Tweets. Our approach involves gathering a data set of Twitter records. A sample of 11,000 records was gathered based on various keywords. The Twitter Streaming API was used to gather the data. Each record consisted of information related to the Tweet along with user account detail, necessary to our research. Heuristics Based on literature and our own preliminary analysis, we identified a set of heuristics which we use for spam detection and data ranking. Some of these include, Follower Count, Account Age, Hash Tag Count 77 URL Count which refers to the no. of URLs present in a Tweet, Status Count referring to the no. of Tweets posted from an account, List Count which refers to the no. of lists/groups an account belongs to, Duplicate Tweets referring to duplicate records present in a record set etc. Following analysis of these heuristics we were able to identify the techniques proving most successful for spam detection. Some of our results can be seen below: Results Spam % Spam 0 Followers 93 65 65% Acc Age < 1day 145 55 63% Hash Tags > 8 23 22 95% Initial analysis showed that individual heuristics can be used to detect spam. Further analysis showed that combination of particular heuristics detected spam more accurately e.g. an account created within 24 hours with 40 friends returns only spam Tweets. Analysis confirmed that one of the main characteristics of Twitter spam is repetition. Spam accounts repeatedly post the same Tweet in an effort to be seen. We explain the use of the Levenshtein distance to measure the similarity between Tweets and removed very similar Tweets. The results show that 20% of our dataset comprise of duplicate or near duplicate Tweets. Future Work To date we have been concerned with identifying heuristics to give high precision. We are now working towards measuring recall of these heuristics Following this we aim to develop a ranking algorithm which will give a score to the remaining Tweets and accurately rank them as a measure of time and query relevance. References [1] Ryan Kelly, Aug 2009, Twitter Study, Pear Analytics. [2] Danny Sullivan, Jun 2009, Twitter’s Real Time Spam Problem.
Using Linked Data to Build Open, Collaborative Recommender Systems * Benjamin Heitmann, Conor Hayes Digital Enterprise Research Institute, National University of Ireland, <strong>Galway</strong> benjamin.heitmann@deri.org, conor.hayes@deri.org Abstract While recommender systems can greatly enhance the user experience, the entry barriers in terms of data acquisition are very high, making it hard for new service providers to compete with existing recommendation services. We propose to build open recommender systems, which can utilise Linked Data to mitigate the new-user, new-item, and sparsity problems of collaborative recommender systems. To demonstrate the validity of our approach, we augment the data from a closed collaborative music recommender system with Linked Data, and significantly improve its precision and recall. 1. Problem statement Most real-world recommender systems employ collaborative filtering [1], which aggregates user ratings for items and uses statistical methods to discover similarities between items. The high entry barriers of providing good recommendations can be characterised by the data acquisition problem [4]: providing recommendations for (a) new items or for (b) new users is a challenge if no data about the item or user is available. If the number of ratings is low compared to the number of items then (c) sparsity of the data will lead to ineffective recommendations. We propose an alternative to building closed recommender systems: by utilising open data sources from the Linking Open Data (LOD) community project, it is possible to build open recommender systems, which can mitigate the challenges introduced by the data acquisition problem. 2. Background: Linked Data Linked Data refers to a set of best practices for publishing and connecting structured data on the Web [2], by making semantic information about things and concepts available via RDF and HTTP. They have been adopted by a steadily growing number of data providers which form the LOD cloud, e.g. DBpedia provides data from Wikipedia pages, and both the US and UK governments have converted data sets to RDF. Social Web sites provide data, which is modeled after the principle of object-centered sociality: it connects individuals not just directly into communities, but also indirectly via objects of a social focus, such as a music act. Sites, which use the Friend-of-a-Friend (FOAF) vocabulary to publish such data, include MySpace and LiveJournal. Figure 1: Applying collaborative filtering to Linked Data 2. Methodology Figure 1 shows the steps of processing Linked Data for collaborative recommendations: (1) integrating the data about user-item connections from different sources to a common vocabulary. (2) Transforming the representation of the data from an RDF graph to a useritem matrix. (3) Applying a specific collaborative filtering algorithm on the user-item matrix. This approach allows us to “fill in the gaps” in local data, by using data with user-item connections from external sources, thus mitigating the data acquisition problem. 4. Evaluation We have augmented the data from the closed Smart Radio streaming recommendation service (190 users, 330 musicians) with Linked Data from MySpace, adding 11000 users and 25000 new connections. We evaluated a binary cosine similarity for the CF algorithm, by using Last.fm as a “gold standard” [3]. The result of adding external data was an improvement of precision from 2% to 14%, and recall from 7% to 33%. 5. References [1] G. Adomavicius, and A. Tuzhilin, “Toward the next generation of recommender systems: A survey of the state-ofthe-art and possible extensions”, IEEE Transactions on Knowledge and Data Engineering, 2005. [2] C. Bizer, T. Heath and T. Berners-Lee, “Linked data-the story so far”, Journal on Semantic Web and Information Systems, 2009. [3] J. Herlocker, J. Konstan et al., “Evaluating collaborative filtering recommender systems”, ACM Transactions on Information Systems, 2004. [4] A. I. Schein, A. Popescul et al., “Methods and metrics for cold-start recommendations”, Conference on Research and Development in Information Retrieval, 2002. * This extended abstract is based on B. Heitmann and C. Hayes, “Using Linked Data to Build Open, Collaborative Recommender Systems”, AAAI Spring Symposia, 2010, and funded by Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Líon-2). 78
Page 1 and 2:
NUI Galway - UL Alliance First Annu
Page 4 and 5:
FULL TABLE OF CONTENTS 1 GAMES, VIS
Page 6 and 7:
4 MECHANICAL AND BIOMEDICAL ENGINEE
Page 8 and 9:
5.21 Detecting Topics and Events in
Page 10 and 11:
8.7 Modelling Extreme Flood Events
Page 12 and 13:
GAMES, VISUALISATION & EDUCATION 1.
Page 14 and 15:
Generation and Analysis of Graph St
Page 16 and 17:
Evolution and Analysis of Strategie
Page 18 and 19:
Abstract The delivery of multimedia
Page 20 and 21:
Applications of Reinforcement Learn
Page 22 and 23:
Assessing the effects of interactiv
Page 24 and 25:
Real-time depth map generation usin
Page 26 and 27:
An analysis of the capability of pr
Page 28 and 29:
Building Information Modelling duri
Page 30 and 31:
Dwelling Energy Measurement Procedu
Page 32 and 33:
Numerical Modelling of Tidal Turbin
Page 34 and 35:
Energy Storage using Microencapsula
Page 36 and 37:
Data Centre Energy Efficiency Mark
Page 38 and 39: An embodied energy and carbon asses
Page 40 and 41: SmartOp - Smart Buildings Operation
Page 42 and 43: Ocean Wave Energy Exploitation in D
Page 44 and 45: Future Smart Grid Synchronization C
Page 46 and 47: Web-Based Building Energy Usage Vis
Page 48 and 49: Image Recognition and Classificatio
Page 50 and 51: Android Based Multi-Feature Elderly
Page 52 and 53: Determining Subjects’ Activities
Page 54 and 55: New Analysis Techniques for ICU Dat
Page 56 and 57: National E-Prescribing Systems in I
Page 58 and 59: Using Mashups to Satisfy Personalis
Page 60 and 61: 3D Computational Modeling of Blood
Page 62 and 63: Experimental and Computational Inve
Page 64 and 65: Experimental Analysis of the Therma
Page 66 and 67: Simulating Actin Cytoskeleton Remod
Page 68 and 69: Computational Analysis of Transcath
Page 70 and 71: An In vitro Shear Stress System for
Page 72 and 73: Development of a Micropipette Aspir
Page 74 and 75: A Computational Test-Bed to Examine
Page 76 and 77: Computational Modeling of Ceramic-b
Page 78 and 79: Multi-Scale Computational Modelling
Page 80 and 81: Development of a mixed-mode cohesiv
Page 82 and 83: Active Computational Modelling of C
Page 84 and 85: Modelling the Management of Medical
Page 86 and 87: SOCIAL MEDIA, SEARCH & RECOMMENDATI
Page 90 and 91: Abstract The goal of this research
Page 92 and 93: Generalized Blockmodeling Samantha
Page 94 and 95: Life-Cycles and Mutual Effects of S
Page 96 and 97: dcat: Searching Public Sector Infor
Page 98 and 99: The Effect of User Features on Chur
Page 100 and 101: User Similarity and Interaction in
Page 102 and 103: Improving Categorisation in Social
Page 104 and 105: Natural Language Queries on Enterpr
Page 106 and 107: Studying Forum Dynamics from a User
Page 108 and 109: Provenance in the Web of Data: a bu
Page 110 and 111: Towards Social Descriptions of Serv
Page 112 and 113: ENVIRONMENTAL ENGINEERING 6.1 Asses
Page 114 and 115: Novel Agri-engineering solutions fo
Page 116 and 117: Evaluation of amendments to control
Page 118 and 119: Determination of optimal applicatio
Page 120 and 121: Treatment of Piggery Wastewaters us
Page 122 and 123: NEXT GENERATION INTERNET 7.1 Extens
Page 124 and 125: Enabling Federation of Government M
Page 126 and 127: Curated Entities for Enterprise Uma
Page 128 and 129: Mobile Web + Social Web + Semantic
Page 130 and 131: Engaging Citizens in the Policy-Mak
Page 132 and 133: Preference-based Discovery of Dynam
Page 134 and 135: RDF On the Go: An RDF Storage and Q
Page 136 and 137: Policy Modeling meets Linked Open D
Page 138 and 139:
A Contextualized Perspective for Li
Page 140 and 141:
Improving discovery in Life Science
Page 142 and 143:
The Semantic Public Service Portal
Page 144 and 145:
Personalized Content Delivery on Mo
Page 146 and 147:
A Framework to Describe Localisatio
Page 148 and 149:
The influence of secondary settleme
Page 150 and 151:
Analysis of Shear Transfer in Void-
Page 152 and 153:
Cost-Effective Sustainable Construc
Page 154 and 155:
Modelling Extreme Flood Events due
Page 156 and 157:
Axial Load Capacity of a Driven Cas
Page 158 and 159:
Chemical amendment of dairy cattle
Page 160 and 161:
Seismic Design of Concentrically Br
Page 162 and 163:
MODELLING, ALGORITHMS & CONTROL 9.1
Page 164 and 165:
Eigen-based Approach for Leverage P
Page 166 and 167:
Evolutionary Modelling of Industria
Page 168 and 169:
Abstract: Graphical Semantic Wiki f
Page 170 and 171:
Low Coverage Genome Assembly Using
Page 172 and 173:
Evolving a Robust Open-Ended Langua
Page 174 and 175:
Context Stamp - A Topic-based Conte
Page 176 and 177:
DSP-Based Control of Multi-Rail DC-
Page 178 and 179:
Topographical Cues - Controlling Ce
Page 180 and 181:
Creep Relaxation and Crack Growth P
Page 182 and 183:
Finite Element Modelling of Failure
Page 184 and 185:
Influence of Fluorine and Nitrogen
Page 186 and 187:
Phase Decompositions of Bioceramic
Page 188 and 189:
High Resolution Microscopical Analy
Page 190 and 191:
An Experimental and Numerical Analy
Page 192 and 193:
Thermomechanical characterisation o
Page 194 and 195:
A multiaxial damage mechanics metho
Page 196:
The effect of citrate ester plastic
show all

NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...

Create successful ePaper yourself

Delete template?

Save as template?