NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...

More documents

Recommendations

Info

Abstract The goal of this research is to create a search engine that correctly ranks search results in terms of phonetic, semantic and orthographic string similarity. Existing and custom matching algorithms are combined and then measured to find the greatest level of performance and accuracy. 1. Introduction& Commercial Application This project is being undertaken with contributions from both NUI Galway and a privately owned company, Enterprise Registry Solutions Ltd (ERS). ERS are primarily involved with building electronic registries for government agencies worldwide such as the Companies Register Office in Ireland. One common concern with company name registration is trying to ensure similar names are not registered to operate in the same jurisdiction. A failure to enforce distinctiveness among registered business names within the same jurisdiction can lead to a number of problems: identity theft, complexity when tracking cross border mergers or groups, and deliberate misrepresentation in order to gain market share and damage competitors. With the growth of the EU it is common for businesses to register and trade in multiple regions. In response to this and to ensure transparency there have been calls for a European wide Central Companies Register [1]. The recent activity in this area highlights the need for a scalable and accurate search engine. The proposed system is labelled the Registered Organisation Search Engine (ROSE). 2. String Searching Background Fundamentally this project is an examination of the performance of data retrieval methods within a specific problem domain. The string similarity measures such as Hamming Distance, Levenshtein Edit Distance and Jaro Winkler all return similarity scores integral to the ultimate ranking of results. The hamming distance between two strings can be computed by determining how many characters must be substituted to transform one string to match the other. Levenshtein Edit distances are similar but allow addition, subtraction, substitution and transposition between strings. The Jaro Winkler Distance gives additional weighting to terms with matching leading substrings. The distance (dj) of two given strings s1 and s2can be computed as follows ( where m is the number of matching characters and t is the number of transpositions. String Matching In Large Data Sets Liam Lynch & Colm O’Riordan L.Lynch4@nuigalway.ie ) 79 3. Problem & Approach The ROSE system will attempt to unify some common algorithms and apply them in a unique way to improve the current state of the art used for company name searching. In addition to traditional string matching approaches (based on syntax) the system adopts other approaches based on semantic similarity of company names. Given a query term (proposed company name), we can rank similar existing company names based on spelling, meaning and other factors. Matching algorithms will run sequentially with phonetic (double metaphone) and semantic (synonym substitution) methods to compute an overall similarity score between terms. By applying all of these techniques the accuracy of the system can be improved over traditional approaches. A test data set has been collected and is used to measure the effectiveness of each algorithm. Furthermore, by separating the main matching engines into distributed services performance and scalability can be ensured. 4. Architecture &Design System performance and scalability are considered of high priority in the success of the ROSE system and the architecture has been designed to maximize this by Utilising the .Net frameworks CLR integration to compile complex C# code as managed code within the database The most CPU intensive procedures have been designed as separate services that can be run independently of each other and on separate machines. This greatly increases scalability. ROSE Interface Results Weighting Service Semantic Search Service Phonetic Search Service Orthographic Search Service Database 5. Conclusion To date, competitive performance (accuracy) and efficiency has been achieved on several of the individual components. Methods to combine evidence from the multiple approaches are currently being evaluated. 6. References [1] Amending Directives 89/666/EEC, 2005/56/EC and 2009/101/EC as regards the interconnection of central, commercial and companies’ registers. [2] Europa Press release: IP/11/221
Reconstruction of Threads in Internet Forums Erik Aumayr, Jeffrey Chan and Conor Hayes Digital Enterprise Research Institute, NUI Galway, Ireland {erik.aumayr, jkc.chan, conor.hayes}@deri.org Abstract Online discussion boards, or Internet forums, are a significant part of the Internet. People use Internet forums to post questions, provide advice and participate in discussions. These online conversations are represented as threads, and the conversation trees within these threads are important in understanding the behaviour of online users. Unfortunately, the reply structures of these threads are generally not publicly accessible or not maintained. Hence, we introduce an efficient and simple approach to reconstruct the reply structure in threaded conversations. We contrast its accuracy against an existing and a baseline algorithm. 1. Introduction Internet forums are an important part of the web for questions to be asked and answered and for public discussions on all types of topics. In forums, conversations are represented as a sequence of posts, or threads, where the posts are replies to one or more earlier posts. Links exist between posts if one is the direct reply to another. However, the reply structure of threads is not always available. For instance, the structure is not maintained by the provider, or lost. We propose a new method to reconstruct the reply structure of posts in forums. It uses a set of simple features and a decision tree classifier to reconstruct the reply structure of threads. We evaluate the accuracy of the algorithm against an existing and a heuristic baseline approach. 2. Methodology Definitions A post in a thread provides us with the following, basic information: creation date, name of author, quoting: name of the quoted author and content. The creation date of posts establishes a chronological order. From that ordering we can compute the distance of one post to another. Distance means how far away is a post to its reply. If there is no other post between a post and its reply, then they have a post distance of 1. If there is another post in between, then the distance is 2, and so forth. Note that the data we use stores the reply interaction in the way that each post can only reply to one other post at once. Although a user can reply to several posts at once, and our approach is able to return more than one reply candidate, we limit replies to one target post in our evaluation. Baseline approaches In our data, we found that 79.7% of the replies have a post distance of 1, i.e. they follow directly the post they refer to. Hence, our first baseline approach is to link each post to its immediate predecessor, called “1-Distance Linking”. 80 Wang et al. 2008 [1] introduced a thread reconstruction that relies on content similarity and post distance. That serves as our second baseline approach. Features Based on the information a pair of posts provides, we extract the following features for our classification task: reply distance, posting-time difference, author quoted and cosine similarity. The cosine similarity compares the contents of two posts and returns a similarity score from 0 to 1, where 0 means not similar and 1 means exactly equal. Classifier As a classifier, we investigate the widely used C4.5 decision tree algorithm. It handles huge amount of data very efficiently due to its relative simplicity which is important for our task to present a fast and efficient way of reconstructing threads. 3. Evaluation For the evaluation we use a subset of our Boards.ie dataset. Namely 13,100 threads, consisting of 133,200 posts in total. In order to compare classification results of the approaches, we use the measurements precision, recall and F-score where F-score is the harmonic mean of precision and recall. For training the classifier, we applied a 10 fold cross validation to minimise bias. Table 1 shows the comparison between our classification algorithm “ThreadRecon” and the two baseline approaches. Wang et al. 2008 1-Distance Linking ThreadRecon 44.40% 79.70% 85.70% Table 1: F-score comparison between ThreadRecon and baseline approaches An extended version of this work will be published in “Reconstruction of Threaded Conversations in Online Discussion Forums”, International Conference on Weblogs and Social Media 2011 8. References [1] Wang, Joshi, Cohen and Rosé. 2008. Recovering implicit thread structure in newsgroup style conversations. In Proceedings of the 2nd International Conference on Weblogs and Social Media (ICWSM II), 152–160
Page 1 and 2:
NUI Galway - UL Alliance First Annu
Page 4 and 5:
FULL TABLE OF CONTENTS 1 GAMES, VIS
Page 6 and 7:
4 MECHANICAL AND BIOMEDICAL ENGINEE
Page 8 and 9:
5.21 Detecting Topics and Events in
Page 10 and 11:
8.7 Modelling Extreme Flood Events
Page 12 and 13:
GAMES, VISUALISATION & EDUCATION 1.
Page 14 and 15:
Generation and Analysis of Graph St
Page 16 and 17:
Evolution and Analysis of Strategie
Page 18 and 19:
Abstract The delivery of multimedia
Page 20 and 21:
Applications of Reinforcement Learn
Page 22 and 23:
Assessing the effects of interactiv
Page 24 and 25:
Real-time depth map generation usin
Page 26 and 27:
An analysis of the capability of pr
Page 28 and 29:
Building Information Modelling duri
Page 30 and 31:
Dwelling Energy Measurement Procedu
Page 32 and 33:
Numerical Modelling of Tidal Turbin
Page 34 and 35:
Energy Storage using Microencapsula
Page 36 and 37:
Data Centre Energy Efficiency Mark
Page 38 and 39:
An embodied energy and carbon asses
Page 40 and 41: SmartOp - Smart Buildings Operation
Page 42 and 43: Ocean Wave Energy Exploitation in D
Page 44 and 45: Future Smart Grid Synchronization C
Page 46 and 47: Web-Based Building Energy Usage Vis
Page 48 and 49: Image Recognition and Classificatio
Page 50 and 51: Android Based Multi-Feature Elderly
Page 52 and 53: Determining Subjects’ Activities
Page 54 and 55: New Analysis Techniques for ICU Dat
Page 56 and 57: National E-Prescribing Systems in I
Page 58 and 59: Using Mashups to Satisfy Personalis
Page 60 and 61: 3D Computational Modeling of Blood
Page 62 and 63: Experimental and Computational Inve
Page 64 and 65: Experimental Analysis of the Therma
Page 66 and 67: Simulating Actin Cytoskeleton Remod
Page 68 and 69: Computational Analysis of Transcath
Page 70 and 71: An In vitro Shear Stress System for
Page 72 and 73: Development of a Micropipette Aspir
Page 74 and 75: A Computational Test-Bed to Examine
Page 76 and 77: Computational Modeling of Ceramic-b
Page 78 and 79: Multi-Scale Computational Modelling
Page 80 and 81: Development of a mixed-mode cohesiv
Page 82 and 83: Active Computational Modelling of C
Page 84 and 85: Modelling the Management of Medical
Page 86 and 87: SOCIAL MEDIA, SEARCH & RECOMMENDATI
Page 88 and 89: Improving Twitter Search by Removin
Page 92 and 93: Generalized Blockmodeling Samantha
Page 94 and 95: Life-Cycles and Mutual Effects of S
Page 96 and 97: dcat: Searching Public Sector Infor
Page 98 and 99: The Effect of User Features on Chur
Page 100 and 101: User Similarity and Interaction in
Page 102 and 103: Improving Categorisation in Social
Page 104 and 105: Natural Language Queries on Enterpr
Page 106 and 107: Studying Forum Dynamics from a User
Page 108 and 109: Provenance in the Web of Data: a bu
Page 110 and 111: Towards Social Descriptions of Serv
Page 112 and 113: ENVIRONMENTAL ENGINEERING 6.1 Asses
Page 114 and 115: Novel Agri-engineering solutions fo
Page 116 and 117: Evaluation of amendments to control
Page 118 and 119: Determination of optimal applicatio
Page 120 and 121: Treatment of Piggery Wastewaters us
Page 122 and 123: NEXT GENERATION INTERNET 7.1 Extens
Page 124 and 125: Enabling Federation of Government M
Page 126 and 127: Curated Entities for Enterprise Uma
Page 128 and 129: Mobile Web + Social Web + Semantic
Page 130 and 131: Engaging Citizens in the Policy-Mak
Page 132 and 133: Preference-based Discovery of Dynam
Page 134 and 135: RDF On the Go: An RDF Storage and Q
Page 136 and 137: Policy Modeling meets Linked Open D
Page 138 and 139: A Contextualized Perspective for Li
Page 140 and 141:
Improving discovery in Life Science
Page 142 and 143:
The Semantic Public Service Portal
Page 144 and 145:
Personalized Content Delivery on Mo
Page 146 and 147:
A Framework to Describe Localisatio
Page 148 and 149:
The influence of secondary settleme
Page 150 and 151:
Analysis of Shear Transfer in Void-
Page 152 and 153:
Cost-Effective Sustainable Construc
Page 154 and 155:
Modelling Extreme Flood Events due
Page 156 and 157:
Axial Load Capacity of a Driven Cas
Page 158 and 159:
Chemical amendment of dairy cattle
Page 160 and 161:
Seismic Design of Concentrically Br
Page 162 and 163:
MODELLING, ALGORITHMS & CONTROL 9.1
Page 164 and 165:
Eigen-based Approach for Leverage P
Page 166 and 167:
Evolutionary Modelling of Industria
Page 168 and 169:
Abstract: Graphical Semantic Wiki f
Page 170 and 171:
Low Coverage Genome Assembly Using
Page 172 and 173:
Evolving a Robust Open-Ended Langua
Page 174 and 175:
Context Stamp - A Topic-based Conte
Page 176 and 177:
DSP-Based Control of Multi-Rail DC-
Page 178 and 179:
Topographical Cues - Controlling Ce
Page 180 and 181:
Creep Relaxation and Crack Growth P
Page 182 and 183:
Finite Element Modelling of Failure
Page 184 and 185:
Influence of Fluorine and Nitrogen
Page 186 and 187:
Phase Decompositions of Bioceramic
Page 188 and 189:
High Resolution Microscopical Analy
Page 190 and 191:
An Experimental and Numerical Analy
Page 192 and 193:
Thermomechanical characterisation o
Page 194 and 195:
A multiaxial damage mechanics metho
Page 196:
The effect of citrate ester plastic
show all

NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...

Create successful ePaper yourself

Delete template?

Save as template?