NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...
NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...
NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Abstract<br />
The goal of this research is to create a search engine<br />
that correctly ranks search results in terms of phonetic,<br />
semantic and orthographic string similarity. Existing<br />
and custom matching algorithms are combined and<br />
then measured to find the greatest level of performance<br />
and accuracy.<br />
1. Introduction& Commercial Application<br />
This project is being undertaken with contributions<br />
from both <strong>NUI</strong> <strong>Galway</strong> and a privately owned<br />
company, Enterprise Registry Solutions Ltd (ERS).<br />
ERS are primarily involved with building electronic<br />
registries for government agencies worldwide such as<br />
the Companies Register Office in Ireland. One common<br />
concern with company name registration is trying to<br />
ensure similar names are not registered to operate in the<br />
same jurisdiction.<br />
A failure to enforce distinctiveness among registered<br />
business names within the same jurisdiction can lead to<br />
a number of problems: identity theft, complexity when<br />
tracking cross border mergers or groups, and deliberate<br />
misrepresentation in order to gain market share and<br />
damage competitors.<br />
With the growth of the EU it is common for<br />
businesses to register and trade in multiple regions. In<br />
response to this and to ensure transparency there have<br />
been calls for a European wide Central Companies<br />
Register [1]. The recent activity in this area highlights<br />
the need for a scalable and accurate search engine. The<br />
proposed system is labelled the Registered Organisation<br />
Search Engine (ROSE).<br />
2. String Searching Background<br />
Fundamentally this project is an examination of the<br />
performance of data retrieval methods within a specific<br />
problem domain. The string similarity measures such as<br />
Hamming Distance, Levenshtein Edit Distance and Jaro<br />
Winkler all return similarity scores integral to the<br />
ultimate ranking of results.<br />
The hamming distance between two strings<br />
can be computed by determining how many characters<br />
must be substituted to transform one string to match the<br />
other. Levenshtein Edit distances are similar but allow<br />
addition, subtraction, substitution and transposition<br />
between strings.<br />
The Jaro Winkler Distance gives additional<br />
weighting to terms with matching leading substrings.<br />
The distance (dj) of two given strings s1 and s2can be<br />
computed as follows<br />
(<br />
where m is the number of matching characters and t is<br />
the number of transpositions.<br />
String Matching In Large Data Sets<br />
Liam Lynch & Colm O’Riordan<br />
L.Lynch4@nuigalway.ie<br />
)<br />
79<br />
3. Problem & Approach<br />
The ROSE system will attempt to unify some<br />
common algorithms and apply them in a unique way to<br />
improve the current state of the art used for company<br />
name searching.<br />
In addition to traditional string matching approaches<br />
(based on syntax) the system adopts other approaches<br />
based on semantic similarity of company names. Given<br />
a query term (proposed company name), we can rank<br />
similar existing company names based on spelling,<br />
meaning and other factors. Matching algorithms will<br />
run sequentially with phonetic (double metaphone) and<br />
semantic (synonym substitution) methods to compute<br />
an overall similarity score between terms. By applying<br />
all of these techniques the accuracy of the system can<br />
be improved over traditional approaches. A test data set<br />
has been collected and is used to measure the<br />
effectiveness of each algorithm. Furthermore, by<br />
separating the main matching engines into distributed<br />
services performance and scalability can be ensured.<br />
4. Architecture &Design<br />
System performance and scalability are considered<br />
of high priority in the success of the ROSE system and<br />
the architecture has been designed to maximize this by<br />
Utilising the .Net frameworks CLR integration<br />
to compile complex C# code as managed code<br />
within the database<br />
The most CPU intensive procedures have been<br />
designed as separate services that can be run<br />
independently of each other and on separate<br />
machines. This greatly increases scalability.<br />
ROSE Interface<br />
Results Weighting<br />
Service<br />
Semantic Search<br />
Service<br />
Phonetic Search<br />
Service<br />
Orthographic Search<br />
Service<br />
Database<br />
5. Conclusion<br />
To date, competitive performance (accuracy) and<br />
efficiency has been achieved on several of the<br />
individual components. Methods to combine evidence<br />
from the multiple approaches are currently being<br />
evaluated.<br />
6. References<br />
[1] Amending Directives 89/666/EEC, 2005/56/EC and<br />
2009/101/EC as regards the interconnection of central,<br />
commercial and companies’ registers.<br />
[2] Europa Press release: IP/11/221