29.06.2013 Views

NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...

NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...

NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Abstract<br />

The goal of this research is to create a search engine<br />

that correctly ranks search results in terms of phonetic,<br />

semantic and orthographic string similarity. Existing<br />

and custom matching algorithms are combined and<br />

then measured to find the greatest level of performance<br />

and accuracy.<br />

1. Introduction& Commercial Application<br />

This project is being undertaken with contributions<br />

from both <strong>NUI</strong> <strong>Galway</strong> and a privately owned<br />

company, Enterprise Registry Solutions Ltd (ERS).<br />

ERS are primarily involved with building electronic<br />

registries for government agencies worldwide such as<br />

the Companies Register Office in Ireland. One common<br />

concern with company name registration is trying to<br />

ensure similar names are not registered to operate in the<br />

same jurisdiction.<br />

A failure to enforce distinctiveness among registered<br />

business names within the same jurisdiction can lead to<br />

a number of problems: identity theft, complexity when<br />

tracking cross border mergers or groups, and deliberate<br />

misrepresentation in order to gain market share and<br />

damage competitors.<br />

With the growth of the EU it is common for<br />

businesses to register and trade in multiple regions. In<br />

response to this and to ensure transparency there have<br />

been calls for a European wide Central Companies<br />

Register [1]. The recent activity in this area highlights<br />

the need for a scalable and accurate search engine. The<br />

proposed system is labelled the Registered Organisation<br />

Search Engine (ROSE).<br />

2. String Searching Background<br />

Fundamentally this project is an examination of the<br />

performance of data retrieval methods within a specific<br />

problem domain. The string similarity measures such as<br />

Hamming Distance, Levenshtein Edit Distance and Jaro<br />

Winkler all return similarity scores integral to the<br />

ultimate ranking of results.<br />

The hamming distance between two strings<br />

can be computed by determining how many characters<br />

must be substituted to transform one string to match the<br />

other. Levenshtein Edit distances are similar but allow<br />

addition, subtraction, substitution and transposition<br />

between strings.<br />

The Jaro Winkler Distance gives additional<br />

weighting to terms with matching leading substrings.<br />

The distance (dj) of two given strings s1 and s2can be<br />

computed as follows<br />

(<br />

where m is the number of matching characters and t is<br />

the number of transpositions.<br />

String Matching In Large Data Sets<br />

Liam Lynch & Colm O’Riordan<br />

L.Lynch4@nuigalway.ie<br />

)<br />

79<br />

3. Problem & Approach<br />

The ROSE system will attempt to unify some<br />

common algorithms and apply them in a unique way to<br />

improve the current state of the art used for company<br />

name searching.<br />

In addition to traditional string matching approaches<br />

(based on syntax) the system adopts other approaches<br />

based on semantic similarity of company names. Given<br />

a query term (proposed company name), we can rank<br />

similar existing company names based on spelling,<br />

meaning and other factors. Matching algorithms will<br />

run sequentially with phonetic (double metaphone) and<br />

semantic (synonym substitution) methods to compute<br />

an overall similarity score between terms. By applying<br />

all of these techniques the accuracy of the system can<br />

be improved over traditional approaches. A test data set<br />

has been collected and is used to measure the<br />

effectiveness of each algorithm. Furthermore, by<br />

separating the main matching engines into distributed<br />

services performance and scalability can be ensured.<br />

4. Architecture &Design<br />

System performance and scalability are considered<br />

of high priority in the success of the ROSE system and<br />

the architecture has been designed to maximize this by<br />

Utilising the .Net frameworks CLR integration<br />

to compile complex C# code as managed code<br />

within the database<br />

The most CPU intensive procedures have been<br />

designed as separate services that can be run<br />

independently of each other and on separate<br />

machines. This greatly increases scalability.<br />

ROSE Interface<br />

Results Weighting<br />

Service<br />

Semantic Search<br />

Service<br />

Phonetic Search<br />

Service<br />

Orthographic Search<br />

Service<br />

Database<br />

5. Conclusion<br />

To date, competitive performance (accuracy) and<br />

efficiency has been achieved on several of the<br />

individual components. Methods to combine evidence<br />

from the multiple approaches are currently being<br />

evaluated.<br />

6. References<br />

[1] Amending Directives 89/666/EEC, 2005/56/EC and<br />

2009/101/EC as regards the interconnection of central,<br />

commercial and companies’ registers.<br />

[2] Europa Press release: IP/11/221

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!