29.06.2013 Views

NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...

NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...

NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Curated Entities for Enterprise<br />

Umair ul Hassan, Edward Curry, Seán O’Riain<br />

Digital Enterprise Research Institute<br />

National University of Ireland, <strong>Galway</strong><br />

umair.ul.hassan@deri.org, ed.curry@deri.org, sean.oriain@deri.org<br />

Abstract<br />

We propose an entity consolidation system with user<br />

collaboration. <strong>First</strong> source data is converted into<br />

entity-attribute-value data model. Then system finds<br />

equivalence relationships at entity, attribute, and value<br />

levels. The confidence of relationships and data<br />

conflicts are kept as intermediate results. Users provide<br />

feedback on result iteratively. Finally corrected data<br />

with lineage and provenance information is updated in<br />

curated database of entities.<br />

1. Introduction<br />

The amount of data generated and stored in<br />

organizations is increasing with more automation of<br />

business processes. Often this data relates to specific<br />

entities of business interest like people, products, and<br />

customers. Users collect, integrate and standardize this<br />

data for analysis. Teams of analysts and skilled IT staff<br />

spend significant amount of time and effort to bring this<br />

all at one place.<br />

2. Problem Statement<br />

Integration of data from disparate sources generates<br />

uncertain results [1]. For example if an analyst<br />

integrates data about iPod from two sources, following<br />

types of uncertainty can occur for price of iPod<br />

Absence: no price<br />

Conflicts: price is 150€ and also 160€<br />

Vagueness: price is given as High<br />

Non-specificity: price is between 150€ and 160€<br />

3. Proposal<br />

We propose to develop an entity consolidation<br />

system, which supports iterative cleaning of uncertain<br />

data with user feedback [2]. Figure 1 illustrates major<br />

process flow steps of our prototype.<br />

Figure 1: Process flow of entity consolidation with management of<br />

uncertain data using iterative user feedback<br />

115<br />

3.1. Entity Consolidation<br />

The process starts by converting source data in<br />

common entity-attribute-value format [3]. Followed by<br />

three associated tasks; mapping of schema attributes<br />

between sources, comparing individual entities for<br />

equivalence, and merging values of attributes for same<br />

entities.<br />

3.2. Uncertain Data<br />

Automated entity consolidation generates results<br />

with confidence scores for equivalence between<br />

entities. Conflicts of data values also exist between<br />

matched entities. All this information is stored in<br />

temporary database for resolution.<br />

3.3. User Feedback<br />

User provides feedback on uncertain data in two<br />

forms, either by validating possible choices or<br />

providing generic rules for repairs. Having people with<br />

domain expertise collaborate to improve quality of<br />

integration result adds value to overall process. This is<br />

similar to curation process of reference works and<br />

dictionaries by domain experts [4].<br />

3.4. Provenance<br />

Provenance information about data source, entities,<br />

and user feedback is stored for tracking lineage of data.<br />

This information serves as indicator of trust for entity<br />

database consumers, which can be further utilized to<br />

support data cleaning tasks automatically.<br />

4. References<br />

[1] M. Magnani and D. Montesi, “A Survey on<br />

Uncertainty Management in Data Integration,” Journal<br />

of Data and Information Quality, vol. 2, Jul. 2010, p.<br />

33.<br />

[2] M.J. Franklin, “Dataspaces: Progress and<br />

Prospects,” Dataspace: The Final Frontier, A.P.<br />

Sexton, ed., Berlin, Heidelberg: Springer Berlin<br />

Heidelberg, 2009, pp. 1-3.<br />

[3] P.M. Nadkarni, L. Marenco, R. Chen, E. Skoufos,<br />

G. Shepherd, and P. Miller, “Organization of<br />

Heterogeneous Scientific Data Using the EAV / CR<br />

Representation,” Journal of the American Medical<br />

Informatics Association, vol. 6, 1999, pp. 478-493.<br />

[4] P. Buneman, J. Cheney, W.-C. Tan, and S.<br />

Vansummeren, “Curated databases,” Proceedings of the<br />

twenty-seventh ACM SIGMOD-SIGACT-SIGART<br />

symposium on Principles of database systems - PODS<br />

’08, New York, New York, USA: ACM Press, 2008.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!