NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...
NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...
NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Curated Entities for Enterprise<br />
Umair ul Hassan, Edward Curry, Seán O’Riain<br />
Digital Enterprise Research Institute<br />
National University of Ireland, <strong>Galway</strong><br />
umair.ul.hassan@deri.org, ed.curry@deri.org, sean.oriain@deri.org<br />
Abstract<br />
We propose an entity consolidation system with user<br />
collaboration. <strong>First</strong> source data is converted into<br />
entity-attribute-value data model. Then system finds<br />
equivalence relationships at entity, attribute, and value<br />
levels. The confidence of relationships and data<br />
conflicts are kept as intermediate results. Users provide<br />
feedback on result iteratively. Finally corrected data<br />
with lineage and provenance information is updated in<br />
curated database of entities.<br />
1. Introduction<br />
The amount of data generated and stored in<br />
organizations is increasing with more automation of<br />
business processes. Often this data relates to specific<br />
entities of business interest like people, products, and<br />
customers. Users collect, integrate and standardize this<br />
data for analysis. Teams of analysts and skilled IT staff<br />
spend significant amount of time and effort to bring this<br />
all at one place.<br />
2. Problem Statement<br />
Integration of data from disparate sources generates<br />
uncertain results [1]. For example if an analyst<br />
integrates data about iPod from two sources, following<br />
types of uncertainty can occur for price of iPod<br />
Absence: no price<br />
Conflicts: price is 150€ and also 160€<br />
Vagueness: price is given as High<br />
Non-specificity: price is between 150€ and 160€<br />
3. Proposal<br />
We propose to develop an entity consolidation<br />
system, which supports iterative cleaning of uncertain<br />
data with user feedback [2]. Figure 1 illustrates major<br />
process flow steps of our prototype.<br />
Figure 1: Process flow of entity consolidation with management of<br />
uncertain data using iterative user feedback<br />
115<br />
3.1. Entity Consolidation<br />
The process starts by converting source data in<br />
common entity-attribute-value format [3]. Followed by<br />
three associated tasks; mapping of schema attributes<br />
between sources, comparing individual entities for<br />
equivalence, and merging values of attributes for same<br />
entities.<br />
3.2. Uncertain Data<br />
Automated entity consolidation generates results<br />
with confidence scores for equivalence between<br />
entities. Conflicts of data values also exist between<br />
matched entities. All this information is stored in<br />
temporary database for resolution.<br />
3.3. User Feedback<br />
User provides feedback on uncertain data in two<br />
forms, either by validating possible choices or<br />
providing generic rules for repairs. Having people with<br />
domain expertise collaborate to improve quality of<br />
integration result adds value to overall process. This is<br />
similar to curation process of reference works and<br />
dictionaries by domain experts [4].<br />
3.4. Provenance<br />
Provenance information about data source, entities,<br />
and user feedback is stored for tracking lineage of data.<br />
This information serves as indicator of trust for entity<br />
database consumers, which can be further utilized to<br />
support data cleaning tasks automatically.<br />
4. References<br />
[1] M. Magnani and D. Montesi, “A Survey on<br />
Uncertainty Management in Data Integration,” Journal<br />
of Data and Information Quality, vol. 2, Jul. 2010, p.<br />
33.<br />
[2] M.J. Franklin, “Dataspaces: Progress and<br />
Prospects,” Dataspace: The Final Frontier, A.P.<br />
Sexton, ed., Berlin, Heidelberg: Springer Berlin<br />
Heidelberg, 2009, pp. 1-3.<br />
[3] P.M. Nadkarni, L. Marenco, R. Chen, E. Skoufos,<br />
G. Shepherd, and P. Miller, “Organization of<br />
Heterogeneous Scientific Data Using the EAV / CR<br />
Representation,” Journal of the American Medical<br />
Informatics Association, vol. 6, 1999, pp. 478-493.<br />
[4] P. Buneman, J. Cheney, W.-C. Tan, and S.<br />
Vansummeren, “Curated databases,” Proceedings of the<br />
twenty-seventh ACM SIGMOD-SIGACT-SIGART<br />
symposium on Principles of database systems - PODS<br />
’08, New York, New York, USA: ACM Press, 2008.