Query Driven Entity Resolution in Data Lakes

  • Giorgos AlexiouEmail author
  • George PapastefanatosEmail author
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 1197)


Entity Resolution (ER) constitutes a core task for data integration which aims at matching different representations of entities coming from various sources. Due to its quadratic complexity, it typically scales to large datasets through approximate, i.e., blocking methods: similar entities are clustered into blocks and pair-wise comparisons are executed only between co-occurring entities, at the cost of some missed matches. In traditional settings, it is a part of the data integration process, i.e., a preprocessing step prior to making “clean” data available to analysis. With the increasing demand of real-time analytical applications, recent research has begun to consider new approaches for integrating Entity Resolution with Query Processing. In this work, we explore the problem of query driven Entity Resolution and we propose a method for efficiently applying blocking and meta-blocking techniques during query processing. The aim of our approach is to effectively and efficiently answer SQL-like queries issued on top of dirty data. The experimental evaluation of the proposed solution demonstrates its significant advantages over the other techniques for the given problem settings.


Entity Resolution Entity matching Data lakes 



This research is funded by the project VisualFacts (#1614) - 1st Call of the Hellenic Foundation for Research and Innovation Research Projects for the support of post-doctoral researchers.


  1. 1.
    Christen, P.: Data Matching. Data-Centric Systems and Applications. Springer, Heidelberg (2012). Scholar
  2. 2.
    Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24(9), 1537–1555 (2012)CrossRefGoogle Scholar
  3. 3.
    Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: Workshop on Data Cleaning, Record Linkage and Object Consolidation, pp. 25–27 (2003)Google Scholar
  4. 4.
    Papadakis, G., Ioannou, E., Palpanas, T., Niederee, C., Nejdl, W.: A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Trans. Knowl. Data Eng. 25(12), 2665–2682 (2013)CrossRefGoogle Scholar
  5. 5.
    Papadakis, G., Alexiou, G., Papastefanatos, G., Koutrika, G.: Schema-agnostic vs schema-based configurations for blocking methods on homogeneous data. Proc. VLDB Endow. 9(4), 312–323 (2015)CrossRefGoogle Scholar
  6. 6.
    Papadakis, G., Koutrika, G., Palpanas, T., Nejdl, W.: Meta-blocking: taking entity resolution to the next level. IEEE Trans. Knowl. Data Eng. 26(8), 1946–1960 (2013)CrossRefGoogle Scholar
  7. 7.
    Getoor, L., Machanavajjhala, A.: Entity resolution: theory, practice & open challenges. Proc. VLDB Endow. 5(12), 2018–2019 (2012)CrossRefGoogle Scholar
  8. 8.
    Getoor, L., Machanavajjhala, A.: Entity resolution for big data. In: KDD, p. 1527 (2013)Google Scholar
  9. 9.
    Papadakis, G., Papastefanatos, G., Palpanas, T., Koubarakis, M.: Scaling entity resolution to large, heterogeneous data with enhanced meta-blocking. In: EDBT, pp. 221–232 (2016)Google Scholar
  10. 10.
    Ipeirotis, P.G., Verykios, V.S., Elmagarmid, A.K.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)CrossRefGoogle Scholar
  11. 11.
    Lenzerini, M.: Data integration: a theoretical perspective. In: Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 233–246. ACM, June 2002Google Scholar
  12. 12.
    Altwaijry, H., Kalashnikov, D.V., Mehrotra, S.: Query-driven approach to entity resolution. Proc. VLDB Endow. 6(14), 1846–1857 (2013)CrossRefGoogle Scholar
  13. 13.
    Bhattacharya, I., Getoor, L.: Query-time entity resolution. J. Artif. Intell. Res. 30, 621–657 (2007)CrossRefGoogle Scholar
  14. 14.
    Ioannou, E., Nejdl, W., Niederée, C., Velegrakis, Y.: On-the-fly entity-aware query processing in the presence of linkage. Proc. VLDB Endow. 3(1–2), 429–438 (2010)CrossRefGoogle Scholar
  15. 15.
    Ioannou, E., Garofalakis, M.: Query analytics over probabilistic databases with unmerged duplicates. IEEE Trans. Knowl. Data Eng. 27(8), 2245–2260 (2015)CrossRefGoogle Scholar
  16. 16.
    Altwaijry, H., Mehrotra, S., Kalashnikov, D.V.: Query: a framework for integrating entity resolution with query processing. Proc. VLDB Endow. 9(3), 120–131 (2015)CrossRefGoogle Scholar
  17. 17.
    Fier, F., Augsten, N., Bouros, P., Leser, U., Freytag, J.C.: Set similarity joins on MapReduce: an experimental survey. Proc. VLDB Endow. 11(10), 1110–1122 (2018)CrossRefGoogle Scholar
  18. 18.
    Alexiou, G., Meimaris, M., Papastefanatos, G.: Enabling persistent identification of groups of duplicates in data aggregators. In: 2016 IEEE 32nd International Conference on Data Engineering Workshops (ICDEW). pp. 124–126. IEEE, May 2016Google Scholar
  19. 19.
    Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3(1–2), 484–493 (2010)CrossRefGoogle Scholar
  20. 20.
    Kopcke, H., Thor, A., Rahm, E.: Learning-based approaches for matching web data entities. IEEE Internet Comput. 14(4), 23–31 (2010)CrossRefGoogle Scholar
  21. 21.
    Thor, A., Rahm, E.: MOMA-a mapping-based object matching system. In: CIDR, pp. 247–258, January 2007Google Scholar
  22. 22.
    Efthymiou, V., Stefanidis, K., Christophides, V.: Minoan ER: progressive entity resolution in the web of data. In: EDBT 2016, pp. 670–671 (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.School of Electrical and Computer EngineeringNational Technical University of AthensAthensGreece
  2. 2.Information Management Systems Institute, ATHENA Research CenterMarousiGreece

Personalised recommendations