Using Similarity-Based Operations for Resolving Data-Level Conflicts

  • Eike Schallehn
  • Kai-Uwe Sattler
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2712)


Dealing with discrepancies in data is still a big challenge in data integration systems. The problem occurs both during eliminating duplicates from semantic overlapping sources as well as during combining complementary data from different sources. Though using SQL operations like grouping and join seems to be a viable way, they fail if the attribute values of the potential duplicates or related tuples are not equal but only similar by certain criteria. As a solution to this problem, we present in this paper similarity-based variants of grouping and join operators. The extended grouping operator produces groups of similar tuples, the extended join combines tuples satisfying a given similarity condition. We describe the semantics of these operators, discuss efficient implementations for the edit distance similarity and present evaluation results. Finally, we give examples how the operators can be used in given application scenarios.


Similarity Measure Edit Distance Table Function Aggregate Function Index Support 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    D. Calvanese, G. de Giacomo, M. Lenzerini, D. Nardi, and R. Rosati. A principled approach to data integration and reconciliation in data warehousing. In Proceedings of the International Workshop on Design and Management of Data Warehouses (DMDW’99), Heidelberg, Germany, 1999.Google Scholar
  2. 2.
    W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In L. M. Haas and A. Tiwary, editors, SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, June 2–4, 1998, Seattle, Washington, USA, pages 201–212. ACM Press, 1998.Google Scholar
  3. 3.
    D. Dey and S. Sarkar. A probabilistic relational model and algebra. ACM Transactions on Database Systems, 21(3):339–369, September 1996.CrossRefGoogle Scholar
  4. 4.
    Oliver Dunemann, Ingolf Geist, Roland Jesse, Kai-Uwe Sattler, and Andreas Stephanik. A Database-Supported Workbench for Information Fusion: InFuse. In Christian S. Jensen, Keith G. Jeffery, Jaroslav Pokorný, Simonas Saltenis, Elisa Bertino, Klemens Böhm, and Matthias Jarke, editors, Advances in Database Technology-EDBT 2002, 8th International Conference on Extending Database Technology, Prague, Czech Republic, March 25–27, Proceedings, volume 2287 of Lecture Notes in Computer Science, pages 756–758. Springer, 2002.Google Scholar
  5. 5.
    R. Elmasri and S. B. Navathe. Fundamentals of Database Systems. Benjamin/Cummings, Redwood City, CA, 2 edition, 1994.zbMATHGoogle Scholar
  6. 6.
    N. Fuhr. Probabilistic datalog — A logic for powerful retrieval methods. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Retrieval Logic, pages 282–290, 1995.Google Scholar
  7. 7.
    H. Galhardas, D. Florescu, D. Shasha, and E. Simon. AJAX: an extensible data cleaning tool. In Weidong Chen, Jeffery Naughton, and Philip A. Bernstein, editors, Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, Texas, volume 29(2), pages 590–590, 2000.Google Scholar
  8. 8.
    G. Graefe. Query Evaluation Techniques For Large Databases. ACM Computing Surveys, 25(2):73–170, 1993.CrossRefGoogle Scholar
  9. 9.
    L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In Proceedings of the 27th International Conference on Very Large Data Bases(VLDB’ 01), pages 491–500, Orlando, September 2001. Morgan Kaufmann.Google Scholar
  10. 10.
    J. M. Hellerstein, M. Stonebraker, and R. Caccia. Independent, Open Enterprise Data Integration. IEEE Data Engineering Bulletin, 22(1):43–49, 1999.Google Scholar
  11. 11.
    M. A. Hernández and S. J. Stolfo. The merge/purge problem for large databases. In Michael J. Carey and Donovan A. Schneider, editors, Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, pages 127–138, San Jose, California, 22–25 May 1995.Google Scholar
  12. 12.
    W. Kent. The breakdown of the information model in multi-database systems. SIGMOD Record, 20(4):10–15, December 1991.CrossRefGoogle Scholar
  13. 13.
    Wen-Syan Li. Knowledge gathering and matching in heterogeneous databases. In AAAI Spring Symposium on Information Gathering, 1995.Google Scholar
  14. 14.
    E.-P. Lim, J. Srivastava, S. Prabhakar, and J. Richardson. Entity identification in database integration. In International Conference on Data Engineering, pages 294–301, Los Alamitos, Ca., USA, April 1993. IEEE Computer Society Press.Google Scholar
  15. 15.
    Sergio Luján-Mora and Manuel Palomar. Reducing Inconsistency in Integrating Data from Different Sources. In M. Adiba, C. Collet, and B.P. Desai, editors, Proc. of Int. Database Engineering and Applications Symposium (IDEAS 2001), pages 219–228, Grenoble, France, 2001. IEEE Computer Society.Google Scholar
  16. 16.
    A. E. Monge and C. P. Elkan. The field matching problem: Algorithms and applications. In Evangelos Simoudis, Jia Wei Han, and Usama Fayyad, editors, Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), page 267. AAAI Press, 1996.Google Scholar
  17. 17.
    A. E. Monge and C. P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In Proceedings of the Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD’97), 1997.Google Scholar
  18. 18.
    Gonzalo Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31–88, 2001.CrossRefGoogle Scholar
  19. 19.
    E. Schallehn, K. Sattler, and G. Saake. Extensible grouping and aggregation for data reconciliation. In Proc. 4th Int. Workshop on Engineering Federated Information Systems, EFIS’01, Berlin, Germany, 2001.Google Scholar
  20. 20.
    H. Shang and T. H. Merrett. Tries for approximate string matching. IEEE Transactions on Knowledge and Data Engineering, 8(4):540–547, 1996.CrossRefGoogle Scholar
  21. 21.
    K. Shim, R. Srikant, and R. Agrawal. High-dimensional similarity joins. In Proceedings of the 13th International Conference on Data Engineering (ICDE’97), pages 301–313, Washington-Brussels — Tokyo, April 1997. IEEE.Google Scholar
  22. 22.
    F. Tseng, A. Chen, and W. Yang. A probabilistic approach to query processing in heterogeneous database systems. In Proceedings of the 2nd International Workshop on Research Issues on Data Engineering: Transaction and Query Processing, pages 176–183, 1992.Google Scholar
  23. 23.
    H. Wang and C. Zaniolo. Using sql to build new aggregates and extenders for object-relational systems. In A. El Abbadi, M.L. Brodie, S. Chakravarthy, U. Dayal, N. Kamel, G. Schlageter, and K.-Y. Whang, editors, Proc. of 26th Int. Conf. on Very Large Data Bases (VLDB’00), Cairo, Egypt, pages 166–175. Morgan Kaufmann, 2000.Google Scholar
  24. 24.
    T. W. Yan and H. Garcia-Molina. Duplicate removal in information dissemination. In Proceedings of the 21st International Conference on Very Large Data Bases (VLDB’ 95), pages 66–77, San Francisco, Ca., USA, September 1995. Morgan Kaufmann Publishers, Inc.Google Scholar
  25. 25.
    G. Zhou, R. Hull, R. King, and J. Franchitti. Using object matching and materialization to integrate heterogeneous databases. In Proc. of 3rd Intl. Conf. on Cooperative Information Systems (CoopIS-95), Vienna, Austria, 1995.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Eike Schallehn
    • 1
  • Kai-Uwe Sattler
    • 1
  1. 1.Department of Computer ScienceUniversity of MagdeburgMagdeburgGermany

Personalised recommendations