Identification of Real-world Objects in Multiple Databases

  • Mattis Neiling
Conference paper
Part of the Studies in Classification, Data Analysis, and Knowledge Organization book series (STUDIES CLASS)


Object identification is an important issue for integration of data from different sources. The identification task is complicated, if no global and consistent identifier is shared by the sources. Then, object identification can only be performed through the identifying information, the objects data provides itself. Unfortunately real-world data is dirty, hence identification mechanisms like natural keys fail mostly — we have to take care of the variations and errors of the data. Consequently, object identification can no more be guaranteed to be fault-free. Several methods tackle the object identification problem, e.g. Record Linkage, or the Sorted Neighborhood Method.

Based on a novel object identification framework, we assessed data quality and evaluated different methods on real data. One main result is that scalability is determined by the applied preselection technique and the usage of efficient data structures. As another result we can state that Decision Tree Induction achieves better correctness and is more robust than Record Linkage.


Association Rule Matched Pair Record Linkage Random Pair Multiple Database 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. ALVEY, W., and JAMERSON, B. (Eds.) (1997): Record Linkage Techniques — 1997. Int. Workshop, Arlington, Virginia.Google Scholar
  2. BELL, G.B., and SETHI, A. (2001): Matching records in a national medical patient index. Communications of the ACM 44(9), 83–88.CrossRefGoogle Scholar
  3. BERTHOLD, M., and HAND, D.J. (Eds.) (1999): Intelligent Data Analysis: An Introduction. New York: Springer.Google Scholar
  4. BILENKO, M., and MOONEY, R. (2003): Adaptive duplicate detection using learnable string similarity measures. KDD Conf. 2003, Washington DC.Google Scholar
  5. BITTON, D., and DeWITT, D.J. (1983): Duplicate record elimination in large data files. ACM TODS 8(2), 255–265.CrossRefGoogle Scholar
  6. BOZKAYA, T., and ÖZSOYOGLU, Z.M. (1999): Indexing large metric spaces for similarity search queries. ACM TODS 24(3), 361–404.CrossRefGoogle Scholar
  7. BREIMAN, L., FRIEDMAN, J., OLSHEN, R., and STONE, C. (1984): Classification and regression trees. Chapman & Hall.Google Scholar
  8. CIACCIA, P., PATELLA, M., and ZEZULA, P. (1997): M-tree: An efficient access method for similarity search in metric spaces. VLDB 1997, 426–435.Google Scholar
  9. ELFEKY, M.G., VERYKIOS, V.S., and ELMAGARMID, A.K. (2002): Tailor: A record linkage toolbox. ICDE 2002, San Jose.Google Scholar
  10. FELLEGI, I.P., and SUNTER, A.B. (1969): A theory of record linkage. Journal of the American Statistical Association, 64, 1183–1210.Google Scholar
  11. GALHARDAS, H., FLORESCU, D., SHASHA, D., SIMON, E., and SAITA, C.-A. (2001): Declarative data cleaning: Language, model and algorithms. VLDB 2001.Google Scholar
  12. GU, L., BAXTER, R., VICKERS, D., and RAINSFORD, C. (2003): Record Linkage: Current practice and future directions. Technical Report 03/83, CSIRO Mathematical and Information Sciences, Canberra, Australia.Google Scholar
  13. HERNANDEZ, M.A. (1996): A Generalization of Band Joins and The Merge/Purge Problem. Phd thesis, Columbia University.Google Scholar
  14. HERNANDEZ, M.A., and STOLFO, S.J. (1995): The merge/purge problem for large databases. ACM SIGMOD Conf. 1995, 127–138.Google Scholar
  15. JARO, M.A. (1989): Advances in record-linkage methodology as applied to matching the census of Tampa, Florida. JASA 84(406), 414–420.Google Scholar
  16. LIM, E.-P., SRIVASTAVA, J., PRABHAKAR, S., and RICHARDSON, J. (1993): Entity Identification in Database Integration. ICDE 1993, pp. 294–301.Google Scholar
  17. LIU, C., and RUBIN, D.B. (1994): The ECME algorithm: A simple extension of EM and ECM with faster monotone convergence. Biometrika 81(4), 633–48.MathSciNetGoogle Scholar
  18. McCALLUM, A., NIGAM, K., and UNGAR, L.H. (2000): Efficient clustering of high-dimensional data sets with application to reference matching. KDD 2000: New York, USA, 169–178.Google Scholar
  19. MENG, X.-L., and RUBIN, D.B. (1993): Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika 80(2), 267–78.MathSciNetGoogle Scholar
  20. MICHIE, D., SPIEGELHALTER, D.J., and TAYLOR, C.C. (1994): Machine learning, neural and statistical classification. New York: Horwood.Google Scholar
  21. NEILING, M. (2004): Identifizierung von Realwelt-Objekten in multiplen Datenbanken. Dissertation, Techn. Universität Cottbus, 2004.Google Scholar
  22. NEILING, M., JURK, S., LENZ, H.-J., and NAUMANN, F.: Object identification quality. Workshop on Data Quality in Coop. Information Systems, Siena.Google Scholar
  23. NEILING, M., and JURK, S. (2003): The Object Identification Framework. Workshop on Data Cleaning, Record Linkage and Object Consolidation at the KDD 2003, Washington DC.Google Scholar
  24. NEILING, M., and LENZ, H.-J. (2000): Data integration by means of object identification in information systems. ECIS 2000, Vienna, Austria.Google Scholar
  25. NEILING, M., and LENZ, H.-J. (2004): The German Administrative Record Census — An Object Identification Problem. Allg. Stat. Arch. 88, 259–277.MathSciNetGoogle Scholar
  26. NEILING, M., and MÜLLER, R. (2001): The good into the pot, the bad into the crop. Preselection of record pairs for database integration. Workshop DBFusion 2001, Gommern, Germany.Google Scholar
  27. NEWCOMBE, H.B., KENNEDY, J.M., AXFORD, S.J., and JAMES, A.P. (1959): Automatic linkage of vital records. Science 130, 954–959.Google Scholar
  28. CHRISTEN, P., CHURCHES, T., and HEGLAND, M. (2004): Febrl — a parallel open source data linkage system. PAKDD, LNCS 3056, 638–647.Google Scholar
  29. BAXTER, R., CHRISTEN, P., and CHURCHES, T. (2003): A comparison of fast blocking methods for record linkage. Workshop on Data Cleaning, Record Linkage and Object Consolidation at the KDD 2003, Washington DC.Google Scholar
  30. TEJADA, S., KNOBLOCK, C.A., and MINTON, S. (2001): Learning object identification rules for information integration. Information Systems 26(8).Google Scholar
  31. VERYKIOS, V., ELMAGARMID, A., and HOUSTIS, E. (2000): Automating the approximate record matching process. J. Information Sciences 126, 83–98.Google Scholar
  32. WANG, Y.R., and MADNICK, S.E. (1989): The inter-database instance identification problem in integrating autonomous systems. ICDE 1989, 46–55.Google Scholar
  33. WINKLER, W.E. (1993): Improved decision rules in the Fellegi-Sunter model of record linkage. The Research Report Series, U.S. Bureau of the Census.Google Scholar
  34. WINKLER, W.E. (1999): The state of record linkage and current research problems. Statistical research report series, U.S. Bureau of the Census, Washington D.C.Google Scholar
  35. WINKLER, W.E. (2001): Record linkage software and methods for merging administrative lists. Statistical research report series, U.S. Bureau of the CensusGoogle Scholar
  36. YANCEY, W. (2002): Improving parameter estimates for record linkage parameters. Section on Survey Research Methodology. American Stat. Association.Google Scholar

Copyright information

© Springer Berlin · Heidelberg 2006

Authors and Affiliations

  • Mattis Neiling
    • 1
  1. 1.Technische Universität BerlinBerlin

Personalised recommendations