iASA: Learning to Annotate the Semantic Web

  • Jie Tang
  • Juanzi Li
  • Hongjun Lu
  • Bangyong Liang
  • Xiaotong Huang
  • Kehong Wang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3730)


With the advent of the Semantic Web, there is a great need to upgrade existing web content to semantic web content. This can be accomplished through semantic annotations. Unfortunately, manual annotation is tedious, time consuming and error-prone. In this paper, we propose a tool, called iASA, that learns to automatically annotate web documents according to an ontology. iASA is based on the combination of information extraction (specifically, the Similarity-based Rule Learner—SRL) and machine learning techniques. Using linguistic knowledge and optimal dynamic window size, SRL produces annotation rules of better quality than comparable semantic annotation systems. Similarity-based learning efficiently reduces the search space by avoiding pseudo rule generalization. In the annotation phase, iASA exploits ontology knowledge to refine the annotation it proposes. Moreover, our annotation algorithm exploits machine learning methods to correctly select instances and to predict missing instances. Finally, iASA provides an explanation component that explains the nature of the learner and annotator to the user. Explanations can greatly help users understand the rule induction and annotation process, so that they can focus on correcting rules and annotations quickly. Experimental results show that iASA can reach high accuracy quickly.


Natural Language Processing Resource Description Framework Information Extraction Semantic Annotation Rule Induction 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Alani, H., Kim, S., Millard, D., Weal, M., Hall, W., Lewis, P., Shadbolt, N.: Automatic Ontology-Based Knowledge Extraction from Web Documents. IEEE Intelligent Systems 18(1), 14–21 (2003)CrossRefGoogle Scholar
  2. 2.
    Benjamins, R., Contreras, J.: White Paper Six Challenges for the Semantic Web. Intelligent Software Components. Intelligent software for the networked economy, isoco (April 2002)Google Scholar
  3. 3.
    Berger, A.L., Della Pietra, S.A., Della Pietra, V.J.: A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics 22, 39–71 (1996)Google Scholar
  4. 4.
    Berners-Lee, T., Fischetti, M., Dertouzos, M.L.: Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web (1999)Google Scholar
  5. 5.
    Berners-Lee, T., Hendler, J., Lassila, O.: The Semantic Web. Scientific American 284, 34–43 (2001)CrossRefGoogle Scholar
  6. 6.
    Buitelaar, P., Declerck, T.: Linguistic Annotation for the Semantic Web. In: Annotation for the Semantic Web. Frontiers in Artificial Intelligence and Applications Series, vol. 96. IOS Press, Amsterdam (2003)Google Scholar
  7. 7.
    Califf, M.E.: Relational Learning Techniques for Natural Language Information Extraction. Ph.D. thesis. University of Texas, Austin (1998)Google Scholar
  8. 8.
    Chieu, H.L., Ng, H.T.: A Maximum Entropy Approach to Information Extraction from Semi-Structured and Free Text. In: Eighteenth national conference on Artificial intelligence (2002)Google Scholar
  9. 9.
    Ciravegna, F.: (LP)2, an Adaptive Algorithm for Information Extraction from Web-related Texts. In: Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining held in conjunction with 17th International Joint Conference on Artificial Intelligence (IJCAI), Seattle, Usa (August 2001)Google Scholar
  10. 10.
    Ciravegna, F., Dingli, A., Iria, J., Wilks, Y.: Multi-strategy Definition of Annotation Services in Melita. In: Fensel, D., Sycara, K., Mylopoulos, J. (eds.) ISWC 2003. LNCS, vol. 2870, pp. 97–107. Springer, Heidelberg (2003)Google Scholar
  11. 11.
    Cohen, W., Jensen, L.: A Structured Wrapper Induction System for Extracting Information from Semi-structured Documents. In: Proceedings of the Workshop on Adaptive Text Extraction and Mining, IJCAI 2001 (2001)Google Scholar
  12. 12.
    Collins, M.: Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. In: Proceedings of the Conference on Empirical Methods in NLP (2002)Google Scholar
  13. 13.
    Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20, 273–297 (1995)zbMATHGoogle Scholar
  14. 14.
    Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (2002)Google Scholar
  15. 15.
    Dean, M., Schreiber, G., Bechhofer, S., van Harmelen, F., Hendler, J., Horrocks, I., McGuinness, D.L., Patel-Schneider, P.F., Andrea Stein, L.: OWL Web Ontology Language Reference. W3C Recommendation (February 10, 2004),
  16. 16.
    Dhamankar, R., Lee, Y., Doan, A.H., Halevy, A., Domingos, P.: iMAP: Discovering Complex Semantic Matches between Database Schemas. In: SIGMOD 2004, Paris, France (June 13–18, 2004)Google Scholar
  17. 17.
    Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., McCurley, K.S., Rajagopalan, S., Tomkins, A., Tomlin, J.A., Zien, J.Y.: A Case for Automated Large-scale Semantic Annotation. Journal of Web Semantics: Science, Services and Agents on the World Wide Web, 115–132 (July 2003)Google Scholar
  18. 18.
    Eriksson, H., Fergerson, R., Shahar, Y., Musen, M.: Automatic Generation of Ontology Editors. In: Proceedings of the 12th Banff Knowledge Acquisition Workshop, Banff Alberta, Canada (1999)Google Scholar
  19. 19.
    Fensel, D., Decker, S., Erdmann, M., Studer, R.: Ontobroker: Or how to enable intelligent access to the WWW. In: Proceedings of 11th Banff Knowledge Acquisition for Knowledge-Based SystemsWorkshop, Banff, Canada (1998)Google Scholar
  20. 20.
    Freitag, D., Kushmerick, N.: Boosted Wrapper Induction. In: Proceedings of 17th National Conference on Artificial Intelligence (2000)Google Scholar
  21. 21.
    Ghahramani, Z., Jordan, M.I.: Factorial Hidden Markov Models. Machine Learning 29, 245–273 (1997)zbMATHCrossRefGoogle Scholar
  22. 22.
    Hammond, B., Sheth, A., Kochut, K.: Semantic Enhancement Engine: A Modular Document Enhancement Platform for Semantic Applications over Heterogeneous Content. In: Kashyap, V., Shklar, L. (eds.) Real World Semantic Web Applications, December 2002, pp. 29–49. IOS Press, Amsterdam (2002)Google Scholar
  23. 23.
    Han, H., Giles, L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.: Automatic Document Metadata Extraction Using Support Vector Machine. In: Proceedings of Joint Conference on Digital Libraries (JCDL 2003), pp. 37–48 (2003)Google Scholar
  24. 24.
    Handschuh, S., Staab, S., Ciravegna, F.: S-CREAM—Semi-automatic Creation of Metadata, In Proceedings of the 13th International Conference on Knowledge Engineering and Management (EKAW 2002), Siguenza, Spain. In: Gómez-Pérez, A., Benjamins, V.R. (eds.) EKAW 2002. LNCS (LNAI), vol. 2473, pp. 358–372. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  25. 25.
    Handschuh, S., Staab, S.: Annotation for the Semantic Web. Frontiers in Artificial Intelligence and Applications, vol. 96. New IOS Publication (2003)Google Scholar
  26. 26.
    Heflin, J., Hendler, J.: Searching the Web with SHOE. In: Proceedings of AAAI-2000 Workshop on AI for Web Search, Austin, Texas (2000)Google Scholar
  27. 27.
    Kahan, J., Koivunen, M.R.: Annotea: an Open RDF Infrastructure for Shared Web Annotations. In: Proceedings of World Wide Web, pp. 623–632 (2001)Google Scholar
  28. 28.
    Kogut, P., Holmes, W.: AeroDAML: Applying Information Extraction to Generate DAML Annotations from Web Pages (2001)Google Scholar
  29. 29.
    Kushmerick, N., Weld, D.S., Doorenbos, R.B.: Wrapper Induction for Information Extraction. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Nagoya, Japan, pp. 729–737 (1997)Google Scholar
  30. 30.
    Leonard, T., Glaser, H.: Large Scale Acquisition and Maintenance from the Web without Source Access (2001),
  31. 31.
    Lerman, K., Knoblock, C., Minton, S.: Automatic data extraction from lists and tables in web sources. In: IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, Seattle, WA (August 2001)Google Scholar
  32. 32.
    Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: ICML 2001 (2001)Google Scholar
  33. 33.
    Lavelli, A., Califf, M., Ciravegna, F., Freitag, F., Giuliano, D., Kushmerick, C., Romano, N.: A Critical Survey of the Methodology for IE Evaluation. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (2004)Google Scholar
  34. 34.
    Li, J., Yu, Y.: Learning to Generate Semantic Annotation for Domain Specific Sentences. In: Proceedings of the Knowledge Markup and Semantic Annotation Workshop in K-CAP 2001, Victoria, BC (2001)Google Scholar
  35. 35.
    Martin, P., Eklund, P.: Embedding Knowledge in Web Documents. In: Proceedings of the 8th International World Wide Web Conf (WWW 1998), Toronto, May 1999, pp. 1403–1419. Elsevier Science B.V, Amsterdam (1999)Google Scholar
  36. 36.
    McCallum, A., Freitag, D., Pereira, F.: Maximum Entropy Markov Models for Information Extraction and Segmentation. In: Proceedings of the ICML Coference (2000)Google Scholar
  37. 37.
    Mukherjee, S., Yang, G., Ramakrishnan, I.V.: Automatic Annotation of Content-Rich HTML Documents: Structural and Semantic Analysis. In: Fensel, D., Sycara, K., Mylopoulos, J. (eds.) ISWC 2003. LNCS, vol. 2870, pp. 533–549. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  38. 38.
    Muslea, I.: Active Learning with Multiple Views. Ph.D. dissertation USC (2002)Google Scholar
  39. 39.
    Nahm, U.Y., Mooney, R.J.: Using Soft-Matching Mined Rules to Improve Information Extraction. In: Proceedings of the AAAI-2004 Workshop on Adaptive Text Extraction and Mining (ATEM-2004), San Jose, CA, July 2004, pp. 27–32 (2004)Google Scholar
  40. 40.
    Peng, F., McCallum, A.: Accurate Information Extraction from Research Papers using Conditional Random Fields. In: Proceedings of Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics, HLT-NAACL (2004)Google Scholar
  41. 41.
    Pinto, D., McCallum, A., Wei, X., Croft, W.B.: Table Extraction Using Conditional Random Fields. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval (2003)Google Scholar
  42. 42.
    Popov, B., Kiryakov, A., Manov, D., Kirilov, A., Ognyanoff, D., Goranov, M.: Towards Semantic Web Information Extraction. In: Fensel, D., Sycara, K., Mylopoulos, J. (eds.) ISWC 2003. LNCS, vol. 2870, pp. 1–21. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  43. 43.
    Schaffer, C.: Selecting a Classification method by Cross-Validation. Machine Learning 13(1), 135–143 (1993)Google Scholar
  44. 44.
    Seymore, K., McCallum, A., Rosenfeld, R.: Learning Hidden Markov Model Structure for Information Extraction. In: Proceedings of AAAI 1999 Workshop on Machine Learning for Information Extraction (1999)Google Scholar
  45. 45.
    Soderland, S.: Learning Information Extraction Rules for Semi-structured and Free Text. Machine Learning, 1–44 (January 1999)Google Scholar
  46. 46.
    Soo, V.W., Lee, C.Y., Li, C.–C., Chen, S.L., Chen, C.: Automated Semantic Annotation and Retrieval Based on Sharable Ontology and Case-based Learning Techniques. In: Proceedings of the 2003 Joint Conference on Digital Libraries. IEEE, Los Alamitos (2003)Google Scholar
  47. 47.
    Vapnik, V.: Statistical Learning Theroy. Springer, New York (1998)Google Scholar
  48. 48.
    Vargas-Vera, M., Motta, E., Domingue, J., Buckingham Shum, S., Lanzoni, M.: Knowledge Extraction by Using an Ontology-based Annotation Tool. In: Proceedings of K-CAP 2001 Workshop on Knowledge Markup and Semantic Annotation, Victoria, BC, Canada (October 2001)Google Scholar
  49. 49.
    Vargas-Vera, M., Motta, E., Domingue, J., Lanzoni, M., Stutt, A., Ciravegna, F.: MnM: Ontology Driven Semiautomatic and Automatic Support for Semantic Markup. In: Gómez-Pérez, A., Benjamins, V.R. (eds.) EKAW 2002. LNCS (LNAI), vol. 2473, p. 379. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  50. 50.
    Zhang, K., Xu, P., Li, J.: Optimal Hierarchical Clustering based Logic Structure Extraction. Journal of Tsinghua Science and Technology (2005)Google Scholar
  51. 51.
    Zhang, L., Pan, Y., Zhang, T.: Recognising and using named entities: Focused named entity recognition using machine learning. In: Proceedings of the SIGIR 2004 (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Jie Tang
    • 1
  • Juanzi Li
    • 1
  • Hongjun Lu
    • 1
  • Bangyong Liang
    • 1
  • Xiaotong Huang
    • 1
  • Kehong Wang
    • 1
  1. 1.Department of Computer ScienceTsinghua UniversityBeijingP.R. China

Personalised recommendations