Massive Biomedical Term Discovery

  • Joachim Wermter
  • Udo Hahn
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3735)


Most technical and scientific terms are comprised of complex, multi-word noun phrases but certainly not all noun phrases are technical or scientific terms. The distinction of specific terminology from common non-specific noun phrases can be based on the observation that terms reveal a much lesser degree of distributional variation than non-specific noun phrases. We formalize the limited paradigmatic modifiability of terms and, subsequently, test the corresponding algorithm on bigram, trigram and quadgram noun phrases extracted from a 104-million-word biomedical text corpus. Using an already existing and community-wide curated biomedical terminology as an evaluation gold standard, we show that our algorithm significantly outperforms standard term identification measures and, therefore, qualifies as a high-performant building block for any terminology identification system.


Noun Phrase Biomedical Literature Training Corpus Term Candidate Porphyria Cutanea Tarda 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Nakagawa, H., Mori, T.: Nested collocation and compound noun for term recognition. In: COMPUTERM 1998 – Proceedings of the First Workshop on Comutational Terminology, pp. 64–70 (1998)Google Scholar
  2. 2.
    Hersh, W.R., Campbell, E., Evans, D., Brownlow, N.: Empirical, automated vocabulary discovery using large text corpora and advanced natural language processing tools. In: Cimino, J.J. (ed.) AMIA 1996 – Proceedings of the 1996 AMIA Annual Fall Symposium (formerly SCAMC). Beyond the Superhighway: Exploiting the Internet with Medical Informatics, Washington, D.C., October 26-30, pp. 159–163. Hanley & Belfus, Philadelphia (1996)Google Scholar
  3. 3.
    Rindflesch, T.C., Hunter, L., Aronson, A.R.: Mining molecular binding terminology from biomedical text. In: Lorenzi, N.M. (ed.) AMIA 1999 – Proceedings of the 1999 Annual Symposium of the American Medical Informatics Association. Transforming Health Care through Informatics: Cornerstones for a New Information Management Paradigm, Washington, D.C., November 6-10, pp. 127–131. Hanley & Belfus, Philadelphia (1999)Google Scholar
  4. 4.
    Collier, N., Nobata, C., Tsujii, J.: Automatic acquisition and classification of terminology using a tagged corpus in the molecular biology domain. Terminology 7, 239–257 (2002)CrossRefGoogle Scholar
  5. 5.
    Bodenreider, O., Rindflesch, T.C., Burgun, A.: Unsupervised, corpus-based method for extending a biomedical terminology. In: Proceedings of the ACL Workshop on Natural Language Processing in the Biomedical Domain, pp. 53–60, Pittsburgh, Association for Computational Linguistics (2002)Google Scholar
  6. 6.
    Nenadić, G., Spasic, I., Ananiadou, S.: Terminology-driven mining of biomedical literature. Journal of Biomedical Informatics 33, 1–6 (2003)Google Scholar
  7. 7.
    Krauthammer, M., Nenadić, G.: Term identification in the biomedical literature. Journal of Biomedical Informatics 37, 512–526 (2004)CrossRefGoogle Scholar
  8. 8.
    Frantzi, K., Ananiadou, S., Mima, H.: Automatic recognition of multi-word-terms: the C/NC value method. International Journal of Digital Libraries 3, 115–130 (2000)CrossRefGoogle Scholar
  9. 9.
    Nenadić, G., Ananiadou, S., McNaught, J.: Enhancing automatic term recognition through recognition of variation. In: COLING 2004 – Proceedings of the 20th International Conference on Computational Linguistics, pp. 604–610. Association for Computational Linguistics (2004)Google Scholar
  10. 10.
    Damerau, F.J.: Generating and evaluating domain-oriented multi-word terms from text. Information Processing & Management 29, 433–447 (1993)CrossRefGoogle Scholar
  11. 11.
    Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. Bradford Book & MIT Press, Cambridge (1999)zbMATHGoogle Scholar
  12. 12.
    Evert, S., Krenn, B.: Methods for the qualitative evaluation of lexical association measures. In: ACL 2001 – Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, Toulouse, France, pp. 188–195 (2001)Google Scholar
  13. 13.
    Wermter, J., Hahn, U.: Collocation extraction based on modifiability statistics. In: COLING Geneva 2004 – Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland, August 23-27, 2004, vol. 2, pp. 980–986. Association for Computational Linguistics (2004)Google Scholar
  14. 14.
    Kudo, T., Matsumoto, Y.: Chunking with support vector machines. In: NAACL 2001, Language Technologies 2001 – Proceedings of the 2nd Meeting of the North American Chapter of the Association for Computational Linguistics, Pittsburgh, PA, USA, June 2-7, pp. 192–199 (2001)Google Scholar
  15. 15.
    Justeson, J.S., Katz, S.M.: Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering 1, 9–27 (1995)CrossRefGoogle Scholar
  16. 16.
    Browne, A.C., Divita, G., Nguyen, V., Cheng, V.C.: Modular text processing system based on the Specialist lexicon and lexical tools. In: Chute, C.G. (ed.) AMIA 1998 – Proceedings of the 1998 AMIA Annual Fall Symposium. A Paradigm Shift in Health Care Information Systems: Clinical Infrastructures for the 21st Century, Orlando, FL, November 7-11, p. 982. Hanley & Belfus, Philadelphia (1998)Google Scholar
  17. 17.
    UMLS: Unified Medical Language System. National Library of Medicine, Bethesda (2004)Google Scholar
  18. 18.
    MESH: Medical Subject Headings. National Library of Medicine, Bethesda (2004)Google Scholar
  19. 19.
    Sachs, L.: Applied Statistics: A Handbook of Techniques, 2nd edn. Springer, New York (1984)zbMATHGoogle Scholar
  20. 20.
    Mima, H., Ananiadou, S., Nenadić, G.: The ATRACT workbench: Automatic term recognition and clustering for terms. In: Matusek, V. (ed.) TSD 2001. LNCS (LNAI), vol. 2166, pp. 126–133. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  21. 21.
    Benson, D.A., Boguski, M.S., Lipman, D.J., Ostell, J., Ouellette, F.B., Rapp, B.A., Wheeler, D.L.: GENBANK. Nucleic Acids Research 27, 12–17 (1999)CrossRefGoogle Scholar
  22. 22.
    Gene Ontology Consortium: Creating the Gene Ontology resource: Design and implementation. Genome Research 11, 1425–1433 (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Joachim Wermter
    • 1
  • Udo Hahn
    • 1
  1. 1.Language and Information Engineering (Julie) LabJena UniversityJenaGermany

Personalised recommendations