Advertisement

Towards Structure-sensitive Hypertext Categorization

  • Alexander Mehler
  • Rüdiger Gleim
  • Matthias Dehmer
Conference paper
Part of the Studies in Classification, Data Analysis, and Knowledge Organization book series (STUDIES CLASS)

Abstract

Hypertext categorization is the task of automatically assigning category labels to hypertext units. Comparable to text categorization it stays in the area of function learning based on the bag-of-features approach. This scenario faces the problem of a many-to-many relation between websites and their hidden logical document structure. The paper argues that this relation is a prevalent characteristic which interferes any effort of applying the classical apparatus of categorization to web genres. This is confirmed by a threefold experiment in hypertext categorization. In order to outline a solution to this problem, the paper sketches an alternative method of unsupervised learning which aims at bridging the gap between statistical and structural pattern recognition (Bunke et al. 2001) in the area of web mining.

Keywords

Text Categorization Function Learning Category Assignment Instance Base Learning Document Object Model Tree 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. AMITAY, E. and CARMEL, D. and DARLOW, A. and LEMPEL, R. and SOFFER, A. (2003): The connectivity sonar. Proc. of the 14th ACM Conference on Hypertext, 28–47.Google Scholar
  2. BOCK, H.H. (1974): Automatische Klassifikation. Vandenhoeck & Ruprecht, Göttingen.Google Scholar
  3. BUNKE, H. and GÜNTER, S. and JIANG, X. (2001): Towards bridging the gap between statistical and structural pattern recognition. Proc. of the 2nd Int. Conf. on Advances in Pattern Recognition, Berlin, Springer, 1–11.Google Scholar
  4. CHAKRABARTI, S. and DOM, B. and INDYK, P. (1998): Enhanced hypertext categorization using hyperlinks. Proc. of ACM SIGMOD, International Conf. on Management of Data, ACM Press, 307–318.Google Scholar
  5. DEHMER, M. and MEHLER, A. (2004): A new method of similarity measuring for a specific class of directed graphs. Submitted to Tatra Mountain Journal, Slovakia.Google Scholar
  6. FÜRNKRANZ, J. (2002): Hyperlink ensembles: a case study in hypertext classification. Information Fusion, 3(4), 299–312.CrossRefGoogle Scholar
  7. GIBSON, D. and KLEINBERG, J. and RAGHAVAN, P. (1998): Inferring web communities from link topology. Proc. of the 9th ACM Conf. on Hypertext, 225–234.Google Scholar
  8. GLEIM, R. (2005): Ein Framework zur Extraktion, Repräsentation und Analyse webbasierter Hypertexte, Proc. of GLDV’ 05, 42–53.Google Scholar
  9. HSU, C.-W. and CHANG, C.-C. and LIN, C.-J. (2003): A practical guide to SVM classification. Technical report, Department of Computer Science and Information Technology, National Taiwan University.Google Scholar
  10. JOACHIMS, T. (2002): Learning to classify text using support vector machines. Kluwer, Boston, 2002.Google Scholar
  11. JOACHIMS, T. and CRISTIANINI, N. and SHAWE-TAYLOR, J. (2001): Composite kernels for hypertext categorisation. Proc. of the 11th ICML, 250–257.Google Scholar
  12. KLEINBERG, J. (1999): Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604–632CrossRefzbMATHMathSciNetGoogle Scholar
  13. KOSALA, R. and BLOCKEEL, H. (2000): Web mining research: A survey. SIGKDD Explorations, 2(1), 1–15.Google Scholar
  14. MEHLER, A. and DEHMER, M. and GLEIM, R. (2004): Towards logical hypertext structure — a graph-theoretic perspective. Proc. of I2CS’ 04, Berlin, Springer.Google Scholar
  15. MIZUUCHI, Y. and TAJIMA, K. (1999): Finding context paths for web pages. Proc. of the 10th ACM Conference on Hypertext and Hypermedia, 13–22.Google Scholar
  16. REHM, G. (2002): Towards automatic web genre identification. Proc. of the Hawai’i Int. Conf. on System Sciences.Google Scholar
  17. RIEGER, B. (1989): Unscharfe Semantik. Peter Lang, Frankfurt a.M.Google Scholar
  18. YANG, Y. (1999): An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval, 1,1/2, 67–88.zbMATHGoogle Scholar
  19. YANG, Y. and SLATTERY, S. and GHANI, R. (2002): A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2–3), 219–241.Google Scholar
  20. YOSHIOKA, T. and HERMAN, G. (2000): Coordinating information using genres. Technical report, Massachusetts Institute of Technology.Google Scholar

Copyright information

© Springer Berlin · Heidelberg 2006

Authors and Affiliations

  • Alexander Mehler
    • 1
  • Rüdiger Gleim
    • 1
  • Matthias Dehmer
    • 2
  1. 1.Universität BielefeldBielefeldGermany
  2. 2.Technische Universität DarmstadtDarmstadtGermany

Personalised recommendations