Advertisement

Finding Significant Web Pages with Lower Ranks by Pseudo-Clique Search

  • Yoshiaki Okubo
  • Makoto Haraguchi
  • Bin Shi
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3735)

Abstract

In this paper, we discuss a method of finding useful clusters of web pages which are significant in the sense that their contents are similar or closely related to ones of higher-ranked pages. Since we are usually careless of pages with lower ranks, they are unconditionally discarded even if their contents are similar to some pages with high ranks. We try to extract such hidden pages together with significant higher-ranked pages as a cluster.

In order to obtain such clusters, we first extract semantic correlations among terms by applying Singular Value Decomposition(SVD) to the term-document matrix generated from a corpus w.r.t. a specific topic. Based on the correlations, we can evaluate potential similarities among web pages from which we try to obtain clusters. The set of web pages is represented as a weighted graph G based on the similarities and their ranks. Our clusters can be found as pseudo-cliques in G. We present an algorithm for finding Top-N weighted pseudo-cliques. Our experimental result shows that quite valuable clusters can be actually extracted according to our method.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking: Bringing Order to the Web (1999), http://dbpubs.stanford.edu/pub/1999-66
  2. 2.
    Vakali, A., Pokorný, J., Dalamagas, T.: An Overview of Web Data Clustering Practices. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 597–606. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  3. 3.
    Kita, K., Tsuda, K., Shishibori, M.: Information Retrieval Algorithms. Kyoritsu Shuppan (2002) (in Japanese)Google Scholar
  4. 4.
    Tomita, E., Seki, T.: An Efficient Branch-and-Bound Algorithm for Finding a Maximum Clique. In: Calude, C.S., Dinneen, M.J., Vajnovszki, V. (eds.) DMTCS 2003. LNCS, vol. 2731, pp. 278–289. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  5. 5.
    Fahle, T.: Simple and Fast: Improving a Branch-and-Bound Algorithm for Maximum Clique. In: Möhring, R.H., Raman, R. (eds.) ESA 2002. LNCS, vol. 2461, pp. 485–498. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  6. 6.
    Satoh, K.: A Method for Generating Data Abstraction Based on Optimal Clique Search, Master’s Thesis, Graduate School of Eng., Hokkaido Univ. (March 2003) (in Japanese)Google Scholar
  7. 7.
    Masuda, S.: Analysis of Ascidian Gene Expression Data by Clique Search, Master’s Thesis, Graduate School of Eng., Hokkaido Univ. (March 2005) (in Japanese)Google Scholar
  8. 8.
    Shi, B.: Top-N Clique Search of Web Pages, Master’s Thesis, Graduate School of Eng., Hokkaido Univ. (March 2005) (in Japanese)Google Scholar
  9. 9.
    Okubo, Y., Haraguchi, M.: Creating Abstract Concepts for Classification by Finding Top-N Maximal Weighted Cliques. In: Grieser, G., Tanaka, Y., Yamamoto, A. (eds.) DS 2003. LNCS (LNAI), vol. 2843, pp. 418–425. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  10. 10.
    Okubo, Y., Haraguchi, M.: Finding Top-N Pseudo-Cliques in Simple Graph. In: Proceedings of the 9th World Multiconference on Systemics, Cybernetics and Informatics - WMSCI 2005, vol. III, pp. 215–220 (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Yoshiaki Okubo
    • 1
  • Makoto Haraguchi
    • 1
  • Bin Shi
    • 1
  1. 1.Division of Computer Science, Graduate School of Information Science and TechnologyHokkaido UniversitySapporoJapan

Personalised recommendations