Advertisement

The q-Gram Distance for Ordered Unlabeled Trees

  • Nobuhito Ohkura
  • Kouichi Hirata
  • Tetsuji Kuboyama
  • Masateru Harao
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3735)

Abstract

In this paper, we investigate the q -gram distance for ordered unlabeled trees (trees, for short). First, we formulate a q-gram as simply a tree with q nodes isomorphic to a line graph, and the q-gram distance between two trees as similar as one between two strings. Then, by using the depth sequence based on postorder, we design the algorithm EnumGram to enumerate all q-grams in a tree T with n nodes which runs in O(n 2) time and in O(q) space. Furthermore, we improve it to the algorithm LinearEnumGram which runs in O(qn) time and in O(qd) space, where d is the depth of T. Hence, we can evaluate the q-gram distance D q (T 1,T 2) between T 1 and T 2 in O(q maxn 1, n 2) time and in O(q maxd 1, d 2) space, where n i and d i are the number of nodes in T i and the depth of T i , respectively. Finally, we show the relationship between the q-gram distance D q (T 1,T 2) and the edit distanceE(T 1,T 2) that D q (T 1,T 2)≤ (gl+1)E(T 1,T 2), where g = max{g 1, g 2}, l = max{l 1, l 2}, g i is the degree of T i and l i is the number of leaves in T i . In particular, for the top-down tree edit distanceF(T 1,T 2), this result implies that \(D_{q}(T_{1}, T_{2}) \leq {\rm min}\{g^{q-2}, l - 1\}\{F(T_{1}, T_{2})\}\).

Keywords

Internal Node Line Graph Depth Sequence Edit Distance Edit Operation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Asai, T., Abe, K., Kawazoe, S., Arimura, H., Sakamoto, H., Arikawa, S.: Efficient substructure discovery from large semi-structured data. In: Proc. SDM 2002 (2002)Google Scholar
  2. 2.
    Asai, T., Arimura, H., Nakano, S., Uno, T.: Discovering frequent substructures in large unordered trees. In: Grieser, G., Tanaka, Y., Yamamoto, A. (eds.) DS 2003. LNCS (LNAI), vol. 2843, pp. 47–61. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  3. 3.
    Bille, P.: A survey on tree edit distance and related problems. Theor. Comput. Sci. 337, 217–239 (2005)CrossRefMathSciNetzbMATHGoogle Scholar
  4. 4.
    Burkhardt, S., Karkkainen, J.: Better filtering with gapped q-grams. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 73–85. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  5. 5.
    Furukawa, K., Uchida, T., Yamada, K., Miyahara, T., Shoudai, T., Nakamura, Y.: Extracting characteristic structures among words in semistructured documents. In: Chen, M.-S., Yu, P.S., Liu, B. (eds.) PAKDD 2002. LNCS (LNAI), vol. 2336, pp. 351–360. Springer, Heidelberg (2002)Google Scholar
  6. 6.
    Garofalakis, M., Kumar, A.: Correlating XML data streams using tree-edit distance embeddings. In: Proc. PODS 2003, pp. 143–154 (2003)Google Scholar
  7. 7.
    Ikeda, D., Yamada, Y., Hirokawa, S.: Eliminating useless parts in semi-structured documents using alternation counts. In: Jantke, K.P., Shinohara, A. (eds.) DS 2001. LNCS (LNAI), vol. 2226, pp. 113–127. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  8. 8.
    Jokinen, P., Ukkonen, E.: Two algorithms for approximate string matching in static texts. In: Tarlecki, A. (ed.) MFCS 1991. LNCS, vol. 520, pp. 240–248. Springer, Heidelberg (1991)Google Scholar
  9. 9.
    Kuboyama, T., Shin, K., Miyahara, T., Yasuda, H.: A theoretical analysis of alignment and edit problems for trees. In: Coppo, M., Lodi, E., Pinna, G.M. (eds.) ICTCS 2005. LNCS, vol. 3701, pp. 323–337. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  10. 10.
    Nakano, S., Uno, T.: Efficient generation of rooted trees. National Institute of Informatics Technical Report NII-2003-005E (2003)Google Scholar
  11. 11.
    Nakano, S., Uno, T.: Constant time generation of trees with specified diameter. In: Hromkovič, J., Nagl, M., Westfechtel, B. (eds.) WG 2004. LNCS, vol. 3353, pp. 33–45. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  12. 12.
    Navarro, G., Sutinen, E., Tanninen, J., Tarhio, J.: Indexing text with approximate q-grams. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 350–363. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  13. 13.
    Selkow, S.M.: The tree-to-tree editing problem. Inform. Proc. Let. 6, 184–186 (1997)CrossRefMathSciNetGoogle Scholar
  14. 14.
    Shasha, D., Zhang, K.: Fast algorithms for the unit cost edit distance between trees. J. Algo. 11, 581–621 (1990)CrossRefMathSciNetzbMATHGoogle Scholar
  15. 15.
    Uchida, T., Mogawa, T., Nakamura, Y.: Finding frequent structural features among words in tree-structured documents. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 351–360. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  16. 16.
    Ukkonen, E.: Approximate string-matching with q-grams and maximal matches. Theor. Comput. Sci. 92, 191–211 (1993)CrossRefMathSciNetGoogle Scholar
  17. 17.
    Yang, W.: Identifying syntactic differences between two programs. Software–Practice and Experience 21, 739–755 (1991)CrossRefGoogle Scholar
  18. 18.
    Zaki, M.J.: Efficiently mining frequent trees in a forest. In: Proc. SIGKDD 2002, pp. 71–80 (2002)Google Scholar
  19. 19.
    Zhang, K., Shasha, D.: Tree pattern matching. In: Apostolico, A., Galil, Z. (eds.) Pattern matching algorithms, pp. 341–371 (1997)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Nobuhito Ohkura
    • 1
  • Kouichi Hirata
    • 2
  • Tetsuji Kuboyama
    • 3
  • Masateru Harao
    • 2
  1. 1.Graduate School of Computer Science and Systems EngineeringKyushu Institute of TechnologyJapan
  2. 2.Department of Artificial IntelligenceKyushu Institute of TechnologyJapan
  3. 3.Center for Collaborative ResearchUniversity of TokyoJapan

Personalised recommendations