DS 2005: Discovery Science pp 189-202

# The q-Gram Distance for Ordered Unlabeled Trees

• Nobuhito Ohkura
• Kouichi Hirata
• Tetsuji Kuboyama
• Masateru Harao
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3735)

## Abstract

In this paper, we investigate the q -gram distance for ordered unlabeled trees (trees, for short). First, we formulate a q-gram as simply a tree with q nodes isomorphic to a line graph, and the q-gram distance between two trees as similar as one between two strings. Then, by using the depth sequence based on postorder, we design the algorithm EnumGram to enumerate all q-grams in a tree T with n nodes which runs in O(n 2) time and in O(q) space. Furthermore, we improve it to the algorithm LinearEnumGram which runs in O(qn) time and in O(qd) space, where d is the depth of T. Hence, we can evaluate the q-gram distance D q (T 1,T 2) between T 1 and T 2 in O(q maxn 1, n 2) time and in O(q maxd 1, d 2) space, where n i and d i are the number of nodes in T i and the depth of T i , respectively. Finally, we show the relationship between the q-gram distance D q (T 1,T 2) and the edit distanceE(T 1,T 2) that D q (T 1,T 2)≤ (gl+1)E(T 1,T 2), where g = max{g 1, g 2}, l = max{l 1, l 2}, g i is the degree of T i and l i is the number of leaves in T i . In particular, for the top-down tree edit distanceF(T 1,T 2), this result implies that $$D_{q}(T_{1}, T_{2}) \leq {\rm min}\{g^{q-2}, l - 1\}\{F(T_{1}, T_{2})\}$$.

## Keywords

Internal Node Line Graph Depth Sequence Edit Distance Edit Operation
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

## References

1. 1.
Asai, T., Abe, K., Kawazoe, S., Arimura, H., Sakamoto, H., Arikawa, S.: Efficient substructure discovery from large semi-structured data. In: Proc. SDM 2002 (2002)Google Scholar
2. 2.
Asai, T., Arimura, H., Nakano, S., Uno, T.: Discovering frequent substructures in large unordered trees. In: Grieser, G., Tanaka, Y., Yamamoto, A. (eds.) DS 2003. LNCS (LNAI), vol. 2843, pp. 47–61. Springer, Heidelberg (2003)
3. 3.
Bille, P.: A survey on tree edit distance and related problems. Theor. Comput. Sci. 337, 217–239 (2005)
4. 4.
Burkhardt, S., Karkkainen, J.: Better filtering with gapped q-grams. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 73–85. Springer, Heidelberg (2001)
5. 5.
Furukawa, K., Uchida, T., Yamada, K., Miyahara, T., Shoudai, T., Nakamura, Y.: Extracting characteristic structures among words in semistructured documents. In: Chen, M.-S., Yu, P.S., Liu, B. (eds.) PAKDD 2002. LNCS (LNAI), vol. 2336, pp. 351–360. Springer, Heidelberg (2002)Google Scholar
6. 6.
Garofalakis, M., Kumar, A.: Correlating XML data streams using tree-edit distance embeddings. In: Proc. PODS 2003, pp. 143–154 (2003)Google Scholar
7. 7.
Ikeda, D., Yamada, Y., Hirokawa, S.: Eliminating useless parts in semi-structured documents using alternation counts. In: Jantke, K.P., Shinohara, A. (eds.) DS 2001. LNCS (LNAI), vol. 2226, pp. 113–127. Springer, Heidelberg (2001)
8. 8.
Jokinen, P., Ukkonen, E.: Two algorithms for approximate string matching in static texts. In: Tarlecki, A. (ed.) MFCS 1991. LNCS, vol. 520, pp. 240–248. Springer, Heidelberg (1991)Google Scholar
9. 9.
Kuboyama, T., Shin, K., Miyahara, T., Yasuda, H.: A theoretical analysis of alignment and edit problems for trees. In: Coppo, M., Lodi, E., Pinna, G.M. (eds.) ICTCS 2005. LNCS, vol. 3701, pp. 323–337. Springer, Heidelberg (2005)
10. 10.
Nakano, S., Uno, T.: Efficient generation of rooted trees. National Institute of Informatics Technical Report NII-2003-005E (2003)Google Scholar
11. 11.
Nakano, S., Uno, T.: Constant time generation of trees with specified diameter. In: Hromkovič, J., Nagl, M., Westfechtel, B. (eds.) WG 2004. LNCS, vol. 3353, pp. 33–45. Springer, Heidelberg (2004)
12. 12.
Navarro, G., Sutinen, E., Tanninen, J., Tarhio, J.: Indexing text with approximate q-grams. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 350–363. Springer, Heidelberg (2000)
13. 13.
Selkow, S.M.: The tree-to-tree editing problem. Inform. Proc. Let. 6, 184–186 (1997)
14. 14.
Shasha, D., Zhang, K.: Fast algorithms for the unit cost edit distance between trees. J. Algo. 11, 581–621 (1990)
15. 15.
Uchida, T., Mogawa, T., Nakamura, Y.: Finding frequent structural features among words in tree-structured documents. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 351–360. Springer, Heidelberg (2004)
16. 16.
Ukkonen, E.: Approximate string-matching with q-grams and maximal matches. Theor. Comput. Sci. 92, 191–211 (1993)
17. 17.
Yang, W.: Identifying syntactic differences between two programs. Software–Practice and Experience 21, 739–755 (1991)
18. 18.
Zaki, M.J.: Efficiently mining frequent trees in a forest. In: Proc. SIGKDD 2002, pp. 71–80 (2002)Google Scholar
19. 19.
Zhang, K., Shasha, D.: Tree pattern matching. In: Apostolico, A., Galil, Z. (eds.) Pattern matching algorithms, pp. 341–371 (1997)Google Scholar

## Authors and Affiliations

• Nobuhito Ohkura
• 1
• Kouichi Hirata
• 2
• Tetsuji Kuboyama
• 3
• Masateru Harao
• 2
1. 1.Graduate School of Computer Science and Systems EngineeringKyushu Institute of TechnologyJapan
2. 2.Department of Artificial IntelligenceKyushu Institute of TechnologyJapan
3. 3.Center for Collaborative ResearchUniversity of TokyoJapan