Measuring Over-Generalization in the Minimal Multiple Generalizations of Biosequences

  • Yen Kaow Ng
  • Hirotaka Ono
  • Takeshi Shinohara
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3735)


We consider the problem of finding a set of patterns that best characterizes a set of strings. To this end, Arimura et. al. [3] considered the use of minimal multiple generalizations (mmg) for such characterizations. Given any sample set, the mmgs are, roughly speaking, the most (syntactically) specific set of languages containing the sample within a given class of languages. Takae et. al. [17] found the mmgs of the class of pattern languages [1] which includes so-called sort symbols to be fairly accurate as predictors for signal peptides. We first reproduce their results using updated data. Then, by using a measure for estimating the level of over-generalizations made by the mmgs, we show results that explain the high level of accuracies resulting from the use of sort symbols, and discuss how better results can be obtained. The measure that we suggests here can also be applied to other types of patterns, e.g. the PROSITE patterns [4].


Positive Accuracy Regular Pattern Pattern Language Coverage 2x10 Kyushu Institute 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Angluin, D.: Finding patterns common to a set of strings. Journal of Computer and System Sciences 21, 46–62 (1980)CrossRefMathSciNetzbMATHGoogle Scholar
  2. 2.
    Arimura, H., Fujino, R., Shinohara, T., Arikawa, S.: Protein motif discovery from positive examples by Minimal Multiple Generalization over regular patterns. In: Proceedings of the Genome Informatics Workshop, pp. 39–48 (1994)Google Scholar
  3. 3.
    Arimura, H., Shinohara, T., Otsuki, S.: Finding minimal generalizations for unions of pattern languages and its application to inductive inference from positive data. In: Enjalbert, P., Mayr, E.W., Wagner, K.W. (eds.) STACS 1994. LNCS, vol. 775, pp. 649–660. Springer, Heidelberg (1994)Google Scholar
  4. 4.
    Bairoch, A.: PROSITE: A dictionary of sites and patterns in proteins. Nucl. Acids Res. 25(19), 2241–2245 (1991)Google Scholar
  5. 5.
    Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Wheeler, D.L.: Genbank: update. Nucl. Acids Res. 32(Database-Issue), 23–26 (2004)CrossRefGoogle Scholar
  6. 6.
    Brāzma, A., Jonassen, I., Eidhammer, I., Gilbert, D.: Approaches to the automatic discovery of patterns in biosequences. J. Comp. Biol. 5(2), 277–304 (1998)Google Scholar
  7. 7.
    Brejova, B., Vinar, T., Li, M.: Pattern Discovery: Methods and Software, Ch. 29, pp. 491–522. Humana Press (2003)Google Scholar
  8. 8.
    Case, J., Jain, S., Reischuk, R., Stephan, F., Zeugmann, T.: Learning a subclass of regular patterns in polynomial time. In: Gavaldá, R., Jantke, K.P., Takimoto, E. (eds.) ALT 2003. LNCS (LNAI), vol. 2842, pp. 234–246. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  9. 9.
    Chan, C., Garofalakis, M., Rastogi, R.: RE-tree: an efficient index structure for regular expressions. The VLDB Journal 12(2), 102–119 (2003)CrossRefGoogle Scholar
  10. 10.
    Hopcroft, J., Ullman, J.: Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, Reading (1979)zbMATHGoogle Scholar
  11. 11.
    Kannan, S., Sweedyk, Z., Mahaney, S.: Counting and random generation of strings in regular languages. In: Proceedings of the sixth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, pp. 551–557 (1995)Google Scholar
  12. 12.
    Ng, Y.K., Shinohara, T.: Inferring unions of the pattern languages by the most fitting covers. In: Jain, S., Simon, H.U., Tomita, E. (eds.) ALT 2005. LNCS (LNAI), vol. 3734, pp. 269–282. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  13. 13.
    Ono, H., Ng, Y.K.: Best fitting fixed-length substring patterns for a set of strings. In: Proceedings of The Eleventh International Computing and Combinatorics Conference (COCOON 2005) (2005) (to appear)Google Scholar
  14. 14.
    Shinohara, A.: String pattern discovery. In: Ben-David, S., Case, J., Maruoka, A. (eds.) ALT 2004. LNCS (LNAI), vol. 3244, pp. 1–13. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  15. 15.
    Shinohara, T.: Polynomial time inference of extended regular pattern languages. In: Goto, E., Nakajima, R., Yonezawa, A., Nakata, I., Furukawa, K. (eds.) RIMS 1982. LNCS, vol. 147, pp. 115–127. Springer, Heidelberg (1983)Google Scholar
  16. 16.
    Shinohara, T., Ng, Y.K.: Strong biases for the minimal multiple generalization algorithm on samples of very small sizes. In: The Proceedings of the 57th Meeting of SIG-FPAI, The Japanese Society of Artificial Intelligence (November 2004)Google Scholar
  17. 17.
    Takae, T., Kasai, T., Arimura, H., Shinohara, T.: Knowledge discovery in biosequences using sort regular patterns. In: Workshop on Applied Learning Theory (1998)Google Scholar
  18. 18.
    Uemura, J., Sato, M.: Compactness and learning of classes of unions of erasing regular pattern languages. In: Cesa-Bianchi, N., Numao, M., Reischuk, R. (eds.) ALT 2002. LNCS (LNAI), vol. 2533, pp. 293–307. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  19. 19.
    Yamaguchi, M., Shimozono, S., Shinohara, T.: Finding minimal multiple generalization over regular patterns with alphabet indexing. In: Proceedings of the Seventh Workshop on Genome Informatics, vol. 7, pp. 51–60. Universal Academy Press (1996)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Yen Kaow Ng
    • 1
  • Hirotaka Ono
    • 2
  • Takeshi Shinohara
    • 3
  1. 1.Graduate School of Computer Science and SystemsKyushu Institute of TechnologyIizukaJapan
  2. 2.Department of Computer Science and Communication EngineeringKyushu UniversityFukuokaJapan
  3. 3.Department of Artificial IntelligenceKyushu Institute of TechnologyIizukaJapan

Personalised recommendations