Latent Forests to Model Genetical Data for the Purpose of Multilocus Genome-Wide Association Studies. Which Clustering Should Be Chosen?

  • Duc-Thanh Phan
  • Philippe Leray
  • Christine SinoquetEmail author
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 574)


The aim of genetic association studies, and in particular genome-wide association studies (GWASs), is to unravel the genetics of complex diseases. In this domain, machine learning offers an attractive alternative to classical statistical approaches. The seminal works of Mourad et al. [1] have led to the proposal of a novel class of probabilistic graphical models, the forest of latent trees (FLTM). The design of this model was motivated by the necessity to model genetical data at the genome scale, prior to a multilocus GWAS. A multilocus GWAS fully exploits information about the complex dependences existing within genetical data, to help detect the loci associated with the studied pathology. The FLTM framework also allows data dimension reduction. The FLTM model is a hierarchical Bayesian network with latent variables. Central to the FLTM construction is the recursive clustering of variables, in a bottom up subsuming process. This article focuses on the analysis of the impact of the choice of the clustering method used in the FLTM learning algorithm, in a GWAS context. We rely on a real GWAS data set describing 41400 variables for each of 3004 controls and 2005 cases affected by Crohn’s disease, and compare the impact of three clustering methods. We compare statistics about data dimension reduction as well as trends concerning the ability to split or group putative causal SNPs in agreement with the underlying biological reality. To assess the risk of missing significant association results due to subsumption, we also compare the clustering methods through the corresponding FLTM-based GWASs. In the GWAS context and in this framework, the choice of the clustering method does not influence the satisfying performance of the GWAS.


Genome-wide association study Multilocus association study Linkage disequilibrium Probabilistic graphical model Latent variable Data dimension reduction Bayesian network 


  1. 1.
    Mourad, R., Sinoquet, C., Leray, P.: A hierarchical bayesian network approach for linkage disequilibrium modeling and data-dimensionality reduction prior to genome-wide association studies. BMC Bioinformatics 12, 16 (2011)CrossRefGoogle Scholar
  2. 2.
    International Monetary Fund: Macro-fiscal Implications of Health Care Reform in Advanced and Emerging Economies. IMF Policy Paper, Washington (2010)Google Scholar
  3. 3.
    Hechter, E.: On Genetic Variants Underlying Common Disease. Ph.D. thesis, University of Oxford (2011)Google Scholar
  4. 4.
    Gibbs, R.A., Belmont, J.W., Hardenbol, P., et al.: The international hapmap project. Nature 426(6968), 789–796 (2003)CrossRefGoogle Scholar
  5. 5.
    The 1000 Genomes Project Consortium: A map of human genome variation from population-scale sequencing. Nature 467, 7319, 1061–1073 (2010)Google Scholar
  6. 6.
    Balding, D.J.: A tutorial on statistical methods for population association studies. Nat. Rev. Genet. 7(10), 781–791 (2006)CrossRefGoogle Scholar
  7. 7.
    Pritchard, J.K., Przeworski, M.: Linkage disequilibrium in humans: models and data. Am. J. Hum. Genet. 69(1), 1–14 (2001)CrossRefGoogle Scholar
  8. 8.
    Patil, N., Berno, A.J., Hinds, D.A., Barrett, W.A., Doshi, J.M., Hacker, C.R., Kautzer, C.R., Lee, D.H., Marjoribanks, C., McDonough, D.P., et al.: Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 294(5547), 1719–1723 (2001)CrossRefGoogle Scholar
  9. 9.
    Abel, H.J., Thomas, A.: Accuracy and computational efficiency of a graphical modeling approach to linkage disequilibrium estimation. Stat. Appl. Genet. Mol. Biol. 10(1), Article 5 (2011)Google Scholar
  10. 10.
    Verzilli, C.J., Stallard, N., Whittaker, J.C.: Bayesian graphical models for genome-wide association studies. Am. J. Hum. Genet. 79, 100–112 (2006)CrossRefGoogle Scholar
  11. 11.
    Browning, B.L., Browning, S.R.: Efficient multilocus association testing for whole genome association studies using localized haplotype clustering. Genet. Epidemiol. 31, 365–375 (2007)CrossRefGoogle Scholar
  12. 12.
    Ackerman, M., Ben-David, S.: Clusterability: A theoretical study. In: 12th International Conference on Artificial Intelligence and Statistics, vol. 5, pp. 1–8 (2009). J. Mach. Learn. ResGoogle Scholar
  13. 13.
    Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)CrossRefGoogle Scholar
  14. 14.
    Robinson, R.W.: Counting unlabeled acyclic digraphs. In: Little, C.H.C. (ed.) Combinatorial Mathematics V. Lecture Notes in Mathematics, vol. 622, pp. 28–43. Springer, New York (1977)CrossRefGoogle Scholar
  15. 15.
    Zhang, N.L.: Hierarchical latent class models for cluster analysis. J. Mach. Learn. Res. 5(6), 697–723 (2004)zbMATHGoogle Scholar
  16. 16.
    Ben-Dor, A., Shamir, R., Yakhini, Z.: Clustering gene expression patterns. In: 3rd Annual International Conference on Computational Molecular Biology, pp. 33–42 (1999)Google Scholar
  17. 17.
    Cahill, J.: Error-Tolerant Clustering of Gene Microarray Data. Bachelor’s Honors thesis, Boston College, Massachusetts (2002)Google Scholar
  18. 18.
    Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: 2nd International Conference on Knowledge Discovery and Data Mining, pp. 226–231. AAAI Press (1996)Google Scholar
  19. 19.
    Meila, M. Comparing clusterings: an axiomatic view. In: 22nd International Conference on Machine learning, pp. 577–584 (2005)Google Scholar
  20. 20.
    Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971)CrossRefGoogle Scholar
  21. 21.
    Mirkin, B.: Mathematical classification and clustering: from how to what and why. J. Classifi. 2(1), 193–218 (1998)Google Scholar
  22. 22.
    Fowlkes, E.B., Mallows, C.L.: A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc. 78(383), 553–569 (1983)zbMATHCrossRefGoogle Scholar
  23. 23.
    Purcell, S., Neale, B., Todd-Brown, K., et al.: PLINK: a toolset for whole-genome association and population-based linkage analysis. Am. J. Hum. Genet. 81(3), 559–575 (2007)CrossRefGoogle Scholar
  24. 24.
    Gabriel, S.B., Schaffner, S.F., Moore, J.M., Roy, J., Blumenstiel, B., Higgins, J., DeFelice, M., Lochner, A., Faggart, M., Liu-Cordero, S.N., Rotimi, C., Adeyemo, A., Cooper, R., Ward, R., Lander, E.S., Daly, M.J., Altshuler, D.: The structure of haplotype blocks in the human genome. Science 296(5576), 2225–2229 (2002)CrossRefGoogle Scholar
  25. 25.
    Wang, N., Akey, J.M., Zhang, K., Chakraborty, R., Jin, L.: Distribution of recombination crossovers and the origin of haplotype blocks: the interplay of population history, recombination, and mutation. Am. J. Hum. Genet. 71(5), 1227–1234 (2002)CrossRefGoogle Scholar
  26. 26.
    Wellcome trust case control consortium: genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447(7145), 661–678 (2007)Google Scholar
  27. 27.
    Barrett, J.C., Hansoul, S., Nicolae, D.L., et al.: Genome-wide association defines more than 30 Distinct susceptibility loci for crohn’s disease. Nat. Genet. 40(8), 955–962 (2008)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Duc-Thanh Phan
    • 1
  • Philippe Leray
    • 1
  • Christine Sinoquet
    • 2
    Email author
  1. 1.LINA, UMR CNRS 6241, POLYTECHUniversity of NantesNantesFrance
  2. 2.LINA, UMR CNRS 6241, Faculty of SciencesUniversity of NantesNantesFrance

Personalised recommendations