Processing Large-Scale, High-Dimension Genetic and Gene Expression Data

  • Cliona MolonyEmail author
  • Solveig K. Sieberts
  • Eric E. Schadt


The now routine generation of large-scale, high-throughput data in multiple dimensions (genotype, gene expression, and so on) provides a significant challenge to researchers who desire to integrate data across these dimensions in hopes of painting a more comprehensive picture of complex system behavior. This type of integration promises to elucidate networks that drive disease traits associated with common human diseases like obesity, diabetes, and atherosclerosis. However, to effectively carry out this type of research not only requires the generation of large-scale genotype and molecular profiling data but also requires the development and application of methods and software in addition to a computing infrastructure capable of processing the large-scale data sets. Mastery of the methods and tools and having access to an appropriate computing environment capable of processing large-scale data will be critical to maintaining a competitive advantage, given future successes in biomedical research will likely demand a more comprehensive view of the complex array of interactions in biological systems and how such interactions are influenced by genetic background, infection, environmental states, life-style choices, and social structures more generally. In this chapter, we detail the methodological and computing issues associated with carrying out large-scale genome-wide association studies on tens of thousands of phenotypes, where the aim is to identify those phenotypes that are intermediate to DNA variations and disease phenotypes. This type of analysis can provide insights into the molecular networks that are perturbed by DNA and environmental variations, and as a result, induce changes in disease associated traits, providing a path to interpret genome-wide association study data as well as uncover networks that drive disease processes.


Message Passing Interface Expression Trait Parallel Virtual Machine Common Human Disease Surrogate Variable Analysis 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Edwards AO et al. (2005) Complement factor H polymorphism and age-related macular degeneration. Science 308:421–424CrossRefPubMedGoogle Scholar
  2. 2.
    Haines JL et al. (2005) Complement factor H variant increases the risk of age-related macular degeneration. Science 308:419–421CrossRefPubMedGoogle Scholar
  3. 3.
    Klein RJ et al. (2005) Complement factor H polymorphism in age-related macular degeneration. Science 308:385–389CrossRefPubMedGoogle Scholar
  4. 4.
    Grant SF et al. (2006) Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes. Nat Genet 38:320–323CrossRefPubMedGoogle Scholar
  5. 5.
    Sladek R et al. (2007) A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature 445:881–885CrossRefPubMedGoogle Scholar
  6. 6.
    Herbert A et al. (2006) A common genetic variant is associated with adult and childhood obesity. Science 312:279–283CrossRefPubMedGoogle Scholar
  7. 7.
    Peacock ML, Warren JT Jr, Roses AD, Fink JK (1993). Novel polymorphism in the A4 region of the amyloid precursor protein gene in a patient without Alzheimer’s disease. Neurology 43, 1254–1256.PubMedGoogle Scholar
  8. 8.
    Brem RB, Yvert G, Clinton R, Kruglyak L (2002) Genetic dissection of transcriptional regulation in budding yeast. Science 296:752–755CrossRefPubMedGoogle Scholar
  9. 9.
    Bystrykh L et al. (2005) Uncovering regulatory pathways that affect hematopoietic stem cell function using ‘genetical genomics’. Nat Genet 37:225–232CrossRefPubMedGoogle Scholar
  10. 10.
    Chesler EJ et al. (2005) Complex trait analysis of gene expression uncovers polygenic and pleiotropic networks that modulate nervous system function. Nat Genet 37:233–242CrossRefPubMedGoogle Scholar
  11. 11.
    Monks SA et al. (2004) Genetic inheritance of gene expression in human cell lines. Am J Hum Genet 75:1094–1105CrossRefPubMedGoogle Scholar
  12. 12.
    Morley M et al. (2004) Genetic analysis of genome-wide variation in human gene expression. Nature 430:743–747CrossRefPubMedGoogle Scholar
  13. 13.
    Schadt EE et al. (2005) An integrative genomics approach to infer causal associations between gene expression and disease. Nat Genet 37:710–717CrossRefPubMedGoogle Scholar
  14. 14.
    Schadt EE et al. (2003) Genetics of gene expression surveyed in maize, mouse and man. Nature 422:297–302CrossRefPubMedGoogle Scholar
  15. 15.
    Hartwell LH, Hopfield JJ, Leibler SMurray A.W (1999) From molecular to modular cell biology. Nature 402:C47–52CrossRefPubMedGoogle Scholar
  16. 16.
    Schadt EE, Sachs A, Friend S (2005) Embracing complexity, inching closer to reality. Sci STKE 2005:pe40Google Scholar
  17. 17.
    Barabasi AL, Oltvai ZN (2004) Network biology: understanding the cell’s functional organization. Nat Rev Genet 5:101–113CrossRefPubMedGoogle Scholar
  18. 18.
    Zerhouni E (2003) Medicine. The NIH Roadmap. Science 302:63–72Google Scholar
  19. 19.
    Han JD et al. (2003) Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature 430:88–93CrossRefGoogle Scholar
  20. 20.
    Luscombe NM et al. (2004) Genomic analysis of regulatory network dynamics reveals large topological changes. Nature 431:308–312CrossRefPubMedGoogle Scholar
  21. 21.
    Chen Y et al. (2008) Variations in DNA elucidate molecular networks that cause disease. Nature 452:429–435CrossRefPubMedGoogle Scholar
  22. 22.
    Zhao LJ et al. (2005) SNPP: automating large-scale SNP genotype data management. Bioinformatics 21:266–268CrossRefPubMedGoogle Scholar
  23. 23.
    Purcell S et al. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81:559–575.CrossRefPubMedGoogle Scholar
  24. 24.
    Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure using multilocus genotype data. Genetics 155:945–959PubMedGoogle Scholar
  25. 25.
    BRLMM: an Improved Genotype Calling Method for the GeneChip®; Human Mapping 500K Array Set (Affymetrix, 2006)Google Scholar
  26. 26.
    Carvalho B, Bengtsson H,, Speed TP, Irizarry RA (2007) Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data. Biostatistics 8:485–499CrossRefPubMedGoogle Scholar
  27. 27.
    Hua J et al. (2007) SNiPer-HD: improved genotype calling accuracy by an expectation-maximization algorithm for high-density SNP arrays. Bioinformatics 23:57–63CrossRefPubMedGoogle Scholar
  28. 28.
    Liu WM et al. (2003) Algorithms for large-scale genotyping microarrays. Bioinformatics 19:2397–2403CrossRefPubMedGoogle Scholar
  29. 29.
    Rabbee N, Speed, TP (2006) A genotype calling algorithm for affymetrix SNP arrays. Bioinformatics 22:7–12CrossRefPubMedGoogle Scholar
  30. 30.
    Teo YY et al. (2007) A genotype calling algorithm for the Illumina BeadArray platform. Bioinformatics 23:2741–2746CrossRefPubMedGoogle Scholar
  31. 31.
    Wellcome Trust Case Control Consortium (2007) Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447:661–678CrossRefGoogle Scholar
  32. 32.
    Sieberts SK, Schadt EE (2007) Moving toward a system genetics view of disease. Mamm Genome 18:389–401CrossRefPubMedGoogle Scholar
  33. 33.
    He YD et al. (2003) Microarray standard data set and figures of merit for comparing data processing methods and experiment designs. Bioinformatics 19:956–965CrossRefPubMedGoogle Scholar
  34. 34.
    Leek JT, Storey JD (2007) Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet 3:1724–1735CrossRefPubMedGoogle Scholar
  35. 35.
    Emilsson V et al. (2008) Genetics of gene expression and its effect on disease. Nature 452:423–428CrossRefPubMedGoogle Scholar
  36. 36.
    Yang X et al. (2006) Tissue-specific expression and regulation of sexually dimorphic genes in mice. Genome Res 16:995–1004CrossRefPubMedGoogle Scholar
  37. 37.
    Wang S et al. (2006) Genetic and genomic analysis of a fat mass trait with complex inheritance reveals marked sex specificity. PLoS Genet 2:e15CrossRefPubMedGoogle Scholar
  38. 38.
    Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. JRSS B 57:289–300Google Scholar
  39. 39.
    Storey JD (2002) A direct approach to false discovery rates. JRSS B 64:479–498Google Scholar
  40. 40.
    Schadt EE et al. (2008) Mapping the genetic architecture of gene expression in human liver. PLoS Biol 6:e107CrossRefPubMedGoogle Scholar
  41. 41.
    Yeo C et al. (2006) Cluster computing: high-performance, high-availability, and high-throughput processing on a network of computers. In Zomaya A (ed) Handbook of nature-inspired and innovative computing, pp 521-55142. Message PassingGoogle Scholar
  42. 42.
    Interface Forum. MPI (1994) A message-passing interface standard. Int J Supercomputer Appl 8:165–414Google Scholar
  43. 43.
    Message Passing Interface Forum. MPI2 (1998) A message passing interface standard. Int J High Performance Comput Appl 12:1–299Google Scholar
  44. 44.
    Geist A et al. (1994) PVM: Parallel Virtual Machine—a user’s guide and tutorial for network parallel computing, MIT, Cambridge, MAGoogle Scholar
  45. 45.
    Gropp W, Lusk E (2002). Goals guiding design: PVM and MPIGoogle Scholar
  46. 46.
    Carlborg O, Andersson-Eklund L, Andersson L (2001) Parallel computing in interval mapping of quantitative trait loci. J Hered 92:449–451CrossRefPubMedGoogle Scholar
  47. 47.
    Jayawardena M, Ljungberg K, Holmgren S (2007) Using parallel computing and grid systems for genetic mapping of quantitative traits. In Applied parallel computing. State of the art in scientific computing, vol Volume 4699/2007 627–636, Springer, BerlinCrossRefGoogle Scholar
  48. 48.
    University of Washington, Fred Hutchinson Cancer Research Center to coordinate National Human Genome Research Institute disease studies (2007)Google Scholar
  49. 49.
    Tanaka T (2005) [International HapMap project]. Nippon Rinsho 63(12):29–34PubMedGoogle Scholar
  50. 50.
    Ramji DP, Singh NN, Foka P, Irvine SA, Arnaoutakis K (2006) Transforming growth factor-beta-regulated expression of genes in macrophages implicated in the control of cholesterol homoeostasis. Biochem Soc Trans 34:1141–1144CrossRefPubMedGoogle Scholar
  51. 51.
    Zhu J et al. (2004) An integrative genomics approach to the reconstruction of gene networks in segregating populations. Cytogenet Genome Res 105:363–374CrossRefPubMedGoogle Scholar
  52. 52.
    Zhu J et al. (2007) Increasing the power to detect causal associations by combining genotypic and expression data in segregating populations. PLoS Comput Biol 3:e69CrossRefPubMedGoogle Scholar
  53. 53.
    Zhu J et al. (2008) Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks. Nat Genet 40:854–861CrossRefPubMedGoogle Scholar
  54. 54.
    Kim JK et al. (2005) Functional genomic analysis of RNA interference in C. elegans. Science 308:1164–1167CrossRefPubMedGoogle Scholar
  55. 55.
    Gargalovic PS et al. (2006) Identification of inflammatory gene modules based on variations of human endothelial cell responses to oxidized lipids. Proc Natl Acad Sci U S A 103: 12741–12746CrossRefPubMedGoogle Scholar
  56. 56.
    Ghazalpour A et al. (2006) Integrating genetic and network analysis to characterize genes related to mouse weight. PLoS Genet 2:e130CrossRefPubMedGoogle Scholar
  57. 57.
    Lum PY et al. (2006) Elucidating the murine brain transcriptional network in a segregating mouse population to identify core functional modules for obesity and diabetes. J Neurochem 97(1):50–62CrossRefPubMedGoogle Scholar
  58. 58.
    Butte AJ, Kohane IS (2000) Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pac Symp Biocomput 2000:418–429Google Scholar
  59. 59.
    Davidson EH, McClay DR, Hood L (2003) Regulatory gene networks and the properties of the developmental process. Proc Natl Acad Sci U S A 100:1475–1480CrossRefPubMedGoogle Scholar
  60. 60.
    Bergmann S, Ihmels, J, Barkai N (2004) Similarities and differences in genome-wide expression data of six organisms. PLoS Biol 2:E9CrossRefPubMedGoogle Scholar
  61. 61.
    Carter SL, Brechbuhler CM, Griffin M, Bond A.T (2004) Gene co-expression network topology provides a framework for molecular characterization of cellular state. Bioinformatics 20:2242–2250CrossRefPubMedGoogle Scholar
  62. 62.
    Doss S, Schadt EE, Drake TA, Lusis AJ (2005) Cis-acting expression quantitative trait loci in mice. Genome Res 15:681–691CrossRefPubMedGoogle Scholar
  63. 63.
    Barabasi AL, Albert R (1999) Emergence of scaling in random networks. Science 286:509–512CrossRefPubMedGoogle Scholar
  64. 64.
    Jiang C, Zeng ZB (1995) Multiple trait analysis of genetic mapping for quantitative trait loci. Genetics 140:1111–1127PubMedGoogle Scholar
  65. 65.
    Zeng ZB (1993) Precision mapping of quantitative trait loci. Genetics 121:185–199Google Scholar
  66. 66.
    Lee SI, Pe’er D, Dudley A.M, Church GM, Koller D (2006) Identifying regulatory mechanisms using individual variation reveals key role for chromatin modification. Proc Natl Acad Sci U S A 103:14062–14067CrossRefPubMedGoogle Scholar
  67. 67.
    Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabasi AL (2002) Hierarchical organization of modularity in metabolic networks. Science 297:1551–1555CrossRefPubMedGoogle Scholar
  68. 68.
    Lee I, Date, SV, Adai AT, Marcotte EM (2004) A probabilistic functional network of yeast genes. Science 306:1555–1558CrossRefPubMedGoogle Scholar
  69. 69.
    Wuchty S, Almaas E (2005) Peeling the yeast protein network. Proteomics 5:444–449CrossRefPubMedGoogle Scholar
  70. 70.
    Palla G, Derenyi I, Farkas I, Vicsek T (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature 435:814–818CrossRefPubMedGoogle Scholar
  71. 71.
    Hughes TR et al. (2000) Functional discovery via a compendium of expression profiles. Cell 102:109–126CrossRefPubMedGoogle Scholar
  72. 72.
    Pan X et al. (2006) A DNA integrity network in the yeast Saccharomyces cerevisiae. Cell 124:1069–1081CrossRefPubMedGoogle Scholar
  73. 73.
    Kanehisa M et al. (2006) From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res 34:D354–D357CrossRefPubMedGoogle Scholar
  74. 74.
    Ideker T et al. (2001) Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science 292:929–934CrossRefPubMedGoogle Scholar
  75. 75.
    Jansen R et al. (2003) A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science 302:449–453CrossRefPubMedGoogle Scholar
  76. 76.
    Pearl J (1998) Probabilistic reasoning in intelligent systems: networks of plausible inference, xix, p 552, Morgan Kaufmann, San Mateo, CAGoogle Scholar
  77. 77.
    Schadt EE, Lum PY (2006) Reverse engineering gene networks to identify key drivers of complex disease phenotypes. J Lipid Res 47:2601–2613CrossRefPubMedGoogle Scholar
  78. 78.
    Almasy L, Blangero J (1998) Multipoint quantitative-trait linkage analysis in general pedigrees. Am J Hum Genet 62:1198–211CrossRefPubMedGoogle Scholar
  79. 79.
    Price AL et al. (2006) Principle components analysis corrects for stratification in genome-wide association studies. Nat Genet 38:904–909CrossRefPubMedGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Cliona Molony
    • 1
    Email author
  • Solveig K. Sieberts
    • 1
  • Eric E. Schadt
    • 1
  1. 1.Rosetta Inpharmatics, LLCa wholly owned subsidiary of Merck & Co., Inc.SeattleUSA

Personalised recommendations