e s o u r c eRational association of genes with traits using agenome-scale gene network for Arabidopsis thalianaInsuk Lee 1,2,5 , Bindu Ambaru 4,5 , Pranjali Thakkar 4 , Edward M Marcotte 2,3 & Seung Y Rhee 4© 2010 Nature America, Inc. All rights reserved.We introduce a rational approach for associating genes with plant traits by combined use of a genome-scale functional networkand targeted reverse genetic screening. We present a probabilistic network (AraNet) of functional associations among 19,647(73%) genes of the reference flowering plant Arabidopsis thaliana. AraNet associations are predictive for diverse biologicalpathways, and outperform predictions derived only from literature-based protein interactions, achieving 21% precision for 55%of genes. AraNet prioritizes genes for limited-scale functional screening, resulting in a hit-rate tenfold greater than screens ofrandom insertional mutants, when applied to early seedling development as a test case. By interrogating network neighborhoods,we identify AT1G80710 (now DROUGHT SENSITIVE 1; DRS1) and AT3G05090 (now LATERAL ROOT STIMULATOR 1; LRS1) asregulators of drought sensitivity and lateral root development, respectively. AraNet (http://www.functionalnet.org/aranet/) providesa resource for plant gene function identification and genetic dissection of plant traits.Manipulating plant traits that affect the production of food, fiberand renewable energy has important agricultural and economic consequences.What is the best approach for identifying genes for importantplant traits? Forward genetics is limited as mutations in manygenes may generate only moderate or weak phenotypes. Similarly,although reverse genetics allows for directed assay of gene perturbations1 , saturated phenotyping for many plant traits is impractical. Apragmatic near-term solution is the computational identification oflikely candidate genes for desired traits, allowing for focused, efficientuse of reverse genetics. This solution is not unlike rational drug designin which computer-assisted and expert knowledge are combined withtargeted screening for the desired drug, or in this case, trait.One emerging approach for prioritizing candidate genes is networkguidedguilt by association. In this approach, functional associationsare first determined between genes in a genome on the basis of extensiveexperimental data sets, encompassing millions of individual observations.Such a map of functional associations is often represented as agraph model and referred to as a functional gene network 2 .Probabilistic functional gene networks integrate heterogeneousbiological data into a single model, enhancing both model accuracyand coverage. Once a suitable network is generated, new candidategenes are proposed for traits based upon network associations withgenes previously linked to these traits. Such network-guided screeninghas been successfully applied to unicellular organisms 3,4 andCaenorhabditis elegans 5,6 , and is a proposed strategy for identifyinghuman disease genes 4,5,7–10 .We demonstrate here that this approach successfully identifies genesaffecting specified traits for a reference flowering plant, A. thaliana,and we introduce a genome-wide, functional gene network forArabidopsis suitable for prioritizing candidate genes for a wide varietyof traits of economic and agricultural importance.RESULTSReconstruction of an Arabidopsis gene networkWe integrated diverse functional genomics, proteomics and comparativegenomics data sets into a genome-wide functional gene network,using data integration and benchmarking methods customized forArabidopsis genes (Supplementary Methods). The data sets includedmRNA co-expression patterns measured from DNA microarray datasets (Supplementary Table 1), known Arabidopsis protein-proteininteractions 11–14 , protein sequence features including sharing ofprotein domains, similarity of phylogenetic profiles 15–17 or genomiccontext of bacterial or archaebacterial homologs 18–20 , and diversegene-gene associations (mRNA co-expression, physical proteininteractions, multiprotein complexes, genetic interactions, literaturemining) transferred from yeast 21 , fly 11,22–24 , worm 5 and humangenes based on orthology 25 (Supplementary Table 2). In total,24 distinct types of gene-gene associations, encompassing >50 millionindividual experimental or computational observations, were scoredfor their ability to correctly reconstruct shared membership inArabidopsis biological processes. Then, these were incorporated intoa single integrated network model, dubbed AraNet. AraNet contains1,062,222 functional linkages among 19,647 genes (~73% of the totalArabidopsis genes), with each linkage weighted by the log likelihood ofthe linked genes to participate in the same biological processes.Integrating data improves network coverage and accuracy,as tested by recovery of known functional associations (Fig. 1aand Supplementary Fig. 1). AraNet extends substantially beyond1 Department of Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seodaemun-gu, Seoul, Korea. 2 Center for Systems and Synthetic Biology,and 3 Department of Chemistry and Biochemistry, Institute for Cellular and Molecular Biology, University of Texas at Austin, Austin, Texas, USA. 4 Department of PlantBiology, Carnegie Institution for Science, Stanford, California, USA. 5 These authors contributed equally to this work. Correspondence should be addressed to I.L.(insuklee@yonsei.ac.kr) or E.M.M. (marcotte@icmb.utexas.edu) or S.Y.R. (rhee@acoma.stanford.edu).Received 10 June 2009; accepted 23 December 2009; published online 31 January 2010; doi:10.1038/nbt.1603nature biotechnology VOLUME 28 NUMBER 2 FEBRUARY 2010 149
e s o u r c eFigure 1 Construction, accuracy and coverage of AraNet, a functionalgene network for Arabidopsis. (a) Pairwise gene linkages derived from 24diverse functional genomics and proteomics data sets, representing >50million experimental or computational observations, were integrated intoa composite network with higher accuracy and genome coverage than anyindividual data set. The integrated network (AraNet) contains 1,062,222functional linkages among 19,647 (73%) of the 27,029 protein-codingArabidopsis genes. The plot x axis indicates the log-scale percentage ofthe 27,029 protein-coding genes 13 covered by functional linkages derivedfrom the indicated data sets (curves); the y axis indicates predictivequality of the data sets, measured as the cumulative likelihood ratio oflinked genes to share GO biological process term annotations, tested using0.632 bootstrapping and plotted for successive bins of 1,000 linkageseach (symbols). Data sets are named as XX-YY, where XX indicates speciesof data origin (AT, A. thaliana; CE, C. elegans; DM, D. melanogaster;HS, H. sapiens; SC, S. cerevisiae) and YY indicates data type (CC,aLikelihood ratio for linked genes toparticipate in same biological process(cumulative likelihood ratio, log scale)AraNet148AT-CXAT-DCAT-GNAT-LC90AT-PGCE-CCCE-CX55CE-GTCE-LCCE-YH33DM-PIHS-CXHS-DC20HS-LCHS-MSHS-YH12SC-CCSC-CXSC-DCSC-GT7SC-LCSC-MSSC-TS4SC-YH1 10100Coverage of 27,029 protein-coding genes (%)bCoverage of 27,029 protein-codinggenes (%)806040200AT-LCaccuracylevelReliableannotation onlyAll annotationAraNetco-citation; CX, mRNA co-expression; DC, domain co-occurrence; GN, gene neighbor; GT, genetic interaction; LC, literature-curated protein interactions;MS, affinity purification/mass spectrometry; PG, phylogenetic profiles; PI, fly protein interactions; TS, tertiary structure; YH, yeast two-hybrid. Relativecontribution of each data type and combining different evidences for inferring function is discussed in Supplementary Discussion (SupplementaryFigs. 13–15). (b) AraNet spans ~73% of the protein-coding genes, far in excess of current GO biological process annotations for Arabidopsis, for which~12.2% of genes are annotated by reliable experimental evidence (GO evidence codes IDA, IMP, IGI, IPI, IEP) or traceable author statements (GOevidence code TAS), or ~45% annotated by any evidence including computational inferences or sequence homology. The subset of AraNet linkagesstronger than the likelihood ratio for literature-curated protein interactions (AT-LC, corresponding to a likelihood ratio of 35:1) covers 55% of the genes.© 2010 Nature America, Inc. All rights reserved.well-characterized Arabidopsis genes (Fig. 1b): 23,720 Arabidopsis genesare unannotated with Gene <strong>Ontology</strong> (GO) biological process annotationsby reliable experimental evidence 13 . AraNet includes more thanhalf (7,465) of genes lacking even sequence homology–based annotations(14,847 genes). These genes’ functions can now be hypothesizedbased upon their network neighborhoods. AraNet implicates specificprocesses for 4,479 uncharacterized genes using guilt by association.There are 2,986 uncharacterized genes associated only with otheruncharacterized genes in AraNet, suggesting many still-uncharacterizedcellular processes in plants.Evaluating the accuracy of AraNetTo verify the reliability of functional associations in AraNet, wetested their consistency with known Arabidopsis gene annotationsby applying guilt by association in AraNet to identify genes associatedwith specific biological processes. Each gene in the genome wasscored for association with a particular process by summing networkedge weights connecting that gene to known genes in that process.A gene’s resulting score corresponds to the naive Bayes estimate forthe gene to belong to that process given network evidence (Fig. 2a).Performing cross-validation of this test allows us to assess predictivepower with a receiver-operator characteristic (ROC) curve, measuringthe true-positive prediction rate versus false-positive predictionrate as a function of prediction score. We use the area under the ROCcurve (AUC) to summarize performance. AUC values of ~0.5 and 1indicate random and perfect performance, respectively.Using cross-validation, we tested AraNet’s ability to correctlyassociate genes with each GO biological process, observing significantlybetter than random predictability for the majority of biologicalprocesses (P < 10 −53 ; Wilcoxon signed rank test unless notedotherwise) (Fig. 2b). AraNet incorporates data from other organisms;we correspondingly observed higher predictability for evolutionarilyconserved processes than for GO processes annotated onlywith plant genes (P < 10 −24 , Wilcoxon rank sum test) (Figs. 2c,d).Nonetheless, genes were correctly associated with plant-specificprocesses at significantly higher rates than expected by chance (P
- Page 3 and 4: volume 28 number 2 february 2010COM
- Page 5 and 6: in this issue© 2010 Nature America
- Page 7 and 8: © 2010 Nature America, Inc. All ri
- Page 10 and 11: NEWS© 2010 Nature America, Inc. Al
- Page 12 and 13: NEWS© 2010 Nature America, Inc. Al
- Page 14 and 15: NEWS© 2010 Nature America, Inc. Al
- Page 16 and 17: © 2010 Nature America, Inc. All ri
- Page 18 and 19: © 2010 Nature America, Inc. All ri
- Page 20 and 21: © 2010 Nature America, Inc. All ri
- Page 22 and 23: NEWS feature© 2010 Nature America,
- Page 24 and 25: uilding a businessComing to termsDa
- Page 26 and 27: uilding a business© 2010 Nature Am
- Page 28 and 29: correspondence© 2010 Nature Americ
- Page 30 and 31: correspondence© 2010 Nature Americ
- Page 32 and 33: correspondence© 2010 Nature Americ
- Page 34 and 35: correspondence© 2010 Nature Americ
- Page 36 and 37: case studyNever againcommentaryChri
- Page 38 and 39: COMMENTARY© 2010 Nature America, I
- Page 40 and 41: COMMENTARY© 2010 Nature America, I
- Page 42 and 43: patents© 2010 Nature America, Inc.
- Page 44 and 45: patents© 2010 Nature America, Inc.
- Page 46 and 47: news and viewsChIPs and regulatory
- Page 48 and 49: news and viewsFrom genomics to crop
- Page 50 and 51: news and views© 2010 Nature Americ
- Page 52 and 53: news and views© 2010 Nature Americ
- Page 56 and 57: e s o u r c e© 2010 Nature America
- Page 58 and 59: e s o u r c e© 2010 Nature America
- Page 60 and 61: e s o u r c e© 2010 Nature America
- Page 62 and 63: © 2010 Nature America, Inc. All ri
- Page 64 and 65: B r i e f c o m m u n i c at i o n
- Page 66 and 67: i e f c o m m u n i c at i o n sAUT
- Page 68 and 69: lettersa1.5 kb hVPrIntron 112.5 kbA
- Page 70 and 71: letters© 2010 Nature America, Inc.
- Page 72 and 73: letters© 2010 Nature America, Inc.
- Page 74 and 75: l e t t e r sReal-time imaging of h
- Page 76 and 77: l e t t e r sFigure 2 Time-lapse li
- Page 78 and 79: l e t t e r s© 2010 Nature America
- Page 80 and 81: l e t t e r sRational design of cat
- Page 82 and 83: l e t t e r s© 2010 Nature America
- Page 84 and 85: l e t t e r s© 2010 Nature America
- Page 86 and 87: sample fluorescence was measured as
- Page 88 and 89: careers and recruitmentFourth quart