13.07.2015 Views

sildes - salsahpc

sildes - salsahpc

sildes - salsahpc

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

VISUALIZING THE PROTEINSEQUENCE UNIVERSEL.STANBERRY 1 , R.HIGDON 1 , W.HAYNES 1 ,N.KOLKER 1 , W.BROOMALL 1 , S.EKANAYAKE 2 ,A.HUGHES 2 , Y.RUAN 2 , J.QIU 2 , E.KOLKER 1 , G.FOX 21SEATTLE CHILDREN’S, 2 INDIANA UNIVERSITYECMLS 2012, 18 June 2012


Outline2A 4th paradigm problem in biology Assigning functions (annotating) proteinsChallengeOur goalPSU: methods & initial resultsConclusionsVisualizing PSUECMLS 2012


Grand Challenge of Functional Genomics3New technologies produce peta- and exabytes ofdataProtein Sequence Universe (PSU), the proteinsequence space, expand exponentially EMP, i5K, iPlant, NEON 30% of existing sequenced proteins unannotatedExisting resources overwhelmed, many unsupported:COG, Systers, ClusTr, eggNOG.Visualizing PSUECMLS 2012


Ultimate Goal: Annotate All Proteins4Our approach: Revitalize, expand & enhance protein annotationresources. Develop sustainable software framework. Use HPC and most powerful CI – grids & clouds. Provide rigorous and reliable tools to annotateprotein sequences.Visualizing PSUECMLS 2012


COG: Clusters of Orthologous Groups5COG database was developed by NCBI. Proteins classified into groups with common functionencoded in complete genomes. Prokaryotes (COG): 66 genomes, 200K proteins, 5Kclusters. Eukaryotes (KOG): 7 genomes, 113K proteins, 5Kclusters. Valuable scientific resource: 5K citations. Last updated: 2006.Visualizing PSUECMLS 2012


Clustering 10 million UniRef1006UniRef100: 10M proteins including 5.3M bacterial &archaealBLAST - common sequence alignment approachAll vs. All alignment on Azure 475 eight-core virtual machines produced 3+billion filtered records in 6 daysVisualizing PSUECMLS 2012


Clustering 10 million UniRef1007Use prokaryotic COG as a starting point.Expand COGs ~20 fold (3.5 million proteins).Cluster 2M proteins into 500K functional groups Single linkage clustering with MapReduceframework on HadoopVisualizing PSUECMLS 2012


Promise and Challenge of Annotation8 Clustering facilitates mass annotationBUT Takes considerable efforts and expertise Multiple cloud systems and compute solutionsVisualizing PSUECMLS 2012


Public Resources9 Struggle to cope with the influx of data Provide limited interactive and analytic capabilities Many no longer supported (SYSTERS, CluSTr, COG) Biological community needs scalable, sustainableand efficient approach to visualize, explore andannotate new data.Visualizing PSUECMLS 2012


Protein Sequence Universe10PSU Goal: Enhance annotation resources withanalytic and visualization (browser) tools.Project sequence data into 3D usingmultidimensional scaling (MDS).MDS interpolation allows expanding the universewithout time consuming all vs all O(N 2 )3D map allows much faster interpolationVisualizing PSUECMLS 2012


Multi-Dimensional Scaling (MDS)11Sammon‘s objective function ijHnijf( ij) d(xi,xis dissimilarity measure between sequences i and jd is Euclidean distance between projections x i and x jDenominator: larger contribution from smallerdissimilaritiesf is monotone transformation of dissimilarity measurechosen “artistically”f( ij)j)2Visualizing PSUECMLS 2012


Typical Metagenomic MDS12Visualizing PSUECMLS 2012


MDS Details13 f chosen heuristically to increase the ratio of standarddeviation to mean for f ( ij) and to increase the range ofdissimilarity measures. O(n 2 ) complexity to map n sequences into 3D.MDS can be solved using EM (SMACOF – fastest butlimited) or directly by Newton's method (it’s just 2 )Used robust implementation of nonlinear 2 minimizationwith Levenberg-Marquardt3D projections visualized in PlotVizVisualizing PSUECMLS 2012


MDS Details14Input Data: 100K sequences from well-characterizedprokaryotic COGs.Proximity measure: sequence alignment scoresScores calculated using Needleman-WunschScores “sqrt 4D” transformed and fed into MDSAnalytic form for transformation to 4D ijndecreases dimension n > 1; increases n < 1“sqrt 4D” reduced dimension of distance data from244 for ij to14 for f( ij )Hence more uniform coverage of Euclidean spaceVisualizing PSUECMLS 2012


3D View of 100K COG Sequences15Visualizing PSUECMLS 2012


Implementation16 NW computed in parallel on 100 node 8-coresystem. Used Twister (IU) in the Reduce phase of MapReduce MDS Calculations performed on 768 core MS HPCcluster (32 nodes) Scaling, parallel MPI with threading intranode Parallel efficiency of the code approximately 70% Lost efficiency due memory bandwidth saturation NW required 1 day, MDS job - 3 days.Visualizing PSUECMLS 2012


Cluster Annotation17COG Annotation Uniref100COG1131 ABC-type multidrug transport system, ATPase component 14406COG1136ABC-type antimicrobial peptide transport system, ATPasecomponent 7306COG1126 ABC-type polar amino acid transport system, ATPase component 4061COG3839 ABC-type sugar transport systems, ATPase component 4121COG0444ABC-type dipeptide/oligopeptide/nickel transport system ATPasecomp 3520COG4608 ABC-type oligopeptide transport system, ATPase component 3074COG3842 ABC-type spermidine/putrescine transport systems, ATPase comp 3665COG0333 Ribosomal protein L32 1148COG0454 Histone acetyltransferase HPA2 and Related acetyltransferases 14085COG0477 Permeases of the major facilitator superfamily 48590COG1028 Dehydrogenases with different specificities 37461Visualizing PSUECMLS 2012


Heatmap of NW vs Euclidean Distances18Visualizing PSUECMLS 2012


Dendrogram of Cluster Centroids19Visualizing PSUECMLS 2012


Selected Clusters20Visualizing PSUECMLS 2012


Heatmap for Selected Clusters21Visualizing PSUECMLS 2012


Future Steps22Comparison Needleman-Wunsch v. Blast v. PSIBlast NW easier as complete; Blast has missing distancesDifferent Transformations distance monotonic function(distance)to reduce formal starting dimension (increase sigma/mean)Automate cluster consensus finding as sequence that minimizesmaximum distance to other sequencesImprove O(N 2 ) to O(N) complexity by interpolating new sequencesto original set and only doing small regions with O(N 2 ) Successful in metagenomics Can use Oct-tree from 3D mapping or set of consensus vectors Some clusters diffuse?Visualizing PSUECMLS 2012


Blast 623Visualizing PSUECMLS 2012


24Full DataBlast 6Originalrun has0.96 cutVisualizing PSUECMLS 2012


25ClusterDataBlast 6Originalrun has0.96 cutVisualizing PSUECMLS 2012


Use Barnes Hut OctTreeoriginally developed tomake O(N 2 ) astrophysicsO(NlogN)26


OctTree for 100Ksample of Fungi27We use OctTree forlogarithmicinterpolation


28440K Interpolated


Conclusions29Data → Knowledge: protein annotationOverwhelming influx of new sequencesAnnotation is an immense challenge.HPC and advanced analytics needed.PSU as tool to facilitate annotation:Interactive visualization and explorationIntegrates info on function, pathways, structure, and environmentMDS preserves grouping structure of protein spaceMDS can use different proximities and biological dataParallel MDS handles large-scale dataMDS interpolation quickly maps new sequences into existing spaceVisualizing PSUECMLS 2012


DELSA: Data → Knowledge → Action30Data-Enabled Life Sciences Alliance InternationalCollective innovation to tackle modern biologicalchallenges through best computational practices andadvanced cyberinfrastructure.Harness expertise and resources across disciplinesPromote accurate, sustainable, scalable approachesFacilitate translation of data influx into tangibleinnovations and groundbreaking discoveriesVisualizing PSUECMLS 2012


References and Resources31COG data is available at the NCBI siteftp://ftp.ncbi.nih.gov/pub/wolf/COGs/COG0303/MDS results are available athttp://manxcatcogblog.blogspot.com/All software used to analyze and visualize thedata is an open source.DELSA: http://www.delsaglobal.org Protein Global Atlas and Data Accessibility ProjectsVisualizing PSUECMLS 2012


Acknowledgements32Grant support NSF: under DBI: 0969929 (EK) and 0910818 (GF) NIH: 5 RC2 HG 005806- 02 (GF); NIGMS grantR01 GM-076680-04 (EK); NIDDK grants U01-DK-089571 and U01-DK-072473 (EK)Visualizing PSUECMLS 2012


25Thank you for your attentionVisualizing PSUECMLS 2012

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!