10.07.2015 Views

Data Analysis Using the R Project for Statistical Computing - NERSC

Data Analysis Using the R Project for Statistical Computing - NERSC

Data Analysis Using the R Project for Statistical Computing - NERSC

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Data</strong><strong>Analysis</strong>using<strong>the</strong>R<strong>Project</strong><strong>for</strong>Sta8s8calCompu8ngDanielaUshizima<strong>NERSC</strong>Analy8csLawrenceBerkeleyNa8onalLaboratory1


Packages,datavisualiza8onandexamplesR‐PROGRAMMING3


is a language and environment <strong>for</strong>sta8s8cal compu8ng and graphics, aGNUproject.Rprovidesawidevarietyofsta8s8cal(linear and nonlinear modeling,classical sta8s8cal tests, 8me‐seriesanalysis, classifica8on, clustering, ...)and graphical techniques, and ishighlyextensible.Download:hVp://www.r‐project.orgRecommendedtutorial:hVp://cran.r‐project.org/doc/contrib/Paradisrdebuts_en.pdf4


1.WhytouseR?• Open‐source,mul8pla^orm,extensible;• EasyonuserswithfamiliaritywithS/S+,Matlab,PythonorIDL;• Ac8veandgrowingcommunity:– Google,Pfizer,Merck,BankofAmerica,Boeing,<strong>the</strong>InterCon8nentalHotelsGroupandShell.5


2.Rin<strong>the</strong>scien8ficcommunity6


2.1.YouRwith<strong>NERSC</strong> • GetstartedwithRonDaVinci:>moduleloadR>R>help()>demo()>help.start()>source(‘your_func8on.R’)>library(package_name)7


3.Extensible• Add‐onpackages:– <strong>Data</strong>input/output:hdf5,Rnetcdf,DICOM,etc.– Graphics:trellis,gplot,RGL,fields,etc.– Mul8variateanalysis:MASS,mclust,ape,etc.– O<strong>the</strong>rlanguages:Rcpp,Rpy,R.matlab,etc.8


4.Sta8s8calanalysisandgraphs• Histogram• Density• Boxplot• Mul8variateplot• Condi8oningplot• Contourplot9


4.1.Mul8variateplotsEx: Explanatory variables: solar radiation, temperature, wind and <strong>the</strong>response variable ozone;- use of pairs() with dataframes to check <strong>for</strong> dependencies between <strong>the</strong>variables.> data=read.table('ozone.data.txt',header=T)> names(data)[1]"rad""temp""wind""ozone“> pairs(data,panel.smooth)#panel.smooth=locally‐weightedpolynomialregression10


4.2.Condi8onalplots• Check<strong>the</strong>rela8onof<strong>the</strong>twoexplanatoryvariableswind,tempand<strong>the</strong>responsevariableozone;>coplot(ozone~wind|temp,panel=panel.smooth)11


4.3.PackageRGL<strong>for</strong>3Dvisualiza8on• OpenGL‐rgl.demo.lsystem()‐kerneldensityes8ma8onUseVisit:h?ps://wci.llnl.gov/codes/visit/12


5.ProfilingVariable numberof arguments• Wheredoesyourprogramspendmore8me?several.8mes


BasicsandbeyondEXPLORATORYDATAANALYSIS14


1.Sta8s8calanalysis• Sta8s8calmodeling:check<strong>for</strong>varia8onsin<strong>the</strong>responsevariablegivenexplanatoryvariables;– Linearregression• Mul8variatesta8s8cs:look<strong>for</strong>structurein<strong>the</strong>data;– Clustering:• Hierarchical– Dendrograms• Par88oning– Kmeans(stats)– Mixture‐models(mclust)15


2.Linearregression• Ex:Find<strong>the</strong>equa8onthatbestfit<strong>the</strong>data,given<strong>the</strong>decayofradioac8veemissionovera50‐dayperiod• Linearregression:variablesexpectedtobelinearlyrelated;• Maximumlikelihoodes8matesofparameters=leastsquares;16


2.1.Linearregressiondata=read.table('sapdecay.txt',header=T)aEach(data)par(mfrow=c(1,3))plot(x,y,main='DecayofradioacNveemissionovera50‐dayperiod',xlab='days')#<strong>the</strong>log(y)givesaroughideaof<strong>the</strong>decayconstant,a,<strong>for</strong><strong>the</strong>sedatabylinearregressionoflog(y)againstxmylm=lm(log(y)~x)print(mylm$coefficients)#sumofsquaresof<strong>the</strong>differencebetween<strong>the</strong>observedyvandpredictedypvaluesofy,givenaspecificvalueofparameterasumsq


3.Clusteranalysis• Hierarchical– dendrogram(stats)• Par88oning– kmeans(stats)• Mixture‐models:– Mclust(mclust)Iris dataset: 150 samples of Irisflowers described in terms of itspetal and sepal length and width18


3.1.Hierarchicalclustering• <strong>Analysis</strong>onasetofdissimilari8es,combinedtoagglomera8onmethods<strong>for</strong>analyzingit:• Dissimilari8es:Euclidean,ManhaVan,…• Methods:– ward,single,complete,average,mcquiVy,medianorcentroid.19


3.2.K‐means• Splitnobserva8onsintokclusters;– eachobserva8onbelongsto<strong>the</strong>clusterwith<strong>the</strong>nearestmean.setosaversicolorvirginica104814202363500020


3.3.Model‐basedclustering• MixtureModels– Eachclusterisma<strong>the</strong>ma8callyrepresentedbyaparametricdistribu8on;– Setofkdistribu8onsiscalledamixture,and<strong>the</strong>overallmodelisafinitemixturemodel;– Eachprobabilitydistribu8ongives<strong>the</strong>probabilityofaninstancebeinginagivencluster.2121


http://www.lbl.gov/publicinfo/newscenter/features/2008/apr/af-bella.htmlAcceleratedlaser‐wakefieldpar8clesCasestudy22


KnowledgediscoveryinLWFAscienceviamachinelearning• PI:C.Geddes(LBNL)inSciDACCOMPASSproject,Incite.• Accomplishments:– Describedcompactelectroncloudsusingminimumenclosingellipsoids;– Developedalgorithmstoadaptmixturemodelclusteringtolargedatasets;• ScienceImpact:– Automateddetec8onandanalysisofcompactelectronclouds;– Deriveddispersionfeaturesofelectronclouds;– Extensiblealgorithmstoo<strong>the</strong>rscienceproblems;• Collaborators:– Tech‐X– MathGroup,LBNL– UCDavis,UniversityofKaiserlauterntime steps


Framework• Goal:automate<strong>the</strong>analysisofelectronbunchesbydetec8ngcompactgroupsofpar8cles,subjectedtosimilarmomentumandspa8o‐temporalcoherence.24


B1.Selectrelevantpar8cles• Beamsofinterestarecharacterizedbyhighdensityofhigh‐energypar8cles:Representation of particle momentum in onetime step: spline interpolation onto a grid <strong>for</strong>visualization of irregularly spaced input data.1. Elimina8onoflowenergypar8cles(px


B2.Kernel‐basedes8ma8on• Kernel density estimators are less sensitive to<strong>the</strong> placement of <strong>the</strong> bin edges;• Goal: retrieve a dense group of particles withsimilar spatial and momentum characteristics: argmax f(x,y,px), Neighborhood: 2 µmPackages:misc3d,rgl,fields26


B3.Iden8fybeamcandidates• Detec8onofcompactgroupsofpar8clesindependentofbeingamaximuminoneof<strong>the</strong>variables;27


B4.Clusterusingmixturemodels• Modelandnumberofclusterscanbeselectedatrun8me(mclust);• Par88onofmul8dimensionalspace;• Assumethat<strong>the</strong>func8onal<strong>for</strong>mof<strong>the</strong>underlyingprobabilitydensityfollowsamixtureofnormaldistribu8ons;Packages:mclust,rgl28


B5.Evalua8onofcompactness• Bunchesofinterestmoveatspeed≈c,hencearenearlysta8onaryin<strong>the</strong>movingsimula8onwindow;• Movingaveragessmoo<strong>the</strong>soutshort‐termfluctua8onsandhighlightslonger‐termtrends.29


Packages,challengesandnewbusinessesHighper<strong>for</strong>mancecompuNng30


1.Improveper<strong>for</strong>mance/reusability• Goodcoding:avoidloops,vectoriza8on;• ExtendRusingcompiledcode:– packages:Rcpp,inline• RecycleyourPythoncodes:– Package:Rpython• Parallelism:– Explicit:packagesRmpi,Rpvm,nws– Implicit:packagespnmath,pnmath0<strong>for</strong>mul8threadedmathfunc8ons• Useout‐of‐memoryprocessingwith– packagesbigmemoryandff31


• Parallelism:2.WhatisgoingonHPCinR?– Mul8core:mul8core,pnmath,…– Computercluster:snow,Rmpi,rpvm,…– Gridcompu8ng:GRIDR,…• GPU:– gputools:parallelalgorithmsusingCUDA+CUBLAS• Extremelylargedata:– ff:memorymappedpagesofbinaryflatfiles.32


3.Nothingisperfect…• Limitsonindividualobjects:onallversionsofR,<strong>the</strong>maximumnumberofelementsofavectoris2^31–1;• Rwilltakeall<strong>the</strong>RAMitcanget(Linuxonly);• Morein<strong>for</strong>ma8on,type:>help(‘Memory‐limits’)>gc()#garbagecollector>object.size(your_obj)#sizeofyourobject33


TakehomeSource: http://www.nettakeaway.com/tp/R/129/understanding-r34


References• MichaelJ.Crawley.StaHsHcs:AnIntroducHonusingR.Wiley,2005.ISBN0‐470‐02297‐3.– data:hVp://www.bio.ic.ac.uk/research/mjcraw/<strong>the</strong>rbook/• RobertH.ShumwayandDavidS.Stoffer.TimeSeries<strong>Analysis</strong>andItsApplicaHonsWithRExamples.Springer,NewYork,2006.ISBN978‐0‐387‐29317‐2• Basics– h?p://cran.r‐project.org/doc/contrib/Short‐refcard.pdfCheat sheets– h?p://cran.r‐project.org/doc/contrib/refcard.pdf– hVp://cran.r‐project.org/doc/contrib/Paradis‐rdebuts_en.pdf– h?p://www.manning.com/kabacoff/Kabacoff_MEAPCH1.pdf• Intermediate– h?p://math.acadiau.ca/ACMMaC/Rmpi/basics.html– User‐lists35


Acknowledgementshttp://www.sciviews.org/Tinn-R/36

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!