25.11.2014 Views

Statistical Methods in Bioinformatics Biology as Information Science ...

Statistical Methods in Bioinformatics Biology as Information Science ...

Statistical Methods in Bioinformatics Biology as Information Science ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Biostat 226 (TICR Biostat V) W<strong>in</strong>ter 2007, Special Topic II<br />

<strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

Ru-Fang Yeh, PhD, UCSF Biostatistics<br />

<strong>Biology</strong> <strong>as</strong> <strong>Information</strong> <strong>Science</strong><br />

• Lecture II.1: An Introduction to Bio<strong>in</strong>formatics<br />

- Genomics & bio<strong>in</strong>formatics <strong>in</strong> cl<strong>in</strong>ical research<br />

- Introduction to high-throughput technology & statistical issues<br />

• Lecture II.2: <strong>Statistical</strong> methods for microarray data<br />

analysis<br />

- F<strong>in</strong>d<strong>in</strong>g <strong>as</strong>sociation between expression and phenotypes<br />

- Multiple test<strong>in</strong>g<br />

- Sample size calculation<br />

• Lecture II.3: <strong>Statistical</strong> data m<strong>in</strong><strong>in</strong>g<br />

- Cluster<strong>in</strong>g, cl<strong>as</strong>sification & prediction<br />

sequence<br />

expression<br />

structure<br />

function<br />

system<br />

4 b<strong>as</strong>es<br />

A:T C:G<br />

4 b<strong>as</strong>es<br />

A:U C:G<br />

20 am<strong>in</strong>o acids<br />

~10000 doma<strong>in</strong>s<br />

~300 folds<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

<strong>Biology</strong> <strong>as</strong> <strong>Information</strong> <strong>Science</strong>: <strong>in</strong> Large Scale!<br />

sequence<br />

Genome<br />

DNA<br />

Human Genome Project:<br />

def<strong>in</strong>e parts list, a periodic table of human biology<br />

RNA<br />

expression<br />

Transcriptome<br />

Prote<strong>in</strong><br />

structure<br />

function<br />

Proteome<br />

system<br />

Systems biology<br />

(Regulome, Epigenome, Metabolome,Physiome...)<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

1


A brief history of the Human Genome Project<br />

How to sequence a genome<br />

• 1859 Darw<strong>in</strong>’s “The orig<strong>in</strong> of species” - natural selection & evolution<br />

• 1865 Mendel’s pe<strong>as</strong> – “genes”<br />

• 1910 Morgan’s flies – “chromosomes”<br />

• 1944 genes made of DNA<br />

• 1953 Watson & Crick – DNA double helix with b<strong>as</strong>e pair<strong>in</strong>g<br />

• 1972 Berg - first recomb<strong>in</strong>ant DNA<br />

• 1977 Sanger, Maxam & Gilbert – DNA sequenc<strong>in</strong>g methods<br />

First genome (bacteriaphage virus) sequenced<br />

• 1985 Mullis – PCR<br />

• 1986 Hood – automated sequencer<br />

Idea of a large genome sequenc<strong>in</strong>g effort<br />

• 1987 Olson - YAC<br />

First genetic map published<br />

• 1988 Human Genome Project <strong>in</strong>itiated: $3 billion, 15 year plan 1990-2005<br />

• 1990 HGP officially began<br />

Venter – shotgun sequenc<strong>in</strong>g a gene<br />

Lipman et al - BLAST<br />

• 1992 First physical map of chromosomes<br />

• 1995 H. <strong>in</strong>fluenzae genome sequenced<br />

• 1997 S.cerevisiae genome sequenced<br />

• 1998 C. elegans genome sequenced<br />

- Celera founded and entered the sequenc<strong>in</strong>g race<br />

Chromosome 22 completed<br />

• 1999<br />

• 2000 D.melag<strong>as</strong>ter genome sequenced jo<strong>in</strong>tly by Celera & BDGP<br />

• 2001 “Draft” human genome published; A. thaliana genome completed<br />

• 2002 Draft mouse genome published; Pl<strong>as</strong>modium & mosquito genome<br />

• 2003 Completion of the human genome<br />

Whole-genome shotgun<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

Theoretical progress <strong>in</strong> shotgun sequenc<strong>in</strong>g<br />

Technology Advances<br />

Moore’s law<br />

Require 8-10x sequence redundancy for coverage > 99.9%<br />

From Shendure et al. 2004. Nat Rev Genet 5: 335-344.<br />

Advanced sequenc<strong>in</strong>g technologies: methods and goals.<br />

HGP began<br />

~ $1/bp<br />

HGP completed<br />

~ $0.01/bp<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

2


The L<strong>as</strong>t Decade of Progress: Genome Sequenc<strong>in</strong>g<br />

$3 billions 1995 H. <strong>in</strong>fluenzae 1.8 Mb<br />

What’s Next?<br />

1996<br />

S. cerevisiae 12 Mb<br />

1997<br />

E. coli 5 Mb<br />

1998<br />

C. elegans 100 Mb<br />

Completed human<br />

chromosome<br />

Draft model<br />

organisms<br />

1999<br />

22<br />

2000<br />

2001<br />

D. melanog<strong>as</strong>ter 122 Mb<br />

A. thaliana 115 Mb<br />

Human genome draft<br />

21<br />

20<br />

2002<br />

2003<br />

Human genome<br />

3000 Mb<br />

14 Y 7 6<br />

Mouse<br />

Fugu rubripes<br />

2004<br />

13 19 10 9 5 16<br />

Rat<br />

$10 millions<br />

2005<br />

X 4 2<br />

Chimpanzee<br />

Dog<br />

From Coll<strong>in</strong>s et al. 2003. Nature 422: 835-847.<br />

A vision for the future of genomics research.<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

What’s Next?<br />

Implications of Genomics <strong>in</strong> Cl<strong>in</strong>ical Research<br />

• ENCyclopedia Of DNA Elements (ENCODE)<br />

- Functional elements/units of DNA<br />

• The International HapMap Project<br />

- Genetic variation (SNPs) among <strong>in</strong>dividuals<br />

• Chemical Genomics<br />

- Libraries of small molecules chemical compounds for screen<strong>in</strong>g<br />

• Genomes to Life<br />

- Understand how s<strong>in</strong>gle-cell microbes function<br />

• Structural Genomics Consortium<br />

- 3-dimensional structures of prote<strong>in</strong>s<br />

• Cancer Genome Anatomy/Atl<strong>as</strong> Project<br />

- Molecular profil<strong>in</strong>g normal, pre-cancer and cancer cells<br />

• Understand dise<strong>as</strong>e processes<br />

- Genetic determ<strong>in</strong>ants<br />

- Molecular diagnosis<br />

- Guid<strong>in</strong>g choice of therapy<br />

- Develop molecular target drug<br />

- Develop gene therapy<br />

• Determ<strong>in</strong>e genetic predisposition to dise<strong>as</strong>es<br />

- Individualized therapy<br />

- Prevention<br />

- Redef<strong>in</strong>e ethnicity<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

3


F<strong>in</strong>d<strong>in</strong>g Genetic Factors <strong>in</strong> Complex Dise<strong>as</strong>es<br />

Example: Age-Related Macular Degeneration<br />

ARMD (cont.)<br />

• Lead<strong>in</strong>g c<strong>as</strong>e of bl<strong>in</strong>dness <strong>in</strong> the developed world<br />

• Characterized by progressive destruction of the ret<strong>in</strong>a’s central<br />

region (macula: cone cells), caus<strong>in</strong>g central field visual loss<br />

• Complex dise<strong>as</strong>e --<br />

Known risk factors: age, smok<strong>in</strong>g, diet (lipid <strong>in</strong>take), family history<br />

• L<strong>in</strong>kage: 1q31 and other regions<br />

• 3 <strong>in</strong>dependent c<strong>as</strong>e-control <strong>as</strong>sociation studies <strong>in</strong> non-Hispanic<br />

white population all found strongest <strong>as</strong>sociation with the<br />

complement factor H gene (CFH), then used haplotype analyses to<br />

identify the specific risk polymorphism Tyr402His. (Kle<strong>in</strong> et al,<br />

Edwards et al, Ha<strong>in</strong>es et al, <strong>Science</strong> 2005.)<br />

• The histid<strong>in</strong>e allele accounts for 20-50% of the overall risk.<br />

Ha<strong>in</strong>es JL et al <strong>Science</strong> 2005<br />

Kle<strong>in</strong> RJ et al <strong>Science</strong> 2005<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

ARMD (cont.)<br />

<strong>Biology</strong> makes sense too --<br />

• CFH modulates the complement c<strong>as</strong>cade of immune response, and<br />

the known risk factors (diet, smok<strong>in</strong>g, age) and the cl<strong>in</strong>ical hallmarks<br />

(multiple drusen, geographic atrophy, choroidal neov<strong>as</strong>cularization) all<br />

correlate with complement activity.<br />

• The risk allele reduces b<strong>in</strong>d<strong>in</strong>g to hepar<strong>in</strong> and C-reactive prote<strong>in</strong><br />

(CRP), which is know to have elevated serum level <strong>in</strong> ARMD patients.<br />

Good, now what?<br />

• Evaluate prospective risks for <strong>in</strong>dividuals carry<strong>in</strong>g the histid<strong>in</strong>e allele,<br />

and <strong>in</strong> other ethnic groups.<br />

• Complement activity is responsive to drug therapy and life-style<br />

changes -- <strong>as</strong> potential treatment/prevention.<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

Molecular Target Drug<br />

Example I: Gleevec (imat<strong>in</strong>ib mesylate)<br />

• First oncology drug developed with<br />

rational drug design<br />

• Chronic Myeloid Leukemia:<br />

- 1-2 c<strong>as</strong>es per 100,000 per year<br />

- < 20% patients cured with ma<strong>in</strong><br />

treatment options (stem-cell<br />

transplantation)<br />

- 95% of patients have Philadelphia<br />

chromosome t(9,22)(q34;q11) and the<br />

result<strong>in</strong>g Bcr-Abl fusion prote<strong>in</strong>, which<br />

functions <strong>as</strong> tyros<strong>in</strong>e k<strong>in</strong><strong>as</strong>e.<br />

• Animal experiments established Bcr-<br />

Abl <strong>as</strong> cause for CML<br />

• Drug target: tyros<strong>in</strong>e k<strong>in</strong><strong>as</strong>e <strong>in</strong>hibitor<br />

Eureka!!!<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

4


Imat<strong>in</strong>ib (cont.)<br />

• Imat<strong>in</strong>ib also <strong>in</strong>hibits other tyros<strong>in</strong>e k<strong>in</strong><strong>as</strong>es:<br />

- the type III transmembrane receptors KIT: g<strong>as</strong>tro<strong>in</strong>test<strong>in</strong>al<br />

stromal tumors<br />

- platelet-derived growth factor receptor (PDGFR) β: chronic<br />

myeloproliferative dise<strong>as</strong>es<br />

• Patients with idiopathic hypereos<strong>in</strong>ophilic syndrome<br />

responded to imat<strong>in</strong>ib due to a novel fusion tyros<strong>in</strong>e<br />

k<strong>in</strong><strong>as</strong>e FIP1L1-PDGFRα (Cools et al NEJM 2003)<br />

Molecular Target Drug<br />

Example II: Hercept<strong>in</strong> (tr<strong>as</strong>tuzumab)<br />

• HER-2/neu (erbB-2) gene, the human epidermal growth<br />

factor receptor 2, is amplified <strong>in</strong> up to 30% of bre<strong>as</strong>t<br />

cancers. Overexpression of HER2 prote<strong>in</strong> leads to cell<br />

proliferation and affects tumor progression and the<br />

response of tumors to chemotherapy.<br />

• Hercept<strong>in</strong>: a monoclonal antibody to HER2 prote<strong>in</strong>.<br />

• FDA-approved for HER2+ met<strong>as</strong>tatic bre<strong>as</strong>t tumors; also<br />

show<strong>in</strong>g absolute benefits from the early results of Ph<strong>as</strong>e<br />

III trial for HER2+ early stage bre<strong>as</strong>t cancer <strong>as</strong> adjuvant<br />

therapy (Piccart-Gebhart et al NEJM 2005).<br />

• Test HER2 by immunohistochemistry (for prote<strong>in</strong><br />

expression) or FISH (for gene amplification).<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

Reverse vacc<strong>in</strong>ology<br />

Example: serogroup B Neisseria men<strong>in</strong>gitidis (MenB)<br />

Bioterrorism/New Epidemic<br />

Example: Severe Acute Respiratory Symdrome<br />

• SARS outbreak began <strong>in</strong> Nov 2002 <strong>in</strong> Ch<strong>in</strong>a’s Guangdong prov<strong>in</strong>ce,<br />

then spread rapidly around the globe through close person-to-person<br />

contact.<br />

• 3/29/2003, Dr. Carlo Urbani, the doctor who first identified SARS, dies<br />

of the illness. By then there were 1400+ c<strong>as</strong>es worldwide, <strong>in</strong>clud<strong>in</strong>g<br />

50+ deaths.<br />

• In late March, UCSF’s Joe Derisi used his “virus chip”, a microarray<br />

that conta<strong>in</strong>s exemplar sequences of ~1000 known viruses, to<br />

determ<strong>in</strong>e that a new form of coronavirus w<strong>as</strong> present <strong>in</strong> the SARS<br />

patient samples.<br />

Fr<strong>as</strong>er CM. 2004. Nat Rev Genet. 5(1):23-33.<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

• By mid Apr, the genome of SARS coronavirus w<strong>as</strong> sequenced.<br />

• 2004, SARS vacc<strong>in</strong>e human trials.<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

5


Molecular Prognosis<br />

Example: bre<strong>as</strong>t cancer<br />

Genome Resource Portals<br />

• NCBI Genbank http://www.ncbi.nlm.nih.gov<br />

• Ensembl http://www.ensembl.org<br />

• UCSC Genome Browser http://www.genome.ucsc.edu<br />

Van de Vijver MJ, He YD et al. NEJM 2002.<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

Additional Resource<br />

• dbSNP: http://www.ncbi.nlm.nih.gov/SNP/<br />

HapMap: http://hapmap.org/<br />

• OMIM: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM<br />

Genes and dise<strong>as</strong>e:<br />

http://www.ncbi.nlm.nih.gov/books/bv.fcgi?call=bv.View..ShowSection<br />

&rid=gnd.preface.91<br />

• Nature Genome Gateway: http://www.nature.com/genomics/<br />

A User’s Guide to the Human Genome II:<br />

http://www.nature.com/ng/journal/v35/n1s/<strong>in</strong>dex.html<br />

Nature Genetics Supplements:<br />

http://www.nature.com/ng/supplements/<strong>in</strong>dex.html<br />

UCSF Resources<br />

• Functional Genomics Microarray Cores:<br />

http://arrays.ucsf.edu<br />

http://derisilab.ucsf.edu/core/<br />

http://cancer.ucsf.edu/array/<br />

• Center for Bio<strong>in</strong>formatics & Molecular Biostatistics:<br />

http://www.biostat.ucsf.edu/cbmb<br />

• Cancer Center Biostatistics Core:<br />

http://cancer.ucsf.edu/biostat<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

6


Read<strong>in</strong>g/References<br />

Course Aims<br />

• Guttmacher AE, Coll<strong>in</strong>s FS.<br />

Genomic medic<strong>in</strong>e - a primer.<br />

NEJM 2002 347(1):1512-20.<br />

• Guttmacher AE, Coll<strong>in</strong>s FS.<br />

Realiz<strong>in</strong>g the promise of genomics <strong>in</strong> biomedical research.<br />

JAMA. 2005 Sep 21;294(11):1399-402.<br />

Biological Question<br />

Experimental Design<br />

High-throughput Data<br />

Generation<br />

<strong>Statistical</strong> Question<br />

<strong>Statistical</strong> <strong>Methods</strong><br />

Data Preprocess<strong>in</strong>g<br />

& QC<br />

• Coll<strong>in</strong>s FS, Green ED, Guttmacher AE, Guyer MS.<br />

A vision for the future of genomics research.<br />

Nature. 2003 Apr 24;422(6934):835-47.<br />

NO<br />

Data Interpretation<br />

Question answered?<br />

YES<br />

Papers <strong>in</strong> Nature/JAMA/NEJM…<br />

Data Analysis<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

Spotted array<br />

Sequencer<br />

M<strong>as</strong>s Spectrometry<br />

SAGE<br />

Different<br />

High-throughput<br />

Technology<br />

GeneChip Affymetrix<br />

** * * *<br />

Illum<strong>in</strong>a<br />

Bead Array<br />

Agilent:<br />

Long oligo Ink Jet<br />

Nylon membrane<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

7


Microarray Applications<br />

<strong>Statistical</strong> Issues<br />

Array<br />

Gene<br />

Expression<br />

Probes<br />

on the array<br />

DNA (cDNA, oligos:<br />

gene representatives)<br />

Targets<br />

to be hybridized<br />

mRNA/cDNA<br />

Large-scale Analysis<br />

of…<br />

transcriptional alterations<br />

Technology dependent:<br />

• Preprocess<strong>in</strong>g & QC<br />

CGH<br />

SNP<br />

Methylation<br />

Promoter<br />

Til<strong>in</strong>g<br />

Prote<strong>in</strong><br />

Tissue<br />

DNA (clones, oligos)<br />

DNA (oligos)<br />

DNA (CpG island)<br />

DNA (promoter ~1kb)<br />

DNA<br />

antibody<br />

tissues<br />

DNA<br />

DNA<br />

DNA<br />

(bisulfite-treated)<br />

DNA<br />

(ChIP-enriched)<br />

All of the above<br />

prote<strong>in</strong><br />

prote<strong>in</strong>s<br />

Genomic changes <strong>in</strong> cancers<br />

Genotyp<strong>in</strong>g; Genomic<br />

changes<br />

Methylation-status <strong>in</strong> genes<br />

Transcription factor b<strong>in</strong>d<strong>in</strong>g<br />

sites; histone modifications<br />

All of the above; sequenc<strong>in</strong>g;<br />

gene annotation<br />

Prote<strong>in</strong> expression (ELISA)<br />

Histology; prote<strong>in</strong> expression<br />

(immunohistochemistry)<br />

Biological question dependent:<br />

• Hypothesis test<strong>in</strong>g & Multiplicity<br />

• Cl<strong>as</strong>sification, prediction<br />

• Cluster<strong>in</strong>g<br />

Both technology & biological question dependent:<br />

• Experimental design<br />

• Meta-data <strong>in</strong>tegration<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

Differential Expression Us<strong>in</strong>g Spotted Arrays<br />

<strong>Statistical</strong> Issues<br />

Example Question:<br />

Identify<strong>in</strong>g σ E -dependent genes <strong>in</strong> E. coli<br />

(Rhodius et al 2006 PLoS Biol.)<br />

Experiment:<br />

Transcript profil<strong>in</strong>g us<strong>in</strong>g spotted arrays <strong>in</strong> wild type<br />

E. coli K-12 strand (low σ E ) versus over-express<strong>in</strong>g<br />

σ E strand.<br />

• Experimental design:<br />

- Which array? How many samples? Which to co-hybridize?<br />

• Data preprocess<strong>in</strong>g:<br />

- Quality <strong>as</strong>sessment, image analysis, normalization<br />

• Comb<strong>in</strong><strong>in</strong>g replicates for differential expression:<br />

- Effect estimate, statistical significance, multiple test<strong>in</strong>g<br />

• Annotation:<br />

- Shared characteristics (functional groups, motifs) among DE<br />

genes?<br />

• Data archive:<br />

- Data submission to public datab<strong>as</strong>e<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

8


<strong>Statistical</strong> Issues<br />

Experimental Design<br />

Oligo/<br />

• Experimental design:<br />

- Which array? How many samples? Which to co-hybridize?<br />

• Data preprocess<strong>in</strong>g:<br />

- Quality <strong>as</strong>sessment, image analysis, normalization<br />

• Comb<strong>in</strong><strong>in</strong>g replicates for differential expression:<br />

- Effect estimate, statistical significance, multiple test<strong>in</strong>g<br />

• Annotation:<br />

- Shared characteristics (functional groups, motifs) among DE<br />

genes?<br />

• Data archive:<br />

- Data submission to public datab<strong>as</strong>e<br />

Probe design:<br />

which sequence to<br />

pr<strong>in</strong>t on the array <strong>in</strong><br />

which platform<br />

Target design:<br />

allocation of mRNA<br />

samples to the slides<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

Two-Color Spotted Arrays<br />

Experimental Design<br />

Arrayed Library<br />

of cDNA/oligos<br />

PCR amplification<br />

Cell/Tissue Total RNA Labeled cDNA<br />

Pr<strong>in</strong>t<strong>in</strong>g, Coupl<strong>in</strong>g, Denatur<strong>in</strong>g<br />

Hybridization<br />

L<strong>as</strong>er 1 L<strong>as</strong>er 2<br />

+ =<br />

Scann<strong>in</strong>g &<br />

Image Analysis<br />

Considerations<br />

• Scientific Aims<br />

• Practical considerations<br />

- Types and amount of mRNA samples<br />

- Number of slides/chips available<br />

• Other <strong>in</strong>formation: prior experiments, controls planned,<br />

etc.<br />

• <strong>Statistical</strong> pr<strong>in</strong>ciple: randomization, replication, local<br />

control.<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

9


Two-color array specific design issues<br />

• Direct vs Indirect Comparison?<br />

A<br />

A B<br />

B<br />

average( log(A/B) )<br />

Ref<br />

log(A/Ref) - log(B/Ref)<br />

• Time Course:<br />

T1<br />

Two-color Array Design (cont.)<br />

T2 T3 T4 T5 T6 T7<br />

• K > 2 Conditions<br />

(i) All pairs<br />

A1 A2<br />

(ii) Common reference<br />

A1 A2 A3 A4<br />

Ref<br />

A4 A3<br />

Ref<br />

#arrays needed for the same precision for all pair comparisons<br />

r K(K-1)/2 r K<br />

Reference: Yang YH and Speed TP. 2002. Nat. Rev. Genetics.<br />

Design Issues for cDNA Microarray Experiments.<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

Preprocess<strong>in</strong>g<br />

Type of “Bi<strong>as</strong>”<br />

• Image Analysis: To identify & extract signal & background<br />

Signal-dependent Spatial Pr<strong>in</strong>t-tip/Plate<br />

GenePix files<br />

(*.gpr)<br />

Cy5 Fg, Bg<br />

Cy3 Fg, Bg<br />

M<br />

Log 2 (Cy5/Cy3)<br />

• With<strong>in</strong>-Slide Normalization: To identify and remove<br />

systematic effects not due to real biological signal.<br />

A: Log 2 (Cy5)+log 2 (Cy3) / 2<br />

• Quality Assessment: To identify bad quality arrays/spots.<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

10


Use Log (b<strong>as</strong>e 2) Intensity for Analysis<br />

Boxplots<br />

1.5 x IQR<br />

1 st quartile<br />

(25%)<br />

Inter-quartile range (IQR)<br />

a robust me<strong>as</strong>ure of the variability<br />

median<br />

3 rd quartile<br />

(75%)<br />

outlier<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

MA plot (log Ratio vs log Intensity)<br />

Spatial plots/Heatmaps<br />

Rotate by<br />

45 degree<br />

M (m<strong>in</strong>us) :<br />

Log 2 (Cy5/Cy3)<br />

Ratio<br />

A (add) : Log 2 (Cy5*Cy3) / 2<br />

Intensity<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

11


Normalization<br />

Normalization (Cont.)<br />

Which genes to use<br />

Normalization methods<br />

• All genes on the array.<br />

• Constantly expressed genes.<br />

• Spiked controls (e.g. plant genes).<br />

• Genomic DNA titration series.<br />

• Rank <strong>in</strong>variant set.<br />

Ratios [two channels], with<strong>in</strong>-slide<br />

-- Median<br />

-- Loess<br />

-- Pr<strong>in</strong>t-tip / p<strong>in</strong>s<br />

Intensities [separate channel],<br />

with<strong>in</strong>- and between-slide<br />

-- ANOVA<br />

-- Quantile normalization<br />

-- VSN<br />

Global<br />

Median<br />

Loess<br />

(or other smoother)<br />

Assumption: No correlation between M and A, or M and layout.<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

Normalization (Cont.)<br />

+<br />

Pr<strong>in</strong>t-tip<br />

+ 2D-loess<br />

S<strong>in</strong>gle-channel Normalization<br />

• Necessary for analysis methods that model log-<strong>in</strong>tensities<br />

(eg: Wolf<strong>in</strong>ger et al, Kerr et al)<br />

Two stages:<br />

A) with<strong>in</strong> slide: use two-channel methods to remove spatial bi<strong>as</strong><br />

b) between all s<strong>in</strong>gle channels normalization: use quantile<br />

normalization or other Affy normalization methods<br />

+<br />

“Local”<br />

(Wilson et al 2003)<br />

Fitted<br />

Normalized<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

12


Quality Assessment<br />

gpQuality() diagnostic plot<br />

Tools:<br />

• Red/Green overlay images<br />

• Diagnostic plots, especially on sets of positive and<br />

negative controls<br />

• Quantitative statistics: signal to noise ratio,<br />

mean/median, variances…<br />

Software: R/Bioconductor<br />

> library(arrayQuality)<br />

> gpQuality(organism=“Mm”)<br />

Bad:<br />

high bg<br />

Good:<br />

low bg<br />

M vs A<br />

Normalized<br />

M vs A<br />

M<br />

M<br />

A M<br />

A A<br />

Pos/Neg Controls<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

Software for Preprocess<strong>in</strong>g<br />

• R / Bioconductor:<br />

marray maNorm<br />

limma normalizeWith<strong>in</strong>Arrays, normalizeBetweenArrays<br />

arrayQuality gpQuality<br />

vsn<br />

• GUI:<br />

limmaGUI http://bio<strong>in</strong>fo.wehi.edu.au/limmaGUI/<br />

• Most commercial and free gene expression analysis<br />

packages offer standard<br />

- Global normalization<br />

- (Pr<strong>in</strong>t-tip) loess normalization.<br />

Preprocess<strong>in</strong>g Reference<br />

Bioconductor book:<br />

Bio<strong>in</strong>formatics and Computational <strong>Biology</strong> Solutions<br />

Us<strong>in</strong>g R and Bioconductor. Edited by R Gentleman,<br />

V Carey et al. Spr<strong>in</strong>ger-Verlag New York. 2005.<br />

Y. H. Yang, S. Dudoit, P. Luu, D. M. L<strong>in</strong>, V. Peng, J. Ngai, and T. P. Speed (2002).<br />

Normalization for cDNA microarray data: a robust composite method address<strong>in</strong>g<br />

s<strong>in</strong>gle and multiple slide systematic variation. Nucleic Acids Research, Vol. 30, No.<br />

4, e15.<br />

W Huber, A von Heydebreck, H Sueltmann, A Poustka, M V<strong>in</strong>gron. (2002). Variance<br />

stabilization applied to microarray data calibration and<br />

to the quantification of differential expression. Bio<strong>in</strong>formatics 18, Suppl. 1, S96-<br />

S104.<br />

Y.H. Yang and N. Thorne (2003) Normalization for Two-color cDNA Microarray Data.<br />

<strong>Science</strong> and Statistics: A Festschrift for Terry Speed, D. Goldste<strong>in</strong> (eds.), IMS<br />

Lecture Notes, Monograph Series, Vol 40, pp. 403-418.<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

13


Alternative DNA Microarray Platform: Affymetrix<br />

High Density Short Oligo Arrays<br />

mRNA reference<br />

For one gene (probe set):<br />

11 probe pairs/gene for Human U133plus2 mRNA reference sequence<br />

5’ 3’<br />

…TCGTCTGTATCACAGACACAAAGTTGACTG…<br />

PM: CAGACATAGTGTCTGTGTTTCAACT<br />

MM: CAGACATAGTGTGTGTGTTTCAACT<br />

Probe sequences<br />

M<strong>as</strong>k design<br />

In Situ Synthesis by Lithography<br />

Perfect Match<br />

MisMatch<br />

…TCGTCTGTATCACAGACACAAAGTTGACTG…<br />

PM: CAGACATAGTGTCTGTGTTTCAACT<br />

MM: CAGACATAGTGTGTGTGTTTCAACT<br />

PM<br />

MM<br />

Fluorescent probe <strong>in</strong>tensity<br />

Cell/Tissue Total RNA Biot<strong>in</strong> labeled<br />

cRNA<br />

Hybridization<br />

Scann<strong>in</strong>g &<br />

Image Analysis<br />

1.28cm<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

Hybridization<br />

+ Scann<strong>in</strong>g<br />

workable raw data<br />

DAT File<br />

[Image, pixel <strong>in</strong>tensities]<br />

Image analysis<br />

dChip<br />

CEL File<br />

[Probe Cell Intensity]<br />

Pre-process<strong>in</strong>g<br />

MAS<br />

GCOS<br />

CDF<br />

+ [Chip Description File]<br />

RMA<br />

Preprocess<strong>in</strong>g of Affymetrix data<br />

Comput<strong>in</strong>g expression me<strong>as</strong>ures <strong>as</strong> a three-step procedure:<br />

• Background subtraction (B)<br />

• Normalization (N) to facilitate between-array comparison<br />

• Summarization of 11-20 probe pair (PM/MM) <strong>in</strong>tensities to<br />

one probe set value (S).<br />

Excel File<br />

Intensity Value<br />

CHP File<br />

Intensity value<br />

Absent / Present call<br />

Text File<br />

RMA Value<br />

Log 2<br />

(Intensity)<br />

Let X be CEL file data from multiple arrays then<br />

Expression values = S(N(B(X)))<br />

Data: a matrix of k genes by n samples<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

14


Affymetrix Data Preprocess<strong>in</strong>g Methds<br />

• Affymetrix: MAS v5.1<br />

Signal = TukeyBiweight{log(PM j - MM j *)}<br />

• Robust multi-array analysis (RMA, Irizarry et al 2003):<br />

use median polish or robust regression fit (IRLS) and<br />

estimate background from PMs<br />

log 2 (PM-BG)* ij = chip i + probe j + ε ij<br />

RMA log expression values<br />

• dChip (Li, Schadt & Wong 2001): Maximum likelihood estimate.<br />

PM ij - MM ij = chip i probe j + ε ij<br />

Between-Array Normalization<br />

• MAS5: N/A. S<strong>in</strong>gle-chip method. Global scal<strong>in</strong>g suggested<br />

for probe set data.<br />

• dChip: Use piecewise l<strong>in</strong>ear normalization of rank<strong>in</strong>variant<br />

sets (estimated non-DE genes) between each<br />

array and a b<strong>as</strong>el<strong>in</strong>e array.<br />

• RMA: Quantile normalization<br />

(B Bolstad et al 2003).<br />

dChip expression values<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

Affymatrix array Quality Assessment us<strong>in</strong>g<br />

weights from affyPLM<br />

Image Gallery:<br />

http://stat-www.berkeley.edu/users/bolstad/PLMImageGallery/<strong>in</strong>dex.html<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

Affymetrix Preprocess<strong>in</strong>g Softwares<br />

• RMA:<br />

- R/Bioconductor package & functions<br />

affy rma, justRMA<br />

affyPLM fitPLM, image, boxplot<br />

gcrma justGCRMA<br />

- Standalone GUI:<br />

affylmGUI: http://bio<strong>in</strong>f.wehi.edu.au/affylmGUI/<br />

RMAExpress: http://rmaexpress.bmbolstad.com/<br />

• dChip: http://www.dchip.org<br />

• Affymetrix own software: MAS5, PLIER<br />

Preprocess<strong>in</strong>g Method Comparison: AffyComp<br />

http://affycomp.biostat.jhsph.edu/<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

15


Affymetrix Preprocess<strong>in</strong>g Reference<br />

Bioconductor book<br />

RMA: Irizarry, R. A., B. Hobbs, et al. (2003). "Exploration, normalization, and<br />

summaries of high density oligonucleotide array probe level data." Biostatistics<br />

4(2): 249-64.<br />

GCRMA: Wu Z, Irizarry RA, Gentleman R, Mart<strong>in</strong>ez-Murillo F, Spencer F. (2004). A<br />

Model B<strong>as</strong>ed Background Adjustment for Oligonucleotide Expression Arrays In<br />

the Journal of the American <strong>Statistical</strong> Association. 99, 909–917.<br />

Quantile Normalization: Bolstad, B.M., Irizarry R. A., Astrand, M., and Speed, T.P.<br />

(2003), A Comparison of Normalization <strong>Methods</strong> for High Density<br />

Oligonucleotide Array Data B<strong>as</strong>ed on Bi<strong>as</strong> and Variance. Bio<strong>in</strong>formatics<br />

19(2):185-193<br />

dChip: Li, C. and W. H. Wong (2001). "Model-b<strong>as</strong>ed analysis of oligonucleotide<br />

arrays: expression <strong>in</strong>dex computation and outlier detection." Proc Natl Acad Sci<br />

U S A 98(1): 31-6.<br />

Method comparison<br />

Irizarry RA, Wu Z, Jaffee HA. Comparison of Affymetrix GeneChip expression<br />

me<strong>as</strong>ures.Bio<strong>in</strong>formatics. 2006 Apr 1;22(7):789-94.<br />

Choe SE, Boutros M, Michelson AM, Church GM, Halfon MS. Preferred analysis methods for<br />

Affymetrix GeneChips revealed by a wholly def<strong>in</strong>ed control dat<strong>as</strong>et.Genome Biol.<br />

2005;6(2):R16.<br />

Other <strong>Statistical</strong> Analysis Questions<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

B-cell lymphoma<br />

Subtype discovery by<br />

Cluster<strong>in</strong>g<br />

and Cl<strong>as</strong>s Validation<br />

(Alizadeh et al Nature 2000)<br />

F<strong>in</strong>d<strong>in</strong>g Groups of<br />

Co-Regulated Genes<br />

with similar time<br />

profile <strong>in</strong> Ye<strong>as</strong>t Cell<br />

Cycle us<strong>in</strong>g SOM<br />

Cluster<strong>in</strong>g<br />

(Tamayo et al, PNAS 1999, Data from<br />

Cho et al, Mol. Cell 1998)<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

16


Prognosis<br />

Prediction with<br />

Gene Expression:<br />

Cl<strong>as</strong>sification &<br />

Validation<br />

(van’t Veer LJ et al, Nature<br />

2002; Further validated <strong>in</strong><br />

NEJM 2002).<br />

Transcript (Exon)<br />

Detection Us<strong>in</strong>g<br />

Til<strong>in</strong>g Arrays:<br />

Estimation, Test<strong>in</strong>g,<br />

Meta-data Integration<br />

(Kapranov et al. Genome Res 2005.)<br />

Chromosomal<br />

Aberrations by<br />

array CGH:<br />

Segmentation,<br />

Meta-data Integration<br />

(Selzer RR et al. Genes<br />

Chromosomes Cancer. 2005)<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

Genotyp<strong>in</strong>g and<br />

Copy Number Estimation<br />

Us<strong>in</strong>g SNP Arrays:<br />

Cluster<strong>in</strong>g, cl<strong>as</strong>sification,<br />

estimation<br />

M<strong>as</strong>s Spectrometry<br />

Proteomic Profil<strong>in</strong>g<br />

Surface Enhanced L<strong>as</strong>er Desortption/Ionization<br />

Denois<strong>in</strong>g<br />

BB<br />

AB<br />

AA<br />

M/Z<br />

*<br />

B<strong>as</strong>el<strong>in</strong>e subtraction<br />

Peak identification<br />

Peak alignment<br />

Data matrix:<br />

n sample<br />

x<br />

p M/Z peaks<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

17


(Light) Read<strong>in</strong>g/References<br />

Project<br />

• Quackenbush J. 2006. NEJM 354(23): 2463-72.<br />

Microarray analysis and tumor cl<strong>as</strong>sification.<br />

• Nature Genetics Supplement:<br />

The Chipp<strong>in</strong>g Forec<strong>as</strong>t II (2002), III (2005).<br />

• Gibson G, Muse SV. 2004.<br />

A primer of genome science.<br />

S<strong>in</strong>auer Associates.<br />

• Choose a paper/project that fits your own<br />

(cl<strong>in</strong>ical/scientific/technology) <strong>in</strong>terest and h<strong>as</strong> an<br />

analytic component<br />

• Proposal due Feb 26 (Monday before Lecture 3)<br />

• Written report due Mar 19 Monday by 5pm via email:<br />

at le<strong>as</strong>t two pages, summariz<strong>in</strong>g what you’ve learned<br />

from the paper/critique of the article<br />

• Oral presentation:<br />

- Mar 6: Manjushree Gautam, Nerissa Ko, Andy Choi, Dan Raz<br />

- Mar 13: Ari Green, Zian Tseng, Beth Cohen<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

TICR Biostat V 2007, Topic II -- <strong>Statistical</strong> <strong>Methods</strong> <strong>in</strong> Bio<strong>in</strong>formatics<br />

18

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!