DNA copy numbers and the Circular Binary Segmentation Algorithm

DNA copy numbers and the 

Circular Binary Segmentation Algorithm 

Venkatraman E. Seshan 

Department of Biostatistics / HICCC 

Columbia University 

June 15, 2009 

Statistical Genomics 

Institute of Mathematical Sciences 

National University of Singapore 

Joint work with Adam B. Olshen, Richard A. Olshen

CBS 

Background 

• The DNA sequence copy number at any locus in a genome 

is the number of copies of genomic DNA. 

• The normal copy number is two for human autosomes. 

• Copy number alterations are gains and losses of DNA 

* they modify the function and/or expression of genes 

* they are common in cancer: copy number 

- Increased at sites of oncogenes; 

- Decreased at sites of tumor suppressor genes. 

1

CBS 

Background 

Comparative genomic hybridization 

• first to scan entire genome for copy number variation 

Kallioniemi, et al (1992), du Manoir et al (1993) 

• metaphase chromosomes from test and reference samples 

• imaged separately using fluorescence microscope 

• quantified using image analysis software 

• smooth signal – “continuous curve”, low resolution 

http://amba.charite.de/cgh/cgh01.html 

2

CBS 

Background 

Array based methods 

• DNA extracted, fragmented, labeled and hybridized 

• different probes 

* Bacterial artificial clones (BACs) (1-3k orig, 31k tiling) 

* Long oligo arrays: Agilent (44k, 244k), 

Nimblegen (350k), ROMA (85k+) 

* Short oligos – Affymetrix SNP arrays (100k, 500k, 1.8m) 

• higher resolutions – noisier 

• Review: Pinkel & Albertson (2005, Nature Genetics) 

3

CBS 

Example 

Lung cancer BAC array with 3160 clones 

log−ratio 

−2 −1 0 1 2 

0 500 1000 1500 2000 2500 3000 

Genomic Position 

4

CBS 

Example 

Breast cancer ROMA array with 9820 probes 

log−ratio 

−2 −1 0 1 2 

0 500 1000 1500 2000 2500 3000 


5

CBS 

Analysis 

Overall plan 

In sample identify regions of abnormal copy number 

Region(s) repeatedly deleted or amplified across samples 

How to identify locations of aberrations? 

Thresholds: 3 SDs above 0 is a gain, 3 below is a loss 

SDs estimated from probe level data of normal/normal experiments 

Alteration calls can oscillate due to data overlap 

Smoothing techniques (lowess, quantreg, wavelets etc.) 

6

CBS 

Analysis 

Copy number alterations are discrete events 

affecting contiguous regions of the genome 

• Hodgson et al. (2001) - 3-component Gaussian mixture 

• Autio et al. (2003) - CGH-Plotter: 3-means clustering 

combined with dynamic programming 

• Fridlyand et al. (2004) - Gaussian hidden Markov model 

• Wang et al. (2005) - Clustering along chromosomes 

• Tibshirani and Wang (2008) - cghFlasso 

Our algorithm based on a change-point model 

Olshen, Venkatraman, Lucito & Wigler (2004) 

7

CBS Change-points Methods 

Let Z 1 , Z 2 , . . . , Z n be the data ordered by an indexing set 

If Z 1 , . . . , Z ν ∼ F 0 and Z ν+1 , . . . , Z n ∼ F 1 , 

then ν is a change-point (Page, 1954). 

For our problem 

• the data are the log-ratio measurements 

• ordered by the location of a probe on a chromosome 

• a change-point corresponds to where 

the copy number changed on a chromosome 

• There may be multiple changes. 

8

CBS Change-points Methods 

Binary segmentation (Sen and Srivastava, 1975; Vostrikova, 1981) 

1. Partial sums: S i = Z 1 + · · · + Z i , i = 1, . . . , n 

2. Test the hypothesis H 0 : no change against H 1 : ν = i 

Statistic if Zs are normal is S i 

i − S n−S i 

n−i 

3. Unknown ν - maximize the t-statistic over all i 

It is a recursive procedure - segments the data at a 

change-point and tests for change-points in the segments 

Tail probability approximation by Siegmund (1986) 

9

CBS Binary Segmentation Fails Methods 

Chromosome 12 Max t-statistic: 4.242/3.462, p-value: 0.221/0.254 

log−ratio 

−2 −1 0 1 2 

1940 1960 1980 2000 2020 2040 2060 2080 


10

CBS Binary Segmentation Fails Methods 

11

CBS Circular Binary Segmentation Methods 

View the data as if on a circle and segment into two arcs. 

Partial sums: S i = Z 1 + · · · + Z i , i = 1, . . . , n 

Test statistic: T = 

max 

1 ≤ i < j ≤ N 

|T ij |, where 

⌢ 

Z ij − ⌣ Z ij 

T ij = 

s √ (j − i) −1 + (i + n − j) −1, 

⌢ 

Z ij = (S j − S i )/(j − i) 

and 

⌣ 

Z ij = {S i + (S n − S j )}/(i + n − j) 

Hence named circular binary segmentation (CBS). 

Either one change-point (j = n) or two (j < n). 

Levin and Kline (1985): Similar statistic for epidemic alternative 

12

CBS More on CBS Methods 

• Recursively split the data if p-value below threshold 

• Tail probability approximation for Normal data 

– Siegmund (1988); Yao (1989) 

• Calculate reference distribution by permuting 

• If ternary split, the first and third segment are tested for 

a binary split with the middle segment (edge effect) 

• Algorithmic tweaks: 

* Maximize difference in means for fixed segment length 

* Stop permutations once p-value exceeds threshold 

* Moving windows for large data sets (deprecated) 

13

CBS More on CBS Example 

−2 −1 0 1 2 

xx2$log2R 

log−ratio 

log−ratio 

−2 −1 0 1 2 

xx2$genomic.pos 

0 500 1000 1500 2000 2500 3000 


14

CBS Faster CBS Example 

CBS performed consistently well (Lai et al., 2005) but slow 

has superior performance (Willenbrock and Fridlyand, 2005) 

Why is CBS slow? 

• Test statistic: require n(n − 1)/2 computations 

• Reference distribution is based on permutation 

• Not a problem for 3k BAC arrays but larger arrays? 

Especially 350k, 500k arrays (let alone 1.8m or 4m) 

Improvements: hybrid p-value & a stopping rule for change 

Venkatraman & Olshen (2007) 

15

CBS Hybrid p-value Faster CBS 

Test statistic T = max{T 1 , T 2 } where 

T 1 = max 

A 1 

|T ij | and T 2 = max 

A 2 

|T ij | 

A 1 = {i, j : j − i ≤ k or > n − k} (small arcs) 

A 2 = {i, j : k + 1 ≤ j − i ≤ n − k} (non-small arcs) 

Choose k to call an arc small 

Split data if P (T > T obs ) < α 

Bound P (T > T obs ) ≤ P (T 1 > T obs ) + P (T 2 > T obs ) 

Use permutation for P (T 1 > T obs ) and 

Approximate P (T 2 > T obs ) (Siegmund (1988), Yao (1993)) 

16

CBS p-value Approximation Faster CBS 

Let ˜T 2 = max 

A 2 

T ij . If data are normal, then 

P ( ˜T 2 > b) = 1 4 b3 φ(b) 

∫ 1−δ 

1/2 

ν 2 (b/[nt(1 − t)] 1/2 ) 

t 2 (1 − t) 2 dt, 

where δ = k/n, φ is the normal density and ν is 

⎧ 

⎨ ∞ 

ν(x) = 2x −2 exp 

⎩ −2 ∑ 

r −1 Φ 

(− 1 ) ⎫ ⎬ 

2 xr1/2 ⎭ . 

By symmetry P (T 2 > b) ≈ 2P ( ˜T 2 > b). 

If data are “regular” and k sufficiently large, 

random field approximately Gaussian 

1 

17

CBS Stopping rule Faster CBS 

• data segmented if p-value ≤ α 

• p-value computed from P permuted statistics 

• all P permutations conducted when change detected 

• what if none of the first 1000 T perm > T obs 

Number T perm > T obs 

0 20 40 60 80 100 

0 2000 4000 6000 8000 10000 

Number of permutations 

18

CBS Stopping rule Faster CBS 

Let e 1 , . . . , e P be the indicators of T perm > T obs 

i∑ 

Let r(i) = e j . Declare no change if r(i) = αP = m 

1 

• Stopping rule to declare change is a boundary b 1 , . . . , b m 

• Stop and declare change detected if r(b i ) < i 

• Choosing the boundary? – Repeated significance test 

• Control Prob{r(b i ) 

• b i is the smallest j such that Prob{r(j) < i} = η ⋆ 

• obtained using the hypergeometric distribution 

19

CBS Performance gain Faster CBS 

100k SNP array data for 3 breast cancer cell lines 

data from http://pevsnerlab.kennedykrieger.org/snpscan.htm 

Cell line Time in 

MCF7 SKBR3 ZR75 minutes 

P 220 242 270 7085.8 

H 217 242 271 100.2 

H+ES 216 243 269 25.0 

Identical 213 242 263 

20

CBS Lung project Example 

• Two broad categories of lung cancer: small cell (SCLC) 

and non-small cell (NSCLC) 

• Adenocarcinoma is a histological subtype of NSCLC 

• Recent advance: EGFR mutation and effective TKI drugs 

• Prognosis is in general poor especially KRAS mutation 

Can we learn more by profiling DNA copy number? 

≈ 250 samples using Agilent 44k arrays. 

Also gene expressions using Affy U133 arrays 

21

CBS Lung Project Example 

22


23


24


!"##$%&%'()*+,-./"*%,0 

25


-./0123" 

&'( *'# 

$")( 

!"#" 

!"#$ 

$)( *)( 

+'( $,)( 

!%#% 

!%#" 

!%#$ 

26


• No obvious focal gain or loss stood out 

• Clusters had weak association with EGFR but not KRAS 

• Incorporated information from U133 expression arrays 

• Potentially interesting gene DUSP4 

Acknowledgements 

Marc Ladanyi, William Gerald, Valerie Rusch, Stephen Broderick, 

Cameron Brennan, Dhananjay Chitale, Bhuvanesh Singh and others. 

27

CBS Copy Number Variation TCGA 

28

CBS Multi platform data wCBS 

• Method to merge data from multiple platforms 

• Platform specific probe noise - weights w i 

• Partial sums: S i = w 1 Z 1 + · · · + w i Z i , 

W i = w 1 + · · · + w i 

i = 1, . . . , n 

• Test statistic: T = 

max 

1 ≤ i < j ≤ N 

|T ij |, where 

T ij = 

⌢ 

Z ij − ⌣ Z ij 

s 

√(W j − W i ) −1 + (W i + W n − W j ) −1, 

⌢ 

Z ij = (S j − S i )/(W j − W i ) 

and 

⌣ 

Z ij = {S i + (S n − S j )}/(W i + W n − W j ) 

29

CBS Multi platform data wCBS 

Weighted CBS uses the same hybrid approach. 

Bengtsson, et al (2009) 

HaasSeg is a Haar wavelt based segmentation algorithm 

30

CBS 

ASCN 

Allele Specific Copy Number 

• Traditional methods measure the sum of copy numbers of 

the parental chromosomes. 

• SNP arrays can be used for allele specific copy numbers. 

• Total copy number of 2 can be from AA, AB or BB. 

• Data from Molecular Inversion Probes (MIPs) SNP genotyping/copy 

number technology from Affymetrix. 

• Intensities for the two alleles converted to copy numbers. 

31

CBS MIPs Example ASCN 

Allele A copy number 

0 1 2 3 4 5 6 

Allele B copy number 

0 1 2 3 4 5 6 

0 500 1000 1500 2000 2500 3000 

Genomic Position (Mb) 

0 500 1000 1500 2000 2500 3000 


Sum of copy numbers 

0 1 2 3 4 5 6 

0 500 1000 1500 2000 2500 3000 


32

CBS Analysis plan ASCN 

• How to estimate parental chromosome copy numbers? 

• Phase ambiguity causes problem. 

parental origin of heterozygous SNPs unknown. 

Proposed plan: 

Begin by segmenting total copy number using CBS. 

change-points in total present in one of the parent 

analyze allele specific CN within each segment 

33

CBS Allele A vs B ASCN 

Homozygotes are non informative for parental CN 

Phase ambiguity can lead to 2 heterozygous blobs 

Chromosome 1p 

Chromosome 1q 

Allee B copy number 

0 1 2 3 4 5 


0 1 2 3 4 5 

0 1 2 3 4 5 

Allee A copy number 

0 1 2 3 4 5 


34


Step 2: Determine the homozygous locations. 

Minimum of A & B 

−0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 

0 500 1000 1500 2000 2500 3000 


35


Step 3: Test for additional change-points 

Why? TCN = 4 = 3 + 1 = 2 + 2 ASCN 

Step 4: Estimate difference in parental copy numbers 

mean absolute difference of heterozygous ASCNs 

Step 5: Combine parental copy numbers 

adjacent segment copy number transitions 

copy numbers 1 + 2 → 2 + 3 

36

CBS Reconstruction ASCN 

Copy numbers 

0 1 2 3 4 5 6 

0 500 1000 1500 2000 2500 3000 


Total High copy Low copy 

37

CBS Reconstruction ASCN 

Allele A copy number 

0 1 2 3 4 5 6 

Allele B copy number 

0 1 2 3 4 5 6 

0 500 1000 1500 2000 2500 3000 


0 500 1000 1500 2000 2500 3000 


Sum of copy numbers 

0 1 2 3 4 5 6 

0 500 1000 1500 2000 2500 3000 


38

CBS Equality of parental CN ASCN 

Chromosome 1p 

Chromosome 1q 

Chromosome 6p 


0 1 2 3 4 5 

0 1 2 3 4 5 


Chromosome 1p 

0 1 2 3 4 5 


Density 

0.0 0.2 0.4 0.6 0.8 1.0 1.2 

Density 

0.0 0.1 0.2 0.3 0.4 0.5 


0 1 2 3 4 5 


0 1 2 3 4 5 

Chromosome 1q 

0 1 2 3 4 5 


Chromosome 6p 

Density 

0.0 0.5 1.0 1.5 2.0 

−2 −1 0 1 2 

N = 678 Bandwidth = 0.09884 

−3 −2 −1 0 1 2 3 


−2 −1 0 1 2 


39

CBS Equality of parental CN ASCN 

Chromosome 6q 

Chromosome 14 

Chromosome 15 


0 1 2 3 4 5 

0 1 2 3 4 5 


Chromosome 6q 

0 1 2 3 4 5 


Density 

0.0 0.2 0.4 0.6 

Density 

0.0 0.2 0.4 0.6 0.8 1.0 1.2 


0 1 2 3 4 5 


0 1 2 3 4 5 

Chromosome 14 

0 1 2 3 4 5 


Chromosome 15 

Density 

0.0 0.5 1.0 1.5 2.0 

−2 −1 0 1 2 


−1.0 −0.5 0.0 0.5 1.0 


−1.0 −0.5 0.0 0.5 1.0 


40

CBS 

Summary 

• New technologies for studying biological process present 

interesting applications of statistical techniques 

• The CBS algorithm is a practical solution which is widely 

used to study DNA copy number data 

• Improvements make it highly attractive for larger arrays 

• Software is available as an open source R (www.r-project. 

org) library called DNAcopy that is part of Bioconductor 

(www.bioconductor.org). 

41

CBS 

References 

1. Autio, R. et al. (2003). CGH-Plotter: MATLAB toolbox for CGH-data analysis. Bioinformatics 19 1714-1715. 

2. du Manoir, S. et al. (1993). Detection of complete and partial chromosome gains and losses by comparative genomic in situ 

hybridization. Hum. Genet. 90, 590610. 

3. Fridlyand, J. et al. (2004). Understanding Array CGH Data. JMVA 90 132-153. 

4. Hodgson, G. et al. (2001). Genome scanning with array CGH delineates regional alterations in mouse islet carcinomas. Nature 

Genet. 29 459-464. 

5. Kallioniemi, A. et al. (1992). Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. Science 

258, 818821. 

6. Lai et al. (2005) Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics 

21, 3763-3770. 

7. Levin, B. and Kline, J. (1985). The CUSUM test of homogeneity with an application in spontaneous abortion epidemiology. 

Statistics in Medicine 4 469-488. 

8. Linn, S. et al. (2003). Gene expression patterns and copy number changes in dermafibrosarcoma Am J Pathol 163 2383-2395. 

9. Menard, S. et al. (2002). Role of HER2 gene overexpression in breast carcinoma. J Cell Physiol 182 150-162. 

10. Olshen, A., Venkatraman, E., Lucito, R. and Wigler, M. (2004). Circular Binary Segmentation for the analysis of 

array-based DNA copy number data. Biostatistics 5 557-572. 

11. Page, E. S. (1954). Continuous inspection schemes. Biometrika. 41 100-115. 

12. Picard, F. et al. (2005) A statistical approach for array CGH data analysis. BMC Bioinformatics 6, 27. 

13. Pinkel, D. and Albertson, D. (2005) Array comparative genomic hybridization and its application in cancer, Nature Genetics, 

37, S11-S17, Suppl. S. 

14. Pollack, J. R. et al. (2002). Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional 

program of human breast tumors Proc. Natl. Acad. Sci. 99 12963-12968. 

15. Sen, A. and Srivastava, M. S. (1975). On tests for detecting a change in mean. Ann Statist. 3 98-108. 

16. Siegmund, D. (1986). Boundary crossing probabilities and statistical applications. Ann Statist. 14 361-404. 

17. Siegmund, D.O. (1988) Approximate tail probabilities for the maxima of some random fields, Annals of Probability, 16, 487-501. 

18. Snijders, A. M. et al. (2001). Assembly of microarrays for genome-wide measurement of DNA copy number. Nature Genet. 29 

263-264. 

19. Tibshirani, R. and Wang, P. (2007) Spatial smoothing and hot spot detection for CGH data using the Fused Lasso. Biostatistics 

20. Venkatraman, E. S. and Olshen, A. B. (2007) A faster circular binary segmentation algorithm for the analysis of array CGH 

data. Bioinformatics. 

42

CBS 

References 

21. Vostrikova, L. J. (1981) Detecting “disorder” in multidimensional random processes. Soviet Mathematics Doklady 24 55-59. 

22. Wang, P. et al. (2005). A method for calling gains and losses in array CGH data. Biostatistics 6 45-58. 

23. Willenbrock, H. and Fridlyand, J. (2005) A comparison study: applying segmentation to array CGH data for downstream 

analyses. Bioinformatics, 21, 4084-4091. 

24. Yao, Q. (1989) Large deviations for bounary crossing probabilities of some random fields. J. Math, Res. Exposition, 9, 181-192. 

25. Yao, Q. (1993) Tests for change-points with epidemic alternatives, Biometrika, 80, 179-191. 

43

DNA copy numbers and the Circular Binary Segmentation Algorithm

Create successful ePaper yourself

Delete template?

Save as template?