28.02.2014 Views

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

2.7. ROBUST DETECTION OF CHIP-SEQ ENRICHMENT 47<br />

therefore conservative in the sense that the true number of non-duplicated reads per position<br />

will be underestimated. In contrast to imposing a static limit on the number of reads accepted<br />

at each position, the adaptive heuristic still allows to discriminate dierent levels of extreme<br />

enrichment as well as peak calling in presence of deep background coverage.<br />

2.7.3 Detection Phase<br />

The ChIP-Seq procedure enriches for short DNA fragments with above average anity to the<br />

protein of interest. One or both ends of these fragments are subsequently sequenced, allowing<br />

detection of its origin on the reference genome assembly. In the following, the enriched DNA<br />

fragments subjected to sequencing will be referred to as inserts, whereas the sequenced ends<br />

of the fragments will be referred to as reads. The number of inserts putatively overlapping a<br />

position will be termed insert depth, <strong>and</strong> the number of reads overlapping the position read depth.<br />

The primary peak detection phase, resembling that of MACS [104] (section 1.5.6), is realized<br />

as a sliding window analysis of insert depth. A window of xed, user-selectable width is shifted<br />

along the reference sequence in single base steps. In each step, the algorithm calculates the<br />

average depth of insert coverage over the current window. In the case of paired-end sequencing,<br />

insert depth is calculated from the ranges dened by read pairs. For single end sequencing, each<br />

read is extended in 3 ′ direction to match the estimated average insert size. To exclude regions<br />

not accessible to the sequencing experiment, positions with sequencing depth zero are ignored,<br />

i. e. the average depth is calculated as<br />

∑<br />

¯d ′ x∈W<br />

W =<br />

d(x)<br />

|{x ∈ W : d(x) > 0}|<br />

where W is the set of reference sequence positions included by the sliding window <strong>and</strong> d(x)<br />

signies the depth at a position x. Since the sliding window may or may not contain enriched<br />

sites, ¯d′ W may be regarded a conservative estimate of the minimum average background signal<br />

over the window. ¯d′ W is used as the average coverage depth parameter to evaluate the potential<br />

enrichment at the central position x c of the sliding window by a one-sided Poisson test. The<br />

test calculates the p-value for the respective coverage depth from the cumulative distribution<br />

function for the upper tail of the Poisson distribution<br />

⌊d(x<br />

p(x c ) = 1 − e − ¯d ∑ c)⌋<br />

′<br />

W ·<br />

i=0<br />

( ¯d ′ W )i<br />

i!<br />

Consecutive positions falling below a certain signicance threshold, by default set to 0.05, are<br />

joined to form a c<strong>and</strong>idate region.<br />

Detection is ne-tunable using two auxiliary signicance thresholds. A relaxed probing signicance<br />

threshold is accepted to automatically join c<strong>and</strong>idate regions separated only by short<br />

drops in depth of coverage into a contiguous region. Furthermore, c<strong>and</strong>idate regions may be<br />

pre-ltered using an acceptance threshold, discarding all regions that do not include at least one<br />

position that satises this more stringent signicance criterion.<br />

2.7.4 Recognition of Read Mapping Artifacts<br />

Final signicance scores for enrichment are calculated based on the number of read mappings<br />

providing evidence for a respective site (section 2.7.5). For multiple reasons, signicance of<br />

enrichment is on its own of limited value for singling out relevant protein binding sites (section<br />

2.7.6). Peak signicance is therefore primarily used as a means of imposing a ranking on<br />

the detected loci.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!