28.02.2014 Views

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

2.7. ROBUST DETECTION OF CHIP-SEQ ENRICHMENT 45<br />

2.7 Robust Detection of ChIP-Seq Enrichment<br />

Investigation of protein-DNA interaction by ChIP-Seq can help to further our underst<strong>and</strong>ing of<br />

cellular regulatory networks (section 1.3.2). Primary data analysis for ChIP-Seq experiments<br />

requires software tools that can pinpoint hundreds to thous<strong>and</strong>s of putative protein binding sites<br />

from data sets of high-throughput sequencing read alignments (section 1.5.6).<br />

This section discusses the peak calling pipeline SHORE peak integrated into the SHORE analysis<br />

suite. The utility is able to discriminate true enrichment from common artifact patterns using<br />

various robust heuristics, calculates false discovery rates (FDR) by interrogation of control experiments<br />

<strong>and</strong> implements a unique joint detection, separate evaluation approach to h<strong>and</strong>ling of<br />

replicate experiments. The program outputs locations of possible binding regions together with<br />

their FDR <strong>and</strong> various additional features potentially useful for ranking, ltering or ne-tuning<br />

the detection. It further enables users to generate graphical representations of peak regions,<br />

put detected loci into the context of available genome annotation or to extract binding region<br />

sequences for subsequent motif sampling <strong>and</strong> analysis.<br />

Our pipeline is pre-congured for straightforward detection of dened loci of ChIP-Seq enrichment<br />

as produced by transcription factor (TF) binding sites. Settings may however be adjusted<br />

to support the detection of clusters of enrichment that occur e. g. in data representing states<br />

of histone modication. Individual algorithms are implemented as pluggable modules <strong>and</strong> have<br />

been brought over to multi-purpose tools SHORE coverage <strong>and</strong> SHORE count, allowing ne-grained<br />

control <strong>and</strong> exible applicability of enrichment detection in diverse scenarios such as MNase-Seq<br />

assays or assessment of copy number variation (CNV).<br />

The SHORE peak pipeline has been outlined [58, 137] <strong>and</strong> further employed [138141] in multiple<br />

biological publications.<br />

2.7.1 Overview<br />

Were sequencing reads r<strong>and</strong>omly sampled from the genome, sequencing depth should be distributed<br />

according to the Poisson distribution whose parameter equals the average depth [142].<br />

Although high-throughput sequencing is known to be highly non-r<strong>and</strong>om, with biases that bring<br />

about considerable deviation from the Poisson model (section 1.4), the distribution can still<br />

present a powerful tool for the detection of irregularities in sequencing depth at high sensitivity.<br />

Our peak calling algorithm consists of two major steps. The rst pass is a detection phase that<br />

employs a relaxed Poisson assumption to the read alignment data generated through the ChIP-<br />

Seq experiment to identify regions of putative enrichment at high sensitivity. In the subsequent<br />

second pass implemented to improve specicity of the prediction, these c<strong>and</strong>idate regions are<br />

subjected to several robust empirical rules for artifact removal <strong>and</strong> further tested against control<br />

experiment data to assess signicance of the enrichment.<br />

2.7.2 Correcting for Duplicated Sequences<br />

The ChIP-Seq procedure yields quantities of DNA that are typically orders of magnitude below<br />

the amounts retrieved by other methods of DNA sampling. The excess PCR rounds required<br />

to obtain a compliant sequencing <strong>lib</strong>rary often bring about signicant amounts of duplicated<br />

sequences, reducing <strong>lib</strong>rary complexity. PCR may entail signicant sequence bias (section 1.4)<br />

with erratic rates of fragment duplication.<br />

Duplicate sequences represent the same fragment of DNA <strong>and</strong> therefore violate the assumption<br />

that each sequencing read is sampled independently from the set of sequences under examination.<br />

During detection of sequence enrichment, the presence of duplicate sequences must thus be<br />

explicitly accounted for by the predictive model, or these sequences must be pruned from the

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!