An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
2.7. ROBUST DETECTION OF CHIP-SEQ ENRICHMENT 45<br />
2.7 Robust Detection of ChIP-Seq Enrichment<br />
Investigation of protein-DNA interaction by ChIP-Seq can help to further our underst<strong>and</strong>ing of<br />
cellular regulatory networks (section 1.3.2). Primary data analysis for ChIP-Seq experiments<br />
requires software tools that can pinpoint hundreds to thous<strong>and</strong>s of putative protein binding sites<br />
from data sets of high-throughput sequencing read alignments (section 1.5.6).<br />
This section discusses the peak calling pipeline SHORE peak integrated into the SHORE analysis<br />
suite. The utility is able to discriminate true enrichment from common artifact patterns using<br />
various robust heuristics, calculates false discovery rates (FDR) by interrogation of control experiments<br />
<strong>and</strong> implements a unique joint detection, separate evaluation approach to h<strong>and</strong>ling of<br />
replicate experiments. The program outputs locations of possible binding regions together with<br />
their FDR <strong>and</strong> various additional features potentially useful for ranking, ltering or ne-tuning<br />
the detection. It further enables users to generate graphical representations of peak regions,<br />
put detected loci into the context of available genome annotation or to extract binding region<br />
sequences for subsequent motif sampling <strong>and</strong> analysis.<br />
Our pipeline is pre-congured for straightforward detection of dened loci of ChIP-Seq enrichment<br />
as produced by transcription factor (TF) binding sites. Settings may however be adjusted<br />
to support the detection of clusters of enrichment that occur e. g. in data representing states<br />
of histone modication. Individual algorithms are implemented as pluggable modules <strong>and</strong> have<br />
been brought over to multi-purpose tools SHORE coverage <strong>and</strong> SHORE count, allowing ne-grained<br />
control <strong>and</strong> exible applicability of enrichment detection in diverse scenarios such as MNase-Seq<br />
assays or assessment of copy number variation (CNV).<br />
The SHORE peak pipeline has been outlined [58, 137] <strong>and</strong> further employed [138141] in multiple<br />
biological publications.<br />
2.7.1 Overview<br />
Were sequencing reads r<strong>and</strong>omly sampled from the genome, sequencing depth should be distributed<br />
according to the Poisson distribution whose parameter equals the average depth [142].<br />
Although high-throughput sequencing is known to be highly non-r<strong>and</strong>om, with biases that bring<br />
about considerable deviation from the Poisson model (section 1.4), the distribution can still<br />
present a powerful tool for the detection of irregularities in sequencing depth at high sensitivity.<br />
Our peak calling algorithm consists of two major steps. The rst pass is a detection phase that<br />
employs a relaxed Poisson assumption to the read alignment data generated through the ChIP-<br />
Seq experiment to identify regions of putative enrichment at high sensitivity. In the subsequent<br />
second pass implemented to improve specicity of the prediction, these c<strong>and</strong>idate regions are<br />
subjected to several robust empirical rules for artifact removal <strong>and</strong> further tested against control<br />
experiment data to assess signicance of the enrichment.<br />
2.7.2 Correcting for Duplicated Sequences<br />
The ChIP-Seq procedure yields quantities of DNA that are typically orders of magnitude below<br />
the amounts retrieved by other methods of DNA sampling. The excess PCR rounds required<br />
to obtain a compliant sequencing <strong>lib</strong>rary often bring about signicant amounts of duplicated<br />
sequences, reducing <strong>lib</strong>rary complexity. PCR may entail signicant sequence bias (section 1.4)<br />
with erratic rates of fragment duplication.<br />
Duplicate sequences represent the same fragment of DNA <strong>and</strong> therefore violate the assumption<br />
that each sequencing read is sampled independently from the set of sequences under examination.<br />
During detection of sequence enrichment, the presence of duplicate sequences must thus be<br />
explicitly accounted for by the predictive model, or these sequences must be pruned from the