An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
1 Introduction<br />
Next-Generation Sequencing has evolved into a powerful tool for many areas of biological<br />
science, but has also introduced many new challenges related to data analysis<br />
<strong>and</strong> computing infrastructure into the eld. Following a brief recapitulation of highthroughput<br />
DNA sequencing technology, this chapter highlights important areas of<br />
application while discussing previous <strong>and</strong> related work. We outline general sequencing<br />
data analysis methodology <strong>and</strong> conclude by establishing the motivation for this<br />
work.<br />
1.1 High-Throughput DNA Sequencing<br />
Widely deployed for less than a decade, the impact of parallel next-generation sequencing technology<br />
on genetic research has already been considerable. The novel high-throughput sequencing<br />
approaches are characterized by massive parallelism implemented in a single device. From this<br />
design results the key advantage of a signicantly reduced cost per sequenced base, which has<br />
allowed to quickly displace the previously predominant Sanger sequencing method in many areas<br />
of application.<br />
Sanger sequencing, the prevalent sequencing method for the majority of time since its introduction<br />
in 1977 [2], needed to rely on extensive automation in dedicated sequencing centers to<br />
achieve time <strong>and</strong> cost eective readout of large amounts of sequence. Most prominently, this was<br />
put into eect during the eort of generating the rst draft sequence of the human genome [36].<br />
By contrast, high-throughput sequencing instruments are designed to analyze thous<strong>and</strong>s to millions<br />
of DNA molecules simultaneously, <strong>and</strong> thus enable even smaller institutions to produce<br />
large quantities of sequence data on site.<br />
Secondary to elucidation of DNA primary structure, sequencing utilized as a r<strong>and</strong>om sampling<br />
device delivers quantitative clues on sample composition. In this capacity, it has been<br />
used for diverse purposes such as inference of DNA methylation levels [7] or the analysis of environmental<br />
samples [8]. In concert with the economical advantage over the dideoxynucleotide<br />
chain-termination method, parallel DNA sequencing becoming widely accessible to researchers<br />
has served to promote such deep sequencing approaches that open up whole new areas of application<br />
beyond the domain previously occupied by Sanger sequencing.<br />
As an alternative to microarray technologies, for example in gene expression proling or<br />
chromatin immunoprecipitation assays, deep sequencing overcomes fundamental technological<br />
restrictions such as probe resolution <strong>and</strong> probe saturation. Furthermore, without the inherent<br />
requirement of a-priori knowledge of sequences to be detected or quantied, development of novel<br />
protocols of application has been furthered to transform high-throughput sequencing methods<br />
into a versatile tool for research.<br />
1