28.02.2014 Views

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

1 Introduction<br />

Next-Generation Sequencing has evolved into a powerful tool for many areas of biological<br />

science, but has also introduced many new challenges related to data analysis<br />

<strong>and</strong> computing infrastructure into the eld. Following a brief recapitulation of highthroughput<br />

DNA sequencing technology, this chapter highlights important areas of<br />

application while discussing previous <strong>and</strong> related work. We outline general sequencing<br />

data analysis methodology <strong>and</strong> conclude by establishing the motivation for this<br />

work.<br />

1.1 High-Throughput DNA Sequencing<br />

Widely deployed for less than a decade, the impact of parallel next-generation sequencing technology<br />

on genetic research has already been considerable. The novel high-throughput sequencing<br />

approaches are characterized by massive parallelism implemented in a single device. From this<br />

design results the key advantage of a signicantly reduced cost per sequenced base, which has<br />

allowed to quickly displace the previously predominant Sanger sequencing method in many areas<br />

of application.<br />

Sanger sequencing, the prevalent sequencing method for the majority of time since its introduction<br />

in 1977 [2], needed to rely on extensive automation in dedicated sequencing centers to<br />

achieve time <strong>and</strong> cost eective readout of large amounts of sequence. Most prominently, this was<br />

put into eect during the eort of generating the rst draft sequence of the human genome [36].<br />

By contrast, high-throughput sequencing instruments are designed to analyze thous<strong>and</strong>s to millions<br />

of DNA molecules simultaneously, <strong>and</strong> thus enable even smaller institutions to produce<br />

large quantities of sequence data on site.<br />

Secondary to elucidation of DNA primary structure, sequencing utilized as a r<strong>and</strong>om sampling<br />

device delivers quantitative clues on sample composition. In this capacity, it has been<br />

used for diverse purposes such as inference of DNA methylation levels [7] or the analysis of environmental<br />

samples [8]. In concert with the economical advantage over the dideoxynucleotide<br />

chain-termination method, parallel DNA sequencing becoming widely accessible to researchers<br />

has served to promote such deep sequencing approaches that open up whole new areas of application<br />

beyond the domain previously occupied by Sanger sequencing.<br />

As an alternative to microarray technologies, for example in gene expression proling or<br />

chromatin immunoprecipitation assays, deep sequencing overcomes fundamental technological<br />

restrictions such as probe resolution <strong>and</strong> probe saturation. Furthermore, without the inherent<br />

requirement of a-priori knowledge of sequences to be detected or quantied, development of novel<br />

protocols of application has been furthered to transform high-throughput sequencing methods<br />

into a versatile tool for research.<br />

1

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!