An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
1.5. SEQUENCING DATA ANALYSIS 11<br />
the most likely combination of alleles to give rise to the observed data, along with a measure of<br />
condence for the provided solution.<br />
Identication of this most likely combination of alleles at each reference genome position is<br />
therefore the main problem to be solved by analysis algorithms. A major issue to be overcome<br />
by these algorithms is to achieve an integration of models for sequencing errors, read mapping<br />
<strong>and</strong> base alignment errors. While the probability of sequencing errors may be systematically<br />
inuenced by the local sequence context, individual error events in dierent reads may be considered<br />
as largely independent. Sources of mapping <strong>and</strong> base alignment error are however mostly<br />
systematic in nature. Reads <strong>and</strong> aligned bases therefore do not represent independent evidence<br />
for a certain allele conguration, preventing these types of error from being as readily captured<br />
by statistical modeling.<br />
The majority of consensus <strong>and</strong> genotype calling algorithms commence by examining the<br />
pileup of base calls <strong>and</strong> base qualities provided by the mapped reads overlapping a position.<br />
Given sucient depth of coverage, eventual sequencing errors can be thereby identied at high<br />
condence.<br />
However, sequencing error may not only cause invalid readout of single positions, but also<br />
spurious cross-mappings to loci with similar sequence to that of the true origin of a read. The<br />
likelihood of such spurious mappings may be assessed using measures like the mapping quality<br />
provided by some alignment tools (section 1.5.3). Furthermore, heuristic parts of the mapping<br />
algorithms used to achieve improved mapping performance may also cause systematic crossmappings<br />
that can be indistinguishable from true variants or give rise to invalid multi-allelic<br />
calls. Such pseudo multi-allelic regions may further arise due to copy number variants, underrepresentation<br />
of repetitive sequence in the reference assembly or sample contamination.<br />
<strong>An</strong>other source of systematic miscalls presents base alignment error. Base alignment error<br />
is introduced by reads, albeit mapped to the locus on the reference genome from which they<br />
originate, whose alignment provided by the read mapping tool does not suggest the correct set<br />
of base pairings. This type of error mainly occurs at transition points to insertions, deletions,<br />
copy number variants or other types of rearrangement or at exon-intron boundaries in mRNA<br />
sequencing. Reads overlapping these transition points only with a small number of positions do<br />
not contain enough information to support the correct type of variant, causing spurious alignment<br />
of terminal nucleotides.<br />
For these reasons, current consensus <strong>and</strong> genotype call algorithms employ combinations of<br />
statistical models as well as alignment correction algorithms <strong>and</strong> heuristics for removing or<br />
penalizing read alignment issues.<br />
The Genome <strong>An</strong>alysis Toolkit (GATK) [97] combines Bayesian heterozygous genotype calling<br />
based on quality score reca<strong>lib</strong>ration with indel realignment <strong>and</strong> various heuristics for penalizing<br />
mapping artifacts [98].<br />
SHORE [18] implements a exibly ca<strong>lib</strong>ratable scoring matrix approach [1, 99] that calculates<br />
SNP, indel <strong>and</strong> reference call scores based on a user-denable weighting of various properties of<br />
the read alignments overlapping a position. The method allows to impose a penalty on, or to<br />
ignore a number of terminal nucleotides towards the ends of each alignment, ruling out many<br />
cases of base alignment error at the cost of an eective loss of sequencing depth.<br />
The SAMtools package [100] implements a prole HMM for estimation of a Base Alignment<br />
Quality (BAQ) for the assessment of potential mapping artifacts [101].<br />
1.5.5 Quantication by Deep Sequencing<br />
In principle, deep sequencing constitutes statistical sampling of the population of molecules that<br />
make up the sequencing <strong>lib</strong>rary. Therefore, obtained sequencing data may be transformed into