28.02.2014 Views

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

1.5. SEQUENCING DATA ANALYSIS 11<br />

the most likely combination of alleles to give rise to the observed data, along with a measure of<br />

condence for the provided solution.<br />

Identication of this most likely combination of alleles at each reference genome position is<br />

therefore the main problem to be solved by analysis algorithms. A major issue to be overcome<br />

by these algorithms is to achieve an integration of models for sequencing errors, read mapping<br />

<strong>and</strong> base alignment errors. While the probability of sequencing errors may be systematically<br />

inuenced by the local sequence context, individual error events in dierent reads may be considered<br />

as largely independent. Sources of mapping <strong>and</strong> base alignment error are however mostly<br />

systematic in nature. Reads <strong>and</strong> aligned bases therefore do not represent independent evidence<br />

for a certain allele conguration, preventing these types of error from being as readily captured<br />

by statistical modeling.<br />

The majority of consensus <strong>and</strong> genotype calling algorithms commence by examining the<br />

pileup of base calls <strong>and</strong> base qualities provided by the mapped reads overlapping a position.<br />

Given sucient depth of coverage, eventual sequencing errors can be thereby identied at high<br />

condence.<br />

However, sequencing error may not only cause invalid readout of single positions, but also<br />

spurious cross-mappings to loci with similar sequence to that of the true origin of a read. The<br />

likelihood of such spurious mappings may be assessed using measures like the mapping quality<br />

provided by some alignment tools (section 1.5.3). Furthermore, heuristic parts of the mapping<br />

algorithms used to achieve improved mapping performance may also cause systematic crossmappings<br />

that can be indistinguishable from true variants or give rise to invalid multi-allelic<br />

calls. Such pseudo multi-allelic regions may further arise due to copy number variants, underrepresentation<br />

of repetitive sequence in the reference assembly or sample contamination.<br />

<strong>An</strong>other source of systematic miscalls presents base alignment error. Base alignment error<br />

is introduced by reads, albeit mapped to the locus on the reference genome from which they<br />

originate, whose alignment provided by the read mapping tool does not suggest the correct set<br />

of base pairings. This type of error mainly occurs at transition points to insertions, deletions,<br />

copy number variants or other types of rearrangement or at exon-intron boundaries in mRNA<br />

sequencing. Reads overlapping these transition points only with a small number of positions do<br />

not contain enough information to support the correct type of variant, causing spurious alignment<br />

of terminal nucleotides.<br />

For these reasons, current consensus <strong>and</strong> genotype call algorithms employ combinations of<br />

statistical models as well as alignment correction algorithms <strong>and</strong> heuristics for removing or<br />

penalizing read alignment issues.<br />

The Genome <strong>An</strong>alysis Toolkit (GATK) [97] combines Bayesian heterozygous genotype calling<br />

based on quality score reca<strong>lib</strong>ration with indel realignment <strong>and</strong> various heuristics for penalizing<br />

mapping artifacts [98].<br />

SHORE [18] implements a exibly ca<strong>lib</strong>ratable scoring matrix approach [1, 99] that calculates<br />

SNP, indel <strong>and</strong> reference call scores based on a user-denable weighting of various properties of<br />

the read alignments overlapping a position. The method allows to impose a penalty on, or to<br />

ignore a number of terminal nucleotides towards the ends of each alignment, ruling out many<br />

cases of base alignment error at the cost of an eective loss of sequencing depth.<br />

The SAMtools package [100] implements a prole HMM for estimation of a Base Alignment<br />

Quality (BAQ) for the assessment of potential mapping artifacts [101].<br />

1.5.5 Quantication by Deep Sequencing<br />

In principle, deep sequencing constitutes statistical sampling of the population of molecules that<br />

make up the sequencing <strong>lib</strong>rary. Therefore, obtained sequencing data may be transformed into

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!