29.01.2013 Views

Applied Biosystems SOLiD™ 4 System SETS Software User Guide ...

Applied Biosystems SOLiD™ 4 System SETS Software User Guide ...

Applied Biosystems SOLiD™ 4 System SETS Software User Guide ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

B<br />

Appendix B Advanced Topic: Data Analysis Overview<br />

Data analysis considerations<br />

SNP error rates<br />

A SNP must have two adjacent color changes. Single-color<br />

mismatches are not evidence of a SNP. This feature allows SOLiD <br />

software to distinguish errors in measurement from true SNPs.<br />

When you use 2-base encoding, any SNP in the original sequence is<br />

represented as two adjacent mismatches in color-space. Only three of<br />

nine possible adjacent mismatches can correspond to a real SNP.<br />

Suppose the raw sequencing error rate for a species is 1% per base,<br />

and for the same species that is sequenced, the SNP rate is about<br />

0.1%. For a sequencing project of M total bases, there are about<br />

0.001M real SNP occurrences in the data set. Among these SNPs,<br />

0.001M×0.02 total cases appear as single mismatches or two invalid<br />

mismatches because a sequencing error happens at one of the alleles<br />

of the SNP. Only 2% of the real SNPs fail to appear as two adjacent,<br />

valid mismatches. If only two adjacent, valid mismatches are treated<br />

as candidates for SNP detection, then there is a 2% false negative<br />

rate. For any data set, two adjacent and valid mismatches can be<br />

caused by sequencing errors. However, for a total of M bps<br />

sequenced, there are 0.00003M total occurrences of two adjacent<br />

mismatches. Note that there are about 0.001M adjacent valid<br />

mismatches from real SNPs, among all two adjacent, valid<br />

mismatches observed in a particular data set. Of these, 97% are from<br />

real SNPs, and only 3% may be from sequencing errors. This is a<br />

97% true discovery rate for that particular data set. Because there are<br />

about 0.01M total sequencing errors in the data set caused by 2-base<br />

encoding and the software’s ability to remove all single base<br />

mismatches, the error is reduced from 0.01M to 0.00003M. This is a<br />

reduction of 300 times, making the effective error rate 0.003%. This<br />

calculation illustrates the power of 2-base encoding in resequencing<br />

and finding SNPs.<br />

154 <strong>Applied</strong> <strong>Biosystems</strong> SOLiD 4 <strong>System</strong> <strong>SETS</strong> <strong>Software</strong> <strong>User</strong> <strong>Guide</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!