Applied Biosystems SOLiD™ 4 System SETS Software User Guide ...
Applied Biosystems SOLiD™ 4 System SETS Software User Guide ...
Applied Biosystems SOLiD™ 4 System SETS Software User Guide ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
B<br />
Appendix B Advanced Topic: Data Analysis Overview<br />
Data analysis considerations<br />
SNP error rates<br />
A SNP must have two adjacent color changes. Single-color<br />
mismatches are not evidence of a SNP. This feature allows SOLiD <br />
software to distinguish errors in measurement from true SNPs.<br />
When you use 2-base encoding, any SNP in the original sequence is<br />
represented as two adjacent mismatches in color-space. Only three of<br />
nine possible adjacent mismatches can correspond to a real SNP.<br />
Suppose the raw sequencing error rate for a species is 1% per base,<br />
and for the same species that is sequenced, the SNP rate is about<br />
0.1%. For a sequencing project of M total bases, there are about<br />
0.001M real SNP occurrences in the data set. Among these SNPs,<br />
0.001M×0.02 total cases appear as single mismatches or two invalid<br />
mismatches because a sequencing error happens at one of the alleles<br />
of the SNP. Only 2% of the real SNPs fail to appear as two adjacent,<br />
valid mismatches. If only two adjacent, valid mismatches are treated<br />
as candidates for SNP detection, then there is a 2% false negative<br />
rate. For any data set, two adjacent and valid mismatches can be<br />
caused by sequencing errors. However, for a total of M bps<br />
sequenced, there are 0.00003M total occurrences of two adjacent<br />
mismatches. Note that there are about 0.001M adjacent valid<br />
mismatches from real SNPs, among all two adjacent, valid<br />
mismatches observed in a particular data set. Of these, 97% are from<br />
real SNPs, and only 3% may be from sequencing errors. This is a<br />
97% true discovery rate for that particular data set. Because there are<br />
about 0.01M total sequencing errors in the data set caused by 2-base<br />
encoding and the software’s ability to remove all single base<br />
mismatches, the error is reduced from 0.01M to 0.00003M. This is a<br />
reduction of 300 times, making the effective error rate 0.003%. This<br />
calculation illustrates the power of 2-base encoding in resequencing<br />
and finding SNPs.<br />
154 <strong>Applied</strong> <strong>Biosystems</strong> SOLiD 4 <strong>System</strong> <strong>SETS</strong> <strong>Software</strong> <strong>User</strong> <strong>Guide</strong>