14.06.2013 Views

Databases and Systems

Databases and Systems

Databases and Systems

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

fragment. In some cases, a same gene has been independently sequenced by different<br />

groups, or has been sequenced several time from different individuals to study<br />

polymorphism. In principle, one should find in databases only one entry for each<br />

gene, <strong>and</strong> if it is polymorphic, then allelic variations should be described in the<br />

annotations. But in practice, all redundant sequences are entered in databases, <strong>and</strong><br />

there is no merging of partial overlapping sequences.<br />

This redundancy is very problematic, not only because it gives a confuse view of<br />

the status of these redundant sequences (are they identical? splicing or allelic variant<br />

of a same gene? paralogous genes?), but also because it can considerably bias the<br />

results of statistical analyses.<br />

In HOVERGEN, we systematically compare all CDSs between each other (with<br />

BLASTN [ 13]) to try to identify those that correspond to a same gene. As previously<br />

discussed [6], the problem with that approach is that two redundant CDSs may show<br />

some differences due to polymorphism, sequencing errors, or annotation errors.<br />

Taking into account published estimates of sequence polymorphism [ 15], <strong>and</strong><br />

sequencing error rates [16-18], we decided to consider as redundant all CDS that<br />

share more than 99% identity (at the DNA level), <strong>and</strong> have less than 3 bases of<br />

difference in length. Using these criteria, we detected 21% of redundancy among the<br />

63,520 vertebrate CDSs available in GenBank (release 101, June 1997). This level of<br />

redundancy is remarkably high, <strong>and</strong> it is thus necessary to take it into account when<br />

doing statistical analyses on sequence databases.<br />

Redundancy is not eliminated from HOVERGEN because each entry may be<br />

associated to useful information. Rather, redundancy is explicitly declared, using a<br />

new qualifier ('redundanc y_ref') that is included in sequence annotations. This<br />

qualifier is unique for each set of redundant CDSs. Thus, this information can easily<br />

be used to eliminate redundancy when required.<br />

It is important to note that two homologous genes resulting from a recent<br />

duplication, speciation or conversion may be more than 99% identical (e.g. human<br />

1 <strong>and</strong> a-2 globin genes). Thus, declaration of redundancy in HOVERGEN should not<br />

be taken into account when one wants to study recent evolutionary events (< 4<br />

million years) [6].<br />

Classification of sequences into families of homologous genes<br />

Sequence selection <strong>and</strong> similarity search. All available CDSs are classified, except<br />

partial CDSs shorter than 300 nt (about 25% of all CDSs). When several redundant<br />

CDS are available, only one is analyzed for the classification. CDSs are translated<br />

into proteins <strong>and</strong> compared between each others with BLASTP [13], using the PAM<br />

120 score matrix. The threshold score to report similarity (S parameter in BLASTP)<br />

is set according to proteins length (L): S=150 for L 170 aa, S=L-20 for L

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!