12.07.2015 Views

View - ResearchGate

View - ResearchGate

View - ResearchGate

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

102 Crabtree et al.3. For the sake of simplicity it is assumed that no protein is a member of more thanone cluster in the same protein clustering analysis. However, the system is routinelyused to compare protein clusters generated using either different algorithms or thesame algorithm with different parameter settings.4. Sybil reads annotation and comparative data from a Sybase or PostgreSQL (seewww.postgresql.org/about/history for details) relational database using the chado(19) schema, the official database schema of the General Model OrganismDatabase (GMOD) project (20). However, the system is based on a three-tierarchitecture that largely isolates the various display and query tools from the specificimplementation details of the database server and schema.5. WashU-BLASTP 2.0 (produced/licensed by the Washington University in St. LouisSchool of Medicine. See http://blast.wustle.edu/ for details) is used for the all-vs-allBLASTP search. The parameters are configurable but by default the followingoptions are used: “–E 1e-5 –matrix BLOSUM62 –wordmask none –B 150 –V150 –gspmax 5 –shortqueryok –novalidctxok –cpus 1.”6. The current system uses bioinformatic sequence markup language (BSML) (21) tostore the intermediate BLASTP results (which are also eventually loaded into thechado comparative database). BSML is an XML-based data exchange format forsequence-related data. In subsequent steps of the analysis the BLASTP matchesare read from BSML flat files using a custom Perl API.7. It should be emphasized that no additional conditions are placed on the BLASTPmatches used to create the JACs, other than the E-value score and percent identitythresholds. In particular, there is no requirement that the BLASTP matchesmust cover a minimum percentage of either sequence, which means that a relativelyshort match—if of sufficiently high identity and statistical significance—is often enough to group polypeptides into the same Jaccard cluster. In earlycomparative databases this lack of stringency was found to be more of a help thana hindrance, particularly when one or more of the input genomes has relativelylow-quality automated annotation. Gene models that incorrectly lack one or moreexons (and thus have artificially abridged polypeptide sequences) are nonethelessincorporated into the same cluster as the (correct) full-length versions of thosegenes. When an expert curator examines these clusters, possible annotation errorscan be rapidly identified and tagged for correction in a future data release.However, in more recent comparative databases that contain more genomes andlarger protein families, this lack of stringency, in conjunction with the subsequentsingle-linkage connected component analysis, has led to some pathological cases,in which a single well-conserved domain results in artificially large clusters ofotherwise unrelated polypeptides. It is hoped that using a more stringent linkagecriterion to compute the connected components will address this issue.8. The default Jaccard clustering thresholds—80% identity for the BLASTP matchesand 0.6 for the Jaccard coefficient threshold—were chosen by running the algorithmon a single representative comparative database using a range of different parametervalues. The resulting matrix of Jaccard cluster sets was evaluated by an expert curatorand default parameter values were chosen that satisfied the following conditions:

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!