12.07.2015 Views

From Protein Structure to Function with Bioinformatics.pdf

From Protein Structure to Function with Bioinformatics.pdf

From Protein Structure to Function with Bioinformatics.pdf

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

2 Fold Recognition 49statistical measures <strong>to</strong> calculate whether a given score is significantly above thisnoise and <strong>to</strong> what degree.Currently there is no generic analytical description of the shape of the distributionof threading or fold recognition scores across different models and sequences,though it is well unders<strong>to</strong>od that the distribution of optimal scores is not normal.For gapped local alignment of two sequences, or a sequence and a sequence profile,the distribution of optimal alignment scores can be approximated by anextreme value distribution. Systems such as BLAST, PSI-BLAST, Hidden MarkovModels and many sequence-profile and profile-profile methods fit their outputscore distributions <strong>to</strong> an extreme value distribution from which it is then possible<strong>to</strong> calculate a p-value or e-value.Some profile-based methods approximate the distribution of scores by a normaldistribution and calculate Z-scores. The Z-scores are calculated <strong>with</strong> the mean andstandard deviation of the scores of a query sequence <strong>with</strong> the library of all structuralmodels. Similarly, many threading approaches use the optimal raw score asthe primary measure of structure and sequence compatibility and estimate the statisticalsignificance of the score assuming a normal distribution of the sequencescores threaded <strong>to</strong> a library of available models. The Gibbs-sampling threadingapproach (Bryant 1996) estimates the significance of the optimal score by comparison<strong>to</strong> the distribution of scores generated by threading a shuffled querysequence <strong>to</strong> the same structural model. The distribution of shuffled scores isassumed <strong>to</strong> be normal. More recently, many fold recognition systems forego anyexplicit statistical calculation and instead rely on machine learning approachessuch as neural networks and support vec<strong>to</strong>r machines, trained on a benchmark se<strong>to</strong>f known relationships, <strong>to</strong> predict an estimate of the accuracy.However, frequently the most cutting edge structure prediction systems attempting<strong>to</strong> probe extremely remote homologous relationships are highly empirical andin general do not have robust statistical measures of likely error. It is important forthe reader <strong>to</strong> realise that protein structure prediction is a very inexact science andas a result caution must be used when interpreting results. The most valuable <strong>to</strong>olin interpreting the results of structure prediction is invariably biological knowledgeof the gene or system under study.2.5 Tools for Fold Recognition on the WebA large number of fold recognition systems are freely available on the web for academicuse a sample of which are listed in Table 2.2. In the most recent CASP7 competition,I-TASSER, HHpred, Robetta and Pcons all performed strongly. Pcons, Bioinfo, andGenesilico are all meta- or consensus servers that gather results from standalone serversand process the models returned using structural clustering or machine learning techniques,generally outperforming any individual server. The Robetta server from DavidBaker’s lab is not limited <strong>to</strong> fold recognition but can handle the entire spectrum of proteinstructure prediction from comparative modelling <strong>to</strong> ab initio. The more recently

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!