12.07.2015 Views

From Protein Structure to Function with Bioinformatics.pdf

From Protein Structure to Function with Bioinformatics.pdf

From Protein Structure to Function with Bioinformatics.pdf

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

2 Fold Recognition 43query sequence or template sequence, they are used for both and compared <strong>to</strong> oneanother (Fig. 2.4e). Each position in a sequence can be considered as a vec<strong>to</strong>r ofprobabilities. In the case of simple profiles, one has a 20 dimensional probabilityvec<strong>to</strong>r (1 dimension per amino acid type). A position in the query sequence is similar<strong>to</strong> a position in a template structure if they are under similar evolutionary pressures,which would be reflected in them having similar probability vec<strong>to</strong>rs. Manydifferent techniques have recently been devised <strong>to</strong> compare such vec<strong>to</strong>rs (the simplestbeing a dot product), almost all of which surpass the simpler sequence-profilescoring approaches (e.g. Rychlewski et al. 2000; Ohlsen et al. 2004; Soeding 2005;Bennett-Lovsey et al. 2008).In the light of the success of profile-profile methods, many groups generalisedthe process <strong>to</strong> include secondary structure profiles, where instead of a simple3-state prediction of alpha helix, beta strand or coil, a probability was calculatedfor each state and treated as a vec<strong>to</strong>r <strong>with</strong> some evidence of improved performance(Tang et al. 2003; Bennett-Lovsey et al. 2008). This is shown schematicallyin Fig. 2.4f.However, as the power of profiles grew due <strong>to</strong> improved sequence databases,more careful profile construction and more intelligent profile-profile matchingalgorithms, the value of the additional predicted structural information seemed<strong>to</strong> diminish relative <strong>to</strong> its initially critical role in the early techniques of Bowieet al. (1991). The most successful techniques for predicting secondary structuretend <strong>to</strong> rely on a machine learning algorithm such as artificial neural networksor support vec<strong>to</strong>r machines, trained on windows of sequence profiles that havebeen generated by PSI-BLAST. The reason for only marginal gains from usingthis information probably stems from a lack of novel or ‘orthogonal’ data. Thesource data used <strong>to</strong> make the secondary structure prediction is often the sameprofile used in the sequence matching. Thus it may be speculated that much ofthe information in the secondary structure prediction is probably alreadyencoded in the profile from which it was derived.2.3.3 Fold Classification and Support Vec<strong>to</strong>r MachinesFold recognition is a classification problem. It can be cast as a series of questionsregarding whether a given sequence adopts one or other of a variety of folds. Assuch, it is a problem amenable <strong>to</strong> the techniques developed in the machine learningfield. Given known features of the query sequence, s, such as its amino acid composition,its sequence relatives, secondary structure prediction and so on, we wish<strong>to</strong> determine the most likely fold that s belongs <strong>to</strong> out of some set of folds F. Suchclassifiers can be broadly classified as generative or discriminative. A typicalgenerative classifier is a Naïve Bayes classifier. The idea here is <strong>to</strong> determine therelative importance of each feature (the parameters of the model) in predicting thefold by examining the frequencies <strong>with</strong> which each feature is associated <strong>with</strong> eachclass in some training set.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!