12.07.2015 Views

From Protein Structure to Function with Bioinformatics.pdf

From Protein Structure to Function with Bioinformatics.pdf

From Protein Structure to Function with Bioinformatics.pdf

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

104 T. Nugent and D.T. JonesFig. 4.6 Eleven proteomes were filtered using the MEMSAT3 TM/globular discrimina<strong>to</strong>r. Thosepredicted <strong>to</strong> be TM proteins were subject <strong>to</strong> full TM <strong>to</strong>pology prediction. X-axis: TM helix count.Y-axis: Number of proteinscolicin used <strong>to</strong> form pores in the outer membranes of competing bacteria.Furthermore, errors in databases are not infrequent and add an element of noise.While such noise is often well <strong>to</strong>lerated by machine learning methods, the problemis more significant in smaller data sets.Another issue that needs <strong>to</strong> be addressed is homology in the data, <strong>with</strong> most datasets being reduced at a level of 30–40% sequence identity. Since structural TMprotein data is at a premium, this level is perhaps slightly higher than that whichwould be applied <strong>to</strong> globular protein data sets. Although there is an increased riskof overfitting, this is necessary <strong>to</strong> ensure training sets are of sufficient size. Allmachine learning methods have multiple free parameters and thus have the potential<strong>to</strong> overfit. That is, rather than identifying a pattern in a sequence, an example maybe learned ‘by heart’, including any noise that the sequence may contain. A methodthat has been overfitted is typically able <strong>to</strong> reproduce its training examples accurately,but will perform poorly on examples that it has not seen before. It is importantthat, when assessing the accuracy of a prediction method, homology in bothtraining and test data sets is reduced in order <strong>to</strong> avoid overfitting.In all cases, it is important that stringent cross-validation is performed. Crossvalidationis the statistical practice of partitioning a data set in<strong>to</strong> subsets such thata single subset is validated on a model trained using the remaining subsets, and the

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!