12.07.2015 Views

View - ResearchGate

View - ResearchGate

View - ResearchGate

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

118 DateBLAST score from the match against the P. falciparum genome. Sequencematches with BLAST E-values greater than 10 −5 are typically discarded (seeNote 5). Besides E-values, other attributes of the HSP can also be used to decidethe quality of the match. For instance, a user might reject a match wherein thelength of the HSP (or all HSPs combined) is not greater than 50% of the querylength, or a match might be rejected based on a cutoff derived from the numberof shared identical amino acids, thereby assuming absence of the query protein inthe particular genome. These choices are reflected in the profile vector, and willultimately affect the quality of the final results.Using this method, phylogenetic profiles are constructed for each amino acidsequence included in the input file. The query set can be extended to include allknown proteins from the given genome, whereby profiles can be generated ona genome-wide scale.3.2.1.2. MEASURING PROFILE SIMILARITY FOR FUNCTION INFERENCESimilarity between phylogenetic profiles is indicative of functional linkagebetween the corresponding proteins and can be measured in a number of differentways. Besides commonly used metrics such as Euclidean distance orPearson correlation, advanced measures such as mutual information, Hammingdistance, Jaccard coefficient, or the chance co-occurrence probability distributioncan also be used (12). Mutual information (13–15) is the metric of choicefor this protocol, as it has the ability to capture inverse and nonlinear relationshipsin the data, in addition to detecting direct and linear relationships.However, users are free to use other metrics, if they seem to perform better.Mutual information is an information theoretic measure, which is the greatestwhen there is complete covariation between two sets of observations, andtends to zero as the sets diverge. For two vectors of proteins X and Y, mutualinformation (MI) can be calculated as follows:( )= ( )+ ( ) ( )MI XY , H X HY − H XY ,In this equation, HX ( ) =−∑ px ( )ln px ( ) represents the marginal entropy ofthe probability distribution p(x) of gene X in each genome included in thedatabase, summed over intervals in the probability distribution, whereasHXY ( , ) =−∑∑pxy ( , )ln pxy ( , ) represents the intrinsic entropy of the joint probabilitydistribution of genes X and Y. Date and Marcotte (5) have described indetail the application of mutual information for measuring profile similarity.Users are directed to this paper for more information about the implementationof the method.Mutual information can be measured in a pairwise manner for proteins in thequery set. Naturally, all mutual information values are not biologically meaningful,

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!