Practical Course on Multiple Sequence Alignment - CNB - Protein ...

More documents

Recommendations

Info

precise meaning of equivalence is generally context dependent: for the phylogeneticist, equivalent residues have common evolutionary ancestry; for the structural biologist, equivalent residues correspond to analogous positions belonging to homologous folds in a set of proteins; for the molecular biologist, equivalent residues play similar functional roles in their corresponding proteins. In each case, an alignment provides a bird’s eye view of the underlying evolutionary, structural, or functional constraints characterizing a protein family in a concise, visually intuitive format. Many bioinformatic methods rely on MSAs. Success of methods relies on quality of alignment. MSA a difficult problem, although current automatic approaches are good, usually they can be improved by human intervention to yield better alignments and hence better analysis. It is difficult to teach how to do this well - depends on obtaining experience. This course aims to provide a starting point to obtain this experience. A typical multiple sequence alignment workflow will be: 1. Clearly formulate the question you want to answer. E.g.: “What is the secondary structure prediction for protein sequence” 2. Collect a set of sequences to address this question. E.g. using a BLAST database search. 3. Create and examine initial automatic MSA. E.g. align using Clustal and examine in JalView 4. Adjust the set of sequences to obtain an optimal set. E.g. remove unrelated sequences, add sequences from additional searches. 5. Manually adjust alignment to correct automatically-introduced errors. Some notes on protein evolution The evolutionary variations of a protein provide much information about the protein itself. Evolutionary divergence into different species has resulted in many variants of the same protein, all with essentially the same biological function but different amino acid sequences. The differences and similarities of the amino acid sequences of these variants reflect the constraints of structure and function for that protein. The number of possible RNA, DNA, or protein sequences is so great that it is implausible that similar long sequences could have arisen by any mechanism other than evolutionary divergence from the same ancestor. Nucleic acids and proteins that have evolved from a common ancestor are said to be homologous. The sequences of homologous genes and proteins were identical at the time they originated by replication of a single gene. Subsequently the two genes accumulate mutational changes, and the sequences of homologous genes and proteins can be identical, similar to varying degrees, or unrecognizably dissimilar because of extensive mutation. Proteins and genes are either homologous or not, because they either did or did not descent from a common ancestor. The only explanation other than divergence for similarities among sequences is convergence, in which two or more unrelated sequences have become similar under the pressure selection for similar functions. Convergence is a fairly 6
common evolutionary phenomenon at the macroscopic level and is even encountered at the level of protein three-dimensional structure. There are, however, no instances in which nucleic acid or protein sequences have been shown conclusively to have become substantially similar by convergence. On the other hand, proving evolutionary convergence only from extant sequences is inherently difficult, so the possibility cannot be dismissed completely. Mutations Mutations are both the raw material and the driving for of evolution, whereas natural selection modulates the rate of divergence. A variety of mutations occur to DNA and RNA, by numerous and complex mechanisms. Simplest and most frequent is the replacement of one nucleotide by another; insertions and deletions of one or more nucleotides are also common. More complex rearrangements of DNA are less common but of greater consequence. Variation among species The more closely related the organisms, the more similar the sequences of their genes and proteins are found to be. Closely related proteins generally differ only by replacement of one amino acid by another at a few positions in the polypeptide chain. Less frequent are differences in the total number of residues, which are due to the deletion or insertion of residues within or at either end of the polypeptide chain. In more distant relationships, the numbers and natures of sequence differences can increase greatly. In general, the amino acid replacements that occur during protein divergence are non random, both in the extent to which various residues change and in the number of amino acids that replace each other. The most prevalent replacements occur between amino acids with similar side chains. This bias in amino acid replacements presumably reflects the role of selection; only those mutations that do not disrupt the function of the protein survive. Variation within species DNA is known to be a very dynamic molecule that undergoes a wide variety of alterations and modifications, and gene duplications occur naturally and frequently. With two copies of a gene available in a genome, one copy could provide the necessary original function while the other accumulated mutations that altered its function. If this altered copy evolved eventually to serve a new function, it would tend to be retained in the genome and passed on to later generations. Many genes and proteins of an organism are homologous and are obviously the products of gene duplication. The genes of such homologous proteins in a genome are said to comprise a gene family. Genes that occupy the same gene locus in different species, and protein products of such genes, are said to be orthologous, whereas genes at different loci that are related by gene duplication are designated paralogous (see Figure 1). The phylogenies of species can be reconstructed only by comparing orthologous genes. The differences between paralogous genes or proteins from different species are not related to the time since divergence of the species, but to the time since gene duplication. 7
Page 1: Sequence Analysis and Structure Pre
Page 4 and 5: Profile-sequence searches .........
Page 8 and 9: Figure 1: Illustration of homologou
Page 10 and 11: A more complete method, like the on
Page 12 and 13: matrix, and a 45% threshold gave th
Page 14 and 15: Sometimes it is helpful to mask par
Page 16 and 17: Progressive methods The most widely
Page 18 and 19: Choosing the right MSA software Giv
Page 20 and 21: Representing multiple sequence alig
Page 23 and 24: Working with profiles When a number
Page 25 and 26: Hidden Markov Models (HMM) The HMM
Page 27 and 28: plastid-encoded proteins families o
Page 29 and 30: Content sources and bibliography On

Practical Course on Multiple Sequence Alignment - CNB - Protein ...

Create successful ePaper yourself

Delete template?

Save as template?