Lecture 12 10/13/10 pdf - Department of Computer Science

cs.albany.edu

Lecture 12 10/13/10 pdf - Department of Computer Science

Principles of Bioinformatics

BIO540/STA569/CSI660

Fall 2010


Lecture 12

Multiple Sequence Alignment 2


Administrivia


Administrivia

• The midterm examination will be Monday,

October 18 th , in class.

– Closed book and notes.

– More details soon.

Fall 2010 BIO540/STA569/CSI660 4


Administrivia

• I’ve put up a set of exercises on pairwise

sequence alignment on the class web page.

– They are not a formal homework, but rather a

resource to help you study for the midterm

exam.

Fall 2010 BIO540/STA569/CSI660 5


Today’s Content…


Readings

• MSA Basics

– 4.5, 6.4

• Blocks, Motifs and Patterns

– 4.8, 4.9, 6.1, 6.3

• HMMs

– 6.2

• Alternatives

– 4.10, 6.5, 6.6

Fall 2010 BIO540/STA569/CSI660 7


MSA with ClustalW

Fall 2010 BIO540/STA569/CSI660 8


Examining and Manipulating

MSA Results


MSA with ClustalW

• Some things to keep in mind when aligning

with ClustalW.

– If ClustalW is not giving you the exact

alignment you want, do not tweak parameters

to change the alignment.

– Instead, use a manual alignment editor to

change the alignment.

Fall 2010 BIO540/STA569/CSI660 10


Viewing and Editing Alignments

• JalView

– Available at http://www.jalview.org.

– Views an alignment created by other programs

• Reads input in a variety of formats (e.g. FASTA,

ALN).

– The alignments can be adjusted manually in the

program, and the result stored.

Fall 2010 BIO540/STA569/CSI660 11


Viewing and Editing Alignments

Fall 2010 BIO540/STA569/CSI660 12


Viewing and Editing Alignments

• Use a program like Jalview to actually

examine the MSA you are getting.

• For example, look to see if there are

sequences that do not align well with the

others.

– Find out what the sequences are and determine

what action needs to be taken.

• E.g. if they should be removed from the set to be

aligned.

Fall 2010 BIO540/STA569/CSI660 13


Viewing and Editing Alignments

• Even the best MSA program may not get

the alignment exactly right.

• When you examine the result, look for

places where it is wrong.

• You can use a program like Jalview to

manually adjust the alignment.

Fall 2010 BIO540/STA569/CSI660 14


Viewing and Editing Alignments

• There is nothing wrong with manually

adjusting an alignment.

– No MSA program is infallible.

– So long as you have a reasonable justification,

it is acceptable to change the MSA output

manually.

Fall 2010 BIO540/STA569/CSI660 15


MSA Editors and Formatters

• There are many programs in addition to

Jalview for editing and formatting MSAs.

Fall 2010 BIO540/STA569/CSI660 16


TCoffee


Tcoffee

• Tcoffee is another freely available MSA

program.

• In general, it

– Builds better alignments than ClustalW, but

– Runs more slowly.

Fall 2010 BIO540/STA569/CSI660 18


Tcoffee

• Tcoffee is a progressive alignment program,

roughly similar to ClustalW.

• However, it has some important differences:

– In addition to using pairwise alignment

information, Tcoffee uses information about the

common segments in multiple sequences.

– It uses sequence identity, not similarity, in its

calculations.

Fall 2010 BIO540/STA569/CSI660 19


Tcoffee

• Tcoffee has some nice features:

– Like ClustalW, it can use structural information

to bias the alignment.

• In this case by using 3D structure (PDB) files.

– It can read in existing alignments,

• Evaluate their quality,

• Show where different alignments agree, and

• Combine them into a single, new one.

Fall 2010 BIO540/STA569/CSI660 20


Tcoffee

Fall 2010 BIO540/STA569/CSI660 21


PILEUP

• PILEUP is another progressive alignment

program.

• It is a part of the GCG suite of

bioinformatics programs.

• It is becoming less widely used as more

researchers use programs available over the

web or freely downloadable to run on their

own machines.

Fall 2010 BIO540/STA569/CSI660 22


Doing MSA in Practice


Use Multiple MSA Programs

• It is worthwhile to do analyses:

– with multiple MSA algorithms, and

– using multiple runs, each using different

parameters.

• For example, with ClustalW and Tcoffee.

• This maximizes the chances that you will

find good alignments.

Fall 2010 BIO540/STA569/CSI660 24


Problems with Progressive

Alignment

• One author (Mount) contends that progressive

algorithms are good for efficiently aligning more

closely related sequences.

– When sequence identity is > ~50%

• For more distantly related sequences (~25%), he

proposes using alternative methods:

– Bayesian algorithms, or

– Hidden Markov Models (HMMs)

we discuss these later).

Fall 2010 BIO540/STA569/CSI660 25


Iterative MSA Methods


Iterative Methods of MSA

• Other methods have been devised in order

to get around the order sensitivity weakness

of progressive alignments.

• These iterative methods repeatedly

– realign subgroups of the sequences,

then

– Align these subgroups into an overall

alignment.

Fall 2010 BIO540/STA569/CSI660 27


Iterative Methods of MSA

• These methods include:

– MultiAlin,

– PRRP, and

– DIALIGN and CHAOS

• It may be worthwhile using these as an

– alternative, or

– sanity check

for ClustalW or Tcoffee results.

Fall 2010 BIO540/STA569/CSI660 28


Genetic Algorithms


Genetic Algorithms

• Genetic Algorithms (GAs) are a type of

computer science machine learning

algorithm for problem solving and

parameter learning.

– They are a way of trying large numbers of

different combinations of gaps and alignments.

– The term has nothing to do with real genetics or

evolution.

Fall 2010 BIO540/STA569/CSI660 30


Genetic Algorithms

• Several different GA-based methods have

been used to do MSA, among them:

– SAGA,

– Zhang and Wong’s method.

Fall 2010 BIO540/STA569/CSI660 31


Genetic Algorithms

• MSASA is a related method.

• It uses simulated annealing, not a GA, but has

roughly the same effect.

Fall 2010 BIO540/STA569/CSI660 32


Genetic Algorithms

• The GA-based methods generate and

evaluate a large number of alignments.

• This helps increase the chances of finding

good MSAs.

• However, this also makes GA-based

methods very slow, compared to the

methods of MSA we’ve been discussing.

Fall 2010 BIO540/STA569/CSI660 33


Hidden Markov Models


Hidden Markov Models

• Hidden Markov Models (HMMs) are a

statistical model used for machine learning.

• We’ll discuss these in more detail later.

Fall 2010 BIO540/STA569/CSI660 35


Other Methods


Other Methods for MSA

• There are many algorithms to do MSA.

• Many of them are variations on the ideas we

have already seen (e.g. Tcoffee is similar to

Clustalw).

• Some are novel, such as (Vingron and

Argos; Boguski, et al) that create pairwise

alignment dot matrices and use those for

MSA.

Fall 2010 BIO540/STA569/CSI660 37


Other Methods for MSA

• The POA algorithm (Progressive Order

Alignment) uses graphs to represent the

MSA.

• It stores information about all of the

sequences used to create the alignment.

• Because of this, it does not suffer from the

loss of information that can affect

progressive alignment algorithms.

Fall 2010 BIO540/STA569/CSI660 38


Comparing MSA Algorithms


MSA Algorithms Compared

• Comparing MSA algorithms is a source of

lively debate.

• While results differ,

– Tcoffee is considered very good for sequences

with high similarity.

– POA is considered good for sequences with low

similarity.

Fall 2010 BIO540/STA569/CSI660 40


MSA Algorithms Compared

• Of course, it makes sense to try several

different methods, and to use a viewer/

editor like JalView to evaluate and compare

the results you get.

Fall 2010 BIO540/STA569/CSI660 41


Localized Methods


Localized Alignments

• The MSA methods we have seen so far give

global alignments.

• Often, however, we are interested in only

highly conserved portions of the alignments.

Fall 2010 BIO540/STA569/CSI660 43


Localized Alignments

• Local regions of interest can be identified in

different ways:

– Identifying the highly conserved portions of an

MSA, and characterizing them using profiles,

– Finding gapless regions of an alignment

(blocks), and

– Analyzing sequences using statistical or pattern

matching methods.

Fall 2010 BIO540/STA569/CSI660 44


Profile Analysis


Profile Analysis

• Typical steps for a profile analysis:

1. A profile is created by first doing an MSA,

and then extracting the highly conserved

regions.

2. This creates a smaller MSA containing only

these regions.

3. A scoring matrix on this second MSA is

called a profile.

Fall 2010 BIO540/STA569/CSI660 46


Profile Analysis

• One program to do profile analysis is

ProfileMake.

Fall 2010 BIO540/STA569/CSI660 47


Profile

Positions

(MSA columns)

Profile Analysis

C

o

n

s

A C D E F G H I K L M N P Q R S T V W Y U

n

k

G

O

P

G

E

P

Fall 2010 BIO540/STA569/CSI660 48


Consensus symbol

(most frequent) at

each profile position.

Profile Analysis

C

o

n

s

A C D E F G H I K L M N P Q R S T V W Y U

n

k

G

O

P

G

E

P

Fall 2010 BIO540/STA569/CSI660 49


Log odds score for each amino

acid at each profile position.

Profile Analysis

C

o

n

s

A C D E F G H I K L M N P Q R S T V W Y U

n

k

G

O

P

G

E

P

Fall 2010 BIO540/STA569/CSI660 50


Log odds score for an unknown

amino acid at each profile position.

Profile Analysis

C

o

n

s

A C D E F G H I K L M N P Q R S T V W Y U

n

k

G

O

P

G

E

P

Fall 2010 BIO540/STA569/CSI660 51


Profile Analysis

Affine gap penalty parameters

at each profile position.

C

o

n

s

A C D E F G H I K L M N P Q R S T V W Y U

n

k

G

O

P

G

E

P

Fall 2010 BIO540/STA569/CSI660 52


Profile Analysis

• Two major methods of creating profiles are

used:

– The average method, and

– The evolutionary method.

Fall 2010 BIO540/STA569/CSI660 53


The Average method

• For a given MSA column, first find the

proportions of occurrence of each amino

acid.

Fall 2010 BIO540/STA569/CSI660 54


The Average method

• Then, for each amino acid column in the profile

for that MSA column (now a profile row), weigh

that amino acid’s log odds score by the

proportions in that column.

Score A = S A→A X A + S A→C X C + S A→D X D + … + S A→W X W

+ S A→Y X Y

where S W→Y is the substitution score from W to Y

in the matrix used, and X L is the log odds score of

the occurrence of amino acid L.

Fall 2010 BIO540/STA569/CSI660 55


Profile Analysis

C

o

n

s

A C D E F G H I K L M N P Q R S T V W Y U

n

k

G

O

P

G

E

P

Fall 2010 BIO540/STA569/CSI660 56


The Evolutionary Method

• As the name implies, the evolutionary method

uses a model of evolution to create the profile.

• Each position is considered to be evolving at an

independent rate.

– A debatable assumption, one made for simplicity.

• Scores are calculated similarly to the average

method, but are weighted by the “amount” of

evolution calculated to have taken place to get the

substitutions seen in the MSA.

Fall 2010 BIO540/STA569/CSI660 57


Profile Analysis

• The profile can now be used to evaluate

other sequences to see if they contain the

same pattern as the profile.

• This can be done as

– Database searches (e.g. the ProfileSearch

program),

– Aligning sequences using the profile as a

scoring matrix (e.g. the Profilegap program).

Fall 2010 BIO540/STA569/CSI660 58


Profile Analysis

• If a protein family contains several profiles,

the more profiles that a putative member

sequence matches increases its chances of

being a member of the family.

Fall 2010 BIO540/STA569/CSI660 59


Profile Analysis

• In general, evolutionary methods are

considered to create better profiles than

average methods.

Fall 2010 BIO540/STA569/CSI660 60


Profile Analysis

• When doing a profile analysis, it is

important that the original construction is

done using a good dataset.

• The sequences used should reflect the full

– Range,

and

– Distribution

of the phenomena the profile should model.

Fall 2010 BIO540/STA569/CSI660 61


Profile Analysis

• Range in a dataset means that the sequences

contain the full set of amino acids needed to make

the profile complete.

– That is, no substitutable amino acids are missing.

• Distribution in a dataset means that the sequences

contain amino acids in their true proportions.

– That is, the sequences do not over- or under-represent

any of the amino acids.

Fall 2010 BIO540/STA569/CSI660 62


Profile Analysis

• Otherwise, the algorithm will create a poor

profile from the data in the set of sequences.

• Not having good, representative data can mean

the resulting profile will

– Have incorrect proportions of allowed amino acids,

and/or

– May even be missing some allowed symbols.

Fall 2010 BIO540/STA569/CSI660 63


Profile Analysis

• Depending on its use, a poor profile can

– Misrepresent the phenomena you are trying to

discover or represent,

and/or

– Can erroneously match or discard new

sequences that are compared to the profile.

Fall 2010 BIO540/STA569/CSI660 64


Block Analysis


Block Analysis

• Blocks are similar to profiles, but do not

contain gaps.

• The most popular programs to extract

blocks are

– BLOCKS, by Henikoff and Henikoff, and

– eMOTIFs, by Nevill-Manning et al.

Fall 2010 BIO540/STA569/CSI660 66


Extraction of Blocks

• Both BLOCKS and eMOTIFs extract

blocks from global MSAs.

Fall 2010 BIO540/STA569/CSI660 67


Position-specific Scoring

Matrices


Position-Specific Scoring

Matrices

• A position-specific scoring matrix (PSSM)

is a way of characterizing a block.

• For each position in the block, the PSSM

indicates how likely each amino acid is to

occur.

– This is frequently done by

log (X a /X e )

where X a is the actual count of amino acid X,

and X e is the expected count of amino acid X.

Fall 2010 BIO540/STA569/CSI660 69


Position-Specific Scoring

Matrices

• PSSMs can be used to see if a sequence has

a likely match for the block represented by

the PSSM.

Fall 2010 BIO540/STA569/CSI660 70


Statistical Methods


Statistical Methods for Aiding

Alignment

• When patterns are not as well conserved as

those that can be picked up by the previous

methods, statistical methods can be used.

• The general idea is that initial patterns are

chosen, and then refined by the statistics.

• They produce a scoring matrix that can be

used to identify other sequences with the

same pattern.

Fall 2010 BIO540/STA569/CSI660 72


The Expectation Maximization

Algorithm

• In the Expectation Maximization (EM)

algorithm, the program alternates between

two steps

– Expectation (E) step - locate a pattern in the

sequences using the current alignment, and

– Maximization (M) step - refine the alignment to

better match the pattern.

Fall 2010 BIO540/STA569/CSI660 73


The Expectation Maximization

Algorithm

• The EM algorithm makes an initial guess

about an alignment.

• It then alternates series of M and E steps,

evaluating the score of the alignment, until

it stops improving.

Fall 2010 BIO540/STA569/CSI660 74


MEME

• The Multiple EM for Motif Elicitation

(MEME) program is an example EM

algorithm.

• It is billed as a “motif discovery tool.”

• It is available as a web server at

http://meme.sdsc.edu/meme/meme-intro.html

Fall 2010 BIO540/STA569/CSI660 75


The Gibbs Sampler

• The Gibbs Sampler

algorithm is similar

in overall approach

to EM.

– Iteratively make and

refine alignments.

• But, the details are

different.

Fall 2010 BIO540/STA569/CSI660 76


The Gibbs Sampler

• The Gibbs Sampler

algorithm is similar

in overall approach

to EM.

– Iteratively make and

refine alignments.

• But, the details are

different.

• Gibbs sampler approaches

were developed by Chip

Lawrence, formerly of NYS

Department of Health

Wadsworth Labs.

Fall 2010 BIO540/STA569/CSI660 77


Hidden Markov Models

• Two other MSA methods are based on

machine learning/statistical techniques

called Hidden Markov Models (HMMs).

• The two are

– HMMER (Eddy, 98), and

– SAM (Krogh et al., 94).

• If there are at least 20 sequences to be

aligned, HMMs are worth using.

Fall 2010 BIO540/STA569/CSI660 78


From Krogh et al.

1994

Fall 2010 BIO540/STA569/CSI660 79


Motif-based HMMs

• The program Meta-MEME uses HMMER

to find conserved sequence domains (motifs)

in a set of proteins.

Fall 2010 BIO540/STA569/CSI660 80


Sequence Logos


Sequence Logos

• A sequence logo is a graphic way of

showing the relative occurrence of amino

acids at each position in a block.

Fall 2010 BIO540/STA569/CSI660 82


Sequence Logos

Fall 2010 BIO540/STA569/CSI660 83


Sequence Logos

Fall 2010 BIO540/STA569/CSI660 84

More magazines by this user
Similar magazines