21.05.2015 Views

EMBnet course Proteomics using Bioinformatics tools

EMBnet course Proteomics using Bioinformatics tools

EMBnet course Proteomics using Bioinformatics tools

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>EMBnet</strong> <strong>course</strong><br />

<strong>Proteomics</strong> <strong>using</strong><br />

<strong>Bioinformatics</strong> <strong>tools</strong><br />

MS identification <strong>tools</strong><br />

Patricia M. Palagi<br />

PIG, SIB, Geneva<br />

PMP


The data: list of m/z values<br />

Peptide mass values and intensities<br />

MS<br />

840.6950 13.75<br />

1676.9606 26.1<br />

1498.8283 128.9<br />

1045.564 845.2<br />

2171.9670 2.56<br />

861.1073 371.2<br />

842.51458 53.7<br />

1456.7274 12.9<br />

863.268365 3.1<br />

Parent mass value<br />

fragment mass values<br />

MS/MS<br />

Parent mass charge<br />

1163.7008 2<br />

86.1105 220.1429<br />

86.1738 13.7619<br />

102.0752 4.3810<br />

147.1329 57.3333<br />

185.1851 649.0953<br />

185.3589 5.3810<br />

186.1876 81.4286<br />

213.0791 1.4286<br />

fragment intensities<br />

PMP


PMP<br />

The <strong>tools</strong>


One direct access to all- ExPASy<br />

http://www.expasy.org/<strong>tools</strong>/<br />

PMP


Automatic protein identification<br />

- Peptide mass fingerprinting – PMF<br />

- MS/MS sequence search<br />

- MS/MS spectra library search<br />

- MS/MS prospective analysis (tag,<br />

open mod, de novo sequencing<br />

PMP


Peptide mass fingerprinting = PMF<br />

MS database matching<br />

Protein(s)<br />

Enzymatic<br />

digestion<br />

…MAIILAGGHSVRFGPKAF<br />

AEVNGETFYSRVITLESTNM<br />

FNEIIISTNAQLATQFKYPN<br />

VVIDDENHNDKGPLAGIYTI<br />

MKQHPEEELFFVVSVDTPM<br />

ITGKAVSTLYQFLV …<br />

Sequence<br />

database entry<br />

Peptides<br />

In-silico<br />

digestion<br />

Mass spectra<br />

- MAIILAGGHSVR<br />

-FGPK<br />

- AFAEVNGETFYSR<br />

- VITLESTNMFNEIIISTNAQLATQFK<br />

- YPNVVIDDENHNDK<br />

…<br />

Theoretical<br />

proteolytic peptides<br />

Peaklist<br />

840.695086<br />

1676.96063<br />

1498.8283<br />

1045.564<br />

2171.967066<br />

861.107346<br />

842.51458<br />

1456.727405<br />

863.268365<br />

Match<br />

861.107346<br />

838.695086<br />

1676.96063<br />

1498.8283<br />

1045.564<br />

2171.967066<br />

842.51458<br />

1457.827405<br />

863.268453<br />

Theoretical<br />

peaklist<br />

Result:<br />

ranked list<br />

of protein<br />

candidates<br />

PMP


Peptide mass fingerprinting<br />

What you have:<br />

- Set of peptide mass values<br />

- Information about the protein: molecular weight, pI, species.<br />

- Information about the experimental conditions: mass spectrometer<br />

precision, calibration used, possibility of missed-cleavages, possible<br />

modifications<br />

- Biological characteristics: post-translational modifications, fragments<br />

What will do the tool:<br />

- Match between this information and a protein sequence database<br />

What will you get:<br />

- a list of probable identified proteins<br />

PMP


What is the expected information in a<br />

submission form?<br />

• Place to upload a spectrum (many spectra)<br />

• Description of the sample process used<br />

– Chemical process such as alkylation/reduction,<br />

– Cleavage properties (enzyme),<br />

– Mass tolerance (m/z tolerance)<br />

• Search space<br />

– Sequence databank,<br />

– taxonomy restriction<br />

– Mw, pI restriction<br />

• Scoring criteria and filters<br />

PMP


One example of parameter<br />

effects on the search<br />

• Accepted mass tolerance<br />

‣ due to imprecise measures and calibration problems<br />

PMP Source: Introduction to proteomics: <strong>tools</strong> for the new biology. Daniel C. Liebler. Human Press. 2002


Summary of PMF <strong>tools</strong><br />

Tool<br />

Aldente<br />

Mascot<br />

MS-Fit<br />

ProFound<br />

PepMAPPER<br />

PeptideSearch<br />

PepFrag<br />

Source website<br />

www.expasy.org/cgi-bin/aldente<br />

www.matrixscience.com/<br />

prospector.ucsf.edu/<br />

prowl.rockefeller.edu/profound_bin/WebProFound.exe<br />

wolf.bms.umist.ac.uk/mapper/<br />

www.mann.emblheidelberg.de/GroupPages/PageLink/peptidesearchpage.html<br />

prowl.rockefeller.edu/prowl/pepfragch.html<br />

PMP<br />

Non exhaustive list!


Scoring systems<br />

• Essential for the identification! Gives a confidence value to<br />

each matched protein<br />

• Three types of scores<br />

• Shared peaks count (SPC): simply counts the number of<br />

matched mass values (peaks)<br />

• Probabilistic scores: confidence value depends on<br />

probabilistic models or statistic knowledge used during the match<br />

(obtained from the databases)<br />

• Statistic-learning: knowledge extraction from the influence of<br />

different properties used to match the proteins (obtained from the<br />

databases)<br />

PMP


Mascot<br />

http://www.matrixscience.com/<br />

• Internet free version in the above website<br />

(commercial versions available too)<br />

•Choice of several databases.<br />

• Considers multiple chemical modifications.<br />

• 0 to 9 missed-cleavages.<br />

• Score based on a combination of probabilistic<br />

and statistic approaches (is based on Mowse<br />

score).<br />

• Considers Swiss-Prot annotations for Splice<br />

Variants (in locally installed versions).<br />

PMP


Mascot - principles<br />

• Probability-based scoring<br />

• Computes the probability P that a match is<br />

random<br />

• Significance threshold p< 0.05 (accepting that<br />

the probability of the observed event occurring<br />

by chance is less than 5%)<br />

• The significance of that result depends on the<br />

size of the database being searched.<br />

• Mascot shades in green the insignificant hits<br />

• Score: -10Log 10 (P)<br />

PMP


Mascot<br />

Input<br />

PMP


Decoy<br />

Output<br />

Hints about the significance<br />

of the score<br />

PMP


Sequence coverage<br />

Output<br />

Peptides matched<br />

Error<br />

function<br />

PMP


Aldente<br />

• SwissProt/TrEMBL db, indexed masses (trypsine and many<br />

others).<br />

• Considers chemical modifications and user specified<br />

modifications.<br />

• Considers biological modifications (annotations SWISS-PROT).<br />

• 0 or 1 missed-cleavages.<br />

• Use of robust alignment method (Hough transform):<br />

• Determines deviation function of spectrometer<br />

• Resolves ambiguities<br />

• Less sensitive to noise<br />

PMP


Aldente – summary<br />

Experimental masses / peaks<br />

Spectrometer<br />

calibration error<br />

Spectrometer<br />

internal error<br />

• The Hough Transform estimates from<br />

the experimental data the deviation<br />

function of the mass spectrometer (the<br />

calibration error function).<br />

Theoretical masses / peptides<br />

• The program optimizes the set of<br />

best matches, excluding noise and<br />

outliers, to find the best alignment.<br />

PMP


PMP<br />

Aldente - Input


PMP<br />

Aldente - Input


PMP<br />

Aldente - Input


PMP<br />

Aldente - Input


PMP<br />

Aldente - Input


Output<br />

Hints about<br />

the<br />

significance of<br />

the score<br />

PMP


PMP<br />

Information from Swiss-Prot<br />

annotation. Processed protein (signal<br />

peptide is cleaved).


PMP<br />

BioGraph


What is the expected information<br />

in an identification result?<br />

• A summary of the search parameters<br />

• A list of potentially identified proteins (AC numbers) with<br />

scores and other evidences<br />

• A detailed list of potentially identified peptides (associated or<br />

not to the potentially identified proteins) with scores<br />

• Possibilities to validate/invalidate the provided results (info on<br />

the data processing, on the statistics, links to external<br />

resources, etc.)<br />

• Possibilities to export the (validated) data in various formats<br />

PMP


Protein characterization with PMF<br />

data<br />

1 protein entry<br />

does not represent<br />

1 unique molecule<br />

- Exact primary structure<br />

- Splicing variants<br />

- Sequence conflicts<br />

-PTMs<br />

Characterization <strong>tools</strong> at ExPASy <strong>using</strong> peptide mass fingerprinting data<br />

http://www.expasy.org/<strong>tools</strong>/<br />

FindMod<br />

GlycoMod<br />

FindPept<br />

Prediction <strong>tools</strong><br />

• PTMs and AA substitutions<br />

• Oligosaccharide structures<br />

• Unspecific cleavages<br />

PMP


PMP<br />

SWISS-PROT feature table:<br />

active protein is more than just translation of<br />

gene sequence (example: P20366)


PMP<br />

Detection of PTMs in MS<br />

769.8<br />

893.4<br />

1326.7<br />

1501.9<br />

2100.6<br />

1056.1<br />

624.3<br />

624.3<br />

769.8<br />

893.4<br />

994.5<br />

994.5<br />

1056.1<br />

1326.7<br />

1501.9<br />

1759.8<br />

1759.8 1923.4<br />

1923.4<br />

2100.6<br />

Unmodified<br />

tryptic<br />

masses<br />

600 2200<br />

Δ m/z<br />

=> PTM<br />

769.8<br />

769.8<br />

893.4<br />

893.4<br />

1056.1<br />

1070.1<br />

1326.7<br />

1326.7<br />

1501.9<br />

1501.9<br />

2100.6<br />

2100.6<br />

624.3<br />

624.3<br />

994.5<br />

994.5<br />

1759.8<br />

1759.8 1923.4 1923.4<br />

Tryptic<br />

masses of<br />

a modified<br />

protein<br />

600 2200


PMP<br />

FindMod http://www.expasy.org/<strong>tools</strong>/findmod/<br />

AA modifications<br />

DB entry<br />

experimental<br />

options<br />

experimental<br />

masses


FindMod Output<br />

}unmodified peptides,<br />

modified peptides<br />

known in SWISS-PROT<br />

and chemically modified<br />

peptides<br />

}<br />

putatively modified<br />

peptides predicted<br />

by mass differences<br />

+ putative AA substitutions<br />

PMP


Modification rules can be defined from<br />

SWISS-PROT, PROSITE and the literature<br />

some examples:<br />

modification amino acid rule exceptions<br />

farnesylation Cys -<br />

palmitoylation Cys Ser, Thr<br />

O-GlcNAc Ser, Thr Asn<br />

amidation<br />

pyrrolidone carboxylic acid Gln (N-term) -<br />

Xaa (C-term) where Gly followed Xaa<br />

phosphorylation in eukaryotes: Ser, Thr, Asp, His, Tyr -<br />

in prokaryotes: Ser, Thr, Asp, His, Cys -<br />

PMP<br />

sulfatation in eukaryotes<br />

Tyr, PROSITE PDOC00003


FindMod Output - Application of Rules<br />

- potentially modified peptides that agree with rules are listed<br />

- amino acids that potentially carry modifications are shown<br />

- peptides potentially modified only by mass difference<br />

PMP<br />

- predictions can be tested by MS-MS peptide fragmentation


FindPept<br />

http://www.expasy.org/<strong>tools</strong>/findpept.html<br />

• From MS (peptide mass fingerprint) data -<br />

detection of :<br />

– Matching peptides for unspecific cleavage<br />

– Masses resulting from possible contaminants<br />

– Matching peptides for specific cleavage (16 different<br />

enzymes)<br />

– Peptides resulting from protease autolysis<br />

PMP


PMP<br />

FindPep


PMP<br />

FindPep


PMP<br />

FindPep


PMP<br />

FindPep


MS/MS based identification <strong>tools</strong><br />

• Tag search- Tools that search peptides based on a MS/MS<br />

Sequence Tag<br />

– MS-Tag and MS-Seq, PeptideSearch<br />

• Ion search or PFF - Tools that match MS/MS experimental<br />

spectra with “theoretical spectra” obtained via in-silico<br />

fragmentation of peptides generated from a sequence<br />

database<br />

– Phenyx, Mascot, Sequest, X!Tandem, OMSSA, ProID, …<br />

• de novo sequencing - Tools that directly interpret MS/MS<br />

spectra and try to deduce a sequence<br />

– Convolution/alignment (PEDENTA)<br />

– De-novo sequencing followed by sequence matching<br />

(Peaks, Lutefisk, Sherenga, PeptideSearch)<br />

– Guided Sequencing (Popitam)<br />

In all cases, the output is a peptide structure per MS/MS spectrum<br />

PMP


Peptide fragmentation<br />

fingerprinting = PFF = ion search<br />

MS/MS database matching<br />

Protein(s)<br />

…MAIILAGGHSVRFGPKAF<br />

AEVNGETFYSRVITLESTNM<br />

FNEIIISTNAQLATQFKYPN<br />

VVIDDENHNDKGPLAGIYTI<br />

MKQHPEEELFFVVSVDTPM<br />

ITGKAVSTLYQFLV …<br />

PMP<br />

Enzymatic<br />

digestion<br />

Sequence<br />

database entry<br />

In-silico<br />

digestion<br />

Peptides<br />

- MAIILAGGHSVR<br />

-FGPK<br />

- AFAEVNGETFYSR<br />

- VITLESTNMFNEIIIK<br />

- YPNVVIDDENNDK<br />

…<br />

Theoretical<br />

proteolytic peptides<br />

MS/MS spectra<br />

of peptides<br />

In-silico<br />

fragmentation<br />

-MAIILAG<br />

-MAIILA<br />

-MAIIL<br />

-MAII<br />

-MAI<br />

-M<br />

-M<br />

-AIILAG<br />

Theoretical<br />

fragmented<br />

peptides<br />

Ions peaklists<br />

340.695086<br />

676.96063<br />

498.8283<br />

545.564<br />

1171.967066<br />

261.107346<br />

342.51458<br />

456.727405<br />

363.268365<br />

Match<br />

361.107346<br />

338.695086<br />

676.96063<br />

498.8283<br />

1045.564<br />

1171.967066<br />

342.51458<br />

457.827405<br />

263.268453<br />

Theoretical<br />

peaklist<br />

Result:<br />

ranked list<br />

of peptide<br />

and<br />

protein<br />

candidates


Ion-types<br />

offset<br />

-28<br />

-45<br />

-46<br />

0<br />

-17<br />

-18<br />

+17<br />

+28<br />

+ 2<br />

-15<br />

-16<br />

-15<br />

P' nterm<br />

P' cterm<br />

It is very important to know the ionic series<br />

produced by a spectrometer, otherwise<br />

potential matches will be missed.<br />

In the other hand, if an ion-type not present<br />

in the original spectrum is taken into account,<br />

it will contribute to get false positive<br />

matches.<br />

[N] is the mass of the N-term group<br />

[M] is the mass of the sum of the neutral amino acid residue<br />

masses<br />

PMP


Some PFF <strong>tools</strong><br />

Same principle of a PMF, but <strong>using</strong> MS/MS spectra<br />

Software<br />

InsPecT<br />

Mascot<br />

MS-Tag and MS-Seq<br />

PepFrag<br />

Phenyx<br />

Popitam<br />

ProID (download)<br />

Sequest*<br />

Sonar<br />

SpectrumMill*<br />

VEMS<br />

X!Tandem (download)<br />

Source website<br />

peptide.ucsd.edu/inspect.py<br />

www.matrixscience.com/search_form_select.html<br />

prospector.ucsf.edu<br />

prowl.rockefeller.edu/prowl/pepfragch.html<br />

phenyx.vital-it.ch<br />

www.expasy.org/<strong>tools</strong>/popitam<br />

sashimi.sourceforge.net/software_mi.html<br />

fields.scripps.edu/sequest/index.html<br />

65.219.84.5/service/prowl/sonar.html<br />

www.home.agilent.com<br />

www.bio.aau.dk/en/biotechnology/vems.htm<br />

www.thegpm.org/TANDEM<br />

PMP<br />

*Commercialized<br />

Non exhaustive list!


PMP


PMP<br />

Phenyx


Submission<br />

The Phenyx Web Interface:<br />

One result, multiple views<br />

Desktop<br />

Results<br />

views<br />

Results comparison<br />

Management console<br />

PMP P.A. Binz<br />

Excel, xml and text exports


The Proteins overview<br />

List of identified<br />

proteins<br />

Protein group<br />

description<br />

Corresponding list of<br />

identified peptides<br />

PMP P.A. Binz


The Proteins overview<br />

Hints about the<br />

significance of<br />

the score<br />

PMP


Hints about the<br />

significance of<br />

the score<br />

PMP<br />

Better when high intensity peaks are matched and ion series are<br />

extended, without too many and too big holes


The scoring system in Phenyx<br />

• The score is the sum of up to 12 basic scores such as:<br />

– presence of a, b, y, y++, B-H 2<br />

O…; co-occurrence of ion series<br />

(<strong>using</strong> HMMs), peak intensities, residue modifications (PTM or<br />

chemical), …<br />

• True probabilistic approach for each peptide match<br />

(likelihood of being correct)<br />

Search in a query database<br />

log --------------------------------<br />

(likelihood of being random)<br />

Search in a randomized set<br />

of peptides<br />

• Function of instruments and molecular types<br />

– Esquire 3000+, LCQ; iTRAQ vs. unmodified peptides<br />

• Scores are normalised into z-scores<br />

PMP


X!Tandem<br />

PMP<br />

www.thegpm.org


X!Tandem - output<br />

1<br />

2<br />

3<br />

PMP


The two-rounds search<br />

Mascot, Phenyx and X!Tandem<br />

The identification process may be launched in 2-rounds<br />

• Each round is defined with a set of search criteria<br />

– First round searches the selected database(s) with<br />

stringent parameters,<br />

– Second round searches the proteins that have<br />

passed the first round (relaxed parameters):<br />

⇒Accelerate the job when looking for many variable modifications,<br />

or unspecific cleavages<br />

⇒Appropriate when the first round defines stringent criteria to<br />

capture a protein ID, and the second round looks for looser<br />

peptide identifications<br />

PMP


Example 2 nd round<br />

1rnd,<br />

Only 3 fixed mods<br />

131 valid,<br />

75% cov.<br />

2rnd,<br />

Add variable mods<br />

205 valid,<br />

84% cov.<br />

2rnd,<br />

With all mods<br />

And half cleaved<br />

348 valid,<br />

90% cov.<br />

PMP


Source of errors in assigning<br />

peptides<br />

• Scores not adapted<br />

• Parameters are too stringent or too loose<br />

• Low MS/MS spectrum quality (many noise peaks, low<br />

signal to noise ratio, missing fragment ions, contaminants)<br />

• Homologous proteins<br />

• Incorrectly assigned charge state<br />

• Pre-selection of the 2 nd isotope (the parent mass is shifted<br />

of 1 Da. A solution is to take the parent mass tol. larger, but<br />

may drawn the good peptide too)…<br />

• Novel peptide or variant<br />

PMP


Hints to know when the<br />

PMP<br />

With MS<br />

identification is correct<br />

• Good sequence coverage: the larger the sub-sequences and the<br />

higher the sequence coverage value, the better<br />

• Consider the length of the protein versus the number of<br />

matched theoretical peptides<br />

•Better when high intensity peaks have been used in the identification<br />

•Scores: the higher, the better. The furthest from the 2nd hit the<br />

better<br />

• Filter on the correct species if you know it (reduces the search<br />

space, time, and errors)<br />

• Better when the errors are more or less constants among all<br />

peptides found.<br />

•If you have time, try many <strong>tools</strong> and compare the results


With MS/MS<br />

Hints to know when the<br />

identification is correct<br />

• The higher the number of peptides identified per protein, the better<br />

• Sequence coverage: the larger the sub-sequences and the higher<br />

the sequence coverage value, the better<br />

•Depends on the sample complexity and experiment workflow<br />

• Scores: the higher, the better.<br />

• Filter on the correct species if you know it (reduces the search<br />

space, time, and errors)<br />

• Better when high intensity peaks are matched and ion series are<br />

extended, without too many and too big holes.<br />

• Better when the errors are more or less constants among all ions.<br />

• If you have time, try many <strong>tools</strong> and compare the results<br />

PMP


E-values<br />

• For a given score S, it indicates the number of<br />

matches that are expected to occur by chance<br />

in a database with a score at least equal to S.<br />

• The e-value takes into account the size of the<br />

database that was searched. As a consequence<br />

it has a maximum of the number of sequences<br />

in the database.<br />

• The lower the e-value, the more significant the<br />

score is.<br />

PMP<br />

• An e-value depends on the calculation of the p-<br />

value.


p-value<br />

• A p-value describes the probability, which<br />

assesses the chance of validly rejecting the null<br />

hypothesis. If the p-value is 10 -5 then the<br />

rejection of the null hypothesis is due to chance<br />

with a probability of 10 -5 .<br />

• A p-value has a maximum of 1.0.<br />

• The larger the search space, the higher the p-<br />

value since the chance of a peptide being a<br />

random match increases.<br />

• The lower the p-Value, the more significant is the<br />

match.<br />

PMP Source: Lisacek, Practical <strong>Proteomics</strong>, 2006 Sep;6 Suppl 2:22-32


Z-score<br />

• Z-score is a dimensionless quantity<br />

derived by subtracting the population<br />

mean from an individual (raw) score and<br />

then dividing the difference by the<br />

population standard deviation.<br />

Z − score =<br />

• The z score reveals how many units of<br />

the standard deviation a case is above or<br />

below the mean.<br />

x<br />

− μ<br />

σ<br />

PMP Source: wikipedia


So what?<br />

• For small (significant) p-values, p and e are<br />

approximately equal, so the choice of one or the<br />

other is often equivalent. It is therefore<br />

reasonable to assimilate low p-values in Phenyx<br />

to e-values. X!Tandem simply switches e-values<br />

to log values to remove the powers of 10<br />

• For a single search (or set of sampled peptides),<br />

you can compare z-scores. However, when two or<br />

more searches are performed on different size<br />

spaces, you first need to look at the p-values<br />

before comparing z-scores.<br />

PMP Source: Lisacek, Practical <strong>Proteomics</strong>, 2006 Sep;6 Suppl 2:22-32

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!