EMBnet course Proteomics using Bioinformatics tools

EMBnet course 

Proteomics using 

Bioinformatics tools 

MS identification tools 

Patricia M. Palagi 

PIG, SIB, Geneva 

PMP

The data: list of m/z values 

Peptide mass values and intensities 

MS 

840.6950 13.75 

1676.9606 26.1 

1498.8283 128.9 

1045.564 845.2 

2171.9670 2.56 

861.1073 371.2 

842.51458 53.7 

1456.7274 12.9 

863.268365 3.1 

Parent mass value 

fragment mass values 

MS/MS 

Parent mass charge 

1163.7008 2 

86.1105 220.1429 

86.1738 13.7619 

102.0752 4.3810 

147.1329 57.3333 

185.1851 649.0953 

185.3589 5.3810 

186.1876 81.4286 

213.0791 1.4286 

fragment intensities 

PMP

PMP 

The tools

One direct access to all- ExPASy 

http://www.expasy.org/tools/ 

PMP

Automatic protein identification 

- Peptide mass fingerprinting – PMF 

- MS/MS sequence search 

- MS/MS spectra library search 

- MS/MS prospective analysis (tag, 

open mod, de novo sequencing 

PMP

Peptide mass fingerprinting = PMF 

MS database matching 

Protein(s) 

Enzymatic 

digestion 

…MAIILAGGHSVRFGPKAF 

AEVNGETFYSRVITLESTNM 

FNEIIISTNAQLATQFKYPN 

VVIDDENHNDKGPLAGIYTI 

MKQHPEEELFFVVSVDTPM 

ITGKAVSTLYQFLV … 

Sequence 

database entry 

Peptides 

In-silico 

digestion 

Mass spectra 

- MAIILAGGHSVR 

-FGPK 

- AFAEVNGETFYSR 

- VITLESTNMFNEIIISTNAQLATQFK 

- YPNVVIDDENHNDK 

… 

Theoretical 

proteolytic peptides 

Peaklist 

840.695086 

1676.96063 

1498.8283 

1045.564 

2171.967066 

861.107346 

842.51458 

1456.727405 

863.268365 

Match 

861.107346 

838.695086 

1676.96063 

1498.8283 

1045.564 

2171.967066 

842.51458 

1457.827405 

863.268453 

Theoretical 

peaklist 

Result: 

ranked list 

of protein 

candidates 

PMP

Peptide mass fingerprinting 

What you have: 

- Set of peptide mass values 

- Information about the protein: molecular weight, pI, species. 

- Information about the experimental conditions: mass spectrometer 

precision, calibration used, possibility of missed-cleavages, possible 

modifications 

- Biological characteristics: post-translational modifications, fragments 

What will do the tool: 

- Match between this information and a protein sequence database 

What will you get: 

- a list of probable identified proteins 

PMP

What is the expected information in a 

submission form? 

• Place to upload a spectrum (many spectra) 

• Description of the sample process used 

– Chemical process such as alkylation/reduction, 

– Cleavage properties (enzyme), 

– Mass tolerance (m/z tolerance) 

• Search space 

– Sequence databank, 

– taxonomy restriction 

– Mw, pI restriction 

• Scoring criteria and filters 

PMP

One example of parameter 

effects on the search 

• Accepted mass tolerance 

‣ due to imprecise measures and calibration problems 

PMP Source: Introduction to proteomics: tools for the new biology. Daniel C. Liebler. Human Press. 2002

Summary of PMF tools 

Tool 

Aldente 

Mascot 

MS-Fit 

ProFound 

PepMAPPER 

PeptideSearch 

PepFrag 

Source website 

www.expasy.org/cgi-bin/aldente 

www.matrixscience.com/ 

prospector.ucsf.edu/ 

prowl.rockefeller.edu/profound_bin/WebProFound.exe 

wolf.bms.umist.ac.uk/mapper/ 

www.mann.emblheidelberg.de/GroupPages/PageLink/peptidesearchpage.html 

prowl.rockefeller.edu/prowl/pepfragch.html 

PMP 

Non exhaustive list!

Scoring systems 

• Essential for the identification! Gives a confidence value to 

each matched protein 

• Three types of scores 

• Shared peaks count (SPC): simply counts the number of 

matched mass values (peaks) 

• Probabilistic scores: confidence value depends on 

probabilistic models or statistic knowledge used during the match 

(obtained from the databases) 

• Statistic-learning: knowledge extraction from the influence of 

different properties used to match the proteins (obtained from the 

databases) 

PMP

Mascot 

http://www.matrixscience.com/ 

• Internet free version in the above website 

(commercial versions available too) 

•Choice of several databases. 

• Considers multiple chemical modifications. 

• 0 to 9 missed-cleavages. 

• Score based on a combination of probabilistic 

and statistic approaches (is based on Mowse 

score). 

• Considers Swiss-Prot annotations for Splice 

Variants (in locally installed versions). 

PMP

Mascot - principles 

• Probability-based scoring 

• Computes the probability P that a match is 

random 

• Significance threshold p< 0.05 (accepting that 

the probability of the observed event occurring 

by chance is less than 5%) 

• The significance of that result depends on the 

size of the database being searched. 

• Mascot shades in green the insignificant hits 

• Score: -10Log 10 (P) 

PMP

Mascot 

Input 

PMP

Decoy 

Output 

Hints about the significance 

of the score 

PMP

Sequence coverage 

Output 

Peptides matched 

Error 

function 

PMP

Aldente 

• SwissProt/TrEMBL db, indexed masses (trypsine and many 

others). 

• Considers chemical modifications and user specified 

modifications. 

• Considers biological modifications (annotations SWISS-PROT). 

• 0 or 1 missed-cleavages. 

• Use of robust alignment method (Hough transform): 

• Determines deviation function of spectrometer 

• Resolves ambiguities 

• Less sensitive to noise 

PMP

Aldente – summary 

Experimental masses / peaks 

Spectrometer 

calibration error 

Spectrometer 

internal error 

• The Hough Transform estimates from 

the experimental data the deviation 

function of the mass spectrometer (the 

calibration error function). 

Theoretical masses / peptides 

• The program optimizes the set of 

best matches, excluding noise and 

outliers, to find the best alignment. 

PMP

PMP 

Aldente - Input

PMP 

Aldente - Input

PMP 

Aldente - Input

PMP 

Aldente - Input

PMP 

Aldente - Input

Output 

Hints about 

the 

significance of 

the score 

PMP

PMP 

Information from Swiss-Prot 

annotation. Processed protein (signal 

peptide is cleaved).

PMP 

BioGraph

What is the expected information 

in an identification result? 

• A summary of the search parameters 

• A list of potentially identified proteins (AC numbers) with 

scores and other evidences 

• A detailed list of potentially identified peptides (associated or 

not to the potentially identified proteins) with scores 

• Possibilities to validate/invalidate the provided results (info on 

the data processing, on the statistics, links to external 

resources, etc.) 

• Possibilities to export the (validated) data in various formats 

PMP

Protein characterization with PMF 

data 

1 protein entry 

does not represent 

1 unique molecule 

- Exact primary structure 

- Splicing variants 

- Sequence conflicts 

-PTMs 

Characterization tools at ExPASy using peptide mass fingerprinting data 

http://www.expasy.org/tools/ 

FindMod 

GlycoMod 

FindPept 

Prediction tools 

• PTMs and AA substitutions 

• Oligosaccharide structures 

• Unspecific cleavages 

PMP

PMP 

SWISS-PROT feature table: 

active protein is more than just translation of 

gene sequence (example: P20366)

PMP 

Detection of PTMs in MS 

769.8 

893.4 

1326.7 

1501.9 

2100.6 

1056.1 

624.3 

624.3 

769.8 

893.4 

994.5 

994.5 

1056.1 

1326.7 

1501.9 

1759.8 

1759.8 1923.4 

1923.4 

2100.6 

Unmodified 

tryptic 

masses 

600 2200 

Δ m/z 

=> PTM 

769.8 

769.8 

893.4 

893.4 

1056.1 

1070.1 

1326.7 

1326.7 

1501.9 

1501.9 

2100.6 

2100.6 

624.3 

624.3 

994.5 

994.5 

1759.8 

1759.8 1923.4 1923.4 

Tryptic 

masses of 

a modified 

protein 

600 2200

PMP 

FindMod http://www.expasy.org/tools/findmod/ 

AA modifications 

DB entry 

experimental 

options 

experimental 

masses

FindMod Output 

}unmodified peptides, 

modified peptides 

known in SWISS-PROT 

and chemically modified 

peptides 

} 

putatively modified 

peptides predicted 

by mass differences 

+ putative AA substitutions 

PMP

Modification rules can be defined from 

SWISS-PROT, PROSITE and the literature 

some examples: 

modification amino acid rule exceptions 

farnesylation Cys - 

palmitoylation Cys Ser, Thr 

O-GlcNAc Ser, Thr Asn 

amidation 

pyrrolidone carboxylic acid Gln (N-term) - 

Xaa (C-term) where Gly followed Xaa 

phosphorylation in eukaryotes: Ser, Thr, Asp, His, Tyr - 

in prokaryotes: Ser, Thr, Asp, His, Cys - 

PMP 

sulfatation in eukaryotes 

Tyr, PROSITE PDOC00003

FindMod Output - Application of Rules 

- potentially modified peptides that agree with rules are listed 

- amino acids that potentially carry modifications are shown 

- peptides potentially modified only by mass difference 

PMP 

- predictions can be tested by MS-MS peptide fragmentation

FindPept 

http://www.expasy.org/tools/findpept.html 

• From MS (peptide mass fingerprint) data - 

detection of : 

– Matching peptides for unspecific cleavage 

– Masses resulting from possible contaminants 

– Matching peptides for specific cleavage (16 different 

enzymes) 

– Peptides resulting from protease autolysis 

PMP

PMP 

FindPep

PMP 

FindPep

PMP 

FindPep

PMP 

FindPep

MS/MS based identification tools 

• Tag search- Tools that search peptides based on a MS/MS 

Sequence Tag 

– MS-Tag and MS-Seq, PeptideSearch 

• Ion search or PFF - Tools that match MS/MS experimental 

spectra with “theoretical spectra” obtained via in-silico 

fragmentation of peptides generated from a sequence 

database 

– Phenyx, Mascot, Sequest, X!Tandem, OMSSA, ProID, … 

• de novo sequencing - Tools that directly interpret MS/MS 

spectra and try to deduce a sequence 

– Convolution/alignment (PEDENTA) 

– De-novo sequencing followed by sequence matching 

(Peaks, Lutefisk, Sherenga, PeptideSearch) 

– Guided Sequencing (Popitam) 

In all cases, the output is a peptide structure per MS/MS spectrum 

PMP

Peptide fragmentation 

fingerprinting = PFF = ion search 

MS/MS database matching 

Protein(s) 

…MAIILAGGHSVRFGPKAF 

AEVNGETFYSRVITLESTNM 

FNEIIISTNAQLATQFKYPN 

VVIDDENHNDKGPLAGIYTI 

MKQHPEEELFFVVSVDTPM 

ITGKAVSTLYQFLV … 

PMP 

Enzymatic 

digestion 

Sequence 

database entry 

In-silico 

digestion 

Peptides 

- MAIILAGGHSVR 

-FGPK 

- AFAEVNGETFYSR 

- VITLESTNMFNEIIIK 

- YPNVVIDDENNDK 

… 

Theoretical 

proteolytic peptides 

MS/MS spectra 

of peptides 

In-silico 

fragmentation 

-MAIILAG 

-MAIILA 

-MAIIL 

-MAII 

-MAI 

-M 

-M 

-AIILAG 

Theoretical 

fragmented 

peptides 

Ions peaklists 

340.695086 

676.96063 

498.8283 

545.564 

1171.967066 

261.107346 

342.51458 

456.727405 

363.268365 

Match 

361.107346 

338.695086 

676.96063 

498.8283 

1045.564 

1171.967066 

342.51458 

457.827405 

263.268453 

Theoretical 

peaklist 

Result: 

ranked list 

of peptide 

and 

protein 

candidates

Ion-types 

offset 

-28 

-45 

-46 

0 

-17 

-18 

+17 

+28 

+ 2 

-15 

-16 

-15 

P' nterm 

P' cterm 

It is very important to know the ionic series 

produced by a spectrometer, otherwise 

potential matches will be missed. 

In the other hand, if an ion-type not present 

in the original spectrum is taken into account, 

it will contribute to get false positive 

matches. 

[N] is the mass of the N-term group 

[M] is the mass of the sum of the neutral amino acid residue 

masses 

PMP

Some PFF tools 

Same principle of a PMF, but using MS/MS spectra 

Software 

InsPecT 

Mascot 

MS-Tag and MS-Seq 

PepFrag 

Phenyx 

Popitam 

ProID (download) 

Sequest* 

Sonar 

SpectrumMill* 

VEMS 

X!Tandem (download) 

Source website 

peptide.ucsd.edu/inspect.py 

www.matrixscience.com/search_form_select.html 

prospector.ucsf.edu 

prowl.rockefeller.edu/prowl/pepfragch.html 

phenyx.vital-it.ch 

www.expasy.org/tools/popitam 

sashimi.sourceforge.net/software_mi.html 

fields.scripps.edu/sequest/index.html 

65.219.84.5/service/prowl/sonar.html 

www.home.agilent.com 

www.bio.aau.dk/en/biotechnology/vems.htm 

www.thegpm.org/TANDEM 

PMP 

*Commercialized 

Non exhaustive list!

PMP

PMP 

Phenyx

Submission 

The Phenyx Web Interface: 

One result, multiple views 

Desktop 

Results 

views 

Results comparison 

Management console 

PMP P.A. Binz 

Excel, xml and text exports

The Proteins overview 

List of identified 

proteins 

Protein group 

description 

Corresponding list of 

identified peptides 

PMP P.A. Binz

The Proteins overview 

Hints about the 


the score 

PMP

Hints about the 


the score 

PMP 

Better when high intensity peaks are matched and ion series are 

extended, without too many and too big holes

The scoring system in Phenyx 

• The score is the sum of up to 12 basic scores such as: 

– presence of a, b, y, y++, B-H 2 

O…; co-occurrence of ion series 

(using HMMs), peak intensities, residue modifications (PTM or 

chemical), … 

• True probabilistic approach for each peptide match 

(likelihood of being correct) 

Search in a query database 

log -------------------------------- 

(likelihood of being random) 

Search in a randomized set 

of peptides 

• Function of instruments and molecular types 

– Esquire 3000+, LCQ; iTRAQ vs. unmodified peptides 

• Scores are normalised into z-scores 

PMP

X!Tandem 

PMP 

www.thegpm.org

X!Tandem - output 

1 

2 

3 

PMP

The two-rounds search 

Mascot, Phenyx and X!Tandem 

The identification process may be launched in 2-rounds 

• Each round is defined with a set of search criteria 

– First round searches the selected database(s) with 

stringent parameters, 

– Second round searches the proteins that have 

passed the first round (relaxed parameters): 

⇒Accelerate the job when looking for many variable modifications, 

or unspecific cleavages 

⇒Appropriate when the first round defines stringent criteria to 

capture a protein ID, and the second round looks for looser 

peptide identifications 

PMP

Example 2 nd round 

1rnd, 

Only 3 fixed mods 

131 valid, 

75% cov. 

2rnd, 

Add variable mods 

205 valid, 

84% cov. 

2rnd, 

With all mods 

And half cleaved 

348 valid, 

90% cov. 

PMP

Source of errors in assigning 

peptides 

• Scores not adapted 

• Parameters are too stringent or too loose 

• Low MS/MS spectrum quality (many noise peaks, low 

signal to noise ratio, missing fragment ions, contaminants) 

• Homologous proteins 

• Incorrectly assigned charge state 

• Pre-selection of the 2 nd isotope (the parent mass is shifted 

of 1 Da. A solution is to take the parent mass tol. larger, but 

may drawn the good peptide too)… 

• Novel peptide or variant 

PMP

Hints to know when the 

PMP 

With MS 

identification is correct 

• Good sequence coverage: the larger the sub-sequences and the 

higher the sequence coverage value, the better 

• Consider the length of the protein versus the number of 

matched theoretical peptides 

•Better when high intensity peaks have been used in the identification 

•Scores: the higher, the better. The furthest from the 2nd hit the 

better 

• Filter on the correct species if you know it (reduces the search 

space, time, and errors) 

• Better when the errors are more or less constants among all 

peptides found. 

•If you have time, try many tools and compare the results

With MS/MS 

Hints to know when the 

identification is correct 

• The higher the number of peptides identified per protein, the better 

• Sequence coverage: the larger the sub-sequences and the higher 

the sequence coverage value, the better 

•Depends on the sample complexity and experiment workflow 

• Scores: the higher, the better. 

• Filter on the correct species if you know it (reduces the search 

space, time, and errors) 

• Better when high intensity peaks are matched and ion series are 

extended, without too many and too big holes. 

• Better when the errors are more or less constants among all ions. 

• If you have time, try many tools and compare the results 

PMP

E-values 

• For a given score S, it indicates the number of 

matches that are expected to occur by chance 

in a database with a score at least equal to S. 

• The e-value takes into account the size of the 

database that was searched. As a consequence 

it has a maximum of the number of sequences 

in the database. 

• The lower the e-value, the more significant the 

score is. 

PMP 

• An e-value depends on the calculation of the p- 

value.

p-value 

• A p-value describes the probability, which 

assesses the chance of validly rejecting the null 

hypothesis. If the p-value is 10 -5 then the 

rejection of the null hypothesis is due to chance 

with a probability of 10 -5 . 

• A p-value has a maximum of 1.0. 

• The larger the search space, the higher the p- 

value since the chance of a peptide being a 

random match increases. 

• The lower the p-Value, the more significant is the 

match. 

PMP Source: Lisacek, Practical Proteomics, 2006 Sep;6 Suppl 2:22-32

Z-score 

• Z-score is a dimensionless quantity 

derived by subtracting the population 

mean from an individual (raw) score and 

then dividing the difference by the 

population standard deviation. 

Z − score = 

• The z score reveals how many units of 

the standard deviation a case is above or 

below the mean. 

x 

− μ 

σ 

PMP Source: wikipedia

So what? 

• For small (significant) p-values, p and e are 

approximately equal, so the choice of one or the 

other is often equivalent. It is therefore 

reasonable to assimilate low p-values in Phenyx 

to e-values. X!Tandem simply switches e-values 

to log values to remove the powers of 10 

• For a single search (or set of sampled peptides), 

you can compare z-scores. However, when two or 

more searches are performed on different size 

spaces, you first need to look at the p-values 

before comparing z-scores. 

PMP Source: Lisacek, Practical Proteomics, 2006 Sep;6 Suppl 2:22-32

EMBnet course Proteomics using Bioinformatics tools

Create successful ePaper yourself

Delete template?

Save as template?