01.04.2015 Views

Gene Cloning

Gene Cloning

Gene Cloning

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Bioinformatics 237<br />

These two regular expressions, describing the motifs for both the serine<br />

and histidine active sites are diagnostic of trypsin-like serine proteases. If<br />

sequences matching both are present then the protein can be unambiguously<br />

assigned to the family. In situations like this where several motifs are<br />

required to define a protein family these can be grouped together into what<br />

are known as fingerprints. The fingerprint provides a signature for the protein<br />

family. The PRINTS protein fingerprint database is a collection of these<br />

fingerprints which can be searched in a variety of different ways. The prints<br />

entry for the trypsin-like serine proteases consists of three elements<br />

including both of the Prosite patterns that we have already discussed and a<br />

third motif representing the active site aspartate.<br />

Both of the above methods of representing the characteristic patterns of<br />

amino acids typical of specific protein families are a compromise between<br />

a very restricted expression, which is likely to miss some of the more divergent<br />

examples, and a very fuzzy expression which may result in false identifications<br />

of family members. These expressions are refined by an iterative<br />

process of scanning the primary protein databases and evaluating the families<br />

identified. An alternative solution is to preserve the full alignment<br />

used to identify the family in the form of a matrix in which the frequency of<br />

each amino acid at each position is recorded. These are known as profiles<br />

and are found in addition to the patterns in the Prosite database.<br />

The real power of these secondary or pattern databases is that they offer<br />

a fast track to identify important structural and functional regions of proteins<br />

and in doing so provide a link to a wealth of information about the<br />

biological function of proteins. These patterns which are derived from<br />

extensive alignments of many sequences may in some cases be able to<br />

identify more distant relationships, and more divergent members of families,<br />

than similarity searching. As the primary databases expand at ever<br />

increasing rates, secondary databases can also help to overcome some of<br />

the problems of “noise” created by the occurrence of many very similar<br />

sequences. We have already come across an example of this when looking<br />

at BLAST searches (Section 8.10). A search of the conserved domain database<br />

(CDD) (Figure 8.13d) is run by default at the same time as a protein<br />

BLAST at NCBI. This makes it possible to study the domain architecture of<br />

the query protein even before the BLAST search is complete.<br />

8.16 Investigating the Three-dimensional Structures of<br />

Biological Molecules<br />

The logical extension of studying the sequence of DNA and protein molecules<br />

is to understand the three-dimensional structures that they adopt.<br />

These three-dimensional structures must be determined experimentally<br />

by techniques such as X-ray crystallography and nuclear magnetic resonance<br />

(NMR) spectroscopy. This is a much more involved process than<br />

determining sequence and is beyond the scope of this book. However,

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!