Principles of Plant Genetics and Breeding
Principles of Plant Genetics and Breeding
Principles of Plant Genetics and Breeding
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
sequence analysis. It is an application <strong>of</strong> information<br />
science to biology. It uses supercomputers <strong>and</strong> sophisticated<br />
s<strong>of</strong>tware to search <strong>and</strong> analyze databases accumulated<br />
from genome sequencing projects <strong>and</strong> other similar<br />
efforts. Bioinformatics allows scientists to make predictions<br />
based on previous experiences with biological<br />
reality. In one application, the biological information<br />
bank is searched to find sequences with known function<br />
that resemble the unknown sequence <strong>and</strong> thereby predict<br />
the function <strong>of</strong> the unknown sequence. Databases<br />
are critical to bioinformatics <strong>and</strong> hence repositories exist<br />
in various parts <strong>of</strong> the world for genetic sequence data.<br />
Types <strong>of</strong> bioinformatics databases<br />
Information used in bioinformatics research may be<br />
grouped into two categories:<br />
1 Primary databases. These databases consist <strong>of</strong> original<br />
biological data such as raw DNA sequences <strong>and</strong><br />
protein structure information from crystallography.<br />
2 Secondary databases. These databases contain original<br />
data that have been processed to suit certain<br />
specific applications.<br />
To be useful, a good database should have two critical<br />
parts: (i) the original sequence; <strong>and</strong> (ii) <strong>and</strong> an annotation<br />
description <strong>of</strong> the biological context <strong>of</strong> the data. It<br />
is critical that each entry be accompanied by a detailed<br />
<strong>and</strong> complete annotation, without which a bioinformatics<br />
search becomes an exercise in futility since it would<br />
be difficult to assign valid meaning to any relationships<br />
discovered. Some databases include taxonomic information<br />
such as the structural <strong>and</strong> biochemical characteristics<br />
<strong>of</strong> organisms.<br />
The bulk <strong>of</strong> data in repositories consist <strong>of</strong> primary<br />
data. Three major entities are collaboratively responsible<br />
for maintaining gene sequence databases. These entities<br />
are the European Molecular Biology Lab (EMBL) <strong>of</strong><br />
Cambridge, UK, the GeneBank <strong>of</strong> the National Center<br />
for Biotechnology Information (NCBI) that is affiliated<br />
with the National Institutes <strong>of</strong> Health, USA, <strong>and</strong> the<br />
DNA Databank <strong>of</strong> Japan.<br />
Databases for both protein sequences <strong>and</strong> structure<br />
are being maintained. The Department <strong>of</strong> Medical<br />
Biochemistry (University <strong>of</strong> Geneva) <strong>and</strong> European<br />
Bioinformatics Institute collaboratively maintain properly<br />
annotated translations <strong>of</strong> sequences in the EMBL<br />
databases. This is called the SWISSPROT. TREMBL is<br />
another protein database consisting solely <strong>of</strong> proteincoding<br />
regions <strong>of</strong> the EMBL database (called the translated<br />
EMBL or TREMBL). The NCBI also maintains a<br />
BIOTECHNOLOGY IN PLANT BREEDING 239<br />
database <strong>of</strong> the translations <strong>of</strong> the GeneBank. Another<br />
kind <strong>of</strong> protein database consisting <strong>of</strong> experimentally<br />
derived 3D structures <strong>of</strong> proteins is kept at the protein<br />
databank where these structures are determined by Xray<br />
diffraction <strong>and</strong> nuclear magnetic resonance.<br />
General steps in a bioinformatics project<br />
One purpose <strong>of</strong> searching a bioinformatics database is to<br />
determine if the researcher’s unknown sequence, DNA<br />
or protein, matches any sequence in the database in<br />
terms <strong>of</strong> structure or function. This requires the proper<br />
choice <strong>and</strong> skillful use <strong>of</strong> s<strong>of</strong>tware to align the unknown<br />
sequence with the known. Gene-seeking s<strong>of</strong>tware varies<br />
in capability <strong>and</strong> ease <strong>of</strong> use. They have certain properties<br />
in common: (i) algorithms for pattern recognition<br />
use statistical probability analysis to determine the similarity<br />
between two sequences; (ii) data tables contain<br />
information on consensus sequences for various genetic<br />
elements; (iii) taxonomic differences are included because<br />
consensus sequences vary between different taxonomic<br />
classes to facilitate analysis <strong>and</strong> minimize errors; <strong>and</strong> (iv)<br />
specific instructions describe how the algorithms should<br />
be applied in an analysis <strong>and</strong> how the results should be<br />
interpreted.<br />
A search involves two key activities:<br />
1 Sequence alignment scoring matrices. The unknown<br />
sequence is aligned with those in the bank. Scores<br />
are assigned on the basis <strong>of</strong> the sequence homology<br />
detected. It is most useful to align sequences such<br />
that the largest scores are assigned to the most<br />
biologically significant matches.<br />
2 Comparing sequences against a database. One<br />
<strong>of</strong> the most common searches <strong>of</strong> bioinformatics<br />
databases is to compare an unknown sequence<br />
against those in the database to discover similarities.<br />
Typical homology search algorithms are used in this<br />
activity. Some <strong>of</strong> the most widely used s<strong>of</strong>tware for<br />
this search is the Basic Local Alignment Search Tool<br />
(BLAST) <strong>and</strong> FASTA. BLAST uses a strategy based<br />
on short sequence fragments between the unknown<br />
sequence <strong>and</strong> those in the database. It is designed<br />
to match only continuous sequences (no gaps from<br />
deletion or insertion mutations are taken into<br />
account).<br />
DNA microarrays in plant breeding<br />
Genes are not only variably expressed, but the level <strong>of</strong><br />
expression varies during a physiological change, some