09.12.2012 Views

Principles of Plant Genetics and Breeding

Principles of Plant Genetics and Breeding

Principles of Plant Genetics and Breeding

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

sequence analysis. It is an application <strong>of</strong> information<br />

science to biology. It uses supercomputers <strong>and</strong> sophisticated<br />

s<strong>of</strong>tware to search <strong>and</strong> analyze databases accumulated<br />

from genome sequencing projects <strong>and</strong> other similar<br />

efforts. Bioinformatics allows scientists to make predictions<br />

based on previous experiences with biological<br />

reality. In one application, the biological information<br />

bank is searched to find sequences with known function<br />

that resemble the unknown sequence <strong>and</strong> thereby predict<br />

the function <strong>of</strong> the unknown sequence. Databases<br />

are critical to bioinformatics <strong>and</strong> hence repositories exist<br />

in various parts <strong>of</strong> the world for genetic sequence data.<br />

Types <strong>of</strong> bioinformatics databases<br />

Information used in bioinformatics research may be<br />

grouped into two categories:<br />

1 Primary databases. These databases consist <strong>of</strong> original<br />

biological data such as raw DNA sequences <strong>and</strong><br />

protein structure information from crystallography.<br />

2 Secondary databases. These databases contain original<br />

data that have been processed to suit certain<br />

specific applications.<br />

To be useful, a good database should have two critical<br />

parts: (i) the original sequence; <strong>and</strong> (ii) <strong>and</strong> an annotation<br />

description <strong>of</strong> the biological context <strong>of</strong> the data. It<br />

is critical that each entry be accompanied by a detailed<br />

<strong>and</strong> complete annotation, without which a bioinformatics<br />

search becomes an exercise in futility since it would<br />

be difficult to assign valid meaning to any relationships<br />

discovered. Some databases include taxonomic information<br />

such as the structural <strong>and</strong> biochemical characteristics<br />

<strong>of</strong> organisms.<br />

The bulk <strong>of</strong> data in repositories consist <strong>of</strong> primary<br />

data. Three major entities are collaboratively responsible<br />

for maintaining gene sequence databases. These entities<br />

are the European Molecular Biology Lab (EMBL) <strong>of</strong><br />

Cambridge, UK, the GeneBank <strong>of</strong> the National Center<br />

for Biotechnology Information (NCBI) that is affiliated<br />

with the National Institutes <strong>of</strong> Health, USA, <strong>and</strong> the<br />

DNA Databank <strong>of</strong> Japan.<br />

Databases for both protein sequences <strong>and</strong> structure<br />

are being maintained. The Department <strong>of</strong> Medical<br />

Biochemistry (University <strong>of</strong> Geneva) <strong>and</strong> European<br />

Bioinformatics Institute collaboratively maintain properly<br />

annotated translations <strong>of</strong> sequences in the EMBL<br />

databases. This is called the SWISSPROT. TREMBL is<br />

another protein database consisting solely <strong>of</strong> proteincoding<br />

regions <strong>of</strong> the EMBL database (called the translated<br />

EMBL or TREMBL). The NCBI also maintains a<br />

BIOTECHNOLOGY IN PLANT BREEDING 239<br />

database <strong>of</strong> the translations <strong>of</strong> the GeneBank. Another<br />

kind <strong>of</strong> protein database consisting <strong>of</strong> experimentally<br />

derived 3D structures <strong>of</strong> proteins is kept at the protein<br />

databank where these structures are determined by Xray<br />

diffraction <strong>and</strong> nuclear magnetic resonance.<br />

General steps in a bioinformatics project<br />

One purpose <strong>of</strong> searching a bioinformatics database is to<br />

determine if the researcher’s unknown sequence, DNA<br />

or protein, matches any sequence in the database in<br />

terms <strong>of</strong> structure or function. This requires the proper<br />

choice <strong>and</strong> skillful use <strong>of</strong> s<strong>of</strong>tware to align the unknown<br />

sequence with the known. Gene-seeking s<strong>of</strong>tware varies<br />

in capability <strong>and</strong> ease <strong>of</strong> use. They have certain properties<br />

in common: (i) algorithms for pattern recognition<br />

use statistical probability analysis to determine the similarity<br />

between two sequences; (ii) data tables contain<br />

information on consensus sequences for various genetic<br />

elements; (iii) taxonomic differences are included because<br />

consensus sequences vary between different taxonomic<br />

classes to facilitate analysis <strong>and</strong> minimize errors; <strong>and</strong> (iv)<br />

specific instructions describe how the algorithms should<br />

be applied in an analysis <strong>and</strong> how the results should be<br />

interpreted.<br />

A search involves two key activities:<br />

1 Sequence alignment scoring matrices. The unknown<br />

sequence is aligned with those in the bank. Scores<br />

are assigned on the basis <strong>of</strong> the sequence homology<br />

detected. It is most useful to align sequences such<br />

that the largest scores are assigned to the most<br />

biologically significant matches.<br />

2 Comparing sequences against a database. One<br />

<strong>of</strong> the most common searches <strong>of</strong> bioinformatics<br />

databases is to compare an unknown sequence<br />

against those in the database to discover similarities.<br />

Typical homology search algorithms are used in this<br />

activity. Some <strong>of</strong> the most widely used s<strong>of</strong>tware for<br />

this search is the Basic Local Alignment Search Tool<br />

(BLAST) <strong>and</strong> FASTA. BLAST uses a strategy based<br />

on short sequence fragments between the unknown<br />

sequence <strong>and</strong> those in the database. It is designed<br />

to match only continuous sequences (no gaps from<br />

deletion or insertion mutations are taken into<br />

account).<br />

DNA microarrays in plant breeding<br />

Genes are not only variably expressed, but the level <strong>of</strong><br />

expression varies during a physiological change, some

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!