04.11.2014 Views

trans

trans

trans

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

14 PARASITE GENOMICS<br />

loss of many of the most critical phenotypes<br />

of the parasite as experienced in the wild.<br />

In addition, much research is carried out<br />

on model parasites, often natural parasites of<br />

model laboratory hosts. A conflict can then<br />

arise between the desire to know the genotype<br />

of the disease organism, and the utility of<br />

knowing the genotype of the well tested model.<br />

Parasite genome projects have struggled with<br />

this conflict and different outcomes have<br />

been based on balancing competing needs. For<br />

filarial nematodes, the choice of human parasite<br />

was simple: only one can readily be maintained<br />

in laboratory culture, and one ‘strain’<br />

of this species is almost universally used in<br />

research: thus the TRS strain of B. malayi was<br />

chosen. For leishmaniasis, the many species<br />

causing different disease syndromes suggested<br />

that it might be necessary to examine multiple<br />

genomes, and even multiple strains within<br />

each: a compromise was reached once it<br />

became evident that Leishmania species share<br />

extensive synteny, and thus a well studied<br />

L. major strain was chosen. Users of genome<br />

data should be aware of this necessary simplification<br />

in considering and using genome<br />

data: the genotype determined may not reflect<br />

the highly variable genotypes of real-world<br />

populations.<br />

BIOINFORMATICS AND THE<br />

ANALYSIS OF GENOME<br />

DATASETS<br />

The completion of a genome sequence is not<br />

the final product. The wealth of data encoded<br />

in the millions of contiguous bases must be<br />

interpreted and linked to biology. The size of<br />

genome datasets has required the development<br />

of a new way of analysing data in biology, generally<br />

grouped under the term bioinformatics.<br />

The linear sequence of the genome DNA<br />

encodes (in response to the environment) the<br />

four-dimensional organism. Bioinformatics<br />

aims to ‘compute’ the organism given the DNA<br />

sequence.<br />

All of the parasite genome projects include<br />

a dedicated bioinformatic component, and<br />

learning the concepts and skills of bioinformatics<br />

is essential if parasitologists are to<br />

exploit the data in their research. The informatics<br />

of interpretation of linear strings of<br />

data, in terms of pattern recognition and<br />

‘emergent’ higher order properties, has been<br />

well developed outside biology, but biologydriven<br />

informatics is now a dynamic and fruitful<br />

field. The genome sequence can be likened<br />

to a book, where each character has been<br />

painstakingly determined, but where we have<br />

little idea of the language, syntax and grammar<br />

in which it was written. Bioinformatics aims to<br />

derive language, grammar and syntax from the<br />

data. The project starts from some universals,<br />

such as the genetic code, and a general view of<br />

structural features such as exons and introns,<br />

but the rest has to be derived from the data. For<br />

example, given a conflicting set of exon predictions<br />

(open reading frame segments flanked by<br />

valid splice sites), is it possible to predict with<br />

accuracy the correct gene model, and thus the<br />

correct encoded protein sequence? Questions<br />

such as these are yielding to ever improving<br />

gene-finding and predictive algorithms. Features<br />

such as overall sequence complexity, the<br />

presence of local or global repeats, the pattern<br />

of di- or tri-nucleotide sequences, the relationship<br />

between repeats and predicted exons, and<br />

the pattern of exonic versus intronic or extragenic<br />

DNA can now be computed and used to<br />

annotate a genome.<br />

Many of the tools used are based on probabilistic<br />

methods, and thus need a training set<br />

of ‘known’ genes from the organism of interest<br />

in order to start extracting information. As<br />

a genome project advances, the training set<br />

MOLECULAR BIOLOGY

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!