You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>EMBL</strong> Research at a Glance 2009<br />
Vertebrate genomics<br />
Paul Flicek<br />
DSc 200, Washington<br />
University, St. Louis,<br />
Missouri.<br />
At <strong>EMBL</strong>-EBI since 2005.<br />
Team leader at <strong>EMBL</strong>-EBI<br />
since 2008.<br />
Previous and current research<br />
The Vertebrate Genomics team is a combined service and research group that creates and manages<br />
data resources focussing on genome annotation and human variation. The major service projects<br />
of the Vertebrate Genomics team are Ensembl, the European Genotype Archive, and the Data<br />
Coordination Centre for the 1000 Genomes Project. In support of these projects, we are developing<br />
the specialised, large-scale bioinformatics infrastructure required for each analysis. The team’s<br />
research is on computational genome annotation with a particular focus on the integration of diverse<br />
data types such as extensive comparative sequencing, DNA–protein interactions, epigenetic<br />
modifications, and the DNA sequence itself.<br />
Ensembl (www.ensembl.org) is a comprehensive genome information system featuring an integrated<br />
set of tools for genome annotation, data mining and visualisation of chordate genomes. As<br />
such, it is one of the fundamental database resources used to address questions in medical research<br />
and molecular biology. As of August 2008, there were 39 fully-supported genomes in Ensembl<br />
including human, mouse, chicken, five species of fish, a nematode, and several other<br />
mammalian, chordate and insect species.<br />
The European Genotype Archive (EGA) database provides a permanent archive for all types of personally identifiable genetic data including<br />
genotypes, genome sequence and associated phenotype data. The EGA contains both data collected from individuals whose consent agreements<br />
stipulate data release for specific approved research uses or bona fide researchers, as well as data approved for full public release.<br />
The 1000 Genomes Project (www.1000genomes.org) aims to create a comprehensive and public catalogue of common human genetic variation<br />
in three populations by using next-generation sequencing technology. During 2008, the project conducted three pilot projects to assess<br />
the feasibly of creating a deep and accurate catalogue and develop the necessary tools to manage and analyse the data. The pilot projects included<br />
the sequencing of 180 individuals to 2x coverage; sequencing two trios consisting of a child and both parents to 20x coverage; and targeted<br />
sequencing of 1,000 genes in 1,000 individuals.<br />
In collaboration with the NCBI, the Vertebrate Genomics team is one half of the 1000 Genomes Project Data Coordination Centre (DCC)<br />
and has co-leadership of the project’s data flow group. Over the course of the year the project produced approximately 2 terabases of sequence<br />
(equivalent to 8.5 times the number of nucleotides in the <strong>EMBL</strong>-Bank sequence archive) at a rate approaching 30 gigabases per day. This data<br />
is collected by the DCC and made available to the 1000 Genome Project analysis group and interested researchers worldwide.<br />
Future projects and goals<br />
Next-generation sequencing methods are having a profound impact.<br />
For example, we have been investigating ways to use short read<br />
transcriptome data in our automatic annotation to support the substantial<br />
amounts of data we expect in the future. The availability of<br />
an increasing number of genome sequences is challenging the comparative<br />
genomics aspects of the team’s work both in terms of scale<br />
and complexity. ENCODE and the 1000 Genomes Project will respectively<br />
provide significant new data into the functional genomics<br />
and variation resources. Future developments for the EGA include<br />
a suite of customised data mining tools, an analysis pipeline infrastructure<br />
supporting uniform analysis of the data in the archive,<br />
and the development (in collaboration with international partners)<br />
of standards for the exchange of genotype data including whole<br />
genome sequences.<br />
An example GenomeView from the European Genotype Archive showing<br />
genomic regions that are significantly associated with type I diabetes.<br />
Selected references<br />
Flicek, P. et al. (2008). Ensembl 2008. Nucleic Acids Res., 36<br />
(Database issue): D707-D71<br />
Johnson, D.S. et al. (2008). Systematic evaluation of variability in<br />
ChIP-chip experiments using predefined DNA targets. Genome Res.,<br />
18, 393-03<br />
Saar, K. et al. (2008). SNP and haplotype mapping for genetic<br />
analysis in the rat. Nat. Genet., 0, 560-566<br />
Warren, W.C. et al. (2008). Genome analysis of the platypus reveals<br />
unique signatures of evolution. Nature, 53, 175-183<br />
Flicek, P. (2007). Gene prediction: compare and CONTRAST.<br />
Genome Biol., 8, 233<br />
7