You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>EMBL</strong> Research at a Glance 2009<br />
Weimin Zhu<br />
MSc, 1993, University of<br />
Toronto.<br />
Project Manager, GDB<br />
Project, Toronto, until 2000.<br />
Head of Bioinformatics,<br />
Synax Pharmar, Toronto,<br />
Canada, until 2002.<br />
Team leader at <strong>EMBL</strong>-EBI<br />
since 2002.<br />
Database Research and Development Group<br />
activities<br />
Previous and current research<br />
In February 2008, the Database Applications team was reorganised to form the Database Research<br />
and Development group due to the creation of the PANDA group.<br />
As the Database Application team, the team’s main functions were to provide software/tool development<br />
and maintenance support for the <strong>EMBL</strong> Nucleotide Sequence Database, and Oracle<br />
database support for all the Oracle databases in the PANDA group. The team also coordinated<br />
hardware resources with the EBI’s Systems and Networking team to ensure the hardware requirements<br />
for the projects within PANDA were met.<br />
The group’s new mandate is to conduct research and development to find new technologies and<br />
solutions to meet challenges related to very large databases (VLDB), which includes data distribution<br />
problems when network speed is a bottleneck, and the solutions required to manage and<br />
query VLDBs efficiently.<br />
The size of bioinformatics databases has been increasing exponentially over the last ten years.<br />
Some core resources are approaching, or have already reached, multi-terabytes in size. This trend<br />
of growth has accelerated in recent years by the introduction of new data types and high-throughput<br />
data producing technologies. Today, we are facing all the challenges a VLDB brings, such as<br />
those in data operational management, data access performance, and data mirroring and distribution. Our current infrastructure in these areas<br />
thus require upgrading in order to realise the full potential of data-rich resources, and optimise the usage of our human and hardware resources.<br />
This year, our main focus has been to analyse the uniqueness of bioinformatics datasets, and conduct research into possible solutions to large<br />
dataset distribution problems, especially over slow networks.<br />
Future projects and goals<br />
The development of a new delta compression algorithm will be continued. The goal is to develop prototype software, along with a data model<br />
and repository to store the pre-computed metadata for the algorithm, to test the key features of this new distribution system. The later stage<br />
of this work will be undertaken in collaboration with the Systems and Networking team and database projects requiring the distribution of<br />
large data files, as well as with external partners in countries with slow network connections to the EBI. This will allow us to benchmark the<br />
network saving for distributing large datasets using delta compression.<br />
The data distribution problem is not unique to the bioinformatics community. It is also a problem for other domains, such as particle physics<br />
and earth science. A workshop is planned for 2009 to share knowledge among scientists across specialisations. Another focus of this workshop<br />
is to bring together Chinese bioinformaticians and network backbone administrators to explore the network optimisation possibilities<br />
between China and the EBI.<br />
We will continue to seek external funding for the project’s long term goal – to have a stable and well-maintained distribution system for the<br />
EBI’s large databases.<br />
Selected references<br />
Cochrane, G. et al. (2008). Priorities for nucleotide trace, sequence<br />
and annotation data capture at the Ensembl Trace Archive and the<br />
<strong>EMBL</strong> Nucleotide Sequence Database. Nucleic Acids Res., 36, D5-<br />
D12<br />
86