You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>EMBL</strong>-EBI<br />
Literature resource development<br />
Previous and current research<br />
The major goal of literature resource development is to integrate the scientific literature with the<br />
data in biological databases and provide public services to exploit this. We achieve this by implementing<br />
a state-of-the-art search engine, flexible web access, novel biomedical text mining methods<br />
and ontologies such as Gene Ontology (GO) and Unified Medical Language System (UMLS).<br />
We develop literature resources for use in-house and in public services. A local copy of PubMed<br />
is maintained under lease from the US National Library of Medicine (NLM), supplemented by<br />
bibliographic data from other sources such as AGRICOLA (USDA-NAL) and Chinese Biology<br />
Abstracts (CAS-SICLS). Biological patent abstracts are captured from the European Patent Office<br />
(EPO) and the US Patent Office (USPTO).<br />
CiteXplore has been developed as a tool for querying the scientific literature, showcasing text mining<br />
methods, and linking to biological databases. UMLS, GO, the NCBI taxonomy and gene synonyms<br />
from UniProt are used as thesauri. Text mining methods from the research community,<br />
such as the ‘Whatizit’ methods of the Rebholz-Schuhmann research group at <strong>EMBL</strong>-EBI<br />
(www.ebi.ac.uk/Rebholz-srv/whatizit/form.jsp) provide several filters for enrichment of text with<br />
annotation. Gene and protein names from UniProt and GO terms are examples of entities identified<br />
in text and hyperlinked to the underlying data resources.<br />
Peter Stoehr<br />
MSc Applied Genetics 1978,<br />
Birmingham University.<br />
Statistician in Agriculture<br />
Faculty, Newcastle<br />
University.<br />
Analyst/Programmer with<br />
AFRC, Cambridge and<br />
Harpenden.<br />
Team leader at <strong>EMBL</strong>-EBI<br />
since 1996.<br />
Future projects and goals<br />
Text mining functionality and enhanced CiteXplore: We<br />
plan to accelerate exploitation of recent methodology into CiteXplore<br />
in areas such as GO and MeSH annotations, related<br />
articles, inclusion of semantic information in the indexing, and<br />
sentence/paragraph retrieval in addition to whole documents.<br />
We will explore the use of BioLexicon, a terminological resource<br />
generated from EBI databases, to facilitate interoperability<br />
of literature with the databases.<br />
Additional bibliographic data: In collaboration with the<br />
British Library we will identify additional content for UKPMC,<br />
for example UK National Health Service publications and<br />
NICE guidelines, among others, and the bibliographic data for<br />
these will be exposed via CiteXplore.<br />
Citation networks: We will begin to process UKPMC articles<br />
that are available only as scanned images, using the results of<br />
OCR and citation extraction from NCBI. An extensive task is<br />
to complete the pipeline of harvesting relevant scientific articles<br />
from the web, and accurately extract citations and their<br />
context from these. As the citation network fills, we will include<br />
citation counts in the CiteXplore indexing and ranking<br />
function, in the same fashion as the Google PageRank method,<br />
to enable highly-cited articles to appear more prominently in<br />
search results.<br />
Screenshot of a PubMed record in CiteXplore, showing mark-up of proteins<br />
and organisms in the text, links to the complete article (a free PDF at UKPMC),<br />
article citations and cross-references to <strong>EMBL</strong> nucleotide sequence databases.<br />
Web services: Further web services interfaces to all of the CiteXplore functionality will be developed to make the bibliographic data aggregated<br />
at the EBI available to third party information systems, workflows and research tools.<br />
From June 2009, the Literature Services team will be led by Jo McIntyre.<br />
Selected references<br />
Nikolov, N. & Stoehr, P. (2008). Integrating biomedical publications<br />
with existing metadata. In ‘Proceedings - IEEE International<br />
Symposium on Computer-Based Medical Systems’, 653-655<br />
Nikolov, N. et al. (2007). CiteXtract: Extracting citation data from<br />
biomedical literature. In ‘Proceedings - IEEE International<br />
Symposium on Computer-Based Medical Systems’, 12-1<br />
Rebholz-Schuhmann, D. et al. (2007). EBIMed - text crunching to<br />
gather facts for proteins from Medline. Bioinformatics, 23, e237-2<br />
85