21.11.2014 Views

ayout 1 - EMBL Grenoble

ayout 1 - EMBL Grenoble

ayout 1 - EMBL Grenoble

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>EMBL</strong>-EBI<br />

Literature resource development<br />

Previous and current research<br />

The major goal of literature resource development is to integrate the scientific literature with the<br />

data in biological databases and provide public services to exploit this. We achieve this by implementing<br />

a state-of-the-art search engine, flexible web access, novel biomedical text mining methods<br />

and ontologies such as Gene Ontology (GO) and Unified Medical Language System (UMLS).<br />

We develop literature resources for use in-house and in public services. A local copy of PubMed<br />

is maintained under lease from the US National Library of Medicine (NLM), supplemented by<br />

bibliographic data from other sources such as AGRICOLA (USDA-NAL) and Chinese Biology<br />

Abstracts (CAS-SICLS). Biological patent abstracts are captured from the European Patent Office<br />

(EPO) and the US Patent Office (USPTO).<br />

CiteXplore has been developed as a tool for querying the scientific literature, showcasing text mining<br />

methods, and linking to biological databases. UMLS, GO, the NCBI taxonomy and gene synonyms<br />

from UniProt are used as thesauri. Text mining methods from the research community,<br />

such as the ‘Whatizit’ methods of the Rebholz-Schuhmann research group at <strong>EMBL</strong>-EBI<br />

(www.ebi.ac.uk/Rebholz-srv/whatizit/form.jsp) provide several filters for enrichment of text with<br />

annotation. Gene and protein names from UniProt and GO terms are examples of entities identified<br />

in text and hyperlinked to the underlying data resources.<br />

Peter Stoehr<br />

MSc Applied Genetics 1978,<br />

Birmingham University.<br />

Statistician in Agriculture<br />

Faculty, Newcastle<br />

University.<br />

Analyst/Programmer with<br />

AFRC, Cambridge and<br />

Harpenden.<br />

Team leader at <strong>EMBL</strong>-EBI<br />

since 1996.<br />

Future projects and goals<br />

Text mining functionality and enhanced CiteXplore: We<br />

plan to accelerate exploitation of recent methodology into CiteXplore<br />

in areas such as GO and MeSH annotations, related<br />

articles, inclusion of semantic information in the indexing, and<br />

sentence/paragraph retrieval in addition to whole documents.<br />

We will explore the use of BioLexicon, a terminological resource<br />

generated from EBI databases, to facilitate interoperability<br />

of literature with the databases.<br />

Additional bibliographic data: In collaboration with the<br />

British Library we will identify additional content for UKPMC,<br />

for example UK National Health Service publications and<br />

NICE guidelines, among others, and the bibliographic data for<br />

these will be exposed via CiteXplore.<br />

Citation networks: We will begin to process UKPMC articles<br />

that are available only as scanned images, using the results of<br />

OCR and citation extraction from NCBI. An extensive task is<br />

to complete the pipeline of harvesting relevant scientific articles<br />

from the web, and accurately extract citations and their<br />

context from these. As the citation network fills, we will include<br />

citation counts in the CiteXplore indexing and ranking<br />

function, in the same fashion as the Google PageRank method,<br />

to enable highly-cited articles to appear more prominently in<br />

search results.<br />

Screenshot of a PubMed record in CiteXplore, showing mark-up of proteins<br />

and organisms in the text, links to the complete article (a free PDF at UKPMC),<br />

article citations and cross-references to <strong>EMBL</strong> nucleotide sequence databases.<br />

Web services: Further web services interfaces to all of the CiteXplore functionality will be developed to make the bibliographic data aggregated<br />

at the EBI available to third party information systems, workflows and research tools.<br />

From June 2009, the Literature Services team will be led by Jo McIntyre.<br />

Selected references<br />

Nikolov, N. & Stoehr, P. (2008). Integrating biomedical publications<br />

with existing metadata. In ‘Proceedings - IEEE International<br />

Symposium on Computer-Based Medical Systems’, 653-655<br />

Nikolov, N. et al. (2007). CiteXtract: Extracting citation data from<br />

biomedical literature. In ‘Proceedings - IEEE International<br />

Symposium on Computer-Based Medical Systems’, 12-1<br />

Rebholz-Schuhmann, D. et al. (2007). EBIMed - text crunching to<br />

gather facts for proteins from Medline. Bioinformatics, 23, e237-2<br />

85

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!