UMLS: The Graph Behind the Forest - Medical Ontology Research
UMLS: The Graph Behind the Forest - Medical Ontology Research
UMLS: The Graph Behind the Forest - Medical Ontology Research
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Institute for Discrete Sciences<br />
Workshop on Associating Semantics with <strong>Graph</strong>s<br />
Rutgers University<br />
April 16, 2007<br />
Unified <strong>Medical</strong> Language System<br />
<strong>The</strong> graph behind <strong>the</strong> forest<br />
Olivier Bodenreider<br />
Lister Hill National Center<br />
for Biomedical Communications<br />
Be<strong>the</strong>sda, Maryland - USA
Biomedical trees
http://www.tolweb.org/tree/
http://www.ncbi.nlm.nih.gov/Taxonomy/<br />
Lister Hill National Center for Biomedical Communications<br />
4
<strong>Medical</strong> Subject Headings<br />
http://www.nlm.nih.gov/mesh/2007/MBrowser.html<br />
Lister Hill National Center for Biomedical Communications<br />
5
Gene <strong>Ontology</strong><br />
http://amigo.geneontology.org/cgi-bin/amigo/go.cgi<br />
Lister Hill National Center for Biomedical Communications<br />
6
SNOMED Clinical Terms<br />
http://www.clininfo.co.uk/clue5/clue.htm<br />
Lister Hill National Center for Biomedical Communications<br />
7
Biomedical trees revisited
<strong>Medical</strong> Subject Headings<br />
Amino Acids, Peptides, and Proteins<br />
Proteins<br />
Cytoskeletal<br />
Proteins<br />
Contractile<br />
Proteins<br />
Muscle Proteins<br />
Membrane<br />
Proteins<br />
Dystrophin<br />
http://www.nlm.nih.gov/mesh/2007/MBrowser.html<br />
Lister Hill National Center for Biomedical Communications<br />
9
Gene <strong>Ontology</strong><br />
biological process<br />
biological regulation<br />
metabolic process<br />
regulation of<br />
biological process<br />
primary metabolic process<br />
regulation of<br />
metabolic process<br />
lipid metabolic process<br />
regulation of lipid metabolic process<br />
http://amigo.geneontology.org/cgi-bin/amigo/go.cgi<br />
Lister Hill National Center for Biomedical Communications<br />
10
SNOMED Clinical Terms<br />
disorder of trunk<br />
disorder of thorax<br />
neoplasm of trunk<br />
disorder of breast<br />
neoplasm of thorax<br />
neoplasm of breast<br />
http://www.clininfo.co.uk/clue5/clue.htm<br />
Lister Hill National Center for Biomedical Communications<br />
11
Terminology integration<br />
Unified <strong>Medical</strong> Language System
Addison’s s disease in medical vocabularies<br />
Synonyms<br />
• Addisonian syndrome<br />
• Bronzed disease<br />
• Addison melanoderma<br />
• As<strong>the</strong>nia pigmentosa<br />
• Primary adrenal deficiency<br />
• Primary adrenal insufficiency<br />
• Primary adrenocortical insufficiency<br />
• Chronic adrenocortical insufficiency<br />
eponym<br />
symptoms<br />
clinical<br />
variants<br />
Lister Hill National Center for Biomedical Communications<br />
13
Organize terms<br />
Synonymous terms clustered into a concept<br />
Preferred term<br />
Unique identifier (CUI)<br />
Addison Disease MeSH D000224<br />
Primary hypoadrenalism MedDRA 10036696<br />
Primary adrenocortical insufficiency ICD-10 E27.1<br />
Addison's disease (disorder) SNOMED CT 363732003<br />
C0001403<br />
Addison's disease<br />
Lister Hill National Center for Biomedical Communications<br />
14
SNOMED International<br />
Diseases/Diagnoses<br />
Diseases of <strong>the</strong> endocrine system<br />
Diseases of <strong>the</strong> Adrenal Glands<br />
Addison’s Disease
MeSH<br />
Diseases<br />
Endocrine Diseases<br />
Adrenal Gland Diseases<br />
Adrenal Gland Hypofunction<br />
Addison’s Disease
AOD<br />
Endocrine disorder<br />
Adrenal disorder<br />
Adrenal cortical disorder<br />
Adrenal cortical hypofunction<br />
Addison’s Disease
Read Codes<br />
Endocrine disorder<br />
Disorder of adrenal gland<br />
Hypoadrenalism<br />
Adrenal Hypofunction<br />
Corticoadrenal insufficiency<br />
Addison’s Disease
ICD-10<br />
Disorders of o<strong>the</strong>r<br />
endocrine gland<br />
O<strong>the</strong>r disorders of<br />
adrenal gland<br />
Primary adrenocortical insufficiency
Organize concepts<br />
Inter-concept<br />
relationships: hierarchies<br />
from <strong>the</strong> source<br />
vocabularies<br />
Redundancy: multiple<br />
paths<br />
One graph instead of<br />
multiple trees<br />
(multiple inheritance)<br />
A<br />
C B<br />
B D E H E F H D E<br />
G H<br />
A<br />
B C<br />
D E F<br />
G<br />
H<br />
Lister Hill National Center for Biomedical Communications<br />
20
organize concepts<br />
Endocrine Diseases<br />
Adrenal Cortex Diseases<br />
Adrenal Gland Diseases<br />
SNOMED<br />
MeSH<br />
AOD<br />
Read Codes<br />
Hypoadrenalism<br />
Adrenal Gland Hypofunction<br />
<strong>UMLS</strong><br />
Adrenal cortical hypofunction<br />
Addison’s Disease
Endocrine System<br />
Endocrine Glands<br />
Abdominal organ<br />
Diseases<br />
Endocrine Diseases<br />
Adrenal Glands<br />
Adrenal Dysfunction<br />
Adrenal Gland Diseases<br />
Adrenal Cortex Diseases<br />
Disorders of o<strong>the</strong>r<br />
endocrine gland<br />
Adrenal Cortex<br />
Adrenal Cortex Dysfunction<br />
Hypoadrenalism<br />
Adrenal Gland Hypofunction<br />
O<strong>the</strong>r disorders of<br />
adrenal gland<br />
Adrenal cortical hypofunction<br />
Secondary hypocortisolism<br />
Addison’s Disease<br />
Addison’s disease due to autoimmunity
Source Vocabularies<br />
(2007AA)<br />
139 source vocabularies<br />
• 17 languages<br />
Broad coverage of biomedicine<br />
• 5.5M names<br />
• 1.4M concepts<br />
• 16M relations<br />
Common presentation<br />
Lister Hill National Center for Biomedical Communications<br />
23
Semantic Types<br />
Anatomical<br />
Structure<br />
Fully Formed<br />
Anatomical<br />
Structure<br />
Body Part, Organ or<br />
Organ Component<br />
Embryonic<br />
Structure<br />
Pharmacologic<br />
Substance<br />
Disease or<br />
Syndrome<br />
Population<br />
Group<br />
Semantic<br />
Network<br />
Concepts<br />
Esophagus<br />
12<br />
Left Phrenic<br />
Nerve<br />
4<br />
Mediastinum<br />
Heart<br />
9 Valves 31<br />
Heart<br />
Fetal<br />
Heart<br />
Saccular<br />
Viscus<br />
22<br />
97<br />
Angina<br />
Pectoris<br />
Cardiotonic<br />
225 Agents<br />
Tissue<br />
Donors<br />
Meta<strong>the</strong>saurus
Biomedical forest<br />
vs. graph
<strong>UMLS</strong> Knowledge Source Server<br />
http://umlsks.nlm.nih.gov/<br />
Lister Hill National Center for Biomedical Communications<br />
26
Addison’s s disease in <strong>UMLS</strong>KS (1)<br />
Lister Hill National Center for Biomedical Communications<br />
27
Addison’s s disease in <strong>UMLS</strong>KS (2)<br />
Lister Hill National Center for Biomedical Communications<br />
28
Addison’s s disease in <strong>UMLS</strong>KS (3)<br />
Lister Hill National Center for Biomedical Communications<br />
29
Addison’s s disease in <strong>UMLS</strong>KS (4)<br />
Lister Hill National Center for Biomedical Communications<br />
30
Addison’s s disease in <strong>UMLS</strong>KS (5)<br />
Lister Hill National Center for Biomedical Communications<br />
31
<strong>UMLS</strong> Semantic Navigator<br />
Lister Hill National Center for Biomedical Communications<br />
32<br />
http://mor.nlm.nih.gov/perl/semnav.pl
AmiGO<br />
http://amigo.geneontology.org/cgi-bin/amigo/go.cgi<br />
Lister Hill National Center for Biomedical Communications<br />
33
GenNav<br />
http://mor.nlm.nih.gov/perl/gennav.pl<br />
Lister Hill National Center for Biomedical Communications<br />
34
Semantics of <strong>the</strong> <strong>UMLS</strong> graph<br />
Issues and challenges
Visualization of large graphs<br />
Lister Hill National Center for Biomedical Communications<br />
36
Visualization of large graphs<br />
Lister Hill National Center for Biomedical Communications<br />
37
Acyclicity<br />
“back edge” from a child concept to a parent concept<br />
A<br />
A<br />
A<br />
B<br />
D<br />
B<br />
E<br />
G<br />
H<br />
Reflexive<br />
13,000<br />
Direct<br />
1800<br />
Indirect<br />
120<br />
Lister Hill National Center for Biomedical Communications<br />
38
Underspecification of relationships<br />
Relationship “attribute” not always present<br />
Relations used to create hierarchies vs. hierachical<br />
relations<br />
Lister Hill National Center for Biomedical Communications<br />
39
Information integration<br />
Mapping<br />
Which tasks?<br />
Depending on <strong>the</strong> degree of human involvement<br />
• Hypo<strong>the</strong>sis generation / validation<br />
• Knowledge discovery<br />
• Automated reasoning<br />
Knowledge standardization<br />
• Common format<br />
• Common semantics<br />
Lister Hill National Center for Biomedical Communications<br />
40
SKOS – <strong>The</strong>saurus<br />
Which formalisms?<br />
• Simple Knowledge Organization Schema<br />
RDF – Concept-Relationship<br />
Relationship-Concept triples<br />
• Resource Description Framework<br />
Description Logics / Frames<br />
• OWL Web <strong>Ontology</strong> Language<br />
• Protégé (frames / OWL)<br />
• OBO Open Biomedical <strong>Ontology</strong><br />
Rule languages<br />
Formal logic<br />
Lister Hill National Center for Biomedical Communications<br />
41
For concepts<br />
Which identifiers?<br />
• Namespaces, ontologies, knowledge bases<br />
• OBO – Open Biomedical Ontologies<br />
• <strong>UMLS</strong> – Unified <strong>Medical</strong> Language System<br />
• NCBI Entrez (Entrez Gene, GenBank, UniGene, …)<br />
• Mappings across information sources<br />
For relationships<br />
Lister Hill National Center for Biomedical Communications<br />
42
Conclusions
Integrating subdomains<br />
Clinical<br />
repositories<br />
Genetic<br />
knowledge bases<br />
O<strong>the</strong>r<br />
subdomains<br />
SNOMED<br />
OMIM<br />
…<br />
NCBI<br />
Taxonomy<br />
<strong>UMLS</strong><br />
MeSH<br />
Biomedical<br />
literature<br />
Model<br />
organisms<br />
UWDA<br />
GO<br />
Genome<br />
Anatomy<br />
annotations<br />
Lister Hill National Center for Biomedical Communications<br />
44
Integrating subdomains<br />
O<strong>the</strong>r<br />
subdomains<br />
Clinical<br />
repositories<br />
Genetic<br />
knowledge bases<br />
Biomedical<br />
literature<br />
Model<br />
organisms<br />
Genome<br />
Anatomy<br />
annotations<br />
Lister Hill National Center for Biomedical Communications<br />
45
From glycosyltransferase<br />
to congenital muscular dystrophy<br />
glycosyltransferase<br />
GO:0016757<br />
GO:0008194<br />
isa<br />
GO:0016758<br />
GO:0008375<br />
acetylglucosaminyltransferase<br />
LARGE<br />
EG:9215<br />
has_molecular_function<br />
has_associated_phenotype<br />
GO:0008375<br />
MIM:608840<br />
acetylglucosaminyltransferase<br />
Muscular dystrophy,<br />
congenital, type 1D<br />
Lister Hill National Center for Biomedical Communications<br />
46
<strong>Medical</strong><br />
<strong>Ontology</strong><br />
<strong>Research</strong><br />
Contact:<br />
Web:<br />
olivier@nlm.nih.gov<br />
mor.nlm.nih.gov<br />
Olivier Bodenreider<br />
Lister Hill National Center<br />
for Biomedical Communications<br />
Be<strong>the</strong>sda, Maryland - USA
<strong>UMLS</strong> References<br />
<strong>UMLS</strong><br />
umlsinfo.nlm.nih.gov<br />
<strong>UMLS</strong> browsers<br />
(free, but <strong>UMLS</strong> license required)<br />
• Knowledge Source Server: umlsks.nlm.nih.gov<br />
• Semantic Navigator:<br />
http://mor.nlm.nih.gov/perl/semnav.pl<br />
• RRF browser<br />
(standalone application distributed with <strong>the</strong> <strong>UMLS</strong>)<br />
Lister Hill National Center for Biomedical Communications<br />
48
Gentle introduction<br />
<strong>UMLS</strong> References<br />
• Bodenreider O. (2004). <strong>The</strong> Unified <strong>Medical</strong> Language<br />
System (<strong>UMLS</strong>): Integrating biomedical terminology.<br />
Nucleic Acids <strong>Research</strong>; ; D267-D270.<br />
D270.<br />
http://mor.nlm.nih.gov/pubs/pdf/2004-nar<br />
nar-ob.pdf<br />
Seminal paper<br />
• Lindberg, D. A., Humphreys, B. L., & McCray, A. T.<br />
(1993). <strong>The</strong> Unified <strong>Medical</strong> Language System.<br />
Methods Inf Med, 32(4), 281-91.<br />
Lister Hill National Center for Biomedical Communications<br />
49
Biomedical information integration<br />
through RDF<br />
Biomedical perspective<br />
• Sahoo S, Zeng K, Bodenreider O, Sheth AP. (2007). From<br />
“glycosyltransferase” to “congenital muscular dystrophy”:<br />
Integrating knowledge from NCBI Entrez Gene and <strong>the</strong> Gene<br />
<strong>Ontology</strong>. Proceedings of Medinfo (in press).<br />
http://mor.nlm.nih.gov/pubs/pdf/2007-medinfo<br />
medinfo-ss.pdf<br />
Semantic Web perspective<br />
• Sahoo S, Zeng K, Bodenreider O, Sheth AP. (2007). An<br />
experiment in integrating large biomedical knowledge resources<br />
with RDF: Application to associating genotype and phenotype<br />
information. Proceedings of <strong>the</strong> workshop on Health Care and Life<br />
Sciences Data Integration for <strong>the</strong> Semantic Web at <strong>the</strong> 16th<br />
International World Wide Web Conference (WWW2007) (in press).<br />
http://mor.nlm.nih.gov/pubs/pdf/2007-www_hcls<br />
www_hcls-ss.pdfss.pdf<br />
Lister Hill National Center for Biomedical Communications<br />
50