R / Bioconductor packages for gene and genome annotation
R / Bioconductor packages for gene and genome annotation
R / Bioconductor packages for gene and genome annotation
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
R / <strong>Bioconductor</strong> <strong>packages</strong> <strong>for</strong> <strong>gene</strong> <strong>and</strong> <strong>genome</strong><br />
<strong>annotation</strong><br />
Martin Morgan<br />
<strong>Bioconductor</strong> / Fred Hutchinson Cancer Research Center<br />
Seattle, WA, USA<br />
15-19 June 2009
Annotations<br />
Scenario<br />
◮ Differnetial expression analysis complete, probesets <strong>for</strong> further<br />
investigation identified<br />
Desire: underst<strong>and</strong> <strong>gene</strong>s that have been identified<br />
◮ Gene name?<br />
◮ Chromosome location?<br />
◮ Gene ontologies, pathways?<br />
◮ . . .
Major options<br />
AnnotationDbi <strong>packages</strong><br />
◮ Chip, e..g, hgu95av2.db<br />
◮ Organism, e.g., org.Hs.eg.db<br />
◮ Pathways / Ontologies, e.g., GO.db<br />
◮ Other, e.g., HapMap SNPs, http://bioconductor.org/<br />
<strong>packages</strong>/release/AnnotationData.html<br />
biomaRt<br />
◮ Query web-based ‘biomart’ resources <strong>for</strong> <strong>gene</strong>s, sequence,<br />
SNPs, homologs, . . .<br />
Manufacturer provided <strong>annotation</strong>s, e.g., in two-color gpr files
Pro <strong>and</strong> con<br />
AnnotationDbi<br />
◮ Consistent across analyses<br />
◮ Reliably accessible; obtain with biocLite<br />
◮ Careful secondary curation at <strong>Bioconductor</strong><br />
biomaRt<br />
◮ Updated continuously<br />
◮ Careful secondary curation at EBI
Using AnnotationDbi <strong>packages</strong><br />
> library(org.Hs.eg.db)<br />
> org.Hs.eg()<br />
> # Quality control in<strong>for</strong>mation <strong>for</strong> org.Hs.eg:<br />
> #<br />
> # This package has the following mappings:<br />
> # ...<br />
> # org.Hs.egGENENAME has 40596 mapped keys (of 40596 keys)<br />
> # org.Hs.egGO has 17593 mapped keys (of 40596 keys)<br />
> # ...<br />
> ls("package:org.Hs.eg.db")<br />
> # [1] "org.Hs.eg" "org.Hs.egACCNUM"<br />
> # [3] "org.Hs.egACCNUM2EG" "org.Hs.egALIAS2EG"<br />
> # ...<br />
> org.Hs.eg_dbInfo()
Basic structure: bi-maps with Lkeys <strong>and</strong> Rkeys<br />
Bi-maps, e.g., from ENTREZ id to GENENAME (<strong>and</strong> reverse)<br />
> org.Hs.egGENENAME<br />
GENENAME map <strong>for</strong> Human (object of class "AnnDbBimap")<br />
> map head(Lkeys(map))<br />
[1] "1" "10" "100"<br />
[4] "1000" "10000" "100008586"<br />
> map[["1000"]]<br />
[1] "cadherin 2, type 1, N-cadherin (neuronal)"
Manipulating maps<br />
Subset (numeric or character)<br />
> submap submap<br />
GENENAME submap <strong>for</strong> Human (object of class "AnnDbBimap")<br />
As data frame or list<br />
> toTable(submap)<br />
<strong>gene</strong>_id<br />
<strong>gene</strong>_name<br />
1 1 alpha-1-B glycoprotein<br />
2 100 adenosine deaminase<br />
Reverse (!)<br />
> revmap(map)[["adenosine deaminase"]]<br />
[1] "100"
Common issues<br />
Not all maps are precisely as described above<br />
◮ Chip <strong>packages</strong>: ‘Lkey’ is probeset id; pathway <strong>packages</strong>:<br />
‘Lkey’ is pathway id<br />
◮ Some maps are not reversible, e.g., org.Hs.egCHRLOC<br />
Symbol to ENTREZ<br />
> org.Hs.egSYMBOL[["10316"]]<br />
[1] "NMUR1"<br />
> revmap(org.Hs.egALIAS2EG)[["10316"]]<br />
[1] "(FM-3)" "FM-3" "FM3" "GPC-R" "GPR66"<br />
[6] "NMU1R" "NMUR1"
Advanced examples I<br />
Between-table joins<br />
◮ ENTREZ id 1 has 3 GO terms associated with it<br />
◮ The second term has GO id GO:0005576<br />
◮ This term is described in detail in the GO.db ontology<br />
<strong>annotation</strong> package<br />
> go length(go)<br />
[1] 3<br />
> (goid
Advanced examples II<br />
> library(GO.db)<br />
> GOTERM[[goid]]<br />
GOID: GO:0005576<br />
Term: extracellular region<br />
Ontology: CC<br />
Definition: The space external to the<br />
outermost structure of a cell. For cells<br />
without external protective or external<br />
encapsulating structures this refers to<br />
space outside of the plasma membrane.<br />
This term covers the host cell<br />
environment outside an intracellular<br />
parasite.<br />
Synonym: extracellular
Advanced examples III<br />
Direct SQL queries<br />
◮ AnnotationDbi stores in<strong>for</strong>mation as sqlite tables<br />
◮ Use org.Hs.eg_dbschema() to discover table structure<br />
◮ Get the data base connection with org.Hs.eg_dbconn()<br />
◮ Compose <strong>and</strong> evaluate a SQL statement with the interface<br />
provided by the DBI package
Advanced examples IV<br />
> conn dbGetQuery(conn, "SELECT * FROM <strong>gene</strong>_info LIMIT 3;")<br />
_id<br />
<strong>gene</strong>_name symbol<br />
1 1 alpha-1-B glycoprotein A1BG<br />
2 2 alpha-2-macroglobulin A2M<br />
3 3 alpha-2-macroglobulin pseudo<strong>gene</strong> A2MP<br />
> ## join<br />
> sql dbGetQuery(conn, sql)<br />
<strong>gene</strong>_id path_id<br />
1 2 04610<br />
2 9 00232<br />
3 9 00983
Advanced examples V<br />
Custom <strong>annotation</strong> <strong>packages</strong> are ‘easy’ to create<br />
◮ See the SQLForge vignette in AnnotationDbi
iomaRt I<br />
Basic work flow<br />
◮ Discover <strong>and</strong> select a ‘mart’, e.g., ensembl<br />
◮ Discover <strong>and</strong> select a ‘dataset’, e.g.,<br />
hsapiens_<strong>gene</strong>_ensembl<br />
◮ Discover <strong>and</strong> select a ‘filter’, e.g., entrez<strong>gene</strong><br />
◮ Compose a query
iomaRt II<br />
> library(biomaRt)<br />
> listMarts()<br />
> mart0 listDatasets(mart0)<br />
> mart listFilters(mart)<br />
> getGene(id = "100", type = "entrez<strong>gene</strong>",<br />
+ mart = mart)<br />
> getGene(id = "1939_at", type = "affy_hg_u133_plus_2",<br />
+ mart = mart)
Intermediate <strong>and</strong> advanced biomaRt<br />
Complex queries<br />
> getBM(<br />
+ attributes=c("affy_hg_u95av2", "hgnc_symbol",<br />
+ "chromosome_name", "b<strong>and</strong>"),<br />
+ filters="affy_hg_u95av2",<br />
+ values=c("1939_at", "1503_at", "1454_at"),<br />
+ mart=mart)<br />
Marts can be installed locally<br />
◮ Non-public data<br />
◮ Reliable connectivity
Additional <strong>packages</strong><br />
◮ GenomeGraphs <strong>for</strong> (very) pretty display of <strong>annotation</strong> <strong>and</strong><br />
expression data<br />
◮ rtracklayer <strong>for</strong> exporting expression <strong>and</strong> genomic coordinates<br />
<strong>for</strong> visualization in web browsers
Summary<br />
AnnotationDbi<br />
◮ Curated, reliable organismal, chip, <strong>and</strong> pathway <strong>annotation</strong>s<br />
◮ Accessible on the desktop<br />
◮ Advanced users can query with SQL, <strong>and</strong> create their own<br />
data bases.<br />
biomaRt<br />
◮ Curated, diverse <strong>annotation</strong>s<br />
◮ Accessible via the internet<br />
◮ Advanced users can install their own biomaRt <strong>for</strong> private <strong>and</strong><br />
reliable access.<br />
Other <strong>packages</strong><br />
◮ GenomeGraphs <strong>for</strong> visualization, rtracklayer <strong>for</strong> export to web<br />
browsers