25.02.2015 Views

R / Bioconductor packages for gene and genome annotation

R / Bioconductor packages for gene and genome annotation

R / Bioconductor packages for gene and genome annotation

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

R / <strong>Bioconductor</strong> <strong>packages</strong> <strong>for</strong> <strong>gene</strong> <strong>and</strong> <strong>genome</strong><br />

<strong>annotation</strong><br />

Martin Morgan<br />

<strong>Bioconductor</strong> / Fred Hutchinson Cancer Research Center<br />

Seattle, WA, USA<br />

15-19 June 2009


Annotations<br />

Scenario<br />

◮ Differnetial expression analysis complete, probesets <strong>for</strong> further<br />

investigation identified<br />

Desire: underst<strong>and</strong> <strong>gene</strong>s that have been identified<br />

◮ Gene name?<br />

◮ Chromosome location?<br />

◮ Gene ontologies, pathways?<br />

◮ . . .


Major options<br />

AnnotationDbi <strong>packages</strong><br />

◮ Chip, e..g, hgu95av2.db<br />

◮ Organism, e.g., org.Hs.eg.db<br />

◮ Pathways / Ontologies, e.g., GO.db<br />

◮ Other, e.g., HapMap SNPs, http://bioconductor.org/<br />

<strong>packages</strong>/release/AnnotationData.html<br />

biomaRt<br />

◮ Query web-based ‘biomart’ resources <strong>for</strong> <strong>gene</strong>s, sequence,<br />

SNPs, homologs, . . .<br />

Manufacturer provided <strong>annotation</strong>s, e.g., in two-color gpr files


Pro <strong>and</strong> con<br />

AnnotationDbi<br />

◮ Consistent across analyses<br />

◮ Reliably accessible; obtain with biocLite<br />

◮ Careful secondary curation at <strong>Bioconductor</strong><br />

biomaRt<br />

◮ Updated continuously<br />

◮ Careful secondary curation at EBI


Using AnnotationDbi <strong>packages</strong><br />

> library(org.Hs.eg.db)<br />

> org.Hs.eg()<br />

> # Quality control in<strong>for</strong>mation <strong>for</strong> org.Hs.eg:<br />

> #<br />

> # This package has the following mappings:<br />

> # ...<br />

> # org.Hs.egGENENAME has 40596 mapped keys (of 40596 keys)<br />

> # org.Hs.egGO has 17593 mapped keys (of 40596 keys)<br />

> # ...<br />

> ls("package:org.Hs.eg.db")<br />

> # [1] "org.Hs.eg" "org.Hs.egACCNUM"<br />

> # [3] "org.Hs.egACCNUM2EG" "org.Hs.egALIAS2EG"<br />

> # ...<br />

> org.Hs.eg_dbInfo()


Basic structure: bi-maps with Lkeys <strong>and</strong> Rkeys<br />

Bi-maps, e.g., from ENTREZ id to GENENAME (<strong>and</strong> reverse)<br />

> org.Hs.egGENENAME<br />

GENENAME map <strong>for</strong> Human (object of class "AnnDbBimap")<br />

> map head(Lkeys(map))<br />

[1] "1" "10" "100"<br />

[4] "1000" "10000" "100008586"<br />

> map[["1000"]]<br />

[1] "cadherin 2, type 1, N-cadherin (neuronal)"


Manipulating maps<br />

Subset (numeric or character)<br />

> submap submap<br />

GENENAME submap <strong>for</strong> Human (object of class "AnnDbBimap")<br />

As data frame or list<br />

> toTable(submap)<br />

<strong>gene</strong>_id<br />

<strong>gene</strong>_name<br />

1 1 alpha-1-B glycoprotein<br />

2 100 adenosine deaminase<br />

Reverse (!)<br />

> revmap(map)[["adenosine deaminase"]]<br />

[1] "100"


Common issues<br />

Not all maps are precisely as described above<br />

◮ Chip <strong>packages</strong>: ‘Lkey’ is probeset id; pathway <strong>packages</strong>:<br />

‘Lkey’ is pathway id<br />

◮ Some maps are not reversible, e.g., org.Hs.egCHRLOC<br />

Symbol to ENTREZ<br />

> org.Hs.egSYMBOL[["10316"]]<br />

[1] "NMUR1"<br />

> revmap(org.Hs.egALIAS2EG)[["10316"]]<br />

[1] "(FM-3)" "FM-3" "FM3" "GPC-R" "GPR66"<br />

[6] "NMU1R" "NMUR1"


Advanced examples I<br />

Between-table joins<br />

◮ ENTREZ id 1 has 3 GO terms associated with it<br />

◮ The second term has GO id GO:0005576<br />

◮ This term is described in detail in the GO.db ontology<br />

<strong>annotation</strong> package<br />

> go length(go)<br />

[1] 3<br />

> (goid


Advanced examples II<br />

> library(GO.db)<br />

> GOTERM[[goid]]<br />

GOID: GO:0005576<br />

Term: extracellular region<br />

Ontology: CC<br />

Definition: The space external to the<br />

outermost structure of a cell. For cells<br />

without external protective or external<br />

encapsulating structures this refers to<br />

space outside of the plasma membrane.<br />

This term covers the host cell<br />

environment outside an intracellular<br />

parasite.<br />

Synonym: extracellular


Advanced examples III<br />

Direct SQL queries<br />

◮ AnnotationDbi stores in<strong>for</strong>mation as sqlite tables<br />

◮ Use org.Hs.eg_dbschema() to discover table structure<br />

◮ Get the data base connection with org.Hs.eg_dbconn()<br />

◮ Compose <strong>and</strong> evaluate a SQL statement with the interface<br />

provided by the DBI package


Advanced examples IV<br />

> conn dbGetQuery(conn, "SELECT * FROM <strong>gene</strong>_info LIMIT 3;")<br />

_id<br />

<strong>gene</strong>_name symbol<br />

1 1 alpha-1-B glycoprotein A1BG<br />

2 2 alpha-2-macroglobulin A2M<br />

3 3 alpha-2-macroglobulin pseudo<strong>gene</strong> A2MP<br />

> ## join<br />

> sql dbGetQuery(conn, sql)<br />

<strong>gene</strong>_id path_id<br />

1 2 04610<br />

2 9 00232<br />

3 9 00983


Advanced examples V<br />

Custom <strong>annotation</strong> <strong>packages</strong> are ‘easy’ to create<br />

◮ See the SQLForge vignette in AnnotationDbi


iomaRt I<br />

Basic work flow<br />

◮ Discover <strong>and</strong> select a ‘mart’, e.g., ensembl<br />

◮ Discover <strong>and</strong> select a ‘dataset’, e.g.,<br />

hsapiens_<strong>gene</strong>_ensembl<br />

◮ Discover <strong>and</strong> select a ‘filter’, e.g., entrez<strong>gene</strong><br />

◮ Compose a query


iomaRt II<br />

> library(biomaRt)<br />

> listMarts()<br />

> mart0 listDatasets(mart0)<br />

> mart listFilters(mart)<br />

> getGene(id = "100", type = "entrez<strong>gene</strong>",<br />

+ mart = mart)<br />

> getGene(id = "1939_at", type = "affy_hg_u133_plus_2",<br />

+ mart = mart)


Intermediate <strong>and</strong> advanced biomaRt<br />

Complex queries<br />

> getBM(<br />

+ attributes=c("affy_hg_u95av2", "hgnc_symbol",<br />

+ "chromosome_name", "b<strong>and</strong>"),<br />

+ filters="affy_hg_u95av2",<br />

+ values=c("1939_at", "1503_at", "1454_at"),<br />

+ mart=mart)<br />

Marts can be installed locally<br />

◮ Non-public data<br />

◮ Reliable connectivity


Additional <strong>packages</strong><br />

◮ GenomeGraphs <strong>for</strong> (very) pretty display of <strong>annotation</strong> <strong>and</strong><br />

expression data<br />

◮ rtracklayer <strong>for</strong> exporting expression <strong>and</strong> genomic coordinates<br />

<strong>for</strong> visualization in web browsers


Summary<br />

AnnotationDbi<br />

◮ Curated, reliable organismal, chip, <strong>and</strong> pathway <strong>annotation</strong>s<br />

◮ Accessible on the desktop<br />

◮ Advanced users can query with SQL, <strong>and</strong> create their own<br />

data bases.<br />

biomaRt<br />

◮ Curated, diverse <strong>annotation</strong>s<br />

◮ Accessible via the internet<br />

◮ Advanced users can install their own biomaRt <strong>for</strong> private <strong>and</strong><br />

reliable access.<br />

Other <strong>packages</strong><br />

◮ GenomeGraphs <strong>for</strong> visualization, rtracklayer <strong>for</strong> export to web<br />

browsers

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!