There are a few methods of importing annotation data into R. I will attempt some of them in preparation for loading them into the L.major RNASeq data.
AnnotationHub is a newer service and has promise to be an excellent top-level resource for gathering annotation data.
tmp <- sm(library(AnnotationHub))
ah = sm(AnnotationHub())
orgdbs <- sm(query(ah, "OrgDb"))
lm_orgdb <- sm(query(ah, c("OrgDB", "Leishmania")))
lm_orgdb
## AnnotationHub with 4 records
## # snapshotDate(): 2017-10-27
## # $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
## # $species: Leishmania donovani, Leishmania major_strain_Friedlin, Leis...
## # $rdataclass: OrgDb
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## # tags, rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["AH59612"]]'
##
## title
## AH59612 | org.Leishmania_major_strain_Friedlin.eg.sqlite
## AH59675 | org.Leishmania_mexicana_MHOM|GT|2001|U1103.eg.sqlite
## AH59676 | org.Leishmania_donovani.eg.sqlite
## AH59684 | org.Leishmania_panamensis.eg.sqlite
lm_orgdb <- ah[["AH59612"]]
## loading from cache '/home/trey//.AnnotationHub/66358'
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Attaching package: 'Biobase'
## The following object is masked from 'package:AnnotationHub':
##
## cache
## Loading required package: IRanges
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
##
## expand.grid
lm_orgdb
## OrgDb object:
## | DBSCHEMAVERSION: 2.1
## | DBSCHEMA: NOSCHEMA_DB
## | ORGANISM: Leishmania major_strain_Friedlin
## | SPECIES: Leishmania major_strain_Friedlin
## | CENTRALID: GID
## | Taxonomy ID: 347515
## | Db type: OrgDb
## | Supporting package: AnnotationDbi
##
## Please see: help('select') for usage information
## Holy crap it worked!
lm_annotv1 <- load_orgdb_annotations(lm_orgdb,
keytype="entrezid",
fields=c("entrezid", "alias", "genename", "refseq", "symbol"))
## Unable to find TYPE in the db, removing it.
## Unable to find TXSTRAND in the db, removing it.
## Unable to find TXSTART in the db, removing it.
## Unable to find TXEND in the db, removing it.
## Extracted all gene ids.
## 'select()' returned 1:many mapping between keys and columns
summary(lm_annotv1)
## Length Class Mode
## genes 6 data.frame list
## transcripts 0 -none- NULL
lm_annotv1 <- lm_annotv1[["genes"]]
head(lm_annotv1)
## entrezid genename chr alias refseq
## X12980396 12980396 ncRNA 20 LMJF_20_snoRNA0211 XR_002460236.1
## X12980396.1 12980396 ncRNA 20 LMJF_20_snoRNA0211 XR_002460237.1
## X12980396.2 12980396 ncRNA 20 LM20Cs1C1.1 XR_002460236.1
## X12980396.3 12980396 ncRNA 20 LM20Cs1C1.1 XR_002460237.1
## X12980397 12980397 ncRNA 20 LMJF_20_snoRNA0220 XR_002460308.1
## X12980397.1 12980397 ncRNA 20 LMJF_20_snoRNA0220 XR_002460309.1
## symbol
## X12980396 LM20Cs1C1.1
## X12980396.1 LM20Cs1C1.1
## X12980396.2 LM20Cs1C1.1
## X12980396.3 LM20Cs1C1.1
## X12980397 LM20Cs1C1.10
## X12980397.1 LM20Cs1C1.10
A completely separate and competing annotation source is biomart.
lm_annotv2 <- sm(load_biomart_annotations(species="lmajor", host="protists.ensembl.org"))$annotation
head(lm_annotv2)
## transcriptID geneID
## LmjF.01.0010.mRNA LmjF.01.0010:mRNA LmjF.01.0010
## LmjF.01.0020.mRNA LmjF.01.0020:mRNA LmjF.01.0020
## LmjF.01.0030.mRNA LmjF.01.0030:mRNA LmjF.01.0030
## LmjF.01.0040.mRNA LmjF.01.0040:mRNA LmjF.01.0040
## LmjF.01.0050.mRNA LmjF.01.0050:mRNA LmjF.01.0050
## LmjF.01.0060.mRNA LmjF.01.0060:mRNA LmjF.01.0060
## Description
## LmjF.01.0010.mRNA hypothetical protein, unknown function.[Source:GeneDB;Acc:LmjF.01.0010]
## LmjF.01.0020.mRNA hypothetical protein, conserved.[Source:GeneDB;Acc:LmjF.01.0020]
## LmjF.01.0030.mRNA MCAK-like kinesin, putative.[Source:GeneDB;Acc:LmjF.01.0030]
## LmjF.01.0040.mRNA hypothetical protein, unknown function.[Source:GeneDB;Acc:LmjF.01.0040]
## LmjF.01.0050.mRNA carboxylase, putative.[Source:GeneDB;Acc:LmjF.01.0050]
## LmjF.01.0060.mRNA hypothetical protein, conserved.[Source:GeneDB;Acc:LmjF.01.0060]
## Type length chromosome strand start end
## LmjF.01.0010.mRNA protein_coding 999 1 -1 3704 4702
## LmjF.01.0020.mRNA protein_coding 1650 1 -1 5790 7439
## LmjF.01.0030.mRNA protein_coding 2007 1 -1 9061 11067
## LmjF.01.0040.mRNA protein_coding 570 1 -1 12073 12642
## LmjF.01.0050.mRNA protein_coding 1998 1 -1 15025 17022
## LmjF.01.0060.mRNA protein_coding 750 1 -1 18137 18886
lm_ontology <- sm(load_biomart_go("lmajor", host="protists.ensembl.org"))
The hpgltools package has some improved methods for collecting annotation information directly from the eupathdb webservices api. It does this by first downloading all the available data for a given species and then creating a sqlite orgdb instance from them.
The creation of orgdbs takes a long time, so here is an example invocation.
testing_major <- make_eupath_organismdbi("major")
Assuming the above packages got created, we may load them and extract the annotation data.
major_names <- get_eupath_pkgnames("major")
## Starting metadata download.
## Finished metadata download.
## Found the following hits: Leishmania major strain Friedlin, Leishmania major strain LV39c5, Leishmania major strain SD 75.1, choosing the first.
major_names$orgdb
## [1] "org.Lmajor.Friedlin.v36.eg.db"
wanted_fields <- c("cds_length", "chromosome", "entrez_gene_id" , "gene_name_or_symbol",
"gene_strand", "gid", "go_go_id", "go_go_term_name", "go_ontology",
"interpro_description" ,"interpro_e_value", "type_gene_type")
lm_org <- load_orgdb_annotations("org.Lmajor.Friedlin.v36.eg.db", keytype="gid", fields=wanted_fields)$genes
##
## Unable to find GENENAME, setting it to GENE_NAME_OR_SYMBOL.
## Unable to find TYPE in the db, removing it.
## Unable to find CHR in the db, removing it.
## Unable to find TXSTRAND in the db, removing it.
## Unable to find TXSTART in the db, removing it.
## Unable to find TXEND in the db, removing it.
## Extracted all gene ids.
## 'select()' returned 1:many mapping between keys and columns
knitr::kable(head(lm_org))
gid | gene_name_or_symbol | cds_length | chromosome | entrez_gene_id | gene_strand | go_go_id | go_go_term_name | go_ontology | interpro_description | interpro_e_value | type_gene_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
LmjF.01.0010 | LmjF.01.0010 | 999.0 | 1 | reverse | NA | NA | NA | Protein of unknown function (DUF2946) | 7.7e-18 | protein coding | ||
LmjF.01.0010.1 | LmjF.01.0010 | 999.0 | 1 | reverse | NA | NA | NA | Prokaryotic membrane lipoprotein lipid attachment site profile | 5.0 | protein coding | ||
LmjF.01.0020 | LmjF.01.0020 | 1650.0 | 1 | reverse | NA | NA | NA | Endonuclease/Exonuclease/phosphatase family | 1.3e-10 | protein coding | ||
LmjF.01.0020.1 | LmjF.01.0020 | 1650.0 | 1 | reverse | NA | NA | NA | DNase I-like | 7.9e-21 | protein coding | ||
LmjF.01.0030 | LmjF.01.0030 | KIN13-1 | 2007.0 | 1 | reverse | GO:0007018 | microtubule-based movement | Biological Process | Kinesin motor domain | 1.2e-94 | protein coding | |
LmjF.01.0030.1 | LmjF.01.0030 | KIN13-1 | 2007.0 | 1 | reverse | GO:0007018 | microtubule-based movement | Biological Process | Kinesin motor domain profile | 43.0 | protein coding |
In contrast, it is possible to load most annotations of interest directly from the gff files used in the alignments.
## The old way of getting genome/annotation data
lm_gff <- "reference/lmajor.gff.gz"
lm_gff_annotations <- load_gff_annotations(lm_gff, type="gene")
## Trying attempt: rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo=TRUE)
## Had a successful gff import with rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo=TRUE)
## Returning a df with 24 columns and 9379 rows.
rownames(lm_gff_annotations) <- make.names(lm_gff_annotations$Name, unique=TRUE)
head(lm_gff_annotations)
## seqnames start end width strand source type score phase
## LmjF.01.0010 LmjF.01 3704 4702 999 - TriTrypDB gene NA NA
## LmjF.01.0020 LmjF.01 5790 7439 1650 - TriTrypDB gene NA NA
## LmjF.01.0030 LmjF.01 9061 11067 2007 - TriTrypDB gene NA NA
## LmjF.01.0040 LmjF.01 12073 12642 570 - TriTrypDB gene NA NA
## LmjF.01.0050 LmjF.01 15025 17022 1998 - TriTrypDB gene NA NA
## LmjF.01.0060 LmjF.01 18137 18886 750 - TriTrypDB gene NA NA
## ID Name
## LmjF.01.0010 LmjF.01.0010 LmjF.01.0010
## LmjF.01.0020 LmjF.01.0020 LmjF.01.0020
## LmjF.01.0030 LmjF.01.0030 LmjF.01.0030
## LmjF.01.0040 LmjF.01.0040 LmjF.01.0040
## LmjF.01.0050 LmjF.01.0050 LmjF.01.0050
## LmjF.01.0060 LmjF.01.0060 LmjF.01.0060
## description size web_id
## LmjF.01.0010 hypothetical+protein,+unknown+function 999 LmjF.01.0010
## LmjF.01.0020 hypothetical+protein,+conserved 1650 LmjF.01.0020
## LmjF.01.0030 Kinesin-13+1,+putative+(KIN13-1) 2007 LmjF.01.0030
## LmjF.01.0040 hypothetical+protein,+unknown+function 570 LmjF.01.0040
## LmjF.01.0050 carboxylase,+putative 1998 LmjF.01.0050
## LmjF.01.0060 hypothetical+protein,+conserved 750 LmjF.01.0060
## molecule_type organism_name translation_table topology
## LmjF.01.0010 <NA> <NA> <NA> <NA>
## LmjF.01.0020 <NA> <NA> <NA> <NA>
## LmjF.01.0030 <NA> <NA> <NA> <NA>
## LmjF.01.0040 <NA> <NA> <NA> <NA>
## LmjF.01.0050 <NA> <NA> <NA> <NA>
## LmjF.01.0060 <NA> <NA> <NA> <NA>
## localization Dbxref locus_tag
## LmjF.01.0010 <NA> LmjF.01.0010
## LmjF.01.0020 <NA> LmjF.01.0020
## LmjF.01.0030 <NA> LmjF.01.0030
## LmjF.01.0040 <NA> LmjF.01.0040
## LmjF.01.0050 <NA> LmjF.01.0050
## LmjF.01.0060 <NA> LmjF.01.0060
## Alias
## LmjF.01.0010 321438052, 389592307, LmjF1.0010, LmjF01.0010, LmjF.01.0010, LmjF01.0010:pep, LmjF01.0010:mRNA
## LmjF.01.0020 321438053, 389592309, LmjF1.0020, LmjF01.0020, LmjF.01.0020, LmjF01.0020:pep, LmjF01.0020:mRNA
## LmjF.01.0030 KIN13-1, Kif-13-1, 321438054, 389592311, LmjF1.0030, LmjF01.0030, LmjF.01.0030, LmjF01.0030:pep, LmjF01.0030:mRNA
## LmjF.01.0040 321438055, 389592313, LmjF1.0040, LmjF01.0040, LmjF.01.0040, LmjF01.0040:pep, LmjF01.0040:mRNA
## LmjF.01.0050 321438056, 389592315, LmjF1.0050, LmjF01.0050, LmjF.01.0050, LmjF01.0050:pep, LmjF01.0050:mRNA
## LmjF.01.0060 321438057, 389592317, LmjF1.0060, LmjF01.0060, LmjF.01.0060, LmjF01.0060:pep, LmjF01.0060:mRNA
## Parent Ontology_term
## LmjF.01.0010
## LmjF.01.0020
## LmjF.01.0030
## LmjF.01.0040
## LmjF.01.0050
## LmjF.01.0060
In the following block we create an expressionset using the sample sheet and the annotations.
Annoyingly, the gff annotations are keyed in a peculiar fashion. Therefore I need to do a little work to merge them.
if (!isTRUE(get0("skip_load"))) {
pander::pander(sessionInfo())
message(paste0("This is hpgltools commit: ", get_git_commit()))
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
tmp <- sm(saveme(filename=this_save))
}
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 7de4503f6bb5724c28cce24af5dbee22bb1c0cae
## R> packrat::restore()
## This is hpgltools commit: Thu Apr 12 22:08:53 2018 -0400: 7de4503f6bb5724c28cce24af5dbee22bb1c0cae
## Saving to 01_annotation_lmajor_201804-v20180410.rda.xz