There are a few methods of importing annotation data into R. The following are two attempts, the second is currently being used in these analyses.
AnnotationHub is a newer service and has promise to be an excellent top-level resource for gathering annotation data.
tt <- sm(library(AnnotationHub))
ah <- AnnotationHub()
orgdbs <- query(ah, "OrgDB")
annot_lb <- query(ah, c("OrgDB", "Leishmania"))
annot_lb
## It appears there are no braziliensis samples in AnnotationHub.
AnnotationHub is the new and fancier version of what OrganismDb does. Keith already made these for the parasites though, lets try and use one of those.
The OrganismDb packages are installable via Keith’s builder: https://github.com/elsayed-lab/eupathdb-organismdb
I did a git pull of it, changed a couple small things and ran ‘make lbraziliensis’. This pulls the annotations down from a mix of Najib’s directory on the cluster and the TriTrypdb. As a result, it currently gets the version 27 of the TriTrypDB data. It may be useful for me to download a newer revision (but I am reasonably certain not much (or anything) has changed for braziliensis in a while).
After 5 or so minutes a brand new package ‘Leishmania.braziliensis’ appeared in my R environment.
Since the original writing of this document, I have improved my ability to import/use data from the eupathdb. With that in mind, relatively recent versions of the hpgltools provide functionality to create relatively complete orgdb/txdb/organismdbi/bsgenome packages from most (any?) species in the eupathdb heirarchy.
The creation of an orgdb instance for a given eupath species is below, but it takes a long time and I already did it for most of the Leishmania, so I am including the function here, but not running it.
testing_braziliensis <- make_eupath_organismdbi("2904")
braziliensis_names <- get_eupath_pkgnames("2904")
## Starting metadata download.
## Finished metadata download.
## Found the following hits: Leishmania braziliensis MHOM/BR/75/M2904, choosing the first.
braziliensis_names$orgdb
## [1] "org.Lbraziliensis.MHOMBR75M2904.v36.eg.db"
lb_org <- load_orgdb_annotations(braziliensis_names$orgdb, keytype="gid", fields=wanted_fields)$genes
## Unable to find GENENAME, setting it to GENE_NAME_OR_SYMBOL.
## Unable to find TYPE in the db, removing it.
## Unable to find CHR in the db, removing it.
## Unable to find TXSTRAND in the db, removing it.
## Unable to find TXSTART in the db, removing it.
## Unable to find TXEND in the db, removing it.
## Extracted all gene ids.
## 'select()' returned 1:many mapping between keys and columns
In contrast, it is possible to load most annotations of interest directly from the gff files used in the alignments. More in-depth information for the human transcriptome may be extracted from biomart.
## The old way of getting genome/annotation data
lb_gff <- "reference/lbraziliensis.gff"
lb_fasta <- "reference/lbraziliensis.fasta.xz"
lb_annotations <- sm(load_gff_annotations(lb_gff, type="gene"))
rownames(lb_annotations) <- paste0("exon_", lb_annotations$web_id, ".1")
## While testing, I called this desc, that will need to change.
## lb_tooltips <- make_tooltips(lb_annotations)
lb_lengths <- lb_annotations[, c("ID", "width")]
lb_goids <- read.csv(file="reference/lbraz_go.txt.xz", sep="\t", header=FALSE)
colnames(lb_goids) <- c("ID","GO","ont","name","source","tag")
The macrophage experiment has samples across 2 contexts, the host and parasite. The following block sets up one experiment for each. If you open the all_samples-species.xlsx files, you will note immediately that a few different attempts were made at ascertaining the most likely experimental factors that contributed to the readily apparent batch effects.
Question: Are there any human reads in this data? I assume no, but will happily map human if so.
parasite_expt <- sm(create_expt("sample_sheet/all_samples.xlsx", gene_info=lb_annotations))
parasite_expt <- exclude_genes_expt(parasite_expt, column="web_id", patterns=c("rRNA", "tRNA"))
## Before removal, there were 8706 entries.
## Now there are 8633 entries.
## Percent kept: 99.537, 99.482, 98.554, 99.212, 99.368, 99.170, 99.522, 99.551, 99.650
## Percent removed: 0.463, 0.518, 1.446, 0.788, 0.632, 0.830, 0.478, 0.449, 0.350
chosen_colors <- list(
"ama" = "#AA3939",
"meta" = "#067300",
"pro" = "#482AB1")
parasite_expt <- set_expt_colors(expt=parasite_expt, colors=chosen_colors)
At this point, we should have everything necessary to perform the various analyses of the 4 sub-experiments. So save the current data for reuse elsewhere.
knitr::kable(parasite_expt$design)
sample | sampleid | strain | condition | stage | batch | altbatch | file | |
---|---|---|---|---|---|---|---|---|
axa1 | axa1 | axa1 | MHOM_BR_74_M2903 | ama | amastigote | a | a | preprocessing/AXA1/processed/outputs/bowtie2_lbraziliensis/axa1-trimmed.count.xz |
axa2 | axa2 | axa2 | MHOM_BR_74_M2903 | ama | amastigote | b | a | preprocessing/AXA2/processed/outputs/bowtie2_lbraziliensis/axa2-trimmed.count.xz |
axa3 | axa3 | axa3 | MHOM_BR_74_M2903 | ama | amastigote | c | a | preprocessing/AXA3/processed/outputs/bowtie2_lbraziliensis/axa3-trimmed.count.xz |
meta1 | meta1 | meta1 | MHOM_BR_74_M2903 | meta | metacyclic | a | a | preprocessing/META1/processed/outputs/bowtie2_lbraziliensis/meta1-trimmed.count.xz |
meta2 | meta2 | meta2 | MHOM_BR_74_M2903 | meta | metacyclic | b | a | preprocessing/META2/processed/outputs/bowtie2_lbraziliensis/meta2-trimmed.count.xz |
meta3 | meta3 | meta3 | MHOM_BR_74_M2903 | meta | metacyclic | c | a | preprocessing/META3/processed/outputs/bowtie2_lbraziliensis/meta3-trimmed.count.xz |
pro1 | pro1 | pro1 | MHOM_BR_74_M2903 | pro | procyclic | a | a | preprocessing/PRO1/processed/outputs/bowtie2_lbraziliensis/pro1-trimmed.count.xz |
pro2 | pro2 | pro2 | MHOM_BR_74_M2903 | pro | procyclic | b | a | preprocessing/PRO2/processed/outputs/bowtie2_lbraziliensis/pro2-trimmed.count.xz |
pro3 | pro3 | pro3 | MHOM_BR_74_M2903 | pro | procyclic | c | a | preprocessing/PRO3/processed/outputs/bowtie2_lbraziliensis/pro3-trimmed.count.xz |
** Explicitly excluded rRNA genes/features, currently only those. The rest of the ncRNA may be removed by adding “snoRNA”, “ncRNA”, “snRNA” to ‘patterns’ above.
if (!isTRUE(get0("skip_load"))) {
pander::pander(sessionInfo())
message(paste0("This is hpgltools commit: ", get_git_commit()))
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
tmp <- sm(saveme(filename=this_save))
}
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 2a0661d6e37f8a3d8831eb3bbd6347c0d9c4b3b7
## R> packrat::restore()
## This is hpgltools commit: Thu Mar 29 16:59:07 2018 -0400: 2a0661d6e37f8a3d8831eb3bbd6347c0d9c4b3b7
## Saving to 01_annotation-v20170201.rda.xz