There are a few methods of importing annotation data into R. The following are two attempts, the second is currently being used in these analyses.
AnnotationHub is a newer service and has promise to be an excellent top-level resource for gathering annotation data.
## This doesn't quite work yet
tt <- sm(library(AnnotationHub))
ah <- AnnotationHub()
orgdbs <- query(ah, "OrgDb")
leishmanias <- query(ah, c("OrgDB", "Leishmania"))
AnnotationHub is the new and fancier version of what OrganismDb does. Keith already made these for the parasites though, lets try and use one of those.
The OrganismDb packages are installable via Keith’s builder: https://github.com/elsayed-lab/eupathdb-organismdb
I did a git pull of it, changed a couple small things and ran ‘make lbraziliensis’. This pulls the annotations down from a mix of Najib’s directory on the cluster and the TriTrypdb. As a result, it currently gets the version 27 of the TriTrypDB data. It may be useful for me to download a newer revision (but I am reasonably certain not much (or anything) has changed for braziliensis in a while).
After 5 or so minutes a brand new package ‘Leishmania.braziliensis’ appeared in my R environment.
testing_panamensis <- make_tritrypdb_organismdbi("lpanamensis")
testing_braziliensis <- make_tritrypdb_organismdbi("lbraziliensis")
testing_donovani <- make_tritrypdb_organismdbi("ldonovani")
testing_mexicana <- make_tritrypdb_organismdbi("lmexicana")
testing_major <- make_tritrypdb_organismdbi("lmajor")
testing_donovani_eu <- make_eupath_organismdbi("donovani")
tmp <- sm(try(library("Leishmania.major.Friedlin")))
if (class(tmp) == "try-error") {
tmp <- make_tritrypdb_organismdbi("lmajor")
}
tmp <- sm(library("org.Lmajor.friedlin.eg.db"))
tmp <- sm(library("TxDb.Lmajor.friedlin.TriTrypDB28"))
Leishmania.major.Friedlin
## OrganismDb Object:
## # Includes GODb Object: GO.db
## # With data about: Gene Ontology
## # Includes OrgDb Object: org.Lmajor.friedlin.eg.db
## # Gene data about: Leishmania major.friedlin
## # Taxonomy Id:
## # Includes TxDb Object: TxDb.Lmajor.friedlin.TriTrypDB28
## # Transcriptome data about: Leishmania major
## # Based on genome: NA
## # The OrgDb gene id GID is mapped to the TxDb gene id GENEID .
lm_org <- sm(load_orgdb_annotations(Leishmania.major.Friedlin, keytype="geneid"))
lm_annotations <- lm_org[["genes"]]
In this instance, it might be smarter to actually read the annotations from a gff file, primarily because I modified the gff file to include an entry for the LmjF.26.0890 3’ UTR.
lm_gff_annotations <- load_gff_annotations(gff="reference/lmajor_modified.gff", type="gene")
## Trying attempt: rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo=TRUE)
## Warning in as.data.frame(mcols(x), ...): Arguments in '...' ignored
## Warning in as.data.frame(mcols(x), ...): Arguments in '...' ignored
## Had a successful gff import with rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo=TRUE)
## Warning in as.data.frame(mcols(x), ...): Arguments in '...' ignored
## Returning a df with 24 columns and 9379 rows.
head(lm_gff_annotations)
## seqnames start end width strand source type score phase ID Name
## 37 LmjF.01 3704 4702 999 - TriTrypDB gene NA NA LmjF.01.0010 LmjF.01.0010
## 41 LmjF.01 5790 7439 1650 - TriTrypDB gene NA NA LmjF.01.0020 LmjF.01.0020
## 45 LmjF.01 9061 11067 2007 - TriTrypDB gene NA NA LmjF.01.0030 LmjF.01.0030
## 49 LmjF.01 12073 12642 570 - TriTrypDB gene NA NA LmjF.01.0040 LmjF.01.0040
## 53 LmjF.01 15025 17022 1998 - TriTrypDB gene NA NA LmjF.01.0050 LmjF.01.0050
## 57 LmjF.01 18137 18886 750 - TriTrypDB gene NA NA LmjF.01.0060 LmjF.01.0060
## description size web_id molecule_type organism_name
## 37 hypothetical+protein,+unknown+function 999 LmjF.01.0010 <NA> <NA>
## 41 hypothetical+protein,+conserved 1650 LmjF.01.0020 <NA> <NA>
## 45 Kinesin-13+1,+putative+(KIN13-1) 2007 LmjF.01.0030 <NA> <NA>
## 49 hypothetical+protein,+unknown+function 570 LmjF.01.0040 <NA> <NA>
## 53 carboxylase,+putative 1998 LmjF.01.0050 <NA> <NA>
## 57 hypothetical+protein,+conserved 750 LmjF.01.0060 <NA> <NA>
## translation_table topology localization Dbxref locus_tag Alias Parent
## 37 <NA> <NA> <NA> LmjF.01.0010 32143805....
## 41 <NA> <NA> <NA> LmjF.01.0020 32143805....
## 45 <NA> <NA> <NA> LmjF.01.0030 KIN13-1,....
## 49 <NA> <NA> <NA> LmjF.01.0040 32143805....
## 53 <NA> <NA> <NA> LmjF.01.0050 32143805....
## 57 <NA> <NA> <NA> LmjF.01.0060 32143805....
## Ontology_term
## 37
## 41
## 45
## 49
## 53
## 57
rownames(lm_gff_annotations) <- paste0("exon_", make.names(lm_gff_annotations[["Name"]], unique=TRUE), ".1")
In contrast, it is possible to load most annotations of interest directly from the gff files used in the alignments. More in-depth information for the human transcriptome may be extracted from biomart.
## While testing, I called this desc, that will need to change.
lm_tooltips <- make_tooltips(annotations=lm_gff_annotations, desc_col="description")
##rownames(lm_annotations) <- paste0("exon_", rownames(lm_annotations), ".1")
##lm_lengths <- lm_annotations[, c("geneid", "width")]
##colnames(lm_lengths) <- c("ID", "width")
lm_goids <- as.data.frame(load_orgdb_go(Leishmania.major.Friedlin, keytype="TXID",
columns=c("TXNAME","GO")))
## This is an organismdbi, that should be ok.
lm_goids <- lm_goids[, c("TXNAME", "GO", "ONTOLOGY", "TERM")]
colnames(lm_goids) <- c("ID", "GO", "ont", "name")
new_ids <- gsub(pattern="-1$", replacement="\\.1", x=lm_goids[["ID"]])
new_ids <- paste0("exon_", new_ids)
lm_goids[["ID"]] <- new_ids
The macrophage experiment has samples across 2 contexts, the host and parasite. The following block sets up one experiment for each. If you open the all_samples-species.xlsx files, you will note immediately that a few different attempts were made at ascertaining the most likely experimental factors that contributed to the readily apparent batch effects.
Question: Are there any human reads in this data? I assume no, but will happily map human if so.
parasite_expt <- create_expt("sample_sheet/all_samples.xlsx", gene_info=lm_gff_annotations)
## wt_1/processed/outputs/tophat_lmajor/accepted_paired.count.xz contains 9470 rows.
## wt_2/processed/outputs/tophat_lmajor/accepted_paired.count.xz contains 9470 rows and merges to 9470 rows.
## wt_3/processed/outputs/tophat_lmajor/accepted_paired.count.xz contains 9470 rows and merges to 9470 rows.
## odd3_1/processed/outputs/tophat_lmajor/accepted_paired.count.xz contains 9470 rows and merges to 9470 rows.
## odd3_2/processed/outputs/tophat_lmajor/accepted_paired.count.xz contains 9470 rows and merges to 9470 rows.
## odd3_3/processed/outputs/tophat_lmajor/accepted_paired.count.xz contains 9470 rows and merges to 9470 rows.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
chosen_colors <- list(
"wt" = "#AA3939",
"odd3" = "#067300")
parasite_expt <- set_expt_colors(expt=parasite_expt, colors=chosen_colors)
At this point, we should have everything necessary to perform some analyses.
knitr::kable(parasite_expt$design)
sampleid | condition | batch | file | |
---|---|---|---|---|
wt_1 | wt_1 | wt | null | wt_1/processed/outputs/tophat_lmajor/accepted_paired.count.xz |
wt_2 | wt_2 | wt | null | wt_2/processed/outputs/tophat_lmajor/accepted_paired.count.xz |
wt_3 | wt_3 | wt | null | wt_3/processed/outputs/tophat_lmajor/accepted_paired.count.xz |
odd_1 | odd_1 | odd3 | null | odd3_1/processed/outputs/tophat_lmajor/accepted_paired.count.xz |
odd_2 | odd_2 | odd3 | null | odd3_2/processed/outputs/tophat_lmajor/accepted_paired.count.xz |
odd_3 | odd_3 | odd3 | null | odd3_3/processed/outputs/tophat_lmajor/accepted_paired.count.xz |
pander::pander(sessionInfo())
R version 3.4.1 (2017-06-30)
**Platform:** x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=en_US.utf8, LC_NUMERIC=C, LC_TIME=en_US.utf8, LC_COLLATE=en_US.utf8, LC_MONETARY=en_US.utf8, LC_MESSAGES=en_US.utf8, LC_PAPER=en_US.utf8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.utf8 and LC_IDENTIFICATION=C
attached base packages: parallel, stats4, stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: hpgltools(v.2017.01), TxDb.Lmajor.friedlin.TriTrypDB28(v.28.0), org.Lmajor.friedlin.eg.db(v.28.0), Leishmania.major.Friedlin(v.28.0), GO.db(v.3.4.1), OrganismDbi(v.1.18.0), GenomicFeatures(v.1.28.4), GenomicRanges(v.1.28.4), GenomeInfoDb(v.1.12.2), AnnotationDbi(v.1.38.2), IRanges(v.2.10.3), S4Vectors(v.0.14.3), Biobase(v.2.36.2) and BiocGenerics(v.0.22.0)
loaded via a namespace (and not attached): bit64(v.0.9-7), foreach(v.1.4.3), assertthat(v.0.2.0), highr(v.0.6), pander(v.0.6.1), RBGL(v.1.52.0), blob(v.1.1.0), GenomeInfoDbData(v.0.99.0), Rsamtools(v.1.28.0), yaml(v.2.1.14), backports(v.1.1.0), RSQLite(v.2.0), lattice(v.0.20-35), glue(v.1.1.1), digest(v.0.6.12), RColorBrewer(v.1.1-2), XVector(v.0.16.0), colorspace(v.1.3-2), htmltools(v.0.3.6), Matrix(v.1.2-11), plyr(v.1.8.4), XML(v.3.98-1.9), pkgconfig(v.2.0.1), devtools(v.1.13.3), biomaRt(v.2.32.1), zlibbioc(v.1.22.0), scales(v.0.5.0), openxlsx(v.4.0.17), BiocParallel(v.1.10.1), tibble(v.1.3.4), ggplot2(v.2.2.1), withr(v.2.0.0), SummarizedExperiment(v.1.6.3), lazyeval(v.0.2.0), magrittr(v.1.5), crayon(v.1.3.2), evaluate(v.0.10.1), memoise(v.1.1.0), xml2(v.1.1.1), graph(v.1.54.0), BiocInstaller(v.1.26.1), tools(v.3.4.1), data.table(v.1.10.4), matrixStats(v.0.52.2), stringr(v.1.2.0), munsell(v.0.4.3), DelayedArray(v.0.2.7), bindrcpp(v.0.2), Biostrings(v.2.44.2), compiler(v.3.4.1), rlang(v.0.1.2), grid(v.3.4.1), RCurl(v.1.95-4.8), iterators(v.1.0.8), base64enc(v.0.1-3), rmarkdown(v.1.6), bitops(v.1.0-6), testthat(v.1.0.2), gtable(v.0.2.0), codetools(v.0.2-15), DBI(v.0.7), roxygen2(v.6.0.1), R6(v.2.2.2), GenomicAlignments(v.1.12.2), knitr(v.1.17), dplyr(v.0.7.2), rtracklayer(v.1.36.4), bit(v.1.1-12), rprojroot(v.1.2), bindr(v.0.1), commonmark(v.1.4), stringi(v.1.1.5) and Rcpp(v.0.12.12)
message(paste0("This is hpgltools commit: ", get_git_commit()))
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 739e4a10345a89efb17b37e7de07c5811491ccea
## R> packrat::restore()
## This is hpgltools commit: Thu Aug 31 11:24:06 2017 -0400: 739e4a10345a89efb17b37e7de07c5811491ccea
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to: ", this_save))
## Saving to: 01_annotation-v20170703.rda.xz
tmp <- sm(saveme(filename=this_save))