index.html preprocessing.html

1 Annotation version: 20170810

1.1 Genome annotation with OrgDb/TxDb/OrganismDbi

The function make_organismdbi() is responsible for collating the orgDb/TxDb instances for the esmeraldo, nonesmeraldo, and clbrener haplotypes/strains. Along the way it downloads the tritrypDb text files to organismdbi/ and saves rdata files for the orgdb information.

## These functions take _forever_ the first time around.
esmer <- make_tritrypdb_organismdbi(id="tcruzi_esmeraldo", kegg=TRUE)
nonesmer <- make_tritrypdb_organismdbi(id="tcruzi_nonesmeraldo", kegg=TRUE)
unassigned <- make_tritrypdb_organismdbi(id="tcruzi_unassigned", kegg=TRUE)
sylvio <- make_tritrypdb_organismdbi(id="tcruzi_sylvio")

It is worth noting that since doing this the first time, I have learned how to use OrganismDbi/OrgDb/TxDb and that they are superior methods of extracting this information.

Upon completion of the above, there should be 3 new loadable packages:

  1. Trypanosoma.cruzi.CLBrener.Esmeraldo: This contains the gene information/GO/KEGG/etc data for the esmeraldo haplotype, and combines the libraries “org.TcCLB.esmer.tritryp.db” and “TxDb.TcruziCLBrenerEsmer.tritryp27.genes”.
  2. Trypanosoma.cruzi.CLBrener.NonEsmeraldo: This contains the gene information/GO/KEGG/etc data for the non-esmeraldo haplotype; including the libraries “org.TcCLB.nonesmer.tritryp.db” and “TxDb.TcruziCLBrenerNonEsmer.tritryp27.genes”.
  3. Trypanosoma.cruzi.CLBrener.Unassigned: This contains the gene information/GO/KEGG/etc data for the unassigned genes; including the libraries “org.TcCLB.unassigned.tritryp.db” and “TxDb.TcruziCLBrenerUnassigned.tritryp27.genes”.
## If the libraries are not available, uncomment and run the previous lines
##  Start with the orgDb instances
tt <- sm(library(org.Tcruzi.CLBrenerEsmeraldo.eg.db))
tt <- sm(library(org.Tcruzi.CLBrenerNonEsmeraldo.eg.db))
tt <- sm(library(org.Tcruzi.CLBrener.eg.db))
## Follow up with the TxDb instances
tt <- sm(library(TxDb.Tcruzi.CLBrenerEsmeraldo.TriTrypDB28))
tt <- sm(library(TxDb.Tcruzi.CLBrenerNonEsmeraldo.TriTrypDB28))
tt <- sm(library(TxDb.Tcruzi.CLBrener.TriTrypDB28))
## Finish with the OrganismDbi instances, noting that one really only needs to
## load these last 3.
tt <- sm(library(Trypanosoma.cruzi.CLBrener.Esmeraldo))
tt <- sm(library(Trypanosoma.cruzi.CLBrener.NonEsmeraldo))
tt <- sm(library(Trypanosoma.cruzi.CLBrener.Unassigned))

## Just to save on typing
esmer_db <- Trypanosoma.cruzi.CLBrener.Esmeraldo
nonesmer_db <- Trypanosoma.cruzi.CLBrener.NonEsmeraldo
unassigned_db <- Trypanosoma.cruzi.CLBrener.Unassigned

1.2 Extract data from the OrganismDbi objects

The esmer/nonesmer/unassigned databases created above provide hopefully complete sets of material including the gene annotations, ontology data, etc. Lets pull from them a subset of interesting data. The most common things I like to see are the gene descriptions, chromosome assignment, start/end/strand, type, and size (for goseq)

## Some notes:
## The possible fields from the OrganismDbi may be found with:
## keytypes(esmer_db)  In addition, if you feed it a gibberish key, it will come
## back with a listing of all possible keys.  You will notice that I do not
## capitalize the keys when I send them, it will handle that for me.

wanted_fields <- c("genedescription", "txchrom", "txtype", "txstart", "txend", "txstrand",
                   "txtype","genesize")
esmer_genes <- sm(load_orgdb_annotations(esmer_db,
                                            keytype="geneid",
                                            fields=wanted_fields)$genes)
nonesmer_genes <- sm(load_orgdb_annotations(nonesmer_db,
                                               keytype="geneid",
                                               fields=wanted_fields)$genes)
unassigned_genes <- sm(load_orgdb_annotations(unassigned_db,
                                                 keytype="geneid",
                                                 fields=wanted_fields)$genes)
## The annotation data as provided by load_parasite_annotations comes as a set
## of genes and transcripts.  Most of the time we are only interested in the
## genes. Thus in the previous 3 lines we asked only for the 'genes' slot.

## Now pull down the ontology data for these haplotypes
go_fields <- c("GOALL")
esmer_go <- as.data.frame(sm(load_orgdb_go(esmer_db)))
nonesmer_go <- as.data.frame(sm(load_orgdb_go(nonesmer_db)))
unassigned_go <- as.data.frame(sm(load_orgdb_go(unassigned_db)))
all_go <- rbind(esmer_go, nonesmer_go)
all_go <- rbind(all_go, unassigned_go)

## Finally, get the KEGG data, however this still sometimes has some problems.
##kegg_fields <- c("KEGG","KEGG_CLASS","KEGG_DESCRIPTION","KEGG_NAME")
##esmer_kegg <- load_parasite_annotations(esmer_db, keytype="geneid",
##                                        fields=kegg_fields)
## nonesmer_kegg <- load_parasite_annotations(nonesmer_db, keytype="geneid",
##                                            fields=kegg_fields)$genes
##unassigned_kegg <- load_parasite_annotations(unassigned_db, keytype="geneid",
##                                             fields=kegg_fields)$genes

## Well, that should not be too much of a problem, I can call the KEGG gatherer
## separately. This is annoying though as this is the function I use to fill in
## the KEGG data for the organismDbi.
kegg_annotations <- get_kegg_genes(species="Trypanosoma cruzi")
## The abbreviation detected was: tcr
## Reading from the savefile, delete kegg_Trypanosoma_cruzi.rda.xz to regenerate.
kegg_mapping <- kegg_annotations[, c("GID","KEGG", "KEGG_NAME")]

1.3 Previous code to load annotations from a gff file.

The following is the previous method of doing a subset of the above. I think I am going to stop using it now. I am leaving it here though as a reminder.

gff_file <- "reference/gff/clbrener_8.1_complete_genes.gff.gz"
gff_annotations <- sm(load_gff_annotations(gff_file, type="gene"))
tooltip_data <- make_tooltips(annotations=gff_annotations)

Given the esmer/nonesmer/unassigned annotations above, we may now pull from them some useful information including gene lengths (for goseq later).

esmer_lengths <- esmer_genes[, c("geneid", "genesize")]
nonesmer_lengths <- nonesmer_genes[, c("geneid", "genesize")]
unassigned_lengths <- unassigned_genes[, c("geneid", "genesize")]
all_lengths <- rbind(esmer_lengths, nonesmer_lengths)
all_lengths <- rbind(all_lengths, unassigned_lengths)

## gene_lengths <- gene_annotations[,c("ID","width")]
## goids <- read.csv(file="reference/go/tcruzi_all_go.tab.gz", sep="\t", header=FALSE)
## colnames(goids) <- c("ID","GOID","ontology","name","evidence","code")
## goids <- goids[, c("ID","GOID")]
## colnames(goids) <- c("ID","GO")

## The following lines are a series of gene mappings between the haplotypes.
esmer_hap <- read.csv("reference/Esmeraldo_orthologs.txt", sep="\t", header=FALSE)
colnames(esmer_hap) <- c("CLBrener", "esmer_number", "esmer_id",
                         "haplotype_esmer", "esmer_gene_name", "esmer_comment",
                         "esmer_comment2")
esmer_hap <- esmer_hap[,c("CLBrener","esmer_number","esmer_id","esmer_gene_name")]
nonesmer_hap <- read.csv("reference/NonEsmeraldo_orthologs.txt", sep="\t", header=FALSE)
colnames(nonesmer_hap) <- c("CLBrener", "nonesmer_number", "nonesmer_id",
                            "haplotype_nonesmer", "nonesmer_gene_name",
                            "nonesmer_comment", "nonesmer_comment2")
nonesmer_hap <- nonesmer_hap[,c("CLBrener", "nonesmer_number", "nonesmer_id",
                                "nonesmer_gene_name")]
haplotype_xref <- merge(esmer_hap, nonesmer_hap, by.x="CLBrener",
                        by.y="CLBrener", all.x=TRUE, all.y=TRUE)

2 Create expt object containing the counts and annotation information

We should be ready to play now. We have on hand in the sample_sheets/ directory the descriptions of each sample. You might note that there are in fact 3 separate sample sheets, one for esmeraldo, nonesmeraldo, and everything. This is because we did 3 separate against the three haplotypes separately. We therefore want to bring the esmeraldo haplotype mapping data together with the esmeraldo annotation data, nonesmer to nonesmer, and all to all. For that reason, the first couple of lines just combine the annotation data from esmer/nonesmer/unassigned into a larger annotation set called ‘all’.

We then step through and create 3 expt objects, one for the 2 haplotypes and one for all.

colnames(esmer_genes) <- gsub(pattern="genesize", replacement="width",
                              x=colnames(esmer_genes))
colnames(nonesmer_genes) <- gsub(pattern="genesize", replacement="width",
                                 x=colnames(nonesmer_genes))
colnames(unassigned_genes) <- gsub(pattern="genesize", replacement="width",
                                   x=colnames(unassigned_genes))
all_genes <- rbind(esmer_genes, nonesmer_genes)
all_genes <- rbind(all_genes, unassigned_genes)

kept_columns <- c("geneid", "txtype", "txchrom", "txstrand", "width", "genedescription")
esmer_genes <- esmer_genes[, kept_columns]
nonesmer_genes <- nonesmer_genes[, kept_columns]
unassigned_genes <- unassigned_genes[, kept_columns]
all_genes <- all_genes[, kept_columns]

esmer_expt <- sm(create_expt("sample_sheets/cl14clbr_samples_combined.xlsx",
                             file_column="esmerfile", gene_info=esmer_genes))
esmer_expt <- exclude_genes_expt(esmer_expt)
## Before removal, there were 10597 entries.
## Now there are 10339 entries.
## Percent kept: 99.8505977035531, 99.8490740564207, 99.8736885394381, 99.6822284200219, 99.9119604883336, 99.9330530382904, 99.9136058806954, 99.6824761559551, 99.6482128038312, 99.9229899348608, 99.9291214114528, 99.8682130311147, 99.8315576877854, 99.8379545310834, 99.78783155441, 99.9203732023632, 99.9169010500345, 99.9389195680523
## Percent removed: 0.14940229644692, 0.150925943579282, 0.126311460561939, 0.317771579978078, 0.0880395116664014, 0.0669469617095973, 0.086394119304594, 0.317523844044861, 0.351787196168787, 0.0770100651391866, 0.0708785885472439, 0.131786968885313, 0.168442312214621, 0.162045468916571, 0.212168445590004, 0.0796267976368062, 0.0830989499654831, 0.0610804319476842
chosen_colors <- list(
##    "CL14.Epi" = "#FFEC38",  ## Light Orange, also we removed the epi samples
##    "CLBr.Epi" = "#FF8B00",  ## Dark Orange
    "CL14.Tryp" = "#FE6E72",  ## Light Red
    "CLBr.Tryp" = "#AA3939",  ## Dark Red
    "CL14.A96" = "#6E9B34",  ## Light Green
    "CLBr.A96" = "#067300",  ## Dark Green
    "CL14.A60" = "#8790E0",  ## Light Blue
    "CLBr.A60" = "#482AB1")  ## Dark Blue
newnames <- paste0(esmer_expt$design$condition, ".", esmer_expt$design$replicate)
esmer_expt <- set_expt_samplenames(esmer_expt, newnames)
esmer_expt <- set_expt_colors(expt=esmer_expt, colors=chosen_colors)

nonesmer_expt <- sm(create_expt(metadata="sample_sheets/cl14clbr_samples_combined.xlsx",
                                gene_info=nonesmer_genes, file_column="nonesmerfile"))
nonesmer_expt <- exclude_genes_expt(nonesmer_expt)
## Before removal, there were 11106 entries.
## Now there are 10831 entries.
## Percent kept: 99.8918598277929, 99.893698258758, 99.9129104416669, 99.7950217728841, 99.9309182689105, 99.9450331189277, 99.927674197287, 99.7875627351139, 99.764345473372, 99.9507873644009, 99.9507601697579, 99.9072257339634, 99.8854714846495, 99.8961681106644, 99.8752381792426, 99.9345328853916, 99.9327497975459, 99.9485650200899
## Percent removed: 0.10814017220706, 0.106301741241984, 0.0870895583331572, 0.204978227115876, 0.0690817310894929, 0.054966881072305, 0.0723258027129468, 0.21243726488613, 0.235654526627954, 0.0492126355991289, 0.0492398302420998, 0.0927742660365738, 0.114528515350464, 0.103831889335598, 0.124761820757434, 0.0654671146083818, 0.0672502024541323, 0.051434979910102
nonesmer_expt <- set_expt_colors(nonesmer_expt, colors=chosen_colors)
nonesmer_expt <- set_expt_samplenames(nonesmer_expt, newnames)

all_expt <- sm(create_expt(metadata="sample_sheets/cl14clbr_samples_combined.xlsx",
                           file_column="clbrenerfile", gene_info=all_genes))
all_expt <- exclude_genes_expt(all_expt)
## Before removal, there were 25100 entries.
## Now there are 23305 entries.
## Percent kept: 99.3312333217399, 99.3061664533641, 99.4570685757597, 98.435323575405, 99.6534395733928, 99.7755579868177, 99.7609502843079, 98.7204083398046, 98.5426791077987, 99.7809256297039, 99.8152710935081, 99.4491231184001, 99.3214938793745, 99.3946631668043, 99.1571379632071, 99.8143299083649, 99.8176608844132, 99.8856471362546
## Percent removed: 0.668766678260111, 0.693833546635864, 0.542931424240318, 1.56467642459504, 0.346560426607171, 0.22444201318232, 0.23904971569214, 1.27959166019542, 1.45732089220129, 0.21907437029615, 0.184728906491865, 0.550876881599932, 0.678506120625489, 0.605336833195692, 0.842862036792925, 0.185670091635106, 0.182339115586813, 0.114352863745435
all_expt <- set_expt_colors(all_expt, colors=chosen_colors)
all_expt <- set_expt_samplenames(all_expt, newnames)

3 Load Human annotation data

The human data has been much more thoroughly examined and is therefore easier to load

tt <- sm(library(Homo.sapiens))
## Note that the counts taken for the human data were using the gene annotations (ENSG...)

hs_genes <- sm(load_orgdb_annotations(Homo.sapiens, fields=c("ensembl","genename")))
## Annoyingly, the Homo.sapiens Dbi returns many lines / ENSG entry
dim(hs_genes$genes)
## [1] 28920     2
head(hs_genes$genes)
##                         ensembl                           genename
## ENSG00000121410 ENSG00000121410             alpha-1-B glycoprotein
## ENSG00000175899 ENSG00000175899              alpha-2-macroglobulin
## ENSG00000256069 ENSG00000256069 alpha-2-macroglobulin pseudogene 1
## ENSG00000171428 ENSG00000171428              N-acetyltransferase 1
## ENSG00000156006 ENSG00000156006              N-acetyltransferase 2
## ENSG00000196136 ENSG00000196136           serpin family A member 3
## Ahh but if we keep a minimal set, then it doesn't, I see because some entries
## are repeated.  This may prove important to remember.
hs_expt <- sm(create_expt(metadata="sample_sheets/cl14clbr_samples_combined.xlsx",
                          file_column="humanfile", gene_info=hs_genes$genes))

biomart_hs <- load_biomart_annotations(species="hsapiens")
## The biomart annotations file already exists, loading from it.
host = "dec2015.archive.ensembl.org"
trymart = "ENSEMBL_MART_ENSEMBL"
allmarts <- biomaRt::listMarts(host=host)
mart <- biomaRt::useMart(biomart=trymart, host=host)
dataset <- "hsapiens_gene_ensembl"
ensembl <- biomaRt::useDataset(dataset, mart=mart)
attribs <- biomaRt::listAttributes(ensembl)

index.html

pander::pander(sessionInfo())

R version 3.4.1 (2017-06-30)

**Platform:** x86_64-pc-linux-gnu (64-bit)

locale: LC_CTYPE=en_US.utf8, LC_NUMERIC=C, LC_TIME=en_US.utf8, LC_COLLATE=en_US.utf8, LC_MONETARY=en_US.utf8, LC_MESSAGES=en_US.utf8, LC_PAPER=en_US.utf8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.utf8 and LC_IDENTIFICATION=C

attached base packages: parallel, stats4, stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: Homo.sapiens(v.1.3.1), TxDb.Hsapiens.UCSC.hg19.knownGene(v.3.2.2), org.Hs.eg.db(v.3.4.1), Trypanosoma.cruzi.CLBrener.Unassigned(v.28.0), Trypanosoma.cruzi.CLBrener.NonEsmeraldo(v.28.0), Trypanosoma.cruzi.CLBrener.Esmeraldo(v.28.0), GO.db(v.3.4.1), OrganismDbi(v.1.18.0), TxDb.Tcruzi.CLBrener.TriTrypDB28(v.28.0), TxDb.Tcruzi.CLBrenerNonEsmeraldo.TriTrypDB28(v.28.0), TxDb.Tcruzi.CLBrenerEsmeraldo.TriTrypDB28(v.28.0), GenomicFeatures(v.1.28.4), GenomicRanges(v.1.28.4), GenomeInfoDb(v.1.12.2), org.Tcruzi.CLBrener.eg.db(v.28.0), org.Tcruzi.CLBrenerNonEsmeraldo.eg.db(v.27.0), org.Tcruzi.CLBrenerEsmeraldo.eg.db(v.28.0), AnnotationDbi(v.1.38.2), IRanges(v.2.10.2), S4Vectors(v.0.14.3), Biobase(v.2.36.2), BiocGenerics(v.0.22.0) and hpgltools(v.2017.01)

loaded via a namespace (and not attached): bit64(v.0.9-7), foreach(v.1.4.3), assertthat(v.0.2.0), pander(v.0.6.1), RBGL(v.1.52.0), blob(v.1.1.0), GenomeInfoDbData(v.0.99.0), Rsamtools(v.1.28.0), yaml(v.2.1.14), RSQLite(v.2.0), backports(v.1.1.0), lattice(v.0.20-35), glue(v.1.1.1), digest(v.0.6.12), RColorBrewer(v.1.1-2), XVector(v.0.16.0), colorspace(v.1.3-2), htmltools(v.0.3.6), Matrix(v.1.2-11), plyr(v.1.8.4), XML(v.3.98-1.9), pkgconfig(v.2.0.1), biomaRt(v.2.32.1), zlibbioc(v.1.22.0), scales(v.0.4.1), openxlsx(v.4.0.17), BiocParallel(v.1.10.1), tibble(v.1.3.3), ggplot2(v.2.2.1), SummarizedExperiment(v.1.6.3), fortunes(v.1.5-4), lazyeval(v.0.2.0), magrittr(v.1.5), memoise(v.1.1.0), evaluate(v.0.10.1), graph(v.1.54.0), BiocInstaller(v.1.26.0), tools(v.3.4.1), data.table(v.1.10.4), matrixStats(v.0.52.2), stringr(v.1.2.0), munsell(v.0.4.3), DelayedArray(v.0.2.7), bindrcpp(v.0.2), Biostrings(v.2.44.2), compiler(v.3.4.1), rlang(v.0.1.2), grid(v.3.4.1), RCurl(v.1.95-4.8), iterators(v.1.0.8), bitops(v.1.0-6), rmarkdown(v.1.6), gtable(v.0.2.0), codetools(v.0.2-15), DBI(v.0.7), R6(v.2.2.2), GenomicAlignments(v.1.12.1), knitr(v.1.17), dplyr(v.0.7.2), rtracklayer(v.1.36.4), bit(v.1.1-12), bindr(v.0.1), rprojroot(v.1.2), stringi(v.1.1.5) and Rcpp(v.0.12.12)

message(paste0("This is hpgltools commit: ", get_git_commit()))
## This is hpgltools commit: Fri Aug 18 12:25:48 2017 -0400: 9916e2ba9bba08c3efbbcd1bfa9b2aa63c49aa6d
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
## Saving to 01_annotation-v20170810.rda.xz
tmp <- sm(saveme(filename=this_save))