This document seeks to explore the differences in samples which did and did-not respond to drug treatment during infection.
Extract the antimonial samples from the experiment.
## Using a subset expression.
## There were 42, now there are 12 samples.
ant$colors <- gsub(pattern="CD960C", replacement="005500", x=ant$colors)
ant$colors <- gsub(pattern="AA7A1A", replacement="880000", x=ant$colors)
ant$colors <- gsub(pattern="886E3E", replacement="0000EE", x=ant$colors)
ant$colors <- gsub(pattern="666666", replacement="777777", x=ant$colors)
Lets take a look at the data before normalization.
Perform an initial sample normalization and see how they look.
## This function will replace the expt$expressionset slot with:
## log2(cpm(quant(cbcb(data))))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep libsizes in mind
## when invoking limma. The appropriate libsize is non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 6998 low-count genes (11849 remaining).
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 2 values equal to 0, adding 1 to the matrix.
## Step 5: not doing batch correction.
## Graphing number of non-zero genes with respect to CPM by library.
## Graphing library sizes.
## Graphing a boxplot.
## Graphing a correlation heatmap.
## Graphing a standard median correlation.
## Performing correlation.
## Graphing a distance heatmap.
## Graphing a standard median distance.
## Performing distance.
## Graphing a PCA plot.
## Graphing a T-SNE plot.
## Plotting a density plot.
## Plotting a CV plot.
## Naively calculating coefficient of variation/dispersion with respect to condition.
## Finished calculating dispersion estimates.
## Plotting the representation of the top-n genes.
## Plotting the expression of the top-n PC loaded genes.
## Printing a color to condition legend.
Do these samples segregate in ways which make sense?
## Blacks are ok -- responds, infected
## Blue are supposed to be fail, infected
## Green are supposed to be respond, uninfected
## This appears to be a formidable batch effect.
anorm_met$corheat
The batch effect is pretty severe, what happens if we try combat on it?
## This function will replace the expt$expressionset slot with:
## combat(quant(cbcb(data)))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep libsizes in mind
## when invoking limma. The appropriate libsize is non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Leaving the data in its current base format, keep in mind that
## some metrics are easier to see when the data is log2 transformed, but
## EdgeR/DESeq do not accept transformed data.
## Leaving the data unconverted. It is often advisable to cpm/rpkm
## the data to normalize for sampling differences, keep in mind though that rpkm
## has some annoying biases, and voom() by default does a cpm (though hpgl_voom()
## will try to detect this).
## Step 1: performing count filter with option: cbcb
## Removing 6998 low-count genes (11849 remaining).
## Step 2: normalizing the data with quant.
## Step 3: not converting the data.
## Step 4: not transforming the data.
## Step 5: doing batch correction with combat.
## Note to self: If you get an error like 'x contains missing values' The data has too many 0's and needs a stronger low-count filter applied.
## Passing off to all_adjusters.
## batch_counts: Before batch/surrogate estimation, 142037 entries are x>1: 99.9%.
## batch_counts: Before batch/surrogate estimation, 2 entries are x==0: 0.00141%.
## batch_counts: Before batch/surrogate estimation, 108 entries are 0<x<1: 0.0760%.
## The be method chose 2 surrogate variable(s).
## batch_counts: Using combat with a prior, no scaling, and a null model.
## There are 389 (0.274%) elements which are < 0 after batch correction.
## This function will replace the expt$expressionset slot with:
## combat_scale(quant(cbcb(data)))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep libsizes in mind
## when invoking limma. The appropriate libsize is non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Leaving the data in its current base format, keep in mind that
## some metrics are easier to see when the data is log2 transformed, but
## EdgeR/DESeq do not accept transformed data.
## Leaving the data unconverted. It is often advisable to cpm/rpkm
## the data to normalize for sampling differences, keep in mind though that rpkm
## has some annoying biases, and voom() by default does a cpm (though hpgl_voom()
## will try to detect this).
## Step 1: performing count filter with option: cbcb
## Removing 6998 low-count genes (11849 remaining).
## Step 2: normalizing the data with quant.
## Step 3: not converting the data.
## Step 4: not transforming the data.
## Step 5: doing batch correction with combat_scale.
## Note to self: If you get an error like 'x contains missing values' The data has too many 0's and needs a stronger low-count filter applied.
## Passing off to all_adjusters.
## batch_counts: Before batch/surrogate estimation, 142037 entries are x>1: 99.9%.
## batch_counts: Before batch/surrogate estimation, 2 entries are x==0: 0.00141%.
## batch_counts: Before batch/surrogate estimation, 108 entries are 0<x<1: 0.0760%.
## The be method chose 2 surrogate variable(s).
## batch_counts: Using combat with a prior and with scaling.
## There are 1929 (1.36%) elements which are < 0 after batch correction.
## Wow, that is encouraging
write.csv(file="excel/antimonial_pca_batch_table.csv", abatch_pca$table)
## Graphing number of non-zero genes with respect to CPM by library.
## Graphing library sizes.
## Graphing a boxplot.
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some data are negative. We are on log scale, setting them to 0.
## Changed 1929 negative features.
## Some entries are 0. We are on log scale, adding 1 to the data.
## Changed 1929 zero count features.
## Graphing a correlation heatmap.
## Graphing a standard median correlation.
## Performing correlation.
## Graphing a distance heatmap.
## Graphing a standard median distance.
## Performing distance.
## Graphing a PCA plot.
## Graphing a T-SNE plot.
## Plotting a density plot.
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some data are negative. We are on log scale, setting them to 0.5.
## Changed 1929 negative features.
## Plotting a CV plot.
## Naively calculating coefficient of variation/dispersion with respect to condition.
## Finished calculating dispersion estimates.
## Plotting the representation of the top-n genes.
## Plotting the expression of the top-n PC loaded genes.
## Printing a color to condition legend.
The PCA plot of the combat adjusted data suggests that it might actually work out ok. Let us look at the other metrics.
anti_host_combat_de$comparison$heat
limma_table <- anti_host_combat_de$limma$all_tables$respond_inf_vs_fail_inf
de_plots <- extract_de_plots(anti_host_combat_de, table="respond_inf_vs_fail_inf")
de_plots$ma$plot
antimony_contrasts <- list(
"fail_respond_inf" = c("fail_inf", "respond_inf"),
"fail_respond_nil" = c("fail_nil", "respond_nil"),
"fail" = c("fail_inf", "fail_nil"),
"respond" = c("respond_inf", "respond_nil"))
antimony_tables <- sm(combine_de_tables(
anti_host_combat_de,
excel=glue::glue("excel/antimony_combat_tables-v{ver}.xlsx"),
keepers=antimony_contrasts, annot_df=hs_annotations))
##antimony_tables$plots[["fail"]]
antimony_sig_tables <- sm(extract_significant_genes(
antimony_tables,
excel=glue::glue("excel/antimony_significant_genes-v{ver}.xlsx"),
according_to="all"))
anti_host_sva <- sm(all_pairwise(ant, model_batch="svaseq", filter=TRUE))
anti_host_tables <- sm(combine_de_tables(
anti_host_sva,
excel=glue::glue("excel/antimony_sva_tables-v{ver}.xlsx"),
keepers=antimony_contrasts))
anti_host_plots <- extract_de_plots(anti_host_tables, table="fail", type="deseq",
label=20)
anti_comparison <- compare_de_results(antimony_tables, anti_host_tables)
## Testing method: limma.
## Adding method: limma to the set.
## Testing method: deseq.
## Adding method: deseq to the set.
## Testing method: edger.
## Adding method: edger to the set.
## Testing method: ebseq.
## Adding method: ebseq to the set.
## Testing method: basic.
## Adding method: basic to the set.
## Warning in cor(x = fs[["x"]], y = fs[["y"]], method = cor_method): the
## standard deviation is zero
## Warning in cor(x = fs[["x"]], y = fs[["y"]], method = cor_method): the
## standard deviation is zero
## fail_respond_inf fail_respond_nil fail respond
## 0.2371 0.1971 0.6620 0.6337
## Reread the parasite sample sheet to play with other covariants
## Wow, from the parasite perspective, it seems they cluster much more tightly by strain
parasite_expt <- sm(create_expt(
"sample_sheets/old/all_samples-lpanamensis.xlsx",
gene_info=lp_annotations))
antipara_expt <- subset_expt(parasite_expt, subset="condition=='ant_good'|condition=='ant_bad'")
## Using a subset expression.
## There were 33, now there are 6 samples.
antipara_expt <- set_expt_batches(antipara_expt, fact="strain")
antipara_metrics <- sm(graph_metrics(antipara_expt))
antipara_norm <- normalize_expt(antipara_expt, transform="log2",
convert="cpm", filter=TRUE,
norm="quant")
## This function will replace the expt$expressionset slot with:
## log2(cpm(quant(cbcb(data))))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep libsizes in mind
## when invoking limma. The appropriate libsize is non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 0 low-count genes (8841 remaining).
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 93 values equal to 0, adding 1 to the matrix.
## Step 5: not doing batch correction.
## This function will replace the expt$expressionset slot with:
## log2(svaseq(simple(data)))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep libsizes in mind
## when invoking limma. The appropriate libsize is non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Leaving the data unconverted. It is often advisable to cpm/rpkm
## the data to normalize for sampling differences, keep in mind though that rpkm
## has some annoying biases, and voom() by default does a cpm (though hpgl_voom()
## will try to detect this).
## Leaving the data unnormalized. This is necessary for DESeq, but
## EdgeR/limma might benefit from normalization. Good choices include quantile,
## size-factor, tmm, etc.
## Step 1: performing count filter with option: simple
## Removing 114 low-count genes (8727 remaining).
## Step 2: not normalizing the data.
## Step 3: not converting the data.
## Step 4: transforming the data with log2.
## transform_counts: Found 7670 values equal to 0, adding 1 to the matrix.
## Step 5: doing batch correction with svaseq.
## Note to self: If you get an error like 'x contains missing values' The data has too many 0's and needs a stronger low-count filter applied.
## Passing off to all_adjusters.
## batch_counts: Before batch/surrogate estimation, 40937 entries are x>1: 78.2%.
## batch_counts: Before batch/surrogate estimation, 7670 entries are x==0: 14.6%.
## The be method chose 1 surrogate variable(s).
## Attempting svaseq estimation with 1 surrogates.
## There are 15939 (30.4%) elements which are < 0 after batch correction.
human_expt <- create_expt(file="all_samples-hsapiens.xlsx")
antimonial_expt <- expt_subset(human_expt, subset="condition=='respond_nil'|condition=='respond_inf'|condition=='fail_nil'|condition=='fail_inf'")
antimonial_expt$colors <- c("blue","blue","red","red","black","black","green","green","red","green","blue","black")
antimonial_metrics <- graph_metrics(antimonial_expt)
antimonial_metrics$libsize
antimonial_norm <- normalize_expt(antimonial_expt, transform="log2", convert="cpm", filter_low=TRUE, norm="quant")
antimonialnorm_metrics <- graph_metrics(antimonial_norm)
antimonialnorm_metrics$pcaplot
antimonialnorm_metrics$corheat
antimonialnorm_metrics$disheat
antimonial_nb <- normalize_expt(antimonial_expt, transform="log2", convert="cpm", filter_low=TRUE, norm="quant", batch="combat")
antimonialnb_metrics <- graph_metrics(antimonial_nb)
antimonialnb_metrics$pcaplot
pp("images/antimonial_batch_corheat.png")
antimonialnb_metrics$corheat
dev.off()
## Reread the parasite sample sheet to play with other covariants
## Wow, from the parasite perspective, it seems they cluster much more tightly by strain
parasite_expt <- create_expt(file="all_samples-lpanamensis_patientbatch.xlsx")
antipara_expt <- expt_subset(parasite_expt, subset="condition=='ant_good'|condition=='ant_bad'")
antipara_metrics <- graph_metrics(antipara_expt)
antipara_metrics$pcaplot
antipara_norm <- normalize_expt(antipara_expt, transform="log2", convert="cpm", filter_low=TRUE, norm="quant", batch="combat")
antiparanorm_metrics <- graph_metrics(antipara_norm)
antiparanorm_metrics$pcaplot
antiparanorm_metrics$corheat
library(Biobase)
library(Heatplus)
infection_norm <- sm(normalize_expt(parasite_expt, transform="raw",
convert="cpm", filter="pofa",
norm="quant", batch="sva"))
infect_data <- exprs(infection_norm$expressionset)
correlations <- hpgl_cor(infect_data)
distances <- as.matrix(dist(t(infect_data), method="mink"))
design <- as.data.frame(infection_norm$design)
rownames(correlations) <- paste0(rownames(correlations), "_", as.character(design$condition))
colnames(correlations) <- paste0(colnames(correlations), "_", as.character(design$batch))
col_data <- as.data.frame(pData(infection_norm)[, c("strain", "condition")])
row_data <- design[, c("batch","condition")]
colnames(col_data) <- c("strain", "selfheal")
colnames(row_data) <- c("volunteer","selfheal")
row_data$selfheal <- as.numeric(row_data$selfheal)
row_data$selfheal <- ifelse(row_data$selfheal == 2, TRUE, FALSE)
myannot <- list(Col=list(data=col_data), Row=list(data=row_data))
mylabels <- list(Row=list(nrow=10), Col=list(nrow=10))
map1 <- annHeatmap2(distances,
cluster=list(cuth=7000),
ann=myannot,
labels=mylabels)
## 3 numbers, dendrogram, plot, annotation width/height
plot(map1)
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset c2881f6d97e1ec981fd1481cf46d6bc875fac423
## This is hpgltools commit: Tue Oct 22 10:22:30 2019 -0400: c2881f6d97e1ec981fd1481cf46d6bc875fac423
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
## Saving to 02_antimonial_estimation_20190914-v20190914.rda.xz
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C
attached base packages: parallel, stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: Heatplus(v.2.31.0), foreach(v.1.4.7), ruv(v.0.9.7.1), hpgltools(v.1.0), Biobase(v.2.45.1) and BiocGenerics(v.0.31.6)
loaded via a namespace (and not attached): Rtsne(v.0.15), colorspace(v.1.4-1), htmlTable(v.1.13.2), corpcor(v.1.6.9), XVector(v.0.25.0), GenomicRanges(v.1.37.17), base64enc(v.0.1-3), Vennerable(v.3.1.0.9000), rstudioapi(v.0.10), ggrepel(v.0.8.1), bit64(v.0.9-7), AnnotationDbi(v.1.47.1), codetools(v.0.2-16), splines(v.3.6.1), doParallel(v.1.0.15), robustbase(v.0.93-5), geneplotter(v.1.63.0), knitr(v.1.25), zeallot(v.0.1.0), Formula(v.1.2-3), Rsamtools(v.2.1.7), annotate(v.1.63.0), cluster(v.2.1.0), dbplyr(v.1.4.2), graph(v.1.63.0), BiocManager(v.1.30.8), compiler(v.3.6.1), httr(v.1.4.1), backports(v.1.1.5), assertthat(v.0.2.1), Matrix(v.1.2-17), lazyeval(v.0.2.2), limma(v.3.41.18), acepack(v.1.4.1), htmltools(v.0.4.0), prettyunits(v.1.0.2), tools(v.3.6.1), gtable(v.0.3.0), glue(v.1.3.1), GenomeInfoDbData(v.1.2.1), reshape2(v.1.4.3), dplyr(v.0.8.3), rappdirs(v.0.3.1), Rcpp(v.1.0.2), vctrs(v.0.2.0), Biostrings(v.2.53.2), gdata(v.2.18.0), preprocessCore(v.1.47.1), nlme(v.3.1-141), rtracklayer(v.1.45.6), iterators(v.1.0.12), xfun(v.0.10), fastcluster(v.1.1.25), stringr(v.1.4.0), openxlsx(v.4.1.0.1), gtools(v.3.8.1), XML(v.3.98-1.20), DEoptimR(v.1.0-8), edgeR(v.3.27.13), directlabels(v.2018.05.22), zlibbioc(v.1.31.0), scales(v.1.0.0), doSNOW(v.1.0.18), hms(v.0.5.1), SummarizedExperiment(v.1.15.9), RBGL(v.1.61.0), RColorBrewer(v.1.1-2), yaml(v.2.2.0), curl(v.4.2), memoise(v.1.1.0), gridExtra(v.2.3), pander(v.0.6.3), ggplot2(v.3.2.1), rpart(v.4.1-15), biomaRt(v.2.41.9), latticeExtra(v.0.6-28), stringi(v.1.4.3), RSQLite(v.2.1.2), genefilter(v.1.67.1), S4Vectors(v.0.23.25), checkmate(v.1.9.4), GenomicFeatures(v.1.37.4), caTools(v.1.17.1.2), zip(v.2.0.4), BiocParallel(v.1.19.4), GenomeInfoDb(v.1.21.2), rlang(v.0.4.0), pkgconfig(v.2.0.3), matrixStats(v.0.55.0), bitops(v.1.0-6), evaluate(v.0.14), lattice(v.0.20-38), purrr(v.0.3.3), htmlwidgets(v.1.5.1), GenomicAlignments(v.1.21.7), labeling(v.0.3), bit(v.1.1-14), tidyselect(v.0.2.5), plyr(v.1.8.4), magrittr(v.1.5), DESeq2(v.1.25.17), R6(v.2.4.0), snow(v.0.4-3), IRanges(v.2.19.17), gplots(v.3.0.1.1), Hmisc(v.4.2-0), DelayedArray(v.0.11.8), DBI(v.1.0.0), pillar(v.1.4.2), foreign(v.0.8-72), mgcv(v.1.8-29), survival(v.2.44-1.1), RCurl(v.1.95-4.12), nnet(v.7.3-12), tibble(v.2.1.3), crayon(v.1.3.4), KernSmooth(v.2.23-16), OrganismDbi(v.1.27.1), BiocFileCache(v.1.9.1), rmarkdown(v.1.16), progress(v.1.2.2), locfit(v.1.5-9.1), grid(v.3.6.1), sva(v.3.33.1), data.table(v.1.12.6), blob(v.1.2.0), digest(v.0.6.22), xtable(v.1.8-4), openssl(v.1.4.1), stats4(v.3.6.1), munsell(v.0.5.0), askpass(v.1.1) and quadprog(v.1.5-7)