This document is concerned with analyzing TNSeq from S.pyogenes, primarily from the perspective of changing Zn concentration in the media. This was done with samples where Cu concentrations were changed; those are not included in the final analyses, but are important for the variance calculations and so are kept here.
There were some other samples in this series of runs which were not relevant; so lets subset the data to include only the relevant samples.
## There were 48, now there are 42 samples.
I am doing this fresh after a long time; I don’t really remember what these plots look like, but I am betting there is a huge batch-like effect which depends on the library precursor. But before that, let us look at their saturations and other metrics.
## So the saddest library has 3,922,047 reads, that is not terrible I think.
## A few samples might be a problem: hpgl0898, hpgl0879; but I am guessing a factor
## of <4 between the highest and lowest samples should not be too big of a problem.
rpmi_metrics$nonzero## Lower y-axis samples are more likely to be troublesome (looking at you hpgl0866, 898, 878, 869).
rpmi_metrics$corheat## But at least the coefficients of variance are similar. wait, we only have 2 thy samples?
rpmi_metrics$densityI just copied/pasted the following from the previous worksheet, but I am seeing that I did not actually use these plots. I think I will leave them here for now.
rpmi_filt <- sm(normalize_expt(rpmi_expt, filter=TRUE))
## This will likely be our data for DE analyses.
rpmi_norm <- sm(normalize_expt(rpmi_expt, filter=TRUE, convert="cpm", norm="quant",
transform="log2"))
## This, and a sva version of it, will be used for visualizing.
rpmi_batch <- sm(normalize_expt(rpmi_filt, filter=TRUE, convert="cpm",
batch="svaseq", transform="log2"))
rpmi_norm_metrics <- sm(graph_metrics(rpmi_norm))After log2 and such, PCA becomes informative. Let us therefore look first at the clustering before any surrogate estimation.
So I think once again, the primary surrogate variable is not batch per se, but stems from how the libraries diverge after their selection but before collection/library prepration.
Given the wretched clustering observed, I figure I should try a couple tools from ruv/sva and see if they help. It looks like I prefered fsva in my earlier iteration of this visualization. I have since come to the feeling that fsva is a bit weird; and so I think I will use svaseq instead.
rpmi_batch1 <- sm(normalize_expt(rpmi_expt, transform="log2", convert="cpm",
filter=TRUE, batch="fsva"))
plot_pca(rpmi_batch1)$plot## This looks a bit more encouraging.
rpmi_batch2 <- sm(normalize_expt(rpmi_expt, transform="log2", convert="cpm",
filter=TRUE, batch="svaseq"))
plot_pca(rpmi_batch2)$plot## as does this.
rpmi_batch_written <- sm(
write_expt(expt=rpmi_expt, transform="log2", convert="cpm",
filter=TRUE, batch="svaseq", violin=TRUE,
excel=paste0("excel/", rundate, "_rpmi_svaseq-v", ver, ".xlsx")))varpart_test <- varpart(expt=rpmi_filt, predictor=NULL,
factors=c("coverage", "replicate", "time", "cuzn", "medium"))## Warning in varpart(expt = rpmi_filt, predictor = NULL, factors =
## c("coverage", : Renaming this soon to 'simple_varpart().'
## varpart sees only 1 batch, adjusting the model accordingly.
## Attempting mixed linear model with: ~ (1|coverage) + (1|replicate) + (1|time) + (1|cuzn) + (1|medium)
## Fitting the expressionset to the model, this is slow.
## Projected run time: ~ 0.2 min
## Placing factor: coverage at the beginning of the model.
## There is 1 batch in the data, fitting condition+batch will fail.
## batch_counts: Before batch/surrogate estimation, 274 entries are x<=0.
## The be method chose 4 surrogate variable(s).
## Attempting pca surrogate estimation with 4 surrogates.
## batch_counts: Before batch/surrogate estimation, 274 entries are x<=0.
## The be method chose 4 surrogate variable(s).
## Attempting sva supervised surrogate estimation with 4 surrogates.
## batch_counts: Before batch/surrogate estimation, 274 entries are x<=0.
## The be method chose 4 surrogate variable(s).
## Attempting sva unsupervised surrogate estimation with 4 surrogates.
## batch_counts: Before batch/surrogate estimation, 274 entries are x<=0.
## The be method chose 4 surrogate variable(s).
## Attempting ruvseq supervised surrogate estimation with 4 surrogates.
## batch_counts: Before batch/surrogate estimation, 274 entries are x<=0.
## The be method chose 4 surrogate variable(s).
## Attempting ruvseq residual surrogate estimation with 4 surrogates.
## batch_counts: Before batch/surrogate estimation, 274 entries are x<=0.
## The be method chose 4 surrogate variable(s).
## Attempting ruvseq empirical surrogate estimation with 4 surrogates.
## Warning in cor(first_svs): the standard deviation is zero
## A friendly reminder that there is only 1 batch in the data.
## 2/7: Performing lmFit(data) etc. with pca in the model.
## 3/7: Performing lmFit(data) etc. with sva_sup in the model.
## 4/7: Performing lmFit(data) etc. with sva_unsup in the model.
## 5/7: Performing lmFit(data) etc. with ruv_sup in the model.
## 6/7: Performing lmFit(data) etc. with ruv_resid in the model.
## 7/7: Performing lmFit(data) etc. with ruv_emp in the model.
## Given the contribution of coverage in the variancePartition results above, one might
## assume that the library sizes will correspond to the surrogates detected by sva and friends.
## This appears to not be the case.
surrogate_test$plot## it looks like the various surrogate estimators mostly agree on this data --
## especially the sva-based ones.## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset c89fbb1a6badaccdac9bfbc109bd1fe2d673b639
## This is hpgltools commit: Sun Jan 13 20:46:37 2019 -0500: c89fbb1a6badaccdac9bfbc109bd1fe2d673b639
R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=en_US.utf8, LC_NUMERIC=C, LC_TIME=en_US.utf8, LC_COLLATE=en_US.utf8, LC_MONETARY=en_US.utf8, LC_MESSAGES=en_US.utf8, LC_PAPER=en_US.utf8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.utf8 and LC_IDENTIFICATION=C
attached base packages: parallel, stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: variancePartition(v.1.12.1), ruv(v.0.9.7), bindrcpp(v.0.2.2), hpgltools(v.2018.11), Biobase(v.2.42.0) and BiocGenerics(v.0.28.0)
loaded via a namespace (and not attached): R.utils(v.2.7.0), tidyselect(v.0.2.5), lme4(v.1.1-19), RSQLite(v.2.1.1), AnnotationDbi(v.1.44.0), grid(v.3.5.2), BiocParallel(v.1.16.5), Rtsne(v.0.15), devtools(v.2.0.1), DESeq(v.1.34.1), munsell(v.0.5.0), codetools(v.0.2-16), preprocessCore(v.1.45.0), units(v.0.6-2), statmod(v.1.4.30), withr(v.2.1.2), colorspace(v.1.3-2), GOSemSim(v.2.8.0), knitr(v.1.21), rstudioapi(v.0.9.0), stats4(v.3.5.2), DOSE(v.3.8.0), labeling(v.0.3), urltools(v.1.7.1), GenomeInfoDbData(v.1.2.0), hwriter(v.1.3.2), bit64(v.0.9-7), farver(v.1.1.0), rprojroot(v.1.3-2), xfun(v.0.4), EDASeq(v.2.16.3), R6(v.2.3.0), doParallel(v.1.0.14), GenomeInfoDb(v.1.18.1), locfit(v.1.5-9.1), bitops(v.1.0-6), fgsea(v.1.8.0), gridGraphics(v.0.3-0), DelayedArray(v.0.8.0), assertthat(v.0.2.0), scales(v.1.0.0), ggraph(v.1.0.2), enrichplot(v.1.2.0), gtable(v.0.2.0), RUVSeq(v.1.16.1), sva(v.3.30.1), processx(v.3.2.1), rlang(v.0.3.1), genefilter(v.1.64.0), splines(v.3.5.2), rtracklayer(v.1.42.1), lazyeval(v.0.2.1), europepmc(v.0.3), yaml(v.2.2.0), reshape2(v.1.4.3), GenomicFeatures(v.1.34.1), backports(v.1.1.3), qvalue(v.2.14.1), clusterProfiler(v.3.10.1), tools(v.3.5.2), usethis(v.1.4.0), ggplotify(v.0.0.3), ggplot2(v.3.1.0), gplots(v.3.0.1), RColorBrewer(v.1.1-2), sessioninfo(v.1.1.1), ggridges(v.0.5.1), Rcpp(v.1.0.0), plyr(v.1.8.4), base64enc(v.0.1-3), progress(v.1.2.0), zlibbioc(v.1.28.0), purrr(v.0.2.5), RCurl(v.1.95-4.11), ps(v.1.3.0), prettyunits(v.1.0.2), viridis(v.0.5.1), cowplot(v.0.9.4), S4Vectors(v.0.20.1), SummarizedExperiment(v.1.12.0), ggrepel(v.0.8.0), colorRamps(v.2.3), fs(v.1.2.6), magrittr(v.1.5), data.table(v.1.11.8), DO.db(v.2.9), openxlsx(v.4.1.0), triebeard(v.0.3.0), packrat(v.0.5.0), matrixStats(v.0.54.0), pkgload(v.1.0.2), aroma.light(v.3.12.0), hms(v.0.4.2), evaluate(v.0.12), xtable(v.1.8-3), pbkrtest(v.0.4-7), XML(v.3.98-1.16), IRanges(v.2.16.0), gridExtra(v.2.3), testthat(v.2.0.1), compiler(v.3.5.2), biomaRt(v.2.38.0), tibble(v.2.0.0), KernSmooth(v.2.23-15), crayon(v.1.3.4), minqa(v.1.2.4), R.oo(v.1.22.0), htmltools(v.0.3.6), mgcv(v.1.8-26), corpcor(v.1.6.9), tidyr(v.0.8.2), geneplotter(v.1.60.0), DBI(v.1.0.0), corrplot(v.0.84), tweenr(v.1.0.1), MASS(v.7.3-51.1), ShortRead(v.1.40.0), Matrix(v.1.2-15), cli(v.1.0.1), quadprog(v.1.5-5), R.methodsS3(v.1.7.1), gdata(v.2.18.0), bindr(v.0.1.1), igraph(v.1.2.2), GenomicRanges(v.1.34.0), pkgconfig(v.2.0.2), rvcheck(v.0.1.3), GenomicAlignments(v.1.18.1), xml2(v.1.2.0), foreach(v.1.4.4), annotate(v.1.60.0), XVector(v.0.22.0), stringr(v.1.3.1), callr(v.3.1.1), digest(v.0.6.18), Biostrings(v.2.50.2), rmarkdown(v.1.11), fastmatch(v.1.1-0), edgeR(v.3.24.3), directlabels(v.2018.05.22), Rsamtools(v.1.34.0), gtools(v.3.8.1), nloptr(v.1.2.1), nlme(v.3.1-137), jsonlite(v.1.6), desc(v.1.2.0), viridisLite(v.0.3.0), limma(v.3.38.3), pillar(v.1.3.1), lattice(v.0.20-38), httr(v.1.4.0), pkgbuild(v.1.0.2), survival(v.2.43-3), GO.db(v.3.7.0), glue(v.1.3.0), remotes(v.2.0.2), zip(v.1.0.0), UpSetR(v.1.3.3), iterators(v.1.0.10), pander(v.0.6.3), bit(v.1.1-14), ggforce(v.0.1.3), stringi(v.1.2.4), blob(v.1.1.1), latticeExtra(v.0.6-28), caTools(v.1.17.1.1), memoise(v.1.1.0) and dplyr(v.0.7.8)
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))## Saving to 20190115_02_sample_estimation-v20190115.rda.xz