index.html preprocessing.html 01_annotation.html
This document should make clear the suitability of our yeast data for differential expression analyses. It should also give some ideas about the depth and distribution of the data.
There are lots of toys we have learned to use to play with with raw data and explore stuff like batch effects or non-canonical distributions or skewed counts. hpgltools provides some functionality to make this process easier. The graphs shown below and many more are generated with the wrapper ‘graph_metrics()’ but that takes away the chance to explain the graphs as I generate them.
sc_filt <- sm(normalize_expt(sc2_expt, filter=TRUE))
sc2_libsize <- plot_libsize(sc2_expt)
sc2_libsize
## The range is from 22-36 million reads, that seems quite reasonable to me.
sc2_nonzero <- plot_nonzero(sc_filt)
sc2_nonzero
## Wow, these samples are super-well represented. The 'worst' samples have only 3 more genes with
## ~0 reads than the 'best'. That is amazing.
sc2_density <- sm(plot_density(sc_filt))
sc2_density
## These look quite good for non-normalized data.
sc2_boxplot <- sm(plot_boxplot(sc_filt))
sc2_boxplot
## This is the same as the density plot, looks good to me.
Here we will apply the ‘default’ normalization preferred by Najib and then re-graph the data to look at the distributions again.
sc_norm <- sm(normalize_expt(sc2_expt, transform="log2", norm="quant", convert="cpm", filter=TRUE))
sc_metrics <- sm(graph_metrics(sc_norm))
The data should now be normalized, lets view some metrics post-facto.
sc_metrics$corheat
## There is definitely something odd going on here.
## It looks like many of the samples are clustering in pairs where one group of pairs
## consists of the first 8 samples and the second group consists of the second 8 samples.
## This is analagous to some observations we made in IGV and somewhat worrisome.
sc_metrics$smc
## The samples are very well behaved, none fall below the red line.
sc_metrics$pcaplot
## This shows us that peculiar split in another way. That is really strange.
sc_metrics$tsneplot
In the following stanza I will apply a few batch/surrogate estimates in an attempt to get a handle on what is going on with these strange data splits.
In each case I will plot a quick pca to get some idea of what happened.
For these estimates, I will leave off cpm/quantile normalizations to try to isolate the effects of only the surrogate estimates on the data.
Start with the default unsupervised sva estimate.
sc2_sva <- sm(normalize_expt(sc2_expt, transform="log2",
filter=TRUE, batch="sva", low_to_zero=TRUE))
sc2_sva_metrics <- sm(graph_metrics(sc2_sva))
sc2_sva_metrics$pcaplot
## hmm better I suppose, but hpgl0774 is 100% kooky.
sc2_sva_metrics$tsneplot
Sometimes combat provides a nice way to see what is going on, let us try that.
sc2_combat <- sm(normalize_expt(sc2_expt, transform="log2",
filter=TRUE, batch="combat_scale", low_to_zero=TRUE))
sc2_combat_metrics <- sm(graph_metrics(sc2_combat))
sc2_combat_metrics$pcaplot
## Well, that did exactly the opposite of what I wanted.
## But on the other hand, the peculiar batch effects are now super-visible, I guess that is useful.
sc2_combat_metrics$tsneplot
This is a modification of the sva defaults, here is a quote from the documentation:
“This function performs frozen surrogate variable analysis as described in Parker, Corrada Bravo and Leek 2013. The approach uses a training database to create surrogate variables which are then used to remove batch effects both from the training database and a new data set for prediction purposes.”
sc2_fsva <- sm(normalize_expt(sc2_expt, transform="log2",
filter=TRUE, batch="fsva"))
sc2_fsva_metrics <- sm(graph_metrics(sc2_fsva))
fsva seems like it might be useful, but also potentially a too-large modification of the data.
sc2_fsva_metrics$pcaplot
sc2_fsva_metrics$tsneplot
I want to apply a few tools to try to get a handle on where this strange variance is coming from.
varhunt <- pca_information(expt_data=sc2_expt,
expt_factors=c("condition","batch","originalbatch", "incubationtime"))
## More shallow curves in these plots suggest more genes in this principle component.
varhunt$anova_fstat_heatmap
## Both batch estimates are very strong on PC1 and PC2, incubation time comes out on the 3rd PC.
## The condition looks like it is stuck with with, unfortunately.
sc2_varpart <- varpart(expt=sc2_expt, predictor=NULL,
factors=c("condition","batch","originalbatch", "incubationtime"))
## Attempting mixed linear model with: ~ (1|condition) + (1|batch) + (1|originalbatch) + (1|incubationtime)
## Fitting the expressionset to the model, this is slow.
## (Eg. Take the projected run time and mulitply by 3-6 and round up.)
## Projected run time: ~ 2 min
## Warning: closing unused connection 8 (<-localhost:11760)
## Warning: closing unused connection 7 (<-localhost:11760)
## Warning: closing unused connection 6 (<-localhost:11760)
## Warning: closing unused connection 5 (<-localhost:11760)
## Warning: closing unused connection 4 (<-localhost:11760)
## Warning: closing unused connection 3 (<-localhost:11760)
## Placing factor: condition at the beginning of the model.
sc2_varpart$partition_plot
## Well here at least, it looks like batch kicks ass.
Let us apply the same set of normalizations/surrogate estimates and write out the resulting tables.
fun <- write_expt(sc2_expt, excel=paste0("excel/samples_written-v", ver, ".xlsx"),
filter=TRUE, norm="raw", batch="fsva", transform="log2", violin=TRUE)
## Writing the legend.
## Writing the raw reads.
## Graphing the raw reads.
## Attempting mixed linear model with: ~ (1|condition) + (1|batch)
## Fitting the expressionset to the model, this is slow.
## (Eg. Take the projected run time and mulitply by 3-6 and round up.)
## Projected run time: ~ 2 min
## Placing factor: condition at the beginning of the model.
## Writing the normalized reads.
## Graphing the normalized reads.
## Attempting mixed linear model with: ~ (1|condition) + (1|batch)
## Fitting the expressionset to the model, this is slow.
## (Eg. Take the projected run time and mulitply by 3-6 and round up.)
## Projected run time: ~ 2 min
## Placing factor: condition at the beginning of the model.
## Writing the median reads by factor.
## The factor mtc_mtu has 4 rows.
## The factor mtc_wtu has 4 rows.
## The factor wtc_mtu has 4 rows.
## The factor wtc_wtu has 4 rows.
pander::pander(sessionInfo())
R version 3.4.1 (2017-06-30)
**Platform:** x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=en_US.utf8, LC_NUMERIC=C, LC_TIME=en_US.utf8, LC_COLLATE=en_US.utf8, LC_MONETARY=en_US.utf8, LC_MESSAGES=en_US.utf8, LC_PAPER=en_US.utf8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.utf8 and LC_IDENTIFICATION=C
attached base packages: stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: hpgltools(v.2017.01) and ruv(v.0.9.6)
loaded via a namespace (and not attached): nlme(v.3.1-131), pbkrtest(v.0.4-7), bitops(v.1.0-6), matrixStats(v.0.52.2), devtools(v.1.13.3), bit64(v.0.9-7), doParallel(v.1.0.10), RColorBrewer(v.1.1-2), rprojroot(v.1.2), GenomeInfoDb(v.1.12.2), backports(v.1.1.0), tools(v.3.4.1), R6(v.2.2.2), KernSmooth(v.2.23-15), DBI(v.0.7), lazyeval(v.0.2.0), BiocGenerics(v.0.22.0), mgcv(v.1.8-20), colorspace(v.1.3-2), withr(v.2.0.0), bit(v.1.1-12), compiler(v.3.4.1), preprocessCore(v.1.38.1), Biobase(v.2.36.2), xml2(v.1.1.1), DelayedArray(v.0.2.7), rtracklayer(v.1.36.4), labeling(v.0.3), caTools(v.1.17.1), scales(v.0.5.0), genefilter(v.1.58.1), quadprog(v.1.5-5), commonmark(v.1.4), stringr(v.1.2.0), digest(v.0.6.12), Rsamtools(v.1.28.0), minqa(v.1.2.4), rmarkdown(v.1.6), variancePartition(v.1.6.0), colorRamps(v.2.3), XVector(v.0.16.0), base64enc(v.0.1-3), htmltools(v.0.3.6), lme4(v.1.1-13), limma(v.3.32.6), rlang(v.0.1.2), RSQLite(v.2.0), BiocParallel(v.1.10.1), gtools(v.3.5.0), RCurl(v.1.95-4.8), magrittr(v.1.5), GenomeInfoDbData(v.0.99.0), Matrix(v.1.2-11), Rcpp(v.0.12.12), munsell(v.0.4.3), S4Vectors(v.0.14.4), yaml(v.2.1.14), stringi(v.1.1.5), edgeR(v.3.18.1), MASS(v.7.3-47), SummarizedExperiment(v.1.6.4), zlibbioc(v.1.22.0), gplots(v.3.0.1), Rtsne(v.0.13), plyr(v.1.8.4), grid(v.3.4.1), blob(v.1.1.0), parallel(v.3.4.1), gdata(v.2.18.0), ggrepel(v.0.6.5), crayon(v.1.3.4), lattice(v.0.20-35), Biostrings(v.2.44.2), splines(v.3.4.1), pander(v.0.6.1), GenomicFeatures(v.1.28.4), annotate(v.1.54.0), locfit(v.1.5-9.1), knitr(v.1.17), GenomicRanges(v.1.28.5), corpcor(v.1.6.9), reshape2(v.1.4.2), codetools(v.0.2-15), biomaRt(v.2.32.1), stats4(v.3.4.1), XML(v.3.98-1.9), evaluate(v.0.10.1), data.table(v.1.10.4), nloptr(v.1.0.4), foreach(v.1.4.3), testthat(v.1.0.2), gtable(v.0.2.0), ggplot2(v.2.2.1), openxlsx(v.4.0.17), xtable(v.1.8-2), roxygen2(v.6.0.1), survival(v.2.41-3), tibble(v.1.3.4), iterators(v.1.0.8), GenomicAlignments(v.1.12.2), AnnotationDbi(v.1.38.2), memoise(v.1.1.0), IRanges(v.2.10.3), sva(v.3.24.4) and directlabels(v.2017.03.31)
message(paste0("This is hpgltools commit: ", get_git_commit()))
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 6cdd1a6804aefc60944ec9c5ff08ab7e97f5b17c
## R> packrat::restore()
## This is hpgltools commit: Wed Sep 6 18:39:00 2017 -0400: 6cdd1a6804aefc60944ec9c5ff08ab7e97f5b17c
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
## Saving to 02_sample_estimation-v20170915.rda.xz
tmp <- sm(saveme(filename=this_save))