This document should make clear the suitability of our yeast data for differential expression analyses. It should also give some ideas about the depth and distribution of the data.
There are lots of toys we have learned to use to play with with raw data and explore stuff like batch effects or non-canonical distributions or skewed counts. hpgltools provides some functionality to make this process easier. The graphs shown below and many more are generated with the wrapper ‘graph_metrics()’ but that takes away the chance to explain the graphs as I generate them.
sc_filt <- sm(normalize_expt(sc2_expt, filter=TRUE))
sc2_libsize <- plot_libsize(sc2_expt)
sc2_libsize
## The range is from 22-36 million reads, that seems quite reasonable to me.
sc2_nonzero <- plot_nonzero(sc_filt)
sc2_nonzero
## Wow, these samples are super-well represented. The 'worst' samples have only 3 more genes with
## ~0 reads than the 'best'. That is amazing.
sc2_density <- sm(plot_density(sc_filt))
sc2_density
## These look quite good for non-normalized data.
sc2_boxplot <- sm(plot_boxplot(sc_filt))
sc2_boxplot
## This is the same as the density plot, looks good to me.
Here we will apply the ‘default’ normalization preferred by Najib and then re-graph the data to look at the distributions again.
sc_norm <- sm(normalize_expt(sc2_expt, transform="log2", norm="quant", convert="cpm", filter=TRUE))
sc_metrics <- sm(graph_metrics(sc_norm))
The data should now be normalized, lets view some metrics post-facto.
sc_metrics$corheat
## There is definitely something odd going on here.
## It looks like many of the samples are clustering in pairs where one group of pairs
## consists of the first 8 samples and the second group consists of the second 8 samples.
## This is analagous to some observations we made in IGV and somewhat worrisome.
sc_metrics$smc
## The samples are very well behaved, none fall below the red line.
sc_metrics$pcaplot
## This shows us that peculiar split in another way. That is really strange.
In the following stanza I will apply a few batch/surrogate estimates in an attempt to get a handle on what is going on with these strange data splits.
In each case I will plot a quick pca to get some idea of what happened.
For these estimates, I will leave off cpm/quantile normalizations to try to isolate the effects of only the surrogate estimates on the data.
Start with the default unsupervised sva estimate.
sc2_sva <- sm(normalize_expt(sc2_expt, transform="log2",
filter=TRUE, batch="sva", low_to_zero=TRUE))
sc2_sva_metrics <- sm(graph_metrics(sc2_sva))
sc2_sva_metrics$pcaplot
## hmm better I suppose, but hpgl0774 is 100% kooky.
Sometimes combat provides a nice way to see what is going on, let us try that.
sc2_combat <- sm(normalize_expt(sc2_expt, transform="log2",
filter=TRUE, batch="combat_scale", low_to_zero=TRUE))
sc2_combat_metrics <- sm(graph_metrics(sc2_combat))
sc2_combat_metrics$pcaplot
## Well, that did exactly the opposite of what I wanted.
## But on the other hand, the peculiar batch effects are now super-visible, I guess that is useful.
This is a modification of the sva defaults, here is a quote from the documentation:
“This function performs frozen surrogate variable analysis as described in Parker, Corrada Bravo and Leek 2013. The approach uses a training database to create surrogate variables which are then used to remove batch effects both from the training database and a new data set for prediction purposes.”
sc2_fsva <- sm(normalize_expt(sc2_expt, transform="log2",
filter=TRUE, batch="fsva", low_to_zero=TRUE))
sc2_fsva_metrics <- sm(graph_metrics(sc2_fsva))
fsva seems like it might be useful, but also potentially a too-large modification of the data.
sc2_fsva_metrics$pcaplot
## Well, it does look pretty.
I want to apply a few tools to try to get a handle on where this strange variance is coming from.
varhunt <- pca_information(expt_data=sc2_expt,
expt_factors=c("condition","batch","originalbatch", "incubationtime"))
## More shallow curves in these plots suggest more genes in this principle component.
## condition
## batch
## originalbatch
## incubationtime
varhunt$anova_fstat_heatmap
## Both batch estimates are very strong on PC1 and PC2, incubation time comes out on the 3rd PC.
## The condition looks like it is stuck with with, unfortunately.
sc2_varpart <- varpart(expt=sc2_expt, predictor=NULL,
factors=c("condition","batch","originalbatch", "incubationtime"))
## Attempting mixed linear model with: ~ (1|condition) + (1|batch) + (1|originalbatch) + (1|incubationtime)
## Fitting the expressionset to the model, this is slow.
## (Eg. Take the projected run time and mulitply by 3-6 and round up.)
## Projected run time: ~ 2 min
sc2_varpart$partition_plot
## Well here at least, it looks like batch kicks ass.
Let us apply the same set of normalizations/surrogate estimates and write out the resulting tables.
fun <- write_expt(sc2_expt, excel=paste0("excel/samples_written-v", ver, ".xlsx"),
filter=TRUE, norm="raw", batch="fsva", transform="log2", violin=TRUE)
## Writing the legend.
## The sheet: legend is in legend.
## Writing the raw reads.
## Graphing the raw reads.
## Attempting mixed linear model with: ~ (1|condition) + (1|batch)
## Fitting the expressionset to the model, this is slow.
## (Eg. Take the projected run time and mulitply by 3-6 and round up.)
## Projected run time: ~ 1 min
## The sheet: raw_graphs is in legend, raw_reads, raw_graphs.
## The sheet: raw_graphs is in legend, raw_reads, raw_graphs.
## Writing the normalized reads.
## Graphing the normalized reads.
## Attempting mixed linear model with: ~ (1|condition) + (1|batch)
## Fitting the expressionset to the model, this is slow.
## (Eg. Take the projected run time and mulitply by 3-6 and round up.)
## Projected run time: ~ 2 min
## The sheet: norm_graphs is in legend, raw_reads, raw_graphs, norm_data, norm_graphs.
## The sheet: norm_graphs is in legend, raw_reads, raw_graphs, norm_data, norm_graphs.
## Writing the median reads by factor.
## The factor mtc_mtu has 4 rows.
## The factor mtc_wtu has 4 rows.
## The factor wtc_mtu has 4 rows.
## The factor wtc_wtu has 4 rows.
library("pander")
pander(sessionInfo())
R version 3.3.3 (2017-03-06)
**Platform:** x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C
attached base packages: stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: pander(v.0.6.0), hpgltools(v.2017.01) and ruv(v.0.9.6)
loaded via a namespace (and not attached): nlme(v.3.1-131), bitops(v.1.0-6), matrixStats(v.0.52.2), pbkrtest(v.0.4-7), devtools(v.1.13.1), doParallel(v.1.0.10), RColorBrewer(v.1.1-2), rprojroot(v.1.2), GenomeInfoDb(v.1.10.3), backports(v.1.0.5), tools(v.3.3.3), R6(v.2.2.1), KernSmooth(v.2.23-15), DBI(v.0.6-1), lazyeval(v.0.2.0), BiocGenerics(v.0.20.0), mgcv(v.1.8-17), colorspace(v.1.3-2), withr(v.1.0.2), compiler(v.3.3.3), preprocessCore(v.1.36.0), Biobase(v.2.34.0), xml2(v.1.1.1), rtracklayer(v.1.34.2), labeling(v.0.3), caTools(v.1.17.1), scales(v.0.4.1), genefilter(v.1.56.0), commonmark(v.1.2), stringr(v.1.2.0), digest(v.0.6.12), Rsamtools(v.1.26.2), minqa(v.1.2.4), variancePartition(v.1.4.2), rmarkdown(v.1.5), colorRamps(v.2.3), XVector(v.0.14.1), base64enc(v.0.1-3), htmltools(v.0.3.6), lme4(v.1.1-13), limma(v.3.30.13), rlang(v.0.1.1), RSQLite(v.1.1-2), BiocParallel(v.1.8.2), gtools(v.3.5.0), RCurl(v.1.95-4.8), magrittr(v.1.5), Matrix(v.1.2-10), Rcpp(v.0.12.11), munsell(v.0.4.3), S4Vectors(v.0.12.2), yaml(v.2.1.14), stringi(v.1.1.5), edgeR(v.3.16.5), MASS(v.7.3-47), SummarizedExperiment(v.1.4.0), zlibbioc(v.1.20.0), gplots(v.3.0.1), plyr(v.1.8.4), grid(v.3.3.3), parallel(v.3.3.3), gdata(v.2.17.0), ggrepel(v.0.6.5), crayon(v.1.3.2), lattice(v.0.20-35), Biostrings(v.2.42.1), splines(v.3.3.3), GenomicFeatures(v.1.26.4), annotate(v.1.52.1), locfit(v.1.5-9.1), knitr(v.1.16), GenomicRanges(v.1.26.4), corpcor(v.1.6.9), reshape2(v.1.4.2), codetools(v.0.2-15), biomaRt(v.2.30.0), stats4(v.3.3.3), XML(v.3.98-1.7), evaluate(v.0.10), data.table(v.1.10.4), nloptr(v.1.0.4), foreach(v.1.4.3), testthat(v.1.0.2), gtable(v.0.2.0), ggplot2(v.2.2.1), openxlsx(v.4.0.17), xtable(v.1.8-2), roxygen2(v.6.0.1), survival(v.2.41-3), tibble(v.1.3.1), iterators(v.1.0.8), GenomicAlignments(v.1.10.1), AnnotationDbi(v.1.36.2), memoise(v.1.1.0), IRanges(v.2.8.2) and sva(v.3.22.0)