There are lots of toys we have learned to use to play with with raw data and explore stuff like batch effects or non-canonical distributions or skewed counts. hpgltools provides some functionality to make this process easier. The graphs shown below and many more are generated with the wrapper ‘graph_metrics()’ but that takes away the chance to explain the graphs as I generate them.
mm_expt <- expt_exclude_genes(expt=mm_expt, column="Type", method="keep", patterns=c("protein_coding"))
## Before removal, there were 88198 entries.
## Now there are 78166 entries.
## Percent kept: 94.6918356393496, 95.6014030466883, 96.0192325721982, 94.1560724825351, 95.4277875550635, 95.5467362467663, 94.8747441594825, 95.3755486209204, 94.5476892156688, 94.7540455861558, 93.5291291731876, 93.0865081043525, 94.1866870128134, 94.3584950782656, 92.7195382093603, 94.3755866812252
## Percent removed: 5.30816436065035, 4.39859695331166, 3.98076742780181, 5.84392751746488, 4.57221244493648, 4.45326375323367, 5.12525584051753, 4.62445137907965, 5.45231078433121, 5.24595441384418, 6.47087082681239, 6.9134918956475, 5.81331298718665, 5.64150492173442, 7.2804617906397, 5.62441331877485
mm_filt <- normalize_expt(mm_expt, low_to_zero=TRUE)
## This function will replace the expt$expressionset slot with:
## raw(data)
## It backs up the current data into a slot named:
## expt$backup_expressionset. It will also save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Filter is false, this should likely be set to something, good
## choices include cbcb, kofa, pofa (anything but FALSE). If you want this to
## stay FALSE, keep in mind that if other normalizations are performed, then the
## resulting libsizes are likely to be strange (potentially negative!)
## Leaving the data in its current base format, keep in mind that
## some metrics are easier to see when the data is log2 transformed, but
## EdgeR/DESeq do not accept transformed data.
## Leaving the data unconverted. It is often advisable to cpm/rpkm
## the data to normalize for sampling differences, keep in mind though that rpkm
## has some annoying biases, and voom() by default does a cpm (though hpgl_voom()
## will try to detect this).
## Leaving the data unnormalized. This is necessary for DESeq, but
## EdgeR/limma might benefit from normalization. Good choices include quantile,
## size-factor, tmm, etc.
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: not doing count filtering.
## Step 2: not normalizing the data.
## Step 3: not converting the data.
## Step 4: not transforming the data.
## Step 5: not doing batch correction.
mm_filt <- normalize_expt(mm_filt, filter="cbcb")
## This function will replace the expt$expressionset slot with:
## cbcb(data)
## It backs up the current data into a slot named:
## expt$backup_expressionset. It will also save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Leaving the data in its current base format, keep in mind that
## some metrics are easier to see when the data is log2 transformed, but
## EdgeR/DESeq do not accept transformed data.
## Leaving the data unconverted. It is often advisable to cpm/rpkm
## the data to normalize for sampling differences, keep in mind though that rpkm
## has some annoying biases, and voom() by default does a cpm (though hpgl_voom()
## will try to detect this).
## Leaving the data unnormalized. This is necessary for DESeq, but
## EdgeR/limma might benefit from normalization. Good choices include quantile,
## size-factor, tmm, etc.
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 40835 low-count genes (37331 remaining).
## Step 2: not normalizing the data.
## Step 3: not converting the data.
## Step 4: not transforming the data.
## Step 5: not doing batch correction.
mm_libsize <- plot_libsize(mm_filt)
## The scale difference between the smallest and largest
## libraries is > 10. Assuming a log10 scale is better, set scale=FALSE if not.
mm_libsize
mm_nonzero <- plot_nonzero(mm_filt)
mm_nonzero
mm_density <- plot_density(mm_filt) + ggplot2::scale_x_continuous(limits=c(0, 5))
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0. We are on log scale, setting them to 0.5.
## Changed 280477 zero count features.
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the
## existing scale.
mm_density
mm_boxplot <- plot_boxplot(mm_filt)
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0. We are on log scale, setting them to 0.5.
## Changed 280477 zero count features.
mm_boxplot
mm_norm <- sm(normalize_expt(mm_expt, transform="log2", norm="quant", convert="cpm", filter=TRUE))
mm_metrics <- sm(graph_metrics(mm_norm))
mm_metrics$pcaplot
##mm_varpart <- varpart(expt=mm_expt, predictor=NULL, factors=c("condition", "batch"))
##mm_varpart$partition_plot
The data should now be normalized, lets view some metrics post-facto.
mm_metrics$corheat
mm_metrics$smc
mm_metrics$pcaplot
mm_sva <- sm(normalize_expt(mm_expt, transform="log2", norm="quant",
filter=TRUE, batch="ssva", low_to_zero=TRUE))
mm_sva_metrics <- graph_metrics(mm_sva)
## Graphing number of non-zero genes with respect to CPM by library.
## Graphing library sizes.
## Graphing a boxplot.
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0. We are on log scale, setting them to 0.5.
## Changed 111759 zero count features.
## Graphing a correlation heatmap.
## Graphing a standard median correlation.
## Performing correlation.
## Graphing a distance heatmap.
## Graphing a standard median distance.
## Performing distance.
## Graphing a PCA plot.
## Plotting a density plot.
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0. We are on log scale, setting them to 0.5.
## Changed 111759 zero count features.
## Printing a color to condition legend.
mm_sva_metrics$pcaplot
mm_fsva <- sm(normalize_expt(mm_expt, transform="log2", norm="quant",
filter=TRUE, batch="fsva", low_to_zero=TRUE))
mm_fsva_metrics <- graph_metrics(mm_fsva)
## Graphing number of non-zero genes with respect to CPM by library.
## Graphing library sizes.
## Graphing a boxplot.
## Graphing a correlation heatmap.
## Graphing a standard median correlation.
## Performing correlation.
## Graphing a distance heatmap.
## Graphing a standard median distance.
## Performing distance.
## Graphing a PCA plot.
## Plotting a density plot.
## Printing a color to condition legend.
mm_fsva_metrics$pcaplot
mm_first <- write_expt(mm_expt, excel=paste0("excel/mm_samples_written_default-v", ver, ".xlsx"),
filter=TRUE, norm="quant", convert="cpm", batch="raw", transform="log2")
## Writing the legend.
## The sheet: legend is in legend.
## Writing the raw reads.
## Graphing the raw reads.
## The sheet: raw_graphs is in legend, raw_reads, raw_graphs.
## The sheet: raw_graphs is in legend, raw_reads, raw_graphs.
## Writing the normalized reads.
## Graphing the normalized reads.
## The sheet: norm_graphs is in legend, raw_reads, raw_graphs, norm_data, norm_graphs.
## The sheet: norm_graphs is in legend, raw_reads, raw_graphs, norm_data, norm_graphs.
## Writing the median reads by factor.
## The factor pmn_inf_12hr has 4 rows.
## The factor pmn_inf_14d has 4 rows.
## The factor pmn_uninf_12hr has 4 rows.
## The factor pmn_uninf_14d has 4 rows.
mm_second <- write_expt(mm_expt, excel=paste0("excel/mm_samples_written_ssva-v", ver, ".xlsx"),
filter=TRUE, norm="quant", convert="raw", batch="ssva", transform="log2")
## Writing the legend.
## The sheet: legend is in legend.
## Writing the raw reads.
## Graphing the raw reads.
## The sheet: raw_graphs is in legend, raw_reads, raw_graphs.
## The sheet: raw_graphs is in legend, raw_reads, raw_graphs.
## Writing the normalized reads.
## Graphing the normalized reads.
## The sheet: norm_graphs is in legend, raw_reads, raw_graphs, norm_data, norm_graphs.
## The sheet: norm_graphs is in legend, raw_reads, raw_graphs, norm_data, norm_graphs.
## Writing the median reads by factor.
## The factor pmn_inf_12hr has 4 rows.
## The factor pmn_inf_14d has 4 rows.
## The factor pmn_uninf_12hr has 4 rows.
## The factor pmn_uninf_14d has 4 rows.
mm_third <- write_expt(mm_expt, excel=paste0("excel/mm_samples_written_fsva-v", ver, ".xlsx"),
filter=TRUE, norm="quant", convert="raw", batch="fsva", transform="log2")
## Writing the legend.
## The sheet: legend is in legend.
## Writing the raw reads.
## Graphing the raw reads.
## The sheet: raw_graphs is in legend, raw_reads, raw_graphs.
## The sheet: raw_graphs is in legend, raw_reads, raw_graphs.
## Writing the normalized reads.
## Graphing the normalized reads.
## The sheet: norm_graphs is in legend, raw_reads, raw_graphs, norm_data, norm_graphs.
## The sheet: norm_graphs is in legend, raw_reads, raw_graphs, norm_data, norm_graphs.
## Writing the median reads by factor.
## The factor pmn_inf_12hr has 4 rows.
## The factor pmn_inf_14d has 4 rows.
## The factor pmn_uninf_12hr has 4 rows.
## The factor pmn_uninf_14d has 4 rows.
pander::pander(sessionInfo())
R version 3.3.3 (2017-03-06)
**Platform:** x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=en_US.utf8, LC_NUMERIC=C, LC_TIME=en_US.utf8, LC_COLLATE=en_US.utf8, LC_MONETARY=en_US.utf8, LC_MESSAGES=en_US.utf8, LC_PAPER=en_US.utf8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.utf8 and LC_IDENTIFICATION=C
attached base packages: stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: ruv(v.0.9.6) and hpgltools(v.2017.01)
loaded via a namespace (and not attached): ggrepel(v.0.6.5), Rcpp(v.0.12.11), locfit(v.1.5-9.1), lattice(v.0.20-35), corpcor(v.1.6.9), gtools(v.3.5.0), rprojroot(v.1.2), digest(v.0.6.12), foreach(v.1.4.3), R6(v.2.2.2), plyr(v.1.8.4), backports(v.1.1.0), stats4(v.3.3.3), RSQLite(v.2.0), evaluate(v.0.10.1), sva(v.3.22.0), ggplot2(v.2.2.1), rlang(v.0.1.1), lazyeval(v.0.2.0), data.table(v.1.10.4), annotate(v.1.52.1), blob(v.1.1.0), S4Vectors(v.0.12.2), Matrix(v.1.2-10), preprocessCore(v.1.36.0), rmarkdown(v.1.6), splines(v.3.3.3), labeling(v.0.3), devtools(v.1.13.2), pander(v.0.6.0), stringr(v.1.2.0), RCurl(v.1.95-4.8), bit(v.1.1-12), munsell(v.0.4.3), BiocGenerics(v.0.20.0), base64enc(v.0.1-3), mgcv(v.1.8-17), htmltools(v.0.3.6), tibble(v.1.3.3), roxygen2(v.6.0.1), edgeR(v.3.16.5), IRanges(v.2.8.2), codetools(v.0.2-15), matrixStats(v.0.52.2), XML(v.3.98-1.9), crayon(v.1.3.2), withr(v.1.0.2), bitops(v.1.0-6), commonmark(v.1.2), grid(v.3.3.3), nlme(v.3.1-131), xtable(v.1.8-2), gtable(v.0.2.0), DBI(v.0.7), magrittr(v.1.5), scales(v.0.4.1), stringi(v.1.1.5), reshape2(v.1.4.2), genefilter(v.1.56.0), testthat(v.1.0.2), limma(v.3.30.13), xml2(v.1.1.1), openxlsx(v.4.0.17), RColorBrewer(v.1.1-2), iterators(v.1.0.8), tools(v.3.3.3), bit64(v.0.9-7), Biobase(v.2.34.0), parallel(v.3.3.3), survival(v.2.41-3), yaml(v.2.1.14), AnnotationDbi(v.1.36.2), colorspace(v.1.3-2), memoise(v.1.1.0) and knitr(v.1.16)
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
## Saving to 02_sample_estimation_mmusculus-v20170706.rda.xz
tt <- sm(saveme(filename=this_save))