There are lots of toys we have learned to use to play with with raw data and explore stuff like batch effects or non-canonical distributions or skewed counts. hpgltools provides some functionality to make this process easier. The graphs shown below and many more are generated with the wrapper ‘graph_metrics()’ but that takes away the chance to explain the graphs as I generate them.
lm_expt <- exclude_genes_expt(expt=lm_expt, column="Type", method="keep", patterns=c("protein_coding"))
## Before removal, there were 9465 entries.
## Now there are 8298 entries.
## Percent kept: 4.863, 4.607, 4.913, 4.743, 5.304, 3.973, 5.594, 4.616, 9.772, 7.150, 13.356, 8.280, 7.956, 7.363, 14.370, 9.896, 6.071, 5.333, 6.298, 4.676, 6.156, 5.918, 6.538, 5.418
## Percent removed: 95.137, 95.393, 95.087, 95.257, 94.696, 96.027, 94.406, 95.384, 90.228, 92.850, 86.644, 91.720, 92.044, 92.637, 85.630, 90.104, 93.929, 94.667, 93.702, 95.324, 93.844, 94.082, 93.462, 94.582
plot_pct_kept(lm_expt)
lm_filt <- sm(normalize_expt(lm_expt, filter=TRUE))
lm_libsize <- plot_libsize(lm_expt)
## The scale difference between the smallest and largest
## libraries is > 10. Assuming a log10 scale is better, set scale=FALSE if not.
lm_libsize$plot
lm_nonzero <- plot_nonzero(lm_filt)
lm_nonzero$plot
lm_density <- plot_density(lm_filt)
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0. We are on log scale, setting them to 0.5.
## Changed 45266 zero count features.
lm_density$plot
lm_boxplot <- plot_boxplot(lm_filt)
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0. We are on log scale, adding 1 to the data.
## Changed 45266 zero count features.
lm_boxplot
Before going any further, look at the library size plot. I think it pretty much guarantees that this data is not usable.
However, I will continue in the hopes that I can lay down a framework for future data and maybe diagnose what is going on in this data.
lm_norm <- sm(normalize_expt(lm_expt, transform="log2", norm="quant", convert="cpm", filter=TRUE))
lm_metrics <- sm(graph_metrics(lm_norm))
lm_metrics$pcaplot
lm_surrogates <- pca_information(expt_data=lm_expt, expt_factors=c("condition", "averagesize"), plot_pcas=TRUE)
## More shallow curves in these plots suggest more genes in this principle component.
## The minimum size is: 2 and the maximum size is: 2
##lm_varpart <- varpart(expt=lm_expt, predictor=NULL, factors=c("condition"))
##lm_varpart$partition_plot
The data should now be normalized, lets view some metrics post-facto.
lm_metrics$corheat
lm_metrics$smc
lm_metrics$pcaplot
lm_sva <- sm(normalize_expt(lm_expt, transform="log2", norm="quant",
filter=TRUE, batch="ssva", low_to_zero=TRUE))
lm_sva_metrics <- graph_metrics(lm_sva)
## Graphing number of non-zero genes with respect to CPM by library.
## Graphing library sizes.
## Graphing a boxplot.
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0. We are on log scale, adding 1 to the data.
## Changed 19588 zero count features.
## Graphing a correlation heatmap.
## Graphing a standard median correlation.
## Performing correlation.
## Graphing a distance heatmap.
## Graphing a standard median distance.
## Performing distance.
## Graphing a PCA plot.
## The minimum size is: 2 and the maximum size is: 2
## Graphing a T-SNE plot.
## The minimum size is: 2 and the maximum size is: 2
## TESTME WHAT IS: 16
## Plotting a density plot.
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0. We are on log scale, setting them to 0.5.
## Changed 19588 zero count features.
## Plotting the representation of the top-n genes.
## Printing a color to condition legend.
lm_sva_metrics$pcaplot
lm_fsva <- sm(normalize_expt(lm_expt, transform="log2", norm="quant",
filter=TRUE, batch="fsva", low_to_zero=TRUE))
lm_fsva_metrics <- graph_metrics(lm_fsva)
## Graphing number of non-zero genes with respect to CPM by library.
## Graphing library sizes.
## Graphing a boxplot.
## Graphing a correlation heatmap.
## Graphing a standard median correlation.
## Performing correlation.
## Graphing a distance heatmap.
## Graphing a standard median distance.
## Performing distance.
## Graphing a PCA plot.
## The minimum size is: 2 and the maximum size is: 2
## Graphing a T-SNE plot.
## The minimum size is: 2 and the maximum size is: 2
## TESTME WHAT IS: 16
## Plotting a density plot.
## Plotting the representation of the top-n genes.
## Printing a color to condition legend.
lm_fsva_metrics$pcaplot
lm_first <- write_expt(lm_expt, excel=paste0("excel/lm_samples_written_default-v", ver, ".xlsx"),
filter=TRUE, norm="quant", convert="cpm", batch="raw", transform="log2")
## Writing the legend.
## Writing the raw reads.
## Graphing the raw reads.
## The minimum size is: 2 and the maximum size is: 2
## The minimum size is: 2 and the maximum size is: 2
## TESTME WHAT IS: 16
## Writing the normalized reads.
## Graphing the normalized reads.
## Writing the median reads by factor.
## The factor amastigote has 4 rows.
## The factor pmn_inf_12hr has 4 rows.
## The factor pmn_inf_14d has 4 rows.
## The factor pmn_uninf_12hr has 4 rows.
## The factor pmn_uninf_14d has 4 rows.
## The factor promastigote has 4 rows.
lm_second <- write_expt(lm_expt, excel=paste0("excel/lm_samples_written_ssva-v", ver, ".xlsx"),
filter=TRUE, norm="quant", convert="raw", batch="ssva", transform="log2")
## Writing the legend.
## Writing the raw reads.
## Graphing the raw reads.
## The minimum size is: 2 and the maximum size is: 2
## The minimum size is: 2 and the maximum size is: 2
## TESTME WHAT IS: 16
## Writing the normalized reads.
## Graphing the normalized reads.
## Writing the median reads by factor.
## The factor amastigote has 4 rows.
## The factor pmn_inf_12hr has 4 rows.
## The factor pmn_inf_14d has 4 rows.
## The factor pmn_uninf_12hr has 4 rows.
## The factor pmn_uninf_14d has 4 rows.
## The factor promastigote has 4 rows.
lm_third <- write_expt(lm_expt, excel=paste0("excel/lm_samples_written_fsva-v", ver, ".xlsx"),
filter=TRUE, norm="quant", convert="raw", batch="fsva", transform="log2")
## Writing the legend.
## Writing the raw reads.
## Graphing the raw reads.
## The minimum size is: 2 and the maximum size is: 2
## The minimum size is: 2 and the maximum size is: 2
## TESTME WHAT IS: 16
## Writing the normalized reads.
## Graphing the normalized reads.
## Writing the median reads by factor.
## The factor amastigote has 4 rows.
## The factor pmn_inf_12hr has 4 rows.
## The factor pmn_inf_14d has 4 rows.
## The factor pmn_uninf_12hr has 4 rows.
## The factor pmn_uninf_14d has 4 rows.
## The factor promastigote has 4 rows.
pander::pander(sessionInfo())
R version 3.4.4 (2018-03-15)
**Platform:** x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=en_US.utf8, LC_NUMERIC=C, LC_TIME=en_US.utf8, LC_COLLATE=en_US.utf8, LC_MONETARY=en_US.utf8, LC_MESSAGES=en_US.utf8, LC_PAPER=en_US.utf8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.utf8 and LC_IDENTIFICATION=C
attached base packages: stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: hpgltools(v.2018.03) and ruv(v.0.9.7)
loaded via a namespace (and not attached): Biobase(v.2.38.0), edgeR(v.3.20.9), bit64(v.0.9-7), splines(v.3.4.4), foreach(v.1.4.4), gtools(v.3.5.0), stats4(v.3.4.4), pander(v.0.6.1), blob(v.1.1.0), yaml(v.2.1.18), ggrepel(v.0.7.0), pillar(v.1.2.1), RSQLite(v.2.0), backports(v.1.1.2), lattice(v.0.20-35), limma(v.3.34.9), quadprog(v.1.5-5), digest(v.0.6.15), RColorBrewer(v.1.1-2), colorspace(v.1.3-2), htmltools(v.0.3.6), preprocessCore(v.1.40.0), Matrix(v.1.2-12), plyr(v.1.8.4), XML(v.3.98-1.10), devtools(v.1.13.5), genefilter(v.1.60.0), xtable(v.1.8-2), corpcor(v.1.6.9), scales(v.0.5.0), openxlsx(v.4.0.17), Rtsne(v.0.13), BiocParallel(v.1.12.0), tibble(v.1.4.2), annotate(v.1.56.1), mgcv(v.1.8-23), IRanges(v.2.12.0), ggplot2(v.2.2.1), withr(v.2.1.2), BiocGenerics(v.0.24.0), lazyeval(v.0.2.1), survival(v.2.41-3), magrittr(v.1.5), evaluate(v.0.10.1), memoise(v.1.1.0), nlme(v.3.1-131.1), MASS(v.7.3-49), xml2(v.1.2.0), tools(v.3.4.4), directlabels(v.2017.03.31), data.table(v.1.10.4-3), matrixStats(v.0.53.1), stringr(v.1.3.0), S4Vectors(v.0.16.0), munsell(v.0.4.3), locfit(v.1.5-9.1), AnnotationDbi(v.1.40.0), compiler(v.3.4.4), rlang(v.0.2.0), grid(v.3.4.4), RCurl(v.1.95-4.10), iterators(v.1.0.9), base64enc(v.0.1-3), bitops(v.1.0-6), labeling(v.0.3), rmarkdown(v.1.9), gtable(v.0.2.0), codetools(v.0.2-15), DBI(v.0.8), roxygen2(v.6.0.1), reshape2(v.1.4.3), R6(v.2.2.2), gridExtra(v.2.3), knitr(v.1.20), bit(v.1.1-12), commonmark(v.1.4), rprojroot(v.1.3-2), KernSmooth(v.2.23-15), stringi(v.1.1.7), parallel(v.3.4.4), sva(v.3.26.0) and Rcpp(v.0.12.16)
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
## Saving to 02_sample_estimation_lmajor-v20170706.rda.xz
tt <- sm(saveme(filename=this_save))