1 Sample Estimation: 20170706

2 L. major sample estimation

3 Visualizing raw data

There are lots of toys we have learned to use to play with with raw data and explore stuff like batch effects or non-canonical distributions or skewed counts. hpgltools provides some functionality to make this process easier. The graphs shown below and many more are generated with the wrapper ‘graph_metrics()’ but that takes away the chance to explain the graphs as I generate them.

lm_expt <- exclude_genes_expt(expt=lm_expt, column="Type", method="keep", patterns=c("protein_coding"))
## Before removal, there were 9465 entries.
## Now there are 8298 entries.
## Percent kept: 4.863, 4.607, 4.913, 4.743, 5.304, 3.973, 5.594, 4.616, 9.772, 7.150, 13.356, 8.280, 7.956, 7.363, 14.370, 9.896, 6.071, 5.333, 6.298, 4.676, 6.156, 5.918, 6.538, 5.418
## Percent removed: 95.137, 95.393, 95.087, 95.257, 94.696, 96.027, 94.406, 95.384, 90.228, 92.850, 86.644, 91.720, 92.044, 92.637, 85.630, 90.104, 93.929, 94.667, 93.702, 95.324, 93.844, 94.082, 93.462, 94.582
plot_pct_kept(lm_expt)

lm_filt <- sm(normalize_expt(lm_expt, filter=TRUE))

lm_libsize <- plot_libsize(lm_expt)
## The scale difference between the smallest and largest
## libraries is > 10. Assuming a log10 scale is better, set scale=FALSE if not.
lm_libsize$plot

lm_nonzero <- plot_nonzero(lm_filt)
lm_nonzero$plot

lm_density <- plot_density(lm_filt)
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0.  We are on log scale, setting them to 0.5.
## Changed 45266 zero count features.
lm_density$plot

lm_boxplot <- plot_boxplot(lm_filt)
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0.  We are on log scale, adding 1 to the data.
## Changed 45266 zero count features.
lm_boxplot

Before going any further, look at the library size plot. I think it pretty much guarantees that this data is not usable.

However, I will continue in the hopes that I can lay down a framework for future data and maybe diagnose what is going on in this data.

4 Normalize and visualize

lm_norm <- sm(normalize_expt(lm_expt, transform="log2", norm="quant", convert="cpm", filter=TRUE))
lm_metrics <- sm(graph_metrics(lm_norm))
lm_metrics$pcaplot
lm_surrogates <- pca_information(expt_data=lm_expt, expt_factors=c("condition", "averagesize"), plot_pcas=TRUE)
## More shallow curves in these plots suggest more genes in this principle component.
## The minimum size is: 2 and the maximum size is: 2
##lm_varpart <- varpart(expt=lm_expt, predictor=NULL, factors=c("condition"))
##lm_varpart$partition_plot

The data should now be normalized, lets view some metrics post-facto.

lm_metrics$corheat

lm_metrics$smc

lm_metrics$pcaplot

5 Try a batch estimation

lm_sva <- sm(normalize_expt(lm_expt, transform="log2", norm="quant",
                            filter=TRUE, batch="ssva", low_to_zero=TRUE))
lm_sva_metrics <- graph_metrics(lm_sva)
## Graphing number of non-zero genes with respect to CPM by library.
## Graphing library sizes.
## Graphing a boxplot.
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0.  We are on log scale, adding 1 to the data.
## Changed 19588 zero count features.
## Graphing a correlation heatmap.
## Graphing a standard median correlation.
## Performing correlation.
## Graphing a distance heatmap.

## Graphing a standard median distance.
## Performing distance.
## Graphing a PCA plot.
## The minimum size is: 2 and the maximum size is: 2
## Graphing a T-SNE plot.
## The minimum size is: 2 and the maximum size is: 2
## TESTME WHAT IS: 16
## Plotting a density plot.
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0.  We are on log scale, setting them to 0.5.
## Changed 19588 zero count features.
## Plotting the representation of the top-n genes.
## Printing a color to condition legend.

lm_sva_metrics$pcaplot

lm_fsva <- sm(normalize_expt(lm_expt, transform="log2", norm="quant",
                             filter=TRUE, batch="fsva", low_to_zero=TRUE))
lm_fsva_metrics <- graph_metrics(lm_fsva)
## Graphing number of non-zero genes with respect to CPM by library.
## Graphing library sizes.
## Graphing a boxplot.
## Graphing a correlation heatmap.
## Graphing a standard median correlation.
## Performing correlation.
## Graphing a distance heatmap.

## Graphing a standard median distance.
## Performing distance.
## Graphing a PCA plot.
## The minimum size is: 2 and the maximum size is: 2
## Graphing a T-SNE plot.
## The minimum size is: 2 and the maximum size is: 2
## TESTME WHAT IS: 16
## Plotting a density plot.
## Plotting the representation of the top-n genes.
## Printing a color to condition legend.

lm_fsva_metrics$pcaplot

6 Write the expt

lm_first <- write_expt(lm_expt, excel=paste0("excel/lm_samples_written_default-v", ver, ".xlsx"),
                       filter=TRUE, norm="quant", convert="cpm", batch="raw", transform="log2")
## Writing the legend.
## Writing the raw reads.
## Graphing the raw reads.
## The minimum size is: 2 and the maximum size is: 2
## The minimum size is: 2 and the maximum size is: 2
## TESTME WHAT IS: 16
## Writing the normalized reads.
## Graphing the normalized reads.
## Writing the median reads by factor.
## The factor amastigote has 4 rows.
## The factor pmn_inf_12hr has 4 rows.
## The factor pmn_inf_14d has 4 rows.
## The factor pmn_uninf_12hr has 4 rows.
## The factor pmn_uninf_14d has 4 rows.
## The factor promastigote has 4 rows.
lm_second <- write_expt(lm_expt, excel=paste0("excel/lm_samples_written_ssva-v", ver, ".xlsx"),
                        filter=TRUE, norm="quant", convert="raw", batch="ssva", transform="log2")
## Writing the legend.
## Writing the raw reads.
## Graphing the raw reads.
## The minimum size is: 2 and the maximum size is: 2
## The minimum size is: 2 and the maximum size is: 2
## TESTME WHAT IS: 16
## Writing the normalized reads.
## Graphing the normalized reads.
## Writing the median reads by factor.
## The factor amastigote has 4 rows.
## The factor pmn_inf_12hr has 4 rows.
## The factor pmn_inf_14d has 4 rows.
## The factor pmn_uninf_12hr has 4 rows.
## The factor pmn_uninf_14d has 4 rows.
## The factor promastigote has 4 rows.
lm_third <- write_expt(lm_expt, excel=paste0("excel/lm_samples_written_fsva-v", ver, ".xlsx"),
                       filter=TRUE, norm="quant", convert="raw", batch="fsva", transform="log2")
## Writing the legend.
## Writing the raw reads.
## Graphing the raw reads.
## The minimum size is: 2 and the maximum size is: 2
## The minimum size is: 2 and the maximum size is: 2
## TESTME WHAT IS: 16
## Writing the normalized reads.
## Graphing the normalized reads.
## Writing the median reads by factor.
## The factor amastigote has 4 rows.
## The factor pmn_inf_12hr has 4 rows.
## The factor pmn_inf_14d has 4 rows.
## The factor pmn_uninf_12hr has 4 rows.
## The factor pmn_uninf_14d has 4 rows.
## The factor promastigote has 4 rows.
pander::pander(sessionInfo())

R version 3.4.4 (2018-03-15)

**Platform:** x86_64-pc-linux-gnu (64-bit)

locale: LC_CTYPE=en_US.utf8, LC_NUMERIC=C, LC_TIME=en_US.utf8, LC_COLLATE=en_US.utf8, LC_MONETARY=en_US.utf8, LC_MESSAGES=en_US.utf8, LC_PAPER=en_US.utf8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.utf8 and LC_IDENTIFICATION=C

attached base packages: stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: hpgltools(v.2018.03) and ruv(v.0.9.7)

loaded via a namespace (and not attached): Biobase(v.2.38.0), edgeR(v.3.20.9), bit64(v.0.9-7), splines(v.3.4.4), foreach(v.1.4.4), gtools(v.3.5.0), stats4(v.3.4.4), pander(v.0.6.1), blob(v.1.1.0), yaml(v.2.1.18), ggrepel(v.0.7.0), pillar(v.1.2.1), RSQLite(v.2.0), backports(v.1.1.2), lattice(v.0.20-35), limma(v.3.34.9), quadprog(v.1.5-5), digest(v.0.6.15), RColorBrewer(v.1.1-2), colorspace(v.1.3-2), htmltools(v.0.3.6), preprocessCore(v.1.40.0), Matrix(v.1.2-12), plyr(v.1.8.4), XML(v.3.98-1.10), devtools(v.1.13.5), genefilter(v.1.60.0), xtable(v.1.8-2), corpcor(v.1.6.9), scales(v.0.5.0), openxlsx(v.4.0.17), Rtsne(v.0.13), BiocParallel(v.1.12.0), tibble(v.1.4.2), annotate(v.1.56.1), mgcv(v.1.8-23), IRanges(v.2.12.0), ggplot2(v.2.2.1), withr(v.2.1.2), BiocGenerics(v.0.24.0), lazyeval(v.0.2.1), survival(v.2.41-3), magrittr(v.1.5), evaluate(v.0.10.1), memoise(v.1.1.0), nlme(v.3.1-131.1), MASS(v.7.3-49), xml2(v.1.2.0), tools(v.3.4.4), directlabels(v.2017.03.31), data.table(v.1.10.4-3), matrixStats(v.0.53.1), stringr(v.1.3.0), S4Vectors(v.0.16.0), munsell(v.0.4.3), locfit(v.1.5-9.1), AnnotationDbi(v.1.40.0), compiler(v.3.4.4), rlang(v.0.2.0), grid(v.3.4.4), RCurl(v.1.95-4.10), iterators(v.1.0.9), base64enc(v.0.1-3), bitops(v.1.0-6), labeling(v.0.3), rmarkdown(v.1.9), gtable(v.0.2.0), codetools(v.0.2-15), DBI(v.0.8), roxygen2(v.6.0.1), reshape2(v.1.4.3), R6(v.2.2.2), gridExtra(v.2.3), knitr(v.1.20), bit(v.1.1-12), commonmark(v.1.4), rprojroot(v.1.3-2), KernSmooth(v.2.23-15), stringi(v.1.1.7), parallel(v.3.4.4), sva(v.3.26.0) and Rcpp(v.0.12.16)

this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
## Saving to 02_sample_estimation_lmajor-v20170706.rda.xz
tt <- sm(saveme(filename=this_save))
LS0tCnRpdGxlOiAiTC4gbWFqb3IgMjAxNzogc2FtcGxlIGVzdGltYXRpb24gb2YgTGVpc2htYW5pYSBzYW1wbGVzLiIKYXV0aG9yOiAiYXRiIGFiZWxld0BnbWFpbC5jb20iCmRhdGU6ICJgciBTeXMuRGF0ZSgpYCIKb3V0cHV0OgogaHRtbF9kb2N1bWVudDoKICBjb2RlX2Rvd25sb2FkOiB0cnVlCiAgY29kZV9mb2xkaW5nOiBzaG93CiAgZmlnX2NhcHRpb246IHRydWUKICBmaWdfaGVpZ2h0OiA3CiAgZmlnX3dpZHRoOiA3CiAgaGlnaGxpZ2h0OiBkZWZhdWx0CiAga2VlcF9tZDogZmFsc2UKICBtb2RlOiBzZWxmY29udGFpbmVkCiAgbnVtYmVyX3NlY3Rpb25zOiB0cnVlCiAgc2VsZl9jb250YWluZWQ6IHRydWUKICB0aGVtZTogcmVhZGFibGUKICB0b2M6IHRydWUKICB0b2NfZmxvYXQ6CiAgICBjb2xsYXBzZWQ6IGZhbHNlCiAgICBzbW9vdGhfc2Nyb2xsOiBmYWxzZQotLS0KCjxzdHlsZT4KICBib2R5IC5tYWluLWNvbnRhaW5lciB7CiAgICBtYXgtd2lkdGg6IDE2MDBweDsKICB9Cjwvc3R5bGU+CgpgYGB7ciBvcHRpb25zLCBpbmNsdWRlPUZBTFNFfQppZiAoIWlzVFJVRShnZXQwKCJza2lwX2xvYWQiKSkpIHsKICBsaWJyYXJ5KGhwZ2x0b29scykKICB0dCA8LSBkZXZ0b29sczo6bG9hZF9hbGwoIn4vaHBnbHRvb2xzIikKICBrbml0cjo6b3B0c19rbml0JHNldChwcm9ncmVzcz1UUlVFLAogICAgICAgICAgICAgICAgICAgICAgIHZlcmJvc2U9VFJVRSwKICAgICAgICAgICAgICAgICAgICAgICB3aWR0aD05MCwKICAgICAgICAgICAgICAgICAgICAgICBlY2hvPVRSVUUpCiAga25pdHI6Om9wdHNfY2h1bmskc2V0KGVycm9yPVRSVUUsCiAgICAgICAgICAgICAgICAgICAgICAgIGZpZy53aWR0aD04LAogICAgICAgICAgICAgICAgICAgICAgICBmaWcuaGVpZ2h0PTgsCiAgICAgICAgICAgICAgICAgICAgICAgIGRwaT05NikKICBvbGRfb3B0aW9ucyA8LSBvcHRpb25zKGRpZ2l0cz00LAogICAgICAgICAgICAgICAgICAgICAgICAgc3RyaW5nc0FzRmFjdG9ycz1GQUxTRSwKICAgICAgICAgICAgICAgICAgICAgICAgIGtuaXRyLmR1cGxpY2F0ZS5sYWJlbD0iYWxsb3ciKQogIGdncGxvdDI6OnRoZW1lX3NldChnZ3Bsb3QyOjp0aGVtZV9idyhiYXNlX3NpemU9MTApKQogIHZlciA8LSAiMjAxNzA3MDYiCiAgcHJldmlvdXNfZmlsZSA8LSAiMDFfYW5ub3RhdGlvbl9sbWFqb3IuUm1kIgoKICB0bXAgPC0gdHJ5KHNtKGxvYWRtZShmaWxlbmFtZT1wYXN0ZTAoZ3N1YihwYXR0ZXJuPSJcXC5SbWQiLCByZXBsYWNlPSIiLCB4PXByZXZpb3VzX2ZpbGUpLCAiLXYiLCB2ZXIsICIucmRhLnh6IikpKSkKICBybWRfZmlsZSA8LSAiMDJfc2FtcGxlX2VzdGltYXRpb25fbG1ham9yLlJtZCIKfQpgYGAKCiMgU2FtcGxlIEVzdGltYXRpb246IGByIHZlcmAKCkwuIG1ham9yIHNhbXBsZSBlc3RpbWF0aW9uCj09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT0KCiMgVmlzdWFsaXppbmcgcmF3IGRhdGEKClRoZXJlIGFyZSBsb3RzIG9mIHRveXMgd2UgaGF2ZSBsZWFybmVkIHRvIHVzZSB0byBwbGF5IHdpdGggd2l0aCByYXcgZGF0YSBhbmQgZXhwbG9yZSBzdHVmZiBsaWtlCmJhdGNoIGVmZmVjdHMgb3Igbm9uLWNhbm9uaWNhbCBkaXN0cmlidXRpb25zIG9yIHNrZXdlZCBjb3VudHMuICBocGdsdG9vbHMgcHJvdmlkZXMgc29tZSBmdW5jdGlvbmFsaXR5CnRvIG1ha2UgdGhpcyBwcm9jZXNzIGVhc2llci4gIFRoZSBncmFwaHMgc2hvd24gYmVsb3cgYW5kIG1hbnkgbW9yZSBhcmUgZ2VuZXJhdGVkIHdpdGggdGhlIHdyYXBwZXIKJ2dyYXBoX21ldHJpY3MoKScgYnV0IHRoYXQgdGFrZXMgYXdheSB0aGUgY2hhbmNlIHRvIGV4cGxhaW4gdGhlIGdyYXBocyBhcyBJIGdlbmVyYXRlIHRoZW0uCgpgYGB7ciByYXdfZXhwbG9yZX0KbG1fZXhwdCA8LSBleGNsdWRlX2dlbmVzX2V4cHQoZXhwdD1sbV9leHB0LCBjb2x1bW49IlR5cGUiLCBtZXRob2Q9ImtlZXAiLCBwYXR0ZXJucz1jKCJwcm90ZWluX2NvZGluZyIpKQpwbG90X3BjdF9rZXB0KGxtX2V4cHQpCmxtX2ZpbHQgPC0gc20obm9ybWFsaXplX2V4cHQobG1fZXhwdCwgZmlsdGVyPVRSVUUpKQoKbG1fbGlic2l6ZSA8LSBwbG90X2xpYnNpemUobG1fZXhwdCkKbG1fbGlic2l6ZSRwbG90CgpsbV9ub256ZXJvIDwtIHBsb3Rfbm9uemVybyhsbV9maWx0KQpsbV9ub256ZXJvJHBsb3QKCmxtX2RlbnNpdHkgPC0gcGxvdF9kZW5zaXR5KGxtX2ZpbHQpCmxtX2RlbnNpdHkkcGxvdAoKbG1fYm94cGxvdCA8LSBwbG90X2JveHBsb3QobG1fZmlsdCkKbG1fYm94cGxvdApgYGAKCkJlZm9yZSBnb2luZyBhbnkgZnVydGhlciwgbG9vayBhdCB0aGUgbGlicmFyeSBzaXplIHBsb3QuICBJIHRoaW5rIGl0IHByZXR0eSBtdWNoCmd1YXJhbnRlZXMgdGhhdCB0aGlzIGRhdGEgaXMgbm90IHVzYWJsZS4KCkhvd2V2ZXIsIEkgd2lsbCBjb250aW51ZSBpbiB0aGUgaG9wZXMgdGhhdCBJIGNhbiBsYXkgZG93biBhIGZyYW1ld29yayBmb3IgZnV0dXJlCmRhdGEgYW5kIG1heWJlIGRpYWdub3NlIHdoYXQgaXMgZ29pbmcgb24gaW4gdGhpcyBkYXRhLgoKIyBOb3JtYWxpemUgYW5kIHZpc3VhbGl6ZQoKYGBge3Igbm9ybWFsaXplLCBmaWcuc2hvdz0iaGlkZSJ9CmxtX25vcm0gPC0gc20obm9ybWFsaXplX2V4cHQobG1fZXhwdCwgdHJhbnNmb3JtPSJsb2cyIiwgbm9ybT0icXVhbnQiLCBjb252ZXJ0PSJjcG0iLCBmaWx0ZXI9VFJVRSkpCmxtX21ldHJpY3MgPC0gc20oZ3JhcGhfbWV0cmljcyhsbV9ub3JtKSkKbG1fbWV0cmljcyRwY2FwbG90CgpsbV9zdXJyb2dhdGVzIDwtIHBjYV9pbmZvcm1hdGlvbihleHB0X2RhdGE9bG1fZXhwdCwgZXhwdF9mYWN0b3JzPWMoImNvbmRpdGlvbiIsICJhdmVyYWdlc2l6ZSIpLCBwbG90X3BjYXM9VFJVRSkKIyNsbV92YXJwYXJ0IDwtIHZhcnBhcnQoZXhwdD1sbV9leHB0LCBwcmVkaWN0b3I9TlVMTCwgZmFjdG9ycz1jKCJjb25kaXRpb24iKSkKIyNsbV92YXJwYXJ0JHBhcnRpdGlvbl9wbG90CmBgYAoKVGhlIGRhdGEgc2hvdWxkIG5vdyBiZSBub3JtYWxpemVkLCBsZXRzIHZpZXcgc29tZSBtZXRyaWNzIHBvc3QtZmFjdG8uCgpgYGB7ciBub3Jtdml6fQpsbV9tZXRyaWNzJGNvcmhlYXQKCmxtX21ldHJpY3Mkc21jCgpsbV9tZXRyaWNzJHBjYXBsb3QKYGBgCgojIFRyeSBhIGJhdGNoIGVzdGltYXRpb24KCmBgYHtyIGJhdGNoX2V4b30KbG1fc3ZhIDwtIHNtKG5vcm1hbGl6ZV9leHB0KGxtX2V4cHQsIHRyYW5zZm9ybT0ibG9nMiIsIG5vcm09InF1YW50IiwKICAgICAgICAgICAgICAgICAgICAgICAgICAgIGZpbHRlcj1UUlVFLCBiYXRjaD0ic3N2YSIsIGxvd190b196ZXJvPVRSVUUpKQpsbV9zdmFfbWV0cmljcyA8LSBncmFwaF9tZXRyaWNzKGxtX3N2YSkKbG1fc3ZhX21ldHJpY3MkcGNhcGxvdAoKbG1fZnN2YSA8LSBzbShub3JtYWxpemVfZXhwdChsbV9leHB0LCB0cmFuc2Zvcm09ImxvZzIiLCBub3JtPSJxdWFudCIsCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZmlsdGVyPVRSVUUsIGJhdGNoPSJmc3ZhIiwgbG93X3RvX3plcm89VFJVRSkpCmxtX2ZzdmFfbWV0cmljcyA8LSBncmFwaF9tZXRyaWNzKGxtX2ZzdmEpCmxtX2ZzdmFfbWV0cmljcyRwY2FwbG90CmBgYAoKIyBXcml0ZSB0aGUgZXhwdAoKYGBge3IgZnN2YV93cml0dGVuLCBmaWcuc2hvdz0iaGlkZSJ9CmxtX2ZpcnN0IDwtIHdyaXRlX2V4cHQobG1fZXhwdCwgZXhjZWw9cGFzdGUwKCJleGNlbC9sbV9zYW1wbGVzX3dyaXR0ZW5fZGVmYXVsdC12IiwgdmVyLCAiLnhsc3giKSwKICAgICAgICAgICAgICAgICAgICAgICBmaWx0ZXI9VFJVRSwgbm9ybT0icXVhbnQiLCBjb252ZXJ0PSJjcG0iLCBiYXRjaD0icmF3IiwgdHJhbnNmb3JtPSJsb2cyIikKbG1fc2Vjb25kIDwtIHdyaXRlX2V4cHQobG1fZXhwdCwgZXhjZWw9cGFzdGUwKCJleGNlbC9sbV9zYW1wbGVzX3dyaXR0ZW5fc3N2YS12IiwgdmVyLCAiLnhsc3giKSwKICAgICAgICAgICAgICAgICAgICAgICAgZmlsdGVyPVRSVUUsIG5vcm09InF1YW50IiwgY29udmVydD0icmF3IiwgYmF0Y2g9InNzdmEiLCB0cmFuc2Zvcm09ImxvZzIiKQpsbV90aGlyZCA8LSB3cml0ZV9leHB0KGxtX2V4cHQsIGV4Y2VsPXBhc3RlMCgiZXhjZWwvbG1fc2FtcGxlc193cml0dGVuX2ZzdmEtdiIsIHZlciwgIi54bHN4IiksCiAgICAgICAgICAgICAgICAgICAgICAgZmlsdGVyPVRSVUUsIG5vcm09InF1YW50IiwgY29udmVydD0icmF3IiwgYmF0Y2g9ImZzdmEiLCB0cmFuc2Zvcm09ImxvZzIiKQpgYGAKCmBgYHtyIHNhdmVtZX0KcGFuZGVyOjpwYW5kZXIoc2Vzc2lvbkluZm8oKSkKdGhpc19zYXZlIDwtIHBhc3RlMChnc3ViKHBhdHRlcm49IlxcLlJtZCIsIHJlcGxhY2U9IiIsIHg9cm1kX2ZpbGUpLCAiLXYiLCB2ZXIsICIucmRhLnh6IikKbWVzc2FnZShwYXN0ZTAoIlNhdmluZyB0byAiLCB0aGlzX3NhdmUpKQp0dCA8LSBzbShzYXZlbWUoZmlsZW5hbWU9dGhpc19zYXZlKSkKYGBgCg==