index.html preprocessing.html

1 Annotation version: 20170905

In this document, I will attempt to visualize the relationships among the various samples in this experiment. In the process, I will examine the samples to (hopefully) ensure that they are consistent and that there is not a fatal batch effect or other weakness in the data.

To start, I will print some simple metrics of the entire data set, then split it into more tractable pieces and examine them more thoroughly.

2 Raw global metrics

With the above in mind, here are a couple global plots showing the state of the data.

2.1 mi vs tx

When processing the data, I did 1 alignment of the reads from all samples against a database of miRNAs and then separately against a database of all transcripts. All the samples of library type ‘small’ are really only intended to be compared against the miRNA library while those of type ‘large’ are intended only to search against the transcripts. This is primarily because of the different ways the RNA samples were treated at the beginning of the library preparation. Despite this, I perform both types of alignments for all samples so that I may now look and see that the distributions are different. If everything worked as planned, then the large RNA libraries should have a very different distribution of reads in the miRNA databases and vice-versa. If this is not true, then something might be very wrong.

With that in mind, the following block is going to do some of the graphs and normalize the data. In the following blocks I will print some of the results and consider what they mean.

The following block will serve to generate all the likely subsets of the data.

Lets lay down a couple rules of thumb now:

  1. In all following analyses, do the miRNA first, then transcripts.
  2. All miRNA datasets will start with ‘mmmi’ and transcripts will be ‘mmtx’
## Graph the raw metrics of all samples mapped against the transcripts and miRNAs
##mm_mir <- expt_subset(mm_mir, subset="genotype!='apop'")
##mm_txr <- expt_subset(mm_txr, subset="genotype!='apop'")
mm_mi_metrics <- sm(graph_metrics(mm_mir))
mm_tx_metrics <- sm(graph_metrics(mm_txr))
mmmi_small_metrics <- sm(graph_metrics(mmmi_small))
mmtx_large_metrics <- sm(graph_metrics(mmtx_large))

2.2 A Legend!

First, do not forget to print a legend showing the colors used and what they mean:

mm_mi_metrics$legend$plot

## This should be the same for the mm_mi and mm_tx objects.

2.3 Start with some global metrics

mm_mi_metrics$libsize

The first 8 libraries are small RNA libraries of which the first 4 are exosomes. The important thing to remember is that these are mapped against only the 1978 miRNA features in the mouse Ensembl genome, as such we see that the first 4 are highly enriched in miRNAs, which is excellent.

Therefore, as one would expect, the small exosome RNA libraries have fewer reads than the total cell small RNA libraries, which comprise the set from 5-8. I of course focused immediately upon the last 8 without thinking. These are all the polyA RNA samples, of which the miRNA features are only a small proportion and therefore we see them as <10% the depth of the small RNA samples. I would therefore expect something like the opposite for the transcriptome samples, which are next.

mm_tx_metrics$libsize

Ahh excellent. Once again, the exosome samples have fewer reads than the cells and have many fewer for the small RNA samples than transcript samples, which is good.

3 Sample distributions

From here on, I think I will examine only the small RNA samples and large RNA samples as 2 separate groups.

mmmi_small_metrics$density

mmmi_small_metrics$boxplot

## The sample densities among the small RNA samples look reasonable to me.
mmtx_large_metrics$density

mmtx_large_metrics$boxplot

## This is more problematic from the perspective of the data distributions; but not suprising and
## encouraging from a biological perspective.  The large RNA exosomes have many more genes with 0
## counts.

Considering the heterologous sources of these samples, I think these actually make sense, lets move on and see what else pops up.

3.1 Normalized Metrics

##mmmi_small_norm <- normalize_expt(mmmi_small, transform="log2", norm="quant", convert="cpm", batch="svaseq", filter=TRUE)
mmmi_small_norm <- sm(normalize_expt(mmmi_small, filter=TRUE, norm="quant", convert="cpm", batch="svaseq"))
mmmi_small_norm_metrics <- sm(graph_metrics(mmmi_small_norm))

mmmi_small_norm_metrics$corheat

mmmi_small_norm_metrics$disheat

mmmi_small_norm_metrics$smc

mmmi_small_norm_metrics$pcaplot

mmmi_small_noapop <- subset_expt(mmmi_small, subset="genotype!='apop'")
mmmi_small_noapop_metrics <- sm(graph_metrics(mmmi_small_noapop))

mmmi_small_noapop_metrics$pcaplot

mmmi_small_sva <- sm(normalize_expt(mmmi_small, transform="log2",
                                    norm="quant", convert="raw",
                                    filter=TRUE, batch="fsva"))
plot_pca(mmmi_small_sva)$plot

## The exosome and cell samples split nicely, and the wt/mutants split nicely among them!
## holy asscrackers that is nice!
## newer_data <- normalize_expt(mmmi_small, filter=TRUE, transform="log2", batch="svaseq", norm="quant")
printed_expt <- sm(write_expt(mmmi_small, violin=TRUE,
                              excel=paste0("excel/mmmi_state_withapop-", ver, ".xlsx"),
                              filter=TRUE, transform="log2", norm="quant", convert="raw", batch="fsva"))

It appears that the cellular mRNA samples are a little more problematic; shockingly to me at least, the problematic sample proved to be one of the cellular RNA samples and not an exosome sample! I almost can’t believe that. I think that, given this, I will try an sva/combat adjustment of the data and see what happens, if for nothing else, then to satisfy my curiosity.

tx_colors <- c("#AA0000", "#AA8888", "#0000AA", "#8888AA")
names(tx_colors) <- c("exo_polya_mut", "exo_polya_wt", "cell_polya_mut", "cell_polya_wt")
mmtx_new <- set_expt_colors(mmtx_large, colors=tx_colors)
##mmtx_new <- expt_subset(mmtx_new, subset="sampleid!='HPGL0688'")

mmtx_large_sva <- sm(normalize_expt(mmtx_new, transform="log2",
                                    norm="raw", convert="cpm",
                                    filter=TRUE, batch="fsva"))
plot_pca(mmtx_large_sva)$plot

printed_expt <- sm(write_expt(mmtx_new, excel=paste0("excel/mmtx_state-", ver, ".xlsx"),
                              batch="fsva", norm="raw", convert="cpm", filter=TRUE, violin=TRUE))

## wow, ummm, ok, so it appears that sva sees a significant surrogate variable.
mmtx_new_combat <- sm(normalize_expt(mmtx_new, transform="log2",
                                     norm="quant", convert="cpm",
                                     filter=TRUE, batch="combat_scale"))

plot_pca(mmtx_new_combat)$plot

## But according to this plot, the surrogate varaible is at least not entirely 'batch'
## oh wait, 98.62% of the remaining variance is now on the x-axis!

It appears that there is a significant but not crippling batch effect notable primarily in the large RNA libraries. I suspect that this will be sufficiently ameliorated by just adding batch into the experimental model.

pander::pander(sessionInfo())

R version 3.4.1 (2017-06-30)

**Platform:** x86_64-pc-linux-gnu (64-bit)

locale: LC_CTYPE=en_US.utf8, LC_NUMERIC=C, LC_TIME=en_US.utf8, LC_COLLATE=en_US.utf8, LC_MONETARY=en_US.utf8, LC_MESSAGES=en_US.utf8, LC_PAPER=en_US.utf8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.utf8 and LC_IDENTIFICATION=C

attached base packages: parallel, stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: hpgltools(v.2017.01), Biobase(v.2.36.2) and BiocGenerics(v.0.22.0)

loaded via a namespace (and not attached): edgeR(v.3.18.1), bit64(v.0.9-7), splines(v.3.4.1), foreach(v.1.4.3), gtools(v.3.5.0), stats4(v.3.4.1), pander(v.0.6.1), blob(v.1.1.0), yaml(v.2.1.14), ggrepel(v.0.6.5), RSQLite(v.2.0), backports(v.1.1.0), lattice(v.0.20-35), limma(v.3.32.5), quadprog(v.1.5-5), digest(v.0.6.12), RColorBrewer(v.1.1-2), minqa(v.1.2.4), colorspace(v.1.3-2), htmltools(v.0.3.6), preprocessCore(v.1.38.1), Matrix(v.1.2-11), plyr(v.1.8.4), XML(v.3.98-1.9), devtools(v.1.13.3), genefilter(v.1.58.1), xtable(v.1.8-2), corpcor(v.1.6.9), scales(v.0.5.0), gdata(v.2.18.0), openxlsx(v.4.0.17), Rtsne(v.0.13), lme4(v.1.1-13), BiocParallel(v.1.10.1), tibble(v.1.3.4), annotate(v.1.54.0), mgcv(v.1.8-19), IRanges(v.2.10.3), ggplot2(v.2.2.1), withr(v.2.0.0), lazyeval(v.0.2.0), pbkrtest(v.0.4-7), survival(v.2.41-3), magrittr(v.1.5), crayon(v.1.3.2), memoise(v.1.1.0), evaluate(v.0.10.1), MASS(v.7.3-47), gplots(v.3.0.1), doParallel(v.1.0.10), nlme(v.3.1-131), xml2(v.1.1.1), tools(v.3.4.1), directlabels(v.2017.03.31), data.table(v.1.10.4), matrixStats(v.0.52.2), stringr(v.1.2.0), S4Vectors(v.0.14.3), munsell(v.0.4.3), locfit(v.1.5-9.1), AnnotationDbi(v.1.38.2), colorRamps(v.2.3), compiler(v.3.4.1), caTools(v.1.17.1), rlang(v.0.1.2), nloptr(v.1.0.4), grid(v.3.4.1), RCurl(v.1.95-4.8), iterators(v.1.0.8), variancePartition(v.1.6.0), bitops(v.1.0-6), base64enc(v.0.1-3), labeling(v.0.3), rmarkdown(v.1.6), testthat(v.1.0.2), gtable(v.0.2.0), codetools(v.0.2-15), DBI(v.0.7), roxygen2(v.6.0.1), reshape2(v.1.4.2), R6(v.2.2.2), knitr(v.1.17), bit(v.1.1-12), commonmark(v.1.4), rprojroot(v.1.2), KernSmooth(v.2.23-15), stringi(v.1.1.5), sva(v.3.24.4) and Rcpp(v.0.12.12)

message(paste0("This is hpgltools commit: ", get_git_commit()))
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 739e4a10345a89efb17b37e7de07c5811491ccea
## R> packrat::restore()
## This is hpgltools commit: Thu Aug 31 11:24:06 2017 -0400: 739e4a10345a89efb17b37e7de07c5811491ccea
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
## Saving to 02_sample_estimation-v20170905.rda.xz
tmp <- sm(saveme(filename=this_save))
---
title: "Sample Estimation of M.musculus samples."
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
 html_document:
  code_download: true
  code_folding: show
  fig_caption: true
  fig_height: 7
  fig_width: 7
  highlight: tango
  keep_md: false
  mode: selfcontained
  number_sections: true
  self_contained: true
  theme: cosmo
  toc: true
  toc_float:
    collapsed: false
    smooth_scroll: false
---

<style>
  body .main-container {
    max-width: 1600px;
}
</style>

```{r options, include=FALSE}
library(hpgltools)
tt <- devtools::load_all("~/hpgltools")
knitr::opts_knit$set(progress=TRUE,
                     verbose=TRUE,
                     width=90,
                     echo=TRUE)
knitr::opts_chunk$set(error=TRUE,
                      fig.width=8,
                      fig.height=8,
                      dpi=96)
old_options <- options(digits=4,
                       stringsAsFactors=FALSE,
                       knitr.duplicate.label="allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size=10))
set.seed(1)
ver <- "20170905"
previous_file <- "01_annotation.Rmd"

tmp <- try(sm(loadme(filename=paste0(gsub(pattern="\\.Rmd", replace="", x=previous_file), "-v", ver, ".rda.xz"))))

rmd_file <- "02_sample_estimation.Rmd"
```

```{r rendering, include=FALSE, eval=FALSE}
rmarkdown::render(rmd_file)

rmarkdown::render(rmd_file, output_format="pdf_document")
```

[index.html](index.html) [preprocessing.html](preprocessing.html)

# Annotation version: `r ver`

In this document, I will attempt to visualize the relationships among the various samples in this
experiment.  In the process, I will examine the samples to (hopefully) ensure that they are
consistent and that there is not a fatal batch effect or other weakness in the data.

To start, I will print some simple metrics of the entire data set, then split it into more tractable
pieces and examine them more thoroughly.

# Raw global metrics

With the above in mind, here are a couple global plots showing the state of the data.

## mi vs tx

When processing the data, I did 1 alignment of the reads from all samples against a database of
miRNAs and then separately against a database of all transcripts.  All the samples of library type
'small' are really only intended to be compared against the miRNA library while those of type
'large' are intended only to search against the transcripts.  This is primarily because of the
different ways the RNA samples were treated at the beginning of the library preparation.  Despite
this, I perform both types of alignments for all samples so that I may now look and see that the
distributions are different.  If everything worked as planned, then the large RNA libraries should
have a very different distribution of reads in the miRNA databases and vice-versa.  If this is not
true, then something might be very wrong.

With that in mind, the following block is going to do some of the graphs and normalize the data.  In
the following blocks I will print some of the results and consider what they mean.

The following block will serve to generate all the likely subsets of the data.

Lets lay down a couple rules of thumb now:

1.  In all following analyses, do the miRNA first, then transcripts.
2.  All miRNA datasets will start with 'mmmi' and transcripts will be 'mmtx'

```{r base_metrics, fig.show="hide"}
## Graph the raw metrics of all samples mapped against the transcripts and miRNAs
##mm_mir <- expt_subset(mm_mir, subset="genotype!='apop'")
##mm_txr <- expt_subset(mm_txr, subset="genotype!='apop'")
mm_mi_metrics <- sm(graph_metrics(mm_mir))
mm_tx_metrics <- sm(graph_metrics(mm_txr))
mmmi_small_metrics <- sm(graph_metrics(mmmi_small))
mmtx_large_metrics <- sm(graph_metrics(mmtx_large))
```

## A Legend!

First, do not forget to print a legend showing the colors used and what they mean:

```{r legend}
mm_mi_metrics$legend$plot
## This should be the same for the mm_mi and mm_tx objects.
```

## Start with some global metrics

```{r libsize_mi}
mm_mi_metrics$libsize
```

The first 8 libraries are small RNA libraries of which the first 4 are exosomes.
The important thing to remember is that these are mapped against only the 1978 miRNA features in
the mouse Ensembl genome, as such we see that the first 4 are highly enriched in miRNAs, which is
excellent.

Therefore, as one would expect, the small exosome RNA libraries have fewer reads than the total
cell small RNA libraries, which comprise the set from 5-8.
I of course focused immediately upon the last 8 without thinking.  These are all the polyA RNA
samples, of which the miRNA features are only a small proportion and therefore we see them as
<10% the depth of the small RNA samples.
I would therefore expect something like the opposite for the transcriptome samples, which are next.

```{r libsize_tx}
mm_tx_metrics$libsize
```

Ahh excellent.  Once again, the exosome samples have fewer reads than the cells and have many
fewer for the small RNA samples than transcript samples, which is good.

# Sample distributions

From here on, I think I will examine only the small RNA samples and large RNA samples as 2 separate
groups.

```{r densities}
mmmi_small_metrics$density
mmmi_small_metrics$boxplot
## The sample densities among the small RNA samples look reasonable to me.
mmtx_large_metrics$density
mmtx_large_metrics$boxplot
## This is more problematic from the perspective of the data distributions; but not suprising and
## encouraging from a biological perspective.  The large RNA exosomes have many more genes with 0
## counts.
```

Considering the heterologous sources of these samples, I think these actually make sense, lets move
on and see what else pops up.


## Normalized Metrics

```{r view_norm_metrics}
##mmmi_small_norm <- normalize_expt(mmmi_small, transform="log2", norm="quant", convert="cpm", batch="svaseq", filter=TRUE)
mmmi_small_norm <- sm(normalize_expt(mmmi_small, filter=TRUE, norm="quant", convert="cpm", batch="svaseq"))
mmmi_small_norm_metrics <- sm(graph_metrics(mmmi_small_norm))
mmmi_small_norm_metrics$corheat
mmmi_small_norm_metrics$disheat
mmmi_small_norm_metrics$smc
mmmi_small_norm_metrics$pcaplot

mmmi_small_noapop <- subset_expt(mmmi_small, subset="genotype!='apop'")
mmmi_small_noapop_metrics <- sm(graph_metrics(mmmi_small_noapop))
mmmi_small_noapop_metrics$pcaplot

mmmi_small_sva <- sm(normalize_expt(mmmi_small, transform="log2",
                                    norm="quant", convert="raw",
                                    filter=TRUE, batch="fsva"))
plot_pca(mmmi_small_sva)$plot
## The exosome and cell samples split nicely, and the wt/mutants split nicely among them!
## holy asscrackers that is nice!
## newer_data <- normalize_expt(mmmi_small, filter=TRUE, transform="log2", batch="svaseq", norm="quant")
printed_expt <- sm(write_expt(mmmi_small, violin=TRUE,
                              excel=paste0("excel/mmmi_state_withapop-", ver, ".xlsx"),
                              filter=TRUE, transform="log2", norm="quant", convert="raw", batch="fsva"))
```

It appears that the cellular mRNA samples are a little more problematic; shockingly to me at least,
the problematic sample proved to be one of the cellular RNA samples and not an exosome sample!  I
almost can't believe that.  I think that, given this, I will try an sva/combat adjustment of the data and
see what happens, if for nothing else, then to satisfy my curiosity.

```{r sva_rna}
tx_colors <- c("#AA0000", "#AA8888", "#0000AA", "#8888AA")
names(tx_colors) <- c("exo_polya_mut", "exo_polya_wt", "cell_polya_mut", "cell_polya_wt")
mmtx_new <- set_expt_colors(mmtx_large, colors=tx_colors)
##mmtx_new <- expt_subset(mmtx_new, subset="sampleid!='HPGL0688'")

mmtx_large_sva <- sm(normalize_expt(mmtx_new, transform="log2",
                                    norm="raw", convert="cpm",
                                    filter=TRUE, batch="fsva"))
plot_pca(mmtx_large_sva)$plot

printed_expt <- sm(write_expt(mmtx_new, excel=paste0("excel/mmtx_state-", ver, ".xlsx"),
                              batch="fsva", norm="raw", convert="cpm", filter=TRUE, violin=TRUE))
## wow, ummm, ok, so it appears that sva sees a significant surrogate variable.
mmtx_new_combat <- sm(normalize_expt(mmtx_new, transform="log2",
                                     norm="quant", convert="cpm",
                                     filter=TRUE, batch="combat_scale"))
plot_pca(mmtx_new_combat)$plot
## But according to this plot, the surrogate varaible is at least not entirely 'batch'
## oh wait, 98.62% of the remaining variance is now on the x-axis!
```

It appears that there is a significant but not crippling batch effect notable primarily in the large
RNA libraries.  I suspect that this will be sufficiently ameliorated by just adding batch into the
experimental model.

```{r saveme}
pander::pander(sessionInfo())
message(paste0("This is hpgltools commit: ", get_git_commit()))
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
tmp <- sm(saveme(filename=this_save))
```
