1 Sample Estimation version: 20180212

This document should make clear the suitability of our yeast data for differential expression analyses. It should also give some ideas about the depth and distribution of the data.

1.1 Visualizing raw data

There are lots of toys we have learned to use to play with with raw data and explore stuff like batch effects or non-canonical distributions or skewed counts. hpgltools provides some functionality to make this process easier. The graphs shown below and many more are generated with the wrapper ‘graph_metrics()’ but that takes away the chance to explain the graphs as I generate them.

sc_filt <- sm(normalize_expt(sc2_expt, filter=TRUE))
sc2_libsize <- plot_libsize(sc2_expt)
pp(file="images/sc2_libsize.png", image=sc2_libsize$plot)
## Wrote the image to: images/sc2_libsize.png

## The range is from 22-36 million reads, that seems quite reasonable to me.

sc2_nonzero <- plot_nonzero(sc_filt)
pp(file="images/sc2_nonzero.png", image=sc2_nonzero$plot)
## Wrote the image to: images/sc2_nonzero.png

## Wow, these samples are super-well represented.  The 'worst' samples have only 3 more genes with
## ~0 reads than the 'best'.  That is amazing.

sc2_density <- sm(plot_density(sc_filt))
pp(file="images/sc2_density.png", image=sc2_density$plot)
## Wrote the image to: images/sc2_density.png

## These look quite good for non-normalized data.

sc2_boxplot <- sm(plot_boxplot(sc_filt))
pp(file="images/sc2_boxplot.png", image=sc2_boxplot)
## Wrote the image to: images/sc2_boxplot.png

## This is the same as the density plot, looks good to me.

2 Normalize and visualize

Here we will apply the ‘default’ normalization preferred by Najib and then re-graph the data to look at the distributions again.

sc_norm <- sm(normalize_expt(sc2_expt, transform="log2", norm="quant", convert="cpm", filter=TRUE))
sc_metrics <- sm(graph_metrics(sc_norm))

The data should now be normalized, lets view some metrics post-facto.

pp(file="illustrator_input/legend.pdf", image=sc_metrics$legend)
## Wrote the image to: illustrator_input/legend.pdf

pp(file="illustrator_input/norm_corheat.pdf", image=sc_metrics$corheat)
## Wrote the image to: illustrator_input/norm_corheat.pdf

pp(file="illustrator_input/norm_disheat.pdf", image=sc_metrics$disheat)
## Wrote the image to: illustrator_input/norm_disheat.pdf

## There is definitely something odd going on here.
## It looks like many of the samples are clustering in pairs where one group of pairs
## consists of the first 8 samples and the second group consists of the second 8 samples.
## This is analagous to some observations we made in IGV and somewhat worrisome.

pp(file="illustrator_input/norm_smc.pdf", image=sc_metrics$smc)
## Wrote the image to: illustrator_input/norm_smc.pdf

## The samples are very well behaved, none fall below the red line.

pp(file="illustrator_input/norm_pca.pdf", image=sc_metrics$pcaplot)
## Wrote the image to: illustrator_input/norm_pca.pdf

## This shows us that peculiar split in another way.  That is really strange.

pp(file="illustrator_input/norm_tsne.pdf", image=sc_metrics$tsneplot)
## Wrote the image to: illustrator_input/norm_tsne.pdf

3 Try a batch estimation

In the following stanza I will apply a few batch/surrogate estimates in an attempt to get a handle on what is going on with these strange data splits.

In each case I will plot a quick pca to get some idea of what happened.

For these estimates, I will leave off cpm/quantile normalizations to try to isolate the effects of only the surrogate estimates on the data.

3.1 sva defaults

Start with the default unsupervised sva estimate.

sc2_sva <- sm(normalize_expt(sc2_expt, transform="log2",
                             filter=TRUE, batch="sva", low_to_zero=TRUE))
sc2_sva_metrics <- sm(graph_metrics(sc2_sva))
pp(file="images/sc2_sva_pca.png", image=sc2_sva_metrics$pcaplot)
## Wrote the image to: images/sc2_sva_pca.png

## hmm better I suppose, but hpgl0774 is 100% kooky.

pp(file="images/sc2_sva_tsne.png", image=sc2_sva_metrics$tsneplot)
## Wrote the image to: images/sc2_sva_tsne.png

3.2 fsva

This is a modification of the sva defaults, here is a quote from the documentation:

“This function performs frozen surrogate variable analysis as described in Parker, Corrada Bravo and Leek 2013. The approach uses a training database to create surrogate variables which are then used to remove batch effects both from the training database and a new data set for prediction purposes.”

sc2_fsva <- sm(normalize_expt(sc2_expt, transform="log2",
                              filter=TRUE, batch="fsva"))
sc2_fsva_metrics <- sm(graph_metrics(sc2_fsva))

fsva seems like it might be useful, but also potentially a too-large modification of the data.

pp(file="sc2_fsva_pca.png", image=sc2_fsva_metrics$pcaplot)
## Wrote the image to: sc2_fsva_pca.png

pp(file="sc2_fsvs_tsne.png", image=sc2_fsva_metrics$tsneplot)
## Wrote the image to: sc2_fsvs_tsne.png

4 Variance hunting

I want to apply a few tools to try to get a handle on where this strange variance is coming from.

varhunt <- pca_information(expt_data=sc2_expt,
                           expt_factors=c("condition","batch","originalbatch", "incubationtime"))
varhunt$anova_fstat_heatmap
## Both batch estimates are very strong on PC1 and PC2, incubation time comes out on the 3rd PC.
## The condition looks like it is stuck with with, unfortunately.

sc2_varpart <- varpart(expt=sc2_expt, predictor=NULL,
                       factors=c("condition","batch","originalbatch", "incubationtime"))
sc2_varpart$partition_plot
## Well here at least, it looks like batch kicks ass.

5 Write the expt

Let us apply the same set of normalizations/surrogate estimates and write out the resulting tables.

fun <- write_expt(sc2_expt, excel=paste0("excel/samples_written-v", ver, ".xlsx"),
                  filter=TRUE, norm="raw", batch="fsva", transform="log2", violin=TRUE)
if (!isTRUE(get0("skip_load"))) {
  pander::pander(sessionInfo())
  message(paste0("This is hpgltools commit: ", get_git_commit()))
  this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
  tmp <- sm(saveme(filename=this_save))
}
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 5d8c266e48bb9f73cdac8300e5c7c9f5baf003dc
## R> packrat::restore()
## This is hpgltools commit: Wed Mar 21 15:55:32 2018 -0400: 5d8c266e48bb9f73cdac8300e5c7c9f5baf003dc
---
title: "S. cerevisiae 2017: sample estimation."
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
 html_document:
  code_download: true
  code_folding: show
  fig_caption: true
  fig_height: 7
  fig_width: 7
  highlight: default
  keep_md: false
  mode: selfcontained
  number_sections: true
  self_contained: true
  theme: readable
  toc: true
  toc_float:
    collapsed: false
    smooth_scroll: false
---

<style>
  body .main-container {
    max-width: 1600px;
  }
</style>

```{r options, include=FALSE}
if (!isTRUE(get0("skip_load"))) {
  library("hpgltools")
  tt <- devtools::load_all("~/hpgltools")
  knitr::opts_knit$set(progress=TRUE,
                       verbose=TRUE,
                       width=90,
                       echo=TRUE)
  knitr::opts_chunk$set(error=TRUE,
                        fig.width=8,
                        fig.height=8,
                        dpi=96)
  old_options <- options(digits=4,
                         stringsAsFactors=FALSE,
                         knitr.duplicate.label="allow")
  ggplot2::theme_set(ggplot2::theme_bw(base_size=10))
  set.seed(1)
  ver <- "20180212"
  previous_file <- "01_annotation.Rmd"

  tmp <- try(sm(loadme(filename=paste0(gsub(pattern="\\.Rmd", replace="", x=previous_file), "-v", ver, ".rda.xz"))))
  rmd_file <- "02_sample_estimation.Rmd"
}
```

# Sample Estimation version: `r ver`

This document should make clear the suitability of our yeast data for differential expression
analyses.  It should also give some ideas about the depth and distribution of
the data.

## Visualizing raw data

There are lots of toys we have learned to use to play with with raw data and explore stuff like
batch effects or non-canonical distributions or skewed counts.  hpgltools provides some functionality
to make this process easier.  The graphs shown below and many more are generated with the wrapper
'graph_metrics()' but that takes away the chance to explain the graphs as I generate them.

```{r raw_explore}
sc_filt <- sm(normalize_expt(sc2_expt, filter=TRUE))
sc2_libsize <- plot_libsize(sc2_expt)
pp(file="images/sc2_libsize.png", image=sc2_libsize$plot)
## The range is from 22-36 million reads, that seems quite reasonable to me.

sc2_nonzero <- plot_nonzero(sc_filt)
pp(file="images/sc2_nonzero.png", image=sc2_nonzero$plot)
## Wow, these samples are super-well represented.  The 'worst' samples have only 3 more genes with
## ~0 reads than the 'best'.  That is amazing.

sc2_density <- sm(plot_density(sc_filt))
pp(file="images/sc2_density.png", image=sc2_density$plot)
## These look quite good for non-normalized data.

sc2_boxplot <- sm(plot_boxplot(sc_filt))
pp(file="images/sc2_boxplot.png", image=sc2_boxplot)
## This is the same as the density plot, looks good to me.
```

# Normalize and visualize

Here we will apply the 'default' normalization preferred by Najib and then
re-graph the data to look at the distributions again.

```{r normalize, fig.show="hide"}
sc_norm <- sm(normalize_expt(sc2_expt, transform="log2", norm="quant", convert="cpm", filter=TRUE))
sc_metrics <- sm(graph_metrics(sc_norm))
```

The data should now be normalized, lets view some metrics post-facto.

```{r normviz}
pp(file="illustrator_input/legend.pdf", image=sc_metrics$legend)
pp(file="illustrator_input/norm_corheat.pdf", image=sc_metrics$corheat)
pp(file="illustrator_input/norm_disheat.pdf", image=sc_metrics$disheat)
## There is definitely something odd going on here.
## It looks like many of the samples are clustering in pairs where one group of pairs
## consists of the first 8 samples and the second group consists of the second 8 samples.
## This is analagous to some observations we made in IGV and somewhat worrisome.

pp(file="illustrator_input/norm_smc.pdf", image=sc_metrics$smc)
## The samples are very well behaved, none fall below the red line.

pp(file="illustrator_input/norm_pca.pdf", image=sc_metrics$pcaplot)
## This shows us that peculiar split in another way.  That is really strange.

pp(file="illustrator_input/norm_tsne.pdf", image=sc_metrics$tsneplot)
```

# Try a batch estimation

In the following stanza I will apply a few batch/surrogate estimates in an
attempt to get a handle on what is going on with these strange data splits.

In each case I will plot a quick pca to get some idea of what happened.

For these estimates, I will leave off cpm/quantile normalizations to try to
isolate the effects of only the surrogate estimates on the data.

## sva defaults

Start with the default unsupervised sva estimate.

```{r batch_sva, fig.show="hide"}
sc2_sva <- sm(normalize_expt(sc2_expt, transform="log2",
                             filter=TRUE, batch="sva", low_to_zero=TRUE))
sc2_sva_metrics <- sm(graph_metrics(sc2_sva))
```

```{r batch_sva_pca}
pp(file="images/sc2_sva_pca.png", image=sc2_sva_metrics$pcaplot)
## hmm better I suppose, but hpgl0774 is 100% kooky.

pp(file="images/sc2_sva_tsne.png", image=sc2_sva_metrics$tsneplot)
```

## fsva

This is a modification of the sva defaults, here is a quote from the
documentation:

"This function performs frozen surrogate variable analysis as described in
Parker, Corrada Bravo and Leek 2013. The approach uses a training database to
create surrogate variables which are then used to remove batch effects both from
the training database and a new data set for prediction purposes."

```{r batch_fsva, fig.show="hide"}
sc2_fsva <- sm(normalize_expt(sc2_expt, transform="log2",
                              filter=TRUE, batch="fsva"))
sc2_fsva_metrics <- sm(graph_metrics(sc2_fsva))
```

fsva seems like it might be useful, but also potentially a too-large
modification of the data.

```{r batch_fsva_pca}
pp(file="sc2_fsva_pca.png", image=sc2_fsva_metrics$pcaplot)
pp(file="sc2_fsvs_tsne.png", image=sc2_fsva_metrics$tsneplot)
```

# Variance hunting

I want to apply a few tools to try to get a handle on where this strange
variance is coming from.

```{r variance_hunt, eval=FALSE}
varhunt <- pca_information(expt_data=sc2_expt,
                           expt_factors=c("condition","batch","originalbatch", "incubationtime"))
varhunt$anova_fstat_heatmap
## Both batch estimates are very strong on PC1 and PC2, incubation time comes out on the 3rd PC.
## The condition looks like it is stuck with with, unfortunately.

sc2_varpart <- varpart(expt=sc2_expt, predictor=NULL,
                       factors=c("condition","batch","originalbatch", "incubationtime"))
sc2_varpart$partition_plot
## Well here at least, it looks like batch kicks ass.
```

# Write the expt

Let us apply the same set of normalizations/surrogate estimates and write out
the resulting tables.

```{r fsva_written, eval=FALSE}
fun <- write_expt(sc2_expt, excel=paste0("excel/samples_written-v", ver, ".xlsx"),
                  filter=TRUE, norm="raw", batch="fsva", transform="log2", violin=TRUE)
```

```{r saveme}
if (!isTRUE(get0("skip_load"))) {
  pander::pander(sessionInfo())
  message(paste0("This is hpgltools commit: ", get_git_commit()))
  this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
  tmp <- sm(saveme(filename=this_save))
}
```
