1 S. cerevisiae sample estimation

This document should make clear the suitability of our yeast data for differential expression analyses. It should also give some ideas about the depth and distribution of the data.

2 Visualizing raw data

There are lots of toys we have learned to use to play with with raw data and explore stuff like batch effects or non-canonical distributions or skewed counts. hpgltools provides some functionality to make this process easier. The graphs shown below and many more are generated with the wrapper ‘graph_metrics()’ but that takes away the chance to explain the graphs as I generate them.

sc_filt <- normalize_expt(sc1_expt, filter=TRUE)
## This function will replace the expt$expressionset slot with:
## cbcb(data)
## It backs up the current data into a slot named:
##  expt$backup_expressionset. It will also save copies of each step along the way
##  in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
##  when invoking limma.  The appropriate libsize is the non-log(cpm(normalized)).
##  This is most likely kept at:
##  'new_expt$normalized$intermediate_counts$normalization$libsizes'
##  A copy of this may also be found at:
##  new_expt$best_libsize
## Leaving the data in its current base format, keep in mind that
##  some metrics are easier to see when the data is log2 transformed, but
##  EdgeR/DESeq do not accept transformed data.
## Leaving the data unconverted.  It is often advisable to cpm/rpkm
##  the data to normalize for sampling differences, keep in mind though that rpkm
##  has some annoying biases, and voom() by default does a cpm (though hpgl_voom()
##  will try to detect this).
## Leaving the data unnormalized.  This is necessary for DESeq, but
##  EdgeR/limma might benefit from normalization.  Good choices include quantile,
##  size-factor, tmm, etc.
## Not correcting the count-data for batch effects.  If batch is
##  included in EdgerR/limma's model, then this is probably wise; but in extreme
##  batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 716 low-count genes (5975 remaining).
## Step 2: not normalizing the data.
## Step 3: not converting the data.
## Step 4: not transforming the data.
## Step 5: not doing batch correction.
sc1_libsize <- plot_libsize(sc1_expt)
sc1_libsize

sc1_nonzero <- plot_nonzero(sc_filt)
sc1_nonzero

sc1_density <- plot_density(sc_filt)
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0.  We are on log scale, setting them to 0.5.
## Changed 337 zero count features.
sc1_density

sc1_boxplot <- plot_boxplot(sc_filt)
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0.  We are on log scale, setting them to 0.5.
## Changed 337 zero count features.
sc1_boxplot

3 Normalize and visualize

sc_norm <- sm(normalize_expt(sc1_expt, transform="log2", norm="quant", convert="cpm", filter=TRUE))
sc_metrics <- sm(graph_metrics(sc_norm))

The data should now be normalized, lets view some metrics post-facto.

sc_metrics$corheat

sc_metrics$smc

sc_metrics$pcaplot

4 Try a batch estimation

sc1_sva <- sm(normalize_expt(sc1_expt, transform="log2",
                             filter=TRUE, batch="sva", low_to_zero=TRUE))
sc1_sva_metrics <- graph_metrics(sc1_sva)
## Graphing number of non-zero genes with respect to CPM by library.
## Graphing library sizes.
## Graphing a boxplot.
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0.  We are on log scale, setting them to 0.5.
## Changed 349 zero count features.
## Graphing a correlation heatmap.
## Graphing a standard median correlation.
## Performing correlation.
## Graphing a distance heatmap.

## Graphing a standard median distance.
## Performing distance.
## Graphing a PCA plot.
## Plotting a density plot.
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0.  We are on log scale, setting them to 0.5.
## Changed 349 zero count features.
## Printing a color to condition legend.

sc1_sva_metrics$pcaplot

sc1_combat <- sm(normalize_expt(sc1_expt, transform="log2",
                                   filter=TRUE, batch="combat_scale", low_to_zero=TRUE))

sc1_combat_metrics <- graph_metrics(sc1_combat)
## Graphing number of non-zero genes with respect to CPM by library.
## Graphing library sizes.
## Graphing a boxplot.
## Graphing a correlation heatmap.
## Graphing a standard median correlation.
## Performing correlation.
## Graphing a distance heatmap.

## Graphing a standard median distance.
## Performing distance.
## Graphing a PCA plot.
## Plotting a density plot.
## Printing a color to condition legend.

sc1_combat_metrics$pcaplot

sc1_fsva <- sm(normalize_expt(sc1_expt, transform="log2",
                              filter=TRUE, batch="fsva", low_to_zero=TRUE))
sc1_fsva_metrics <- graph_metrics(sc1_fsva)
## Graphing number of non-zero genes with respect to CPM by library.
## Graphing library sizes.
## Graphing a boxplot.
## Graphing a correlation heatmap.
## Graphing a standard median correlation.
## Performing correlation.
## Graphing a distance heatmap.

## Graphing a standard median distance.
## Performing distance.
## Graphing a PCA plot.
## Plotting a density plot.
## Printing a color to condition legend.

sc1_fsva_metrics$pcaplot

5 Write the expt

fun <- write_expt(sc1_expt, excel=paste0("excel/samples_written-v", ver, ".xlsx"),
                  filter=TRUE, norm="raw", batch="fsva", transform="log2")
## Writing the legend.
## The sheet: legend is in legend.
## Warning in max(nchar(as.character(data[[data_col]])), na.rm = TRUE): no
## non-missing arguments to max; returning -Inf

## Warning in max(nchar(as.character(data[[data_col]])), na.rm = TRUE): no
## non-missing arguments to max; returning -Inf
## Writing the raw reads.
## Graphing the raw reads.

## The sheet: raw_graphs is in legend, raw_reads, raw_graphs.
## The sheet: raw_graphs is in legend, raw_reads, raw_graphs.
## Writing the normalized reads.
## Graphing the normalized reads.

## The sheet: norm_graphs is in legend, raw_reads, raw_graphs, norm_data, norm_graphs.
## The sheet: norm_graphs is in legend, raw_reads, raw_graphs, norm_data, norm_graphs.
## Writing the median reads by factor.
## The factor mtc_wtu has 4 rows.
## The factor wtc_mtu has 3 rows.
## The factor wtc_wtu has 5 rows.

6 Repeat and drop the external data set

sc_expt <- expt_subset(expt=sc1_expt, subset="batch=='y'")
sc_norm <- sm(normalize_expt(sc_expt, transform="log2", filter=TRUE, norm="quant", convert="cpm"))
sc_metrics <- graph_metrics(sc_norm)
## Graphing number of non-zero genes with respect to CPM by library.
## Graphing library sizes.
## Graphing a boxplot.
## Graphing a correlation heatmap.
## Graphing a standard median correlation.
## Performing correlation.
## Graphing a distance heatmap.

## Graphing a standard median distance.
## Performing distance.
## Graphing a PCA plot.
## There is just one batch in this data.
## Plotting a density plot.
## Printing a color to condition legend.

sc_metrics$pcaplot

library("pander")
pander(sessionInfo())

R version 3.3.3 (2017-03-06)

**Platform:** x86_64-pc-linux-gnu (64-bit)

locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C

attached base packages: stats, graphics, grDevices, utils, datasets and base

other attached packages: pander(v.0.6.0), ruv(v.0.9.6), hpgltools(v.2017.01) and rmarkdown(v.1.5)

loaded via a namespace (and not attached): ggrepel(v.0.6.5), Rcpp(v.0.12.11), locfit(v.1.5-9.1), lattice(v.0.20-35), corpcor(v.1.6.9), gtools(v.3.5.0), rprojroot(v.1.2), digest(v.0.6.12), foreach(v.1.4.3), R6(v.2.2.1), plyr(v.1.8.4), backports(v.1.0.5), stats4(v.3.3.3), RSQLite(v.1.1-2), evaluate(v.0.10), sva(v.3.22.0), ggplot2(v.2.2.1), rlang(v.0.1.1), lazyeval(v.0.2.0), data.table(v.1.10.4), annotate(v.1.52.1), S4Vectors(v.0.12.2), Matrix(v.1.2-10), preprocessCore(v.1.36.0), labeling(v.0.3), devtools(v.1.13.1), splines(v.3.3.3), stringr(v.1.2.0), RCurl(v.1.95-4.8), munsell(v.0.4.3), BiocGenerics(v.0.20.0), base64enc(v.0.1-3), mgcv(v.1.8-17), htmltools(v.0.3.6), tibble(v.1.3.1), roxygen2(v.6.0.1), edgeR(v.3.16.5), IRanges(v.2.8.2), codetools(v.0.2-15), matrixStats(v.0.52.2), XML(v.3.98-1.7), crayon(v.1.3.2), withr(v.1.0.2), bitops(v.1.0-6), commonmark(v.1.2), grid(v.3.3.3), nlme(v.3.1-131), xtable(v.1.8-2), gtable(v.0.2.0), DBI(v.0.6-1), magrittr(v.1.5), scales(v.0.4.1), stringi(v.1.1.5), reshape2(v.1.4.2), genefilter(v.1.56.0), testthat(v.1.0.2), limma(v.3.30.13), xml2(v.1.1.1), openxlsx(v.4.0.17), RColorBrewer(v.1.1-2), iterators(v.1.0.8), tools(v.3.3.3), Biobase(v.2.34.0), parallel(v.3.3.3), survival(v.2.41-3), yaml(v.2.1.14), AnnotationDbi(v.1.36.2), colorspace(v.1.3-2), memoise(v.1.1.0), knitr(v.1.16) and methods(v.3.3.3)

LS0tCnRpdGxlOiAiUy4gY2VyZXZpc2lhZSBzYW1wbGUgZXN0aW1hdGlvbi4iCmF1dGhvcjogImF0YiBhYmVsZXdAZ21haWwuY29tIgpkYXRlOiAiYHIgU3lzLkRhdGUoKWAiCm91dHB1dDoKIGh0bWxfZG9jdW1lbnQ6CiAgY29kZV9kb3dubG9hZDogdHJ1ZQogIGNvZGVfZm9sZGluZzogc2hvdwogIGZpZ19jYXB0aW9uOiB0cnVlCiAgZmlnX2hlaWdodDogNwogIGZpZ193aWR0aDogNwogIGhpZ2hsaWdodDogZGVmYXVsdAogIGtlZXBfbWQ6IGZhbHNlCiAgbW9kZTogc2VsZmNvbnRhaW5lZAogIG51bWJlcl9zZWN0aW9uczogdHJ1ZQogIHNlbGZfY29udGFpbmVkOiB0cnVlCiAgdGhlbWU6IHJlYWRhYmxlCiAgdG9jOiB0cnVlCiAgdG9jX2Zsb2F0OgogICAgY29sbGFwc2VkOiBmYWxzZQogICAgc21vb3RoX3Njcm9sbDogZmFsc2UKLS0tCgo8c3R5bGU+CiAgPCEtLSBEb2N1bWVudCBwcmVsdWRlIHJldmlzaW9uIDIwMTctMDIgLS0+CiAgYm9keSAubWFpbi1jb250YWluZXIgewogICAgbWF4LXdpZHRoOiAxNjAwcHg7Cn0KPC9zdHlsZT4KCmBgYHtyIG9wdGlvbnMsIGluY2x1ZGU9RkFMU0V9CiMjIFRoZXNlIGFyZSB0aGUgb3B0aW9ucyBJIHRlbmQgdG8gZmF2b3IKbGlicmFyeSgiaHBnbHRvb2xzIikKdHQgPC0gZGV2dG9vbHM6OmxvYWRfYWxsKCJ+L2hwZ2x0b29scyIpCmtuaXRyOjpvcHRzX2tuaXQkc2V0KAogICAgcHJvZ3Jlc3MgPSBUUlVFLAogICAgdmVyYm9zZSA9IFRSVUUsCiAgICB3aWR0aCA9IDkwLAogICAgZWNobyA9IFRSVUUpCmtuaXRyOjpvcHRzX2NodW5rJHNldCgKICAgIGVycm9yID0gVFJVRSwKICAgIGZpZy53aWR0aCA9IDgsCiAgICBmaWcuaGVpZ2h0ID0gOCwKICAgIGRwaSA9IDk2KQpvcHRpb25zKAogICAgZGlnaXRzID0gNCwKICAgIHN0cmluZ3NBc0ZhY3RvcnMgPSBGQUxTRSwKICAgIGtuaXRyLmR1cGxpY2F0ZS5sYWJlbCA9ICJhbGxvdyIpCmdncGxvdDI6OnRoZW1lX3NldChnZ3Bsb3QyOjp0aGVtZV9idyhiYXNlX3NpemU9MTApKQpzZXQuc2VlZCgxKQp2ZXIgPC0gIjIwMTcwNTE1IgpwcmV2aW91c19maWxlIDwtICJhbm5vdGF0aW9uLlJtZCIKYGBgCgpgYGB7ciBsb2FkbWUsIGluY2x1ZGU9RkFMU0V9CnRtcCA8LSB0cnkoc20obG9hZG1lKGZpbGVuYW1lPXBhc3RlMChnc3ViKHBhdHRlcm49IlxcLlJtZCIsIHJlcGxhY2U9IiIsIHg9cHJldmlvdXNfZmlsZSksICItdiIsIHZlciwgIi5yZGEueHoiKSkpKQpgYGAKCmBgYHtyIHJlbmRlciwgZXZhbD1GQUxTRSwgaW5jbHVkZT1GQUxTRX0Kcm1kX2ZpbGUgPC0gInNhbXBsZV9lc3RpbWF0aW9uLlJtZCIKdGhpc19zYXZlIDwtIHBhc3RlMChnc3ViKHBhdHRlcm49IlxcLlJtZCIsIHJlcGxhY2U9IiIsIHg9cm1kX2ZpbGUpLCAiLXYiLCB2ZXIsICIucmRhLnh6IikKCnJtYXJrZG93bjo6cmVuZGVyKHJtZF9maWxlKQoKcm1hcmtkb3duOjpyZW5kZXIocm1kX2ZpbGUsIG91dHB1dF9mb3JtYXQ9InBkZl9kb2N1bWVudCIsIG91dHB1dF9vcHRpb25zPWMoInNraXBfaHRtbCIpKQpgYGAKClMuIGNlcmV2aXNpYWUgc2FtcGxlIGVzdGltYXRpb24KPT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PQoKVGhpcyBkb2N1bWVudCBzaG91bGQgbWFrZSBjbGVhciB0aGUgc3VpdGFiaWxpdHkgb2Ygb3VyIHllYXN0IGRhdGEgZm9yIGRpZmZlcmVudGlhbCBleHByZXNzaW9uCmFuYWx5c2VzLiAgSXQgc2hvdWxkIGFsc28gZ2l2ZSBzb21lIGlkZWFzIGFib3V0IHRoZSBkZXB0aCBhbmQgZGlzdHJpYnV0aW9uIG9mCnRoZSBkYXRhLgoKIyBWaXN1YWxpemluZyByYXcgZGF0YQoKVGhlcmUgYXJlIGxvdHMgb2YgdG95cyB3ZSBoYXZlIGxlYXJuZWQgdG8gdXNlIHRvIHBsYXkgd2l0aCB3aXRoIHJhdyBkYXRhIGFuZCBleHBsb3JlIHN0dWZmIGxpa2UKYmF0Y2ggZWZmZWN0cyBvciBub24tY2Fub25pY2FsIGRpc3RyaWJ1dGlvbnMgb3Igc2tld2VkIGNvdW50cy4gIGhwZ2x0b29scyBwcm92aWRlcyBzb21lIGZ1bmN0aW9uYWxpdHkKdG8gbWFrZSB0aGlzIHByb2Nlc3MgZWFzaWVyLiAgVGhlIGdyYXBocyBzaG93biBiZWxvdyBhbmQgbWFueSBtb3JlIGFyZSBnZW5lcmF0ZWQgd2l0aCB0aGUgd3JhcHBlcgonZ3JhcGhfbWV0cmljcygpJyBidXQgdGhhdCB0YWtlcyBhd2F5IHRoZSBjaGFuY2UgdG8gZXhwbGFpbiB0aGUgZ3JhcGhzIGFzIEkgZ2VuZXJhdGUgdGhlbS4KCmBgYHtyIHJhd19leHBsb3JlfQpzY19maWx0IDwtIG5vcm1hbGl6ZV9leHB0KHNjMV9leHB0LCBmaWx0ZXI9VFJVRSkKc2MxX2xpYnNpemUgPC0gcGxvdF9saWJzaXplKHNjMV9leHB0KQpzYzFfbGlic2l6ZQoKc2MxX25vbnplcm8gPC0gcGxvdF9ub256ZXJvKHNjX2ZpbHQpCnNjMV9ub256ZXJvCgpzYzFfZGVuc2l0eSA8LSBwbG90X2RlbnNpdHkoc2NfZmlsdCkKc2MxX2RlbnNpdHkKCnNjMV9ib3hwbG90IDwtIHBsb3RfYm94cGxvdChzY19maWx0KQpzYzFfYm94cGxvdApgYGAKCiMgTm9ybWFsaXplIGFuZCB2aXN1YWxpemUKCmBgYHtyIG5vcm1hbGl6ZSwgZmlnLnNob3c9ImhpZGUifQpzY19ub3JtIDwtIHNtKG5vcm1hbGl6ZV9leHB0KHNjMV9leHB0LCB0cmFuc2Zvcm09ImxvZzIiLCBub3JtPSJxdWFudCIsIGNvbnZlcnQ9ImNwbSIsIGZpbHRlcj1UUlVFKSkKc2NfbWV0cmljcyA8LSBzbShncmFwaF9tZXRyaWNzKHNjX25vcm0pKQpgYGAKClRoZSBkYXRhIHNob3VsZCBub3cgYmUgbm9ybWFsaXplZCwgbGV0cyB2aWV3IHNvbWUgbWV0cmljcyBwb3N0LWZhY3RvLgoKYGBge3Igbm9ybXZpen0Kc2NfbWV0cmljcyRjb3JoZWF0CgpzY19tZXRyaWNzJHNtYwoKc2NfbWV0cmljcyRwY2FwbG90CmBgYAoKIyBUcnkgYSBiYXRjaCBlc3RpbWF0aW9uCgpgYGB7ciBiYXRjaF9leG99CnNjMV9zdmEgPC0gc20obm9ybWFsaXplX2V4cHQoc2MxX2V4cHQsIHRyYW5zZm9ybT0ibG9nMiIsCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZmlsdGVyPVRSVUUsIGJhdGNoPSJzdmEiLCBsb3dfdG9femVybz1UUlVFKSkKc2MxX3N2YV9tZXRyaWNzIDwtIGdyYXBoX21ldHJpY3Moc2MxX3N2YSkKc2MxX3N2YV9tZXRyaWNzJHBjYXBsb3QKCnNjMV9jb21iYXQgPC0gc20obm9ybWFsaXplX2V4cHQoc2MxX2V4cHQsIHRyYW5zZm9ybT0ibG9nMiIsCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZmlsdGVyPVRSVUUsIGJhdGNoPSJjb21iYXRfc2NhbGUiLCBsb3dfdG9femVybz1UUlVFKSkKc2MxX2NvbWJhdF9tZXRyaWNzIDwtIGdyYXBoX21ldHJpY3Moc2MxX2NvbWJhdCkKc2MxX2NvbWJhdF9tZXRyaWNzJHBjYXBsb3QKCnNjMV9mc3ZhIDwtIHNtKG5vcm1hbGl6ZV9leHB0KHNjMV9leHB0LCB0cmFuc2Zvcm09ImxvZzIiLAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICBmaWx0ZXI9VFJVRSwgYmF0Y2g9ImZzdmEiLCBsb3dfdG9femVybz1UUlVFKSkKc2MxX2ZzdmFfbWV0cmljcyA8LSBncmFwaF9tZXRyaWNzKHNjMV9mc3ZhKQpzYzFfZnN2YV9tZXRyaWNzJHBjYXBsb3QKYGBgCgojIFdyaXRlIHRoZSBleHB0CgpgYGB7ciBmc3ZhX3dyaXR0ZW59CmZ1biA8LSB3cml0ZV9leHB0KHNjMV9leHB0LCBleGNlbD1wYXN0ZTAoImV4Y2VsL3NhbXBsZXNfd3JpdHRlbi12IiwgdmVyLCAiLnhsc3giKSwKICAgICAgICAgICAgICAgICAgZmlsdGVyPVRSVUUsIG5vcm09InJhdyIsIGJhdGNoPSJmc3ZhIiwgdHJhbnNmb3JtPSJsb2cyIikKYGBgCgojIFJlcGVhdCBhbmQgZHJvcCB0aGUgZXh0ZXJuYWwgZGF0YSBzZXQKCmBgYHtyIGRyb3BwZWRfZXhvZ2Vub3VzfQpzY19leHB0IDwtIGV4cHRfc3Vic2V0KGV4cHQ9c2MxX2V4cHQsIHN1YnNldD0iYmF0Y2g9PSd5JyIpCnNjX25vcm0gPC0gc20obm9ybWFsaXplX2V4cHQoc2NfZXhwdCwgdHJhbnNmb3JtPSJsb2cyIiwgZmlsdGVyPVRSVUUsIG5vcm09InF1YW50IiwgY29udmVydD0iY3BtIikpCnNjX21ldHJpY3MgPC0gZ3JhcGhfbWV0cmljcyhzY19ub3JtKQpzY19tZXRyaWNzJHBjYXBsb3QKYGBgCgpgYGB7ciBzYXZlbWUsIGluY2x1ZGU9RkFMU0V9CnRtcCA8LSBzbShzYXZlbWUoZmlsZW5hbWU9dGhpc19zYXZlKSkKYGBgCgpgYGB7ciBwYW5kZXJ9CmxpYnJhcnkoInBhbmRlciIpCnBhbmRlcihzZXNzaW9uSW5mbygpKQpgYGAK