All of my analyses rely on a spreadsheet to handle keeping track of stuff like indexes, batches, conditions, etc. I have a small copy of this sitting in the root directory. all_samples.csv
The columns Batch and Condition are used for defining the models used by limma/DESeq2/edgeR, batch correction(not applicable in this case), etc.
This document should make clear the suitability of our yeast data for differential expression analyses. It should also give some ideas about the depth and distribution of the data.
There are lots of toys we have learned to use to play with with raw data and explore stuff like batch effects or non-canonical distributions or skewed counts. hpgltools provides some functionality to make this process easier. The graphs shown below and many more are generated with the wrapper ‘graph_metrics()’ but that takes away the chance to explain the graphs as I generate them.
## First make a bar plot of the library sizes in the experiment.
## Notice that the colors were auto-chosen by create_expt() and they should
## be maintained throughout this process
fis_libsize <- plot_libsize(sc_expt)
fis_libsize## $plot## 
## $table
##           id      sum condition  colors
##  1: hpgl0564 24090576        wt #1B9E77
##  2: hpgl0565 15640237        wt #1B9E77
##  3: hpgl0566  7744253        wt #1B9E77
##  4: hpgl0567 20750638        wt #1B9E77
##  5: hpgl0568 20233725       mut #D95F02
##  6: hpgl0569  9309552       mut #D95F02
##  7: hpgl0570 25558770       mut #D95F02
##  8: hpgl0571  8826828       mut #D95F02
##  9:       wt 13442858       wt2 #7570B3
## 10:     upf1 11677282      upf1 #E7298A
## 11:     upf2 12145075      upf2 #66A61E
## 12:     upf3 11201490      upf3 #E6AB02
## 
## $summary
##    condition      min      1st   median     mean      3rd      max
## 1:        wt  7744253 13666241 18195438 17056426 21585622 24090576
## 2:       mut  8826828  9188871 14771638 15982219 21564986 25558770
## 3:       wt2 13442858 13442858 13442858 13442858 13442858 13442858
## 4:      upf1 11677282 11677282 11677282 11677282 11677282 11677282
## 5:      upf2 12145075 12145075 12145075 12145075 12145075 12145075
## 6:      upf3 11201490 11201490 11201490 11201490 11201490 11201490## I am including in this the external data sets, wt/upf1/upf2/upf3
fis_nonzero <- plot_nonzero(sc_expt, title="nonzero vs. cpm")
fis_nonzero## $plot## 
## $table
##                id nonzero_genes    cpm condition batch   color
## hpgl0564 hpgl0564          6394 24.091        wt     1 #1B9E77
## hpgl0565 hpgl0565          6358 15.640        wt     1 #1B9E77
## hpgl0566 hpgl0566          6330  7.744        wt     1 #1B9E77
## hpgl0567 hpgl0567          6385 20.751        wt     1 #1B9E77
## hpgl0568 hpgl0568          6383 20.234       mut     1 #D95F02
## hpgl0569 hpgl0569          6315  9.310       mut     1 #D95F02
## hpgl0570 hpgl0570          6406 25.559       mut     1 #D95F02
## hpgl0571 hpgl0571          6318  8.827       mut     1 #D95F02
## wt             wt          6385 13.443       wt2     2 #7570B3
## upf1         upf1          6400 11.677      upf1     2 #E7298A
## upf2         upf2          6392 12.145      upf2     2 #66A61E
## upf3         upf3          6404 11.201      upf3     2 #E6AB02## The heterogeneous data set has a very different profile of gene coverage, that is interesting.
fis_density <- sm(plot_density(sc_expt))
fis_density## $plot## 
## $condition_summary
##    condition min   1st median mean  3rd    max
## 1:        wt 0.5 153.8    507 2549 1333 522951
## 2:       mut 0.5 138.0    494 2389 1373 752118
## 3:       wt2 0.5 176.5    562 2009 1404 323725
## 4:      upf1 0.5 251.0    665 1745 1410 223662
## 5:      upf2 0.5 257.0    676 1815 1441 249224
## 6:      upf3 0.5 244.0    635 1674 1358 223631
## 
## $batch_summary
##    batch min   1st median mean  3rd    max
## 1:     1 0.5 146.0    501 2469 1353 752118
## 2:     2 0.5 228.8    636 1811 1408 323725
## 
## $sample_summary
##       sample min   1st median mean    3rd    max
##  1: hpgl0564 0.5 266.0    778 3600 1896.5 483526
##  2: hpgl0565 0.5 164.0    480 2338 1177.0 372351
##  3: hpgl0566 0.5  93.0    285 1157  717.0 177661
##  4: hpgl0567 0.5 201.0    623 3101 1580.5 522951
##  5: hpgl0568 0.5 219.5    737 3024 1887.5 606290
##  6: hpgl0569 0.5 109.0    348 1391  853.0 266373
##  7: hpgl0570 0.5 279.5    907 3820 2283.0 752118
##  8: hpgl0571 0.5  88.0    301 1319  765.5 274040
##  9:       wt 0.5 176.5    562 2009 1403.5 323725
## 10:     upf1 0.5 251.0    665 1745 1409.5 223662
## 11:     upf2 0.5 257.0    676 1815 1441.0 249224
## 12:     upf3 0.5 244.0    635 1674 1358.0 223631
## 
## $table
##           id   sample counts condition batch
##     1:  AAC1 hpgl0564    141        wt     1
##     2:  AAC3 hpgl0564    236        wt     1
##     3: AAD10 hpgl0564    189        wt     1
##     4: AAD14 hpgl0564    389        wt     1
##     5: AAD15 hpgl0564     94        wt     1
##    ---                                      
## 80288:  ZRT2     upf3   4420      upf3     2
## 80289:  ZRT3     upf3   1446      upf3     2
## 80290:  ZTA1     upf3    433      upf3     2
## 80291:  ZUO1     upf3   7530      upf3     2
## 80292:  ZWF1     upf3   3969      upf3     2## Reasonably consistent, that should be recapitulated in a boxplot
fis_boxplot <- sm(plot_boxplot(sc_expt))
fis_boxplot## I can see there there is a larg number of low-count genes shared among all samples, they will need to be removed.
fis_corheat <- plot_disheat(sc_expt)## Non-normalized data still separated reasonably well, but you can easily see that at least some normalization will need to happen.The metrics of the raw data were encouraging, lets look again after performing a series of ‘stock’ normalizations: log2(quantile(cpm(filter(data)))): Remove low-count genes, cpm the libraries, quantile normalize, and log2 the result, then replot.
sc_norm <- sm(normalize_expt(sc_expt, transform="log2", norm="quant", convert="cpm", filter=TRUE))
sc_metrics <- sm(graph_metrics(sc_norm))The data should now be normalized, lets view some metrics post-facto.
sc_metrics$corheat## It appears that just the normalization is sufficient to split the samples completely by type and deeply separate them from the heterologous samples
sc_metrics$smc## The samples are very well behaved, none fall below the red line.
sc_metrics$pcaplot## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse## The homogeneous wt/mutant are nicely separated, and what is more, the exogeneous samples also split wt/mutant, that might prove to be quite useful.our_expt <- subset_expt(sc_expt, subset="batch=='1'")
our_norm <- sm(normalize_expt(our_expt, transform="log2", norm="quant", convert="cpm", filter=TRUE))
plot_disheat(our_norm)$plot## The euclidean distances between samples are pretty small, this is primarily because we did a log2 of the raw data
plot_corheat(our_norm)$plot## The non-rank correlations are exceedingly high among the samples, but definitely lower from wt<->mutant.
our_pca <- sm(plot_pca(our_norm))
our_pca$res##   propVar cumPropVar cond.R2
## 1   73.84      73.84   94.34
## 2   15.30      89.14    3.61
## 3    4.47      93.61    1.83
## 4    2.38      95.99    0.19
## 5    1.69      97.68    0.02
## 6    1.21      98.89    0.01
## 7    1.10      99.99    0.00our_pca$plot## The split is super obvious, now.
## I think this is the most nicely behaved data I have used so far by a pretty long shot.Given how well behaved this data is, perhaps we can make it more interesting by shoe-horning in the exogeneous data with a batch correction. I am not going to tell the computer to shut up for this part so that I can see what happens at each step, it will therefore be a bit longer and chattier.
exo_norm <- sm(normalize_expt(exo_expt, transform="log2", norm="quant", convert="cpm", filter=TRUE, batch="sva", low_to_zero=TRUE))
## ok, that looks ok.
exo_metrics <- graph_metrics(exo_norm)## Graphing number of non-zero genes with respect to CPM by library.## Graphing library sizes.## Graphing a boxplot.## This data will benefit from being displayed on the log scale.## If this is not desired, set scale='raw'## Some entries are 0.  We are on log scale, adding 1 to the data.## Changed 32 zero count features.## Graphing a correlation heatmap.## Graphing a standard median correlation.## Performing correlation.## Graphing a distance heatmap.## Graphing a standard median distance.## Performing distance.## Graphing a PCA plot.## Graphing a T-SNE plot.## Plotting a density plot.## This data will benefit from being displayed on the log scale.## If this is not desired, set scale='raw'## Some entries are 0.  We are on log scale, setting them to 0.5.## Changed 32 zero count features.## Plotting the representation of the top-n genes.## Printing a color to condition legend.## Too few points to calculate an ellipse
## Too few points to calculate an ellipseexo_metrics$pcaplot## Too few points to calculate an ellipse
## Too few points to calculate an ellipse## It looks like it successfully merged these data sets??  Interesting.##v2_expt <- create_expt(metadata="sample_sheets/all_samples.xlsx", file_column="bowtiefile",
##                       gene_info=sc_all_annotations)
v2_expt <- sm(create_expt(metadata="sample_sheets/all_samples.xlsx", file_column="bt2file",
                          gene_info=sc_all_annotations))
v2_filt <- normalize_expt(v2_expt, filter=TRUE)## This function will replace the expt$expressionset slot with:## hpgl(data)## It backs up the current data into a slot named:
##  expt$backup_expressionset. It will also save copies of each step along the way
##  in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
##  when invoking limma.  The appropriate libsize is the non-log(cpm(normalized)).
##  This is most likely kept at:
##  'new_expt$normalized$intermediate_counts$normalization$libsizes'
##  A copy of this may also be found at:
##  new_expt$best_libsize## Leaving the data in its current base format, keep in mind that
##  some metrics are easier to see when the data is log2 transformed, but
##  EdgeR/DESeq do not accept transformed data.## Leaving the data unconverted.  It is often advisable to cpm/rpkm
##  the data to normalize for sampling differences, keep in mind though that rpkm
##  has some annoying biases, and voom() by default does a cpm (though hpgl_voom()
##  will try to detect this).## Leaving the data unnormalized.  This is necessary for DESeq, but
##  EdgeR/limma might benefit from normalization.  Good choices include quantile,
##  size-factor, tmm, etc.## Not correcting the count-data for batch effects.  If batch is
##  included in EdgerR/limma's model, then this is probably wise; but in extreme
##  batch effects this is a good parameter to play with.## Step 1: performing count filter with option: hpgl## Removing 1158 low-count genes (5968 remaining).## Step 2: not normalizing the data.## Step 3: not converting the data.## Step 4: not transforming the data.## Step 5: not doing batch correction.v2_norm <- sm(normalize_expt(v2_expt, convert="cpm", norm="quant", filter=TRUE, transform="log2"))
v2_metrics <- graph_metrics(v2_norm)## Graphing number of non-zero genes with respect to CPM by library.## Graphing library sizes.## Graphing a boxplot.## Graphing a correlation heatmap.## Graphing a standard median correlation.## Performing correlation.## Graphing a distance heatmap.## Graphing a standard median distance.## Performing distance.## Graphing a PCA plot.## Graphing a T-SNE plot.## Plotting a density plot.## Plotting the representation of the top-n genes.## Printing a color to condition legend.v2_metrics$pcaplotv2_plotting <- normalize_expt(v2_expt, filter=TRUE, transform="log2", convert="cpm", batch="fsva")## This function will replace the expt$expressionset slot with:## log2(fsva(cpm(hpgl(data))))## It backs up the current data into a slot named:
##  expt$backup_expressionset. It will also save copies of each step along the way
##  in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
##  when invoking limma.  The appropriate libsize is the non-log(cpm(normalized)).
##  This is most likely kept at:
##  'new_expt$normalized$intermediate_counts$normalization$libsizes'
##  A copy of this may also be found at:
##  new_expt$best_libsize## Leaving the data unnormalized.  This is necessary for DESeq, but
##  EdgeR/limma might benefit from normalization.  Good choices include quantile,
##  size-factor, tmm, etc.## Step 1: performing count filter with option: hpgl## Removing 1158 low-count genes (5968 remaining).## Step 2: not normalizing the data.## Step 3: converting the data with cpm.## Step 4: transforming the data with log2.## transform_counts: Found 14 values equal to 0, adding 1 to the matrix.## Step 5: doing batch correction with fsva.## Note to self:  If you get an error like 'x contains missing values'; I think this
##  means that the data has too many 0's and needs to have a better low-count filter applied.## batch_counts: Before batch correction, 1576 entries 0<x<1.## batch_counts: Before batch correction, 14 entries are >= 0.## batch_counts: Using sva::fsva for batch correction.## The number of elements which are < 0 after batch correction is: 12## The variable low_to_zero sets whether to change <0 values to 0 and is: FALSEtt <- graph_metrics(v2_plotting)## Graphing number of non-zero genes with respect to CPM by library.## Graphing library sizes.## Graphing a boxplot.## Graphing a correlation heatmap.## Graphing a standard median correlation.## Performing correlation.## Graphing a distance heatmap.## Graphing a standard median distance.## Performing distance.## Graphing a PCA plot.## Graphing a T-SNE plot.## Plotting a density plot.## Plotting the representation of the top-n genes.## Printing a color to condition legend.tt$pcaplottesting_pca <- pca_information(v2_norm, plot_pcas=TRUE,
                               expt_factors=c("condition","batch","incubationtime",
                                              "strain","tube", "cbf5igv", "upf1igv"))## More shallow curves in these plots suggest more genes in this principle component.testing_pca$pca_plots$PC3_PC4$plot## <environment: 0x55e941558ff0>fun <- write_expt(v2_filt, excel=paste0("excel/new_samples-v", ver, ".xlsx"),
                  filter="cbcb", norm="raw", convert="cpm", batch="svaseq", transform="log2")## Writing the legend.## Writing the raw reads.## Graphing the raw reads.## Writing the normalized reads.## Graphing the normalized reads.## Writing the median reads by factor.## The factor wtc_wtu has 4 rows.## The factor mtc_wtu has 4 rows.## The factor wtc_mtu has 4 rows.## The factor mtc_mtu has 4 rows.if (!isTRUE(get0("skip_load"))) {
  pander::pander(sessionInfo())
  message(paste0("This is hpgltools commit: ", get_git_commit()))
  this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
  message(paste0("Saving to ", this_save))
  tmp <- sm(saveme(filename=this_save))
}## If you wish to reproduce this exact build of hpgltools, invoke the following:## > git clone http://github.com/abelew/hpgltools.git## > git reset 5d8c266e48bb9f73cdac8300e5c7c9f5baf003dc## R> packrat::restore()## This is hpgltools commit: Wed Mar 21 15:55:32 2018 -0400: 5d8c266e48bb9f73cdac8300e5c7c9f5baf003dc## Saving to annotation-v20151102.rda.xz