This document should make clear the suitability of our yeast data for differential expression analyses. It should also give some ideas about the depth and distribution of the data.
Later, we will likely choose to exclude one of the three experimental batches in the data. Therefore, I am making three expressionsets, one with all data, and one the various subsets.
In response to some ideas around 20180606, we will extract all the samples which are wt/cbf5 and not excessively batch-ridden (eg. some of our wt samples)
all_expt <- sm(create_expt(metadata="sample_sheets/all_samples.xlsx",
gene_info=sc_all_annotations,
file_column="allfile"))
## Don't forget, I need to change the condition names.
E1_expt <- subset_expt(all_expt, subset="batch=='E1'")
## There were 24, now there are 8 samples.
E2_expt <- subset_expt(all_expt, subset="batch!='E1'")
## There were 24, now there are 16 samples.
E1E2B2_expt <- subset_expt(all_expt, subset="batch!='E2B1'")
## There were 24, now there are 16 samples.
E1E2B1_expt <- subset_expt(all_expt, subset="batch!='E2B2'")
## There were 24, now there are 16 samples.
E1E2B2_cbf5wt <- subset_expt(E1E2B2_expt, subset="condition=='WT'|condition=='cbf5_D95A'")
## There were 16, now there are 12 samples.
There are lots of methods we have to examine raw data and explore stuff like batch effects or non-canonical distributions or skewed counts. hpgltools provides some functionality to make this process easier. The graphs shown below and many more are generated with the wrapper ‘graph_metrics()’, when invoked it provides a list of plots including: library sizes, the number of non-zero genes, density/box plots by sample, distance/correlation heatmaps, standard median correlation/distance, PCA/TSNE clustering, top-n genes, a legend describing the colors/symbols used, and some tables describing the data. Optionally, it can also perform quantile-quantile plots showing the distribution of each sample vs. the median of all samples and MA plots of the same.
Caveat: some plots do not work well with gene IDs that are all-0, thus I first filter the data to remove them.
I also added a neat function ‘plot_libsize_prepost()’ which shows how many genes are poorly represented before/after filtering the data.
The plots printed here are all metrics which are useful when considering raw data.
all_filt <- sm(normalize_expt(all_expt, filter=TRUE))
all_metrics <- sm(graph_metrics(all_expt))
Now print some plots of interest of the raw data.
pp(file="illustrator_input/01_legend.pdf", image=all_metrics$legend)
## Writing the image to: illustrator_input/01_legend.pdf and calling dev.off().
pp(file="illustrator_input/02_raw_libsize.pdf", image=all_metrics$libsize)
## Writing the image to: illustrator_input/02_raw_libsize.pdf and calling dev.off().
prepost <- plot_libsize_prepost(all_expt)
pp(file="illustrator_input/03_libsize_changed_lowgenes.pdf", image=prepost$lowgene_plot)
## Writing the image to: illustrator_input/03_libsize_changed_lowgenes.pdf and calling dev.off().
pp(file="illustrator_input/04_nonzero_genes.pdf", image=all_metrics$nonzero)
## Writing the image to: illustrator_input/04_nonzero_genes.pdf and calling dev.off().
pp(file="illustrator_input/05_raw_boxplot.pdf", image=all_metrics$boxplot)
## Writing the image to: illustrator_input/05_raw_boxplot.pdf and calling dev.off().
pp(file="illustrator_input/06_raw_density.pdf", image=all_metrics$density)
## Writing the image to: illustrator_input/06_raw_density.pdf and calling dev.off().
pp(file="illustrator_input/07_raw_boxplot.pdf", image=all_metrics$boxplot)
## Writing the image to: illustrator_input/07_raw_boxplot.pdf and calling dev.off().
pp(file="illustrator_input/08_topn.pdf", image=all_metrics$topnplot)
## Writing the image to: illustrator_input/08_topn.pdf and calling dev.off().
Other metrics are more useful when used with data on the log scale and normalized by number of reads/library and/or by quantile.
all_exptm <- normalize_expt(all_expt, transform="log2", convert="cpm", filter=TRUE)
## This function will replace the expt$expressionset slot with:
## log2(cpm(hpgl(data)))
## It backs up the current data into a slot named:
## expt$backup_expressionset. It will also save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Leaving the data unnormalized. This is necessary for DESeq, but
## EdgeR/limma might benefit from normalization. Good choices include quantile,
## size-factor, tmm, etc.
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: hpgl
## Removing 1030 low-count genes (6095 remaining).
## Step 2: not normalizing the data.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 1152 values equal to 0, adding 1 to the matrix.
## Step 5: not doing batch correction.
all_metrics <- graph_metrics(all_exptm)
## Graphing number of non-zero genes with respect to CPM by library.
## Graphing library sizes.
## Graphing a boxplot.
## Graphing a correlation heatmap.
## Graphing a standard median correlation.
## Performing correlation.
## Graphing a distance heatmap.
## Graphing a standard median distance.
## Performing distance.
## Graphing a PCA plot.
## Graphing a T-SNE plot.
## Plotting a density plot.
## Plotting the representation of the top-n genes.
## Printing a color to condition legend.
The data should now be normalized, lets view some metrics post-facto.
pp(file="illustrator_input/09_norm_corheat.pdf", image=all_metrics$corheat)
## Writing the image to: illustrator_input/09_norm_corheat.pdf and calling dev.off().
pp(file="illustrator_input/10_norm_disheat.pdf", image=all_metrics$disheat)
## Writing the image to: illustrator_input/10_norm_disheat.pdf and calling dev.off().
## It appears that just the normalization is sufficient to split the samples completely by type and deeply separate them from the heterologous samples
pp(file="illustrator_input/11_norm_smc.pdf", image=all_metrics$smc)
## Writing the image to: illustrator_input/11_norm_smc.pdf and calling dev.off().
pp(file="illustrator_input/12_norm_smd.pdf", image=all_metrics$smd)
## Writing the image to: illustrator_input/12_norm_smd.pdf and calling dev.off().
## The samples are very well behaved, none fall below the red line.
pp(file="illustrator_input/13_norm_pca.pdf", image=all_metrics$pcaplot)
## Writing the image to: illustrator_input/13_norm_pca.pdf and calling dev.off().
pp(file="illustrator_input/14_norm_tsne.pdf", image=all_metrics$tsneplot)
## Writing the image to: illustrator_input/14_norm_tsne.pdf and calling dev.off().
## The homogeneous wt/mutant are nicely separated, and what is more, the exogeneous samples also split wt/mutant, that might prove to be quite useful.
Above, we split out the wt/cbf5 samples, now lets normalize and make a few plots of potential interest. Let us use the same batch method (fsva).
We will do this for both only the old samples and also for the old+new samples.
E1_norm <- normalize_expt(E1_expt, filter=TRUE, norm="quant", convert="cpm", transform="log")
## This function will replace the expt$expressionset slot with:
## log(cpm(quant(hpgl(data))))
## It backs up the current data into a slot named:
## expt$backup_expressionset. It will also save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: hpgl
## Removing 1280 low-count genes (5845 remaining).
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log.
## Step 5: not doing batch correction.
cbf5_norm <- sm(normalize_expt(E1E2B2_cbf5wt, filter=TRUE, norm="quant", convert="cpm",
transform="log"))
cbf5_batch <- sm(normalize_expt(E1E2B2_cbf5wt, filter=TRUE, norm="quant", convert="cpm",
transform="log", batch="fsva"))
E1_raw_metrics <- sm(graph_metrics(E1_expt))
E1_norm_metrics <- sm(graph_metrics(E1_norm))
cbf5_raw_metrics <- sm(graph_metrics(E1E2B2_cbf5wt))
cbf5_norm_metrics <- sm(graph_metrics(cbf5_norm))
cbf5_batch_metrics <- sm(graph_metrics(cbf5_batch))
Now print out some of the interesting plots.
pp(file="illustrator_input/32_E1_norm_corheat.pdf", image=E1_norm_metrics$corheat)
## Writing the image to: illustrator_input/32_E1_norm_corheat.pdf and calling dev.off().
pp(file="illustrator_input/33_E1_norm_pca.pdf", image=E1_norm_metrics$pcaplot)
## Writing the image to: illustrator_input/33_E1_norm_pca.pdf and calling dev.off().
pp(file="illustrator_input/34_E1_norm_disheat.pdf", image=E1_norm_metrics$disheat)
## Writing the image to: illustrator_input/34_E1_norm_disheat.pdf and calling dev.off().
This is an interesting aside which came up last week for some other data. It might be good to include the % variance correlated with ‘condition’ as a column in the annotation data for the expressionset. Thus, when we do the differential expression analysis later, we can look and see if a ‘significant’ gene has variance which is actually correlated with condition.
Since writing these analyses, I implemented this idea and so am including it here.
vp <- varpart(all_expt, predictor=NULL, factors=c("condition", "batch"))
## Attempting mixed linear model with: ~ (1|condition) + (1|batch)
## Fitting the expressionset to the model, this is slow.
## Projected run time: ~ 0.8 min
## Placing factor: condition at the beginning of the model.
pp(file="illustrator_input/15_merged_varpart_partition.pdf", image=vp$partition_plot)
## Writing the image to: illustrator_input/15_merged_varpart_partition.pdf and calling dev.off().
## Check out the last two columns!
head(fData(vp$modified_expt))
## transcriptID geneID
## AAC1 YMR056C YMR056C
## AAC3 YBR085W YBR085W
## AAD10 YJR155W YJR155W
## AAD14 YNL331C YNL331C
## AAD15 YOL165C YOL165C
## AAD16 YFL057C YFL057C
## Description
## AAC1 Mitochondrial inner membrane ADP/ATP translocator; exchanges cytosolic ADP for mitochondrially synthesized ATP; phosphorylated; Aac1p is a minor isoform while Pet9p is the major ADP/ATP translocator; relocalizes from mitochondrion to cytoplasm upon DNA replication stress [Source:SGD;Acc:S000004660]
## AAC3 Mitochondrial inner membrane ADP/ATP translocator; exchanges cytosolic ADP for mitochondrially synthesized ATP; expressed under anaerobic conditions; similar to Aac1p; has roles in maintenance of viability and in respiration; AAC3 has a paralog, PET9, that arose from the whole genome duplication [Source:SGD;Acc:S000000289]
## AAD10 Putative aryl-alcohol dehydrogenase; similar to P. chrysosporium aryl-alcohol dehydrogenase; mutational analysis has not yet revealed a physiological role; members of the AAD gene family comprise three pairs (AAD3 + AAD15, AAD6/AAD16 + AAD4, AAD10 + AAD14) whose two genes are more related to one another than to other members of the family [Source:SGD;Acc:S000003916]
## AAD14 Putative aryl-alcohol dehydrogenase; similar to P. chrysosporium aryl-alcohol dehydrogenase; mutational analysis has not yet revealed a physiological role; members of the AAD gene family comprise three pairs (AAD3 + AAD15, AAD6/AAD16 + AAD4, AAD10 + AAD14) whose two genes are more related to one another than to other members of the family [Source:SGD;Acc:S000005275]
## AAD15 Putative aryl-alcohol dehydrogenase; similar to P. chrysosporium aryl-alcohol dehydrogenase; mutational analysis has not yet revealed a physiological role; AAD15 has a paralog, AAD3, that arose from a segmental duplication; members of the AAD gene family comprise three pairs (AAD3 + AAD15, AAD6/AAD16 + AAD4, AAD10 + AAD14) whose two genes are more related to one another than to other members of the family [Source:SGD;Acc:S000005525]
## AAD16 Putative aryl alcohol dehydrogenase; similar to Phanerochaete chrysosporium aryl alcohol dehydrogenase; ORFs AAD6/YFL056C and AAD16/YFL057C are displaced from one another by -1 frameshift; members of the AAD gene family comprise three pairs (AAD3 + AAD15, AAD6/AAD16 + AAD4, AAD10 + AAD14) whose two genes are more related to one another than to other members of the family [Source:SGD;Acc:S000001837]
## Type length chromosome strand start end tss_id
## AAC1 protein_coding 930 XIII -1 387315 388244 TSS5132
## AAC3 protein_coding 924 II 1 415983 416906 TSS1609
## AAD10 protein_coding 867 X 1 727405 728271 TSS5024
## AAD14 protein_coding 1131 XIV -1 16118 17248 TSS6941
## AAD15 protein_coding 432 XV -1 1647 2078 TSS108
## AAD16 protein_coding 459 VI -1 14305 14763 TSS2145
## location variance_batch variance_condition
## AAC1 XIII_387315_388244 0.60247 0.0000
## AAC3 II_415983_416906 0.01024 0.7903
## AAD10 X_727405_728271 0.44833 0.4727
## AAD14 XIV_16118_17248 0.18558 0.6522
## AAD15 XV_1647_2078 0.02458 0.8443
## AAD16 VI_14305_14763 0.06128 0.8080
## variance_Residuals
## AAC1 0.39753
## AAC3 0.19950
## AAD10 0.07901
## AAD14 0.16217
## AAD15 0.13115
## AAD16 0.13069
merged_sva <- sm(normalize_expt(all_expt, transform="log2",
filter=TRUE, batch="sva", low_to_zero=TRUE))
vpsva <- varpart(merged_sva, predictor=NULL, factors=c("condition", "batch"))
## Attempting mixed linear model with: ~ (1|condition) + (1|batch)
## Fitting the expressionset to the model, this is slow.
## Projected run time: ~ 0.4 min
## Placing factor: condition at the beginning of the model.
pp(file="illustrator_input/16_merged_varpart_sva_partition.pdf", image=vpsva$partition_plot)
## Writing the image to: illustrator_input/16_merged_varpart_sva_partition.pdf and calling dev.off().
head(fData(vpsva$modified_expt))
## transcriptID geneID
## AAC1 YMR056C YMR056C
## AAC3 YBR085W YBR085W
## AAD10 YJR155W YJR155W
## AAD14 YNL331C YNL331C
## AAD15 YOL165C YOL165C
## AAD16 YFL057C YFL057C
## Description
## AAC1 Mitochondrial inner membrane ADP/ATP translocator; exchanges cytosolic ADP for mitochondrially synthesized ATP; phosphorylated; Aac1p is a minor isoform while Pet9p is the major ADP/ATP translocator; relocalizes from mitochondrion to cytoplasm upon DNA replication stress [Source:SGD;Acc:S000004660]
## AAC3 Mitochondrial inner membrane ADP/ATP translocator; exchanges cytosolic ADP for mitochondrially synthesized ATP; expressed under anaerobic conditions; similar to Aac1p; has roles in maintenance of viability and in respiration; AAC3 has a paralog, PET9, that arose from the whole genome duplication [Source:SGD;Acc:S000000289]
## AAD10 Putative aryl-alcohol dehydrogenase; similar to P. chrysosporium aryl-alcohol dehydrogenase; mutational analysis has not yet revealed a physiological role; members of the AAD gene family comprise three pairs (AAD3 + AAD15, AAD6/AAD16 + AAD4, AAD10 + AAD14) whose two genes are more related to one another than to other members of the family [Source:SGD;Acc:S000003916]
## AAD14 Putative aryl-alcohol dehydrogenase; similar to P. chrysosporium aryl-alcohol dehydrogenase; mutational analysis has not yet revealed a physiological role; members of the AAD gene family comprise three pairs (AAD3 + AAD15, AAD6/AAD16 + AAD4, AAD10 + AAD14) whose two genes are more related to one another than to other members of the family [Source:SGD;Acc:S000005275]
## AAD15 Putative aryl-alcohol dehydrogenase; similar to P. chrysosporium aryl-alcohol dehydrogenase; mutational analysis has not yet revealed a physiological role; AAD15 has a paralog, AAD3, that arose from a segmental duplication; members of the AAD gene family comprise three pairs (AAD3 + AAD15, AAD6/AAD16 + AAD4, AAD10 + AAD14) whose two genes are more related to one another than to other members of the family [Source:SGD;Acc:S000005525]
## AAD16 Putative aryl alcohol dehydrogenase; similar to Phanerochaete chrysosporium aryl alcohol dehydrogenase; ORFs AAD6/YFL056C and AAD16/YFL057C are displaced from one another by -1 frameshift; members of the AAD gene family comprise three pairs (AAD3 + AAD15, AAD6/AAD16 + AAD4, AAD10 + AAD14) whose two genes are more related to one another than to other members of the family [Source:SGD;Acc:S000001837]
## Type length chromosome strand start end tss_id
## AAC1 protein_coding 930 XIII -1 387315 388244 TSS5132
## AAC3 protein_coding 924 II 1 415983 416906 TSS1609
## AAD10 protein_coding 867 X 1 727405 728271 TSS5024
## AAD14 protein_coding 1131 XIV -1 16118 17248 TSS6941
## AAD15 protein_coding 432 XV -1 1647 2078 TSS108
## AAD16 protein_coding 459 VI -1 14305 14763 TSS2145
## location variance_condition variance_batch
## AAC1 XIII_387315_388244 0.2121 0
## AAC3 II_415983_416906 0.8471 0
## AAD10 X_727405_728271 0.9067 0
## AAD14 XIV_16118_17248 0.8387 0
## AAD15 XV_1647_2078 0.9147 0
## AAD16 VI_14305_14763 0.9024 0
## variance_Residuals
## AAC1 0.78786
## AAC3 0.15286
## AAD10 0.09335
## AAD14 0.16130
## AAD15 0.08532
## AAD16 0.09764
E1E2B2_norm <- sm(normalize_expt(E1E2B2_expt, transform="log2", convert="cpm", norm="quant", filter=TRUE))
E1E2B2_plots <- sm(graph_metrics(E1E2B2_norm))
E1E2B2_batchnorm <- sm(normalize_expt(E1E2B2_expt, transform="log2", convert="cpm", norm="quant", filter=TRUE,
batch="fsva"))
E1E2B2_batchplots <- sm(graph_metrics(E1E2B2_batchnorm))
## Before fsva, correlation heatmap without 'r'
pp(file="illustrator_input/17_E1E2B2_corheat.pdf", image=E1E2B2_plots$corheat)
## Writing the image to: illustrator_input/17_E1E2B2_corheat.pdf and calling dev.off().
## Before fsva, pca without 'r'
pp(file="illustrator_input/18_E1E2B2_pca.pdf", image=E1E2B2_plots$pcaplot)
## Writing the image to: illustrator_input/18_E1E2B2_pca.pdf and calling dev.off().
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## After fsva, correlation heatmap without 'r'
pp(file="illustrator_input/19_E1E2B2_fsva_corheat.pdf", image=E1E2B2_batchplots$corheat)
## Writing the image to: illustrator_input/19_E1E2B2_fsva_corheat.pdf and calling dev.off().
## After fsva, pca without 'r'
pp(file="illustrator_input/20_E1E2B2_fsva_pca.pdf", image=E1E2B2_batchplots$pcaplot)
## Writing the image to: illustrator_input/20_E1E2B2_fsva_pca.pdf and calling dev.off().
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## hmm I was going to silence this line, but looking at the report it seems to be missing some pictures.
all_written <- write_expt(all_expt, excel=paste0("excel/samples_written_all_expt-v", ver, ".xlsx"),
filter=TRUE, norm="raw", convert="cpm", batch="fsva", transform="log2")
## Writing the legend.
## Writing the raw reads.
## Graphing the raw reads.
## Writing the normalized reads.
## Graphing the normalized reads.
## Writing the median reads by factor.
## The factor WT has 8 rows.
## The factor cbf5_D95A has 12 rows.
## The factor upf1d has 8 rows.
## The factor cbf5_D95A upf1d has 4 rows.
E1_written <- write_expt(all_expt, excel=paste0("excel/samples_written_E1_expt-v", ver, ".xlsx"),
filter=TRUE, norm="raw", convert="cpm", transform="log2")
## Writing the legend.
## Writing the raw reads.
## Graphing the raw reads.
## Writing the normalized reads.
## Graphing the normalized reads.
## Writing the median reads by factor.
## The factor WT has 8 rows.
## The factor cbf5_D95A has 12 rows.
## The factor upf1d has 8 rows.
## The factor cbf5_D95A upf1d has 4 rows.
E1E2B2_written <- write_expt(E1E2B2_expt, excel=paste0("excel/samples_written_E1E2B2_expt-v", ver, ".xlsx"),
filter=TRUE, norm="raw", convert="cpm", batch="fsva", transform="log2")
## Writing the legend.
## Writing the raw reads.
## Graphing the raw reads.
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Writing the normalized reads.
## Graphing the normalized reads.
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Writing the median reads by factor.
## The factor WT has 6 rows.
## The factor cbf5_D95A has 8 rows.
## The factor upf1d has 4 rows.
## The factor cbf5_D95A upf1d has 2 rows.