In this document, I will attempt to visualize the relationships among the various samples in this experiment. In the process, I will examine the samples to (hopefully) ensure that they are consistent and that there is not a fatal batch effect or other weakness in the data.
To start, I will print some simple metrics of the entire data set, then split it into more tractable pieces and examine them more thoroughly.
With the above in mind, here are a couple global plots showing the state of the data.
When processing the data, I did 1 alignment of the reads from all samples against a database of miRNAs and then separately against a database of all transcripts. All the samples of library type ‘small’ are really only intended to be compared against the miRNA library while those of type ‘large’ are intended only to search against the transcripts. This is primarily because of the different ways the RNA samples were treated at the beginning of the library preparation. Despite this, I perform both types of alignments for all samples so that I may now look and see that the distributions are different. If everything worked as planned, then the large RNA libraries should have a very different distribution of reads in the miRNA databases and vice-versa. If this is not true, then something might be very wrong.
With that in mind, the following block is going to do some of the graphs and normalize the data. In the following blocks I will print some of the results and consider what they mean.
The following block will serve to generate all the likely subsets of the data.
Lets lay down a couple rules of thumb now:
## Graph the raw metrics of all samples mapped against the transcripts and miRNAs
##mm_mir <- expt_subset(mm_mir, subset="genotype!='apop'")
##mm_txr <- expt_subset(mm_txr, subset="genotype!='apop'")
mm_mi_metrics <- sm(graph_metrics(mm_mir))
mm_tx_metrics <- sm(graph_metrics(mm_txr))
mmmi_small_metrics <- sm(graph_metrics(mmmi_small))
mmtx_large_metrics <- sm(graph_metrics(mmtx_large))
First, do not forget to print a legend showing the colors used and what they mean:
mm_mi_metrics$legend$plot
## This should be the same for the mm_mi and mm_tx objects.
mm_mi_metrics$libsize
Hahah I am a doofus. For a moment I thought this was a total disaster. The first 8 libraries are small RNA libraries of which the first 4 are exosomes. The important thing to remember is that these are mapped against only the 1978 miRNA features in the mouse Ensembl genome, as such we see that the first 4 are highly enriched in miRNAs, which is excellent.
Therefore, as one would expect, the small exosome RNA libraries have fewer reads than the total cell small RNA libraries, which comprise the set from 5-8. I of course focused immediately upon the last 8 without thinking. These are all the polyA RNA samples, of which the miRNA features are only a small proportion and therefore we see them as <10% the depth of the small RNA samples. I would therefore expect something like the opposite for the transcriptome samples, which are next.
mm_tx_metrics$libsize
Ahh excellent. Once again, the exosome samples have fewer reads than the cells and have many fewer for the small RNA samples than transcript samples, which is good.
From here on, I think I will examine only the small RNA samples and large RNA samples as 2 separate groups.
mmmi_small_metrics$density
mmmi_small_metrics$boxplot
## The sample densities among the small RNA samples look reasonable to me.
mmtx_large_metrics$density
mmtx_large_metrics$boxplot
## This is more problematic from the perspective of the data distributions; but not suprising and
## encouraging from a biological perspective. The large RNA exosomes have many more genes with 0
## counts.
Considering the heterologous sources of these samples, I think these actually make sense, lets move on and see what else pops up.
tt <- sm(library(Biobase))
older_data <- sm(normalize_expt(mmmi_small, filter="cbcb", batch="svaseq", transform="log2", convert="cpm", norm="quant"))
tt <- read.csv("current_norm_counts.csv")
rownames(tt) <- tt$row.names_1
tt <- tt[, -1]
colnames(tt) <- colnames(exprs(older_data$expressionset))
tt <- as.matrix(tt)
exprs(older_data$expressionset) <- tt
plot_pca(older_data)$plot
newer_data <- sm(normalize_expt(mmmi_small, filter=TRUE))
newer_data <- sm(normalize_expt(newer_data, transform="log2"))
newer_data <- sm(normalize_expt(newer_data, batch="svaseq"))
plot_pca(newer_data)$plot
head(exprs(newer_data$expressionset))
## HPGL0673 HPGL0674 HPGL0675 HPGL0676 HPGL0677 HPGL0678 HPGL0679
## chr10_ENSMUSG00000065406 13.68362 14.356512 14.22133 13.05992 13.2824 11.747 12.579
## chr10_ENSMUSG00000065430 15.40259 14.460674 15.44751 14.88549 14.8038 13.664 13.783
## chr10_ENSMUSG00000065607 3.84900 3.304276 3.41267 4.52450 5.0937 5.153 5.573
## chr10_ENSMUSG00000072837 -0.03828 -0.004397 -0.02086 0.05697 0.1767 0.200 1.160
## chr10_ENSMUSG00000076282 3.90750 3.540230 4.19327 4.52894 3.3235 3.272 3.771
## chr10_ENSMUSG00000076377 3.24159 3.169939 1.63295 2.87267 0.5068 3.424 2.424
## HPGL0680 HPGL0523 HPGL0524 HPGL0525 HPGL0555 HPGL0556 HPGL0557
## chr10_ENSMUSG00000065406 12.609 14.0610 12.6006 11.5508 13.2688 14.90448 12.921425
## chr10_ENSMUSG00000065430 14.548 15.0392 14.3152 14.5733 14.9921 14.96946 14.933262
## chr10_ENSMUSG00000065607 4.770 3.7294 6.2637 6.4215 4.1248 6.26584 6.601161
## chr10_ENSMUSG00000072837 1.185 0.8603 -0.1125 -0.1865 -0.1375 0.07877 -0.006403
## chr10_ENSMUSG00000076282 3.215 3.4830 2.6182 4.3736 4.4009 1.91238 2.989641
## chr10_ENSMUSG00000076377 2.326 1.9453 0.7421 0.7456 2.2778 1.43283 1.833683
## HPGL0558 HPGL0559 HPGL0560
## chr10_ENSMUSG00000065406 10.8910 11.58183 12.2112
## chr10_ENSMUSG00000065430 15.0328 16.09648 16.5378
## chr10_ENSMUSG00000065607 6.2635 5.70732 4.3671
## chr10_ENSMUSG00000072837 0.9327 -0.04588 2.2239
## chr10_ENSMUSG00000076282 3.1016 2.82909 4.6733
## chr10_ENSMUSG00000076377 2.4565 -0.27303 0.2501
newer_data <- sm(normalize_expt(mmmi_small, filter=TRUE, transform="log2",
batch="svaseq", norm="quant"))
plot_pca(newer_data)$plot
head(exprs(newer_data$expressionset))
## HPGL0673 HPGL0674 HPGL0675 HPGL0676 HPGL0677 HPGL0678 HPGL0679
## chr10_ENSMUSG00000065406 14.3892 15.183 14.7172 14.6155 14.1600 13.0178 13.9433
## chr10_ENSMUSG00000065430 17.5851 15.481 16.8497 16.9851 15.6866 15.8460 15.0628
## chr10_ENSMUSG00000065607 5.5268 5.044 5.8579 5.3218 4.4953 4.1794 5.4221
## chr10_ENSMUSG00000072837 0.7435 1.150 0.9354 0.7889 0.5968 0.6424 0.9229
## chr10_ENSMUSG00000076282 3.6269 2.908 3.4527 3.5106 4.7548 4.0961 4.8982
## chr10_ENSMUSG00000076377 3.1827 3.205 1.8553 2.0704 0.8876 2.7625 2.1457
## HPGL0680 HPGL0523 HPGL0524 HPGL0525 HPGL0555 HPGL0556 HPGL0557
## chr10_ENSMUSG00000065406 13.264 14.428 13.4813 14.5400 14.506 14.8595 13.8445
## chr10_ENSMUSG00000065430 16.309 15.292 15.1815 17.4366 16.299 16.5851 16.4371
## chr10_ENSMUSG00000065607 3.974 3.994 7.2001 6.4671 4.748 5.8259 6.8290
## chr10_ENSMUSG00000072837 1.294 0.984 -0.2465 0.6387 1.506 0.2668 -0.7884
## chr10_ENSMUSG00000076282 4.278 4.376 2.8431 3.7180 3.353 4.5176 5.2178
## chr10_ENSMUSG00000076377 1.928 1.680 0.7907 1.9127 2.443 0.7774 2.0949
## HPGL0558 HPGL0559 HPGL0560
## chr10_ENSMUSG00000065406 14.1788 14.335 14.376
## chr10_ENSMUSG00000065430 15.9038 17.190 16.380
## chr10_ENSMUSG00000065607 7.9194 6.574 5.493
## chr10_ENSMUSG00000072837 -0.6324 -1.601 1.611
## chr10_ENSMUSG00000076282 2.6632 2.037 3.662
## chr10_ENSMUSG00000076377 3.9966 0.872 1.249
newer_data <- sm(normalize_expt(mmmi_small, filter=TRUE))
newer_data <- sm(normalize_expt(newer_data, batch="svaseq", low_to_zero=TRUE))
newer_data <- sm(normalize_expt(newer_data, transform="log2"))
plot_pca(newer_data)$plot
head(exprs(newer_data$expressionset))
## HPGL0673 HPGL0674 HPGL0675 HPGL0676 HPGL0677 HPGL0678 HPGL0679
## chr10_ENSMUSG00000065406 14.9122 14.97679 14.70103 16.4748 16.080 14.878 15.2253
## chr10_ENSMUSG00000065430 15.6170 14.74830 15.45856 17.9794 19.187 18.555 17.8190
## chr10_ENSMUSG00000065607 4.1440 3.63404 3.26691 6.1509 7.965 8.398 8.1909
## chr10_ENSMUSG00000072837 0.1169 0.04338 0.01136 0.2666 0.000 0.000 0.9651
## chr10_ENSMUSG00000076282 4.4091 3.69143 4.48934 5.8866 4.485 4.520 4.7684
## chr10_ENSMUSG00000076377 3.9625 3.39092 1.95152 3.2642 1.054 3.038 2.2624
## HPGL0680 HPGL0523 HPGL0524 HPGL0525 HPGL0555 HPGL0556 HPGL0557
## chr10_ENSMUSG00000065406 15.0695 12.616 8.0299 0.0000 13.177 15.704 12.893
## chr10_ENSMUSG00000065430 18.7755 10.237 0.0000 0.0000 13.856 16.658 14.626
## chr10_ENSMUSG00000065607 7.6941 3.095 5.0732 4.7389 0.000 7.413 6.290
## chr10_ENSMUSG00000072837 0.9356 1.050 0.1663 0.2612 0.000 0.000 0.000
## chr10_ENSMUSG00000076282 4.3184 2.637 0.0000 0.5565 4.156 3.609 3.561
## chr10_ENSMUSG00000076377 2.1567 2.543 0.0000 0.0000 2.358 2.004 2.145
## HPGL0558 HPGL0559 HPGL0560
## chr10_ENSMUSG00000065406 6.656 9.91391 11.5147
## chr10_ENSMUSG00000065430 13.427 15.11901 13.6559
## chr10_ENSMUSG00000065607 5.300 5.13317 0.1647
## chr10_ENSMUSG00000072837 1.022 0.05534 2.3040
## chr10_ENSMUSG00000076282 2.571 2.11567 3.9264
## chr10_ENSMUSG00000076377 2.699 0.00000 1.0219
old <- exprs(older_data$expressionset)
new <- exprs(newer_data$expressionset)
##mmmi_small_norm <- normalize_expt(mmmi_small, transform="log2", norm="quant", convert="cpm", batch="svaseq", filter=TRUE)
mmmi_small_norm <- sm(normalize_expt(mmmi_small, norm="quant", convert="cpm", batch="svaseq"))
mmmi_small_norm_metrics <- graph_metrics(mmmi_small_norm)
## Graphing number of non-zero genes with respect to CPM by library.
## Graphing library sizes.
## Graphing a boxplot.
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0. We are on log scale, setting them to 0.5.
## Changed 21087 zero count features.
## Graphing a correlation heatmap.
## Graphing a standard median correlation.
## Performing correlation.
## Graphing a distance heatmap.
## Graphing a standard median distance.
## Performing distance.
## Graphing a PCA plot.
## Plotting a density plot.
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0. We are on log scale, setting them to 0.5.
## Changed 21087 zero count features.
## Printing a color to condition legend.
mmmi_small_norm_metrics$corheat
mmmi_small_norm_metrics$disheat
mmmi_small_norm_metrics$smc
mmmi_small_norm_metrics$pcaplot
mmmi_small_noapop <- subset_expt(mmmi_small, subset="genotype!='apop'")
mmmi_small_noapop_metrics <- graph_metrics(mmmi_small_noapop)
## Graphing number of non-zero genes with respect to CPM by library.
## Graphing library sizes.
## The scale difference between the smallest and largest
## libraries is > 10. Assuming a log10 scale is better, set scale=FALSE if not.
## Graphing a boxplot.
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0. We are on log scale, setting them to 0.5.
## Changed 17300 zero count features.
## Graphing a correlation heatmap.
## Graphing a standard median correlation.
## Performing correlation.
## Graphing a distance heatmap.
## Graphing a standard median distance.
## Performing distance.
## Graphing a PCA plot.
## Plotting a density plot.
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0. We are on log scale, setting them to 0.5.
## Changed 17300 zero count features.
## Printing a color to condition legend.
mmmi_small_noapop_metrics$pcaplot
mmmi_small_sva <- sm(normalize_expt(mmmi_small, transform="log2",
norm="quant", convert="raw",
filter=TRUE, batch="fsva"))
plot_pca(mmmi_small_sva)$plot
## The exosome and cell samples split nicely, and the wt/mutants split nicely among them!
## holy asscrackers that is nice!
## newer_data <- normalize_expt(mmmi_small, filter=TRUE, transform="log2", batch="svaseq", norm="quant")
printed_expt <- write_expt(mmmi_small, violin=TRUE,
excel=paste0("excel/mmmi_state_withapop-", ver, ".xlsx"),
filter=TRUE, transform="log2", norm="quant", convert="raw", batch="fsva")
## Writing the legend.
## The sheet: legend is in legend.
## Writing the raw reads.
## Graphing the raw reads.
## Attempting mixed linear model with: ~ (1|condition) + (1|batch)
## Fitting the expressionset to the model, this is slow.
## Projected run time: ~ 0.07 min
## The sheet: raw_graphs is in legend, raw_reads, raw_graphs.
## The sheet: raw_graphs is in legend, raw_reads, raw_graphs.
## Writing the normalized reads.
## Graphing the normalized reads.
## Attempting mixed linear model with: ~ (1|condition) + (1|batch)
## Fitting the expressionset to the model, this is slow.
## Projected run time: ~ 0.07 min
## The sheet: norm_graphs is in legend, raw_reads, raw_graphs, norm_data, norm_graphs.
## The sheet: norm_graphs is in legend, raw_reads, raw_graphs, norm_data, norm_graphs.
## Writing the median reads by factor.
## The factor cell_mirna_apop has 3 rows.
## The factor cell_mirna_mut has 5 rows.
## The factor cell_mirna_wt has 2 rows.
## The factor exo_mirna_mut has 5 rows.
## The factor exo_mirna_wt has 2 rows.
It appears that the cellular mRNA samples are a little more problematic; shockingly to me at least, the problematic sample proved to be one of the cellular RNA samples and not an exosome sample! I almost can’t believe that. I think that, given this, I will try an sva/combat adjustment of the data and see what happens, if for nothing else, then to satisfy my curiosity.
tx_colors <- c("#AA0000", "#AA8888", "#0000AA", "#8888AA")
names(tx_colors) <- c("exo_polya_mut", "exo_polya_wt", "cell_polya_mut", "cell_polya_wt")
mmtx_new <- set_expt_colors(mmtx_large, colors=tx_colors)
##mmtx_new <- expt_subset(mmtx_new, subset="sampleid!='HPGL0688'")
mmtx_large_sva <- sm(normalize_expt(mmtx_new, transform="log2",
norm="raw", convert="cpm",
filter=TRUE, batch="fsva"))
plot_pca(mmtx_large_sva)$plot
printed_expt <- write_expt(mmtx_new, excel=paste0("excel/mmtx_state-", ver, ".xlsx"),
batch="fsva", norm="raw", convert="cpm", filter=TRUE, violin=TRUE)
## Writing the legend.
## The sheet: legend is in legend.
## Warning in max(nchar(as.character(data[[data_col]])), na.rm = TRUE): no non-missing
## arguments to max; returning -Inf
## Writing the raw reads.
## Graphing the raw reads.
## Attempting mixed linear model with: ~ (1|condition) + (1|batch)
## Fitting the expressionset to the model, this is slow.
## Projected run time: ~ 3 min
## The sheet: raw_graphs is in legend, raw_reads, raw_graphs.
## The sheet: raw_graphs is in legend, raw_reads, raw_graphs.
## Writing the normalized reads.
## Graphing the normalized reads.
## Attempting mixed linear model with: ~ (1|condition) + (1|batch)
## Fitting the expressionset to the model, this is slow.
## Projected run time: ~ 3 min
## The sheet: norm_graphs is in legend, raw_reads, raw_graphs, norm_data, norm_graphs.
## The sheet: norm_graphs is in legend, raw_reads, raw_graphs, norm_data, norm_graphs.
## Writing the median reads by factor.
## The factor cell_polya_mut has 3 rows.
## The factor cell_polya_wt has 2 rows.
## The factor exo_polya_mut has 4 rows.
## The factor exo_polya_wt has 2 rows.
## wow, ummm, ok, so it appears that sva sees a significant surrogate variable.
mmtx_new_combat <- sm(normalize_expt(mmtx_new, transform="log2",
norm="quant", convert="cpm",
filter=TRUE, batch="combat_scale"))
plot_pca(mmtx_new_combat)$plot
## But according to this plot, the surrogate varaible is at least not entirely 'batch'
## oh wait, 98.62% of the remaining variance is now on the x-axis!
It appears that there is a significant but not crippling batch effect notable primarily in the large RNA libraries. I suspect that this will be sufficiently ameliorated by just adding batch into the experimental model.