index.html preprocessing.html 01_annotation.html
Since I used kallisto for this work, I want to figure out tximport.
In 01_annotation.Rmd, I decided to open up a few lines of inquiry:
Start with the full data set. This will take a rather long time to perform all the graphs, and since we are not likely to use it, I will limit myself to just plotting 1-2 graphs.
This data set is unlikely to be very helpful for differential expression, but may be useful for diagnosing other concerns in the data.
k10_libsize <- plot_libsize(k10_raw)
k10_libsize$plot
k10_density <- sm(plot_density(k10_raw))
sm(k10_density$plot + ggplot2::scale_x_continuous(limits=c(1,100)))
k10_boxplot <- sm(plot_boxplot(k10_raw))
k10_boxplot
k10_corheat <- plot_corheat(k10_raw)
k10_tsne <- plot_tsne(k10_raw)
k10_tsne$plot
Most notably, it makes obvious some concerns with samples 1003, 1007, and 1008. We chose to remove 1003 and 1008, but will keep an eye on 1007.
The metrics of the data when we remove those two samples should generally be quite similar.
k8_libsize <- plot_libsize(k8_raw)
k8_libsize$plot
k8_density <- sm(plot_density(k8_raw))
sm(k8_density$plot + ggplot2::scale_x_continuous(limits=c(1,100)))
k8_boxplot <- sm(plot_boxplot(k8_raw))
k8_boxplot
k8_corheat <- plot_corheat(k8_raw)
k8_tsne <- plot_tsne(k8_raw)
k8_tsne$plot
Here is the same density plot, but looking from 0 to 10. In it we can see that there is a tremendous number of ‘genes’ which have fractional counts. I am certain that this is important, but I am not sure what to do with it. I have a feeling that when we do a count-based filtering, this tremendous jump by sample 1009 will fall away…
sm(k8_density$plot + ggplot2::scale_x_continuous(limits=c(0,10)))
Can we ask the data to give us relatively reliable estimates of which genes/transcripts are in all 4 embryogenic samples and in none of the non-embryogenic samples?
Yesterday when I went to do this, my rpkm function failed because I neglected to include a sequence length column in the annotation data. That has been remedied, as a result these cutoffs probably should be adjust to something more appropriate to rpkm-scale data. Though I am not entirely certain where they should be adjusted, perhaps <1 and 5?
k8_raw_rpkm <- normalize_expt(k8_count, convert="rpkm", norm="quant")
## This function will replace the expt$expressionset slot with:
## rpkm(quant(data))
## It backs up the current data into a slot named:
## expt$backup_expressionset. It will also save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Filter is false, this should likely be set to something, good
## choices include cbcb, kofa, pofa (anything but FALSE). If you want this to
## stay FALSE, keep in mind that if other normalizations are performed, then the
## resulting libsizes are likely to be strange (potentially negative!)
## Leaving the data in its current base format, keep in mind that
## some metrics are easier to see when the data is log2 transformed, but
## EdgeR/DESeq do not accept transformed data.
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: not doing count filtering.
## Step 2: normalizing the data with quant.
## Step 3: converting the data with rpkm.
## Step 4: not transforming the data.
## Step 5: not doing batch correction.
low_cutoff <- 3
high_cutoff <- 10
k8_embryo_low_nonembryo_high <- rowMeans(exprs(k8_raw_rpkm)[, 1:4]) <= low_cutoff &
rowMeans(exprs(k8_raw_rpkm)[, 5:8]) >= high_cutoff
checkme_first <- rownames(exprs(k8_raw_rpkm))[k8_embryo_low_nonembryo_high]
dim(exprs(k8_raw_rpkm)[checkme_first, ])
## [1] 228 8
low_embryo_high_nonembryo <- merge(exprs(k8_raw_rpkm)[checkme_first, ], fData(k8_raw_rpkm)[checkme_first, ], by="row.names")
write.csv(file="low_embryo_high_nonembryo.csv", low_embryo_high_nonembryo)
k8_embryo_high_nonembryo_low <- rowMeans(exprs(k8_raw_rpkm)[, 1:4]) >= high_cutoff & rowMeans(exprs(k8_raw_rpkm)[, 5:8]) <= low_cutoff
checkme_second <- rownames(exprs(k8_raw_rpkm))[k8_embryo_high_nonembryo_low]
dim(exprs(k8_raw_rpkm)[checkme_second, ])
## [1] 43 8
high_embryo_low_nonembryo <- merge(exprs(k8_raw_rpkm)[checkme_second, ], fData(k8_raw_rpkm)[checkme_second, ], by="row.names")
write.csv(file="high_embryo_low_nonembryo.csv", high_embryo_low_nonembryo)
When we get to the count filtered data, I suspect things will be more interesting. Therefore let us look at the full spectrum of plots before/after normalization.
For these data, per orm some extra analyses including a PC loading, PC vs actor analysis, and variance partition.
k10_count_raw_plots <- sm(graph_metrics(k10_count))
k10_count_norm <- sm(normalize_expt(k10_count, transform="log2", norm="quant", convert="cpm"))
k10_count_norm_plots <- sm(graph_metrics(k10_count_norm))
k10_count_normbatch <- sm(normalize_expt(k10_count, batch="svaseq", transform="log2", convert="cpm"))
k10_count_normbatch_plots <- sm(graph_metrics(k10_count_normbatch))
pca_info <- pca_information(k10_count, plot_pcas=TRUE, num_components=4,
expt_factors=c("condition", "batch", "rnaconcentration"))
## More shallow curves in these plots suggest more genes in this principle component.
pca_info$pca_plots
## $PC1_PC2
##
## $PC1_PC3
##
## $PC1_PC4
##
## $PC2_PC3
##
## $PC2_PC4
##
## $PC3_PC4
k10_varpart <- varpart(k10_count, predictor=NULL, factors=c("condition","batch"))
## Attempting mixed linear model with: ~ (1|condition) + (1|batch)
## Fitting the expressionset to the model, this is slow.
## Projected run time: ~ 6 min
## Placing factor: condition at the beginning of the model.
k10_varpart$percent_plot
k10_varpart$partition_plot
write.csv(k10_varpart$sorted_df, file="excel/varpart_sorted.csv")
high_scores <- pca_highscores(k10_count, n=100)
## Top 100 Strongest genes vs. component 1.
high_scores$highest[, "Comp.1"]
## [1] "14.62:TRINITY_DN11326_c0_g3" "14.33:TRINITY_DN7305_c1_g7"
## [3] "13.14:TRINITY_DN20010_c1_g5" "12.84:TRINITY_DN16546_c0_g6"
## [5] "12.76:TRINITY_DN18956_c2_g1" "12.6:TRINITY_DN15106_c4_g3"
## [7] "12.54:TRINITY_DN10836_c0_g3" "12.36:TRINITY_DN11934_c0_g7"
## [9] "12.07:TRINITY_DN15734_c1_g2" "11.87:TRINITY_DN13204_c0_g1"
## [11] "11.84:TRINITY_DN15897_c1_g3" "11.63:TRINITY_DN7489_c0_g7"
## [13] "11.6:TRINITY_DN16838_c0_g1" "10.87:TRINITY_DN13516_c1_g1"
## [15] "10.81:TRINITY_DN14567_c0_g2" "10.77:TRINITY_DN11842_c4_g1"
## [17] "10.72:TRINITY_DN11322_c1_g5" "10.56:TRINITY_DN7489_c0_g3"
## [19] "10.17:TRINITY_DN8985_c1_g8" "10.09:TRINITY_DN14291_c0_g2"
## [21] "10.05:TRINITY_DN14132_c0_g1" "10.04:TRINITY_DN15333_c1_g5"
## [23] "10.03:TRINITY_DN14230_c2_g3" "9.791:TRINITY_DN8078_c1_g3"
## [25] "9.785:TRINITY_DN16231_c1_g6" "9.73:TRINITY_DN11126_c0_g1"
## [27] "9.698:TRINITY_DN14513_c0_g4" "9.692:TRINITY_DN17690_c0_g4"
## [29] "9.585:TRINITY_DN11628_c1_g1" "9.578:TRINITY_DN14199_c1_g4"
## [31] "9.572:TRINITY_DN8086_c3_g1" "9.507:TRINITY_DN11811_c1_g4"
## [33] "9.475:TRINITY_DN14545_c1_g1" "9.364:TRINITY_DN16511_c0_g8"
## [35] "9.325:TRINITY_DN18952_c1_g4" "9.29:TRINITY_DN16578_c3_g2"
## [37] "9.26:TRINITY_DN12894_c1_g4" "9.253:TRINITY_DN8744_c0_g1"
## [39] "9.224:TRINITY_DN13058_c0_g1" "9.186:TRINITY_DN11505_c0_g1"
## [41] "9.168:TRINITY_DN9459_c1_g3" "9.122:TRINITY_DN15734_c1_g4"
## [43] "9.07:TRINITY_DN7883_c1_g1" "9.055:TRINITY_DN16729_c0_g3"
## [45] "9.017:TRINITY_DN13759_c0_g3" "9.011:TRINITY_DN9824_c4_g2"
## [47] "8.97:TRINITY_DN14732_c0_g3" "8.837:TRINITY_DN6957_c2_g5"
## [49] "8.794:TRINITY_DN8411_c0_g5" "8.773:TRINITY_DN7877_c0_g1"
## [51] "8.749:TRINITY_DN11615_c1_g1" "8.64:TRINITY_DN16221_c1_g2"
## [53] "8.595:TRINITY_DN13936_c0_g2" "8.591:TRINITY_DN18874_c0_g4"
## [55] "8.574:TRINITY_DN13514_c0_g3" "8.476:TRINITY_DN9025_c2_g5"
## [57] "8.454:TRINITY_DN8979_c2_g6" "8.451:TRINITY_DN10143_c1_g6"
## [59] "8.419:TRINITY_DN10513_c0_g3" "8.417:TRINITY_DN12936_c0_g4"
## [61] "8.39:TRINITY_DN11923_c0_g1" "8.389:TRINITY_DN11803_c0_g1"
## [63] "8.354:TRINITY_DN15545_c0_g3" "8.343:TRINITY_DN11028_c0_g5"
## [65] "8.342:TRINITY_DN18336_c1_g2" "8.333:TRINITY_DN18472_c1_g3"
## [67] "8.324:TRINITY_DN16152_c0_g6" "8.217:TRINITY_DN9532_c1_g2"
## [69] "8.209:TRINITY_DN10485_c0_g2" "8.184:TRINITY_DN9528_c0_g4"
## [71] "8.167:TRINITY_DN17291_c1_g3" "8.151:TRINITY_DN18524_c2_g3"
## [73] "8.137:TRINITY_DN19276_c4_g2" "8.102:TRINITY_DN12962_c0_g2"
## [75] "8.008:TRINITY_DN14612_c1_g1" "8.001:TRINITY_DN17042_c1_g4"
## [77] "7.948:TRINITY_DN17472_c1_g2" "7.945:TRINITY_DN9025_c2_g1"
## [79] "7.917:TRINITY_DN7877_c0_g2" "7.889:TRINITY_DN491_c0_g1"
## [81] "7.879:TRINITY_DN16566_c0_g1" "7.842:TRINITY_DN12634_c0_g6"
## [83] "7.773:TRINITY_DN17712_c1_g3" "7.749:TRINITY_DN12908_c0_g3"
## [85] "7.745:TRINITY_DN9663_c5_g2" "7.717:TRINITY_DN20047_c3_g1"
## [87] "7.672:TRINITY_DN16384_c0_g2" "7.642:TRINITY_DN17629_c1_g9"
## [89] "7.597:TRINITY_DN14207_c0_g4" "7.587:TRINITY_DN14470_c0_g2"
## [91] "7.555:TRINITY_DN15573_c0_g3" "7.548:TRINITY_DN15055_c3_g4"
## [93] "7.54:TRINITY_DN9864_c0_g5" "7.515:TRINITY_DN17399_c1_g1"
## [95] "7.482:TRINITY_DN14477_c5_g5" "7.482:TRINITY_DN19416_c3_g2"
## [97] "7.469:TRINITY_DN14568_c1_g3" "7.452:TRINITY_DN7179_c3_g1"
## [99] "7.449:TRINITY_DN19759_c4_g2" "7.412:TRINITY_DN15133_c2_g8"
## Theoretically that list should agree well with the top of the variance partition table vs. Condition
## assuming condition fits component 1.
## In contrast, since component 2 seems to be the best for splitting the data, here is component 2.
high_scores$highest[, "Comp.2"]
## [1] "7.085:TRINITY_DN10230_c1_g1" "6.32:TRINITY_DN8292_c1_g4"
## [3] "5.798:TRINITY_DN9670_c0_g2" "5.794:TRINITY_DN13638_c1_g4"
## [5] "5.618:TRINITY_DN9498_c2_g8" "5.42:TRINITY_DN17921_c0_g4"
## [7] "5.384:TRINITY_DN10525_c1_g1" "5.383:TRINITY_DN11758_c0_g3"
## [9] "5.348:TRINITY_DN13148_c0_g5" "5.291:TRINITY_DN17201_c1_g1"
## [11] "5.209:TRINITY_DN9222_c1_g5" "5.194:TRINITY_DN12708_c2_g1"
## [13] "5.18:TRINITY_DN9635_c0_g1" "5.172:TRINITY_DN19904_c1_g1"
## [15] "5.059:TRINITY_DN12430_c0_g1" "4.91:TRINITY_DN9443_c2_g1"
## [17] "4.857:TRINITY_DN11538_c3_g1" "4.77:TRINITY_DN8827_c1_g4"
## [19] "4.754:TRINITY_DN16558_c1_g2" "4.753:TRINITY_DN11981_c0_g4"
## [21] "4.706:TRINITY_DN16501_c2_g7" "4.671:TRINITY_DN16423_c0_g4"
## [23] "4.658:TRINITY_DN15487_c0_g2" "4.632:TRINITY_DN11981_c0_g2"
## [25] "4.621:TRINITY_DN16252_c3_g3" "4.601:TRINITY_DN8627_c2_g6"
## [27] "4.6:TRINITY_DN9469_c1_g1" "4.597:TRINITY_DN17255_c2_g2"
## [29] "4.544:TRINITY_DN17280_c0_g2" "4.514:TRINITY_DN13390_c0_g4"
## [31] "4.458:TRINITY_DN9322_c0_g1" "4.432:TRINITY_DN9494_c1_g3"
## [33] "4.406:TRINITY_DN11981_c0_g3" "4.403:TRINITY_DN9748_c0_g7"
## [35] "4.402:TRINITY_DN14526_c1_g1" "4.401:TRINITY_DN13038_c0_g2"
## [37] "4.377:TRINITY_DN16497_c0_g1" "4.36:TRINITY_DN15335_c0_g1"
## [39] "4.356:TRINITY_DN8800_c0_g3" "4.349:TRINITY_DN8104_c2_g2"
## [41] "4.288:TRINITY_DN5674_c0_g1" "4.287:TRINITY_DN9640_c0_g6"
## [43] "4.287:TRINITY_DN15257_c2_g2" "4.284:TRINITY_DN9966_c0_g1"
## [45] "4.254:TRINITY_DN9249_c2_g4" "4.242:TRINITY_DN16876_c0_g2"
## [47] "4.241:TRINITY_DN17605_c1_g3" "4.23:TRINITY_DN13052_c1_g1"
## [49] "4.192:TRINITY_DN13390_c0_g3" "4.179:TRINITY_DN7384_c1_g1"
## [51] "4.177:TRINITY_DN7087_c1_g6" "4.165:TRINITY_DN18181_c0_g4"
## [53] "4.158:TRINITY_DN14567_c0_g6" "4.122:TRINITY_DN17663_c1_g3"
## [55] "4.115:TRINITY_DN15701_c0_g1" "4.102:TRINITY_DN11081_c0_g1"
## [57] "4.085:TRINITY_DN17711_c1_g2" "4.047:TRINITY_DN9929_c0_g1"
## [59] "4.04:TRINITY_DN17060_c0_g3" "4.032:TRINITY_DN13362_c0_g1"
## [61] "4.011:TRINITY_DN11420_c0_g6" "4.007:TRINITY_DN18620_c0_g1"
## [63] "3.987:TRINITY_DN18993_c0_g3" "3.977:TRINITY_DN11224_c0_g1"
## [65] "3.963:TRINITY_DN15631_c1_g3" "3.962:TRINITY_DN11448_c1_g5"
## [67] "3.962:TRINITY_DN7492_c0_g1" "3.943:TRINITY_DN6883_c4_g2"
## [69] "3.935:TRINITY_DN15372_c0_g3" "3.926:TRINITY_DN9397_c0_g1"
## [71] "3.907:TRINITY_DN10528_c1_g7" "3.893:TRINITY_DN12147_c2_g6"
## [73] "3.833:TRINITY_DN12844_c2_g3" "3.823:TRINITY_DN10093_c4_g4"
## [75] "3.818:TRINITY_DN15193_c1_g1" "3.81:TRINITY_DN14610_c5_g1"
## [77] "3.8:TRINITY_DN9913_c1_g1" "3.788:TRINITY_DN17295_c0_g3"
## [79] "3.788:TRINITY_DN9235_c6_g2" "3.778:TRINITY_DN9427_c0_g3"
## [81] "3.776:TRINITY_DN6793_c2_g3" "3.773:TRINITY_DN7677_c1_g5"
## [83] "3.772:TRINITY_DN19023_c2_g3" "3.756:TRINITY_DN12147_c2_g9"
## [85] "3.752:TRINITY_DN13706_c1_g4" "3.751:TRINITY_DN17297_c0_g3"
## [87] "3.729:TRINITY_DN7794_c3_g1" "3.725:TRINITY_DN11557_c0_g3"
## [89] "3.721:TRINITY_DN7013_c2_g1" "3.721:TRINITY_DN18815_c1_g2"
## [91] "3.717:TRINITY_DN8085_c1_g3" "3.705:TRINITY_DN19866_c0_g5"
## [93] "3.702:TRINITY_DN14614_c2_g3" "3.694:TRINITY_DN11328_c2_g5"
## [95] "3.69:TRINITY_DN19362_c0_g1" "3.675:TRINITY_DN17450_c1_g9"
## [97] "3.667:TRINITY_DN12227_c0_g5" "3.666:TRINITY_DN8097_c2_g1"
## [99] "3.665:TRINITY_DN12924_c0_g2" "3.661:TRINITY_DN10328_c1_g2"
Let us see how the k10 count data appears
k10_count_raw_plots$libsize
sm(k10_count_raw_plots$density$plot + ggplot2::scale_x_continuous(limits=c(1,200)))
k10_count_raw_plots$boxplot
k10_count_norm_plots$corheat
k10_count_norm_plots$smc
k10_count_norm_plots$disheat
k10_count_norm_plots$smd
k10_count_norm_plots$pcaplot
## See if the additional information in the 'batch' factor made a difference
k10_count_normbatch_plots$pcaplot
sm(k10_count_raw_plots$density$plot + ggplot2::scale_x_continuous(limits=c(0,10)))
It looks like I was mostly correct above, this makes it clearer that sample 1003 has to go, and since 1008 is in the same batch it is likely to get the axe still. Though judging from this plot, maybe 1009 is a candidate for removal too.
k8_count_raw_plots <- sm(graph_metrics(k8_count))
k8_count_norm <- sm(normalize_expt(k8_count, transform="log2", norm="quant", convert="cpm"))
k8_count_norm_plots <- sm(graph_metrics(k8_count_norm))
k8_count_normbatch <- sm(normalize_expt(k8_count, batch="svaseq", transform="log2", convert="cpm"))
k8_count_normbatch_plots <- sm(graph_metrics(k8_count_normbatch))
Let us see how the k8 count data appears
k8_count_raw_plots$libsize
sm(k8_count_raw_plots$density$plot + ggplot2::scale_x_continuous(limits=c(1,200)))
k8_count_raw_plots$boxplot
k8_count_norm_plots$corheat
k8_count_norm_plots$smc
k8_count_norm_plots$disheat
k8_count_norm_plots$smd
k8_count_norm_plots$pcaplot
## See if the additional information in the 'batch' factor made a difference
k8_count_normbatch_plots$pcaplot
k8_count_normbatch_plots$tsneplot
When we get to the conf filtered data, I suspect things will be more interesting. Therefore let us look at the full spectrum of plots before/after normalization.
k10_conf_raw_plots <- sm(graph_metrics(k10_conf))
k10_conf_norm <- sm(normalize_expt(k10_conf, transform="log2", norm="quant", convert="cpm"))
k10_conf_norm_plots <- sm(graph_metrics(k10_conf_norm))
k10_conf_normbatch <- sm(normalize_expt(k10_conf, batch="svaseq", transform="log2",
convert="cpm", filter=TRUE))
k10_conf_normbatch_plots <- sm(graph_metrics(k10_conf_normbatch))
Let us see how the k10 conf data appears
k10_conf_raw_plots$libsize
sm(k10_conf_raw_plots$density$plot + ggplot2::scale_x_continuous(limits=c(1,100)))
k10_conf_raw_plots$boxplot
k10_conf_norm_plots$corheat
## This plot is a particularly nice example of why 1003/1008 are annoying.
k10_conf_norm_plots$smc
k10_conf_norm_plots$disheat
k10_conf_norm_plots$smd
k10_conf_norm_plots$pcaplot
## See if the additional information in the 'batch' factor made a difference
k10_conf_normbatch_plots$pcaplot
k8_conf_raw_plots <- sm(graph_metrics(k8_conf))
k8_conf_norm <- sm(normalize_expt(k8_conf, transform="log2", norm="quant", convert="cpm"))
k8_conf_norm_plots <- sm(graph_metrics(k8_conf_norm))
k8_conf_normbatch <- sm(normalize_expt(k8_conf, batch="svaseq", transform="log2",
convert="cpm", filter=TRUE))
k8_conf_normbatch_plots <- sm(graph_metrics(k8_conf_normbatch))
Let us see how the k8 conf data appears
k8_conf_raw_plots$libsize
sm(k8_conf_raw_plots$density$plot + ggplot2::scale_x_continuous(limits=c(1,100)))
k8_conf_raw_plots$boxplot
k8_conf_norm_plots$corheat
k8_conf_norm_plots$smc
k8_conf_norm_plots$disheat
k8_conf_norm_plots$smd
k8_conf_norm_plots$pcaplot
k8_conf_norm_plots$tsneplot
## See if the additional information in the 'batch' factor made a difference
k8_conf_normbatch_plots$pcaplot
k8_conf_normbatch_plots$tsneplot
When we get to the count_conf filtered data, I suspect things will be more interesting. Therefore let us look at the full spectrum of plots before/after normalization.
k10_conf_count_raw_plots <- sm(graph_metrics(k10_conf_count))
k10_conf_count_norm <- sm(normalize_expt(k10_conf_count, transform="log2",
norm="quant", convert="cpm"))
k10_conf_count_norm_plots <- sm(graph_metrics(k10_conf_count_norm))
k10_conf_count_normbatch <- sm(normalize_expt(k10_conf_count, batch="svaseq", transform="log2",
convert="cpm", filter=TRUE))
k10_conf_count_normbatch_plots <- sm(graph_metrics(k10_conf_count_normbatch))
Let us see how the k10 count_conf data appears
k10_conf_count_raw_plots$libsize
sm(k10_conf_count_raw_plots$density$plot + ggplot2::scale_x_continuous(limits=c(1,100)))
k10_conf_count_raw_plots$boxplot
k10_conf_count_norm_plots$corheat
k10_conf_count_norm_plots$smc
k10_conf_count_norm_plots$disheat
k10_conf_count_norm_plots$smd
k10_conf_count_norm_plots$pcaplot
## See if the additional information in the 'batch' factor made a difference
k10_conf_count_normbatch_plots$pcaplot
k10_conf_count_normbatch_plots$tsneplot
k8_conf_count_raw_plots <- sm(graph_metrics(k8_conf_count))
k8_conf_count_norm <- sm(normalize_expt(k8_conf_count, transform="log2",
norm="quant", convert="cpm"))
k8_conf_count_norm_plots <- sm(graph_metrics(k8_conf_count_norm))
k8_conf_count_normbatch <- sm(normalize_expt(k8_conf_count, batch="svaseq",
transform="log2", convert="cpm"))
k8_conf_count_normbatch_plots <- sm(graph_metrics(k8_conf_count_normbatch))
Let us see how the k8 count_conf data appears
k8_conf_count_raw_plots$libsize
sm(k8_conf_count_raw_plots$density$plot + ggplot2::scale_x_continuous(limits=c(1,100)))
k8_conf_count_raw_plots$boxplot
k8_conf_count_norm_plots$corheat
k8_conf_count_norm_plots$smc
k8_conf_count_norm_plots$disheat
k8_conf_count_norm_plots$smd
k8_conf_count_norm_plots$pcaplot
## See if the additional information in the 'batch' factor made a difference
k8_conf_count_normbatch_plots$pcaplot
k8_conf_count_normbatch_plots$tsneplot
Variance Partition is a neat tool to see what experimental factors have the most variance associated with them. It is slow, so I will use it on the smallest data set we have created (the confidence/count filtered).
k8_conf_count_varpart <- varpart(expt=k8_conf_count, predictor=NULL,
factors=c("condition", "batch"),
modify_expt=TRUE)
## Attempting mixed linear model with: ~ (1|condition) + (1|batch)
## Fitting the expressionset to the model, this is slow.
## Projected run time: ~ 0.3 min
## Placing factor: condition at the beginning of the model.
k8_conf_count <- k8_conf_count_varpart[["modified_expt"]]
k8_conf_count_varpart$percent_plot
## These genes are the most likely to be interesting for the difference between non/embryogenic
k8_conf_count_varpart$partition_plot
## Another fun thing to use is pca_information()
testing <- pca_information(expt_data=k8_conf_count)
## More shallow curves in these plots suggest more genes in this principle component.
library(ggplot2)
maximum <- plot_libsize(k8_raw, text=FALSE)
maximum_alpha <- maximum$table
maximum_alpha$alpha <- alpha(maximum_alpha$colors, 0.4)
conf <- plot_libsize(k8_conf, text=FALSE)
conf_alpha <- conf$table
conf_alpha$alpha <- alpha(conf_alpha$colors, 0.6)
count <- plot_libsize(k8_count, text=FALSE)
count_alpha <- count$table
count_alpha$alpha <- alpha(count_alpha$colors, 0.8)
conf_count <- plot_libsize(k8_conf_count, text=FALSE)
conf_count_alpha <- conf_count$table
conf_count_alpha$alpha <- alpha(conf_count_alpha$colors, 1.0)
all <- rbind(maximum_alpha, conf_alpha,
count_alpha, conf_count_alpha)
all <- as.data.frame(all)
## The following will stack the library sizes from each subset one on top of the other.
## This is likely not desired, but it does provide sufficient room to print the library
## sizes if one wishes.
stacked <- ggplot(all, aes(x=id)) +
geom_bar(aes(weight=sum, alpha=alpha, fill=colors), color="black") +
scale_fill_manual(values=c(levels(as.factor(all$colors)))) +
theme(axis.text=ggplot2::element_text(size=10, colour="black"),
axis.text.x=ggplot2::element_text(angle=90, vjust=0.5),
legend.position="none")
stacked
## In this instance, they are stacked on top of 0, which I suspect is the goal.
columns <- ggplot(all, aes(x=id, y=sum)) +
geom_col(position="identity", aes(fill=colors, alpha=alpha), color="black") +
scale_fill_manual(values=c(levels(as.factor(all$colors)))) +
theme(axis.text=ggplot2::element_text(size=10, colour="black"),
axis.text.x=ggplot2::element_text(angle=90, vjust=0.5),
legend.position="none")
columns
Let us assume for a moment that we will choose a couple of the above for looking at more closely. I suspect k8_count_conf, k10_count_conf, and k8_count and k8_conf are good candidates.
output <- paste0("excel/k8_conf_count_by_gene-v", ver, ".xlsx")
if (!file.exists(output)) {
k8_conf_count_written <- sm(write_expt(k8_conf_count, excel=output, violin=TRUE))
}
output <- paste0("excel/k10_conf_count_by_gene-v", ver, ".xlsx")
if (!file.exists(output)) {
k10_conf_count_written <- sm(write_expt(k10_conf_count, excel=output, violin=TRUE))
}
output <- paste0("excel/k8_count_by_gene-v", ver, ".xlsx")
if (!file.exists(output)) {
k8_count_written <- sm(write_expt(k8_count, excel=output, violin=TRUE))
}
output <- paste0("excel/k8_conf_by_gene-v", ver, ".xlsx")
if (!file.exists(output)) {
k8_conf_written <- sm(write_expt(k8_conf, excel=output, violin=TRUE))
}
I decided to not save the state of this file, because it is huge. The differential expression analyses do not need this, but need only the result of the annotation worksheet.
pander::pander(sessionInfo())
R version 3.4.3 (2017-11-30)
**Platform:** x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=en_US.utf8, LC_NUMERIC=C, LC_TIME=en_US.utf8, LC_COLLATE=en_US.utf8, LC_MONETARY=en_US.utf8, LC_MESSAGES=en_US.utf8, LC_PAPER=en_US.utf8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.utf8 and LC_IDENTIFICATION=C
attached base packages: stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: hpgltools(v.2017.10) and ggplot2(v.2.2.1)
loaded via a namespace (and not attached): Biobase(v.2.38.0), edgeR(v.3.20.2), bit64(v.0.9-7), splines(v.3.4.3), foreach(v.1.4.4), gtools(v.3.5.0), stats4(v.3.4.3), pander(v.0.6.1), blob(v.1.1.0), yaml(v.2.1.16), ggrepel(v.0.7.0), RSQLite(v.2.0), backports(v.1.1.2), lattice(v.0.20-35), limma(v.3.34.4), quadprog(v.1.5-5), digest(v.0.6.13), RColorBrewer(v.1.1-2), minqa(v.1.2.4), colorspace(v.1.3-2), htmltools(v.0.3.6), preprocessCore(v.1.40.0), Matrix(v.1.2-12), plyr(v.1.8.4), XML(v.3.98-1.9), devtools(v.1.13.4), genefilter(v.1.60.0), xtable(v.1.8-2), corpcor(v.1.6.9), scales(v.0.5.0), gdata(v.2.18.0), Rtsne(v.0.13), lme4(v.1.1-15), BiocParallel(v.1.12.0), tibble(v.1.3.4), annotate(v.1.56.1), mgcv(v.1.8-22), IRanges(v.2.12.0), withr(v.2.1.1), BiocGenerics(v.0.24.0), lazyeval(v.0.2.1), pbkrtest(v.0.4-7), survival(v.2.41-3), magrittr(v.1.5), memoise(v.1.1.0), evaluate(v.0.10.1), gplots(v.3.0.1), doParallel(v.1.0.11), nlme(v.3.1-131), MASS(v.7.3-47), xml2(v.1.1.1), tools(v.3.4.3), directlabels(v.2017.03.31), data.table(v.1.10.4-3), matrixStats(v.0.52.2), stringr(v.1.2.0), S4Vectors(v.0.16.0), munsell(v.0.4.3), locfit(v.1.5-9.1), AnnotationDbi(v.1.40.0), colorRamps(v.2.3), compiler(v.3.4.3), caTools(v.1.17.1), rlang(v.0.1.6), nloptr(v.1.0.4), grid(v.3.4.3), RCurl(v.1.95-4.8), iterators(v.1.0.9), variancePartition(v.1.8.0), bitops(v.1.0-6), base64enc(v.0.1-3), labeling(v.0.3), rmarkdown(v.1.8), gtable(v.0.2.0), codetools(v.0.2-15), DBI(v.0.7), roxygen2(v.6.0.1), reshape2(v.1.4.3), R6(v.2.2.2), knitr(v.1.17), bit(v.1.1-12), commonmark(v.1.4), rprojroot(v.1.3-1), KernSmooth(v.2.23-15), stringi(v.1.1.6), parallel(v.3.4.3), sva(v.3.26.0) and Rcpp(v.0.12.14)
message(paste0("This is hpgltools commit: ", get_git_commit()))
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset b1e1b19ec57ffc7b90f4f16b7b1c0cd71a321e04
## R> packrat::restore()
## This is hpgltools commit: Mon Jan 15 11:38:07 2018 -0500: b1e1b19ec57ffc7b90f4f16b7b1c0cd71a321e04
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
## Saving to 02_sample_estimation-v20171102.rda.xz
## tmp <- sm(saveme(filename=this_save))