index.html preprocessing.html 01_annotation.html

1 Sample Estimation version: 20171102

Since I used kallisto for this work, I want to figure out tximport.

In 01_annotation.Rmd, I decided to open up a few lines of inquiry:

  • k10_raw: All samples using the full set of putative annotations.
  • k8_raw: Samples excluding hpgl1003 with the full set of annotations.
  • k10_count: All samples, removing those with < ‘threshold (5)’ counts/gene.
  • k8_count: 8 samples, ibid.
  • k10_conf: All samples, but removing transcripts with annotations less confident than 1e-10 blastx e-value and less identity than 60 percent to some other known gene.
  • k8_conf: 8 samples, ibid.
  • k10_count_conf: A combination of the count filter and confidence filter.
  • k8_count_conf: ibid, but with the 8 samples.

2 Examine these data sets

2.1 Raw data

Start with the full data set. This will take a rather long time to perform all the graphs, and since we are not likely to use it, I will limit myself to just plotting 1-2 graphs.

2.1.1 k10 raw

This data set is unlikely to be very helpful for differential expression, but may be useful for diagnosing other concerns in the data.

k10_libsize <- plot_libsize(k10_raw)
k10_libsize$plot

k10_density <- sm(plot_density(k10_raw))
sm(k10_density$plot + ggplot2::scale_x_continuous(limits=c(1,100)))

k10_boxplot <- sm(plot_boxplot(k10_raw))
k10_boxplot

k10_corheat <- plot_corheat(k10_raw)

k10_tsne <- plot_tsne(k10_raw)
k10_tsne$plot

Most notably, it makes obvious some concerns with samples 1003, 1007, and 1008. We chose to remove 1003 and 1008, but will keep an eye on 1007.

2.1.2 k8 raw

The metrics of the data when we remove those two samples should generally be quite similar.

k8_libsize <- plot_libsize(k8_raw)
k8_libsize$plot

k8_density <- sm(plot_density(k8_raw))
sm(k8_density$plot + ggplot2::scale_x_continuous(limits=c(1,100)))

k8_boxplot <- sm(plot_boxplot(k8_raw))
k8_boxplot

k8_corheat <- plot_corheat(k8_raw)

k8_tsne <- plot_tsne(k8_raw)
k8_tsne$plot

2.1.2.1 A peculiarity of kallisto

Here is the same density plot, but looking from 0 to 10. In it we can see that there is a tremendous number of ‘genes’ which have fractional counts. I am certain that this is important, but I am not sure what to do with it. I have a feeling that when we do a count-based filtering, this tremendous jump by sample 1009 will fall away…

sm(k8_density$plot + ggplot2::scale_x_continuous(limits=c(0,10)))

2.2 Quick side question

Can we ask the data to give us relatively reliable estimates of which genes/transcripts are in all 4 embryogenic samples and in none of the non-embryogenic samples?

Yesterday when I went to do this, my rpkm function failed because I neglected to include a sequence length column in the annotation data. That has been remedied, as a result these cutoffs probably should be adjust to something more appropriate to rpkm-scale data. Though I am not entirely certain where they should be adjusted, perhaps <1 and 5?

k8_raw_rpkm <- normalize_expt(k8_count, convert="rpkm", norm="quant")
## This function will replace the expt$expressionset slot with:
## rpkm(quant(data))
## It backs up the current data into a slot named:
##  expt$backup_expressionset. It will also save copies of each step along the way
##  in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
##  when invoking limma.  The appropriate libsize is the non-log(cpm(normalized)).
##  This is most likely kept at:
##  'new_expt$normalized$intermediate_counts$normalization$libsizes'
##  A copy of this may also be found at:
##  new_expt$best_libsize
## Filter is false, this should likely be set to something, good
##  choices include cbcb, kofa, pofa (anything but FALSE).  If you want this to
##  stay FALSE, keep in mind that if other normalizations are performed, then the
##  resulting libsizes are likely to be strange (potentially negative!)
## Leaving the data in its current base format, keep in mind that
##  some metrics are easier to see when the data is log2 transformed, but
##  EdgeR/DESeq do not accept transformed data.
## Not correcting the count-data for batch effects.  If batch is
##  included in EdgerR/limma's model, then this is probably wise; but in extreme
##  batch effects this is a good parameter to play with.
## Step 1: not doing count filtering.
## Step 2: normalizing the data with quant.
## Step 3: converting the data with rpkm.
## Step 4: not transforming the data.
## Step 5: not doing batch correction.
low_cutoff <- 3
high_cutoff <- 10
k8_embryo_low_nonembryo_high <- rowMeans(exprs(k8_raw_rpkm)[, 1:4]) <= low_cutoff &
  rowMeans(exprs(k8_raw_rpkm)[, 5:8]) >= high_cutoff
checkme_first <- rownames(exprs(k8_raw_rpkm))[k8_embryo_low_nonembryo_high]
dim(exprs(k8_raw_rpkm)[checkme_first, ])
## [1] 228   8
low_embryo_high_nonembryo <- merge(exprs(k8_raw_rpkm)[checkme_first, ], fData(k8_raw_rpkm)[checkme_first, ], by="row.names")
write.csv(file="low_embryo_high_nonembryo.csv", low_embryo_high_nonembryo)

k8_embryo_high_nonembryo_low <- rowMeans(exprs(k8_raw_rpkm)[, 1:4]) >= high_cutoff & rowMeans(exprs(k8_raw_rpkm)[, 5:8]) <= low_cutoff
checkme_second <- rownames(exprs(k8_raw_rpkm))[k8_embryo_high_nonembryo_low]
dim(exprs(k8_raw_rpkm)[checkme_second, ])
## [1] 43  8
high_embryo_low_nonembryo <- merge(exprs(k8_raw_rpkm)[checkme_second, ], fData(k8_raw_rpkm)[checkme_second, ], by="row.names")
write.csv(file="high_embryo_low_nonembryo.csv", high_embryo_low_nonembryo)

2.3 Count filtered data

When we get to the count filtered data, I suspect things will be more interesting. Therefore let us look at the full spectrum of plots before/after normalization.

2.3.1 k10 count filtering

For these data, per orm some extra analyses including a PC loading, PC vs actor analysis, and variance partition.

k10_count_raw_plots <- sm(graph_metrics(k10_count))
k10_count_norm <- sm(normalize_expt(k10_count, transform="log2", norm="quant", convert="cpm"))
k10_count_norm_plots <- sm(graph_metrics(k10_count_norm))
k10_count_normbatch <- sm(normalize_expt(k10_count, batch="svaseq", transform="log2", convert="cpm"))
k10_count_normbatch_plots <- sm(graph_metrics(k10_count_normbatch))
pca_info <- pca_information(k10_count, plot_pcas=TRUE, num_components=4,
                            expt_factors=c("condition", "batch", "rnaconcentration"))
## More shallow curves in these plots suggest more genes in this principle component.

pca_info$pca_plots
## $PC1_PC2

## 
## $PC1_PC3

## 
## $PC1_PC4

## 
## $PC2_PC3

## 
## $PC2_PC4

## 
## $PC3_PC4

k10_varpart <- varpart(k10_count, predictor=NULL, factors=c("condition","batch"))
## Attempting mixed linear model with: ~  (1|condition) + (1|batch)
## Fitting the expressionset to the model, this is slow.
## Projected run time: ~ 6 min
## Placing factor: condition at the beginning of the model.
k10_varpart$percent_plot

k10_varpart$partition_plot

write.csv(k10_varpart$sorted_df, file="excel/varpart_sorted.csv")
high_scores <- pca_highscores(k10_count, n=100)

## Top 100 Strongest genes vs. component 1.
high_scores$highest[, "Comp.1"]
##   [1] "14.62:TRINITY_DN11326_c0_g3" "14.33:TRINITY_DN7305_c1_g7" 
##   [3] "13.14:TRINITY_DN20010_c1_g5" "12.84:TRINITY_DN16546_c0_g6"
##   [5] "12.76:TRINITY_DN18956_c2_g1" "12.6:TRINITY_DN15106_c4_g3" 
##   [7] "12.54:TRINITY_DN10836_c0_g3" "12.36:TRINITY_DN11934_c0_g7"
##   [9] "12.07:TRINITY_DN15734_c1_g2" "11.87:TRINITY_DN13204_c0_g1"
##  [11] "11.84:TRINITY_DN15897_c1_g3" "11.63:TRINITY_DN7489_c0_g7" 
##  [13] "11.6:TRINITY_DN16838_c0_g1"  "10.87:TRINITY_DN13516_c1_g1"
##  [15] "10.81:TRINITY_DN14567_c0_g2" "10.77:TRINITY_DN11842_c4_g1"
##  [17] "10.72:TRINITY_DN11322_c1_g5" "10.56:TRINITY_DN7489_c0_g3" 
##  [19] "10.17:TRINITY_DN8985_c1_g8"  "10.09:TRINITY_DN14291_c0_g2"
##  [21] "10.05:TRINITY_DN14132_c0_g1" "10.04:TRINITY_DN15333_c1_g5"
##  [23] "10.03:TRINITY_DN14230_c2_g3" "9.791:TRINITY_DN8078_c1_g3" 
##  [25] "9.785:TRINITY_DN16231_c1_g6" "9.73:TRINITY_DN11126_c0_g1" 
##  [27] "9.698:TRINITY_DN14513_c0_g4" "9.692:TRINITY_DN17690_c0_g4"
##  [29] "9.585:TRINITY_DN11628_c1_g1" "9.578:TRINITY_DN14199_c1_g4"
##  [31] "9.572:TRINITY_DN8086_c3_g1"  "9.507:TRINITY_DN11811_c1_g4"
##  [33] "9.475:TRINITY_DN14545_c1_g1" "9.364:TRINITY_DN16511_c0_g8"
##  [35] "9.325:TRINITY_DN18952_c1_g4" "9.29:TRINITY_DN16578_c3_g2" 
##  [37] "9.26:TRINITY_DN12894_c1_g4"  "9.253:TRINITY_DN8744_c0_g1" 
##  [39] "9.224:TRINITY_DN13058_c0_g1" "9.186:TRINITY_DN11505_c0_g1"
##  [41] "9.168:TRINITY_DN9459_c1_g3"  "9.122:TRINITY_DN15734_c1_g4"
##  [43] "9.07:TRINITY_DN7883_c1_g1"   "9.055:TRINITY_DN16729_c0_g3"
##  [45] "9.017:TRINITY_DN13759_c0_g3" "9.011:TRINITY_DN9824_c4_g2" 
##  [47] "8.97:TRINITY_DN14732_c0_g3"  "8.837:TRINITY_DN6957_c2_g5" 
##  [49] "8.794:TRINITY_DN8411_c0_g5"  "8.773:TRINITY_DN7877_c0_g1" 
##  [51] "8.749:TRINITY_DN11615_c1_g1" "8.64:TRINITY_DN16221_c1_g2" 
##  [53] "8.595:TRINITY_DN13936_c0_g2" "8.591:TRINITY_DN18874_c0_g4"
##  [55] "8.574:TRINITY_DN13514_c0_g3" "8.476:TRINITY_DN9025_c2_g5" 
##  [57] "8.454:TRINITY_DN8979_c2_g6"  "8.451:TRINITY_DN10143_c1_g6"
##  [59] "8.419:TRINITY_DN10513_c0_g3" "8.417:TRINITY_DN12936_c0_g4"
##  [61] "8.39:TRINITY_DN11923_c0_g1"  "8.389:TRINITY_DN11803_c0_g1"
##  [63] "8.354:TRINITY_DN15545_c0_g3" "8.343:TRINITY_DN11028_c0_g5"
##  [65] "8.342:TRINITY_DN18336_c1_g2" "8.333:TRINITY_DN18472_c1_g3"
##  [67] "8.324:TRINITY_DN16152_c0_g6" "8.217:TRINITY_DN9532_c1_g2" 
##  [69] "8.209:TRINITY_DN10485_c0_g2" "8.184:TRINITY_DN9528_c0_g4" 
##  [71] "8.167:TRINITY_DN17291_c1_g3" "8.151:TRINITY_DN18524_c2_g3"
##  [73] "8.137:TRINITY_DN19276_c4_g2" "8.102:TRINITY_DN12962_c0_g2"
##  [75] "8.008:TRINITY_DN14612_c1_g1" "8.001:TRINITY_DN17042_c1_g4"
##  [77] "7.948:TRINITY_DN17472_c1_g2" "7.945:TRINITY_DN9025_c2_g1" 
##  [79] "7.917:TRINITY_DN7877_c0_g2"  "7.889:TRINITY_DN491_c0_g1"  
##  [81] "7.879:TRINITY_DN16566_c0_g1" "7.842:TRINITY_DN12634_c0_g6"
##  [83] "7.773:TRINITY_DN17712_c1_g3" "7.749:TRINITY_DN12908_c0_g3"
##  [85] "7.745:TRINITY_DN9663_c5_g2"  "7.717:TRINITY_DN20047_c3_g1"
##  [87] "7.672:TRINITY_DN16384_c0_g2" "7.642:TRINITY_DN17629_c1_g9"
##  [89] "7.597:TRINITY_DN14207_c0_g4" "7.587:TRINITY_DN14470_c0_g2"
##  [91] "7.555:TRINITY_DN15573_c0_g3" "7.548:TRINITY_DN15055_c3_g4"
##  [93] "7.54:TRINITY_DN9864_c0_g5"   "7.515:TRINITY_DN17399_c1_g1"
##  [95] "7.482:TRINITY_DN14477_c5_g5" "7.482:TRINITY_DN19416_c3_g2"
##  [97] "7.469:TRINITY_DN14568_c1_g3" "7.452:TRINITY_DN7179_c3_g1" 
##  [99] "7.449:TRINITY_DN19759_c4_g2" "7.412:TRINITY_DN15133_c2_g8"
## Theoretically that list should agree well with the top of the variance partition table vs. Condition
## assuming condition fits component 1.

## In contrast, since component 2 seems to be the best for splitting the data, here is component 2.
high_scores$highest[, "Comp.2"]
##   [1] "7.085:TRINITY_DN10230_c1_g1" "6.32:TRINITY_DN8292_c1_g4"  
##   [3] "5.798:TRINITY_DN9670_c0_g2"  "5.794:TRINITY_DN13638_c1_g4"
##   [5] "5.618:TRINITY_DN9498_c2_g8"  "5.42:TRINITY_DN17921_c0_g4" 
##   [7] "5.384:TRINITY_DN10525_c1_g1" "5.383:TRINITY_DN11758_c0_g3"
##   [9] "5.348:TRINITY_DN13148_c0_g5" "5.291:TRINITY_DN17201_c1_g1"
##  [11] "5.209:TRINITY_DN9222_c1_g5"  "5.194:TRINITY_DN12708_c2_g1"
##  [13] "5.18:TRINITY_DN9635_c0_g1"   "5.172:TRINITY_DN19904_c1_g1"
##  [15] "5.059:TRINITY_DN12430_c0_g1" "4.91:TRINITY_DN9443_c2_g1"  
##  [17] "4.857:TRINITY_DN11538_c3_g1" "4.77:TRINITY_DN8827_c1_g4"  
##  [19] "4.754:TRINITY_DN16558_c1_g2" "4.753:TRINITY_DN11981_c0_g4"
##  [21] "4.706:TRINITY_DN16501_c2_g7" "4.671:TRINITY_DN16423_c0_g4"
##  [23] "4.658:TRINITY_DN15487_c0_g2" "4.632:TRINITY_DN11981_c0_g2"
##  [25] "4.621:TRINITY_DN16252_c3_g3" "4.601:TRINITY_DN8627_c2_g6" 
##  [27] "4.6:TRINITY_DN9469_c1_g1"    "4.597:TRINITY_DN17255_c2_g2"
##  [29] "4.544:TRINITY_DN17280_c0_g2" "4.514:TRINITY_DN13390_c0_g4"
##  [31] "4.458:TRINITY_DN9322_c0_g1"  "4.432:TRINITY_DN9494_c1_g3" 
##  [33] "4.406:TRINITY_DN11981_c0_g3" "4.403:TRINITY_DN9748_c0_g7" 
##  [35] "4.402:TRINITY_DN14526_c1_g1" "4.401:TRINITY_DN13038_c0_g2"
##  [37] "4.377:TRINITY_DN16497_c0_g1" "4.36:TRINITY_DN15335_c0_g1" 
##  [39] "4.356:TRINITY_DN8800_c0_g3"  "4.349:TRINITY_DN8104_c2_g2" 
##  [41] "4.288:TRINITY_DN5674_c0_g1"  "4.287:TRINITY_DN9640_c0_g6" 
##  [43] "4.287:TRINITY_DN15257_c2_g2" "4.284:TRINITY_DN9966_c0_g1" 
##  [45] "4.254:TRINITY_DN9249_c2_g4"  "4.242:TRINITY_DN16876_c0_g2"
##  [47] "4.241:TRINITY_DN17605_c1_g3" "4.23:TRINITY_DN13052_c1_g1" 
##  [49] "4.192:TRINITY_DN13390_c0_g3" "4.179:TRINITY_DN7384_c1_g1" 
##  [51] "4.177:TRINITY_DN7087_c1_g6"  "4.165:TRINITY_DN18181_c0_g4"
##  [53] "4.158:TRINITY_DN14567_c0_g6" "4.122:TRINITY_DN17663_c1_g3"
##  [55] "4.115:TRINITY_DN15701_c0_g1" "4.102:TRINITY_DN11081_c0_g1"
##  [57] "4.085:TRINITY_DN17711_c1_g2" "4.047:TRINITY_DN9929_c0_g1" 
##  [59] "4.04:TRINITY_DN17060_c0_g3"  "4.032:TRINITY_DN13362_c0_g1"
##  [61] "4.011:TRINITY_DN11420_c0_g6" "4.007:TRINITY_DN18620_c0_g1"
##  [63] "3.987:TRINITY_DN18993_c0_g3" "3.977:TRINITY_DN11224_c0_g1"
##  [65] "3.963:TRINITY_DN15631_c1_g3" "3.962:TRINITY_DN11448_c1_g5"
##  [67] "3.962:TRINITY_DN7492_c0_g1"  "3.943:TRINITY_DN6883_c4_g2" 
##  [69] "3.935:TRINITY_DN15372_c0_g3" "3.926:TRINITY_DN9397_c0_g1" 
##  [71] "3.907:TRINITY_DN10528_c1_g7" "3.893:TRINITY_DN12147_c2_g6"
##  [73] "3.833:TRINITY_DN12844_c2_g3" "3.823:TRINITY_DN10093_c4_g4"
##  [75] "3.818:TRINITY_DN15193_c1_g1" "3.81:TRINITY_DN14610_c5_g1" 
##  [77] "3.8:TRINITY_DN9913_c1_g1"    "3.788:TRINITY_DN17295_c0_g3"
##  [79] "3.788:TRINITY_DN9235_c6_g2"  "3.778:TRINITY_DN9427_c0_g3" 
##  [81] "3.776:TRINITY_DN6793_c2_g3"  "3.773:TRINITY_DN7677_c1_g5" 
##  [83] "3.772:TRINITY_DN19023_c2_g3" "3.756:TRINITY_DN12147_c2_g9"
##  [85] "3.752:TRINITY_DN13706_c1_g4" "3.751:TRINITY_DN17297_c0_g3"
##  [87] "3.729:TRINITY_DN7794_c3_g1"  "3.725:TRINITY_DN11557_c0_g3"
##  [89] "3.721:TRINITY_DN7013_c2_g1"  "3.721:TRINITY_DN18815_c1_g2"
##  [91] "3.717:TRINITY_DN8085_c1_g3"  "3.705:TRINITY_DN19866_c0_g5"
##  [93] "3.702:TRINITY_DN14614_c2_g3" "3.694:TRINITY_DN11328_c2_g5"
##  [95] "3.69:TRINITY_DN19362_c0_g1"  "3.675:TRINITY_DN17450_c1_g9"
##  [97] "3.667:TRINITY_DN12227_c0_g5" "3.666:TRINITY_DN8097_c2_g1" 
##  [99] "3.665:TRINITY_DN12924_c0_g2" "3.661:TRINITY_DN10328_c1_g2"

2.3.1.1 Look at some plots

Let us see how the k10 count data appears

k10_count_raw_plots$libsize

sm(k10_count_raw_plots$density$plot + ggplot2::scale_x_continuous(limits=c(1,200)))

k10_count_raw_plots$boxplot

k10_count_norm_plots$corheat

k10_count_norm_plots$smc

k10_count_norm_plots$disheat

k10_count_norm_plots$smd

k10_count_norm_plots$pcaplot

## See if the additional information in the 'batch' factor made a difference
k10_count_normbatch_plots$pcaplot

2.3.1.2 Odd densities again

sm(k10_count_raw_plots$density$plot + ggplot2::scale_x_continuous(limits=c(0,10)))

It looks like I was mostly correct above, this makes it clearer that sample 1003 has to go, and since 1008 is in the same batch it is likely to get the axe still. Though judging from this plot, maybe 1009 is a candidate for removal too.

2.3.2 k8 count filtering

k8_count_raw_plots <- sm(graph_metrics(k8_count))
k8_count_norm <- sm(normalize_expt(k8_count, transform="log2", norm="quant", convert="cpm"))
k8_count_norm_plots <- sm(graph_metrics(k8_count_norm))
k8_count_normbatch <- sm(normalize_expt(k8_count, batch="svaseq", transform="log2", convert="cpm"))
k8_count_normbatch_plots <- sm(graph_metrics(k8_count_normbatch))

2.3.2.1 Look at some plots

Let us see how the k8 count data appears

k8_count_raw_plots$libsize

sm(k8_count_raw_plots$density$plot + ggplot2::scale_x_continuous(limits=c(1,200)))

k8_count_raw_plots$boxplot

k8_count_norm_plots$corheat

k8_count_norm_plots$smc

k8_count_norm_plots$disheat

k8_count_norm_plots$smd

k8_count_norm_plots$pcaplot

## See if the additional information in the 'batch' factor made a difference
k8_count_normbatch_plots$pcaplot

k8_count_normbatch_plots$tsneplot

2.4 Conf filtered data

When we get to the conf filtered data, I suspect things will be more interesting. Therefore let us look at the full spectrum of plots before/after normalization.

2.4.1 k10 confidence filtering

k10_conf_raw_plots <- sm(graph_metrics(k10_conf))
k10_conf_norm <- sm(normalize_expt(k10_conf, transform="log2", norm="quant", convert="cpm"))
k10_conf_norm_plots <- sm(graph_metrics(k10_conf_norm))
k10_conf_normbatch <- sm(normalize_expt(k10_conf, batch="svaseq", transform="log2",
                                        convert="cpm", filter=TRUE))
k10_conf_normbatch_plots <- sm(graph_metrics(k10_conf_normbatch))

2.4.1.1 Look at some plots

Let us see how the k10 conf data appears

k10_conf_raw_plots$libsize

sm(k10_conf_raw_plots$density$plot + ggplot2::scale_x_continuous(limits=c(1,100)))

k10_conf_raw_plots$boxplot

k10_conf_norm_plots$corheat

## This plot is a particularly nice example of why 1003/1008 are annoying.
k10_conf_norm_plots$smc

k10_conf_norm_plots$disheat

k10_conf_norm_plots$smd

k10_conf_norm_plots$pcaplot

## See if the additional information in the 'batch' factor made a difference
k10_conf_normbatch_plots$pcaplot

2.4.2 k8 confidence filtering

k8_conf_raw_plots <- sm(graph_metrics(k8_conf))
k8_conf_norm <- sm(normalize_expt(k8_conf, transform="log2", norm="quant", convert="cpm"))
k8_conf_norm_plots <- sm(graph_metrics(k8_conf_norm))
k8_conf_normbatch <- sm(normalize_expt(k8_conf, batch="svaseq", transform="log2",
                                       convert="cpm", filter=TRUE))
k8_conf_normbatch_plots <- sm(graph_metrics(k8_conf_normbatch))

2.4.2.1 Look at some plots

Let us see how the k8 conf data appears

k8_conf_raw_plots$libsize

sm(k8_conf_raw_plots$density$plot + ggplot2::scale_x_continuous(limits=c(1,100)))

k8_conf_raw_plots$boxplot

k8_conf_norm_plots$corheat

k8_conf_norm_plots$smc

k8_conf_norm_plots$disheat

k8_conf_norm_plots$smd

k8_conf_norm_plots$pcaplot

k8_conf_norm_plots$tsneplot

## See if the additional information in the 'batch' factor made a difference
k8_conf_normbatch_plots$pcaplot

k8_conf_normbatch_plots$tsneplot

2.5 Count and Conf filtered data

When we get to the count_conf filtered data, I suspect things will be more interesting. Therefore let us look at the full spectrum of plots before/after normalization.

2.5.1 k10 both filters

k10_conf_count_raw_plots <- sm(graph_metrics(k10_conf_count))
k10_conf_count_norm <- sm(normalize_expt(k10_conf_count, transform="log2",
                                         norm="quant", convert="cpm"))
k10_conf_count_norm_plots <- sm(graph_metrics(k10_conf_count_norm))
k10_conf_count_normbatch <- sm(normalize_expt(k10_conf_count, batch="svaseq", transform="log2",
                                              convert="cpm", filter=TRUE))
k10_conf_count_normbatch_plots <- sm(graph_metrics(k10_conf_count_normbatch))

2.5.1.1 Look at some plots

Let us see how the k10 count_conf data appears

k10_conf_count_raw_plots$libsize

sm(k10_conf_count_raw_plots$density$plot + ggplot2::scale_x_continuous(limits=c(1,100)))

k10_conf_count_raw_plots$boxplot

k10_conf_count_norm_plots$corheat

k10_conf_count_norm_plots$smc

k10_conf_count_norm_plots$disheat

k10_conf_count_norm_plots$smd

k10_conf_count_norm_plots$pcaplot

## See if the additional information in the 'batch' factor made a difference
k10_conf_count_normbatch_plots$pcaplot

k10_conf_count_normbatch_plots$tsneplot

2.5.2 k8 both filters

k8_conf_count_raw_plots <- sm(graph_metrics(k8_conf_count))
k8_conf_count_norm <- sm(normalize_expt(k8_conf_count, transform="log2",
                                        norm="quant", convert="cpm"))
k8_conf_count_norm_plots <- sm(graph_metrics(k8_conf_count_norm))
k8_conf_count_normbatch <- sm(normalize_expt(k8_conf_count, batch="svaseq",
                                             transform="log2", convert="cpm"))
k8_conf_count_normbatch_plots <- sm(graph_metrics(k8_conf_count_normbatch))

2.5.2.1 Look at some plots

Let us see how the k8 count_conf data appears

k8_conf_count_raw_plots$libsize

sm(k8_conf_count_raw_plots$density$plot + ggplot2::scale_x_continuous(limits=c(1,100)))

k8_conf_count_raw_plots$boxplot

k8_conf_count_norm_plots$corheat

k8_conf_count_norm_plots$smc

k8_conf_count_norm_plots$disheat

k8_conf_count_norm_plots$smd

k8_conf_count_norm_plots$pcaplot

## See if the additional information in the 'batch' factor made a difference
k8_conf_count_normbatch_plots$pcaplot

k8_conf_count_normbatch_plots$tsneplot

2.5.3 Both filters, variance partition

Variance Partition is a neat tool to see what experimental factors have the most variance associated with them. It is slow, so I will use it on the smallest data set we have created (the confidence/count filtered).

k8_conf_count_varpart <- varpart(expt=k8_conf_count, predictor=NULL,
                                 factors=c("condition", "batch"),
                                 modify_expt=TRUE)
## Attempting mixed linear model with: ~  (1|condition) + (1|batch)
## Fitting the expressionset to the model, this is slow.
## Projected run time: ~ 0.3 min
## Placing factor: condition at the beginning of the model.
k8_conf_count <- k8_conf_count_varpart[["modified_expt"]]
k8_conf_count_varpart$percent_plot

## These genes are the most likely to be interesting for the difference between non/embryogenic
k8_conf_count_varpart$partition_plot

## Another fun thing to use is pca_information()
testing <- pca_information(expt_data=k8_conf_count)
## More shallow curves in these plots suggest more genes in this principle component.

3 Make a nice stacked libsize barplot

library(ggplot2)
maximum <- plot_libsize(k8_raw, text=FALSE)
maximum_alpha <- maximum$table
maximum_alpha$alpha <- alpha(maximum_alpha$colors, 0.4)
conf <- plot_libsize(k8_conf, text=FALSE)
conf_alpha <- conf$table
conf_alpha$alpha <- alpha(conf_alpha$colors, 0.6)
count <- plot_libsize(k8_count, text=FALSE)
count_alpha <- count$table
count_alpha$alpha <- alpha(count_alpha$colors, 0.8)
conf_count <- plot_libsize(k8_conf_count, text=FALSE)
conf_count_alpha <- conf_count$table
conf_count_alpha$alpha <- alpha(conf_count_alpha$colors, 1.0)

all <- rbind(maximum_alpha, conf_alpha,
             count_alpha, conf_count_alpha)
all <- as.data.frame(all)

## The following will stack the library sizes from each subset one on top of the other.
## This is likely not desired, but it does provide sufficient room to print the library
## sizes if one wishes.
stacked <- ggplot(all, aes(x=id)) +
  geom_bar(aes(weight=sum, alpha=alpha, fill=colors), color="black") +
  scale_fill_manual(values=c(levels(as.factor(all$colors)))) +
  theme(axis.text=ggplot2::element_text(size=10, colour="black"),
        axis.text.x=ggplot2::element_text(angle=90, vjust=0.5),
        legend.position="none")
stacked

## In this instance, they are stacked on top of 0, which I suspect is the goal.
columns <- ggplot(all, aes(x=id, y=sum)) +
  geom_col(position="identity", aes(fill=colors, alpha=alpha), color="black") +
  scale_fill_manual(values=c(levels(as.factor(all$colors)))) +
  theme(axis.text=ggplot2::element_text(size=10, colour="black"),
        axis.text.x=ggplot2::element_text(angle=90, vjust=0.5),
        legend.position="none")
columns

4 Write up the favorites

Let us assume for a moment that we will choose a couple of the above for looking at more closely. I suspect k8_count_conf, k10_count_conf, and k8_count and k8_conf are good candidates.

output <- paste0("excel/k8_conf_count_by_gene-v", ver, ".xlsx")
if (!file.exists(output)) {
  k8_conf_count_written <- sm(write_expt(k8_conf_count, excel=output, violin=TRUE))
}
output <- paste0("excel/k10_conf_count_by_gene-v", ver, ".xlsx")
if (!file.exists(output)) {
  k10_conf_count_written <- sm(write_expt(k10_conf_count, excel=output, violin=TRUE))
}
output <- paste0("excel/k8_count_by_gene-v", ver, ".xlsx")
if (!file.exists(output)) {
  k8_count_written <- sm(write_expt(k8_count, excel=output, violin=TRUE))
}
output <- paste0("excel/k8_conf_by_gene-v", ver, ".xlsx")
if (!file.exists(output)) {
  k8_conf_written <- sm(write_expt(k8_conf, excel=output, violin=TRUE))
}

5 Note to self

I decided to not save the state of this file, because it is huge. The differential expression analyses do not need this, but need only the result of the annotation worksheet.

pander::pander(sessionInfo())

R version 3.4.3 (2017-11-30)

**Platform:** x86_64-pc-linux-gnu (64-bit)

locale: LC_CTYPE=en_US.utf8, LC_NUMERIC=C, LC_TIME=en_US.utf8, LC_COLLATE=en_US.utf8, LC_MONETARY=en_US.utf8, LC_MESSAGES=en_US.utf8, LC_PAPER=en_US.utf8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.utf8 and LC_IDENTIFICATION=C

attached base packages: stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: hpgltools(v.2017.10) and ggplot2(v.2.2.1)

loaded via a namespace (and not attached): Biobase(v.2.38.0), edgeR(v.3.20.2), bit64(v.0.9-7), splines(v.3.4.3), foreach(v.1.4.4), gtools(v.3.5.0), stats4(v.3.4.3), pander(v.0.6.1), blob(v.1.1.0), yaml(v.2.1.16), ggrepel(v.0.7.0), RSQLite(v.2.0), backports(v.1.1.2), lattice(v.0.20-35), limma(v.3.34.4), quadprog(v.1.5-5), digest(v.0.6.13), RColorBrewer(v.1.1-2), minqa(v.1.2.4), colorspace(v.1.3-2), htmltools(v.0.3.6), preprocessCore(v.1.40.0), Matrix(v.1.2-12), plyr(v.1.8.4), XML(v.3.98-1.9), devtools(v.1.13.4), genefilter(v.1.60.0), xtable(v.1.8-2), corpcor(v.1.6.9), scales(v.0.5.0), gdata(v.2.18.0), Rtsne(v.0.13), lme4(v.1.1-15), BiocParallel(v.1.12.0), tibble(v.1.3.4), annotate(v.1.56.1), mgcv(v.1.8-22), IRanges(v.2.12.0), withr(v.2.1.1), BiocGenerics(v.0.24.0), lazyeval(v.0.2.1), pbkrtest(v.0.4-7), survival(v.2.41-3), magrittr(v.1.5), memoise(v.1.1.0), evaluate(v.0.10.1), gplots(v.3.0.1), doParallel(v.1.0.11), nlme(v.3.1-131), MASS(v.7.3-47), xml2(v.1.1.1), tools(v.3.4.3), directlabels(v.2017.03.31), data.table(v.1.10.4-3), matrixStats(v.0.52.2), stringr(v.1.2.0), S4Vectors(v.0.16.0), munsell(v.0.4.3), locfit(v.1.5-9.1), AnnotationDbi(v.1.40.0), colorRamps(v.2.3), compiler(v.3.4.3), caTools(v.1.17.1), rlang(v.0.1.6), nloptr(v.1.0.4), grid(v.3.4.3), RCurl(v.1.95-4.8), iterators(v.1.0.9), variancePartition(v.1.8.0), bitops(v.1.0-6), base64enc(v.0.1-3), labeling(v.0.3), rmarkdown(v.1.8), gtable(v.0.2.0), codetools(v.0.2-15), DBI(v.0.7), roxygen2(v.6.0.1), reshape2(v.1.4.3), R6(v.2.2.2), knitr(v.1.17), bit(v.1.1-12), commonmark(v.1.4), rprojroot(v.1.3-1), KernSmooth(v.2.23-15), stringi(v.1.1.6), parallel(v.3.4.3), sva(v.3.26.0) and Rcpp(v.0.12.14)

message(paste0("This is hpgltools commit: ", get_git_commit()))
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset b1e1b19ec57ffc7b90f4f16b7b1c0cd71a321e04
## R> packrat::restore()
## This is hpgltools commit: Mon Jan 15 11:38:07 2018 -0500: b1e1b19ec57ffc7b90f4f16b7b1c0cd71a321e04
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
## Saving to 02_sample_estimation-v20171102.rda.xz
## tmp <- sm(saveme(filename=this_save))
---
title: "Solanum betaceum sample estimation."
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
 html_document:
  code_download: true
  code_folding: show
  fig_caption: true
  fig_height: 7
  fig_width: 7
  highlight: default
  keep_md: false
  mode: selfcontained
  number_sections: true
  self_contained: true
  theme: readable
  toc: true
  toc_float:
    collapsed: false
    smooth_scroll: false
---

<style>
  body .main-container {
    max-width: 1600px;
  }
</style>

```{r options, include=FALSE}
library("hpgltools")
tt <- devtools::load_all("~/hpgltools")
knitr::opts_knit$set(progress=TRUE,
                     verbose=TRUE,
                     width=90,
                     echo=TRUE)
knitr::opts_chunk$set(error=TRUE,
                      fig.width=8,
                      fig.height=8,
                      dpi=96)
old_options <- options(digits=4,
                       stringsAsFactors=FALSE,
                       knitr.duplicate.label="allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size=10))
set.seed(1)
ver <- "20171102"
previous_file <- "01_annotation.Rmd"

tmp <- try(sm(loadme(filename=paste0(gsub(pattern="\\.Rmd", replace="", x=previous_file), "-v", ver, ".rda.xz"))))

rmd_file <- "02_sample_estimation.Rmd"
```

[index.html](index.html)
[preprocessing.html](preprocessing.html)
[01_annotation.html](01_annotation.html)

# Sample Estimation version: `r ver`

Since I used kallisto for this work, I want to figure out tximport.

In 01_annotation.Rmd, I decided to open up a few lines of inquiry:

* k10_raw:   All samples using the full set of putative annotations.
* k8_raw:  Samples excluding hpgl1003 with the full set of annotations.
* k10_count: All samples, removing those with < 'threshold (5)' counts/gene.
* k8_count:  8 samples, ibid.
* k10_conf:  All samples, but removing transcripts with annotations less confident than
    1e-10 blastx e-value and less identity than 60 percent to some other known gene.
* k8_conf:  8 samples, ibid.
* k10_count_conf:  A combination of the count filter and confidence filter.
* k8_count_conf:  ibid, but with the 8 samples.

# Examine these data sets

## Raw data

Start with the full data set.  This will take a rather long time to perform all the graphs, 
and since we are not likely to use it, I will limit myself to just plotting 1-2 graphs.

### k10 raw

This data set is unlikely to be very helpful for differential expression, but
may be useful for diagnosing other concerns in the data.

```{r k10_raw}
k10_libsize <- plot_libsize(k10_raw)
k10_libsize$plot
k10_density <- sm(plot_density(k10_raw))
sm(k10_density$plot + ggplot2::scale_x_continuous(limits=c(1,100)))
k10_boxplot <- sm(plot_boxplot(k10_raw))
k10_boxplot
k10_corheat <- plot_corheat(k10_raw)
k10_tsne <- plot_tsne(k10_raw)
k10_tsne$plot
```

Most notably, it makes obvious some concerns with samples 1003, 1007, and 1008.
We chose to remove 1003 and 1008, but will keep an eye on 1007.

### k8 raw

The metrics of the data when we remove those two samples should generally be
quite similar.

```{r k8_raw}
k8_libsize <- plot_libsize(k8_raw)
k8_libsize$plot
k8_density <- sm(plot_density(k8_raw))
sm(k8_density$plot + ggplot2::scale_x_continuous(limits=c(1,100)))
k8_boxplot <- sm(plot_boxplot(k8_raw))
k8_boxplot
k8_corheat <- plot_corheat(k8_raw)
k8_tsne <- plot_tsne(k8_raw)
k8_tsne$plot
```

#### A peculiarity of kallisto

Here is the same density plot, but looking from 0 to 10.  In it we can see that
there is a tremendous number of 'genes' which have fractional counts.  I am
certain that this is important, but I am not sure what to do with it.  I have a
feeling that when we do a count-based filtering, this tremendous jump by sample
1009 will fall away...

```{r peculiar}
sm(k8_density$plot + ggplot2::scale_x_continuous(limits=c(0,10)))
```

## Quick side question

Can we ask the data to give us relatively reliable estimates of which genes/transcripts are in
all 4 embryogenic samples and in none of the non-embryogenic samples?

Yesterday when I went to do this, my rpkm function failed because I neglected to
include a sequence length column in the annotation data.  That has been
remedied, as a result these cutoffs probably should be adjust to something more
appropriate to rpkm-scale data.  Though I am not entirely certain where they
should be adjusted, perhaps <1 and 5?

```{r side_quest}
k8_raw_rpkm <- normalize_expt(k8_count, convert="rpkm", norm="quant")
low_cutoff <- 3
high_cutoff <- 10
k8_embryo_low_nonembryo_high <- rowMeans(exprs(k8_raw_rpkm)[, 1:4]) <= low_cutoff &
  rowMeans(exprs(k8_raw_rpkm)[, 5:8]) >= high_cutoff
checkme_first <- rownames(exprs(k8_raw_rpkm))[k8_embryo_low_nonembryo_high]
dim(exprs(k8_raw_rpkm)[checkme_first, ])
low_embryo_high_nonembryo <- merge(exprs(k8_raw_rpkm)[checkme_first, ], fData(k8_raw_rpkm)[checkme_first, ], by="row.names")
write.csv(file="low_embryo_high_nonembryo.csv", low_embryo_high_nonembryo)

k8_embryo_high_nonembryo_low <- rowMeans(exprs(k8_raw_rpkm)[, 1:4]) >= high_cutoff & rowMeans(exprs(k8_raw_rpkm)[, 5:8]) <= low_cutoff
checkme_second <- rownames(exprs(k8_raw_rpkm))[k8_embryo_high_nonembryo_low]
dim(exprs(k8_raw_rpkm)[checkme_second, ])
high_embryo_low_nonembryo <- merge(exprs(k8_raw_rpkm)[checkme_second, ], fData(k8_raw_rpkm)[checkme_second, ], by="row.names")
write.csv(file="high_embryo_low_nonembryo.csv", high_embryo_low_nonembryo)
```

## Count filtered data

When we get to the count filtered data, I suspect things will be more interesting.
Therefore let us look at the full spectrum of plots before/after normalization.

### k10 count filtering

For these data, per orm some extra analyses including a PC loading, PC vs actor
analysis, and variance partition.

```{r k10_count, fig.show="hide"}
k10_count_raw_plots <- sm(graph_metrics(k10_count))
k10_count_norm <- sm(normalize_expt(k10_count, transform="log2", norm="quant", convert="cpm"))
k10_count_norm_plots <- sm(graph_metrics(k10_count_norm))
k10_count_normbatch <- sm(normalize_expt(k10_count, batch="svaseq", transform="log2", convert="cpm"))
k10_count_normbatch_plots <- sm(graph_metrics(k10_count_normbatch))
```

```{r k10_count_pca_fun}
pca_info <- pca_information(k10_count, plot_pcas=TRUE, num_components=4,
                            expt_factors=c("condition", "batch", "rnaconcentration"))
pca_info$pca_plots
k10_varpart <- varpart(k10_count, predictor=NULL, factors=c("condition","batch"))
k10_varpart$percent_plot
k10_varpart$partition_plot
write.csv(k10_varpart$sorted_df, file="excel/varpart_sorted.csv")
high_scores <- pca_highscores(k10_count, n=100)
## Top 100 Strongest genes vs. component 1.
high_scores$highest[, "Comp.1"]
## Theoretically that list should agree well with the top of the variance partition table vs. Condition
## assuming condition fits component 1.

## In contrast, since component 2 seems to be the best for splitting the data, here is component 2.
high_scores$highest[, "Comp.2"]

```

#### Look at some plots

Let us see how the k10 count data appears

```{r show_k10_count}
k10_count_raw_plots$libsize
sm(k10_count_raw_plots$density$plot + ggplot2::scale_x_continuous(limits=c(1,200)))
k10_count_raw_plots$boxplot
k10_count_norm_plots$corheat
k10_count_norm_plots$smc
k10_count_norm_plots$disheat
k10_count_norm_plots$smd
k10_count_norm_plots$pcaplot
## See if the additional information in the 'batch' factor made a difference
k10_count_normbatch_plots$pcaplot
```

#### Odd densities again

```{r densities_again}
sm(k10_count_raw_plots$density$plot + ggplot2::scale_x_continuous(limits=c(0,10)))
```

It looks like I was mostly correct above, this makes it clearer that sample 1003
has to go, and since 1008 is in the same batch it is likely to get the axe
still.  Though judging from this plot, maybe 1009 is a candidate for removal too.

### k8 count filtering

```{r k8_count, fig.show="hide"}
k8_count_raw_plots <- sm(graph_metrics(k8_count))
k8_count_norm <- sm(normalize_expt(k8_count, transform="log2", norm="quant", convert="cpm"))
k8_count_norm_plots <- sm(graph_metrics(k8_count_norm))
k8_count_normbatch <- sm(normalize_expt(k8_count, batch="svaseq", transform="log2", convert="cpm"))
k8_count_normbatch_plots <- sm(graph_metrics(k8_count_normbatch))
```

#### Look at some plots

Let us see how the k8 count data appears

```{r show_k8_count}
k8_count_raw_plots$libsize
sm(k8_count_raw_plots$density$plot + ggplot2::scale_x_continuous(limits=c(1,200)))
k8_count_raw_plots$boxplot
k8_count_norm_plots$corheat
k8_count_norm_plots$smc
k8_count_norm_plots$disheat
k8_count_norm_plots$smd
k8_count_norm_plots$pcaplot
## See if the additional information in the 'batch' factor made a difference
k8_count_normbatch_plots$pcaplot
k8_count_normbatch_plots$tsneplot
```

## Conf filtered data

When we get to the conf filtered data, I suspect things will be more interesting.
Therefore let us look at the full spectrum of plots before/after normalization.

### k10 confidence filtering

```{r k10_conf, fig.show="hide"}
k10_conf_raw_plots <- sm(graph_metrics(k10_conf))
k10_conf_norm <- sm(normalize_expt(k10_conf, transform="log2", norm="quant", convert="cpm"))
k10_conf_norm_plots <- sm(graph_metrics(k10_conf_norm))
k10_conf_normbatch <- sm(normalize_expt(k10_conf, batch="svaseq", transform="log2",
                                        convert="cpm", filter=TRUE))
k10_conf_normbatch_plots <- sm(graph_metrics(k10_conf_normbatch))
```

#### Look at some plots

Let us see how the k10 conf data appears

```{r show_k10_conf}
k10_conf_raw_plots$libsize
sm(k10_conf_raw_plots$density$plot + ggplot2::scale_x_continuous(limits=c(1,100)))
k10_conf_raw_plots$boxplot
k10_conf_norm_plots$corheat
## This plot is a particularly nice example of why 1003/1008 are annoying.
k10_conf_norm_plots$smc
k10_conf_norm_plots$disheat
k10_conf_norm_plots$smd
k10_conf_norm_plots$pcaplot
## See if the additional information in the 'batch' factor made a difference
k10_conf_normbatch_plots$pcaplot
```

### k8 confidence filtering

```{r k8_conf, fig.show="hide"}
k8_conf_raw_plots <- sm(graph_metrics(k8_conf))
k8_conf_norm <- sm(normalize_expt(k8_conf, transform="log2", norm="quant", convert="cpm"))
k8_conf_norm_plots <- sm(graph_metrics(k8_conf_norm))
k8_conf_normbatch <- sm(normalize_expt(k8_conf, batch="svaseq", transform="log2",
                                       convert="cpm", filter=TRUE))
k8_conf_normbatch_plots <- sm(graph_metrics(k8_conf_normbatch))
```

#### Look at some plots

Let us see how the k8 conf data appears

```{r show_k8_conf}
k8_conf_raw_plots$libsize
sm(k8_conf_raw_plots$density$plot + ggplot2::scale_x_continuous(limits=c(1,100)))
k8_conf_raw_plots$boxplot
k8_conf_norm_plots$corheat
k8_conf_norm_plots$smc
k8_conf_norm_plots$disheat
k8_conf_norm_plots$smd
k8_conf_norm_plots$pcaplot
k8_conf_norm_plots$tsneplot
## See if the additional information in the 'batch' factor made a difference
k8_conf_normbatch_plots$pcaplot
k8_conf_normbatch_plots$tsneplot
```

## Count and Conf filtered data

When we get to the count_conf filtered data, I suspect things will be more interesting.
Therefore let us look at the full spectrum of plots before/after normalization.

### k10 both filters

```{r k10_count_conf, fig.show="hide"}
k10_conf_count_raw_plots <- sm(graph_metrics(k10_conf_count))
k10_conf_count_norm <- sm(normalize_expt(k10_conf_count, transform="log2",
                                         norm="quant", convert="cpm"))
k10_conf_count_norm_plots <- sm(graph_metrics(k10_conf_count_norm))
k10_conf_count_normbatch <- sm(normalize_expt(k10_conf_count, batch="svaseq", transform="log2",
                                              convert="cpm", filter=TRUE))
k10_conf_count_normbatch_plots <- sm(graph_metrics(k10_conf_count_normbatch))
```

#### Look at some plots

Let us see how the k10 count_conf data appears

```{r show_k10_count_conf}
k10_conf_count_raw_plots$libsize
sm(k10_conf_count_raw_plots$density$plot + ggplot2::scale_x_continuous(limits=c(1,100)))
k10_conf_count_raw_plots$boxplot
k10_conf_count_norm_plots$corheat
k10_conf_count_norm_plots$smc
k10_conf_count_norm_plots$disheat
k10_conf_count_norm_plots$smd
k10_conf_count_norm_plots$pcaplot
## See if the additional information in the 'batch' factor made a difference
k10_conf_count_normbatch_plots$pcaplot
k10_conf_count_normbatch_plots$tsneplot
```

### k8 both filters

```{r k8_count_conf, fig.show="hide"}
k8_conf_count_raw_plots <- sm(graph_metrics(k8_conf_count))
k8_conf_count_norm <- sm(normalize_expt(k8_conf_count, transform="log2",
                                        norm="quant", convert="cpm"))
k8_conf_count_norm_plots <- sm(graph_metrics(k8_conf_count_norm))
k8_conf_count_normbatch <- sm(normalize_expt(k8_conf_count, batch="svaseq",
                                             transform="log2", convert="cpm"))
k8_conf_count_normbatch_plots <- sm(graph_metrics(k8_conf_count_normbatch))
```

#### Look at some plots

Let us see how the k8 count_conf data appears

```{r show_k8_count_conf}
k8_conf_count_raw_plots$libsize
sm(k8_conf_count_raw_plots$density$plot + ggplot2::scale_x_continuous(limits=c(1,100)))
k8_conf_count_raw_plots$boxplot
k8_conf_count_norm_plots$corheat
k8_conf_count_norm_plots$smc
k8_conf_count_norm_plots$disheat
k8_conf_count_norm_plots$smd
k8_conf_count_norm_plots$pcaplot
## See if the additional information in the 'batch' factor made a difference
k8_conf_count_normbatch_plots$pcaplot
k8_conf_count_normbatch_plots$tsneplot
```

### Both filters, variance partition

Variance Partition is a neat tool to see what experimental factors have the most
variance associated with them.  It is slow, so I will use it on the smallest
data set we have created (the confidence/count filtered).

```{r varpart}
k8_conf_count_varpart <- varpart(expt=k8_conf_count, predictor=NULL,
                                 factors=c("condition", "batch"),
                                 modify_expt=TRUE)
k8_conf_count <- k8_conf_count_varpart[["modified_expt"]]
k8_conf_count_varpart$percent_plot
## These genes are the most likely to be interesting for the difference between non/embryogenic
k8_conf_count_varpart$partition_plot
## Another fun thing to use is pca_information()
testing <- pca_information(expt_data=k8_conf_count)
```

# Make a nice stacked libsize barplot

```{r test_barplot}
library(ggplot2)
maximum <- plot_libsize(k8_raw, text=FALSE)
maximum_alpha <- maximum$table
maximum_alpha$alpha <- alpha(maximum_alpha$colors, 0.4)
conf <- plot_libsize(k8_conf, text=FALSE)
conf_alpha <- conf$table
conf_alpha$alpha <- alpha(conf_alpha$colors, 0.6)
count <- plot_libsize(k8_count, text=FALSE)
count_alpha <- count$table
count_alpha$alpha <- alpha(count_alpha$colors, 0.8)
conf_count <- plot_libsize(k8_conf_count, text=FALSE)
conf_count_alpha <- conf_count$table
conf_count_alpha$alpha <- alpha(conf_count_alpha$colors, 1.0)

all <- rbind(maximum_alpha, conf_alpha,
             count_alpha, conf_count_alpha)
all <- as.data.frame(all)

## The following will stack the library sizes from each subset one on top of the other.
## This is likely not desired, but it does provide sufficient room to print the library
## sizes if one wishes.
stacked <- ggplot(all, aes(x=id)) +
  geom_bar(aes(weight=sum, alpha=alpha, fill=colors), color="black") +
  scale_fill_manual(values=c(levels(as.factor(all$colors)))) +
  theme(axis.text=ggplot2::element_text(size=10, colour="black"),
        axis.text.x=ggplot2::element_text(angle=90, vjust=0.5),
        legend.position="none")
stacked

## In this instance, they are stacked on top of 0, which I suspect is the goal.
columns <- ggplot(all, aes(x=id, y=sum)) +
  geom_col(position="identity", aes(fill=colors, alpha=alpha), color="black") +
  scale_fill_manual(values=c(levels(as.factor(all$colors)))) +
  theme(axis.text=ggplot2::element_text(size=10, colour="black"),
        axis.text.x=ggplot2::element_text(angle=90, vjust=0.5),
        legend.position="none")
columns
```

# Write up the favorites

Let us assume for a moment that we will choose a couple of the above for looking at more
closely.  I suspect k8_count_conf, k10_count_conf, and k8_count and k8_conf are good candidates.

```{r write_some, fig.show="hide"}
output <- paste0("excel/k8_conf_count_by_gene-v", ver, ".xlsx")
if (!file.exists(output)) {
  k8_conf_count_written <- sm(write_expt(k8_conf_count, excel=output, violin=TRUE))
}
output <- paste0("excel/k10_conf_count_by_gene-v", ver, ".xlsx")
if (!file.exists(output)) {
  k10_conf_count_written <- sm(write_expt(k10_conf_count, excel=output, violin=TRUE))
}
output <- paste0("excel/k8_count_by_gene-v", ver, ".xlsx")
if (!file.exists(output)) {
  k8_count_written <- sm(write_expt(k8_count, excel=output, violin=TRUE))
}
output <- paste0("excel/k8_conf_by_gene-v", ver, ".xlsx")
if (!file.exists(output)) {
  k8_conf_written <- sm(write_expt(k8_conf, excel=output, violin=TRUE))
}
```

# Note to self

I decided to not save the state of this file, because it is huge.
The differential expression analyses do not need this, but need only the result of the
annotation worksheet.

```{r saveme}
pander::pander(sessionInfo())
message(paste0("This is hpgltools commit: ", get_git_commit()))
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
## tmp <- sm(saveme(filename=this_save))
```
