1 Gene specific data

The following 3 sections are gene specific. Later, they will be repeated for the transcripts.

2 Annotation version: 20180310

Any annotation data will need to come from our trinotate result. An interesting caveat, kallisto uses a transcript database. tximport is the library used to read the counts/transcript into R, it is able to take a set of mappings between gene/transcript in order to provide gene counts rather than transcript.

An important aspect of using trinity for this creation of our annotation data: there are many putative transcripts for each putative gene, many of which are probably not real, but instead are artifacts of the assembler. Therefore we need to do something to try to distinguish between the reals and fakers. The following code will attempt to address this problem.

2.1 Load all trinotate data

putative_annotations <- sm(load_trinotate_annotations("reference/trinotate.csv"))
annot_by_gene <- putative_annotations
## Let us make one set of annotations by transcript ID and one by gene ID
rownames(annot_by_gene) <- make.names(putative_annotations[["gene_id"]], unique=TRUE)
annot_by_tx <- putative_annotations
rownames(annot_by_tx) <- make.names(putative_annotations[["transcript_id"]], unique=TRUE)

2.2 Create the expressionsets

In order to make functional expressionsets of this data, we will need to use the mappings from gene <-> transcripts, that is the final argument to create_expt().

## I split the annotations into one set keyed by gene IDs and one by transcript IDs.
## This way it is possible to make an expressionset of transcripts or genes by
## removing/keeping the tx_gene_map argument.

## Why would you want to do this? you might ask...  Well, sadly the annotation database
## Does not always have filled-in data for the first transcript for a given gene ID,
## therefore the annotation information for those genes will be lost when we merge
## the count tables / annotation tables by gene ID.
## This is of course also true for transcript IDs, but for a vastly smaller number of them.
transcript_gene_map <- putative_annotations[, c("transcript_id", "gene_id")]
k10_raw <- create_expt(metadata="sample_sheets/all_samples.xlsx",
                       gene_info=annot_by_gene,
                       tx_gene_map=transcript_gene_map)

## Reading the sample metadata.

## The sample definitions comprises: 10, 7 rows, columns.

## Reading count tables.

## Using the transcript<->gene mapping.

## Finished reading count tables.

## Matched 58486 annotations and counts.

## Bringing together the count matrix and gene information.

putative_annotations <- fData(k10_raw)  ## A cheap way to get back the unique gene IDs.
## On further examination, we decided to remove both samples 1003 and 1008
## which Sandra pointed out are in fact from the same flow cytometry isolation.
k8_raw <- create_expt(metadata="sample_sheets/kept_samples.xlsx",
                      gene_info=annot_by_gene,
                      tx_gene_map=transcript_gene_map)

## Reading the sample metadata.

## The sample definitions comprises: 8, 7 rows, columns.

## Reading count tables.

## Using the transcript<->gene mapping.

## Finished reading count tables.

## Matched 58486 annotations and counts.

## Bringing together the count matrix and gene information.

## If we do not use the tx_gene_map information, we get 234,330 transcripts in the data set.
## Instead, we get 58,486 genes.

2.2.1 How many putative features in the raw data?

dim(exprs(k10_raw))

## [1] 58486    10

dim(exprs(k8_raw))

## [1] 58486     8

2.3 Threshold cutoff

Start by dropping all features with less than 4 counts across all samples. This should get rid of a significant number of the false positive transcripts. While I am at it, I will drop the features annotated as rRNA.

## Perform a count/gene cutoff here
threshold <- 4
method <- "cbcb"  ## other methods include 'genefilter' and 'simple'
k10_count <- sm(normalize_expt(k10_raw, filter=method, thresh=threshold))
k8_count <- sm(normalize_expt(k8_raw, filter=method, thresh=threshold))

## First get rid of the rRNA
sb_rrna <- putative_annotations[["rrna_subunit"]] != ""
rrna_droppers <- rownames(putative_annotations)[sb_rrna]
k10_norrna <- exclude_genes_expt(expt=k10_raw, ids=rrna_droppers)

## Before removal, there were 58486 entries.

## Now there are 58481 entries.

## Percent kept: 99.994, 99.994, 99.994, 99.996, 99.996, 99.993, 99.991, 99.994, 99.996, 99.995

## Percent removed: 0.006, 0.006, 0.006, 0.004, 0.004, 0.007, 0.009, 0.006, 0.004, 0.005

k8_norrna <- exclude_genes_expt(expt=k8_raw, ids=rrna_droppers)

## Before removal, there were 58486 entries.

## Now there are 58481 entries.

## Percent kept: 99.994, 99.994, 99.994, 99.996, 99.993, 99.991, 99.994, 99.995

## Percent removed: 0.006, 0.006, 0.006, 0.004, 0.007, 0.009, 0.006, 0.005

2.3.1 How many high-count features are in the data?

dim(exprs(k10_count))

## [1] 24132    10

dim(exprs(k8_count))

## [1] 23211     8

2.4 Confidence cutoff

Now lets drop those features for which we have relatively lower confidence in the BLAST results from trinotate.

## Keep only the blastx hits with a low e-value and high identity.
filtered_keepers <- putative_annotations[["blastx_evalue"]] <= 1e-10 &
  putative_annotations[["blastx_identity"]] >= 60
keepers <- rownames(putative_annotations)[filtered_keepers]
## This excludes all but 33,451 transcripts.

## Now keep only those transcript IDs which fall into the above categories.
k10_conf <- exclude_genes_expt(expt=k10_norrna, ids=keepers, method="keep")

## Before removal, there were 58481 entries.

## Now there are 4888 entries.

## Percent kept: 23.611, 22.922, 24.308, 23.931, 24.798, 24.603, 25.471, 22.574, 23.289, 25.368

## Percent removed: 76.389, 77.078, 75.692, 76.069, 75.202, 75.397, 74.529, 77.426, 76.711, 74.632

k8_conf <- exclude_genes_expt(expt=k8_norrna, ids=keepers, method="keep")

## Before removal, there were 58481 entries.

## Now there are 4888 entries.

## Percent kept: 23.611, 22.922, 24.308, 24.798, 24.603, 25.471, 22.574, 25.368

## Percent removed: 76.389, 77.078, 75.692, 75.202, 75.397, 74.529, 77.426, 74.632

2.5 How many high-confidence features are in the data?

dim(exprs(k10_conf))

## [1] 4888   10

dim(exprs(k8_conf))

## [1] 4888    8

2.6 Both cutoffs

Finally, lets be super-strict and remove both.

## Perform both the confidence and count filtration
k10_conf_count <- sm(normalize_expt(k10_conf, filter=method, thresh=threshold))
k8_conf_count <- sm(normalize_expt(k8_conf, filter=method, thresh=threshold))

rrna_expt <- exclude_genes_expt(expt=k10_raw, ids=sb_rrna, method="keep")

## Before removal, there were 58486 entries.

## Now there are 5 entries.

## Percent kept: 0.006, 0.006, 0.006, 0.004, 0.004, 0.007, 0.009, 0.006, 0.004, 0.005

## Percent removed: 99.994, 99.994, 99.994, 99.996, 99.996, 99.993, 99.991, 99.994, 99.996, 99.995

## This has only 21 putative rRNA transcripts.

2.6.1 How many high-count and high-confidence features?

dim(exprs(k10_conf_count))

## [1] 4392   10

dim(exprs(k8_conf_count))

## [1] 4328    8

3 Plot changes in number of putative genes

There are a few ways to visualize the changes wrought in the data by these operations. Here are two bar plots which attempt to show what happened, the first stacks the relatively-sized library sizes atop one another, the second stacks them on the x-axis.

genes_by_subset <- data.frame(
  samples = c(rep("ten", 4), rep("eight", 4)),
  colors = c(rep("#1B9E77", 4), rep("#7570B3", 4)),
  alpha = c("#1B9E77FF", "#1B9E77CC", "#1B0E77AA",
            "#1B0E7777", "#7570B3FF", "#7570B3CC", "#7570B3AA", "#7570B377"),
  type = rep(c("raw", "count", "conf", "conf_count"), 2),
  genes = c(nrow(exprs(k10_raw)), nrow(exprs(k10_count)),
            nrow(exprs(k10_conf)), nrow(exprs(k10_conf_count)),
            nrow(exprs(k8_raw)), nrow(exprs(k8_count)),
            nrow(exprs(k8_conf)), nrow(exprs(k8_conf_count))))

library(ggplot2)
columns <- ggplot(genes_by_subset, aes(x=samples, y=genes)) +
  geom_col(position="identity", aes(fill=colors, alpha=alpha), color="black") +
  scale_fill_manual(values=c(levels(as.factor(genes_by_subset$colors)))) +
  theme(axis.text=ggplot2::element_text(size=10, colour="black"),
        axis.text.x=ggplot2::element_text(angle=90, vjust=0.5),
        legend.position="none")
c1 <- genes_by_subset[["samples"]] == "ten"
c2 <- genes_by_subset[["samples"]] == "eight"
tmp <- cbind(genes_by_subset[c1, "genes"], genes_by_subset[c2, "genes"])
colnames(tmp) <- c("ten", "eight")
rownames(tmp) <- c("raw", "count", "conf", "conf_count")
tmp

##              ten eight
## raw        58486 58486
## count      24132 23211
## conf        4888  4888
## conf_count  4392  4328

pp(file="images/genes_by_columns.png", image=columns)

## Wrote the image to: images/genes_by_columns.png

if (!isTRUE(get0("skip_load"))) {
  pander::pander(sessionInfo())
  message(paste0("This is hpgltools commit: ", get_git_commit()))
  this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
  message(paste0("Saving to ", this_save))
  tmp <- sm(saveme(filename=this_save))
}

4 Gene Sample Estimation version: 20180310

Since I used kallisto for this work, I want to figure out tximport.

In 01_annotation.Rmd, I decided to open up a few lines of inquiry:

k10_raw: All samples using the full set of putative annotations.
k8_raw: Samples excluding hpgl1003 with the full set of annotations.
k10_count: All samples, removing those with < ‘threshold (5)’ counts/gene.
k8_count: 8 samples, ibid.
k10_conf: All samples, but removing transcripts with annotations less confident than 1e-10 blastx e-value and less identity than 60 percent to some other known gene.
k8_conf: 8 samples, ibid.
k10_count_conf: A combination of the count filter and confidence filter.
k8_count_conf: ibid, but with the 8 samples.

5 Examine these data sets

5.1 Raw data

Start with the full data set. This will take a rather long time to perform all the graphs, and since we are not likely to use it, I will limit myself to just plotting 1-2 graphs.

5.1.1 k10 raw

This data set is unlikely to be very helpful for differential expression, but may be useful for diagnosing other concerns in the data.

k10_libsize <- plot_libsize(k10_raw)
pp(file="images/01_k10_libsize.pdf", image=k10_libsize$plot)

## Wrote the image to: images/01_k10_libsize.pdf

k10_density <- sm(plot_density(k10_raw))

k10_density <- sm(k10_density$plot + ggplot2::scale_x_continuous(limits=c(1,100)))
pp(file="images/02_k10_density.pdf", image=k10_density)

## Wrote the image to: images/02_k10_density.pdf

k10_boxplot <- sm(plot_boxplot(k10_raw))
pp(file="images/03_k10_boxplot.pdf", image=k10_boxplot)

## Wrote the image to: images/03_k10_boxplot.pdf

k10_corheat <- plot_corheat(k10_raw)

pp(file="images/04_k10_corheat.pdf", image=k10_corheat)

## Wrote the image to: images/04_k10_corheat.pdf

k10_tsne <- plot_tsne(k10_raw)
pp(file="images/05_k10_tsne.pdf", image=k10_tsne)

## Wrote the image to: images/05_k10_tsne.pdf

Most notably, it makes obvious some concerns with samples 1003, 1007, and 1008. We chose to remove 1003 and 1008, but will keep an eye on 1007.

5.1.2 k8 raw

The metrics of the data when we remove those two samples should generally be quite similar.

k8_libsize <- plot_libsize(k8_raw)
pp(file="images/06_k8_libsize.pdf", image=k8_libsize$plot)

## Wrote the image to: images/06_k8_libsize.pdf

k8_density <- sm(plot_density(k8_raw))

k8_density <- sm(k8_density$plot + ggplot2::scale_x_continuous(limits=c(1,100)))
pp(file="images/07_k8_density.pdf", image=k8_density)

## Wrote the image to: images/07_k8_density.pdf

k8_boxplot <- sm(plot_boxplot(k8_raw))
pp(file="images/08_k8_boxplot.pdf", image=k8_boxplot)

## Wrote the image to: images/08_k8_boxplot.pdf

k8_corheat <- plot_corheat(k8_raw)

pp(file="images/09_k8_corheat.pdf", image=k8_corheat)

## Wrote the image to: images/09_k8_corheat.pdf

k8_tsne <- plot_tsne(k8_raw)
pp(file="images/10_k8_tsne.pdf", image=k8_tsne)

## Wrote the image to: images/10_k8_tsne.pdf

5.1.2.1 A peculiarity of kallisto

Here is the same density plot, but looking from 0 to 10. In it we can see that there is a tremendous number of ‘genes’ which have fractional counts. I am certain that this is important, but I am not sure what to do with it. I have a feeling that when we do a count-based filtering, this tremendous jump by sample 1009 will fall away…

odd <- sm(k8_density + ggplot2::scale_x_continuous(limits=c(0,10)))
pp(file="images/11_k8_odd_density.pdf", image=odd)

## Wrote the image to: images/11_k8_odd_density.pdf

5.2 Quick side question

Can we ask the data to give us relatively reliable estimates of which genes/transcripts are in all 4 embryogenic samples and in none of the non-embryogenic samples?

Yesterday when I went to do this, my rpkm function failed because I neglected to include a sequence length column in the annotation data. That has been remedied, as a result these cutoffs probably should be adjust to something more appropriate to rpkm-scale data. Though I am not entirely certain where they should be adjusted, perhaps <1 and 5?

k8_raw_rpkm <- normalize_expt(k8_count, convert="rpkm", norm="quant")

## This function will replace the expt$expressionset slot with:

## rpkm(quant(data))

## It backs up the current data into a slot named:
##  expt$backup_expressionset. It will also save copies of each step along the way
##  in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
##  when invoking limma.  The appropriate libsize is the non-log(cpm(normalized)).
##  This is most likely kept at:
##  'new_expt$normalized$intermediate_counts$normalization$libsizes'
##  A copy of this may also be found at:
##  new_expt$best_libsize

## Filter is false, this should likely be set to something, good
##  choices include cbcb, kofa, pofa (anything but FALSE).  If you want this to
##  stay FALSE, keep in mind that if other normalizations are performed, then the
##  resulting libsizes are likely to be strange (potentially negative!)

## Leaving the data in its current base format, keep in mind that
##  some metrics are easier to see when the data is log2 transformed, but
##  EdgeR/DESeq do not accept transformed data.

## Not correcting the count-data for batch effects.  If batch is
##  included in EdgerR/limma's model, then this is probably wise; but in extreme
##  batch effects this is a good parameter to play with.

## Step 1: not doing count filtering.

## Step 2: normalizing the data with quant.

## Step 3: converting the data with rpkm.

## Step 4: not transforming the data.

## Step 5: not doing batch correction.

low_cutoff <- 3
high_cutoff <- 10
k8_embryo_low_nonembryo_high <- rowMeans(exprs(k8_raw_rpkm)[, 1:4]) <= low_cutoff &
  rowMeans(exprs(k8_raw_rpkm)[, 5:8]) >= high_cutoff
checkme_first <- rownames(exprs(k8_raw_rpkm))[k8_embryo_low_nonembryo_high]
dim(exprs(k8_raw_rpkm)[checkme_first, ])

## [1] 228   8

low_embryo_high_nonembryo <- merge(exprs(k8_raw_rpkm)[checkme_first, ], fData(k8_raw_rpkm)[checkme_first, ], by="row.names")
write.csv(file="low_embryo_high_nonembryo.csv", low_embryo_high_nonembryo)

k8_embryo_high_nonembryo_low <- rowMeans(exprs(k8_raw_rpkm)[, 1:4]) >= high_cutoff & rowMeans(exprs(k8_raw_rpkm)[, 5:8]) <= low_cutoff
checkme_second <- rownames(exprs(k8_raw_rpkm))[k8_embryo_high_nonembryo_low]
dim(exprs(k8_raw_rpkm)[checkme_second, ])

## [1] 43  8

high_embryo_low_nonembryo <- merge(exprs(k8_raw_rpkm)[checkme_second, ], fData(k8_raw_rpkm)[checkme_second, ], by="row.names")
write.csv(file="high_embryo_low_nonembryo.csv", high_embryo_low_nonembryo)

5.3 Count filtered data

When we get to the count filtered data, I suspect things will be more interesting. Therefore let us look at the full spectrum of plots before/after normalization.

5.3.1 k10 count filtering

For these data, per orm some extra analyses including a PC loading, PC vs actor analysis, and variance partition.

k10_count_raw_plots <- sm(graph_metrics(k10_count))

k10_count_norm <- sm(normalize_expt(k10_count, transform="log2", norm="quant", convert="cpm"))
k10_count_norm_plots <- sm(graph_metrics(k10_count_norm))

k10_count_normbatch <- sm(normalize_expt(k10_count, batch="svaseq", transform="log2", convert="cpm"))
k10_count_normbatch_plots <- sm(graph_metrics(k10_count_normbatch))

5.3.2 Some variance partition plots

pp(file="images/12_anova_fstat_heatmap.pdf", image=pca_info$anova_fstat_heatmap)

## Wrote the image to: images/12_anova_fstat_heatmap.pdf

pp(file="images/13_k10_varpart_pct.pdf", image=k10_varpart$percent_plot)

## Wrote the image to: images/13_k10_varpart_pct.pdf

pp(file="images/14_k10_varpart_partition.pdf", image=k10_varpart$partition_plot)

## Wrote the image to: images/14_k10_varpart_partition.pdf

5.3.2.1 Look at some plots

Let us see how the k10 count data appears

pp(file="images/15_k10count_libsize.pdf", image=k10_count_raw_plots$libsize)

## Wrote the image to: images/15_k10count_libsize.pdf

k10_count_density <- sm(k10_count_raw_plots$density + ggplot2::scale_x_continuous(limits=c(1,200)))

pp(file="images/16_k10count_density.pdf", image=k10_count_density)

## Wrote the image to: images/16_k10count_density.pdf

pp(file="images/17_k10count_boxplot.pdf", image=k10_count_raw_plots$boxplot)

## Wrote the image to: images/17_k10count_boxplot.pdf

pp(file="images/18_k10count_corheat.pdf", image=k10_count_norm_plots$corheat)

## Wrote the image to: images/18_k10count_corheat.pdf

pp(file="images/19_k10count_smc.pdf", image=k10_count_norm_plots$smc)

## Wrote the image to: images/19_k10count_smc.pdf

pp(file="images/20_k10count_disheat.pdf", image=k10_count_norm_plots$disheat)

## Wrote the image to: images/20_k10count_disheat.pdf

pp(file="images/21_k10count_smd.pdf", image=k10_count_norm_plots$smd)

## Wrote the image to: images/21_k10count_smd.pdf

pp(file="images/22_k10count_pca.pdf", image=k10_count_norm_plots$pcaplot)

## Wrote the image to: images/22_k10count_pca.pdf

## See if the additional information in the 'batch' factor made a difference
pp(file="images/23_k10count_normbatch_pca.pdf", image=k10_count_normbatch_plots$pcaplot)

## Wrote the image to: images/23_k10count_normbatch_pca.pdf

5.3.2.2 Odd densities again

sm(k10_count_raw_plots$density + ggplot2::scale_x_continuous(limits=c(0,10)))

It looks like I was mostly correct above, this makes it clearer that sample 1003 has to go, and since 1008 is in the same batch it is likely to get the axe still. Though judging from this plot, maybe 1009 is a candidate for removal too.

5.3.3 k8 count filtering

These are the data primarily used in the following analyses (differential expression/ontology/etc).

k8_count_raw_plots <- sm(graph_metrics(k8_count))

k8_count_norm <- sm(normalize_expt(k8_count, transform="log2", norm="quant", convert="cpm"))
k8_count_norm_plots <- sm(graph_metrics(k8_count_norm))

k8_count_normbatch <- sm(normalize_expt(k8_count, batch="svaseq", transform="log2", convert="cpm"))
k8_count_normbatch_plots <- sm(graph_metrics(k8_count_normbatch))

5.3.3.1 Look at some plots

Let us see how the k8 count data appears

pp(file="images/20_k8count_libsize.pdf", image=k8_count_raw_plots$libsize)

## Wrote the image to: images/20_k8count_libsize.pdf

k8_count_density <- sm(k8_count_raw_plots$density + ggplot2::scale_x_continuous(limits=c(1,200)))

pp(file="images/21_k8count_density.pdf", image=k8_count_density)

## Wrote the image to: images/21_k8count_density.pdf

pp(file="images/22_k8count_boxplot.pdf", image=k8_count_raw_plots$boxplot)

## Wrote the image to: images/22_k8count_boxplot.pdf

pp(file="images/23_k8count_corheat.pdf", image=k8_count_norm_plots$corheat)

## Wrote the image to: images/23_k8count_corheat.pdf

pp(file="images/24_k8count_smc.pdf", image=k8_count_norm_plots$smc)

## Wrote the image to: images/24_k8count_smc.pdf

pp(file="images/25_k8count_disheat.pdf", image=k8_count_norm_plots$disheat)

## Wrote the image to: images/25_k8count_disheat.pdf

pp(file="images/26_k8count_smd.pdf", image=k8_count_norm_plots$smd)

## Wrote the image to: images/26_k8count_smd.pdf

pp(file="images/27_k8count_pca.pdf", image=k8_count_norm_plots$pcaplot)

## Wrote the image to: images/27_k8count_pca.pdf

## See if the additional information in the 'batch' factor made a difference
pp(file="images/28_k8count_normbatch_pca.pdf", image=k8_count_normbatch_plots$pcaplot)

## Wrote the image to: images/28_k8count_normbatch_pca.pdf

pp(file="images/29_k8count_tsne.pdf", image=k8_count_normbatch_plots$tsneplot)

## Wrote the image to: images/29_k8count_tsne.pdf

5.3.4 Look more closely at variance

We can use variancePartition to look more closely at where the variance in the data is coming from.

pca_info <- pca_information(k8_count, plot_pcas=TRUE, num_components=4,
                            expt_factors=c("condition", "batch", "rnaconcentration"))

## More shallow curves in these plots suggest more genes in this principle component.

k8_varpart <- varpart(k8_count, predictor=NULL, factors=c("condition","batch"))

## Attempting mixed linear model with: ~  (1|condition) + (1|batch)

## Fitting the expressionset to the model, this is slow.

## Projected run time: ~ 2 min

## Placing factor: condition at the beginning of the model.

write.csv(k8_varpart$sorted_df, file="excel/varpart_sorted.csv")
high_scores <- pca_highscores(k8_count, n=100)

## Top 100 Strongest genes vs. component 1.
high_scores$highest[, "Comp.1"]

##   [1] "14.53:TRINITY_DN11326_c0_g3"  "13.33:TRINITY_DN7305_c1_g7"  
##   [3] "13.19:TRINITY_DN18956_c2_g1"  "13.13:TRINITY_DN11934_c0_g7" 
##   [5] "12.63:TRINITY_DN15106_c4_g3"  "12.56:TRINITY_DN16546_c0_g6" 
##   [7] "12.34:TRINITY_DN20010_c1_g5"  "12.15:TRINITY_DN15897_c1_g3" 
##   [9] "12.15:TRINITY_DN10836_c0_g3"  "11.9:TRINITY_DN15055_c3_g4"  
##  [11] "11.75:TRINITY_DN15734_c1_g2"  "11.59:TRINITY_DN7489_c0_g3"  
##  [13] "11.55:TRINITY_DN13204_c0_g1"  "11.47:TRINITY_DN16838_c0_g1" 
##  [15] "11.37:TRINITY_DN14567_c0_g2"  "11.27:TRINITY_DN7489_c0_g7"  
##  [17] "11.27:TRINITY_DN16152_c0_g6"  "10.84:TRINITY_DN8744_c0_g1"  
##  [19] "10.71:TRINITY_DN11842_c4_g1"  "10.62:TRINITY_DN13516_c1_g1" 
##  [21] "10.57:TRINITY_DN8078_c1_g3"   "10.45:TRINITY_DN15545_c0_g3" 
##  [23] "10.41:TRINITY_DN11322_c1_g5"  "10.41:TRINITY_DN17690_c0_g4" 
##  [25] "10.21:TRINITY_DN9459_c1_g3"   "10.14:TRINITY_DN13058_c0_g1" 
##  [27] "10.07:TRINITY_DN15333_c1_g5"  "9.923:TRINITY_DN16729_c0_g3" 
##  [29] "9.871:TRINITY_DN8985_c1_g8"   "9.83:TRINITY_DN14132_c0_g1"  
##  [31] "9.825:TRINITY_DN12928_c0_g13" "9.767:TRINITY_DN14230_c2_g3" 
##  [33] "9.687:TRINITY_DN18952_c1_g4"  "9.679:TRINITY_DN18760_c1_g4" 
##  [35] "9.651:TRINITY_DN8086_c3_g1"   "9.651:TRINITY_DN14199_c1_g4" 
##  [37] "9.626:TRINITY_DN6957_c2_g5"   "9.61:TRINITY_DN15734_c1_g4"  
##  [39] "9.592:TRINITY_DN14545_c1_g1"  "9.589:TRINITY_DN14291_c0_g2" 
##  [41] "9.532:TRINITY_DN19416_c3_g2"  "9.464:TRINITY_DN11811_c1_g4" 
##  [43] "9.438:TRINITY_DN8979_c2_g6"   "9.344:TRINITY_DN8411_c0_g5"  
##  [45] "9.315:TRINITY_DN10513_c0_g3"  "9.309:TRINITY_DN16231_c1_g6" 
##  [47] "9.302:TRINITY_DN12894_c1_g4"  "9.29:TRINITY_DN11126_c0_g1"  
##  [49] "9.205:TRINITY_DN9864_c0_g5"   "9.145:TRINITY_DN11615_c1_g1" 
##  [51] "9.09:TRINITY_DN16511_c0_g8"   "9.046:TRINITY_DN17042_c1_g4" 
##  [53] "9.043:TRINITY_DN13936_c0_g2"  "9.03:TRINITY_DN12936_c0_g4"  
##  [55] "8.978:TRINITY_DN16578_c3_g2"  "8.912:TRINITY_DN7883_c1_g1"  
##  [57] "8.901:TRINITY_DN11505_c0_g1"  "8.888:TRINITY_DN11628_c1_g1" 
##  [59] "8.809:TRINITY_DN7877_c0_g1"   "8.806:TRINITY_DN11976_c1_g2" 
##  [61] "8.797:TRINITY_DN16152_c0_g8"  "8.762:TRINITY_DN14513_c0_g4" 
##  [63] "8.741:TRINITY_DN15573_c0_g3"  "8.715:TRINITY_DN14860_c2_g4" 
##  [65] "8.712:TRINITY_DN14732_c0_g3"  "8.698:TRINITY_DN17291_c1_g3" 
##  [67] "8.63:TRINITY_DN8013_c0_g4"    "8.612:TRINITY_DN17213_c0_g1" 
##  [69] "8.608:TRINITY_DN9824_c4_g2"   "8.527:TRINITY_DN13759_c0_g3" 
##  [71] "8.526:TRINITY_DN11028_c0_g5"  "8.521:TRINITY_DN18874_c0_g4" 
##  [73] "8.454:TRINITY_DN9025_c2_g5"   "8.452:TRINITY_DN16578_c3_g3" 
##  [75] "8.352:TRINITY_DN10485_c0_g2"  "8.333:TRINITY_DN12962_c0_g2" 
##  [77] "8.323:TRINITY_DN15701_c0_g3"  "8.321:TRINITY_DN9025_c2_g1"  
##  [79] "8.292:TRINITY_DN16566_c0_g1"  "8.292:TRINITY_DN12634_c0_g6" 
##  [81] "8.29:TRINITY_DN11495_c1_g5"   "8.253:TRINITY_DN7877_c0_g2"  
##  [83] "8.223:TRINITY_DN11375_c0_g4"  "8.175:TRINITY_DN16221_c1_g2" 
##  [85] "8.148:TRINITY_DN11923_c0_g1"  "8.11:TRINITY_DN18336_c1_g2"  
##  [87] "8.07:TRINITY_DN9532_c1_g2"    "8.026:TRINITY_DN14568_c1_g3" 
##  [89] "8.024:TRINITY_DN491_c0_g1"    "7.997:TRINITY_DN17712_c1_g3" 
##  [91] "7.99:TRINITY_DN17472_c1_g2"   "7.974:TRINITY_DN9528_c0_g4"  
##  [93] "7.926:TRINITY_DN18524_c2_g3"  "7.9:TRINITY_DN13728_c0_g6"   
##  [95] "7.878:TRINITY_DN17799_c1_g5"  "7.823:TRINITY_DN19276_c4_g2" 
##  [97] "7.791:TRINITY_DN17399_c1_g1"  "7.774:TRINITY_DN11918_c0_g1" 
##  [99] "7.765:TRINITY_DN14890_c0_g3"  "7.764:TRINITY_DN10143_c1_g6"

## Theoretically that list should agree well with the top of the variance partition table vs. Condition
## assuming condition fits component 1.

## In contrast, since component 2 seems to be the best for splitting the data, here is component 2.
most_by_batch <- high_scores$highest[, "Comp.2"]
most_by_batch <- gsub(pattern="^.*:(.*$)", replacement="\\1", x=most_by_batch)
most_by_batch

##   [1] "TRINITY_DN18570_c1_g2" "TRINITY_DN7305_c1_g7" 
##   [3] "TRINITY_DN11697_c1_g1" "TRINITY_DN16210_c1_g5"
##   [5] "TRINITY_DN9078_c2_g1"  "TRINITY_DN11256_c0_g1"
##   [7] "TRINITY_DN10320_c1_g2" "TRINITY_DN9633_c1_g2" 
##   [9] "TRINITY_DN9548_c1_g4"  "TRINITY_DN17153_c1_g2"
##  [11] "TRINITY_DN12103_c0_g1" "TRINITY_DN18792_c1_g4"
##  [13] "TRINITY_DN10496_c1_g1" "TRINITY_DN14604_c1_g4"
##  [15] "TRINITY_DN18209_c0_g3" "TRINITY_DN7869_c0_g3" 
##  [17] "TRINITY_DN8181_c1_g1"  "TRINITY_DN7021_c0_g1" 
##  [19] "TRINITY_DN18394_c1_g2" "TRINITY_DN17811_c0_g5"
##  [21] "TRINITY_DN11151_c0_g3" "TRINITY_DN15959_c1_g2"
##  [23] "TRINITY_DN14297_c2_g1" "TRINITY_DN18698_c1_g1"
##  [25] "TRINITY_DN9582_c0_g3"  "TRINITY_DN12083_c1_g6"
##  [27] "TRINITY_DN15470_c0_g5" "TRINITY_DN8612_c1_g5" 
##  [29] "TRINITY_DN11811_c1_g1" "TRINITY_DN15599_c0_g8"
##  [31] "TRINITY_DN6713_c0_g4"  "TRINITY_DN17511_c1_g1"
##  [33] "TRINITY_DN17349_c0_g5" "TRINITY_DN12356_c0_g1"
##  [35] "TRINITY_DN18879_c2_g2" "TRINITY_DN11111_c0_g1"
##  [37] "TRINITY_DN9599_c0_g2"  "TRINITY_DN11427_c0_g5"
##  [39] "TRINITY_DN19771_c2_g1" "TRINITY_DN18486_c1_g2"
##  [41] "TRINITY_DN11934_c0_g7" "TRINITY_DN11001_c0_g7"
##  [43] "TRINITY_DN7548_c2_g1"  "TRINITY_DN7838_c0_g4" 
##  [45] "TRINITY_DN7838_c0_g9"  "TRINITY_DN6621_c2_g2" 
##  [47] "TRINITY_DN12218_c1_g8" "TRINITY_DN8201_c4_g3" 
##  [49] "TRINITY_DN14030_c0_g6" "TRINITY_DN7283_c1_g2" 
##  [51] "TRINITY_DN6823_c1_g5"  "TRINITY_DN9223_c1_g4" 
##  [53] "TRINITY_DN8181_c1_g2"  "TRINITY_DN14409_c1_g1"
##  [55] "TRINITY_DN11030_c0_g2" "TRINITY_DN10215_c1_g7"
##  [57] "TRINITY_DN15447_c1_g2" "TRINITY_DN16292_c0_g6"
##  [59] "TRINITY_DN16343_c0_g4" "TRINITY_DN9854_c1_g1" 
##  [61] "TRINITY_DN9128_c0_g5"  "TRINITY_DN12828_c0_g3"
##  [63] "TRINITY_DN11641_c0_g1" "TRINITY_DN9448_c2_g4" 
##  [65] "TRINITY_DN10128_c0_g1" "TRINITY_DN9396_c0_g6" 
##  [67] "TRINITY_DN10699_c0_g1" "TRINITY_DN12162_c0_g7"
##  [69] "TRINITY_DN8196_c1_g1"  "TRINITY_DN6666_c2_g1" 
##  [71] "TRINITY_DN15779_c0_g8" "TRINITY_DN13628_c0_g2"
##  [73] "TRINITY_DN11639_c0_g1" "TRINITY_DN8514_c0_g5" 
##  [75] "TRINITY_DN12923_c0_g2" "TRINITY_DN9187_c0_g2" 
##  [77] "TRINITY_DN12753_c2_g1" "TRINITY_DN10100_c0_g1"
##  [79] "TRINITY_DN6493_c3_g4"  "TRINITY_DN8669_c0_g5" 
##  [81] "TRINITY_DN6372_c0_g3"  "TRINITY_DN13516_c1_g6"
##  [83] "TRINITY_DN7705_c2_g6"  "TRINITY_DN18960_c1_g8"
##  [85] "TRINITY_DN8597_c0_g1"  "TRINITY_DN13972_c8_g1"
##  [87] "TRINITY_DN11641_c1_g1" "TRINITY_DN12766_c0_g2"
##  [89] "TRINITY_DN9367_c2_g7"  "TRINITY_DN13428_c1_g5"
##  [91] "TRINITY_DN10648_c1_g3" "TRINITY_DN9128_c0_g4" 
##  [93] "TRINITY_DN16438_c1_g9" "TRINITY_DN16292_c0_g5"
##  [95] "TRINITY_DN10054_c0_g5" "TRINITY_DN18739_c1_g6"
##  [97] "TRINITY_DN15354_c0_g3" "TRINITY_DN9001_c2_g5" 
##  [99] "TRINITY_DN6364_c3_g1"  "TRINITY_DN17213_c0_g1"

5.4 Conf filtered data

When we get to the conf filtered data, I suspect things will be more interesting. Therefore let us look at the full spectrum of plots before/after normalization.

5.4.1 k10 confidence filtering

k10_conf_raw_plots <- sm(graph_metrics(k10_conf))

k10_conf_norm <- sm(normalize_expt(k10_conf, transform="log2", norm="quant", convert="cpm"))
k10_conf_norm_plots <- sm(graph_metrics(k10_conf_norm))

k10_conf_normbatch <- sm(normalize_expt(k10_conf, batch="svaseq", transform="log2",
                                        convert="cpm", filter=TRUE))
k10_conf_normbatch_plots <- sm(graph_metrics(k10_conf_normbatch))

5.4.1.1 Look at some plots

Let us see how the k10 conf data appears

pp(file="images/30_k10conf_libsize.pdf", image=k10_conf_raw_plots$libsize)

## Wrote the image to: images/30_k10conf_libsize.pdf

k10conf_density <- sm(k10_conf_raw_plots$density + ggplot2::scale_x_continuous(limits=c(1,100)))

pp(file="images/31_k10conf_density.pdf", image=k10conf_density)

## Wrote the image to: images/31_k10conf_density.pdf

pp(file="images/32_k10conf_boxplot.pdf", image=k10_conf_raw_plots$boxplot)

## Wrote the image to: images/32_k10conf_boxplot.pdf

pp(file="images/33_k10conf_corheat.pdf", image=k10_conf_norm_plots$corheat)

## Wrote the image to: images/33_k10conf_corheat.pdf

## This plot is a particularly nice example of why 1003/1008 are annoying.

pp(file="images/34_k10conf_smc.pdf", image=k10_conf_norm_plots$smc)

## Wrote the image to: images/34_k10conf_smc.pdf

pp(file="images/35_k10conf_disheat.pdf", image=k10_conf_norm_plots$disheat)

## Wrote the image to: images/35_k10conf_disheat.pdf

pp(file="images/36_k10conf_smd.pdf", image=k10_conf_norm_plots$smd)

## Wrote the image to: images/36_k10conf_smd.pdf

pp(file="images/37_k10conf_pca.pdf", image=k10_conf_norm_plots$pcaplot)

## Wrote the image to: images/37_k10conf_pca.pdf

## See if the additional information in the 'batch' factor made a difference

pp(file="images/38_k10conf_normbatch_pca.pdf", image=k10_conf_normbatch_plots$pcaplot)

## Wrote the image to: images/38_k10conf_normbatch_pca.pdf

5.4.2 k8 confidence filtering

k8_conf_raw_plots <- sm(graph_metrics(k8_conf))

k8_conf_norm <- sm(normalize_expt(k8_conf, transform="log2", norm="quant", convert="cpm"))
k8_conf_norm_plots <- sm(graph_metrics(k8_conf_norm))

k8_conf_normbatch <- sm(normalize_expt(k8_conf, batch="svaseq", transform="log2",
                                       convert="cpm", filter=TRUE))
k8_conf_normbatch_plots <- sm(graph_metrics(k8_conf_normbatch))

5.4.2.1 Look at some plots

Let us see how the k8 conf data appears

pp(file="images/39_k8conf_libsize.pdf", image=k8_conf_raw_plots$libsize)

## Wrote the image to: images/39_k8conf_libsize.pdf

k8_conf_density <- sm(k8_conf_raw_plots$density + ggplot2::scale_x_continuous(limits=c(1,100)))

pp(file="images/40_k8conf_density.pdf", image=k8_conf_density)

## Wrote the image to: images/40_k8conf_density.pdf

pp(file="images/41_k8conf_boxplot.pdf", image=k8_conf_raw_plots$boxplot)

## Wrote the image to: images/41_k8conf_boxplot.pdf

pp(file="images/42_k8conf_corheat.pdf", image=k8_conf_norm_plots$corheat)

## Wrote the image to: images/42_k8conf_corheat.pdf

pp(file="images/43_k8conf_smc.pdf", image=k8_conf_norm_plots$smc)

## Wrote the image to: images/43_k8conf_smc.pdf

pp(file="images/44_k8conf_disheat.pdf", image=k8_conf_norm_plots$disheat)

## Wrote the image to: images/44_k8conf_disheat.pdf

pp(file="images/45_k8conf_smd.pdf", image=k8_conf_norm_plots$smd)

## Wrote the image to: images/45_k8conf_smd.pdf

pp(file="images/46_k8conf_pca.pdf", image=k8_conf_norm_plots$pcaplot)

## Wrote the image to: images/46_k8conf_pca.pdf

pp(file="images/47_k8conf_tsne.pdf", image=k8_conf_norm_plots$tsneplot)

## Wrote the image to: images/47_k8conf_tsne.pdf

## See if the additional information in the 'batch' factor made a difference

pp(file="images/48_k8conf_normbatch_pca.pdf", image=k8_conf_normbatch_plots$pcaplot)

## Wrote the image to: images/48_k8conf_normbatch_pca.pdf

pp(file="images/49_k8conf_normbatch_tsne.pdf", image=k8_conf_normbatch_plots$tsneplot)

## Wrote the image to: images/49_k8conf_normbatch_tsne.pdf

5.5 Count and Conf filtered data

When we get to the count_conf filtered data, I suspect things will be more interesting. Therefore let us look at the full spectrum of plots before/after normalization.

5.5.1 k10 both filters

k10_conf_count_raw_plots <- sm(graph_metrics(k10_conf_count))

k10_conf_count_norm <- sm(normalize_expt(k10_conf_count, transform="log2",
                                         norm="quant", convert="cpm"))
k10_conf_count_norm_plots <- sm(graph_metrics(k10_conf_count_norm))

k10_conf_count_normbatch <- sm(normalize_expt(k10_conf_count, batch="svaseq", transform="log2",
                                              convert="cpm", filter=TRUE))
k10_conf_count_normbatch_plots <- sm(graph_metrics(k10_conf_count_normbatch))

5.5.1.1 Look at some plots

Let us see how the k10 count_conf data appears

pp(file="images/50_k10confcount_libsize.pdf", image=k10_conf_count_raw_plots$libsize)

## Wrote the image to: images/50_k10confcount_libsize.pdf

k10confcount_density <- sm(k10_conf_count_raw_plots$density + ggplot2::scale_x_continuous(limits=c(1,100)))

pp(file="images/51_k10confcount_density.pdf", image=k10confcount_density)

## Wrote the image to: images/51_k10confcount_density.pdf

pp(file="images/52_k10confcount_boxplot.pdf", image=k10_conf_count_raw_plots$boxplot)

## Wrote the image to: images/52_k10confcount_boxplot.pdf

pp(file="images/53_k10confcount_corheat.pdf", image=k10_conf_count_norm_plots$corheat)

## Wrote the image to: images/53_k10confcount_corheat.pdf

pp(file="images/54_k10confcount_smc.pdf", image=k10_conf_count_norm_plots$smc)

## Wrote the image to: images/54_k10confcount_smc.pdf

pp(file="images/55_k10confcount_disheat.pdf", image=k10_conf_count_norm_plots$disheat)

## Wrote the image to: images/55_k10confcount_disheat.pdf

pp(file="images/56_k10confcount_smd.pdf", image=k10_conf_count_norm_plots$smd)

## Wrote the image to: images/56_k10confcount_smd.pdf

pp(file="images/57_k10confcount_pca.pdf", image=k10_conf_count_norm_plots$pcaplot)

## Wrote the image to: images/57_k10confcount_pca.pdf

## See if the additional information in the 'batch' factor made a difference

pp(file="images/58_k10confcount_normbatch_pca.pdf", image=k10_conf_count_normbatch_plots$pcaplot)

## Wrote the image to: images/58_k10confcount_normbatch_pca.pdf

pp(file="images/59_k10confcount_tsne.pdf", image=k10_conf_count_normbatch_plots$tsneplot)

## Wrote the image to: images/59_k10confcount_tsne.pdf

5.5.2 k8 both filters

k8_conf_count_raw_plots <- sm(graph_metrics(k8_conf_count))

k8_conf_count_norm <- sm(normalize_expt(k8_conf_count, transform="log2",
                                        norm="quant", convert="cpm"))
k8_conf_count_norm_plots <- sm(graph_metrics(k8_conf_count_norm))

k8_conf_count_normbatch <- sm(normalize_expt(k8_conf_count, batch="svaseq",
                                             transform="log2", convert="cpm"))
k8_conf_count_normbatch_plots <- sm(graph_metrics(k8_conf_count_normbatch))

5.5.2.1 Look at some plots

Let us see how the k8 count_conf data appears

pp(file="images/60_k8confcount_libsize.pdf", image=k8_conf_count_raw_plots$libsize)

## Wrote the image to: images/60_k8confcount_libsize.pdf

k8confcount_density <- sm(k8_conf_count_raw_plots$density + ggplot2::scale_x_continuous(limits=c(1,100)))

pp(file="images/61_k8confcount_density.pdf", image=k8confcount_density)

## Wrote the image to: images/61_k8confcount_density.pdf

pp(file="images/62_k8confcount_boxplot.pdf", image=k8_conf_count_raw_plots$boxplot)

## Wrote the image to: images/62_k8confcount_boxplot.pdf

pp(file="images/63_k8confcount_corheat.pdf", image=k8_conf_count_norm_plots$corheat)

## Wrote the image to: images/63_k8confcount_corheat.pdf

pp(file="images/64_k8confcount_smc.pdf", image=k8_conf_count_norm_plots$smc)

## Wrote the image to: images/64_k8confcount_smc.pdf

pp(file="images/65_k8confcount_disheat.pdf", image=k8_conf_count_norm_plots$disheat)

## Wrote the image to: images/65_k8confcount_disheat.pdf

pp(file="images/66_k8confcount_smd.pdf", image=k8_conf_count_norm_plots$smd)

## Wrote the image to: images/66_k8confcount_smd.pdf

pp(file="images/67_k8confcount_pca.pdf", image=k8_conf_count_norm_plots$pcaplot)

## Wrote the image to: images/67_k8confcount_pca.pdf

## See if the additional information in the 'batch' factor made a difference

pp(file="images/68_k8confcount_normbatch_pca.pdf", image=k8_conf_count_normbatch_plots$pcaplot)

## Wrote the image to: images/68_k8confcount_normbatch_pca.pdf

pp(file="images/69_k8confcount_tsne.pdf", image=k8_conf_count_normbatch_plots$tsneplot)

## Wrote the image to: images/69_k8confcount_tsne.pdf

5.5.3 Both filters, variance partition

Variance Partition is a neat tool to see what experimental factors have the most variance associated with them. It is slow, so I will use it on the smallest data set we have created (the confidence/count filtered).

k8_conf_count_varpart <- varpart(expt=k8_conf_count, predictor=NULL,
                                 factors=c("condition", "batch"),
                                 modify_expt=TRUE)

## Attempting mixed linear model with: ~  (1|condition) + (1|batch)

## Fitting the expressionset to the model, this is slow.

## Projected run time: ~ 0.4 min

## Placing factor: condition at the beginning of the model.

k8_conf_count <- k8_conf_count_varpart[["modified_expt"]]
k8_conf_count_varpart$percent_plot

## These genes are the most likely to be interesting for the difference between non/embryogenic
k8_conf_count_varpart$partition_plot

## Another fun thing to use is pca_information()
## testing <- pca_information(expt_data=k8_conf_count)

6 Make a nice stacked libsize barplot

tt <- sm(library(ggplot2))
maximum <- plot_libsize(k10_raw, text=FALSE)
maximum_alpha <- maximum$table
maximum_alpha$alpha <- alpha(maximum_alpha$colors, 0.4)
conf <- plot_libsize(k10_conf, text=FALSE)
conf_alpha <- conf$table
conf_alpha$alpha <- alpha(conf_alpha$colors, 0.6)
count <- plot_libsize(k10_count, text=FALSE)
count_alpha <- count$table
count_alpha$alpha <- alpha(count_alpha$colors, 0.8)
conf_count <- plot_libsize(k10_conf_count, text=FALSE)
conf_count_alpha <- conf_count$table
conf_count_alpha$alpha <- alpha(conf_count_alpha$colors, 1.0)

all <- rbind(maximum_alpha, conf_alpha,
             count_alpha, conf_count_alpha)
all <- as.data.frame(all)

## In this instance, they are stacked on top of 0, which I suspect is the goal.
columns <- ggplot(all, aes(x=id, y=sum)) +
  geom_col(position="identity", aes(fill=colors, alpha=alpha), color="black") +
  scale_fill_manual(values=c(levels(as.factor(all$colors)))) +
  theme(axis.text=ggplot2::element_text(size=10, colour="black"),
        axis.text.x=ggplot2::element_text(angle=90, vjust=0.5),
        legend.position="none")

maximum <- plot_libsize(k8_raw, text=FALSE)
maximum_alpha <- maximum$table
maximum_alpha$alpha <- alpha(maximum_alpha$colors, 0.4)
conf <- plot_libsize(k8_conf, text=FALSE)
conf_alpha <- conf$table
conf_alpha$alpha <- alpha(conf_alpha$colors, 0.6)
count <- plot_libsize(k8_count, text=FALSE)
count_alpha <- count$table
count_alpha$alpha <- alpha(count_alpha$colors, 0.8)
conf_count <- plot_libsize(k8_conf_count, text=FALSE)
conf_count_alpha <- conf_count$table
conf_count_alpha$alpha <- alpha(conf_count_alpha$colors, 1.0)

all <- rbind(maximum_alpha, conf_alpha,
             count_alpha, conf_count_alpha)
all <- as.data.frame(all)

## In this instance, they are stacked on top of 0, which I suspect is the goal.
columns <- ggplot(all, aes(x=id, y=sum)) +
  geom_col(position="identity", aes(fill=colors, alpha=alpha), color="black") +
  scale_fill_manual(values=c(levels(as.factor(all$colors)))) +
  theme(axis.text=ggplot2::element_text(size=10, colour="black"),
        axis.text.x=ggplot2::element_text(angle=90, vjust=0.5),
        legend.position="none")
pp(file="images/70_stacked_libsize_columns.pdf",image=columns)

## Wrote the image to: images/70_stacked_libsize_columns.pdf

7 Write up the favorites

Let us assume for a moment that we will choose a couple of the above for looking at more closely. I suspect k8_count_conf, k10_count_conf, and k8_count and k8_conf are good candidates.

output <- paste0("excel/k8_conf_count_by_gene-v", ver, ".xlsx")
#if (!file.exists(output)) {
k8_conf_count_written <- sm(write_expt(k8_conf_count, excel=output, violin=TRUE))

#}

output <- paste0("excel/k10_conf_count_by_gene-v", ver, ".xlsx")

#if (!file.exists(output)) {
k10_conf_count_written <- sm(write_expt(k10_conf_count, excel=output, violin=TRUE))

#}

output <- paste0("excel/k8_count_by_gene-v", ver, ".xlsx")
#if (!file.exists(output)) {
k8_count_written <- sm(write_expt(k8_count, excel=output, violin=TRUE))

#}

output <- paste0("excel/k8_conf_by_gene-v", ver, ".xlsx")
#if (!file.exists(output)) {
k8_conf_written <- sm(write_expt(k8_conf, excel=output, violin=TRUE))

#}

if (!isTRUE(get0("skip_load"))) {
  pander::pander(sessionInfo())
  message(paste0("This is hpgltools commit: ", get_git_commit()))
  this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
  message(paste0("Saving to ", this_save))
  ## tmp <- sm(saveme(filename=this_save))
}

8 Differential Expression by gene, Solanum betaceum: 20180310

For the moment, I will assume we only want to deal with the 8 samples. However, I will try a count-filtered run, an annotation filtered run, and both.

8.1 Count filtered

keepers <- list(
  "embryo" = c("embryogenic", "nonembryogenic"))

k8_count_de <- sm(all_pairwise(k8_count))
k10_count_de <- sm(all_pairwise(k10_count))

8.1.1 Generate tables

k8_count_tables <- sm(combine_de_tables(
  k8_count_de, keepers=keepers, type="deseq",
  excel=paste0("excel/k8_count_by_gene_de-v", ver, ".xlsx"),
  abundant_excel=paste0("excel/k8_count_by_gene_abundant-v", ver, ".xlsx"),
  sig_excel=paste0("excel/k8_count_by_gene_sig-v", ver, ".xlsx")))
k10_count_tables <- sm(combine_de_tables(
  k10_count_de, keepers=keepers, type="deseq",
  excel=paste0("excel/k10_count_by_gene_de-v", ver, ".xlsx"),
  abundant_excel=paste0("excel/k10_count_by_gene_abundant-v", ver, ".xlsx"),
  sig_excel=paste0("excel/k10_count_by_gene_sig-v", ver, ".xlsx")))

k8k10_comparison <- compare_de_results(k8_count_tables, k10_count_tables)

extract_fasta_sequences <- function(k8_count_tables) {
    sig_table <- k8_count_tables[["significant"]][["deseq"]][["ups"]][[1]]
    subset_df <- sig_table[, c("transcriptid", "transcriptseq")]
    names <- subset_df[[1]]
    sequences <- subset_df[[2]]
    seq <- as.list(sequences)
    names(seq) <- names
    testing <- seqinr::write.fasta(seq, names, file.out="test_ups.fasta")

    sig_table <- k8_count_tables[["significant"]][["deseq"]][["downs"]][[1]]
    subset_df <- sig_table[, c("transcriptid", "transcriptseq")]
    names <- subset_df[[1]]
    sequences <- subset_df[[2]]
    seq <- as.list(sequences)
    names(seq) <- names
    testing <- seqinr::write.fasta(seq, names, file.out="test_downs.fasta")
}

8.2 Test for genes ‘strong in batch’

In the sample estimation worksheet, we created a list of genes putatively strongest in terms of the signal from the principal component associated with batch (Comp 2). Let us look at the differential expression data and see where these genes lie.

test_batch_idx <- rownames(k8_count_tables[["data"]][[1]]) %in% most_by_batch
test_batch <- k8_count_tables[["data"]][[1]][test_batch_idx, ]
strong_batch_de_idx <- abs(test_batch[["deseq_logfc"]] > 1) & test_batch[["deseq_adjp"]] <= 0.05
strong_batch_de <- test_batch[strong_batch_de_idx, ]
## The following genes are not to be trusted I think.
knitr::kable(strong_batch_de)

	geneid	transcriptid	blastxname	blastxqueryloc	blastxhitloc	blastxidentity	blastxevalue	blastxrecname	blastxtaxonomy	blastpname	blastpqueryloc	blastphitloc	blastpidentity	blastpevalue	blastprecname	blastptaxonomy	protid	protcoords	tmhmmexpaa	tmhmmhelices	tmhmmtopology	eggnogid	eggnogdescription	keggdata	geneontologyblast	transcriptseq	length	limma_logfc	limma_adjp	deseq_logfc	deseq_adjp	edger_logfc	edger_adjp	limma_ave	limma_t	limma_b	limma_p	deseq_basemean	deseq_lfcse	deseq_stat	deseq_p	edger_logcpm	edger_lr	edger_p	basic_nummed	basic_denmed	basic_numvar	basic_denvar	basic_logfc	basic_t	basic_p	basic_adjp	limma_adjp_fdr	deseq_adjp_fdr	edger_adjp_fdr	basic_adjp_fdr	lfc_meta	lfc_var	lfc_varbymed	p_meta	p_var
TRINITY_DN10128_c0_g1	TRINITY_DN10128_c0_g1	TRINITY_DN10128_c0_g1_i1				0.00	1						0.00	1					0.00	0						TTGAGATATAGTTGATTGATTGATTGGAACAGTTGAATTGCGAGTAAACTGATTACCGATATAGATAGATTTATTGAGATAGAGTTGATTGATCGATTGAAACAATTGAATTGCGAGTAACTGAGTGTAGATATAGATAGATTGGTTGAGATTGAGATATAGTTGATTGATTGAAAGTTGAAACAGTTGAATTGTGAGTAAACTGAATACAAATATAGATAGATTGGTTGAGATTGATATATAGTTGATTGATTGAAACAGTTGAATTGTGAGTAAACTGAATACAAATATAGATAGATTGGTTGAGATTGAGATATAGTTGATTGATTGAAACAGCTGAATTGTGAGTAAACTGAATACAGATATAGATAGATTTATTGAGATTGAGATACAGTTGATTGATTGATTGAAACAGTTGAATTGTGAGTAAACTGAATACAAATATAGATAGATTGGTTGAGATTGATATATAGTTGATTGATTGAAACAGTTGAATTGTGAGTAAACCGAATACAAATATAGATAGATTTATTGAGATTGAGATATAGATGATTGATTGATTGAAACAGTTGAATTGCGAGTAAACAAAATATCGATATAGATAGATTTATTGAGATTGAGATACAGTTGATTGATTGATTGAAACAGTTGACTTGTGAGTAAACTGAATA	671	2.144	0.1032	2.645	0.0323	2.456	0.0544	0.2185	-3.142	-2.332	0.0074	24.52	0.9289	2.848	0.0044	1.608	8.642	0.0033	0.5070	1.779	3.561e+00	1.093e+00	1.272	1.413	2.207e-01	5.090e-01	1.032e-01	6.695e-03	5.006e-03	2.847e-01	2.557	1.502e+00	5.876e-01	5.034e-03	4.572e-06
TRINITY_DN11001_c0_g7	TRINITY_DN11001_c0_g7	TRINITY_DN11001_c0_g7_i1				0.00	1						0.00	1			TRINITY_DN11001_c0_g7::TRINITY_DN11001_c0_g7_i1::g.168124::m.168124	86-1168[+]	0.00	0						GGTATCAACGCAGAGTACGGGGCTCACTCAAAAATCTCCGCCCTAATACCAAATCCCAATTCTTCAAGCTCTCTTTCTCTTATCAATGGCGACAATGGCTTCCACCTTGGCTTCCTCCGTGATTTCCAAGACAAAGTTCCTTGACACACACAAATCATCATCCTTGTATGGTGTGCCAATTTCCTCACAAGCTAGATTGAAAATCGTAAAATCGACTCACCAGAACATGTCTGTTTCTATGTCTGCTGATGCTTCTCCTCCTCCGCCTTACGATCTCGGAAGTTTCAGTTTTAATCCGATTAAGGAATCGATTGTTGCTCGGGAAATGACACGTAGGTACATGACGGATATGATCACTTATGCTGATACTGACGTCGTCGTTGTTGGTGCTGGATCTGCTGGTCTATCTTGTGCTTATGAGCTCAGCAAGAACCCTGATGTTCAGGTGGCCATCCTTGAGCAATCTGTGAGCCCTGGTGGAGGTGCTTGGCTAGGCGGACAACTCTTCTCAGCTATGATTGTGCGTAAGCCAGCACATCTTTTCTTGAACGAGCTAGGCATTGACTACGACGAGCAAGACAACTACGTGGTCGTCAAGCACGCTGCCTTGTTTACCTCGACCATCATGAGCAAGCTTTTGGCCAGGCCAAATGTGAAGCTCTTCAATGCTGTTGCAACAGAGGACCTTATCGTGAAGAACGGAAGAGTTGGTGGTGTTGTCACCAACTGGTCTTTGGTTTCCCAGAACCATGACACACAATCCTGCATGGACCCTAATGTTATGGAGGCTAAGGTTGTGGTTAGCTCCTGCGGCCACGACGGTCCCATGGGTGCCACTGGTGTTAAGAGGCTTAGGAGCATTGGCATGATTAACAGTGTTCCTGGAATGAAAGCTTTGGATATGAACACTGCTGAGGATGCAATTGTTAGACTTACCAGAGAAGTTGTACCTGGAATGATTATAACAGGGATGGAAGTTGCTGAAATTGATGGAGCACCAAGAATGGGTCCAACTTTTGGAGCTATGATGATATCAGGGCAGAAGGCTGCGCACCTTGCCCTACGGGCGTTGGGATTGCCCAACGCCCTTGACGGAACTGCAGAAACAAGCATCCACCCGGAGCTTATGTTGGCTGCAGCTGATGAAGCTGAAACTGCTGATGCTTAAATGTAATTATAGTTCCTAAGAAAGGATATGAGAAATATTAGTTTTTCCTAGATGAGCATGGTTGTGTGATGGAGATAAATTGGCTAGTTGGGGATGAATAAAAATCTGCGAGACCTTTGTCTGTTTCATGTCAATTCTTCTTTTACTTTTGATTTATTTTGGGACTTAGAAACTATATGTGAAAAAAGTGGTAAGAGTAAGACTTGTTTTAGTCGTTGAATAATCCTATGGGCCTCATTTTGGTACTTGATGTGGTTTGGGCCTCGCGGGAGAAGCTGTCTCTTATAC	1456	2.442	0.0992	2.884	0.0098	2.763	0.0469	0.8301	-3.175	-2.321	0.0069	47.12	0.8753	3.295	0.0010	2.490	9.033	0.0027	0.4770	2.438	5.537e+00	1.171e+00	1.961	1.636	1.735e-01	4.583e-01	9.918e-02	1.533e-03	4.058e-03	2.303e-01	2.800	1.451e+00	5.180e-01	3.529e-03	9.467e-06
TRINITY_DN17811_c0_g5	TRINITY_DN17811_c0_g5	TRINITY_DN17811_c0_g5_i1	AZF2_ARATH	866-375	67-187	32.12	0	Zinc finger protein AZF2;	Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; Gunneridae; Pentapetalae; rosids; malvids; Brassicales; Brassicaceae; Camelineae; Arabidopsis	AZF2_ARATH	128-353	67-242	30.87	0	Zinc finger protein AZF2;	Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; Gunneridae; Pentapetalae; rosids; malvids; Brassicales; Brassicaceae; Camelineae; Arabidopsis	TRINITY_DN17811_c0_g5::TRINITY_DN17811_c0_g5_i1::g.74924::m.74924	177-1247[-]	22.83	1	i12-34o	COG5048	Zinc finger protein	KEGG:ath:AT3G19580	GO:0005634,cellular_component,nucleus; GO:0003677,molecular_function,DNA binding; GO:0046872,molecular_function,metal ion binding; GO:0043565,molecular_function,sequence-specific DNA binding; GO:0003700,molecular_function,transcription factor activity, sequence-specific DNA binding; GO:0044212,molecular_function,transcription regulatory region DNA binding; GO:0009738,biological_process,abscisic acid-activated signaling pathway; GO:0009793,biological_process,embryo development ending in seed dormancy; GO:0042538,biological_process,hyperosmotic salinity response; GO:0045892,biological_process,negative regulation of transcription, DNA-templated; GO:0009737,biological_process,response to abscisic acid; GO:0010200,biological_process,response to chitin; GO:0009414,biological_process,response to water deprivation; GO:0006351,biological_process,transcription, DNA-templated	AAAGTAAACTACCAACAATACTTGATGAAAACAACAACAAAAAAAAAAAAAATTAATTTAGGAAACAAATTTATGTATATTTATGAACAACAATAATTCTAAAAATTCATTATCACAAATTTTTGTGACTAATTAATTGATCAATAAGTGTAAAAAAAAATGAAAAACTAAAATTATTAGTAATAGCAATCAACCAAAGGGGCTGCAGATAACACAAGACTTTGTTCAAATTTGTGATCATCTTCTGATGGTGCTGGTAAATTAAGATCAAGAGATAATATATTTCTTCTTGATTTTTCGTCGTGATGGGATGAAGATGATTCTGAAACATTATTGTTTGTAGTAGAGGTTGTCATCGCTGTAGGAGGTGGTCTGTGTCTTCTCATGTGTCCACCTAAAGCTTGTCCCGATGAAAACTCCGATCCACAAATTGAACATTCGTGGATTTTGGTCTTGATGTTATTAACAACTATTTTGTTTGAGAGAGATGTACTATTAATTTTATTGAGACGCCCTTGATCCTTATCGTTCTCTAGATGATGATGATGTGAATCATCATGACCGCCACCACTAGCAGTGCCAGTATTGATAGTGACAGATTTTTTCTCTTCCACTATGGTTTTGGGCTTTTTGTGACTTGCTCTGTGTCCACCAAGTGCTTGAAATGAAGGGAAAGTTCTATTGCAAGTTTTGCATTCATAGACATAAAAGCCAGCTTTTCCAGTTGTACTATTACTAGTGGTCATTTCTGCAAATTTCCTACTGCTGATTTTCTCCTTTTTGACTTCACTTATTACATCTTGTACTTTCTGGCCACATCCACTTTGTGCTAATAGAATTAAACAATTAGCCATATCTTCATCTTCCTCAGATGTGCTAGTTGTTGAAATTTGAGTAGTAGAAATGGTTGGTGAATATTGAATATTTGTTGACTTGTAAAAATCACCTCCGCTACCACCATCACCACCACCAGCATCATCGCTTCCACCAGCGATCGATGAGCTGGTGGTAGTGACAGTCGTCGTCCCCGCTATATTAAGTGGGGACGATGGTCGCGATCGCTTAGTGCGCTTTCCTTTGAAAGCGCATGAGCGATCGGTGAAGTTTTGACTCAAAACTTCCATGACTGCAGAAACAAGAAAAGAAGGCAAAACAAGTGACAAAGTGAACAAGAAGAAGAGTGATAGAAAGAAAGAATATAAAGAGAGAGAAAGAGTGATGTAATTATCCCGTACTCTGCGTTGATAC	1248	2.803	0.0092	2.995	0.0006	2.901	0.0006	0.4279	-5.071	1.004	0.0002	41.17	0.7266	4.122	0.0000	2.352	19.200	0.0000	0.3112	2.383	3.526e+00	4.157e+00	2.072	1.445	1.989e-01	4.854e-01	9.166e-03	5.997e-05	1.893e-05	2.597e-01	3.006	1.448e+00	4.815e-01	7.826e-05	8.777e-09
TRINITY_DN8597_c0_g1	TRINITY_DN8597_c0_g1	TRINITY_DN8597_c0_g1_i1	ISP4_SCHPO	22-939	455-755	40.07	0	Sexual differentiation process protein isp4;	Eukaryota; Fungi; Dikarya; Ascomycota; Taphrinomycotina; Schizosaccharomycetes; Schizosaccharomycetales; Schizosaccharomycetaceae; Schizosaccharomyces	ISP4_SCHPO	8-313	455-755	40.07	0	Sexual differentiation process protein isp4;	Eukaryota; Fungi; Dikarya; Ascomycota; Taphrinomycotina; Schizosaccharomycetes; Schizosaccharomycetales; Schizosaccharomycetaceae; Schizosaccharomyces	TRINITY_DN8597_c0_g1::TRINITY_DN8597_c0_g1_i1::g.50377::m.50377	1-1014[+]	163.33	8	o15-36i43-65o75-97i129-151o201-223i228-245o250-267i279-301o			KEGG:spo:SPBC29B5.02c	GO:0005783,cellular_component,endoplasmic reticulum; GO:0005789,cellular_component,endoplasmic reticulum membrane; GO:0005794,cellular_component,Golgi apparatus; GO:0016021,cellular_component,integral component of membrane; GO:0005886,cellular_component,plasma membrane; GO:1901584,molecular_function,tetrapeptide transmembrane transporter activity; GO:0015031,biological_process,protein transport; GO:1901583,biological_process,tetrapeptide transmembrane transport	GTATCAACGCAGAGTACGGGGATGAAGAGTTACAAGCAAGTCCCTCAGTGGTGGTTCCTCGCATTATTGGTAGGTAGCATAGCCCTATCACTTGTGATGTGTTTTGTGTGGAAAGAAGACGTGCAGCTGCCGTGGTGGGGCATGTTGTTTGCGTTTGGTCTTGCTTTTATTGTTACTCTCCCTATAGGGGTCATTCAAGCGACTACCAATCAGCAACCTGGATATGACATAATTGCACAGTTCATAATTGGTTATATCCTCCCAGGGAAACCAATCGCGAACTTGCTTTTCAAAATATATGGCCGGACAAGTACTGTTCATGCTCTCTCCTTTTTAGCCGATCTTAAACTTGGTCACTACATGAAAATTCCGCCACGATGCATGTACACAGCTCAGCTCGTGGGGACACTGGTTGCTGGTACAATCAACCTTGCTGTAGCATGGTGGATGTTAGGAAGCATCGAGAACATTTGTGACGTTGAGACTCTTCATCCGGACAGCCCATGGACCTGTCCTAAATTCCGAGTGACTTTTGATGCATCTGTCATATGGGGTCTGATTGGACCACAACGGTTATTTGGTCCCGGAGGATTATACAGGAACTTGGTATGGTTATTCCTTATAGGTGCATTGCTACCGGTGCCTATTTGGGTGTTAAGCAAAATCTTCCCGGAAAAGAAATGGATCCCATTGATCAACATACCTGTTATATCATATGGTTTCGCGGGAATGCCACCAGCCACACCAACGAATATTGCTAGCTGGCTTATAACGGGTATGATATTCAACTATTTCGTGTTCAAATACAGAAAAGAATGGTGGAAGAAATACAACTATGTTCTGTCAGCCGCGTTGGATGCTGGAACAGCTTTTATGGGTGTTTTATTGTTTTTCGCGTTGCAAAACGAGGGTAAAAACTTGAAATGGTGGGGAACAGAGTTAGACCATTGTCCCTTAGCAACTTGTCCAACAGCCCCTGGGATCATAGTGGAAGGATGTCCAGTTTTCAAGTAGGCCTAACAACAAGAATGAATGGTTCTAAGCATTGTTGATGTAATTTTGATAGTTTTTTTGTGTTGTTCAACTAAATATATTTTGTTATGAGAGTTTTCTGATCTTGTTTGCAGATGAGCTAATGAAAGACAAAATATTGTGAATTGAAATTGTGATTAACTTTTTACTAAATCAAAATATTGTTAAT	1201	2.184	0.2483	2.724	0.0492	2.533	0.1689	0.4656	-2.341	-3.715	0.0350	35.02	1.0200	2.669	0.0076	2.218	5.484	0.0192	0.0566	2.195	6.038e+00	1.931e+00	2.139	1.111	3.196e-01	5.966e-01	2.482e-01	1.145e-02	2.761e-02	3.918e-01	2.607	1.515e+00	5.813e-01	2.060e-02	1.891e-04
TRINITY_DN9223_c1_g4	TRINITY_DN9223_c1_g4	TRINITY_DN9223_c1_g4_i1	EB1C_ARATH	118-1104	1-323	62.46	0	Microtubule-associated protein RP/EB family member 1C;	Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; Gunneridae; Pentapetalae; rosids; malvids; Brassicales; Brassicaceae; Camelineae; Arabidopsis	EB1C_ARATH	40-368	1-323	62.46	0	Microtubule-associated protein RP/EB family member 1C;	Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; Gunneridae; Pentapetalae; rosids; malvids; Brassicales; Brassicaceae; Camelineae; Arabidopsis	TRINITY_DN9223_c1_g4::TRINITY_DN9223_c1_g4_i1::g.67832::m.67832	1-1119[+]	0.00	0		COG5217	microtubule-associated protein RP EB family member	KEGG:ath:AT5G67270; KO:K10436	GO:0005618,cellular_component,cell wall; GO:0005874,cellular_component,microtubule; GO:0005730,cellular_component,nucleolus; GO:0005634,cellular_component,nucleus; GO:0009524,cellular_component,phragmoplast; GO:0005819,cellular_component,spindle; GO:0008017,molecular_function,microtubule binding; GO:0051301,biological_process,cell division; GO:0030865,biological_process,cortical cytoskeleton organization; GO:0007067,biological_process,mitotic nuclear division; GO:0009652,biological_process,thigmotropism	GGTATCAACGCAGAGTACGGGGAAACCCTCATTTCAAACTCTCTTTCTCTCTCCGTCCGCCCTCTCTCTCTCTCTCTTTCTACAGAGCCTTCTAGAGATCTTCTCCGAAGAGGAGAAATGGCAGCACACATAGGAATGATGGATGGGGCATACTTCGTTGGCAGATCGGAAATCCTGGCTTGGATCAATTCCACACTTCATCTTAATCTTTCCAAAGTCGAAGAGGCGTGTACTGGCGCGGTTCATTGCCAGCTAATGGATGCGGCTCACCCAGGGCTAGTCCCTATGCACAAGGTCAATTTTGATGCGAAGAATGAATATGAGATGATCCAGAACTATAAGGTTCTTCAGGATGTATTCACCAAACTCAAGATAACCAAGCACATTGAGGTCAGCAAACTGGTGAAAGGAAGACCTCTTGACAACCTGGAGTTCATGCAATGGATGAAGAGATACTGCGATTCTGTTAATGGAGGCAATATTCACAGTTACAATCCTCTTGAAAGAAGGGAAGGCTGCAAGGGTGCCAAAGAACTGAACAAAAGATCTGCCCCTTCACAAAATGCTACCAAAAATGCCTCAACTGCTTCAAAGCATGTCACCCATAACACTAGAAGAATTGATGTACCTCATGTAAGCAGCACAAGTCAATCTGGTAAGATCTCCAGGCCTTCTTCAAGTGGAGGAGCGTCAACCTATTCTGAAACAGAACGTACAGCACATGAACAACAGATTACGGAGTTGAAGTTATCAGTGGATAGTCTTGAAAAGGAAAGGGACTTCTACTTTTCAAAGTTGAGGGACATTGAAATCCTGTGCCAATGTCCAGAGATTGAAAACCTACCTGTGGTTGAAGCAGTGAAGAGGATTCTGTATGCTATGGATGATGATGCATCACTGGTAGATACTGAAGCCATGATATCTGAGCAACACCAGCAAGTAGAGACATTGAGTTGCATATCTGAAGAGGCAGAAGAAAGGCTGAGGGTAGACACTCAAAAGAGAAAAAACATTGTCAATGTTGACGTTGACCATGCTGCCAGCAACACATTGTCTCCAAGGCAAAGGATGGCTGATGCTTCTGATGTCCATTGCAGTGGATCACTTGTGACATATTGATCAACCTTGTCTCCTGGATAAGTAACTTGTTCTATCTGGGTTGTGGAATAATTCCTGCTACCGTCTTTGTTACTCTGTAGTTAGTTATATGTTGTCAGAGCATACTTTATTTCTAAGCAATACCGTTTGAGGAATTCCATCAGTATAGAAGTAGTGAATATATGAACCTATCTGGTTTGTCTGATTGTCGGTGTACAACATTGCCCGATGTTATTGCTGTTGTATACTGCATTGTAGTGCAAGTCTTGGATTATGATGATTCCTGTGGTTTAATGGATACTAAAGTTGGTATTGGTATTGGTTTTGATTTTAGAAAATATCTTGAAAACAAACATAAGGTAAATTGACAGTCTCCACACAACTTAGT	1486	1.297	0.4481	2.497	0.0432	2.374	0.2114	1.0040	-1.681	-4.864	0.1156	47.83	0.9164	2.725	0.0064	2.462	4.793	0.0286	1.2260	2.853	6.885e+00	6.604e-01	1.626	1.249	2.873e-01	5.698e-01	4.480e-01	9.701e-03	4.023e-02	3.577e-01	2.162	1.865e+00	8.627e-01	5.020e-02	3.330e-03
TRINITY_DN9367_c2_g7	TRINITY_DN9367_c2_g7	TRINITY_DN9367_c2_g7_i1				0.00	1						0.00	1			TRINITY_DN9367_c2_g7::TRINITY_DN9367_c2_g7_i1::g.195987::m.195987	1-459[+]	30.82	1	i21-38o					GGTATCAACGCAGAGTACGGGACAAACAAATTCATCATACAAAACAATATGGCTTCTAATACTTTTTCTATTCTCATTCTTGTTGTGTCTTTATTTTTGTCTCAGTTCTCATTTTCTTTTGGTGATGGTCCCATACCAGTTGTTCACCTCATTAGCAACTTGACTCCTGGTCATACACCAACTATGCAAGTTACTTGTAAATTACAACATTGGCTCCCTTTGCTTAGTGTAACTTTAGTAATAGGCCAAGATTTTCCATTTGATGCTGGTATTATCAATGATATATATGTATGTTATGCTGAATGGGGTGTATTTTATGCTTGGTTTAATGCATATGATCCACTTAAGGAAGACAAAGGCCAACCTGTTGTCTATTTTTCTTTTCGAGATCGCGGTGTTTATAAGAGTTACGATAAAGCTACTTGGACTGGAGTTACAGATTGGAACAACGGACACTAGCTTTTCGTATTTTTATCATGTCAAGATCGAATTTCGAATTTTTTCTTTATGATAAAAATAAAATTATATAATTTGTGTGTGTGAGTCAAGGTGACGGATGATCTGAATGGGCTTCTTAATTCCCCCGGCAATTGTTTGTTTTTGTCTAGGGTTTTCTTGTTATTTTATTGTCTCTTTGATTCACATTAGTATATTGGTTTTATACTACTTTTTTTTTTCATCTACATTTTTGTTATTGTTCTAATTTATGTGGTTAAAAGGCAACCAAATTGAAAAACTTATATAAAATGCATCAACTTTATTTATACGGTAAAAGTTATCATTAACTAGTATTTATAT	798	2.340	0.1089	2.939	0.0089	2.811	0.0436	0.7559	-3.092	-2.448	0.0082	37.64	0.8836	3.326	0.0009	2.170	9.213	0.0024	0.5538	2.600	4.703e+00	3.898e-01	2.047	1.870	1.451e-01	4.256e-01	1.089e-01	1.370e-03	3.685e-03	1.965e-01	2.786	1.499e+00	5.382e-01	3.821e-03	1.484e-05

8.3 Confidence filtered

k8_conf_de <- sm(all_pairwise(k8_conf, parallel=FALSE))

k8_conf_tables <- sm(combine_de_tables(
  k8_conf_de, keepers=keepers,
  excel=paste0("excel/k8_conf_de-v", ver, ".xlsx"),
  sig_excel=paste0("excel/k8_conf_sig-v", ver, ".xlsx")))

8.4 Both

k8_conf_count_de <- sm(all_pairwise(k8_conf_count, parallel=FALSE))

k8_conf_count_tables <- sm(combine_de_tables(
  k8_conf_count_de, keepers=keepers,
  excel=paste0("excel/k8_conf_count_de-v", ver, ".xlsx"),
  sig_excel=paste0("excel/k8_conf_count_sig-v", ver, ".xlsx")))

if (!isTRUE(get0("skip_load"))) {
  pander::pander(sessionInfo())
  message(paste0("This is hpgltools commit: ", get_git_commit()))
  this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
  message(paste0("Saving to ", this_save))
  tmp <- sm(saveme(filename=this_save))
}

9 Gene Ontology Searches using Solanum betaceum trinotate GO associations: 20180310

9.1 Searching with goseq

I want to search the set of up/down genes using goseq, since they changed how they accept annotations, I am changing the goseq search to match the new methods. I think they will be better but am not yet certain.

sig_tables <- k8_count_tables[["significant"]]
sigups <- sig_tables[["deseq"]][["ups"]][[1]]
sigdowns <- sig_tables[["deseq"]][["downs"]][[1]]

putative_ontologies <- load_trinotate_go("reference/trinotate.csv")

## Warning: Expected 3 pieces. Additional pieces discarded in 83811 rows [4,
## 5, 6, 11, 18, 19, 22, 29, 40, 83, 84, 87, 90, 91, 98, 99, 100, 103, 104,
## 105, ...].

## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 182246 rows
## [1, 2, 3, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17, 20, 21, 23, 24, 25, 26,
## 27, ...].

length_db <- putative_ontologies[["length_table"]]
go_db <- putative_ontologies[["go_table"]]
head(go_db)

##                     ID           GO
## 1: TRINITY_DN614_c0_g1 character(0)
## 2: TRINITY_DN654_c0_g1 character(0)
## 3: TRINITY_DN609_c0_g1 character(0)
## 4: TRINITY_DN609_c0_g1 c(GO:0005525
## 5: TRINITY_DN694_c0_g1 c(GO:0005737
## 6: TRINITY_DN694_c0_g1 c(GO:0005737

goseq_up <- sm(simple_goseq(
  sig_genes=sigups, go_db=go_db, length_db=length_db,
  excel=paste0("excel/goseq_deseq_count_sigup-v", ver, ".xlsx")))
goseq_down <- sm(simple_goseq(
  sig_genes=sigdowns, go_db=go_db, length_db=length_db,
  excel=paste0("excel/goseq_deseq_count_sigdown-v", ver, ".xlsx")))

topgo_up <- simple_topgo(
  sig_genes=sigups, go_db=go_db,
  excel=paste0("excel/topgo_deseq_count_sigup-v", ver, ".xlsx"))

## simple_topgo(): Set densities=TRUE for ontology density plots.

## Writing data to: excel/topgo_deseq_count_sigup-v20180310.xlsx.

topgo_down <- simple_topgo(
  sig_genes=sigdowns, go_db=go_db,
  excel=paste0("excel/topgo_deseq_count_sigdown-v", ver, ".xlsx"))

## simple_topgo(): Set densities=TRUE for ontology density plots.

## Writing data to: excel/topgo_deseq_count_sigdown-v20180310.xlsx.

9.2 Search using the top/bottom n genes

Sandra also wants to see what groups are represented by the top-n genes in Embryogenic/NonEmbryogenic. The likely subsets for this were made when we did write_expt() in in 02_sample_estimation.html.

median_by_cond <- k8_count_written[["medians"]]

top_1k_embryo_idx <- head(order(median_by_cond[["embryogenic"]], decreasing=FALSE), n=1000)
top_1k_embryo_genes <- rownames(median_by_cond)[top_1k_embryo_idx]
top_1k_embryo <- k8_count_tables[["data"]][[1]][top_1k_embryo_genes, ]

bottom_1k_embryo_idx <- head(order(median_by_cond[["embryogenic"]], decreasing=TRUE), n=1000)
bottom_1k_embryo_genes <- rownames(median_by_cond)[bottom_1k_embryo_idx]
bottom_1k_embryo <- k8_count_tables[["data"]][[1]][bottom_1k_embryo_genes, ]

top_1k_nonembryo_idx <- head(order(median_by_cond[["nonembryogenic"]], decreasing=FALSE), n=1000)
top_1k_nonembryo_genes <- rownames(median_by_cond)[top_1k_nonembryo_idx]
top_1k_nonembryo <- k8_count_tables[["data"]][[1]][top_1k_nonembryo_genes, ]

bottom_1k_nonembryo_idx <- head(order(median_by_cond[["nonembryogenic"]], decreasing=TRUE), n=1000)
bottom_1k_nonembryo_genes <- rownames(median_by_cond)[bottom_1k_nonembryo_idx]
bottom_1k_nonembryo <- k8_count_tables[["data"]][[1]][bottom_1k_nonembryo_genes, ]

top1k_embryo_goseq <- sm(simple_goseq(
  sig_genes=top_1k_embryo, go_db=go_db, length_db=length_db,
  excel=paste0("excel/goseq_top1k_embryogenic-v", ver, ".xlsx")))
bottom1k_embryo_goseq <- sm(simple_goseq(
  sig_genes=bottom_1k_embryo, go_db=go_db, length_db=length_db,
  excel=paste0("excel/goseq_bottom1k_embryogenic-v", ver, ".xlsx")))

top1k_nonembryo_goseq <- sm(simple_goseq(
  sig_genes=top_1k_nonembryo, go_db=go_db, length_db=length_db,
  excel=paste0("excel/goseq_top1k_nonembryo-v", ver, ".xlsx")))

bottom1k_nonembryo_goseq <- sm(simple_goseq(
  sig_genes=bottom_1k_nonembryo, go_db=go_db, length_db=length_db,
  excel=paste0("excel/goseq_bottom1k_nonembryo-v", ver, ".xlsx")))

neg_ctrl <- random_ontology(input=k8_count, n=1000, go_db=go_db, length_db=length_db,
                            excel=paste0("excel/goseq_random-v", ver, ".xlsx"))

## Using the ID column from your table rather than the row names.

## Found 1000 genes out of 1000 from the sig_genes in the go_db.

## Found 1000 genes out of 1000 from the sig_genes in the length_db.

## Warning in pcls(G): initial point very close to some inequality constraints

## Using manually entered categories.

## Calculating the p-values...

## 'select()' returned 1:1 mapping between keys and columns

## simple_goseq(): Calculating q-values

## Using GO.db to extract terms and categories.

## simple_goseq(): Filling godata with terms, this is slow.

## Testing that go categories are defined.

## Removing undefined categories.

## Gathering synonyms.

## Gathering category definitions.

## simple_goseq(): Making pvalue plots for the ontologies.

## Writing data to: excel/goseq_random-v20180310.xlsx.

if (!isTRUE(get0("skip_load"))) {
  pander::pander(sessionInfo())
  this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
  message(paste0("Saving to ", this_save))
  tmp <- sm(saveme(filename=this_save))
}

10 Transcript specific data

The following sections do not use the gene->transcript maps.

11 Annotation version: 20180310

Any annotation data will need to come from our trinotate information… An interesting caveat, kallisto uses a transcript database. tximport is the library used to read the counts/transcript into R, it is able to take a set oF mappings between gene/transcript in order to provide gene counts rather than transcript.

putative_annotations <- load_trinotate_annotations("reference/trinotate.csv")
annot_by_tx <- putative_annotations
rownames(annot_by_tx) <- make.names(putative_annotations[["transcript_id"]], unique=TRUE)

## I split the annotations into one set keyed by gene IDs and one by transcript IDs.
## This way it is possible to make an expressionset of transcripts or genes by
## removing/keeping the tx_gene_map argument.

## Why would you want to do this? you might ask...  Well, sadly the annotation database
## Does not always have filled-in data for the first transcript for a given gene ID,
## therefore the annotation information for those genes will be lost when we merge
## the count tables / annotation tables by gene ID.
## This is of course also true for transcript IDs, but for a vastly smaller number of them.
transcript_gene_map <- putative_annotations[, c("transcript_id", "gene_id")]
k10tx_raw <- create_expt(metadata="sample_sheets/all_samples.xlsx",
                         gene_info=annot_by_tx)

## Reading the sample metadata.

## The sample definitions comprises: 10, 7 rows, columns.

## Reading count tables.

## Finished reading count tables.

## Matched 234330 annotations and counts.

## Bringing together the count matrix and gene information.

putative_annotations <- fData(k10tx_raw)  ## A cheap way to get back the unique gene IDs.
dim(exprs(k10tx_raw))

## [1] 234330     10

## On further examination, we decided to remove both samples 1003 and 1008
## which Sandra pointed out are in fact from the same flow cytometry isolation.
k8tx_raw <- create_expt(metadata="sample_sheets/kept_samples.xlsx",
                        gene_info=annot_by_tx)

## Reading the sample metadata.

## The sample definitions comprises: 8, 7 rows, columns.

## Reading count tables.

## Finished reading count tables.

## Matched 234330 annotations and counts.

## Bringing together the count matrix and gene information.

dim(exprs(k8tx_raw))

## [1] 234330      8

## If we do not use the tx_gene_map information, we get 234,330 transcripts in the data set.
## Instead, we get 58,486 genes.

## Perform a count/gene cutoff here
threshold <- 4
method <- "cbcb"  ## other methods include 'genefilter' and 'simple'
k10tx_count <- sm(normalize_expt(k10tx_raw, filter=method, thresh=threshold))
dim(exprs(k10tx_count))

## [1] 42143    10

k8tx_count <- sm(normalize_expt(k8tx_raw, filter=method, thresh=threshold))
dim(exprs(k8tx_count))

## [1] 38571     8

## First get rid of the rRNA
sb_rrna <- putative_annotations[["rrna_subunit"]] != ""
rrna_droppers <- rownames(putative_annotations)[sb_rrna]
k10tx_norrna <- exclude_genes_expt(expt=k10tx_raw, ids=rrna_droppers)

## Before removal, there were 234330 entries.

## Now there are 234309 entries.

## Percent kept: 99.990, 99.988, 99.991, 99.993, 99.992, 99.989, 99.989, 99.987, 99.993, 99.995

## Percent removed: 0.010, 0.012, 0.009, 0.007, 0.008, 0.011, 0.011, 0.013, 0.007, 0.005

k8tx_norrna <- exclude_genes_expt(expt=k8tx_raw, ids=rrna_droppers)

## Before removal, there were 234330 entries.

## Now there are 234309 entries.

## Percent kept: 99.990, 99.988, 99.991, 99.992, 99.989, 99.989, 99.987, 99.995

## Percent removed: 0.010, 0.012, 0.009, 0.008, 0.011, 0.011, 0.013, 0.005

## Keep only the blastx hits with a low e-value and high identity.
filtered_keepers <- putative_annotations[["blastx_evalue"]] <= 1e-10 &
  putative_annotations[["blastx_identity"]] >= 60
keepers <- rownames(putative_annotations)[filtered_keepers]
## This excludes all but 33,451 transcripts.

## Now keep only those transcript IDs which fall into the above categories.
k10tx_conf <- exclude_genes_expt(expt=k10tx_norrna, ids=keepers, method="keep")

## Before removal, there were 234309 entries.

## Now there are 33451 entries.

## Percent kept: 25.523, 25.012, 26.033, 25.603, 26.420, 26.603, 27.573, 24.233, 24.821, 27.304

## Percent removed: 74.477, 74.988, 73.967, 74.397, 73.580, 73.397, 72.427, 75.767, 75.179, 72.696

dim(exprs(k10tx_conf))

## [1] 33451    10

k8tx_conf <- exclude_genes_expt(expt=k8tx_norrna, ids=keepers, method="keep")

## Before removal, there were 234309 entries.

## Now there are 33451 entries.

## Percent kept: 25.523, 25.012, 26.033, 26.420, 26.603, 27.573, 24.233, 27.304

## Percent removed: 74.477, 74.988, 73.967, 73.580, 73.397, 72.427, 75.767, 72.696

dim(exprs(k8tx_conf))

## [1] 33451     8

## Perform both the confidence and count filtration
k10tx_conf_count <- sm(normalize_expt(k10tx_conf, filter=method, thresh=threshold))
dim(exprs(k10tx_conf_count))

## [1] 18018    10

k8tx_conf_count <- sm(normalize_expt(k8tx_conf, filter=method, thresh=threshold))
dim(exprs(k8tx_conf_count))

## [1] 16826     8

rrnatx_expt <- exclude_genes_expt(expt=k10tx_raw, ids=sb_rrna, method="keep")

## Before removal, there were 234330 entries.

## Now there are 21 entries.

## Percent kept: 0.010, 0.012, 0.009, 0.007, 0.008, 0.011, 0.011, 0.013, 0.007, 0.005

## Percent removed: 99.990, 99.988, 99.991, 99.993, 99.992, 99.989, 99.989, 99.987, 99.993, 99.995

## This has only 21 putative rRNA transcripts.

12 Plot changes in number of putative genes

genes_by_subset <- data.frame(
  samples = c(rep("ten", 4), rep("eight", 4)),
  colors = c(rep("#1B9E77", 4), rep("#7570B3", 4)),
  alpha = c("#1B9E77FF", "#1B9E77CC", "#1B0E77AA",
            "#1B0E7777", "#7570B3FF", "#7570B3CC", "#7570B3AA", "#7570B377"),
  type = rep(c("raw", "count", "conf", "conf_count"), 2),
  genes = c(nrow(exprs(k10tx_raw)), nrow(exprs(k10tx_count)),
            nrow(exprs(k10tx_conf)), nrow(exprs(k10tx_conf_count)),
            nrow(exprs(k8tx_raw)), nrow(exprs(k8tx_count)),
            nrow(exprs(k8tx_conf)), nrow(exprs(k8tx_conf_count))))

library(ggplot2)
columns <- ggplot(genes_by_subset, aes(x=samples, y=genes)) +
  geom_col(position="identity", aes(fill=colors, alpha=alpha), color="black") +
  scale_fill_manual(values=c(levels(as.factor(genes_by_subset$colors)))) +
  theme(axis.text=ggplot2::element_text(size=10, colour="black"),
        axis.text.x=ggplot2::element_text(angle=90, vjust=0.5),
        legend.position="none")
columns

c1 <- genes_by_subset[["samples"]] == "ten"
c2 <- genes_by_subset[["samples"]] == "eight"
tmp <- cbind(genes_by_subset[c1, "genes"], genes_by_subset[c2, "genes"])
colnames(tmp) <- c("ten", "eight")
rownames(tmp) <- c("raw", "count", "conf", "conf_count")
tmp

##               ten  eight
## raw        234330 234330
## count       42143  38571
## conf        33451  33451
## conf_count  18018  16826

13 Try to understand how similar bowtie2->htseq counts are to kallisto for this data.

We have suspiciously low mapping numbers for the reads against our denovo transcriptome. Therefore I tried a bowtie2 alignment of the reads against the transcriptome and counted with htseq. I was rather shocked to see that htseq excluded a huge majority of the reads, the following block shows just how poorly that data matches the others.

bowtie_data <- read.table(file="preprocessing/hpgl1000/outputs/bowtie2_solanum_betaceum/hpgl1000.count")
colnames(bowtie_data) <- c("hpgl1000id", "count")
bowtie_comp <- merge(exprs(k8tx_raw), bowtie_data, by.x="row.names", by.y="hpgl1000id")
rownames(bowtie_comp) <- bowtie_comp[["Row.names"]]
bowtie_comp <- bowtie_comp[, -1]
bowtie_sad <- plot_corheat(expt_data=bowtie_comp)

bowtie_sad$plot

if (!isTRUE(get0("skip_load"))) {
  pander::pander(sessionInfo())
  message(paste0("This is hpgltools commit: ", get_git_commit()))
  this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
  message(paste0("Saving to ", this_save))
  tmp <- sm(saveme(filename=this_save))
}

14 Sample Estimation version: 20180310

Since I used kallisto for this work, I want to figure out tximport.

In 01_annotation.Rmd, I decided to open up a few lines of inquiry:

k10_raw: All samples using the full set of putative annotations.
k8_raw: Samples excluding hpgl1003 with the full set of annotations.
k10_count: All samples, removing those with < ‘threshold (5)’ counts/gene.
k8_count: 8 samples, ibid.
k10_conf: All samples, but removing transcripts with annotations less confident than 1e-10 blastx e-value and less identity than 60 percent to some other known gene.
k8_conf: 8 samples, ibid.
k10_count_conf: A combination of the count filter and confidence filter.
k8_count_conf: ibid, but with the 8 samples.

15 Examine these data sets

15.1 Raw data

Start with the full data set. This will take a rather long time to perform all the graphs, and since we are not likely to use it, I will limit myself to just plotting 1-2 graphs.

15.1.1 k10

k10_libsize <- plot_libsize(k10tx_raw)
k10_libsize$plot

k10_density <- sm(plot_density(k10tx_raw))
sm(k10_density$plot + ggplot2::scale_x_continuous(limits=c(1,100)))

k10_boxplot <- sm(plot_boxplot(k10tx_raw))
k10_boxplot

k10_corheat <- plot_corheat(k10tx_raw)

k10_tsne <- plot_tsne(k10tx_raw)
k10_tsne$plot

15.1.2 k8

k8_libsize <- plot_libsize(k8tx_raw)
k8_libsize$plot

k8_density <- sm(plot_density(k8tx_raw))
sm(k8_density$plot + ggplot2::scale_x_continuous(limits=c(1,100)))

k8_boxplot <- sm(plot_boxplot(k8tx_raw))
k8_boxplot

k8_corheat <- plot_corheat(k8tx_raw)

k8_tsne <- plot_tsne(k8tx_raw)
k8_tsne$plot

15.2 Count filtered data

When we get to the count filtered data, I suspect things will be more interesting. Therefore let us look at the full spectrum of plots before/after normalization.

15.2.1 k10

k10tx_count_raw_plots <- sm(graph_metrics(k10tx_count))

k10tx_count_norm <- sm(normalize_expt(k10tx_count, transform="log2", norm="quant", convert="cpm"))
k10tx_count_norm_plots <- sm(graph_metrics(k10tx_count_norm))

k10tx_count_normbatch <- sm(normalize_expt(k10tx_count, batch="svaseq", transform="log2", convert="cpm"))
k10tx_count_normbatch_plots <- sm(graph_metrics(k10tx_count_normbatch))

15.2.1.1 Look at some plots

Let us see how the k10 count data appears

k10tx_count_raw_plots$libsize

sm(k10tx_count_raw_plots$density + ggplot2::scale_x_continuous(limits=c(1,200)))

k10tx_count_raw_plots$boxplot

k10tx_count_norm_plots$corheat

k10tx_count_norm_plots$smc

k10tx_count_norm_plots$disheat

k10tx_count_norm_plots$smd

k10tx_count_norm_plots$pcaplot

## See if the additional information in the 'batch' factor made a difference
k10tx_count_normbatch_plots$pcaplot

15.2.2 k8

k8tx_count_raw_plots <- sm(graph_metrics(k8tx_count))

k8tx_count_norm <- sm(normalize_expt(k8tx_count, transform="log2", norm="quant", convert="cpm"))
k8tx_count_norm_plots <- sm(graph_metrics(k8tx_count_norm))

k8tx_count_normbatch <- sm(normalize_expt(k8tx_count, batch="svaseq", transform="log2", convert="cpm"))
k8tx_count_normbatch_plots <- sm(graph_metrics(k8tx_count_normbatch))

15.2.2.1 Look at some plots

Let us see how the k8 count data appears

k8tx_count_raw_plots$libsize

sm(k8tx_count_raw_plots$density + ggplot2::scale_x_continuous(limits=c(1,200)))

k8tx_count_raw_plots$boxplot

k8tx_count_norm_plots$corheat

k8tx_count_norm_plots$smc

k8tx_count_norm_plots$disheat

k8tx_count_norm_plots$smd

k8tx_count_norm_plots$pcaplot

## See if the additional information in the 'batch' factor made a difference
k8tx_count_normbatch_plots$pcaplot

k8tx_count_normbatch_plots$tsneplot

15.3 Conf filtered data

When we get to the conf filtered data, I suspect things will be more interesting. Therefore let us look at the full spectrum of plots before/after normalization.

15.3.1 k10

k10tx_conf_raw_plots <- sm(graph_metrics(k10tx_conf))

k10tx_conf_norm <- sm(normalize_expt(k10tx_conf, transform="log2", norm="quant", convert="cpm"))
k10tx_conf_norm_plots <- sm(graph_metrics(k10tx_conf_norm))

k10tx_conf_normbatch <- sm(normalize_expt(k10tx_conf, batch="svaseq", transform="log2",
                                          convert="cpm", filter=TRUE))
k10tx_conf_normbatch_plots <- sm(graph_metrics(k10tx_conf_normbatch))

15.3.1.1 Look at some plots

Let us see how the k10 conf data appears

k10tx_conf_raw_plots$libsize

sm(k10tx_conf_raw_plots$density + ggplot2::scale_x_continuous(limits=c(1,100)))

k10tx_conf_raw_plots$boxplot

k10tx_conf_norm_plots$corheat

k10tx_conf_norm_plots$smc

k10tx_conf_norm_plots$disheat

k10tx_conf_norm_plots$smd

k10tx_conf_norm_plots$pcaplot

## See if the additional information in the 'batch' factor made a difference
k10tx_conf_normbatch_plots$pcaplot

15.3.2 k8

k8tx_conf_raw_plots <- sm(graph_metrics(k8tx_conf))

k8tx_conf_norm <- sm(normalize_expt(k8tx_conf, transform="log2", norm="quant", convert="cpm"))
k8tx_conf_norm_plots <- sm(graph_metrics(k8tx_conf_norm))

k8tx_conf_normbatch <- sm(normalize_expt(k8tx_conf, batch="svaseq", transform="log2",
                                       convert="cpm", filter=TRUE))
k8tx_conf_normbatch_plots <- sm(graph_metrics(k8tx_conf_normbatch))

15.3.2.1 Look at some plots

Let us see how the k8 conf data appears

k8tx_conf_raw_plots$libsize

sm(k8tx_conf_raw_plots$density + ggplot2::scale_x_continuous(limits=c(1,100)))

k8tx_conf_raw_plots$boxplot

k8tx_conf_norm_plots$corheat

k8tx_conf_norm_plots$smc

k8tx_conf_norm_plots$disheat

k8tx_conf_norm_plots$smd

k8tx_conf_norm_plots$pcaplot

k8tx_conf_norm_plots$tsneplot

## See if the additional information in the 'batch' factor made a difference
k8tx_conf_normbatch_plots$pcaplot

k8tx_conf_normbatch_plots$tsneplot

15.4 Count_Conf filtered data

When we get to the count_conf filtered data, I suspect things will be more interesting. Therefore let us look at the full spectrum of plots before/after normalization.

15.4.1 k10

k10tx_conf_count_raw_plots <- sm(graph_metrics(k10tx_conf_count))

k10tx_conf_count_norm <- sm(normalize_expt(k10tx_conf_count, transform="log2",
                                           norm="quant", convert="cpm"))
k10tx_conf_count_norm_plots <- sm(graph_metrics(k10tx_conf_count_norm))

k10tx_conf_count_normbatch <- sm(normalize_expt(k10tx_conf_count, batch="svaseq", transform="log2",
                                                convert="cpm", filter=TRUE))
k10tx_conf_count_normbatch_plots <- sm(graph_metrics(k10tx_conf_count_normbatch))

15.4.1.1 Look at some plots

Let us see how the k10 count_conf data appears

k10tx_conf_count_raw_plots$libsize

sm(k10tx_conf_count_raw_plots$density + ggplot2::scale_x_continuous(limits=c(1,100)))

k10tx_conf_count_raw_plots$boxplot

k10tx_conf_count_norm_plots$corheat

k10tx_conf_count_norm_plots$smc

k10tx_conf_count_norm_plots$disheat

k10tx_conf_count_norm_plots$smd

k10tx_conf_count_norm_plots$pcaplot

## See if the additional information in the 'batch' factor made a difference
k10tx_conf_count_normbatch_plots$pcaplot

k10tx_conf_count_normbatch_plots$tsneplot

15.4.2 k8

k8tx_conf_count_raw_plots <- sm(graph_metrics(k8tx_conf_count))

k8tx_conf_count_norm <- sm(normalize_expt(k8tx_conf_count, transform="log2",
                                          norm="quant", convert="cpm"))
k8tx_conf_count_norm_plots <- sm(graph_metrics(k8tx_conf_count_norm))

k8tx_conf_count_normbatch <- sm(normalize_expt(k8tx_conf_count, batch="svaseq",
                                               transform="log2", convert="cpm"))
k8tx_conf_count_normbatch_plots <- sm(graph_metrics(k8tx_conf_count_normbatch))

15.4.2.1 Look at some plots

Let us see how the k8 count_conf data appears

k8tx_conf_count_raw_plots$libsize

sm(k8tx_conf_count_raw_plots$density + ggplot2::scale_x_continuous(limits=c(1,100)))

k8tx_conf_count_raw_plots$boxplot

k8tx_conf_count_norm_plots$corheat

k8tx_conf_count_norm_plots$smc

k8tx_conf_count_norm_plots$disheat

k8tx_conf_count_norm_plots$smd

k8tx_conf_count_norm_plots$pcaplot

## See if the additional information in the 'batch' factor made a difference
k8tx_conf_count_normbatch_plots$pcaplot

k8tx_conf_count_normbatch_plots$tsneplot

16 Make a nice stacked libsize barplot

library(ggplot2)
maximum <- plot_libsize(k8tx_raw, text=FALSE)
maximum_alpha <- maximum$table
maximum_alpha$alpha <- alpha(maximum_alpha$colors, 0.4)
conf <- plot_libsize(k8tx_conf, text=FALSE)
conf_alpha <- conf$table
conf_alpha$alpha <- alpha(conf_alpha$colors, 0.6)
count <- plot_libsize(k8tx_count, text=FALSE)
count_alpha <- count$table
count_alpha$alpha <- alpha(count_alpha$colors, 0.8)
conf_count <- plot_libsize(k8tx_conf_count, text=FALSE)
conf_count_alpha <- conf_count$table
conf_count_alpha$alpha <- alpha(conf_count_alpha$colors, 1.0)

all <- rbind(maximum_alpha, conf_alpha,
             count_alpha, conf_count_alpha)

## The following will stack the library sizes from each subset one on top of the other.
## This is likely not desired, but it does provide sufficient room to print the library
## sizes if one wishes.
stacked <- ggplot(all, aes(x=id)) +
  geom_bar(aes(weight=sum, alpha=alpha, fill=colors), color="black") +
  scale_fill_manual(values=c(levels(as.factor(all$colors)))) +
  theme(axis.text=ggplot2::element_text(size=10, colour="black"),
        axis.text.x=ggplot2::element_text(angle=90, vjust=0.5),
        legend.position="none")
stacked

## In this instance, they are stacked on top of 0, which I suspect is the goal.
columns <- ggplot(all, aes(x=id, y=sum)) +
  geom_col(position="identity", aes(fill=colors, alpha=alpha), color="black") +
  scale_fill_manual(values=c(levels(as.factor(all$colors)))) +
  theme(axis.text=ggplot2::element_text(size=10, colour="black"),
        axis.text.x=ggplot2::element_text(angle=90, vjust=0.5),
        legend.position="none")
columns

17 Write up the favorites

Let us assume for a moment that we will choose a couple of the above for looking at more closely. I suspect k8_count_conf, k10_count_conf, and k8_count and k8_conf are good candidates.

output <- paste0("excel/k8_conf_count_by_tx-v", ver, ".xlsx")
k8_conf_count_written <- sm(write_expt(k8tx_conf_count, excel=output))

output <- paste0("excel/k10_conf_count_by_tx-v", ver, ".xlsx")

k10tx_conf_count_written <- sm(write_expt(k10tx_conf_count, excel=output))

output <- paste0("excel/k8_count_by_tx-v", ver, ".xlsx")
k8tx_count_written <- sm(write_expt(k8tx_count, excel=output))

output <- paste0("excel/k8_conf_by_tx-v", ver, ".xlsx")
k8tx_conf_written <- sm(write_expt(k8tx_conf, excel=output))

17.1 Quantify tx/gene

I want to see if there are some changes in the number of reads/tx between the embryogenic/nonembryogenic. I am not entirely certain how to do this, so I will poke about and see what happens.

I do have a data structure telling me the mapping of tx->gene.

In the following block, make a table of the number of txs per gene and look at the distribution.

transcript_gene_map <- putative_annotations[, c("transcript_id", "gene_id")]
mappings <- data.table::as.data.table(transcript_gene_map)
mappings[["num"]] <- 1
num_tx <- mappings[, list(num_tx=sum(num)), by="gene_id"]

tx_hist <- plot_histogram(num_tx$num_tx)
tx_hist

17.2 Look for how many transcripts are near 0.

A side question from the above: how many of transcripts are actually being used by the embryogenic/nonembryogenic samples? In order to attempt that question, I split the raw data into two separate pieces (embryo/nonembryo), extract the rowMeans for each transcript by these (oh damn I have a function which does this automagically), and see how many are near 0 (<= 0.001 tpm). When we do that, we observe that many more embryogenic transcripts are actually used than non-embryogenic.

start <- sm(normalize_expt(k8tx_raw, convert="cpm"))
embryo <- subset_expt(start, subset="condition=='embryogenic'")
nonembryo <- subset_expt(start, subset="condition=='nonembryogenic'")
embryo_df <- as.data.frame(exprs(embryo))
embryo_df$em_mean <- rowMeans(embryo_df)
non_df <- as.data.frame(exprs(nonembryo))
non_df$non_mean <- rowMeans(non_df)
e <- as.data.frame(embryo_df[["em_mean"]])
rownames(e) <- rownames(embryo_df)
colnames(e) <- "em_mean"
n <- as.data.frame(non_df[["non_mean"]])
rownames(n) <- rownames(non_df)
colnames(n) <- "non_mean"
means <- merge(x=e, y=n, by="row.names")
rownames(means) <- means[["Row.names"]]
means <- means[, -1]
near_zero <- means <= 0.001
embryo_near_zero <- sum(near_zero[, "em_mean"])
embryo_near_zero

## [1] 35080

non_near_zero <- sum(near_zero[, "non_mean"])
non_near_zero

## [1] 52196

plot_boxplot(embryo)

## This data will benefit from being displayed on the log scale.

## If this is not desired, set scale='raw'

## Some entries are 0.  We are on log scale, adding 1 to the data.

## Changed 391753 zero count features.

plot_boxplot(nonembryo)

## This data will benefit from being displayed on the log scale.

## If this is not desired, set scale='raw'

## Some entries are 0.  We are on log scale, adding 1 to the data.

## Changed 508520 zero count features.

if (!isTRUE(get0("skip_load"))) {
  pander::pander(sessionInfo())
  message(paste0("This is hpgltools commit: ", get_git_commit()))
  this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
  message(paste0("Saving to ", this_save))
  tmp <- sm(saveme(filename=this_save))
}

18 Differential expression analyses, transcripts

For the moment, I will assume we only want to deal with the 8 samples. However, I will try a count-filtered run, an annotation filtered run, and both.

18.1 Count filtered

keepers <- list(
  "embryo" = c("embryogenic", "nonembryogenic"))
k8tx_count_de <- sm(all_pairwise(k8tx_count))
k8tx_count_tables <- sm(combine_de_tables(k8tx_count_de, keepers=keepers,
                                          excel=paste0("excel/k8_count_de_tx-v", ver, ".xlsx"),
                                          sig_excel=paste0("excel/k8_count_sig_tx-v", ver, ".xlsx")))

18.2 Confidence filtered

k8tx_conf_de <- sm(all_pairwise(k8tx_conf))
k8tx_conf_tables <- sm(combine_de_tables(k8tx_conf_de, keepers=keepers,
                                         excel=paste0("excel/k8_conf_de_tx-v", ver, ".xlsx"),
                                         sig_excel=paste0("excel/k8_conf_sig_tx-v", ver, ".xlsx")))

18.3 Both

k8tx_conf_count_de <- sm(all_pairwise(k8tx_conf_count))
k8tx_conf_count_tables <- sm(combine_de_tables(k8tx_conf_count_de, keepers=keepers,
                                               excel=paste0("excel/k8_conf_count_de_tx-v", ver, ".xlsx"),
                                               sig_excel=paste0("excel/k8_conf_count_sig_tx-v", ver, ".xlsx")))

if (!isTRUE(get0("skip_load"))) {
  pander::pander(sessionInfo())
  message(paste0("This is hpgltools commit: ", get_git_commit()))
  this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
  message(paste0("Saving to ", this_save))
  tmp <- sm(saveme(filename=this_save))
}

19 Look for transcripts with respect to gene

20 Look for highly different transcripts for relatively unchanged genes.

21 Tx vs. genes

I loaded my differential expression data with respect to genes and transcripts. Let us try finding some genes which are relatively unchanged in that data and see if any of them have significantly changed transcripts. It is worth noting that in my previous analysis, it seemed highly likely that there are at least some to be found.

## First pull the table of data
gene_data <- k8_count_tables[["data"]][[1]]
tx_data <- k8tx_count_tables[["data"]][[1]]

dim(gene_data)

## [1] 58486    67

candidate_genes <- abs(gene_data[["deseq_logfc"]]) <= 0.2
candidates <- gene_data[candidate_genes, ]
dim(candidates)

## [1] 40725    67

## hmm that is too many.
candidate_genes <- candidates[["deseq_basemean"]] >= 1000
candidates <- candidates[candidate_genes, ]
dim(candidates)

## [1] 327  67

## well, that is more than I can deal with, but it is a start.

## Now look back at the transcripts and immediately be annoyed because a bunch are <NA>.
new_geneids <- gsub(pattern="(^.*)_i.*$", replacement="\\1", x=rownames(tx_data))
tx_data[["gene_id"]] <- new_geneids
keepers <- tx_data[["gene_id"]] %in% rownames(candidates)
tx_candidates <- tx_data[keepers, ]

test_gene <- "TRINITY_DN9916_c0_g2"
gene_data <- candidates[test_gene, ]
gene_data[["deseq_basemean"]]

## [1] 1487

test_tx <- tx_data[["gene_id"]] == test_gene
test_tx <- 2 ^ tx_data[test_tx, c("basic_nummed", "basic_denmed")]
test_kept <- rowSums(as.matrix(test_tx)) > 2
test_tx <- test_tx[test_kept, ]
colnames(test_tx) <- c("embryogenic", "nonembryogenic")
test_df <- as.data.frame(test_tx)
test_df[["tx"]] <- rownames(test_df)
em <- test_df[, c("tx", "embryogenic")]
em[["condition"]] <- "embryogenic"
colnames(em) <- c("tx", "abundance", "condition")
no <- test_df[, c("tx", "nonembryogenic")]
no[["condition"]] <- "nonembryogenic"
colnames(no) <- c("tx", "abundance", "condition")
plotted <- rbind(em, no)

library(ggplot2)
stacked <- ggplot(plotted, aes_string(x="condition")) +
  geom_bar(aes(weight=abundance, fill=tx))

pp(file="/tmp/test_transcripts_one_gene.png", image=stacked)

## Wrote the image to: /tmp/test_transcripts_one_gene.png

if (!isTRUE(get0("skip_load"))) {
  message(paste0("This is hpgltools commit: ", get_git_commit()))
  this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
  message(paste0("Saving to ", this_save))
  tmp <- sm(saveme(filename=this_save))
  pander::pander(sessionInfo())
}

message(paste0("This is hpgltools commit: ", get_git_commit()))

## If you wish to reproduce this exact build of hpgltools, invoke the following:

## > git clone http://github.com/abelew/hpgltools.git

## > git reset 5d8c266e48bb9f73cdac8300e5c7c9f5baf003dc

## R> packrat::restore()

## This is hpgltools commit: Wed Mar 21 15:55:32 2018 -0400: 5d8c266e48bb9f73cdac8300e5c7c9f5baf003dc

this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))

## Saving to 04_tx_to_gene-v20171102.rda.xz

tmp <- sm(saveme(filename=this_save))
pander::pander(sessionInfo())

R version 3.4.4 (2018-03-15)

**Platform:** x86_64-pc-linux-gnu (64-bit)

locale: LC_CTYPE=en_US.utf8, LC_NUMERIC=C, LC_TIME=en_US.utf8, LC_COLLATE=en_US.utf8, LC_MONETARY=en_US.utf8, LC_MESSAGES=en_US.utf8, LC_PAPER=en_US.utf8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.utf8 and LC_IDENTIFICATION=C

attached base packages: grid, parallel, stats4, stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: hpgltools(v.2018.03), Rgraphviz(v.2.22.0), graph(v.1.56.0), SparseM(v.1.77), topGO(v.2.30.1), GO.db(v.3.5.0), AnnotationDbi(v.1.40.0), IRanges(v.2.12.0), S4Vectors(v.0.16.0), Biobase(v.2.38.0), BiocGenerics(v.0.24.0), bindrcpp(v.0.2), edgeR(v.3.20.9), foreach(v.1.4.4), Vennerable(v.3.1.0.9000), ruv(v.0.9.7) and ggplot2(v.2.2.1)

loaded via a namespace (and not attached): snow(v.0.4-2), backports(v.1.1.2), Hmisc(v.4.1-1), plyr(v.1.8.4), lazyeval(v.0.2.1), splines(v.3.4.4), BiocParallel(v.1.12.0), GenomeInfoDb(v.1.14.0), sva(v.3.26.0), digest(v.0.6.15), htmltools(v.0.3.6), gdata(v.2.18.0), magrittr(v.1.5), checkmate(v.1.8.5), memoise(v.1.1.0), cluster(v.2.0.6), doParallel(v.1.0.11), openxlsx(v.4.0.17), limma(v.3.34.9), Biostrings(v.2.46.0), readr(v.1.1.1), annotate(v.1.56.1), matrixStats(v.0.53.1), prettyunits(v.1.0.2), colorspace(v.1.3-2), blob(v.1.1.0), ggrepel(v.0.7.0), BiasedUrn(v.1.07), dplyr(v.0.7.4), RCurl(v.1.95-4.10), tximport(v.1.6.0), roxygen2(v.6.0.1), genefilter(v.1.60.0), lme4(v.1.1-15), bindr(v.0.1.1), survival(v.2.41-3), iterators(v.1.0.9), glue(v.1.2.0), gtable(v.0.2.0), zlibbioc(v.1.24.0), XVector(v.0.18.0), DelayedArray(v.0.4.1), DEoptimR(v.1.0-8), scales(v.0.5.0), DBI(v.0.8), Rcpp(v.0.12.16), xtable(v.1.8-2), progress(v.1.1.2), htmlTable(v.1.11.2), foreign(v.0.8-69), bit(v.1.1-12), preprocessCore(v.1.40.0), Formula(v.1.2-2), httr(v.1.3.1), htmlwidgets(v.1.0), gplots(v.3.0.1), RColorBrewer(v.1.1-2), acepack(v.1.4.1), pkgconfig(v.2.0.1), XML(v.3.98-1.10), nnet(v.7.3-12), locfit(v.1.5-9.1), tidyselect(v.0.2.4), labeling(v.0.3), rlang(v.0.2.0), reshape2(v.1.4.3), munsell(v.0.4.3), tools(v.3.4.4), RSQLite(v.2.0), devtools(v.1.13.5), evaluate(v.0.10.1), stringr(v.1.3.0), yaml(v.2.1.18), knitr(v.1.20), bit64(v.0.9-7), pander(v.0.6.1), geneLenDataBase(v.1.14.0), robustbase(v.0.92-8), caTools(v.1.17.1), purrr(v.0.2.4), RBGL(v.1.54.0), nlme(v.3.1-131.1), xml2(v.1.2.0), biomaRt(v.2.34.2), compiler(v.3.4.4), pbkrtest(v.0.4-7), rstudioapi(v.0.7), variancePartition(v.1.8.1), tibble(v.1.4.2), geneplotter(v.1.56.0), stringi(v.1.1.7), highr(v.0.6), GenomicFeatures(v.1.30.3), lattice(v.0.20-35), Matrix(v.1.2-12), commonmark(v.1.4), nloptr(v.1.0.4), pillar(v.1.2.1), goseq(v.1.30.0), data.table(v.1.10.4-3), bitops(v.1.0-6), corpcor(v.1.6.9), qvalue(v.2.10.0), rtracklayer(v.1.38.3), GenomicRanges(v.1.30.3), colorRamps(v.2.3), R6(v.2.2.2), latticeExtra(v.0.6-28), directlabels(v.2017.03.31), RMySQL(v.0.10.14), KernSmooth(v.2.23-15), gridExtra(v.2.3), codetools(v.0.2-15), MASS(v.7.3-49), gtools(v.3.5.0), assertthat(v.0.2.0), rhdf5(v.2.22.0), SummarizedExperiment(v.1.8.1), DESeq2(v.1.18.1), rprojroot(v.1.3-2), withr(v.2.1.2), GenomicAlignments(v.1.14.1), Rsamtools(v.1.30.0), GenomeInfoDbData(v.1.0.0), mgcv(v.1.8-23), hms(v.0.4.2), doSNOW(v.1.0.16), quadprog(v.1.5-5), rpart(v.4.1-13), tidyr(v.0.8.0), minqa(v.1.2.4), rmarkdown(v.1.9), Rtsne(v.0.13) and base64enc(v.0.1-3)

LS0tCnRpdGxlOiAiQ29tYmluZWQgU29sYW51bSBhbmFseXNlcy4iCmF1dGhvcjogImF0YiBhYmVsZXdAZ21haWwuY29tIgpkYXRlOiAiYHIgU3lzLkRhdGUoKWAiCm91dHB1dDoKIGh0bWxfZG9jdW1lbnQ6CiAgY29kZV9kb3dubG9hZDogdHJ1ZQogIGNvZGVfZm9sZGluZzogc2hvdwogIGZpZ19jYXB0aW9uOiB0cnVlCiAgZmlnX2hlaWdodDogNwogIGZpZ193aWR0aDogNwogIGhpZ2hsaWdodDogdGFuZ28KICBrZWVwX21kOiBmYWxzZQogIG1vZGU6IHNlbGZjb250YWluZWQKICBudW1iZXJfc2VjdGlvbnM6IHRydWUKICBzZWxmX2NvbnRhaW5lZDogdHJ1ZQogIHRoZW1lOiBjb3NtbwogIHRvYzogdHJ1ZQogIHRvY19mbG9hdDoKICAgIGNvbGxhcHNlZDogZmFsc2UKICAgIHNtb290aF9zY3JvbGw6IGZhbHNlCi0tLQoKPHN0eWxlPgogIGJvZHkgLm1haW4tY29udGFpbmVyIHsKICAgIG1heC13aWR0aDogMTYwMHB4Owp9Cjwvc3R5bGU+CgpgYGB7ciBvcHRpb25zLCBpbmNsdWRlPUZBTFNFfQpsaWJyYXJ5KCJocGdsdG9vbHMiKQp0dCA8LSBkZXZ0b29sczo6bG9hZF9hbGwoIn4vaHBnbHRvb2xzIikKa25pdHI6Om9wdHNfa25pdCRzZXQocHJvZ3Jlc3M9VFJVRSwKICAgICAgICAgICAgICAgICAgICAgdmVyYm9zZT1UUlVFLAogICAgICAgICAgICAgICAgICAgICB3aWR0aD05MCwKICAgICAgICAgICAgICAgICAgICAgZWNobz1UUlVFKQprbml0cjo6b3B0c19jaHVuayRzZXQoZXJyb3I9VFJVRSwKICAgICAgICAgICAgICAgICAgICAgIGZpZy53aWR0aD04LAogICAgICAgICAgICAgICAgICAgICAgZmlnLmhlaWdodD04LAogICAgICAgICAgICAgICAgICAgICAgZHBpPTk2KQpvcHRpb25zKGRpZ2l0cz00LAogICAgICAgIHN0cmluZ3NBc0ZhY3RvcnM9RkFMU0UsCiAgICAgICAga25pdHIuZHVwbGljYXRlLmxhYmVsPSJhbGxvdyIpCmdncGxvdDI6OnRoZW1lX3NldChnZ3Bsb3QyOjp0aGVtZV9idyhiYXNlX3NpemU9MTApKQpzZXQuc2VlZCgxKQp2ZXIgPC0gIjIwMTgwMzEwIgpwcmV2aW91c19maWxlIDwtICJpbmRleC5SbWQiCgp0bXAgPC0gdHJ5KHNtKGxvYWRtZShmaWxlbmFtZT1wYXN0ZTAoZ3N1YihwYXR0ZXJuPSJcXC5SbWQiLCByZXBsYWNlPSIiLCB4PXByZXZpb3VzX2ZpbGUpLCAiLXYiLCB2ZXIsICIucmRhLnh6IikpKSkKc2tpcF9sb2FkIDwtIFRSVUUKYGBgCgojIEdlbmUgc3BlY2lmaWMgZGF0YQoKVGhlIGZvbGxvd2luZyAzIHNlY3Rpb25zIGFyZSBnZW5lIHNwZWNpZmljLiAgTGF0ZXIsIHRoZXkgd2lsbCBiZSByZXBlYXRlZCBmb3IKdGhlIHRyYW5zY3JpcHRzLgoKYGBge3IgYW5ub3RhdGlvbiwgY2hpbGQ9JzAxX2Fubm90YXRpb24uUm1kJ30KYGBgCgpgYGB7ciBzYW1wbGVfZXN0aW1hdGlvbiwgY2hpbGQ9JzAyX3NhbXBsZV9lc3RpbWF0aW9uLlJtZCd9CmBgYAoKYGBge3IgZGlmZmVyZW50aWFsX2V4cHJlc3Npb24sIGNoaWxkPScwM19kaWZmZXJlbnRpYWxfZXhwcmVzc2lvbi5SbWQnfQpgYGAKCmBgYHtyIGdlbmVfb250b2xvZ3ksIGNoaWxkPScwNF9nZW5lX29udG9sb2d5LlJtZCd9CmBgYAoKIyBUcmFuc2NyaXB0IHNwZWNpZmljIGRhdGEKClRoZSBmb2xsb3dpbmcgc2VjdGlvbnMgZG8gbm90IHVzZSB0aGUgZ2VuZS0+dHJhbnNjcmlwdCBtYXBzLgoKYGBge3IgYW5ub3RhdGlvbl90eCwgY2hpbGQ9JzAxX2Fubm90YXRpb25fdHguUm1kJ30KYGBgCgpgYGB7ciBzYW1wbGVfZXN0aW1hdGlvbl90eCwgY2hpbGQ9JzAyX3NhbXBsZV9lc3RpbWF0aW9uX3R4LlJtZCd9CmBgYAoKYGBge3IgZGlmZmVyZW50aWFsX2V4cHJlc3Npb25fdHgsIGNoaWxkPScwM19kaWZmZXJlbnRpYWxfZXhwcmVzc2lvbl90eC5SbWQnfQpgYGAKCiMgTG9vayBmb3IgdHJhbnNjcmlwdHMgd2l0aCByZXNwZWN0IHRvIGdlbmUKCmBgYHtyIHR4X3Blcl9nZW5lLCBjaGlsZD0nMDRfdHhfdG9fZ2VuZS5SbWQnfQpgYGAKCmBgYHtyIHNhdmVtZX0KbWVzc2FnZShwYXN0ZTAoIlRoaXMgaXMgaHBnbHRvb2xzIGNvbW1pdDogIiwgZ2V0X2dpdF9jb21taXQoKSkpCnRoaXNfc2F2ZSA8LSBwYXN0ZTAoZ3N1YihwYXR0ZXJuPSJcXC5SbWQiLCByZXBsYWNlPSIiLCB4PXJtZF9maWxlKSwgIi12IiwgdmVyLCAiLnJkYS54eiIpCm1lc3NhZ2UocGFzdGUwKCJTYXZpbmcgdG8gIiwgdGhpc19zYXZlKSkKdG1wIDwtIHNtKHNhdmVtZShmaWxlbmFtZT10aGlzX3NhdmUpKQpwYW5kZXI6OnBhbmRlcihzZXNzaW9uSW5mbygpKQpgYGAK

Combined Solanum analyses.

atb abelew@gmail.com

2018-03-27

1 Gene specific data

2 Annotation version: 20180310

2.1 Load all trinotate data

2.2 Create the expressionsets

2.2.1 How many putative features in the raw data?

2.3 Threshold cutoff

2.3.1 How many high-count features are in the data?

2.4 Confidence cutoff

2.5 How many high-confidence features are in the data?

2.6 Both cutoffs

2.6.1 How many high-count and high-confidence features?

3 Plot changes in number of putative genes

4 Gene Sample Estimation version: 20180310

5 Examine these data sets

5.1 Raw data

5.1.1 k10 raw

5.1.2 k8 raw

5.1.2.1 A peculiarity of kallisto

5.2 Quick side question

5.3 Count filtered data

5.3.1 k10 count filtering

5.3.2 Some variance partition plots

5.3.2.1 Look at some plots

5.3.2.2 Odd densities again

5.3.3 k8 count filtering

5.3.3.1 Look at some plots

5.3.4 Look more closely at variance

5.4 Conf filtered data

5.4.1 k10 confidence filtering

5.4.1.1 Look at some plots

5.4.2 k8 confidence filtering

5.4.2.1 Look at some plots

5.5 Count and Conf filtered data

5.5.1 k10 both filters

5.5.1.1 Look at some plots

5.5.2 k8 both filters

5.5.2.1 Look at some plots

5.5.3 Both filters, variance partition

6 Make a nice stacked libsize barplot

7 Write up the favorites

8 Differential Expression by gene, Solanum betaceum: 20180310

8.1 Count filtered

8.1.1 Generate tables

8.2 Test for genes ‘strong in batch’

8.3 Confidence filtered

8.4 Both

9 Gene Ontology Searches using Solanum betaceum trinotate GO associations: 20180310

9.1 Searching with goseq

9.2 Search using the top/bottom n genes

10 Transcript specific data

11 Annotation version: 20180310

12 Plot changes in number of putative genes

13 Try to understand how similar bowtie2->htseq counts are to kallisto for this data.

14 Sample Estimation version: 20180310

15 Examine these data sets

15.1 Raw data

15.1.1 k10

15.1.2 k8

15.2 Count filtered data

15.2.1 k10

15.2.1.1 Look at some plots

15.2.2 k8

15.2.2.1 Look at some plots

15.3 Conf filtered data

15.3.1 k10

15.3.1.1 Look at some plots

15.3.2 k8

15.3.2.1 Look at some plots

15.4 Count_Conf filtered data

15.4.1 k10

15.4.1.1 Look at some plots

15.4.2 k8

15.4.2.1 Look at some plots

16 Make a nice stacked libsize barplot

17 Write up the favorites

17.1 Quantify tx/gene

17.2 Look for how many transcripts are near 0.