The following 3 sections are gene specific. Later, they will be repeated for the transcripts.
Any annotation data will need to come from our trinotate result. An interesting caveat, kallisto uses a transcript database. tximport is the library used to read the counts/transcript into R, it is able to take a set of mappings between gene/transcript in order to provide gene counts rather than transcript.
An important aspect of using trinity for this creation of our annotation data: there are many putative transcripts for each putative gene, many of which are probably not real, but instead are artifacts of the assembler. Therefore we need to do something to try to distinguish between the reals and fakers. The following code will attempt to address this problem.
putative_annotations <- sm(load_trinotate_annotations("reference/trinotate.csv"))
annot_by_gene <- putative_annotations
## Let us make one set of annotations by transcript ID and one by gene ID
rownames(annot_by_gene) <- make.names(putative_annotations[["gene_id"]], unique=TRUE)
annot_by_tx <- putative_annotations
rownames(annot_by_tx) <- make.names(putative_annotations[["transcript_id"]], unique=TRUE)
In order to make functional expressionsets of this data, we will need to use the mappings from gene <-> transcripts, that is the final argument to create_expt().
## I split the annotations into one set keyed by gene IDs and one by transcript IDs.
## This way it is possible to make an expressionset of transcripts or genes by
## removing/keeping the tx_gene_map argument.
## Why would you want to do this? you might ask... Well, sadly the annotation database
## Does not always have filled-in data for the first transcript for a given gene ID,
## therefore the annotation information for those genes will be lost when we merge
## the count tables / annotation tables by gene ID.
## This is of course also true for transcript IDs, but for a vastly smaller number of them.
transcript_gene_map <- putative_annotations[, c("transcript_id", "gene_id")]
k10_raw <- create_expt(metadata="sample_sheets/all_samples.xlsx",
gene_info=annot_by_gene,
tx_gene_map=transcript_gene_map)
## Reading the sample metadata.
## The sample definitions comprises: 10, 7 rows, columns.
## Reading count tables.
## Using the transcript<->gene mapping.
## Finished reading count tables.
## Matched 58486 annotations and counts.
## Bringing together the count matrix and gene information.
putative_annotations <- fData(k10_raw) ## A cheap way to get back the unique gene IDs.
## On further examination, we decided to remove both samples 1003 and 1008
## which Sandra pointed out are in fact from the same flow cytometry isolation.
k8_raw <- create_expt(metadata="sample_sheets/kept_samples.xlsx",
gene_info=annot_by_gene,
tx_gene_map=transcript_gene_map)
## Reading the sample metadata.
## The sample definitions comprises: 8, 7 rows, columns.
## Reading count tables.
## Using the transcript<->gene mapping.
## Finished reading count tables.
## Matched 58486 annotations and counts.
## Bringing together the count matrix and gene information.
## If we do not use the tx_gene_map information, we get 234,330 transcripts in the data set.
## Instead, we get 58,486 genes.
dim(exprs(k10_raw))
## [1] 58486 10
dim(exprs(k8_raw))
## [1] 58486 8
Start by dropping all features with less than 4 counts across all samples. This should get rid of a significant number of the false positive transcripts. While I am at it, I will drop the features annotated as rRNA.
## Perform a count/gene cutoff here
threshold <- 4
method <- "cbcb" ## other methods include 'genefilter' and 'simple'
k10_count <- sm(normalize_expt(k10_raw, filter=method, thresh=threshold))
k8_count <- sm(normalize_expt(k8_raw, filter=method, thresh=threshold))
## First get rid of the rRNA
sb_rrna <- putative_annotations[["rrna_subunit"]] != ""
rrna_droppers <- rownames(putative_annotations)[sb_rrna]
k10_norrna <- exclude_genes_expt(expt=k10_raw, ids=rrna_droppers)
## Before removal, there were 58486 entries.
## Now there are 58481 entries.
## Percent kept: 99.994, 99.994, 99.994, 99.996, 99.996, 99.993, 99.991, 99.994, 99.996, 99.995
## Percent removed: 0.006, 0.006, 0.006, 0.004, 0.004, 0.007, 0.009, 0.006, 0.004, 0.005
k8_norrna <- exclude_genes_expt(expt=k8_raw, ids=rrna_droppers)
## Before removal, there were 58486 entries.
## Now there are 58481 entries.
## Percent kept: 99.994, 99.994, 99.994, 99.996, 99.993, 99.991, 99.994, 99.995
## Percent removed: 0.006, 0.006, 0.006, 0.004, 0.007, 0.009, 0.006, 0.005
dim(exprs(k10_count))
## [1] 24132 10
dim(exprs(k8_count))
## [1] 23211 8
Now lets drop those features for which we have relatively lower confidence in the BLAST results from trinotate.
## Keep only the blastx hits with a low e-value and high identity.
filtered_keepers <- putative_annotations[["blastx_evalue"]] <= 1e-10 &
putative_annotations[["blastx_identity"]] >= 60
keepers <- rownames(putative_annotations)[filtered_keepers]
## This excludes all but 33,451 transcripts.
## Now keep only those transcript IDs which fall into the above categories.
k10_conf <- exclude_genes_expt(expt=k10_norrna, ids=keepers, method="keep")
## Before removal, there were 58481 entries.
## Now there are 4888 entries.
## Percent kept: 23.611, 22.922, 24.308, 23.931, 24.798, 24.603, 25.471, 22.574, 23.289, 25.368
## Percent removed: 76.389, 77.078, 75.692, 76.069, 75.202, 75.397, 74.529, 77.426, 76.711, 74.632
k8_conf <- exclude_genes_expt(expt=k8_norrna, ids=keepers, method="keep")
## Before removal, there were 58481 entries.
## Now there are 4888 entries.
## Percent kept: 23.611, 22.922, 24.308, 24.798, 24.603, 25.471, 22.574, 25.368
## Percent removed: 76.389, 77.078, 75.692, 75.202, 75.397, 74.529, 77.426, 74.632
dim(exprs(k10_conf))
## [1] 4888 10
dim(exprs(k8_conf))
## [1] 4888 8
Finally, lets be super-strict and remove both.
## Perform both the confidence and count filtration
k10_conf_count <- sm(normalize_expt(k10_conf, filter=method, thresh=threshold))
k8_conf_count <- sm(normalize_expt(k8_conf, filter=method, thresh=threshold))
rrna_expt <- exclude_genes_expt(expt=k10_raw, ids=sb_rrna, method="keep")
## Before removal, there were 58486 entries.
## Now there are 5 entries.
## Percent kept: 0.006, 0.006, 0.006, 0.004, 0.004, 0.007, 0.009, 0.006, 0.004, 0.005
## Percent removed: 99.994, 99.994, 99.994, 99.996, 99.996, 99.993, 99.991, 99.994, 99.996, 99.995
## This has only 21 putative rRNA transcripts.
dim(exprs(k10_conf_count))
## [1] 4392 10
dim(exprs(k8_conf_count))
## [1] 4328 8
There are a few ways to visualize the changes wrought in the data by these operations. Here are two bar plots which attempt to show what happened, the first stacks the relatively-sized library sizes atop one another, the second stacks them on the x-axis.
genes_by_subset <- data.frame(
samples = c(rep("ten", 4), rep("eight", 4)),
colors = c(rep("#1B9E77", 4), rep("#7570B3", 4)),
alpha = c("#1B9E77FF", "#1B9E77CC", "#1B0E77AA",
"#1B0E7777", "#7570B3FF", "#7570B3CC", "#7570B3AA", "#7570B377"),
type = rep(c("raw", "count", "conf", "conf_count"), 2),
genes = c(nrow(exprs(k10_raw)), nrow(exprs(k10_count)),
nrow(exprs(k10_conf)), nrow(exprs(k10_conf_count)),
nrow(exprs(k8_raw)), nrow(exprs(k8_count)),
nrow(exprs(k8_conf)), nrow(exprs(k8_conf_count))))
library(ggplot2)
columns <- ggplot(genes_by_subset, aes(x=samples, y=genes)) +
geom_col(position="identity", aes(fill=colors, alpha=alpha), color="black") +
scale_fill_manual(values=c(levels(as.factor(genes_by_subset$colors)))) +
theme(axis.text=ggplot2::element_text(size=10, colour="black"),
axis.text.x=ggplot2::element_text(angle=90, vjust=0.5),
legend.position="none")
c1 <- genes_by_subset[["samples"]] == "ten"
c2 <- genes_by_subset[["samples"]] == "eight"
tmp <- cbind(genes_by_subset[c1, "genes"], genes_by_subset[c2, "genes"])
colnames(tmp) <- c("ten", "eight")
rownames(tmp) <- c("raw", "count", "conf", "conf_count")
tmp
## ten eight
## raw 58486 58486
## count 24132 23211
## conf 4888 4888
## conf_count 4392 4328
pp(file="images/genes_by_columns.png", image=columns)
## Wrote the image to: images/genes_by_columns.png
if (!isTRUE(get0("skip_load"))) {
pander::pander(sessionInfo())
message(paste0("This is hpgltools commit: ", get_git_commit()))
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
tmp <- sm(saveme(filename=this_save))
}
Since I used kallisto for this work, I want to figure out tximport.
In 01_annotation.Rmd, I decided to open up a few lines of inquiry:
Start with the full data set. This will take a rather long time to perform all the graphs, and since we are not likely to use it, I will limit myself to just plotting 1-2 graphs.
This data set is unlikely to be very helpful for differential expression, but may be useful for diagnosing other concerns in the data.
k10_libsize <- plot_libsize(k10_raw)
pp(file="images/01_k10_libsize.pdf", image=k10_libsize$plot)
## Wrote the image to: images/01_k10_libsize.pdf
k10_density <- sm(plot_density(k10_raw))
k10_density <- sm(k10_density$plot + ggplot2::scale_x_continuous(limits=c(1,100)))
pp(file="images/02_k10_density.pdf", image=k10_density)
## Wrote the image to: images/02_k10_density.pdf
k10_boxplot <- sm(plot_boxplot(k10_raw))
pp(file="images/03_k10_boxplot.pdf", image=k10_boxplot)
## Wrote the image to: images/03_k10_boxplot.pdf
k10_corheat <- plot_corheat(k10_raw)
pp(file="images/04_k10_corheat.pdf", image=k10_corheat)
## Wrote the image to: images/04_k10_corheat.pdf
k10_tsne <- plot_tsne(k10_raw)
pp(file="images/05_k10_tsne.pdf", image=k10_tsne)
## Wrote the image to: images/05_k10_tsne.pdf
Most notably, it makes obvious some concerns with samples 1003, 1007, and 1008. We chose to remove 1003 and 1008, but will keep an eye on 1007.
The metrics of the data when we remove those two samples should generally be quite similar.
k8_libsize <- plot_libsize(k8_raw)
pp(file="images/06_k8_libsize.pdf", image=k8_libsize$plot)
## Wrote the image to: images/06_k8_libsize.pdf
k8_density <- sm(plot_density(k8_raw))
k8_density <- sm(k8_density$plot + ggplot2::scale_x_continuous(limits=c(1,100)))
pp(file="images/07_k8_density.pdf", image=k8_density)
## Wrote the image to: images/07_k8_density.pdf
k8_boxplot <- sm(plot_boxplot(k8_raw))
pp(file="images/08_k8_boxplot.pdf", image=k8_boxplot)
## Wrote the image to: images/08_k8_boxplot.pdf
k8_corheat <- plot_corheat(k8_raw)
pp(file="images/09_k8_corheat.pdf", image=k8_corheat)
## Wrote the image to: images/09_k8_corheat.pdf
k8_tsne <- plot_tsne(k8_raw)
pp(file="images/10_k8_tsne.pdf", image=k8_tsne)
## Wrote the image to: images/10_k8_tsne.pdf
Here is the same density plot, but looking from 0 to 10. In it we can see that there is a tremendous number of ‘genes’ which have fractional counts. I am certain that this is important, but I am not sure what to do with it. I have a feeling that when we do a count-based filtering, this tremendous jump by sample 1009 will fall away…
odd <- sm(k8_density + ggplot2::scale_x_continuous(limits=c(0,10)))
pp(file="images/11_k8_odd_density.pdf", image=odd)
## Wrote the image to: images/11_k8_odd_density.pdf
Can we ask the data to give us relatively reliable estimates of which genes/transcripts are in all 4 embryogenic samples and in none of the non-embryogenic samples?
Yesterday when I went to do this, my rpkm function failed because I neglected to include a sequence length column in the annotation data. That has been remedied, as a result these cutoffs probably should be adjust to something more appropriate to rpkm-scale data. Though I am not entirely certain where they should be adjusted, perhaps <1 and 5?
k8_raw_rpkm <- normalize_expt(k8_count, convert="rpkm", norm="quant")
## This function will replace the expt$expressionset slot with:
## rpkm(quant(data))
## It backs up the current data into a slot named:
## expt$backup_expressionset. It will also save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Filter is false, this should likely be set to something, good
## choices include cbcb, kofa, pofa (anything but FALSE). If you want this to
## stay FALSE, keep in mind that if other normalizations are performed, then the
## resulting libsizes are likely to be strange (potentially negative!)
## Leaving the data in its current base format, keep in mind that
## some metrics are easier to see when the data is log2 transformed, but
## EdgeR/DESeq do not accept transformed data.
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: not doing count filtering.
## Step 2: normalizing the data with quant.
## Step 3: converting the data with rpkm.
## Step 4: not transforming the data.
## Step 5: not doing batch correction.
low_cutoff <- 3
high_cutoff <- 10
k8_embryo_low_nonembryo_high <- rowMeans(exprs(k8_raw_rpkm)[, 1:4]) <= low_cutoff &
rowMeans(exprs(k8_raw_rpkm)[, 5:8]) >= high_cutoff
checkme_first <- rownames(exprs(k8_raw_rpkm))[k8_embryo_low_nonembryo_high]
dim(exprs(k8_raw_rpkm)[checkme_first, ])
## [1] 228 8
low_embryo_high_nonembryo <- merge(exprs(k8_raw_rpkm)[checkme_first, ], fData(k8_raw_rpkm)[checkme_first, ], by="row.names")
write.csv(file="low_embryo_high_nonembryo.csv", low_embryo_high_nonembryo)
k8_embryo_high_nonembryo_low <- rowMeans(exprs(k8_raw_rpkm)[, 1:4]) >= high_cutoff & rowMeans(exprs(k8_raw_rpkm)[, 5:8]) <= low_cutoff
checkme_second <- rownames(exprs(k8_raw_rpkm))[k8_embryo_high_nonembryo_low]
dim(exprs(k8_raw_rpkm)[checkme_second, ])
## [1] 43 8
high_embryo_low_nonembryo <- merge(exprs(k8_raw_rpkm)[checkme_second, ], fData(k8_raw_rpkm)[checkme_second, ], by="row.names")
write.csv(file="high_embryo_low_nonembryo.csv", high_embryo_low_nonembryo)
When we get to the count filtered data, I suspect things will be more interesting. Therefore let us look at the full spectrum of plots before/after normalization.
For these data, per orm some extra analyses including a PC loading, PC vs actor analysis, and variance partition.
k10_count_raw_plots <- sm(graph_metrics(k10_count))
k10_count_norm <- sm(normalize_expt(k10_count, transform="log2", norm="quant", convert="cpm"))
k10_count_norm_plots <- sm(graph_metrics(k10_count_norm))
k10_count_normbatch <- sm(normalize_expt(k10_count, batch="svaseq", transform="log2", convert="cpm"))
k10_count_normbatch_plots <- sm(graph_metrics(k10_count_normbatch))
pp(file="images/12_anova_fstat_heatmap.pdf", image=pca_info$anova_fstat_heatmap)
## Wrote the image to: images/12_anova_fstat_heatmap.pdf
pp(file="images/13_k10_varpart_pct.pdf", image=k10_varpart$percent_plot)
## Wrote the image to: images/13_k10_varpart_pct.pdf
pp(file="images/14_k10_varpart_partition.pdf", image=k10_varpart$partition_plot)
## Wrote the image to: images/14_k10_varpart_partition.pdf
Let us see how the k10 count data appears
pp(file="images/15_k10count_libsize.pdf", image=k10_count_raw_plots$libsize)
## Wrote the image to: images/15_k10count_libsize.pdf
k10_count_density <- sm(k10_count_raw_plots$density + ggplot2::scale_x_continuous(limits=c(1,200)))
pp(file="images/16_k10count_density.pdf", image=k10_count_density)
## Wrote the image to: images/16_k10count_density.pdf
pp(file="images/17_k10count_boxplot.pdf", image=k10_count_raw_plots$boxplot)
## Wrote the image to: images/17_k10count_boxplot.pdf
pp(file="images/18_k10count_corheat.pdf", image=k10_count_norm_plots$corheat)
## Wrote the image to: images/18_k10count_corheat.pdf
pp(file="images/19_k10count_smc.pdf", image=k10_count_norm_plots$smc)
## Wrote the image to: images/19_k10count_smc.pdf
pp(file="images/20_k10count_disheat.pdf", image=k10_count_norm_plots$disheat)
## Wrote the image to: images/20_k10count_disheat.pdf
pp(file="images/21_k10count_smd.pdf", image=k10_count_norm_plots$smd)
## Wrote the image to: images/21_k10count_smd.pdf
pp(file="images/22_k10count_pca.pdf", image=k10_count_norm_plots$pcaplot)
## Wrote the image to: images/22_k10count_pca.pdf
## See if the additional information in the 'batch' factor made a difference
pp(file="images/23_k10count_normbatch_pca.pdf", image=k10_count_normbatch_plots$pcaplot)
## Wrote the image to: images/23_k10count_normbatch_pca.pdf
sm(k10_count_raw_plots$density + ggplot2::scale_x_continuous(limits=c(0,10)))
It looks like I was mostly correct above, this makes it clearer that sample 1003 has to go, and since 1008 is in the same batch it is likely to get the axe still. Though judging from this plot, maybe 1009 is a candidate for removal too.
These are the data primarily used in the following analyses (differential expression/ontology/etc).
k8_count_raw_plots <- sm(graph_metrics(k8_count))
k8_count_norm <- sm(normalize_expt(k8_count, transform="log2", norm="quant", convert="cpm"))
k8_count_norm_plots <- sm(graph_metrics(k8_count_norm))
k8_count_normbatch <- sm(normalize_expt(k8_count, batch="svaseq", transform="log2", convert="cpm"))
k8_count_normbatch_plots <- sm(graph_metrics(k8_count_normbatch))
Let us see how the k8 count data appears
pp(file="images/20_k8count_libsize.pdf", image=k8_count_raw_plots$libsize)
## Wrote the image to: images/20_k8count_libsize.pdf
k8_count_density <- sm(k8_count_raw_plots$density + ggplot2::scale_x_continuous(limits=c(1,200)))
pp(file="images/21_k8count_density.pdf", image=k8_count_density)
## Wrote the image to: images/21_k8count_density.pdf
pp(file="images/22_k8count_boxplot.pdf", image=k8_count_raw_plots$boxplot)
## Wrote the image to: images/22_k8count_boxplot.pdf
pp(file="images/23_k8count_corheat.pdf", image=k8_count_norm_plots$corheat)
## Wrote the image to: images/23_k8count_corheat.pdf
pp(file="images/24_k8count_smc.pdf", image=k8_count_norm_plots$smc)
## Wrote the image to: images/24_k8count_smc.pdf
pp(file="images/25_k8count_disheat.pdf", image=k8_count_norm_plots$disheat)
## Wrote the image to: images/25_k8count_disheat.pdf
pp(file="images/26_k8count_smd.pdf", image=k8_count_norm_plots$smd)
## Wrote the image to: images/26_k8count_smd.pdf
pp(file="images/27_k8count_pca.pdf", image=k8_count_norm_plots$pcaplot)
## Wrote the image to: images/27_k8count_pca.pdf
## See if the additional information in the 'batch' factor made a difference
pp(file="images/28_k8count_normbatch_pca.pdf", image=k8_count_normbatch_plots$pcaplot)
## Wrote the image to: images/28_k8count_normbatch_pca.pdf
pp(file="images/29_k8count_tsne.pdf", image=k8_count_normbatch_plots$tsneplot)
## Wrote the image to: images/29_k8count_tsne.pdf
We can use variancePartition to look more closely at where the variance in the data is coming from.
pca_info <- pca_information(k8_count, plot_pcas=TRUE, num_components=4,
expt_factors=c("condition", "batch", "rnaconcentration"))
## More shallow curves in these plots suggest more genes in this principle component.
k8_varpart <- varpart(k8_count, predictor=NULL, factors=c("condition","batch"))
## Attempting mixed linear model with: ~ (1|condition) + (1|batch)
## Fitting the expressionset to the model, this is slow.
## Projected run time: ~ 2 min
## Placing factor: condition at the beginning of the model.
write.csv(k8_varpart$sorted_df, file="excel/varpart_sorted.csv")
high_scores <- pca_highscores(k8_count, n=100)
## Top 100 Strongest genes vs. component 1.
high_scores$highest[, "Comp.1"]
## [1] "14.53:TRINITY_DN11326_c0_g3" "13.33:TRINITY_DN7305_c1_g7"
## [3] "13.19:TRINITY_DN18956_c2_g1" "13.13:TRINITY_DN11934_c0_g7"
## [5] "12.63:TRINITY_DN15106_c4_g3" "12.56:TRINITY_DN16546_c0_g6"
## [7] "12.34:TRINITY_DN20010_c1_g5" "12.15:TRINITY_DN15897_c1_g3"
## [9] "12.15:TRINITY_DN10836_c0_g3" "11.9:TRINITY_DN15055_c3_g4"
## [11] "11.75:TRINITY_DN15734_c1_g2" "11.59:TRINITY_DN7489_c0_g3"
## [13] "11.55:TRINITY_DN13204_c0_g1" "11.47:TRINITY_DN16838_c0_g1"
## [15] "11.37:TRINITY_DN14567_c0_g2" "11.27:TRINITY_DN7489_c0_g7"
## [17] "11.27:TRINITY_DN16152_c0_g6" "10.84:TRINITY_DN8744_c0_g1"
## [19] "10.71:TRINITY_DN11842_c4_g1" "10.62:TRINITY_DN13516_c1_g1"
## [21] "10.57:TRINITY_DN8078_c1_g3" "10.45:TRINITY_DN15545_c0_g3"
## [23] "10.41:TRINITY_DN11322_c1_g5" "10.41:TRINITY_DN17690_c0_g4"
## [25] "10.21:TRINITY_DN9459_c1_g3" "10.14:TRINITY_DN13058_c0_g1"
## [27] "10.07:TRINITY_DN15333_c1_g5" "9.923:TRINITY_DN16729_c0_g3"
## [29] "9.871:TRINITY_DN8985_c1_g8" "9.83:TRINITY_DN14132_c0_g1"
## [31] "9.825:TRINITY_DN12928_c0_g13" "9.767:TRINITY_DN14230_c2_g3"
## [33] "9.687:TRINITY_DN18952_c1_g4" "9.679:TRINITY_DN18760_c1_g4"
## [35] "9.651:TRINITY_DN8086_c3_g1" "9.651:TRINITY_DN14199_c1_g4"
## [37] "9.626:TRINITY_DN6957_c2_g5" "9.61:TRINITY_DN15734_c1_g4"
## [39] "9.592:TRINITY_DN14545_c1_g1" "9.589:TRINITY_DN14291_c0_g2"
## [41] "9.532:TRINITY_DN19416_c3_g2" "9.464:TRINITY_DN11811_c1_g4"
## [43] "9.438:TRINITY_DN8979_c2_g6" "9.344:TRINITY_DN8411_c0_g5"
## [45] "9.315:TRINITY_DN10513_c0_g3" "9.309:TRINITY_DN16231_c1_g6"
## [47] "9.302:TRINITY_DN12894_c1_g4" "9.29:TRINITY_DN11126_c0_g1"
## [49] "9.205:TRINITY_DN9864_c0_g5" "9.145:TRINITY_DN11615_c1_g1"
## [51] "9.09:TRINITY_DN16511_c0_g8" "9.046:TRINITY_DN17042_c1_g4"
## [53] "9.043:TRINITY_DN13936_c0_g2" "9.03:TRINITY_DN12936_c0_g4"
## [55] "8.978:TRINITY_DN16578_c3_g2" "8.912:TRINITY_DN7883_c1_g1"
## [57] "8.901:TRINITY_DN11505_c0_g1" "8.888:TRINITY_DN11628_c1_g1"
## [59] "8.809:TRINITY_DN7877_c0_g1" "8.806:TRINITY_DN11976_c1_g2"
## [61] "8.797:TRINITY_DN16152_c0_g8" "8.762:TRINITY_DN14513_c0_g4"
## [63] "8.741:TRINITY_DN15573_c0_g3" "8.715:TRINITY_DN14860_c2_g4"
## [65] "8.712:TRINITY_DN14732_c0_g3" "8.698:TRINITY_DN17291_c1_g3"
## [67] "8.63:TRINITY_DN8013_c0_g4" "8.612:TRINITY_DN17213_c0_g1"
## [69] "8.608:TRINITY_DN9824_c4_g2" "8.527:TRINITY_DN13759_c0_g3"
## [71] "8.526:TRINITY_DN11028_c0_g5" "8.521:TRINITY_DN18874_c0_g4"
## [73] "8.454:TRINITY_DN9025_c2_g5" "8.452:TRINITY_DN16578_c3_g3"
## [75] "8.352:TRINITY_DN10485_c0_g2" "8.333:TRINITY_DN12962_c0_g2"
## [77] "8.323:TRINITY_DN15701_c0_g3" "8.321:TRINITY_DN9025_c2_g1"
## [79] "8.292:TRINITY_DN16566_c0_g1" "8.292:TRINITY_DN12634_c0_g6"
## [81] "8.29:TRINITY_DN11495_c1_g5" "8.253:TRINITY_DN7877_c0_g2"
## [83] "8.223:TRINITY_DN11375_c0_g4" "8.175:TRINITY_DN16221_c1_g2"
## [85] "8.148:TRINITY_DN11923_c0_g1" "8.11:TRINITY_DN18336_c1_g2"
## [87] "8.07:TRINITY_DN9532_c1_g2" "8.026:TRINITY_DN14568_c1_g3"
## [89] "8.024:TRINITY_DN491_c0_g1" "7.997:TRINITY_DN17712_c1_g3"
## [91] "7.99:TRINITY_DN17472_c1_g2" "7.974:TRINITY_DN9528_c0_g4"
## [93] "7.926:TRINITY_DN18524_c2_g3" "7.9:TRINITY_DN13728_c0_g6"
## [95] "7.878:TRINITY_DN17799_c1_g5" "7.823:TRINITY_DN19276_c4_g2"
## [97] "7.791:TRINITY_DN17399_c1_g1" "7.774:TRINITY_DN11918_c0_g1"
## [99] "7.765:TRINITY_DN14890_c0_g3" "7.764:TRINITY_DN10143_c1_g6"
## Theoretically that list should agree well with the top of the variance partition table vs. Condition
## assuming condition fits component 1.
## In contrast, since component 2 seems to be the best for splitting the data, here is component 2.
most_by_batch <- high_scores$highest[, "Comp.2"]
most_by_batch <- gsub(pattern="^.*:(.*$)", replacement="\\1", x=most_by_batch)
most_by_batch
## [1] "TRINITY_DN18570_c1_g2" "TRINITY_DN7305_c1_g7"
## [3] "TRINITY_DN11697_c1_g1" "TRINITY_DN16210_c1_g5"
## [5] "TRINITY_DN9078_c2_g1" "TRINITY_DN11256_c0_g1"
## [7] "TRINITY_DN10320_c1_g2" "TRINITY_DN9633_c1_g2"
## [9] "TRINITY_DN9548_c1_g4" "TRINITY_DN17153_c1_g2"
## [11] "TRINITY_DN12103_c0_g1" "TRINITY_DN18792_c1_g4"
## [13] "TRINITY_DN10496_c1_g1" "TRINITY_DN14604_c1_g4"
## [15] "TRINITY_DN18209_c0_g3" "TRINITY_DN7869_c0_g3"
## [17] "TRINITY_DN8181_c1_g1" "TRINITY_DN7021_c0_g1"
## [19] "TRINITY_DN18394_c1_g2" "TRINITY_DN17811_c0_g5"
## [21] "TRINITY_DN11151_c0_g3" "TRINITY_DN15959_c1_g2"
## [23] "TRINITY_DN14297_c2_g1" "TRINITY_DN18698_c1_g1"
## [25] "TRINITY_DN9582_c0_g3" "TRINITY_DN12083_c1_g6"
## [27] "TRINITY_DN15470_c0_g5" "TRINITY_DN8612_c1_g5"
## [29] "TRINITY_DN11811_c1_g1" "TRINITY_DN15599_c0_g8"
## [31] "TRINITY_DN6713_c0_g4" "TRINITY_DN17511_c1_g1"
## [33] "TRINITY_DN17349_c0_g5" "TRINITY_DN12356_c0_g1"
## [35] "TRINITY_DN18879_c2_g2" "TRINITY_DN11111_c0_g1"
## [37] "TRINITY_DN9599_c0_g2" "TRINITY_DN11427_c0_g5"
## [39] "TRINITY_DN19771_c2_g1" "TRINITY_DN18486_c1_g2"
## [41] "TRINITY_DN11934_c0_g7" "TRINITY_DN11001_c0_g7"
## [43] "TRINITY_DN7548_c2_g1" "TRINITY_DN7838_c0_g4"
## [45] "TRINITY_DN7838_c0_g9" "TRINITY_DN6621_c2_g2"
## [47] "TRINITY_DN12218_c1_g8" "TRINITY_DN8201_c4_g3"
## [49] "TRINITY_DN14030_c0_g6" "TRINITY_DN7283_c1_g2"
## [51] "TRINITY_DN6823_c1_g5" "TRINITY_DN9223_c1_g4"
## [53] "TRINITY_DN8181_c1_g2" "TRINITY_DN14409_c1_g1"
## [55] "TRINITY_DN11030_c0_g2" "TRINITY_DN10215_c1_g7"
## [57] "TRINITY_DN15447_c1_g2" "TRINITY_DN16292_c0_g6"
## [59] "TRINITY_DN16343_c0_g4" "TRINITY_DN9854_c1_g1"
## [61] "TRINITY_DN9128_c0_g5" "TRINITY_DN12828_c0_g3"
## [63] "TRINITY_DN11641_c0_g1" "TRINITY_DN9448_c2_g4"
## [65] "TRINITY_DN10128_c0_g1" "TRINITY_DN9396_c0_g6"
## [67] "TRINITY_DN10699_c0_g1" "TRINITY_DN12162_c0_g7"
## [69] "TRINITY_DN8196_c1_g1" "TRINITY_DN6666_c2_g1"
## [71] "TRINITY_DN15779_c0_g8" "TRINITY_DN13628_c0_g2"
## [73] "TRINITY_DN11639_c0_g1" "TRINITY_DN8514_c0_g5"
## [75] "TRINITY_DN12923_c0_g2" "TRINITY_DN9187_c0_g2"
## [77] "TRINITY_DN12753_c2_g1" "TRINITY_DN10100_c0_g1"
## [79] "TRINITY_DN6493_c3_g4" "TRINITY_DN8669_c0_g5"
## [81] "TRINITY_DN6372_c0_g3" "TRINITY_DN13516_c1_g6"
## [83] "TRINITY_DN7705_c2_g6" "TRINITY_DN18960_c1_g8"
## [85] "TRINITY_DN8597_c0_g1" "TRINITY_DN13972_c8_g1"
## [87] "TRINITY_DN11641_c1_g1" "TRINITY_DN12766_c0_g2"
## [89] "TRINITY_DN9367_c2_g7" "TRINITY_DN13428_c1_g5"
## [91] "TRINITY_DN10648_c1_g3" "TRINITY_DN9128_c0_g4"
## [93] "TRINITY_DN16438_c1_g9" "TRINITY_DN16292_c0_g5"
## [95] "TRINITY_DN10054_c0_g5" "TRINITY_DN18739_c1_g6"
## [97] "TRINITY_DN15354_c0_g3" "TRINITY_DN9001_c2_g5"
## [99] "TRINITY_DN6364_c3_g1" "TRINITY_DN17213_c0_g1"
When we get to the conf filtered data, I suspect things will be more interesting. Therefore let us look at the full spectrum of plots before/after normalization.
k10_conf_raw_plots <- sm(graph_metrics(k10_conf))
k10_conf_norm <- sm(normalize_expt(k10_conf, transform="log2", norm="quant", convert="cpm"))
k10_conf_norm_plots <- sm(graph_metrics(k10_conf_norm))
k10_conf_normbatch <- sm(normalize_expt(k10_conf, batch="svaseq", transform="log2",
convert="cpm", filter=TRUE))
k10_conf_normbatch_plots <- sm(graph_metrics(k10_conf_normbatch))
Let us see how the k10 conf data appears
pp(file="images/30_k10conf_libsize.pdf", image=k10_conf_raw_plots$libsize)
## Wrote the image to: images/30_k10conf_libsize.pdf
k10conf_density <- sm(k10_conf_raw_plots$density + ggplot2::scale_x_continuous(limits=c(1,100)))
pp(file="images/31_k10conf_density.pdf", image=k10conf_density)
## Wrote the image to: images/31_k10conf_density.pdf
pp(file="images/32_k10conf_boxplot.pdf", image=k10_conf_raw_plots$boxplot)
## Wrote the image to: images/32_k10conf_boxplot.pdf
pp(file="images/33_k10conf_corheat.pdf", image=k10_conf_norm_plots$corheat)
## Wrote the image to: images/33_k10conf_corheat.pdf
## This plot is a particularly nice example of why 1003/1008 are annoying.
pp(file="images/34_k10conf_smc.pdf", image=k10_conf_norm_plots$smc)
## Wrote the image to: images/34_k10conf_smc.pdf
pp(file="images/35_k10conf_disheat.pdf", image=k10_conf_norm_plots$disheat)
## Wrote the image to: images/35_k10conf_disheat.pdf
pp(file="images/36_k10conf_smd.pdf", image=k10_conf_norm_plots$smd)
## Wrote the image to: images/36_k10conf_smd.pdf
pp(file="images/37_k10conf_pca.pdf", image=k10_conf_norm_plots$pcaplot)
## Wrote the image to: images/37_k10conf_pca.pdf
## See if the additional information in the 'batch' factor made a difference
pp(file="images/38_k10conf_normbatch_pca.pdf", image=k10_conf_normbatch_plots$pcaplot)
## Wrote the image to: images/38_k10conf_normbatch_pca.pdf
k8_conf_raw_plots <- sm(graph_metrics(k8_conf))
k8_conf_norm <- sm(normalize_expt(k8_conf, transform="log2", norm="quant", convert="cpm"))
k8_conf_norm_plots <- sm(graph_metrics(k8_conf_norm))
k8_conf_normbatch <- sm(normalize_expt(k8_conf, batch="svaseq", transform="log2",
convert="cpm", filter=TRUE))
k8_conf_normbatch_plots <- sm(graph_metrics(k8_conf_normbatch))
Let us see how the k8 conf data appears
pp(file="images/39_k8conf_libsize.pdf", image=k8_conf_raw_plots$libsize)
## Wrote the image to: images/39_k8conf_libsize.pdf
k8_conf_density <- sm(k8_conf_raw_plots$density + ggplot2::scale_x_continuous(limits=c(1,100)))
pp(file="images/40_k8conf_density.pdf", image=k8_conf_density)
## Wrote the image to: images/40_k8conf_density.pdf
pp(file="images/41_k8conf_boxplot.pdf", image=k8_conf_raw_plots$boxplot)
## Wrote the image to: images/41_k8conf_boxplot.pdf
pp(file="images/42_k8conf_corheat.pdf", image=k8_conf_norm_plots$corheat)
## Wrote the image to: images/42_k8conf_corheat.pdf
pp(file="images/43_k8conf_smc.pdf", image=k8_conf_norm_plots$smc)
## Wrote the image to: images/43_k8conf_smc.pdf
pp(file="images/44_k8conf_disheat.pdf", image=k8_conf_norm_plots$disheat)
## Wrote the image to: images/44_k8conf_disheat.pdf
pp(file="images/45_k8conf_smd.pdf", image=k8_conf_norm_plots$smd)
## Wrote the image to: images/45_k8conf_smd.pdf
pp(file="images/46_k8conf_pca.pdf", image=k8_conf_norm_plots$pcaplot)
## Wrote the image to: images/46_k8conf_pca.pdf
pp(file="images/47_k8conf_tsne.pdf", image=k8_conf_norm_plots$tsneplot)
## Wrote the image to: images/47_k8conf_tsne.pdf
## See if the additional information in the 'batch' factor made a difference
pp(file="images/48_k8conf_normbatch_pca.pdf", image=k8_conf_normbatch_plots$pcaplot)
## Wrote the image to: images/48_k8conf_normbatch_pca.pdf
pp(file="images/49_k8conf_normbatch_tsne.pdf", image=k8_conf_normbatch_plots$tsneplot)
## Wrote the image to: images/49_k8conf_normbatch_tsne.pdf
When we get to the count_conf filtered data, I suspect things will be more interesting. Therefore let us look at the full spectrum of plots before/after normalization.
k10_conf_count_raw_plots <- sm(graph_metrics(k10_conf_count))
k10_conf_count_norm <- sm(normalize_expt(k10_conf_count, transform="log2",
norm="quant", convert="cpm"))
k10_conf_count_norm_plots <- sm(graph_metrics(k10_conf_count_norm))
k10_conf_count_normbatch <- sm(normalize_expt(k10_conf_count, batch="svaseq", transform="log2",
convert="cpm", filter=TRUE))
k10_conf_count_normbatch_plots <- sm(graph_metrics(k10_conf_count_normbatch))
Let us see how the k10 count_conf data appears
pp(file="images/50_k10confcount_libsize.pdf", image=k10_conf_count_raw_plots$libsize)
## Wrote the image to: images/50_k10confcount_libsize.pdf
k10confcount_density <- sm(k10_conf_count_raw_plots$density + ggplot2::scale_x_continuous(limits=c(1,100)))
pp(file="images/51_k10confcount_density.pdf", image=k10confcount_density)
## Wrote the image to: images/51_k10confcount_density.pdf
pp(file="images/52_k10confcount_boxplot.pdf", image=k10_conf_count_raw_plots$boxplot)
## Wrote the image to: images/52_k10confcount_boxplot.pdf
pp(file="images/53_k10confcount_corheat.pdf", image=k10_conf_count_norm_plots$corheat)
## Wrote the image to: images/53_k10confcount_corheat.pdf
pp(file="images/54_k10confcount_smc.pdf", image=k10_conf_count_norm_plots$smc)
## Wrote the image to: images/54_k10confcount_smc.pdf
pp(file="images/55_k10confcount_disheat.pdf", image=k10_conf_count_norm_plots$disheat)
## Wrote the image to: images/55_k10confcount_disheat.pdf
pp(file="images/56_k10confcount_smd.pdf", image=k10_conf_count_norm_plots$smd)
## Wrote the image to: images/56_k10confcount_smd.pdf
pp(file="images/57_k10confcount_pca.pdf", image=k10_conf_count_norm_plots$pcaplot)
## Wrote the image to: images/57_k10confcount_pca.pdf
## See if the additional information in the 'batch' factor made a difference
pp(file="images/58_k10confcount_normbatch_pca.pdf", image=k10_conf_count_normbatch_plots$pcaplot)
## Wrote the image to: images/58_k10confcount_normbatch_pca.pdf
pp(file="images/59_k10confcount_tsne.pdf", image=k10_conf_count_normbatch_plots$tsneplot)
## Wrote the image to: images/59_k10confcount_tsne.pdf
k8_conf_count_raw_plots <- sm(graph_metrics(k8_conf_count))
k8_conf_count_norm <- sm(normalize_expt(k8_conf_count, transform="log2",
norm="quant", convert="cpm"))
k8_conf_count_norm_plots <- sm(graph_metrics(k8_conf_count_norm))
k8_conf_count_normbatch <- sm(normalize_expt(k8_conf_count, batch="svaseq",
transform="log2", convert="cpm"))
k8_conf_count_normbatch_plots <- sm(graph_metrics(k8_conf_count_normbatch))
Let us see how the k8 count_conf data appears
pp(file="images/60_k8confcount_libsize.pdf", image=k8_conf_count_raw_plots$libsize)
## Wrote the image to: images/60_k8confcount_libsize.pdf
k8confcount_density <- sm(k8_conf_count_raw_plots$density + ggplot2::scale_x_continuous(limits=c(1,100)))
pp(file="images/61_k8confcount_density.pdf", image=k8confcount_density)
## Wrote the image to: images/61_k8confcount_density.pdf
pp(file="images/62_k8confcount_boxplot.pdf", image=k8_conf_count_raw_plots$boxplot)
## Wrote the image to: images/62_k8confcount_boxplot.pdf
pp(file="images/63_k8confcount_corheat.pdf", image=k8_conf_count_norm_plots$corheat)
## Wrote the image to: images/63_k8confcount_corheat.pdf
pp(file="images/64_k8confcount_smc.pdf", image=k8_conf_count_norm_plots$smc)
## Wrote the image to: images/64_k8confcount_smc.pdf
pp(file="images/65_k8confcount_disheat.pdf", image=k8_conf_count_norm_plots$disheat)
## Wrote the image to: images/65_k8confcount_disheat.pdf
pp(file="images/66_k8confcount_smd.pdf", image=k8_conf_count_norm_plots$smd)
## Wrote the image to: images/66_k8confcount_smd.pdf
pp(file="images/67_k8confcount_pca.pdf", image=k8_conf_count_norm_plots$pcaplot)
## Wrote the image to: images/67_k8confcount_pca.pdf
## See if the additional information in the 'batch' factor made a difference
pp(file="images/68_k8confcount_normbatch_pca.pdf", image=k8_conf_count_normbatch_plots$pcaplot)
## Wrote the image to: images/68_k8confcount_normbatch_pca.pdf
pp(file="images/69_k8confcount_tsne.pdf", image=k8_conf_count_normbatch_plots$tsneplot)
## Wrote the image to: images/69_k8confcount_tsne.pdf
Variance Partition is a neat tool to see what experimental factors have the most variance associated with them. It is slow, so I will use it on the smallest data set we have created (the confidence/count filtered).
k8_conf_count_varpart <- varpart(expt=k8_conf_count, predictor=NULL,
factors=c("condition", "batch"),
modify_expt=TRUE)
## Attempting mixed linear model with: ~ (1|condition) + (1|batch)
## Fitting the expressionset to the model, this is slow.
## Projected run time: ~ 0.4 min
## Placing factor: condition at the beginning of the model.
k8_conf_count <- k8_conf_count_varpart[["modified_expt"]]
k8_conf_count_varpart$percent_plot
## These genes are the most likely to be interesting for the difference between non/embryogenic
k8_conf_count_varpart$partition_plot
## Another fun thing to use is pca_information()
## testing <- pca_information(expt_data=k8_conf_count)
tt <- sm(library(ggplot2))
maximum <- plot_libsize(k10_raw, text=FALSE)
maximum_alpha <- maximum$table
maximum_alpha$alpha <- alpha(maximum_alpha$colors, 0.4)
conf <- plot_libsize(k10_conf, text=FALSE)
conf_alpha <- conf$table
conf_alpha$alpha <- alpha(conf_alpha$colors, 0.6)
count <- plot_libsize(k10_count, text=FALSE)
count_alpha <- count$table
count_alpha$alpha <- alpha(count_alpha$colors, 0.8)
conf_count <- plot_libsize(k10_conf_count, text=FALSE)
conf_count_alpha <- conf_count$table
conf_count_alpha$alpha <- alpha(conf_count_alpha$colors, 1.0)
all <- rbind(maximum_alpha, conf_alpha,
count_alpha, conf_count_alpha)
all <- as.data.frame(all)
## In this instance, they are stacked on top of 0, which I suspect is the goal.
columns <- ggplot(all, aes(x=id, y=sum)) +
geom_col(position="identity", aes(fill=colors, alpha=alpha), color="black") +
scale_fill_manual(values=c(levels(as.factor(all$colors)))) +
theme(axis.text=ggplot2::element_text(size=10, colour="black"),
axis.text.x=ggplot2::element_text(angle=90, vjust=0.5),
legend.position="none")
maximum <- plot_libsize(k8_raw, text=FALSE)
maximum_alpha <- maximum$table
maximum_alpha$alpha <- alpha(maximum_alpha$colors, 0.4)
conf <- plot_libsize(k8_conf, text=FALSE)
conf_alpha <- conf$table
conf_alpha$alpha <- alpha(conf_alpha$colors, 0.6)
count <- plot_libsize(k8_count, text=FALSE)
count_alpha <- count$table
count_alpha$alpha <- alpha(count_alpha$colors, 0.8)
conf_count <- plot_libsize(k8_conf_count, text=FALSE)
conf_count_alpha <- conf_count$table
conf_count_alpha$alpha <- alpha(conf_count_alpha$colors, 1.0)
all <- rbind(maximum_alpha, conf_alpha,
count_alpha, conf_count_alpha)
all <- as.data.frame(all)
## In this instance, they are stacked on top of 0, which I suspect is the goal.
columns <- ggplot(all, aes(x=id, y=sum)) +
geom_col(position="identity", aes(fill=colors, alpha=alpha), color="black") +
scale_fill_manual(values=c(levels(as.factor(all$colors)))) +
theme(axis.text=ggplot2::element_text(size=10, colour="black"),
axis.text.x=ggplot2::element_text(angle=90, vjust=0.5),
legend.position="none")
pp(file="images/70_stacked_libsize_columns.pdf",image=columns)
## Wrote the image to: images/70_stacked_libsize_columns.pdf
Let us assume for a moment that we will choose a couple of the above for looking at more closely. I suspect k8_count_conf, k10_count_conf, and k8_count and k8_conf are good candidates.
output <- paste0("excel/k8_conf_count_by_gene-v", ver, ".xlsx")
#if (!file.exists(output)) {
k8_conf_count_written <- sm(write_expt(k8_conf_count, excel=output, violin=TRUE))
#}
output <- paste0("excel/k10_conf_count_by_gene-v", ver, ".xlsx")
#if (!file.exists(output)) {
k10_conf_count_written <- sm(write_expt(k10_conf_count, excel=output, violin=TRUE))
#}
output <- paste0("excel/k8_count_by_gene-v", ver, ".xlsx")
#if (!file.exists(output)) {
k8_count_written <- sm(write_expt(k8_count, excel=output, violin=TRUE))
#}
output <- paste0("excel/k8_conf_by_gene-v", ver, ".xlsx")
#if (!file.exists(output)) {
k8_conf_written <- sm(write_expt(k8_conf, excel=output, violin=TRUE))
#}
if (!isTRUE(get0("skip_load"))) {
pander::pander(sessionInfo())
message(paste0("This is hpgltools commit: ", get_git_commit()))
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
## tmp <- sm(saveme(filename=this_save))
}
For the moment, I will assume we only want to deal with the 8 samples. However, I will try a count-filtered run, an annotation filtered run, and both.
keepers <- list(
"embryo" = c("embryogenic", "nonembryogenic"))
k8_count_de <- sm(all_pairwise(k8_count))
k10_count_de <- sm(all_pairwise(k10_count))
k8_count_tables <- sm(combine_de_tables(
k8_count_de, keepers=keepers, type="deseq",
excel=paste0("excel/k8_count_by_gene_de-v", ver, ".xlsx"),
abundant_excel=paste0("excel/k8_count_by_gene_abundant-v", ver, ".xlsx"),
sig_excel=paste0("excel/k8_count_by_gene_sig-v", ver, ".xlsx")))
k10_count_tables <- sm(combine_de_tables(
k10_count_de, keepers=keepers, type="deseq",
excel=paste0("excel/k10_count_by_gene_de-v", ver, ".xlsx"),
abundant_excel=paste0("excel/k10_count_by_gene_abundant-v", ver, ".xlsx"),
sig_excel=paste0("excel/k10_count_by_gene_sig-v", ver, ".xlsx")))
k8k10_comparison <- compare_de_results(k8_count_tables, k10_count_tables)
extract_fasta_sequences <- function(k8_count_tables) {
sig_table <- k8_count_tables[["significant"]][["deseq"]][["ups"]][[1]]
subset_df <- sig_table[, c("transcriptid", "transcriptseq")]
names <- subset_df[[1]]
sequences <- subset_df[[2]]
seq <- as.list(sequences)
names(seq) <- names
testing <- seqinr::write.fasta(seq, names, file.out="test_ups.fasta")
sig_table <- k8_count_tables[["significant"]][["deseq"]][["downs"]][[1]]
subset_df <- sig_table[, c("transcriptid", "transcriptseq")]
names <- subset_df[[1]]
sequences <- subset_df[[2]]
seq <- as.list(sequences)
names(seq) <- names
testing <- seqinr::write.fasta(seq, names, file.out="test_downs.fasta")
}
In the sample estimation worksheet, we created a list of genes putatively strongest in terms of the signal from the principal component associated with batch (Comp 2). Let us look at the differential expression data and see where these genes lie.
test_batch_idx <- rownames(k8_count_tables[["data"]][[1]]) %in% most_by_batch
test_batch <- k8_count_tables[["data"]][[1]][test_batch_idx, ]
strong_batch_de_idx <- abs(test_batch[["deseq_logfc"]] > 1) & test_batch[["deseq_adjp"]] <= 0.05
strong_batch_de <- test_batch[strong_batch_de_idx, ]
## The following genes are not to be trusted I think.
knitr::kable(strong_batch_de)
geneid | transcriptid | blastxname | blastxqueryloc | blastxhitloc | blastxidentity | blastxevalue | blastxrecname | blastxtaxonomy | blastpname | blastpqueryloc | blastphitloc | blastpidentity | blastpevalue | blastprecname | blastptaxonomy | rrnasubunit | rrnasubunitregion | protid | protcoords | pfamdata | signalpdata | tmhmmexpaa | tmhmmhelices | tmhmmtopology | eggnogid | eggnogdescription | keggdata | geneontologyblast | geneontologypfam | transcriptseq | peptide | length | limma_logfc | limma_adjp | deseq_logfc | deseq_adjp | edger_logfc | edger_adjp | limma_ave | limma_t | limma_b | limma_p | deseq_basemean | deseq_lfcse | deseq_stat | deseq_p | edger_logcpm | edger_lr | edger_p | basic_nummed | basic_denmed | basic_numvar | basic_denvar | basic_logfc | basic_t | basic_p | basic_adjp | limma_adjp_fdr | deseq_adjp_fdr | edger_adjp_fdr | basic_adjp_fdr | lfc_meta | lfc_var | lfc_varbymed | p_meta | p_var | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TRINITY_DN10128_c0_g1 | TRINITY_DN10128_c0_g1 | TRINITY_DN10128_c0_g1_i1 | 0.00 | 1 | 0.00 | 1 | 0.00 | 0 | TTGAGATATAGTTGATTGATTGATTGGAACAGTTGAATTGCGAGTAAACTGATTACCGATATAGATAGATTTATTGAGATAGAGTTGATTGATCGATTGAAACAATTGAATTGCGAGTAACTGAGTGTAGATATAGATAGATTGGTTGAGATTGAGATATAGTTGATTGATTGAAAGTTGAAACAGTTGAATTGTGAGTAAACTGAATACAAATATAGATAGATTGGTTGAGATTGATATATAGTTGATTGATTGAAACAGTTGAATTGTGAGTAAACTGAATACAAATATAGATAGATTGGTTGAGATTGAGATATAGTTGATTGATTGAAACAGCTGAATTGTGAGTAAACTGAATACAGATATAGATAGATTTATTGAGATTGAGATACAGTTGATTGATTGATTGAAACAGTTGAATTGTGAGTAAACTGAATACAAATATAGATAGATTGGTTGAGATTGATATATAGTTGATTGATTGAAACAGTTGAATTGTGAGTAAACCGAATACAAATATAGATAGATTTATTGAGATTGAGATATAGATGATTGATTGATTGAAACAGTTGAATTGCGAGTAAACAAAATATCGATATAGATAGATTTATTGAGATTGAGATACAGTTGATTGATTGATTGAAACAGTTGACTTGTGAGTAAACTGAATA | 671 | 2.144 | 0.1032 | 2.645 | 0.0323 | 2.456 | 0.0544 | 0.2185 | -3.142 | -2.332 | 0.0074 | 24.52 | 0.9289 | 2.848 | 0.0044 | 1.608 | 8.642 | 0.0033 | 0.5070 | 1.779 | 3.561e+00 | 1.093e+00 | 1.272 | 1.413 | 2.207e-01 | 5.090e-01 | 1.032e-01 | 6.695e-03 | 5.006e-03 | 2.847e-01 | 2.557 | 1.502e+00 | 5.876e-01 | 5.034e-03 | 4.572e-06 | |||||||||||||||||||||||
TRINITY_DN11001_c0_g7 | TRINITY_DN11001_c0_g7 | TRINITY_DN11001_c0_g7_i1 | 0.00 | 1 | 0.00 | 1 | TRINITY_DN11001_c0_g7::TRINITY_DN11001_c0_g7_i1::g.168124::m.168124 | 86-1168[+] | 0.00 | 0 | GGTATCAACGCAGAGTACGGGGCTCACTCAAAAATCTCCGCCCTAATACCAAATCCCAATTCTTCAAGCTCTCTTTCTCTTATCAATGGCGACAATGGCTTCCACCTTGGCTTCCTCCGTGATTTCCAAGACAAAGTTCCTTGACACACACAAATCATCATCCTTGTATGGTGTGCCAATTTCCTCACAAGCTAGATTGAAAATCGTAAAATCGACTCACCAGAACATGTCTGTTTCTATGTCTGCTGATGCTTCTCCTCCTCCGCCTTACGATCTCGGAAGTTTCAGTTTTAATCCGATTAAGGAATCGATTGTTGCTCGGGAAATGACACGTAGGTACATGACGGATATGATCACTTATGCTGATACTGACGTCGTCGTTGTTGGTGCTGGATCTGCTGGTCTATCTTGTGCTTATGAGCTCAGCAAGAACCCTGATGTTCAGGTGGCCATCCTTGAGCAATCTGTGAGCCCTGGTGGAGGTGCTTGGCTAGGCGGACAACTCTTCTCAGCTATGATTGTGCGTAAGCCAGCACATCTTTTCTTGAACGAGCTAGGCATTGACTACGACGAGCAAGACAACTACGTGGTCGTCAAGCACGCTGCCTTGTTTACCTCGACCATCATGAGCAAGCTTTTGGCCAGGCCAAATGTGAAGCTCTTCAATGCTGTTGCAACAGAGGACCTTATCGTGAAGAACGGAAGAGTTGGTGGTGTTGTCACCAACTGGTCTTTGGTTTCCCAGAACCATGACACACAATCCTGCATGGACCCTAATGTTATGGAGGCTAAGGTTGTGGTTAGCTCCTGCGGCCACGACGGTCCCATGGGTGCCACTGGTGTTAAGAGGCTTAGGAGCATTGGCATGATTAACAGTGTTCCTGGAATGAAAGCTTTGGATATGAACACTGCTGAGGATGCAATTGTTAGACTTACCAGAGAAGTTGTACCTGGAATGATTATAACAGGGATGGAAGTTGCTGAAATTGATGGAGCACCAAGAATGGGTCCAACTTTTGGAGCTATGATGATATCAGGGCAGAAGGCTGCGCACCTTGCCCTACGGGCGTTGGGATTGCCCAACGCCCTTGACGGAACTGCAGAAACAAGCATCCACCCGGAGCTTATGTTGGCTGCAGCTGATGAAGCTGAAACTGCTGATGCTTAAATGTAATTATAGTTCCTAAGAAAGGATATGAGAAATATTAGTTTTTCCTAGATGAGCATGGTTGTGTGATGGAGATAAATTGGCTAGTTGGGGATGAATAAAAATCTGCGAGACCTTTGTCTGTTTCATGTCAATTCTTCTTTTACTTTTGATTTATTTTGGGACTTAGAAACTATATGTGAAAAAAGTGGTAAGAGTAAGACTTGTTTTAGTCGTTGAATAATCCTATGGGCCTCATTTTGGTACTTGATGTGGTTTGGGCCTCGCGGGAGAAGCTGTCTCTTATAC | 1456 | 2.442 | 0.0992 | 2.884 | 0.0098 | 2.763 | 0.0469 | 0.8301 | -3.175 | -2.321 | 0.0069 | 47.12 | 0.8753 | 3.295 | 0.0010 | 2.490 | 9.033 | 0.0027 | 0.4770 | 2.438 | 5.537e+00 | 1.171e+00 | 1.961 | 1.636 | 1.735e-01 | 4.583e-01 | 9.918e-02 | 1.533e-03 | 4.058e-03 | 2.303e-01 | 2.800 | 1.451e+00 | 5.180e-01 | 3.529e-03 | 9.467e-06 | |||||||||||||||||||||
TRINITY_DN17811_c0_g5 | TRINITY_DN17811_c0_g5 | TRINITY_DN17811_c0_g5_i1 | AZF2_ARATH | 866-375 | 67-187 | 32.12 | 0 | Zinc finger protein AZF2; | Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; Gunneridae; Pentapetalae; rosids; malvids; Brassicales; Brassicaceae; Camelineae; Arabidopsis | AZF2_ARATH | 128-353 | 67-242 | 30.87 | 0 | Zinc finger protein AZF2; | Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; Gunneridae; Pentapetalae; rosids; malvids; Brassicales; Brassicaceae; Camelineae; Arabidopsis | TRINITY_DN17811_c0_g5::TRINITY_DN17811_c0_g5_i1::g.74924::m.74924 | 177-1247[-] | 22.83 | 1 | i12-34o | COG5048 | Zinc finger protein | KEGG:ath:AT3G19580 | GO:0005634,cellular_component,nucleus; GO:0003677,molecular_function,DNA binding; GO:0046872,molecular_function,metal ion binding; GO:0043565,molecular_function,sequence-specific DNA binding; GO:0003700,molecular_function,transcription factor activity, sequence-specific DNA binding; GO:0044212,molecular_function,transcription regulatory region DNA binding; GO:0009738,biological_process,abscisic acid-activated signaling pathway; GO:0009793,biological_process,embryo development ending in seed dormancy; GO:0042538,biological_process,hyperosmotic salinity response; GO:0045892,biological_process,negative regulation of transcription, DNA-templated; GO:0009737,biological_process,response to abscisic acid; GO:0010200,biological_process,response to chitin; GO:0009414,biological_process,response to water deprivation; GO:0006351,biological_process,transcription, DNA-templated | AAAGTAAACTACCAACAATACTTGATGAAAACAACAACAAAAAAAAAAAAAATTAATTTAGGAAACAAATTTATGTATATTTATGAACAACAATAATTCTAAAAATTCATTATCACAAATTTTTGTGACTAATTAATTGATCAATAAGTGTAAAAAAAAATGAAAAACTAAAATTATTAGTAATAGCAATCAACCAAAGGGGCTGCAGATAACACAAGACTTTGTTCAAATTTGTGATCATCTTCTGATGGTGCTGGTAAATTAAGATCAAGAGATAATATATTTCTTCTTGATTTTTCGTCGTGATGGGATGAAGATGATTCTGAAACATTATTGTTTGTAGTAGAGGTTGTCATCGCTGTAGGAGGTGGTCTGTGTCTTCTCATGTGTCCACCTAAAGCTTGTCCCGATGAAAACTCCGATCCACAAATTGAACATTCGTGGATTTTGGTCTTGATGTTATTAACAACTATTTTGTTTGAGAGAGATGTACTATTAATTTTATTGAGACGCCCTTGATCCTTATCGTTCTCTAGATGATGATGATGTGAATCATCATGACCGCCACCACTAGCAGTGCCAGTATTGATAGTGACAGATTTTTTCTCTTCCACTATGGTTTTGGGCTTTTTGTGACTTGCTCTGTGTCCACCAAGTGCTTGAAATGAAGGGAAAGTTCTATTGCAAGTTTTGCATTCATAGACATAAAAGCCAGCTTTTCCAGTTGTACTATTACTAGTGGTCATTTCTGCAAATTTCCTACTGCTGATTTTCTCCTTTTTGACTTCACTTATTACATCTTGTACTTTCTGGCCACATCCACTTTGTGCTAATAGAATTAAACAATTAGCCATATCTTCATCTTCCTCAGATGTGCTAGTTGTTGAAATTTGAGTAGTAGAAATGGTTGGTGAATATTGAATATTTGTTGACTTGTAAAAATCACCTCCGCTACCACCATCACCACCACCAGCATCATCGCTTCCACCAGCGATCGATGAGCTGGTGGTAGTGACAGTCGTCGTCCCCGCTATATTAAGTGGGGACGATGGTCGCGATCGCTTAGTGCGCTTTCCTTTGAAAGCGCATGAGCGATCGGTGAAGTTTTGACTCAAAACTTCCATGACTGCAGAAACAAGAAAAGAAGGCAAAACAAGTGACAAAGTGAACAAGAAGAAGAGTGATAGAAAGAAAGAATATAAAGAGAGAGAAAGAGTGATGTAATTATCCCGTACTCTGCGTTGATAC | 1248 | 2.803 | 0.0092 | 2.995 | 0.0006 | 2.901 | 0.0006 | 0.4279 | -5.071 | 1.004 | 0.0002 | 41.17 | 0.7266 | 4.122 | 0.0000 | 2.352 | 19.200 | 0.0000 | 0.3112 | 2.383 | 3.526e+00 | 4.157e+00 | 2.072 | 1.445 | 1.989e-01 | 4.854e-01 | 9.166e-03 | 5.997e-05 | 1.893e-05 | 2.597e-01 | 3.006 | 1.448e+00 | 4.815e-01 | 7.826e-05 | 8.777e-09 | ||||||
TRINITY_DN8597_c0_g1 | TRINITY_DN8597_c0_g1 | TRINITY_DN8597_c0_g1_i1 | ISP4_SCHPO | 22-939 | 455-755 | 40.07 | 0 | Sexual differentiation process protein isp4; | Eukaryota; Fungi; Dikarya; Ascomycota; Taphrinomycotina; Schizosaccharomycetes; Schizosaccharomycetales; Schizosaccharomycetaceae; Schizosaccharomyces | ISP4_SCHPO | 8-313 | 455-755 | 40.07 | 0 | Sexual differentiation process protein isp4; | Eukaryota; Fungi; Dikarya; Ascomycota; Taphrinomycotina; Schizosaccharomycetes; Schizosaccharomycetales; Schizosaccharomycetaceae; Schizosaccharomyces | TRINITY_DN8597_c0_g1::TRINITY_DN8597_c0_g1_i1::g.50377::m.50377 | 1-1014[+] | 163.33 | 8 | o15-36i43-65o75-97i129-151o201-223i228-245o250-267i279-301o | KEGG:spo:SPBC29B5.02c | GO:0005783,cellular_component,endoplasmic reticulum; GO:0005789,cellular_component,endoplasmic reticulum membrane; GO:0005794,cellular_component,Golgi apparatus; GO:0016021,cellular_component,integral component of membrane; GO:0005886,cellular_component,plasma membrane; GO:1901584,molecular_function,tetrapeptide transmembrane transporter activity; GO:0015031,biological_process,protein transport; GO:1901583,biological_process,tetrapeptide transmembrane transport | GTATCAACGCAGAGTACGGGGATGAAGAGTTACAAGCAAGTCCCTCAGTGGTGGTTCCTCGCATTATTGGTAGGTAGCATAGCCCTATCACTTGTGATGTGTTTTGTGTGGAAAGAAGACGTGCAGCTGCCGTGGTGGGGCATGTTGTTTGCGTTTGGTCTTGCTTTTATTGTTACTCTCCCTATAGGGGTCATTCAAGCGACTACCAATCAGCAACCTGGATATGACATAATTGCACAGTTCATAATTGGTTATATCCTCCCAGGGAAACCAATCGCGAACTTGCTTTTCAAAATATATGGCCGGACAAGTACTGTTCATGCTCTCTCCTTTTTAGCCGATCTTAAACTTGGTCACTACATGAAAATTCCGCCACGATGCATGTACACAGCTCAGCTCGTGGGGACACTGGTTGCTGGTACAATCAACCTTGCTGTAGCATGGTGGATGTTAGGAAGCATCGAGAACATTTGTGACGTTGAGACTCTTCATCCGGACAGCCCATGGACCTGTCCTAAATTCCGAGTGACTTTTGATGCATCTGTCATATGGGGTCTGATTGGACCACAACGGTTATTTGGTCCCGGAGGATTATACAGGAACTTGGTATGGTTATTCCTTATAGGTGCATTGCTACCGGTGCCTATTTGGGTGTTAAGCAAAATCTTCCCGGAAAAGAAATGGATCCCATTGATCAACATACCTGTTATATCATATGGTTTCGCGGGAATGCCACCAGCCACACCAACGAATATTGCTAGCTGGCTTATAACGGGTATGATATTCAACTATTTCGTGTTCAAATACAGAAAAGAATGGTGGAAGAAATACAACTATGTTCTGTCAGCCGCGTTGGATGCTGGAACAGCTTTTATGGGTGTTTTATTGTTTTTCGCGTTGCAAAACGAGGGTAAAAACTTGAAATGGTGGGGAACAGAGTTAGACCATTGTCCCTTAGCAACTTGTCCAACAGCCCCTGGGATCATAGTGGAAGGATGTCCAGTTTTCAAGTAGGCCTAACAACAAGAATGAATGGTTCTAAGCATTGTTGATGTAATTTTGATAGTTTTTTTGTGTTGTTCAACTAAATATATTTTGTTATGAGAGTTTTCTGATCTTGTTTGCAGATGAGCTAATGAAAGACAAAATATTGTGAATTGAAATTGTGATTAACTTTTTACTAAATCAAAATATTGTTAAT | 1201 | 2.184 | 0.2483 | 2.724 | 0.0492 | 2.533 | 0.1689 | 0.4656 | -2.341 | -3.715 | 0.0350 | 35.02 | 1.0200 | 2.669 | 0.0076 | 2.218 | 5.484 | 0.0192 | 0.0566 | 2.195 | 6.038e+00 | 1.931e+00 | 2.139 | 1.111 | 3.196e-01 | 5.966e-01 | 2.482e-01 | 1.145e-02 | 2.761e-02 | 3.918e-01 | 2.607 | 1.515e+00 | 5.813e-01 | 2.060e-02 | 1.891e-04 | ||||||||
TRINITY_DN9223_c1_g4 | TRINITY_DN9223_c1_g4 | TRINITY_DN9223_c1_g4_i1 | EB1C_ARATH | 118-1104 | 1-323 | 62.46 | 0 | Microtubule-associated protein RP/EB family member 1C; | Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; Gunneridae; Pentapetalae; rosids; malvids; Brassicales; Brassicaceae; Camelineae; Arabidopsis | EB1C_ARATH | 40-368 | 1-323 | 62.46 | 0 | Microtubule-associated protein RP/EB family member 1C; | Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; Gunneridae; Pentapetalae; rosids; malvids; Brassicales; Brassicaceae; Camelineae; Arabidopsis | TRINITY_DN9223_c1_g4::TRINITY_DN9223_c1_g4_i1::g.67832::m.67832 | 1-1119[+] | 0.00 | 0 | COG5217 | microtubule-associated protein RP EB family member | KEGG:ath:AT5G67270; KO:K10436 | GO:0005618,cellular_component,cell wall; GO:0005874,cellular_component,microtubule; GO:0005730,cellular_component,nucleolus; GO:0005634,cellular_component,nucleus; GO:0009524,cellular_component,phragmoplast; GO:0005819,cellular_component,spindle; GO:0008017,molecular_function,microtubule binding; GO:0051301,biological_process,cell division; GO:0030865,biological_process,cortical cytoskeleton organization; GO:0007067,biological_process,mitotic nuclear division; GO:0009652,biological_process,thigmotropism | GGTATCAACGCAGAGTACGGGGAAACCCTCATTTCAAACTCTCTTTCTCTCTCCGTCCGCCCTCTCTCTCTCTCTCTTTCTACAGAGCCTTCTAGAGATCTTCTCCGAAGAGGAGAAATGGCAGCACACATAGGAATGATGGATGGGGCATACTTCGTTGGCAGATCGGAAATCCTGGCTTGGATCAATTCCACACTTCATCTTAATCTTTCCAAAGTCGAAGAGGCGTGTACTGGCGCGGTTCATTGCCAGCTAATGGATGCGGCTCACCCAGGGCTAGTCCCTATGCACAAGGTCAATTTTGATGCGAAGAATGAATATGAGATGATCCAGAACTATAAGGTTCTTCAGGATGTATTCACCAAACTCAAGATAACCAAGCACATTGAGGTCAGCAAACTGGTGAAAGGAAGACCTCTTGACAACCTGGAGTTCATGCAATGGATGAAGAGATACTGCGATTCTGTTAATGGAGGCAATATTCACAGTTACAATCCTCTTGAAAGAAGGGAAGGCTGCAAGGGTGCCAAAGAACTGAACAAAAGATCTGCCCCTTCACAAAATGCTACCAAAAATGCCTCAACTGCTTCAAAGCATGTCACCCATAACACTAGAAGAATTGATGTACCTCATGTAAGCAGCACAAGTCAATCTGGTAAGATCTCCAGGCCTTCTTCAAGTGGAGGAGCGTCAACCTATTCTGAAACAGAACGTACAGCACATGAACAACAGATTACGGAGTTGAAGTTATCAGTGGATAGTCTTGAAAAGGAAAGGGACTTCTACTTTTCAAAGTTGAGGGACATTGAAATCCTGTGCCAATGTCCAGAGATTGAAAACCTACCTGTGGTTGAAGCAGTGAAGAGGATTCTGTATGCTATGGATGATGATGCATCACTGGTAGATACTGAAGCCATGATATCTGAGCAACACCAGCAAGTAGAGACATTGAGTTGCATATCTGAAGAGGCAGAAGAAAGGCTGAGGGTAGACACTCAAAAGAGAAAAAACATTGTCAATGTTGACGTTGACCATGCTGCCAGCAACACATTGTCTCCAAGGCAAAGGATGGCTGATGCTTCTGATGTCCATTGCAGTGGATCACTTGTGACATATTGATCAACCTTGTCTCCTGGATAAGTAACTTGTTCTATCTGGGTTGTGGAATAATTCCTGCTACCGTCTTTGTTACTCTGTAGTTAGTTATATGTTGTCAGAGCATACTTTATTTCTAAGCAATACCGTTTGAGGAATTCCATCAGTATAGAAGTAGTGAATATATGAACCTATCTGGTTTGTCTGATTGTCGGTGTACAACATTGCCCGATGTTATTGCTGTTGTATACTGCATTGTAGTGCAAGTCTTGGATTATGATGATTCCTGTGGTTTAATGGATACTAAAGTTGGTATTGGTATTGGTTTTGATTTTAGAAAATATCTTGAAAACAAACATAAGGTAAATTGACAGTCTCCACACAACTTAGT | 1486 | 1.297 | 0.4481 | 2.497 | 0.0432 | 2.374 | 0.2114 | 1.0040 | -1.681 | -4.864 | 0.1156 | 47.83 | 0.9164 | 2.725 | 0.0064 | 2.462 | 4.793 | 0.0286 | 1.2260 | 2.853 | 6.885e+00 | 6.604e-01 | 1.626 | 1.249 | 2.873e-01 | 5.698e-01 | 4.480e-01 | 9.701e-03 | 4.023e-02 | 3.577e-01 | 2.162 | 1.865e+00 | 8.627e-01 | 5.020e-02 | 3.330e-03 | |||||||
TRINITY_DN9367_c2_g7 | TRINITY_DN9367_c2_g7 | TRINITY_DN9367_c2_g7_i1 | 0.00 | 1 | 0.00 | 1 | TRINITY_DN9367_c2_g7::TRINITY_DN9367_c2_g7_i1::g.195987::m.195987 | 1-459[+] | 30.82 | 1 | i21-38o | GGTATCAACGCAGAGTACGGGACAAACAAATTCATCATACAAAACAATATGGCTTCTAATACTTTTTCTATTCTCATTCTTGTTGTGTCTTTATTTTTGTCTCAGTTCTCATTTTCTTTTGGTGATGGTCCCATACCAGTTGTTCACCTCATTAGCAACTTGACTCCTGGTCATACACCAACTATGCAAGTTACTTGTAAATTACAACATTGGCTCCCTTTGCTTAGTGTAACTTTAGTAATAGGCCAAGATTTTCCATTTGATGCTGGTATTATCAATGATATATATGTATGTTATGCTGAATGGGGTGTATTTTATGCTTGGTTTAATGCATATGATCCACTTAAGGAAGACAAAGGCCAACCTGTTGTCTATTTTTCTTTTCGAGATCGCGGTGTTTATAAGAGTTACGATAAAGCTACTTGGACTGGAGTTACAGATTGGAACAACGGACACTAGCTTTTCGTATTTTTATCATGTCAAGATCGAATTTCGAATTTTTTCTTTATGATAAAAATAAAATTATATAATTTGTGTGTGTGAGTCAAGGTGACGGATGATCTGAATGGGCTTCTTAATTCCCCCGGCAATTGTTTGTTTTTGTCTAGGGTTTTCTTGTTATTTTATTGTCTCTTTGATTCACATTAGTATATTGGTTTTATACTACTTTTTTTTTTCATCTACATTTTTGTTATTGTTCTAATTTATGTGGTTAAAAGGCAACCAAATTGAAAAACTTATATAAAATGCATCAACTTTATTTATACGGTAAAAGTTATCATTAACTAGTATTTATAT | 798 | 2.340 | 0.1089 | 2.939 | 0.0089 | 2.811 | 0.0436 | 0.7559 | -3.092 | -2.448 | 0.0082 | 37.64 | 0.8836 | 3.326 | 0.0009 | 2.170 | 9.213 | 0.0024 | 0.5538 | 2.600 | 4.703e+00 | 3.898e-01 | 2.047 | 1.870 | 1.451e-01 | 4.256e-01 | 1.089e-01 | 1.370e-03 | 3.685e-03 | 1.965e-01 | 2.786 | 1.499e+00 | 5.382e-01 | 3.821e-03 | 1.484e-05 |
k8_conf_de <- sm(all_pairwise(k8_conf, parallel=FALSE))
k8_conf_tables <- sm(combine_de_tables(
k8_conf_de, keepers=keepers,
excel=paste0("excel/k8_conf_de-v", ver, ".xlsx"),
sig_excel=paste0("excel/k8_conf_sig-v", ver, ".xlsx")))
k8_conf_count_de <- sm(all_pairwise(k8_conf_count, parallel=FALSE))
k8_conf_count_tables <- sm(combine_de_tables(
k8_conf_count_de, keepers=keepers,
excel=paste0("excel/k8_conf_count_de-v", ver, ".xlsx"),
sig_excel=paste0("excel/k8_conf_count_sig-v", ver, ".xlsx")))
if (!isTRUE(get0("skip_load"))) {
pander::pander(sessionInfo())
message(paste0("This is hpgltools commit: ", get_git_commit()))
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
tmp <- sm(saveme(filename=this_save))
}
I want to search the set of up/down genes using goseq, since they changed how they accept annotations, I am changing the goseq search to match the new methods. I think they will be better but am not yet certain.
sig_tables <- k8_count_tables[["significant"]]
sigups <- sig_tables[["deseq"]][["ups"]][[1]]
sigdowns <- sig_tables[["deseq"]][["downs"]][[1]]
putative_ontologies <- load_trinotate_go("reference/trinotate.csv")
## Warning: Expected 3 pieces. Additional pieces discarded in 83811 rows [4,
## 5, 6, 11, 18, 19, 22, 29, 40, 83, 84, 87, 90, 91, 98, 99, 100, 103, 104,
## 105, ...].
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 182246 rows
## [1, 2, 3, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17, 20, 21, 23, 24, 25, 26,
## 27, ...].
length_db <- putative_ontologies[["length_table"]]
go_db <- putative_ontologies[["go_table"]]
head(go_db)
## ID GO
## 1: TRINITY_DN614_c0_g1 character(0)
## 2: TRINITY_DN654_c0_g1 character(0)
## 3: TRINITY_DN609_c0_g1 character(0)
## 4: TRINITY_DN609_c0_g1 c(GO:0005525
## 5: TRINITY_DN694_c0_g1 c(GO:0005737
## 6: TRINITY_DN694_c0_g1 c(GO:0005737
goseq_up <- sm(simple_goseq(
sig_genes=sigups, go_db=go_db, length_db=length_db,
excel=paste0("excel/goseq_deseq_count_sigup-v", ver, ".xlsx")))
goseq_down <- sm(simple_goseq(
sig_genes=sigdowns, go_db=go_db, length_db=length_db,
excel=paste0("excel/goseq_deseq_count_sigdown-v", ver, ".xlsx")))
topgo_up <- simple_topgo(
sig_genes=sigups, go_db=go_db,
excel=paste0("excel/topgo_deseq_count_sigup-v", ver, ".xlsx"))
## simple_topgo(): Set densities=TRUE for ontology density plots.
## Writing data to: excel/topgo_deseq_count_sigup-v20180310.xlsx.
topgo_down <- simple_topgo(
sig_genes=sigdowns, go_db=go_db,
excel=paste0("excel/topgo_deseq_count_sigdown-v", ver, ".xlsx"))
## simple_topgo(): Set densities=TRUE for ontology density plots.
## Writing data to: excel/topgo_deseq_count_sigdown-v20180310.xlsx.
Sandra also wants to see what groups are represented by the top-n genes in Embryogenic/NonEmbryogenic. The likely subsets for this were made when we did write_expt() in in 02_sample_estimation.html.
median_by_cond <- k8_count_written[["medians"]]
top_1k_embryo_idx <- head(order(median_by_cond[["embryogenic"]], decreasing=FALSE), n=1000)
top_1k_embryo_genes <- rownames(median_by_cond)[top_1k_embryo_idx]
top_1k_embryo <- k8_count_tables[["data"]][[1]][top_1k_embryo_genes, ]
bottom_1k_embryo_idx <- head(order(median_by_cond[["embryogenic"]], decreasing=TRUE), n=1000)
bottom_1k_embryo_genes <- rownames(median_by_cond)[bottom_1k_embryo_idx]
bottom_1k_embryo <- k8_count_tables[["data"]][[1]][bottom_1k_embryo_genes, ]
top_1k_nonembryo_idx <- head(order(median_by_cond[["nonembryogenic"]], decreasing=FALSE), n=1000)
top_1k_nonembryo_genes <- rownames(median_by_cond)[top_1k_nonembryo_idx]
top_1k_nonembryo <- k8_count_tables[["data"]][[1]][top_1k_nonembryo_genes, ]
bottom_1k_nonembryo_idx <- head(order(median_by_cond[["nonembryogenic"]], decreasing=TRUE), n=1000)
bottom_1k_nonembryo_genes <- rownames(median_by_cond)[bottom_1k_nonembryo_idx]
bottom_1k_nonembryo <- k8_count_tables[["data"]][[1]][bottom_1k_nonembryo_genes, ]
top1k_embryo_goseq <- sm(simple_goseq(
sig_genes=top_1k_embryo, go_db=go_db, length_db=length_db,
excel=paste0("excel/goseq_top1k_embryogenic-v", ver, ".xlsx")))
bottom1k_embryo_goseq <- sm(simple_goseq(
sig_genes=bottom_1k_embryo, go_db=go_db, length_db=length_db,
excel=paste0("excel/goseq_bottom1k_embryogenic-v", ver, ".xlsx")))
top1k_nonembryo_goseq <- sm(simple_goseq(
sig_genes=top_1k_nonembryo, go_db=go_db, length_db=length_db,
excel=paste0("excel/goseq_top1k_nonembryo-v", ver, ".xlsx")))
bottom1k_nonembryo_goseq <- sm(simple_goseq(
sig_genes=bottom_1k_nonembryo, go_db=go_db, length_db=length_db,
excel=paste0("excel/goseq_bottom1k_nonembryo-v", ver, ".xlsx")))
neg_ctrl <- random_ontology(input=k8_count, n=1000, go_db=go_db, length_db=length_db,
excel=paste0("excel/goseq_random-v", ver, ".xlsx"))
## Using the ID column from your table rather than the row names.
## Found 1000 genes out of 1000 from the sig_genes in the go_db.
## Found 1000 genes out of 1000 from the sig_genes in the length_db.
## Warning in pcls(G): initial point very close to some inequality constraints
## Using manually entered categories.
## Calculating the p-values...
## 'select()' returned 1:1 mapping between keys and columns
## simple_goseq(): Calculating q-values
## Using GO.db to extract terms and categories.
## simple_goseq(): Filling godata with terms, this is slow.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
## simple_goseq(): Making pvalue plots for the ontologies.
## Writing data to: excel/goseq_random-v20180310.xlsx.
if (!isTRUE(get0("skip_load"))) {
pander::pander(sessionInfo())
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
tmp <- sm(saveme(filename=this_save))
}
The following sections do not use the gene->transcript maps.
Any annotation data will need to come from our trinotate information… An interesting caveat, kallisto uses a transcript database. tximport is the library used to read the counts/transcript into R, it is able to take a set oF mappings between gene/transcript in order to provide gene counts rather than transcript.
putative_annotations <- load_trinotate_annotations("reference/trinotate.csv")
annot_by_tx <- putative_annotations
rownames(annot_by_tx) <- make.names(putative_annotations[["transcript_id"]], unique=TRUE)
## I split the annotations into one set keyed by gene IDs and one by transcript IDs.
## This way it is possible to make an expressionset of transcripts or genes by
## removing/keeping the tx_gene_map argument.
## Why would you want to do this? you might ask... Well, sadly the annotation database
## Does not always have filled-in data for the first transcript for a given gene ID,
## therefore the annotation information for those genes will be lost when we merge
## the count tables / annotation tables by gene ID.
## This is of course also true for transcript IDs, but for a vastly smaller number of them.
transcript_gene_map <- putative_annotations[, c("transcript_id", "gene_id")]
k10tx_raw <- create_expt(metadata="sample_sheets/all_samples.xlsx",
gene_info=annot_by_tx)
## Reading the sample metadata.
## The sample definitions comprises: 10, 7 rows, columns.
## Reading count tables.
## Finished reading count tables.
## Matched 234330 annotations and counts.
## Bringing together the count matrix and gene information.
putative_annotations <- fData(k10tx_raw) ## A cheap way to get back the unique gene IDs.
dim(exprs(k10tx_raw))
## [1] 234330 10
## On further examination, we decided to remove both samples 1003 and 1008
## which Sandra pointed out are in fact from the same flow cytometry isolation.
k8tx_raw <- create_expt(metadata="sample_sheets/kept_samples.xlsx",
gene_info=annot_by_tx)
## Reading the sample metadata.
## The sample definitions comprises: 8, 7 rows, columns.
## Reading count tables.
## Finished reading count tables.
## Matched 234330 annotations and counts.
## Bringing together the count matrix and gene information.
dim(exprs(k8tx_raw))
## [1] 234330 8
## If we do not use the tx_gene_map information, we get 234,330 transcripts in the data set.
## Instead, we get 58,486 genes.
## Perform a count/gene cutoff here
threshold <- 4
method <- "cbcb" ## other methods include 'genefilter' and 'simple'
k10tx_count <- sm(normalize_expt(k10tx_raw, filter=method, thresh=threshold))
dim(exprs(k10tx_count))
## [1] 42143 10
k8tx_count <- sm(normalize_expt(k8tx_raw, filter=method, thresh=threshold))
dim(exprs(k8tx_count))
## [1] 38571 8
## First get rid of the rRNA
sb_rrna <- putative_annotations[["rrna_subunit"]] != ""
rrna_droppers <- rownames(putative_annotations)[sb_rrna]
k10tx_norrna <- exclude_genes_expt(expt=k10tx_raw, ids=rrna_droppers)
## Before removal, there were 234330 entries.
## Now there are 234309 entries.
## Percent kept: 99.990, 99.988, 99.991, 99.993, 99.992, 99.989, 99.989, 99.987, 99.993, 99.995
## Percent removed: 0.010, 0.012, 0.009, 0.007, 0.008, 0.011, 0.011, 0.013, 0.007, 0.005
k8tx_norrna <- exclude_genes_expt(expt=k8tx_raw, ids=rrna_droppers)
## Before removal, there were 234330 entries.
## Now there are 234309 entries.
## Percent kept: 99.990, 99.988, 99.991, 99.992, 99.989, 99.989, 99.987, 99.995
## Percent removed: 0.010, 0.012, 0.009, 0.008, 0.011, 0.011, 0.013, 0.005
## Keep only the blastx hits with a low e-value and high identity.
filtered_keepers <- putative_annotations[["blastx_evalue"]] <= 1e-10 &
putative_annotations[["blastx_identity"]] >= 60
keepers <- rownames(putative_annotations)[filtered_keepers]
## This excludes all but 33,451 transcripts.
## Now keep only those transcript IDs which fall into the above categories.
k10tx_conf <- exclude_genes_expt(expt=k10tx_norrna, ids=keepers, method="keep")
## Before removal, there were 234309 entries.
## Now there are 33451 entries.
## Percent kept: 25.523, 25.012, 26.033, 25.603, 26.420, 26.603, 27.573, 24.233, 24.821, 27.304
## Percent removed: 74.477, 74.988, 73.967, 74.397, 73.580, 73.397, 72.427, 75.767, 75.179, 72.696
dim(exprs(k10tx_conf))
## [1] 33451 10
k8tx_conf <- exclude_genes_expt(expt=k8tx_norrna, ids=keepers, method="keep")
## Before removal, there were 234309 entries.
## Now there are 33451 entries.
## Percent kept: 25.523, 25.012, 26.033, 26.420, 26.603, 27.573, 24.233, 27.304
## Percent removed: 74.477, 74.988, 73.967, 73.580, 73.397, 72.427, 75.767, 72.696
dim(exprs(k8tx_conf))
## [1] 33451 8
## Perform both the confidence and count filtration
k10tx_conf_count <- sm(normalize_expt(k10tx_conf, filter=method, thresh=threshold))
dim(exprs(k10tx_conf_count))
## [1] 18018 10
k8tx_conf_count <- sm(normalize_expt(k8tx_conf, filter=method, thresh=threshold))
dim(exprs(k8tx_conf_count))
## [1] 16826 8
rrnatx_expt <- exclude_genes_expt(expt=k10tx_raw, ids=sb_rrna, method="keep")
## Before removal, there were 234330 entries.
## Now there are 21 entries.
## Percent kept: 0.010, 0.012, 0.009, 0.007, 0.008, 0.011, 0.011, 0.013, 0.007, 0.005
## Percent removed: 99.990, 99.988, 99.991, 99.993, 99.992, 99.989, 99.989, 99.987, 99.993, 99.995
## This has only 21 putative rRNA transcripts.
genes_by_subset <- data.frame(
samples = c(rep("ten", 4), rep("eight", 4)),
colors = c(rep("#1B9E77", 4), rep("#7570B3", 4)),
alpha = c("#1B9E77FF", "#1B9E77CC", "#1B0E77AA",
"#1B0E7777", "#7570B3FF", "#7570B3CC", "#7570B3AA", "#7570B377"),
type = rep(c("raw", "count", "conf", "conf_count"), 2),
genes = c(nrow(exprs(k10tx_raw)), nrow(exprs(k10tx_count)),
nrow(exprs(k10tx_conf)), nrow(exprs(k10tx_conf_count)),
nrow(exprs(k8tx_raw)), nrow(exprs(k8tx_count)),
nrow(exprs(k8tx_conf)), nrow(exprs(k8tx_conf_count))))
library(ggplot2)
columns <- ggplot(genes_by_subset, aes(x=samples, y=genes)) +
geom_col(position="identity", aes(fill=colors, alpha=alpha), color="black") +
scale_fill_manual(values=c(levels(as.factor(genes_by_subset$colors)))) +
theme(axis.text=ggplot2::element_text(size=10, colour="black"),
axis.text.x=ggplot2::element_text(angle=90, vjust=0.5),
legend.position="none")
columns
c1 <- genes_by_subset[["samples"]] == "ten"
c2 <- genes_by_subset[["samples"]] == "eight"
tmp <- cbind(genes_by_subset[c1, "genes"], genes_by_subset[c2, "genes"])
colnames(tmp) <- c("ten", "eight")
rownames(tmp) <- c("raw", "count", "conf", "conf_count")
tmp
## ten eight
## raw 234330 234330
## count 42143 38571
## conf 33451 33451
## conf_count 18018 16826
We have suspiciously low mapping numbers for the reads against our denovo transcriptome. Therefore I tried a bowtie2 alignment of the reads against the transcriptome and counted with htseq. I was rather shocked to see that htseq excluded a huge majority of the reads, the following block shows just how poorly that data matches the others.
bowtie_data <- read.table(file="preprocessing/hpgl1000/outputs/bowtie2_solanum_betaceum/hpgl1000.count")
colnames(bowtie_data) <- c("hpgl1000id", "count")
bowtie_comp <- merge(exprs(k8tx_raw), bowtie_data, by.x="row.names", by.y="hpgl1000id")
rownames(bowtie_comp) <- bowtie_comp[["Row.names"]]
bowtie_comp <- bowtie_comp[, -1]
bowtie_sad <- plot_corheat(expt_data=bowtie_comp)
bowtie_sad$plot
if (!isTRUE(get0("skip_load"))) {
pander::pander(sessionInfo())
message(paste0("This is hpgltools commit: ", get_git_commit()))
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
tmp <- sm(saveme(filename=this_save))
}
Since I used kallisto for this work, I want to figure out tximport.
In 01_annotation.Rmd, I decided to open up a few lines of inquiry:
Start with the full data set. This will take a rather long time to perform all the graphs, and since we are not likely to use it, I will limit myself to just plotting 1-2 graphs.
k10_libsize <- plot_libsize(k10tx_raw)
k10_libsize$plot
k10_density <- sm(plot_density(k10tx_raw))
sm(k10_density$plot + ggplot2::scale_x_continuous(limits=c(1,100)))
k10_boxplot <- sm(plot_boxplot(k10tx_raw))
k10_boxplot
k10_corheat <- plot_corheat(k10tx_raw)
k10_tsne <- plot_tsne(k10tx_raw)
k10_tsne$plot
k8_libsize <- plot_libsize(k8tx_raw)
k8_libsize$plot
k8_density <- sm(plot_density(k8tx_raw))
sm(k8_density$plot + ggplot2::scale_x_continuous(limits=c(1,100)))
k8_boxplot <- sm(plot_boxplot(k8tx_raw))
k8_boxplot
k8_corheat <- plot_corheat(k8tx_raw)
k8_tsne <- plot_tsne(k8tx_raw)
k8_tsne$plot
When we get to the count filtered data, I suspect things will be more interesting. Therefore let us look at the full spectrum of plots before/after normalization.
k10tx_count_raw_plots <- sm(graph_metrics(k10tx_count))
k10tx_count_norm <- sm(normalize_expt(k10tx_count, transform="log2", norm="quant", convert="cpm"))
k10tx_count_norm_plots <- sm(graph_metrics(k10tx_count_norm))
k10tx_count_normbatch <- sm(normalize_expt(k10tx_count, batch="svaseq", transform="log2", convert="cpm"))
k10tx_count_normbatch_plots <- sm(graph_metrics(k10tx_count_normbatch))
Let us see how the k10 count data appears
k10tx_count_raw_plots$libsize
sm(k10tx_count_raw_plots$density + ggplot2::scale_x_continuous(limits=c(1,200)))
k10tx_count_raw_plots$boxplot
k10tx_count_norm_plots$corheat
k10tx_count_norm_plots$smc
k10tx_count_norm_plots$disheat
k10tx_count_norm_plots$smd
k10tx_count_norm_plots$pcaplot
## See if the additional information in the 'batch' factor made a difference
k10tx_count_normbatch_plots$pcaplot
k8tx_count_raw_plots <- sm(graph_metrics(k8tx_count))
k8tx_count_norm <- sm(normalize_expt(k8tx_count, transform="log2", norm="quant", convert="cpm"))
k8tx_count_norm_plots <- sm(graph_metrics(k8tx_count_norm))
k8tx_count_normbatch <- sm(normalize_expt(k8tx_count, batch="svaseq", transform="log2", convert="cpm"))
k8tx_count_normbatch_plots <- sm(graph_metrics(k8tx_count_normbatch))
Let us see how the k8 count data appears
k8tx_count_raw_plots$libsize
sm(k8tx_count_raw_plots$density + ggplot2::scale_x_continuous(limits=c(1,200)))
k8tx_count_raw_plots$boxplot
k8tx_count_norm_plots$corheat
k8tx_count_norm_plots$smc
k8tx_count_norm_plots$disheat
k8tx_count_norm_plots$smd
k8tx_count_norm_plots$pcaplot
## See if the additional information in the 'batch' factor made a difference
k8tx_count_normbatch_plots$pcaplot
k8tx_count_normbatch_plots$tsneplot
When we get to the conf filtered data, I suspect things will be more interesting. Therefore let us look at the full spectrum of plots before/after normalization.
k10tx_conf_raw_plots <- sm(graph_metrics(k10tx_conf))
k10tx_conf_norm <- sm(normalize_expt(k10tx_conf, transform="log2", norm="quant", convert="cpm"))
k10tx_conf_norm_plots <- sm(graph_metrics(k10tx_conf_norm))
k10tx_conf_normbatch <- sm(normalize_expt(k10tx_conf, batch="svaseq", transform="log2",
convert="cpm", filter=TRUE))
k10tx_conf_normbatch_plots <- sm(graph_metrics(k10tx_conf_normbatch))
Let us see how the k10 conf data appears
k10tx_conf_raw_plots$libsize
sm(k10tx_conf_raw_plots$density + ggplot2::scale_x_continuous(limits=c(1,100)))
k10tx_conf_raw_plots$boxplot
k10tx_conf_norm_plots$corheat
k10tx_conf_norm_plots$smc
k10tx_conf_norm_plots$disheat
k10tx_conf_norm_plots$smd
k10tx_conf_norm_plots$pcaplot
## See if the additional information in the 'batch' factor made a difference
k10tx_conf_normbatch_plots$pcaplot
k8tx_conf_raw_plots <- sm(graph_metrics(k8tx_conf))
k8tx_conf_norm <- sm(normalize_expt(k8tx_conf, transform="log2", norm="quant", convert="cpm"))
k8tx_conf_norm_plots <- sm(graph_metrics(k8tx_conf_norm))
k8tx_conf_normbatch <- sm(normalize_expt(k8tx_conf, batch="svaseq", transform="log2",
convert="cpm", filter=TRUE))
k8tx_conf_normbatch_plots <- sm(graph_metrics(k8tx_conf_normbatch))
Let us see how the k8 conf data appears
k8tx_conf_raw_plots$libsize
sm(k8tx_conf_raw_plots$density + ggplot2::scale_x_continuous(limits=c(1,100)))
k8tx_conf_raw_plots$boxplot
k8tx_conf_norm_plots$corheat
k8tx_conf_norm_plots$smc
k8tx_conf_norm_plots$disheat
k8tx_conf_norm_plots$smd
k8tx_conf_norm_plots$pcaplot
k8tx_conf_norm_plots$tsneplot
## See if the additional information in the 'batch' factor made a difference
k8tx_conf_normbatch_plots$pcaplot
k8tx_conf_normbatch_plots$tsneplot
When we get to the count_conf filtered data, I suspect things will be more interesting. Therefore let us look at the full spectrum of plots before/after normalization.
k10tx_conf_count_raw_plots <- sm(graph_metrics(k10tx_conf_count))
k10tx_conf_count_norm <- sm(normalize_expt(k10tx_conf_count, transform="log2",
norm="quant", convert="cpm"))
k10tx_conf_count_norm_plots <- sm(graph_metrics(k10tx_conf_count_norm))
k10tx_conf_count_normbatch <- sm(normalize_expt(k10tx_conf_count, batch="svaseq", transform="log2",
convert="cpm", filter=TRUE))
k10tx_conf_count_normbatch_plots <- sm(graph_metrics(k10tx_conf_count_normbatch))
Let us see how the k10 count_conf data appears
k10tx_conf_count_raw_plots$libsize
sm(k10tx_conf_count_raw_plots$density + ggplot2::scale_x_continuous(limits=c(1,100)))
k10tx_conf_count_raw_plots$boxplot
k10tx_conf_count_norm_plots$corheat
k10tx_conf_count_norm_plots$smc
k10tx_conf_count_norm_plots$disheat
k10tx_conf_count_norm_plots$smd
k10tx_conf_count_norm_plots$pcaplot
## See if the additional information in the 'batch' factor made a difference
k10tx_conf_count_normbatch_plots$pcaplot
k10tx_conf_count_normbatch_plots$tsneplot
k8tx_conf_count_raw_plots <- sm(graph_metrics(k8tx_conf_count))
k8tx_conf_count_norm <- sm(normalize_expt(k8tx_conf_count, transform="log2",
norm="quant", convert="cpm"))
k8tx_conf_count_norm_plots <- sm(graph_metrics(k8tx_conf_count_norm))
k8tx_conf_count_normbatch <- sm(normalize_expt(k8tx_conf_count, batch="svaseq",
transform="log2", convert="cpm"))
k8tx_conf_count_normbatch_plots <- sm(graph_metrics(k8tx_conf_count_normbatch))
Let us see how the k8 count_conf data appears
k8tx_conf_count_raw_plots$libsize
sm(k8tx_conf_count_raw_plots$density + ggplot2::scale_x_continuous(limits=c(1,100)))
k8tx_conf_count_raw_plots$boxplot
k8tx_conf_count_norm_plots$corheat
k8tx_conf_count_norm_plots$smc
k8tx_conf_count_norm_plots$disheat
k8tx_conf_count_norm_plots$smd
k8tx_conf_count_norm_plots$pcaplot
## See if the additional information in the 'batch' factor made a difference
k8tx_conf_count_normbatch_plots$pcaplot
k8tx_conf_count_normbatch_plots$tsneplot
library(ggplot2)
maximum <- plot_libsize(k8tx_raw, text=FALSE)
maximum_alpha <- maximum$table
maximum_alpha$alpha <- alpha(maximum_alpha$colors, 0.4)
conf <- plot_libsize(k8tx_conf, text=FALSE)
conf_alpha <- conf$table
conf_alpha$alpha <- alpha(conf_alpha$colors, 0.6)
count <- plot_libsize(k8tx_count, text=FALSE)
count_alpha <- count$table
count_alpha$alpha <- alpha(count_alpha$colors, 0.8)
conf_count <- plot_libsize(k8tx_conf_count, text=FALSE)
conf_count_alpha <- conf_count$table
conf_count_alpha$alpha <- alpha(conf_count_alpha$colors, 1.0)
all <- rbind(maximum_alpha, conf_alpha,
count_alpha, conf_count_alpha)
## The following will stack the library sizes from each subset one on top of the other.
## This is likely not desired, but it does provide sufficient room to print the library
## sizes if one wishes.
stacked <- ggplot(all, aes(x=id)) +
geom_bar(aes(weight=sum, alpha=alpha, fill=colors), color="black") +
scale_fill_manual(values=c(levels(as.factor(all$colors)))) +
theme(axis.text=ggplot2::element_text(size=10, colour="black"),
axis.text.x=ggplot2::element_text(angle=90, vjust=0.5),
legend.position="none")
stacked
## In this instance, they are stacked on top of 0, which I suspect is the goal.
columns <- ggplot(all, aes(x=id, y=sum)) +
geom_col(position="identity", aes(fill=colors, alpha=alpha), color="black") +
scale_fill_manual(values=c(levels(as.factor(all$colors)))) +
theme(axis.text=ggplot2::element_text(size=10, colour="black"),
axis.text.x=ggplot2::element_text(angle=90, vjust=0.5),
legend.position="none")
columns
Let us assume for a moment that we will choose a couple of the above for looking at more closely. I suspect k8_count_conf, k10_count_conf, and k8_count and k8_conf are good candidates.
output <- paste0("excel/k8_conf_count_by_tx-v", ver, ".xlsx")
k8_conf_count_written <- sm(write_expt(k8tx_conf_count, excel=output))
output <- paste0("excel/k10_conf_count_by_tx-v", ver, ".xlsx")
k10tx_conf_count_written <- sm(write_expt(k10tx_conf_count, excel=output))
output <- paste0("excel/k8_count_by_tx-v", ver, ".xlsx")
k8tx_count_written <- sm(write_expt(k8tx_count, excel=output))
output <- paste0("excel/k8_conf_by_tx-v", ver, ".xlsx")
k8tx_conf_written <- sm(write_expt(k8tx_conf, excel=output))
I want to see if there are some changes in the number of reads/tx between the embryogenic/nonembryogenic. I am not entirely certain how to do this, so I will poke about and see what happens.
I do have a data structure telling me the mapping of tx->gene.
In the following block, make a table of the number of txs per gene and look at the distribution.
transcript_gene_map <- putative_annotations[, c("transcript_id", "gene_id")]
mappings <- data.table::as.data.table(transcript_gene_map)
mappings[["num"]] <- 1
num_tx <- mappings[, list(num_tx=sum(num)), by="gene_id"]
tx_hist <- plot_histogram(num_tx$num_tx)
tx_hist
A side question from the above: how many of transcripts are actually being used by the embryogenic/nonembryogenic samples? In order to attempt that question, I split the raw data into two separate pieces (embryo/nonembryo), extract the rowMeans for each transcript by these (oh damn I have a function which does this automagically), and see how many are near 0 (<= 0.001 tpm). When we do that, we observe that many more embryogenic transcripts are actually used than non-embryogenic.
start <- sm(normalize_expt(k8tx_raw, convert="cpm"))
embryo <- subset_expt(start, subset="condition=='embryogenic'")
nonembryo <- subset_expt(start, subset="condition=='nonembryogenic'")
embryo_df <- as.data.frame(exprs(embryo))
embryo_df$em_mean <- rowMeans(embryo_df)
non_df <- as.data.frame(exprs(nonembryo))
non_df$non_mean <- rowMeans(non_df)
e <- as.data.frame(embryo_df[["em_mean"]])
rownames(e) <- rownames(embryo_df)
colnames(e) <- "em_mean"
n <- as.data.frame(non_df[["non_mean"]])
rownames(n) <- rownames(non_df)
colnames(n) <- "non_mean"
means <- merge(x=e, y=n, by="row.names")
rownames(means) <- means[["Row.names"]]
means <- means[, -1]
near_zero <- means <= 0.001
embryo_near_zero <- sum(near_zero[, "em_mean"])
embryo_near_zero
## [1] 35080
non_near_zero <- sum(near_zero[, "non_mean"])
non_near_zero
## [1] 52196
plot_boxplot(embryo)
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0. We are on log scale, adding 1 to the data.
## Changed 391753 zero count features.
plot_boxplot(nonembryo)
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0. We are on log scale, adding 1 to the data.
## Changed 508520 zero count features.
if (!isTRUE(get0("skip_load"))) {
pander::pander(sessionInfo())
message(paste0("This is hpgltools commit: ", get_git_commit()))
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
tmp <- sm(saveme(filename=this_save))
}
For the moment, I will assume we only want to deal with the 8 samples. However, I will try a count-filtered run, an annotation filtered run, and both.
keepers <- list(
"embryo" = c("embryogenic", "nonembryogenic"))
k8tx_count_de <- sm(all_pairwise(k8tx_count))
k8tx_count_tables <- sm(combine_de_tables(k8tx_count_de, keepers=keepers,
excel=paste0("excel/k8_count_de_tx-v", ver, ".xlsx"),
sig_excel=paste0("excel/k8_count_sig_tx-v", ver, ".xlsx")))
k8tx_conf_de <- sm(all_pairwise(k8tx_conf))
k8tx_conf_tables <- sm(combine_de_tables(k8tx_conf_de, keepers=keepers,
excel=paste0("excel/k8_conf_de_tx-v", ver, ".xlsx"),
sig_excel=paste0("excel/k8_conf_sig_tx-v", ver, ".xlsx")))
k8tx_conf_count_de <- sm(all_pairwise(k8tx_conf_count))
k8tx_conf_count_tables <- sm(combine_de_tables(k8tx_conf_count_de, keepers=keepers,
excel=paste0("excel/k8_conf_count_de_tx-v", ver, ".xlsx"),
sig_excel=paste0("excel/k8_conf_count_sig_tx-v", ver, ".xlsx")))
if (!isTRUE(get0("skip_load"))) {
pander::pander(sessionInfo())
message(paste0("This is hpgltools commit: ", get_git_commit()))
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
tmp <- sm(saveme(filename=this_save))
}
I loaded my differential expression data with respect to genes and transcripts. Let us try finding some genes which are relatively unchanged in that data and see if any of them have significantly changed transcripts. It is worth noting that in my previous analysis, it seemed highly likely that there are at least some to be found.
## First pull the table of data
gene_data <- k8_count_tables[["data"]][[1]]
tx_data <- k8tx_count_tables[["data"]][[1]]
dim(gene_data)
## [1] 58486 67
candidate_genes <- abs(gene_data[["deseq_logfc"]]) <= 0.2
candidates <- gene_data[candidate_genes, ]
dim(candidates)
## [1] 40725 67
## hmm that is too many.
candidate_genes <- candidates[["deseq_basemean"]] >= 1000
candidates <- candidates[candidate_genes, ]
dim(candidates)
## [1] 327 67
## well, that is more than I can deal with, but it is a start.
## Now look back at the transcripts and immediately be annoyed because a bunch are <NA>.
new_geneids <- gsub(pattern="(^.*)_i.*$", replacement="\\1", x=rownames(tx_data))
tx_data[["gene_id"]] <- new_geneids
keepers <- tx_data[["gene_id"]] %in% rownames(candidates)
tx_candidates <- tx_data[keepers, ]
test_gene <- "TRINITY_DN9916_c0_g2"
gene_data <- candidates[test_gene, ]
gene_data[["deseq_basemean"]]
## [1] 1487
test_tx <- tx_data[["gene_id"]] == test_gene
test_tx <- 2 ^ tx_data[test_tx, c("basic_nummed", "basic_denmed")]
test_kept <- rowSums(as.matrix(test_tx)) > 2
test_tx <- test_tx[test_kept, ]
colnames(test_tx) <- c("embryogenic", "nonembryogenic")
test_df <- as.data.frame(test_tx)
test_df[["tx"]] <- rownames(test_df)
em <- test_df[, c("tx", "embryogenic")]
em[["condition"]] <- "embryogenic"
colnames(em) <- c("tx", "abundance", "condition")
no <- test_df[, c("tx", "nonembryogenic")]
no[["condition"]] <- "nonembryogenic"
colnames(no) <- c("tx", "abundance", "condition")
plotted <- rbind(em, no)
library(ggplot2)
stacked <- ggplot(plotted, aes_string(x="condition")) +
geom_bar(aes(weight=abundance, fill=tx))
pp(file="/tmp/test_transcripts_one_gene.png", image=stacked)
## Wrote the image to: /tmp/test_transcripts_one_gene.png
if (!isTRUE(get0("skip_load"))) {
message(paste0("This is hpgltools commit: ", get_git_commit()))
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
tmp <- sm(saveme(filename=this_save))
pander::pander(sessionInfo())
}
message(paste0("This is hpgltools commit: ", get_git_commit()))
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 5d8c266e48bb9f73cdac8300e5c7c9f5baf003dc
## R> packrat::restore()
## This is hpgltools commit: Wed Mar 21 15:55:32 2018 -0400: 5d8c266e48bb9f73cdac8300e5c7c9f5baf003dc
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
## Saving to 04_tx_to_gene-v20171102.rda.xz
tmp <- sm(saveme(filename=this_save))
pander::pander(sessionInfo())
R version 3.4.4 (2018-03-15)
**Platform:** x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=en_US.utf8, LC_NUMERIC=C, LC_TIME=en_US.utf8, LC_COLLATE=en_US.utf8, LC_MONETARY=en_US.utf8, LC_MESSAGES=en_US.utf8, LC_PAPER=en_US.utf8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.utf8 and LC_IDENTIFICATION=C
attached base packages: grid, parallel, stats4, stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: hpgltools(v.2018.03), Rgraphviz(v.2.22.0), graph(v.1.56.0), SparseM(v.1.77), topGO(v.2.30.1), GO.db(v.3.5.0), AnnotationDbi(v.1.40.0), IRanges(v.2.12.0), S4Vectors(v.0.16.0), Biobase(v.2.38.0), BiocGenerics(v.0.24.0), bindrcpp(v.0.2), edgeR(v.3.20.9), foreach(v.1.4.4), Vennerable(v.3.1.0.9000), ruv(v.0.9.7) and ggplot2(v.2.2.1)
loaded via a namespace (and not attached): snow(v.0.4-2), backports(v.1.1.2), Hmisc(v.4.1-1), plyr(v.1.8.4), lazyeval(v.0.2.1), splines(v.3.4.4), BiocParallel(v.1.12.0), GenomeInfoDb(v.1.14.0), sva(v.3.26.0), digest(v.0.6.15), htmltools(v.0.3.6), gdata(v.2.18.0), magrittr(v.1.5), checkmate(v.1.8.5), memoise(v.1.1.0), cluster(v.2.0.6), doParallel(v.1.0.11), openxlsx(v.4.0.17), limma(v.3.34.9), Biostrings(v.2.46.0), readr(v.1.1.1), annotate(v.1.56.1), matrixStats(v.0.53.1), prettyunits(v.1.0.2), colorspace(v.1.3-2), blob(v.1.1.0), ggrepel(v.0.7.0), BiasedUrn(v.1.07), dplyr(v.0.7.4), RCurl(v.1.95-4.10), tximport(v.1.6.0), roxygen2(v.6.0.1), genefilter(v.1.60.0), lme4(v.1.1-15), bindr(v.0.1.1), survival(v.2.41-3), iterators(v.1.0.9), glue(v.1.2.0), gtable(v.0.2.0), zlibbioc(v.1.24.0), XVector(v.0.18.0), DelayedArray(v.0.4.1), DEoptimR(v.1.0-8), scales(v.0.5.0), DBI(v.0.8), Rcpp(v.0.12.16), xtable(v.1.8-2), progress(v.1.1.2), htmlTable(v.1.11.2), foreign(v.0.8-69), bit(v.1.1-12), preprocessCore(v.1.40.0), Formula(v.1.2-2), httr(v.1.3.1), htmlwidgets(v.1.0), gplots(v.3.0.1), RColorBrewer(v.1.1-2), acepack(v.1.4.1), pkgconfig(v.2.0.1), XML(v.3.98-1.10), nnet(v.7.3-12), locfit(v.1.5-9.1), tidyselect(v.0.2.4), labeling(v.0.3), rlang(v.0.2.0), reshape2(v.1.4.3), munsell(v.0.4.3), tools(v.3.4.4), RSQLite(v.2.0), devtools(v.1.13.5), evaluate(v.0.10.1), stringr(v.1.3.0), yaml(v.2.1.18), knitr(v.1.20), bit64(v.0.9-7), pander(v.0.6.1), geneLenDataBase(v.1.14.0), robustbase(v.0.92-8), caTools(v.1.17.1), purrr(v.0.2.4), RBGL(v.1.54.0), nlme(v.3.1-131.1), xml2(v.1.2.0), biomaRt(v.2.34.2), compiler(v.3.4.4), pbkrtest(v.0.4-7), rstudioapi(v.0.7), variancePartition(v.1.8.1), tibble(v.1.4.2), geneplotter(v.1.56.0), stringi(v.1.1.7), highr(v.0.6), GenomicFeatures(v.1.30.3), lattice(v.0.20-35), Matrix(v.1.2-12), commonmark(v.1.4), nloptr(v.1.0.4), pillar(v.1.2.1), goseq(v.1.30.0), data.table(v.1.10.4-3), bitops(v.1.0-6), corpcor(v.1.6.9), qvalue(v.2.10.0), rtracklayer(v.1.38.3), GenomicRanges(v.1.30.3), colorRamps(v.2.3), R6(v.2.2.2), latticeExtra(v.0.6-28), directlabels(v.2017.03.31), RMySQL(v.0.10.14), KernSmooth(v.2.23-15), gridExtra(v.2.3), codetools(v.0.2-15), MASS(v.7.3-49), gtools(v.3.5.0), assertthat(v.0.2.0), rhdf5(v.2.22.0), SummarizedExperiment(v.1.8.1), DESeq2(v.1.18.1), rprojroot(v.1.3-2), withr(v.2.1.2), GenomicAlignments(v.1.14.1), Rsamtools(v.1.30.0), GenomeInfoDbData(v.1.0.0), mgcv(v.1.8-23), hms(v.0.4.2), doSNOW(v.1.0.16), quadprog(v.1.5-5), rpart(v.4.1-13), tidyr(v.0.8.0), minqa(v.1.2.4), rmarkdown(v.1.9), Rtsne(v.0.13) and base64enc(v.0.1-3)