index.html annotation.html

This document is intended to provide an opportunity to compare the samples and ensure that they are all of sufficient quality to be included in the final differential expression analyses. In addition, this should provide an opportunity to compare the results from the alignments to the esmeraldo, nonesmeraldo, and combined haplotypes. In addition, with the recent inclusion of all the human alignments, it may be used to quantify them as well.

1 Create testing subsets

For some analyses, we might want to consider only the samples of types CL14 or CLBrener.

I am going to use the following convention when working with the various subsets:

  1. Names of data sets begin with the data included:
  1. cl_ means include all samples.
  2. clbr_ means the clbrener samples.
  3. cl14_ means the cl14 samples.
  4. kcl_,kclbr_,kcl14_ means the ‘kept’ samples respectively after excluding outliers.
  1. The following portion refers to the haplotype used for mapping the reads.
  1. esmer_ means the alignments mapped against the Esmeraldo haplotype.
  2. nonesmer_ means those against NonEsmeraldo.
  3. all_ means the combined haplotypes including Esmeraldo+NonEsmeraldo+Unassigned.
  1. The final portion refers to the data type created.
  1. expt means the expt object which contains the annotation data, sample data, read counts.
  2. metrics means the set of plots describing the data.
  3. xxxnorm means normalized data using xxx where xxx is defined by an abbreviation of the normalization used
    1. lqcf: log2(quantile(cpm(filtered(data))))
    2. clqcf: combat(log2(quantile(cpm(filtered(data)))))
    3. slqcf: sva(log2(quantile(cpm(filtered(data)))))
    4. Other letters may be added as needed in each slot.
## I am going to copy the original expts to names which follow the conventions below
cl_esmer_expt <- esmer_expt
cl_nonesmer_expt <- nonesmer_expt
cl_all_expt <- all_expt
## Split the esmer data into CLBR/CL14
clbr_esmer_expt <- expt_subset(esmer_expt, "type=='CLBr'")
cl14_esmer_expt <- expt_subset(esmer_expt, "type=='CL14'")
## Split the nonesmer data into CLBR/CL14
clbr_nonesmer_expt <- expt_subset(nonesmer_expt, "type=='CLBr'")
cl14_nonesmer_expt <- expt_subset(nonesmer_expt, "type=='CL14'")
## And the combined data
clbr_all_expt <- expt_subset(all_expt, "type=='CLBr'")
cl14_all_expt <- expt_subset(all_expt, "type=='CL14'")

Now we have 9 expt objects, one for each haplotype (esmer, nonesmer, combined(all)) for each subset (all data, cl14 data, and clbr data)

cl_esmer_metrics <- sm(graph_metrics(esmer_expt))
cl_nonesmer_metrics <- sm(graph_metrics(nonesmer_expt))
cl_all_metrics <- sm(graph_metrics(all_expt))
clbr_esmer_metrics <- sm(graph_metrics(clbr_esmer_expt))
clbr_nonesmer_metrics <- sm(graph_metrics(clbr_nonesmer_expt))
clbr_all_metrics <- sm(graph_metrics(clbr_all_expt))
cl14_esmer_metrics <- sm(graph_metrics(cl14_esmer_expt))
cl14_nonesmer_metrics <- sm(graph_metrics(cl14_nonesmer_expt))
cl14_all_metrics <- sm(graph_metrics(cl14_all_expt))

1.1 Quick check the library sizes

I find myself thinking I should plot these together in a single plot somehow. They are kind of fascinating, as there are more reads on the nonesmer mappings than esmer for all samples. Why is this?

cl_esmer_metrics$libsize

cl_nonesmer_metrics$libsize

## NonEsmeraldo often gets >= 10% more reads.
cl_all_metrics$libsize

1.2 Check global data distributions across subsets

As a starting point, let us see how similar the data distributions are across subsets/haplotypes.

This will be printed in order: esmer mapping:, all-data, clbr-data, cl14-data nonesmer mappings, all-data, clbr-data, cl14-data all mappings: all-data, clbr-data, cl14-data human-data: all

library(directlabels)
library(ggplot2)
directlabels::direct.label(cl_esmer_metrics$density)

## The distribusions mostly look good but hpgl0125 and hpgl 476/475/482 are a
## little off as I think we will see more clearly in the smaller dataset plots
directlabels::direct.label(clbr_esmer_metrics$density)

## Ahh the lower density samples are the amastigote 60 hours, that makes sense:
## 476,482,475 are all A60.
directlabels::direct.label(cl14_esmer_metrics$density)

## This is similarly true for the cl14 samples, but at both time points, 60 and
## 96 hours.

directlabels::direct.label(cl_nonesmer_metrics$density)

directlabels::direct.label(clbr_nonesmer_metrics$density)

directlabels::direct.label(cl14_nonesmer_metrics$density)

## Surprisingly to me, the nonesmer and esmer density plots are different for
## some samples.

directlabels::direct.label(cl_all_metrics$density)

directlabels::direct.label(clbr_all_metrics$density)

## Ahh the lower density samples are the amastigote 60 hours, that makes sense:
## 476,482,475 are all A60.
directlabels::direct.label(cl14_nonesmer_metrics$density)

## Surprisingly to me, the nonesmer and esmer density plots are different for
## some samples.

cl_esmer_metrics$boxplot

cl_nonesmer_metrics$boxplot

cl_all_metrics$boxplot

We could in theory print out the pca, heatmaps etc now, but I think I want first to perform a default normalization of the data and see the trends in the data a bit more clearly.

cl_esmer_lqcf <- default_norm(cl_esmer_expt, transform="log2")
cl_nonesmer_lqcf <- default_norm(cl_nonesmer_expt, transform="log2")
cl_all_lqcf <- default_norm(cl_all_expt, transform="log2")
clbr_esmer_lqcf <- default_norm(clbr_esmer_expt, transform="log2")
clbr_nonesmer_lqcf <- default_norm(clbr_nonesmer_expt, transform="log2")
clbr_all_lqcf <- default_norm(clbr_all_expt, transform="log2")
cl14_esmer_lqcf <- default_norm(cl14_esmer_expt, transform="log2")
cl14_nonesmer_lqcf <- default_norm(cl14_nonesmer_expt, transform="log2")
cl14_all_lqcf <- default_norm(cl14_all_expt, transform="log2")

2 Redo metric plotting on normalized data

cl_esmer_lqcf_metrics <- sm(graph_metrics(cl_esmer_lqcf))
cl_nonesmer_lqcf_metrics <- sm(graph_metrics(cl_nonesmer_lqcf))
cl_all_lqcf_metrics <- sm(graph_metrics(cl_all_lqcf))
clbr_esmer_lqcf_metrics <- sm(graph_metrics(clbr_esmer_lqcf))
clbr_nonesmer_lqcf_metrics <- sm(graph_metrics(clbr_nonesmer_lqcf))
clbr_all_lqcf_metrics <- sm(graph_metrics(clbr_all_lqcf))
cl14_esmer_lqcf_metrics <- sm(graph_metrics(cl14_esmer_lqcf))
cl14_nonesmer_lqcf_metrics <- sm(graph_metrics(cl14_nonesmer_lqcf))
cl14_all_lqcf_metrics <- sm(graph_metrics(cl14_all_lqcf))

For the following plots, I think I will focus on the combined data unless there is something which is unclear unless it is separated.

2.1 Correlation heatmaps

The correlations in these plots are exceedingly high, 0.7<x<1.0.

cl_esmer_lqcf_metrics$corheat

cl_nonesmer_lqcf_metrics$corheat

## Once again the differences between esmer and nonesmer are striking
cl_all_lqcf_metrics$corheat

## HPGL0485 (cl14 epimastigote) is clearly the strangest sample of all.
## Its cohorts, HPGL0128 looks most like a CL14 60 hour sample; while HPGL0129
## looks like a CLBr epimastigote.
## HPGL0490 is also problematic, as it looks very similar to the CLBr 96 hour samples.

cl_esmer_lqcf_metrics$disheat

cl_nonesmer_lqcf_metrics$disheat

cl_all_lqcf_metrics$disheat

## Same story as above I think, but the problematic samples HPGL0485 is even
## starker to my eyes.

cl_all_lqcf_metrics$smc

Once again, I see that the nonesmer mappings are surprisingly (to me) different than the esmer. So much so that in the most pathologically difficult samples, we get pretty big shifts in clustering for the epimastigote and trypomastigote samples. This is starting to make me think that a large number of the genes attributed to the ‘Unassigned’ haplotype are actually Esmeraldo.

2.2 PCA

For the PCA plots, I think it might prove useful to look at the CLBr and CL14 samples separately. However I will start out with the entire set like the heatmaps.

cl_esmer_lqcf_metrics$pcaplot

cl_nonesmer_lqcf_metrics$pcaplot

cl_all_lqcf_metrics$pcaplot

cl14_esmer_lqcf_metrics$pcaplot

cl14_nonesmer_lqcf_metrics$pcaplot

cl14_all_lqcf_metrics$pcaplot

clbr_esmer_lqcf_metrics$pcaplot

clbr_nonesmer_lqcf_metrics$pcaplot

clbr_all_lqcf_metrics$pcaplot

2.3 A separate, simpler Figure S02A:

Note that this figure must be generated before removing troubling samples because I reused the names of some of the expressionsets.

## 20170214 This is causing a segmentation fault. hmm it seems fixed now...
## All I did was restart the computer, R worries my sometimes.
cl_all_written <- write_expt(cl_all_expt,
                             excel=paste0("excel/testing-v", ver, ".xlsx"))
svg(file="images/figS02a_atb_v2.3.svg")
cl_all_lqcf_metrics$corheat
dev.off()
testing <- normalize_expt(cl_all_expt, filter=FALSE, convert="cpm",
                          norm="quant", transform="log2")
testing_disheat <- plot_disheat(testing)
testing_pca <- plot_pca(testing, plot_labels=FALSE)
pdf(file="images/figS02a_atb_v2.2_all_samples.pdf")
testing_disheat$plot
dev.off()
svg(file="images/figS02b_atb_v2.2_all_samples.svg")
testing_pca$plot
dev.off()

## I think therefore we want:
## 1.  A PCA of all samples log2(cpm(quant(filt(sva())))) (not in that order)
## 2.  Corresponding correlation heatmap, distance, and smc.
test_data <- normalize_expt(cl_all_expt, transform="log2", norm="quant",
                            convert="cpm", filter=TRUE)
##test_write <- write_expt(cl_all_expt, transform="log2", norm="quant",
##                         convert="cpm", filter=TRUE,
##                         excel=paste0("excel/cl_all_expt-v", ver, ".xlsx"))
fig_pca <- plot_pca(test_data)
fig_cd <- plot_corheat(test_data)
fig_dd <- plot_disheat(test_data)
fig_smc <- plot_sm(test_data)
svg(file="images/figS02v2a_atb_v2.0.svg")
fig_pca$plot
dev.off()
pdf(file="images/figS02v2b_atb_v2.0.pdf")
fig_cd$plot
dev.off()
pdf(file="images/figS02v2c_atb_v2.0.pdf")
fig_dd$plot
dev.off()
svg(file="images/figS02v2d_atb_v2.0.svg")
fig_smc
dev.off()

2.4 A couple of possible implementations of Figure 2

The text introducing Figure 2 is as follows:

“Mapped sequencing data derived from all libraries were analyzed using two methods to inspect the relationships between samples and to identify outliers. Samples were subjected to principal component analysis (PCA) as well as hierarchical clustering. The resulting heat map and PCA plot (Figure 2) showed a clear separation between samples as well as the expected clustering between biological triplicates, except for one library generated from RNA isolated from CL Brener trypomastigotes, which was identified as an outlier and was excluded from further analyses (Figure 2C). Of the 24887 genes analyzed, 23191 were expressed at a level of at least 1 count per million in at least 2 samples.”

I take this to mean that we are performing a heatmap and PCA of all samples.

The heat map provided in the powerpoint, appears to me to not have all samples, but is instead missing the CL14 epimastigotes. While it is true we remove them, this piece of text does say we are looking at all samples – which is it?

library(Biobase)
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport,
##     clusterMap, parApply, parCapply, parLapply, parLapplyLB, parRapply,
##     parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
##     IQR, mad, xtabs
## The following objects are masked from 'package:base':
##
##     anyDuplicated, append, as.data.frame, cbind, colnames, do.call, duplicated,
##     eval, evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply,
##     lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
##     pmin.int, Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which, which.max, which.min
## Welcome to Bioconductor
##
##     Vignettes contain introductory material; view with 'browseVignettes()'. To
##     cite Bioconductor, see 'citation("Biobase")', and for packages
##     'citation("pkgname")'.
new_names <- paste0(rownames(pData(cl_all_lqcf$expressionset)), "_",
                    pData(cl_all_lqcf$expressionset)$originalcond, "_",
                    pData(cl_all_lqcf$expressionset)$batch)
distance_plot <- plot_disheat(cl_all_lqcf, expt_names=new_names)

correlation_plot <- plot_corheat(cl_all_lqcf, expt_names=new_names)

library(Heatplus)
row_correlations <- hpgl_cor(exprs(cl_all_lqcf$expressionset))
row_design <- as.data.frame(pData(cl_all_lqcf$expressionset))
## Reorder the design to match the correlation clustering order
row_design <- row_design[rownames(row_correlations), ]

column_distances <- as.matrix(dist(t(exprs(cl_all_lqcf$expressionset))))
column_design <- as.data.frame(pData(cl_all_lqcf$expressionset))
## Reorder these according to the distance clustering
column_design <- column_design[rownames(column_distances), ]

## Now choose the columns of the design to add as annotations
row_design <- row_design[, c("batch","replicate")]
row_design$replicate <- as.factor(row_design$replicate)
column_design <- column_design[, c("type","stage")]

my_annotations <- list(Row=list(data=row_design),
                       Col=list(data=column_design))
my_labels <- list(Row=list(nrow=5),
                  Col=list(nrow=5))
my_cluster_distances <- list(Row=list(cuth=2),
                             Col=list(cuth=100))
cordist <- column_distances
for (r in 1:nrow(row_correlations)) {
    for (c in 1:ncol(row_correlations)) {
        if (c > r) {
            cordist[r,c] <- row_correlations[r,c]
        } else if (r == c) {
            cordist[r,c] <- 0
        }
    }
}
## ok, so now we go from -1 (highest correlation) to 0 (lowest correlation),
## then up to high numbers (distance)
map1 <- annHeatmap2(cordist, scale="none",
                    cluster=my_cluster_distances,
                    ann=my_annotations,
                    labels=my_labels)
## heatmap.2 heatmaps have: $breaks (color breaks), $col (colors by breaks),
## and $colorTable (low,high,color assignments)
## annHeatmap2 has $breaks and $col -- perhaps I can use that.
map2 <- map1
cor_breaks <- rev(correlation_plot[["map"]][["breaks"]])
dist_breaks <- distance_plot[["map"]][["breaks"]]
dist_breaks <- dist_breaks[2:length(dist_breaks)]
total_breaks <- append(cor_breaks, dist_breaks)
map2[["data"]][["col"]] <- append(correlation_plot[["map"]][["col"]], distance_plot[["map"]][["col"]])
map2[["data"]][["breaks"]] <- total_breaks
##map2$data$x2 <- map2$data$x
plot(map2)
## Warning in image.default(1:nc, 1:nr, t(x2), axes = FALSE, xlim = c(0.5, : unsorted
## 'breaks' will be sorted before use

3 Remove troubled samples

My previous notes, and the plots above suggest strongly that the samples HPGL0485, HPGL0490, and HPGL0163 are problematic. I have already excluded HPGL0163, so now I just want to drop the other two.

I am going to therefore create new base kexpt objects and overwrite the normalized data with the new normalized data.

old_subset <- "sampleid!='HPGL0485'&sampleid!='HPGL0490'&sampleid!='HPGL0129'&sampleid!='HPGL0128'"
##new_subset <- "sampleid!='HPGL0128'&sampleid!='HPGL0129'&sampleid!='HPGL0483'&sampleid!='HPGL0485'&sampleid!='HPGL0486'&sampleid!='HPGL0487'&sampleid!='HPGL0490'"
##new_subset <- "stage!='Epi'&sampleid!='HPGL0490'"
outlier_subset <- "sampleid!='HPGL0490'"
noepi_subset <- "stage!='Epi'"
cl_all_kexptv1 <- expt_subset(cl_all_expt, subset=outlier_subset)
cl_all_kexpt <- expt_subset(cl_all_kexptv1, subset=noepi_subset)
expt_filename <- paste0("excel/cl_all_kexptv1-", ver, ".xlsx")
printed_expt <- write_expt(cl_all_kexptv1, violin=FALSE,
                           excel=expt_filename,
                           filter=TRUE, transform="log2", norm="quant",
                           convert="raw", batch="sva")
## Writing the legend.
## The sheet: legend is in legend.
## Writing the raw reads.
## Graphing the raw reads.
## There is just one batch in this data.
## The sheet: raw_graphs is in legend, raw_reads, raw_graphs.
## The sheet: raw_graphs is in legend, raw_reads, raw_graphs.
## Writing the normalized reads.
## Graphing the normalized reads.
## The sheet: norm_graphs is in legend, raw_reads, raw_graphs, norm_data, norm_graphs.
## The sheet: norm_graphs is in legend, raw_reads, raw_graphs, norm_data, norm_graphs.
## Writing the median reads by factor.
## The factor CL14.A60 has 3 rows.
## The factor CL14.A96 has 3 rows.
## The factor CL14.Tryp has 3 rows.
## The factor CLBr.A60 has 3 rows.
## The factor CLBr.A96 has 3 rows.
## The factor CLBr.Tryp has 2 rows.
cl_esmer_kexpt <- expt_subset(cl_esmer_expt, subset=outlier_subset)
cl_nonesmer_kexpt <- expt_subset(cl_nonesmer_expt, subset=outlier_subset)
cl_all_kexpt <- expt_subset(cl_all_expt, subset=outlier_subset)
clbr_esmer_kexpt <- expt_subset(clbr_esmer_expt, subset=outlier_subset)
clbr_nonesmer_kexpt <- expt_subset(clbr_nonesmer_expt, subset=outlier_subset)
clbr_all_kexpt <- expt_subset(clbr_all_expt, subset=outlier_subset)
cl14_esmer_kexpt <- expt_subset(cl14_esmer_expt, subset=outlier_subset)
cl14_nonesmer_kexpt <- expt_subset(cl14_nonesmer_expt, subset=outlier_subset)
cl14_all_kexpt <- expt_subset(cl14_all_expt, subset=outlier_subset)
cl_esmer_lqcf <- default_norm(cl_esmer_kexpt, transform="log2")
cl_nonesmer_lqcf <- default_norm(cl_nonesmer_kexpt, transform="log2")
cl_all_lqcf <- default_norm(cl_all_kexpt, transform="log2")
clbr_esmer_lqcf <- default_norm(clbr_esmer_kexpt, transform="log2")
clbr_nonesmer_lqcf <- default_norm(clbr_nonesmer_kexpt, transform="log2")
clbr_all_lqcf <- default_norm(clbr_all_kexpt, transform="log2")
cl14_esmer_lqcf <- default_norm(cl14_esmer_kexpt, transform="log2")
cl14_nonesmer_lqcf <- default_norm(cl14_nonesmer_kexpt, transform="log2")
cl14_all_lqcf <- default_norm(cl14_all_kexpt, transform="log2")

cl_esmer_lqcf_metrics <- sm(graph_metrics(cl_esmer_lqcf))
cl_nonesmer_lqcf_metrics <- sm(graph_metrics(cl_nonesmer_lqcf))
cl_all_lqcf_metrics <- sm(graph_metrics(cl_all_lqcf))
clbr_esmer_lqcf_metrics <- sm(graph_metrics(clbr_esmer_lqcf))
clbr_nonesmer_lqcf_metrics <- sm(graph_metrics(clbr_nonesmer_lqcf))
clbr_all_lqcf_metrics <- sm(graph_metrics(clbr_all_lqcf))
cl14_esmer_lqcf_metrics <- sm(graph_metrics(cl14_esmer_lqcf))
cl14_nonesmer_lqcf_metrics <- sm(graph_metrics(cl14_nonesmer_lqcf))
cl14_all_lqcf_metrics <- sm(graph_metrics(cl14_all_lqcf))

3.1 A separate, simpler Figure 02A:

distance_plot <- plot_disheat(cl_all_lqcf)

testing_data <- normalize_expt(cl_all_kexpt, transform="log2", convert="cpm",
                               filter=TRUE, batch="ssva")
## This function will replace the expt$expressionset slot with:
## log2(ssva(cpm(cbcb(data))))
## It backs up the current data into a slot named:
##  expt$backup_expressionset. It will also save copies of each step along the way
##  in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
##  when invoking limma.  The appropriate libsize is the non-log(cpm(normalized)).
##  This is most likely kept at:
##  'new_expt$normalized$intermediate_counts$normalization$libsizes'
##  A copy of this may also be found at:
##  new_expt$best_libsize
## Leaving the data unnormalized.  This is necessary for DESeq, but
##  EdgeR/limma might benefit from normalization.  Good choices include quantile,
##  size-factor, tmm, etc.
## Step 1: performing count filter with option: cbcb
## Removing 1049 low-count genes (22256 remaining).
## Step 2: not normalizing the data.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 2268 values equal to 0, adding 1 to the matrix.
## Step 5: doing batch correction with ssva.
## Note to self:  If you get an error like 'x contains missing values'; I think this
##  means that the data has too many 0's and needs to have a better low-count filter applied.
## batch_counts: Before batch correction, 9511 entries 0<x<1.
## batch_counts: Before batch correction, 2268 entries are >= 0.
## Passing the batch method to get_model_adjust().
## It understands a few additional batch methods.
## Not able to discern the state of the data.
## Going to use a simplistic metric to guess if it is log scale.
## The be method chose 4 surrogate variable(s).
## Did not understand ssva, assuming supervised sva.
## sva warning: controls provided so supervised sva is being performed.
## Number of significant surrogate variables is:  4
## The number of elements which are < 0 after batch correction is: 8165
## The variable low_to_zero sets whether to change <0 values to 0 and is: FALSE
correlation_plot <- plot_corheat(testing_data, keysize=1.5)

pdf(file="images/fig02a_atb_v2.3.pdf")
correlation_plot$plot
dev.off()
## png
##   2
pca_plot <- plot_pca(testing_data, plot_labels=FALSE)
## There is just one batch in this data.
## Not putting labels on the plot.
svg(file="images/fig02b_atb_v2.3.svg")
pca_plot$plot
dev.off()
## png
##   2

4 Re-visualize chosen normalized metrics

Now the visualizations should be maximally pretty.

cl_nonesmer_lqcf_metrics$corheat

clbr_nonesmer_lqcf_metrics$corheat

cl14_nonesmer_lqcf_metrics$corheat

cl_nonesmer_lqcf_metrics$pcaplot

clbr_nonesmer_lqcf_metrics$pcaplot

cl14_nonesmer_lqcf_metrics$pcaplot

5 Violins!

Lets see if my new violin function works.

clbr_filt <- normalize_expt(clbr_all_kexpt, filter="cbcb")
varpart_b <- varpart(clbr_filt, predictor=NULL, factors=c("actualbatch"))
varpart_b$partition_plot
varpart_c <- varpart(clbr_filt, predictor=NULL, factors=c("condition"))
varpart_c$partition_plot
varpart_cb <- varpart(clbr_filt, predictor=NULL,
                      factors=c("condition","actualbatch"))
varpart_cb$partition_plot
tmp <- sm(saveme(filename=this_save))
---
title: "RNAseq of T.cruzi CL14/CLBr: Sample Estimation"
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
 html_document:
  code_download: true
  code_folding: show
  fig_caption: true
  fig_height: 7
  fig_width: 7
  highlight: default
  keep_md: false
  mode: selfcontained
  number_sections: true
  self_contained: true
  theme: readable
  toc: true
  toc_float:
    collapsed: false
    smooth_scroll: false
---

<style>
  <!-- Document prelude revision 2016-10 -->
body .main-container {
max-width: 1600px;
}
</style>

```{r options, include=FALSE}
## These are the options I tend to favor
library("hpgltools")
tt <- sm(devtools::load_all("~/hpgltools"))
knitr::opts_knit$set(progress=TRUE,
                     verbose=TRUE,
                     width=90,
                     echo=TRUE)
knitr::opts_chunk$set(
    fig.width=8,
    fig.height=8,
    dpi=96)
old_options <- options(
    error = NULL,
    digits = 4,
    stringsAsFactors = FALSE,
    knitr.duplicate.label = "allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size=10))
set.seed(1)
previous_file <- "annotation.Rmd"
rmd_file <- "sample_estimation.Rmd"
ver <- "20170301"
previous_save <- paste0(gsub(pattern="\\.Rmd", replace=".rda.xz", x=previous_file))
this_save <- paste0(gsub(pattern="\\.Rmd", replace=".rda.xz", x=rmd_file))
```

[index.html](index.html) [annotation.html](annotation.html)

```{r rendering, include=FALSE, eval=FALSE}
## This block is used to render a document from within it.
rmarkdown::render(rmd_file)

## An extra renderer for pdf output
rmarkdown::render(rmd_file, output_format="pdf_document", output_options=c("skip_html"))
## Or to save/load large Rdata files.
hpgltools:::saveme()
hpgltools:::loadme()
rm(list=ls())
```

```{r loadme, include=FALSE}
tmp <- sm(loadme(filename=previous_save))
```

This document is intended to provide an opportunity to compare the samples and ensure that they are
all of sufficient quality to be included in the final differential expression analyses.  In
addition, this should provide an opportunity to compare the results from the alignments to the
esmeraldo, nonesmeraldo, and combined haplotypes.  In addition, with the recent inclusion of all the
human alignments, it may be used to quantify them as well.

# Create testing subsets

For some analyses, we might want to consider only the samples of types CL14 or CLBrener.

I am going to use the following convention when working with the various subsets:

1.  Names of data sets begin with the data included:
  a. cl_ means include all samples.
  b. clbr_ means the clbrener samples.
  c. cl14_ means the cl14 samples.
  d. kcl_,kclbr_,kcl14_ means the 'kept' samples respectively after excluding outliers.
2.  The following portion refers to the haplotype used for mapping the reads.
  a. esmer_ means the alignments mapped against the Esmeraldo haplotype.
  b. nonesmer_ means those against NonEsmeraldo.
  c. all_ means the combined haplotypes including Esmeraldo+NonEsmeraldo+Unassigned.
3.  The final portion refers to the data type created.
  a. expt means the expt object which contains the annotation data, sample data, read counts.
  b. metrics means the set of plots describing the data.
  c. xxxnorm means normalized data using xxx where xxx is defined by an abbreviation of the normalization used
    i.   lqcf: log2(quantile(cpm(filtered(data))))
    ii.  clqcf: combat(log2(quantile(cpm(filtered(data)))))
    iii. slqcf: sva(log2(quantile(cpm(filtered(data)))))
    iv.  Other letters may be added as needed in each slot.

```{r create_subsets}
## I am going to copy the original expts to names which follow the conventions below
cl_esmer_expt <- esmer_expt
cl_nonesmer_expt <- nonesmer_expt
cl_all_expt <- all_expt
## Split the esmer data into CLBR/CL14
clbr_esmer_expt <- expt_subset(esmer_expt, "type=='CLBr'")
cl14_esmer_expt <- expt_subset(esmer_expt, "type=='CL14'")
## Split the nonesmer data into CLBR/CL14
clbr_nonesmer_expt <- expt_subset(nonesmer_expt, "type=='CLBr'")
cl14_nonesmer_expt <- expt_subset(nonesmer_expt, "type=='CL14'")
## And the combined data
clbr_all_expt <- expt_subset(all_expt, "type=='CLBr'")
cl14_all_expt <- expt_subset(all_expt, "type=='CL14'")
```

Now we have 9 expt objects, one for each haplotype (esmer, nonesmer, combined(all))
for each subset (all data, cl14 data, and clbr data)

```{r initial_graphs_hidden1, fig.show="hide"}
cl_esmer_metrics <- sm(graph_metrics(esmer_expt))
cl_nonesmer_metrics <- sm(graph_metrics(nonesmer_expt))
cl_all_metrics <- sm(graph_metrics(all_expt))
clbr_esmer_metrics <- sm(graph_metrics(clbr_esmer_expt))
clbr_nonesmer_metrics <- sm(graph_metrics(clbr_nonesmer_expt))
clbr_all_metrics <- sm(graph_metrics(clbr_all_expt))
cl14_esmer_metrics <- sm(graph_metrics(cl14_esmer_expt))
cl14_nonesmer_metrics <- sm(graph_metrics(cl14_nonesmer_expt))
cl14_all_metrics <- sm(graph_metrics(cl14_all_expt))
```

## Quick check the library sizes

I find myself thinking I should plot these together in a single plot somehow.
They are kind of fascinating, as there are more reads on the nonesmer mappings than esmer for all
samples.  Why is this?

```{r libsizes}
cl_esmer_metrics$libsize
cl_nonesmer_metrics$libsize
## NonEsmeraldo often gets >= 10% more reads.
cl_all_metrics$libsize
```

## Check global data distributions across subsets

As a starting point, let us see how similar the data distributions are across subsets/haplotypes.

This will be printed in order:
esmer mapping:, all-data, clbr-data, cl14-data
nonesmer mappings, all-data, clbr-data, cl14-data
all mappings: all-data, clbr-data, cl14-data
human-data: all

```{r densities}
library(directlabels)
library(ggplot2)
directlabels::direct.label(cl_esmer_metrics$density)
## The distribusions mostly look good but hpgl0125 and hpgl 476/475/482 are a
## little off as I think we will see more clearly in the smaller dataset plots
directlabels::direct.label(clbr_esmer_metrics$density)
## Ahh the lower density samples are the amastigote 60 hours, that makes sense:
## 476,482,475 are all A60.
directlabels::direct.label(cl14_esmer_metrics$density)
## This is similarly true for the cl14 samples, but at both time points, 60 and
## 96 hours.

directlabels::direct.label(cl_nonesmer_metrics$density)
directlabels::direct.label(clbr_nonesmer_metrics$density)
directlabels::direct.label(cl14_nonesmer_metrics$density)
## Surprisingly to me, the nonesmer and esmer density plots are different for
## some samples.

directlabels::direct.label(cl_all_metrics$density)
directlabels::direct.label(clbr_all_metrics$density)
## Ahh the lower density samples are the amastigote 60 hours, that makes sense:
## 476,482,475 are all A60.
directlabels::direct.label(cl14_nonesmer_metrics$density)
## Surprisingly to me, the nonesmer and esmer density plots are different for
## some samples.

cl_esmer_metrics$boxplot
cl_nonesmer_metrics$boxplot
cl_all_metrics$boxplot
```

We could in theory print out the pca, heatmaps etc now, but I think I want first to perform a
default normalization of the data and see the trends in the data a bit more clearly.

```{r default_norm}
cl_esmer_lqcf <- default_norm(cl_esmer_expt, transform="log2")
cl_nonesmer_lqcf <- default_norm(cl_nonesmer_expt, transform="log2")
cl_all_lqcf <- default_norm(cl_all_expt, transform="log2")
clbr_esmer_lqcf <- default_norm(clbr_esmer_expt, transform="log2")
clbr_nonesmer_lqcf <- default_norm(clbr_nonesmer_expt, transform="log2")
clbr_all_lqcf <- default_norm(clbr_all_expt, transform="log2")
cl14_esmer_lqcf <- default_norm(cl14_esmer_expt, transform="log2")
cl14_nonesmer_lqcf <- default_norm(cl14_nonesmer_expt, transform="log2")
cl14_all_lqcf <- default_norm(cl14_all_expt, transform="log2")
```

# Redo metric plotting on normalized data

```{r initial_graphs_hidden, fig.show="hide"}
cl_esmer_lqcf_metrics <- sm(graph_metrics(cl_esmer_lqcf))
cl_nonesmer_lqcf_metrics <- sm(graph_metrics(cl_nonesmer_lqcf))
cl_all_lqcf_metrics <- sm(graph_metrics(cl_all_lqcf))
clbr_esmer_lqcf_metrics <- sm(graph_metrics(clbr_esmer_lqcf))
clbr_nonesmer_lqcf_metrics <- sm(graph_metrics(clbr_nonesmer_lqcf))
clbr_all_lqcf_metrics <- sm(graph_metrics(clbr_all_lqcf))
cl14_esmer_lqcf_metrics <- sm(graph_metrics(cl14_esmer_lqcf))
cl14_nonesmer_lqcf_metrics <- sm(graph_metrics(cl14_nonesmer_lqcf))
cl14_all_lqcf_metrics <- sm(graph_metrics(cl14_all_lqcf))
```

For the following plots, I think I will focus on the combined data unless there is something which
is unclear unless it is separated.

## Correlation heatmaps

The correlations in these plots are exceedingly high, 0.7<x<1.0.

```{r all_heatmaps}
cl_esmer_lqcf_metrics$corheat
cl_nonesmer_lqcf_metrics$corheat
## Once again the differences between esmer and nonesmer are striking
cl_all_lqcf_metrics$corheat
## HPGL0485 (cl14 epimastigote) is clearly the strangest sample of all.
## Its cohorts, HPGL0128 looks most like a CL14 60 hour sample; while HPGL0129
## looks like a CLBr epimastigote.
## HPGL0490 is also problematic, as it looks very similar to the CLBr 96 hour samples.

cl_esmer_lqcf_metrics$disheat
cl_nonesmer_lqcf_metrics$disheat
cl_all_lqcf_metrics$disheat
## Same story as above I think, but the problematic samples HPGL0485 is even
## starker to my eyes.

cl_all_lqcf_metrics$smc
```

Once again, I see that the nonesmer mappings are surprisingly (to me) different than the
esmer.  So much so that in the most pathologically difficult samples, we get pretty big shifts
in clustering for the epimastigote and trypomastigote samples.  This is starting to make me think
that a large number of the genes attributed to the 'Unassigned' haplotype are actually Esmeraldo.

## PCA

For the PCA plots, I think it might prove useful to look at the CLBr and CL14 samples separately.
However I will start out with the entire set like the heatmaps.

```{r pca_plots}
cl_esmer_lqcf_metrics$pcaplot
cl_nonesmer_lqcf_metrics$pcaplot
cl_all_lqcf_metrics$pcaplot
cl14_esmer_lqcf_metrics$pcaplot
cl14_nonesmer_lqcf_metrics$pcaplot
cl14_all_lqcf_metrics$pcaplot
clbr_esmer_lqcf_metrics$pcaplot
clbr_nonesmer_lqcf_metrics$pcaplot
clbr_all_lqcf_metrics$pcaplot
```

## A separate, simpler Figure S02A:

Note that this figure must be generated before removing troubling samples because I reused the names
of some of the expressionsets.

```{r figure_s02a_simpler, eval=FALSE}
## 20170214 This is causing a segmentation fault. hmm it seems fixed now...
## All I did was restart the computer, R worries my sometimes.
cl_all_written <- write_expt(cl_all_expt,
                             excel=paste0("excel/testing-v", ver, ".xlsx"))
svg(file="images/figS02a_atb_v2.3.svg")
cl_all_lqcf_metrics$corheat
dev.off()
testing <- normalize_expt(cl_all_expt, filter=FALSE, convert="cpm",
                          norm="quant", transform="log2")
testing_disheat <- plot_disheat(testing)
testing_pca <- plot_pca(testing, plot_labels=FALSE)
pdf(file="images/figS02a_atb_v2.2_all_samples.pdf")
testing_disheat$plot
dev.off()
svg(file="images/figS02b_atb_v2.2_all_samples.svg")
testing_pca$plot
dev.off()

## I think therefore we want:
## 1.  A PCA of all samples log2(cpm(quant(filt(sva())))) (not in that order)
## 2.  Corresponding correlation heatmap, distance, and smc.
test_data <- normalize_expt(cl_all_expt, transform="log2", norm="quant",
                            convert="cpm", filter=TRUE)
##test_write <- write_expt(cl_all_expt, transform="log2", norm="quant",
##                         convert="cpm", filter=TRUE,
##                         excel=paste0("excel/cl_all_expt-v", ver, ".xlsx"))
fig_pca <- plot_pca(test_data)
fig_cd <- plot_corheat(test_data)
fig_dd <- plot_disheat(test_data)
fig_smc <- plot_sm(test_data)
svg(file="images/figS02v2a_atb_v2.0.svg")
fig_pca$plot
dev.off()
pdf(file="images/figS02v2b_atb_v2.0.pdf")
fig_cd$plot
dev.off()
pdf(file="images/figS02v2c_atb_v2.0.pdf")
fig_dd$plot
dev.off()
svg(file="images/figS02v2d_atb_v2.0.svg")
fig_smc
dev.off()
```

## A couple of possible implementations of Figure 2

The text introducing Figure 2 is as follows:

"Mapped sequencing data derived from all libraries were analyzed using two methods to inspect the
relationships between samples and to identify outliers. Samples were subjected to principal
component analysis (PCA) as well as hierarchical clustering. The resulting heat map and PCA plot
(Figure 2) showed a clear separation between samples as well as the expected clustering between
biological triplicates, except for one library generated from RNA isolated from CL Brener
trypomastigotes, which was identified as an outlier and was excluded from further analyses (Figure
2C). Of the 24887 genes analyzed, 23191 were expressed at a level of  at least 1 count per million
in at least 2 samples."

I take this to mean that we are performing a heatmap and PCA of _all_ samples.

The heat map provided in the powerpoint, appears to me to not have all samples, but is instead
missing the CL14 epimastigotes.  While it is true we remove them, this piece of text does say we are
looking at all samples -- which is it?

```{r figure2_silly}
library(Biobase)
new_names <- paste0(rownames(pData(cl_all_lqcf$expressionset)), "_",
                    pData(cl_all_lqcf$expressionset)$originalcond, "_",
                    pData(cl_all_lqcf$expressionset)$batch)
distance_plot <- plot_disheat(cl_all_lqcf, expt_names=new_names)
correlation_plot <- plot_corheat(cl_all_lqcf, expt_names=new_names)

library(Heatplus)
row_correlations <- hpgl_cor(exprs(cl_all_lqcf$expressionset))
row_design <- as.data.frame(pData(cl_all_lqcf$expressionset))
## Reorder the design to match the correlation clustering order
row_design <- row_design[rownames(row_correlations), ]

column_distances <- as.matrix(dist(t(exprs(cl_all_lqcf$expressionset))))
column_design <- as.data.frame(pData(cl_all_lqcf$expressionset))
## Reorder these according to the distance clustering
column_design <- column_design[rownames(column_distances), ]

## Now choose the columns of the design to add as annotations
row_design <- row_design[, c("batch","replicate")]
row_design$replicate <- as.factor(row_design$replicate)
column_design <- column_design[, c("type","stage")]

my_annotations <- list(Row=list(data=row_design),
                       Col=list(data=column_design))
my_labels <- list(Row=list(nrow=5),
                  Col=list(nrow=5))
my_cluster_distances <- list(Row=list(cuth=2),
                             Col=list(cuth=100))
cordist <- column_distances
for (r in 1:nrow(row_correlations)) {
    for (c in 1:ncol(row_correlations)) {
        if (c > r) {
            cordist[r,c] <- row_correlations[r,c]
        } else if (r == c) {
            cordist[r,c] <- 0
        }
    }
}
## ok, so now we go from -1 (highest correlation) to 0 (lowest correlation),
## then up to high numbers (distance)
map1 <- annHeatmap2(cordist, scale="none",
                    cluster=my_cluster_distances,
                    ann=my_annotations,
                    labels=my_labels)
## heatmap.2 heatmaps have: $breaks (color breaks), $col (colors by breaks),
## and $colorTable (low,high,color assignments)
## annHeatmap2 has $breaks and $col -- perhaps I can use that.
map2 <- map1
cor_breaks <- rev(correlation_plot[["map"]][["breaks"]])
dist_breaks <- distance_plot[["map"]][["breaks"]]
dist_breaks <- dist_breaks[2:length(dist_breaks)]
total_breaks <- append(cor_breaks, dist_breaks)
map2[["data"]][["col"]] <- append(correlation_plot[["map"]][["col"]], distance_plot[["map"]][["col"]])
map2[["data"]][["breaks"]] <- total_breaks
##map2$data$x2 <- map2$data$x
plot(map2)
```

# Remove troubled samples

My previous notes, and the plots above suggest strongly that the samples HPGL0485, HPGL0490, and HPGL0163 are problematic.
I have already excluded HPGL0163, so now I just want to drop the other two.

I am going to therefore create new base kexpt objects and overwrite the normalized data with the new normalized data.

```{r sample_removal, fig.show="hide"}
old_subset <- "sampleid!='HPGL0485'&sampleid!='HPGL0490'&sampleid!='HPGL0129'&sampleid!='HPGL0128'"
##new_subset <- "sampleid!='HPGL0128'&sampleid!='HPGL0129'&sampleid!='HPGL0483'&sampleid!='HPGL0485'&sampleid!='HPGL0486'&sampleid!='HPGL0487'&sampleid!='HPGL0490'"
##new_subset <- "stage!='Epi'&sampleid!='HPGL0490'"
outlier_subset <- "sampleid!='HPGL0490'"
noepi_subset <- "stage!='Epi'"
cl_all_kexptv1 <- expt_subset(cl_all_expt, subset=outlier_subset)
cl_all_kexpt <- expt_subset(cl_all_kexptv1, subset=noepi_subset)
expt_filename <- paste0("excel/cl_all_kexptv1-", ver, ".xlsx")
printed_expt <- write_expt(cl_all_kexptv1, violin=FALSE,
                           excel=expt_filename,
                           filter=TRUE, transform="log2", norm="quant",
                           convert="raw", batch="sva")

cl_esmer_kexpt <- expt_subset(cl_esmer_expt, subset=outlier_subset)
cl_nonesmer_kexpt <- expt_subset(cl_nonesmer_expt, subset=outlier_subset)
cl_all_kexpt <- expt_subset(cl_all_expt, subset=outlier_subset)
clbr_esmer_kexpt <- expt_subset(clbr_esmer_expt, subset=outlier_subset)
clbr_nonesmer_kexpt <- expt_subset(clbr_nonesmer_expt, subset=outlier_subset)
clbr_all_kexpt <- expt_subset(clbr_all_expt, subset=outlier_subset)
cl14_esmer_kexpt <- expt_subset(cl14_esmer_expt, subset=outlier_subset)
cl14_nonesmer_kexpt <- expt_subset(cl14_nonesmer_expt, subset=outlier_subset)
cl14_all_kexpt <- expt_subset(cl14_all_expt, subset=outlier_subset)
cl_esmer_lqcf <- default_norm(cl_esmer_kexpt, transform="log2")
cl_nonesmer_lqcf <- default_norm(cl_nonesmer_kexpt, transform="log2")
cl_all_lqcf <- default_norm(cl_all_kexpt, transform="log2")
clbr_esmer_lqcf <- default_norm(clbr_esmer_kexpt, transform="log2")
clbr_nonesmer_lqcf <- default_norm(clbr_nonesmer_kexpt, transform="log2")
clbr_all_lqcf <- default_norm(clbr_all_kexpt, transform="log2")
cl14_esmer_lqcf <- default_norm(cl14_esmer_kexpt, transform="log2")
cl14_nonesmer_lqcf <- default_norm(cl14_nonesmer_kexpt, transform="log2")
cl14_all_lqcf <- default_norm(cl14_all_kexpt, transform="log2")

cl_esmer_lqcf_metrics <- sm(graph_metrics(cl_esmer_lqcf))
cl_nonesmer_lqcf_metrics <- sm(graph_metrics(cl_nonesmer_lqcf))
cl_all_lqcf_metrics <- sm(graph_metrics(cl_all_lqcf))
clbr_esmer_lqcf_metrics <- sm(graph_metrics(clbr_esmer_lqcf))
clbr_nonesmer_lqcf_metrics <- sm(graph_metrics(clbr_nonesmer_lqcf))
clbr_all_lqcf_metrics <- sm(graph_metrics(clbr_all_lqcf))
cl14_esmer_lqcf_metrics <- sm(graph_metrics(cl14_esmer_lqcf))
cl14_nonesmer_lqcf_metrics <- sm(graph_metrics(cl14_nonesmer_lqcf))
cl14_all_lqcf_metrics <- sm(graph_metrics(cl14_all_lqcf))
```

## A separate, simpler Figure 02A:

```{r figure_02a_simpler}
distance_plot <- plot_disheat(cl_all_lqcf)

testing_data <- normalize_expt(cl_all_kexpt, transform="log2", convert="cpm",
                               filter=TRUE, batch="ssva")
correlation_plot <- plot_corheat(testing_data, keysize=1.5)
pdf(file="images/fig02a_atb_v2.3.pdf")
correlation_plot$plot
dev.off()
pca_plot <- plot_pca(testing_data, plot_labels=FALSE)
svg(file="images/fig02b_atb_v2.3.svg")
pca_plot$plot
dev.off()
```

# Re-visualize chosen normalized metrics

Now the visualizations should be maximally pretty.

```{r vis_pretty}
cl_nonesmer_lqcf_metrics$corheat
clbr_nonesmer_lqcf_metrics$corheat
cl14_nonesmer_lqcf_metrics$corheat
cl_nonesmer_lqcf_metrics$pcaplot
clbr_nonesmer_lqcf_metrics$pcaplot
cl14_nonesmer_lqcf_metrics$pcaplot
```

# Violins!

Lets see if my new violin function works.

```{r violin, eval=FALSE}
clbr_filt <- normalize_expt(clbr_all_kexpt, filter="cbcb")
varpart_b <- varpart(clbr_filt, predictor=NULL, factors=c("actualbatch"))
varpart_b$partition_plot
varpart_c <- varpart(clbr_filt, predictor=NULL, factors=c("condition"))
varpart_c$partition_plot
varpart_cb <- varpart(clbr_filt, predictor=NULL,
                      factors=c("condition","actualbatch"))
varpart_cb$partition_plot
```

```{r save_estimation}
tmp <- sm(saveme(filename=this_save))
```
