index.html annotation.html

This document is intended to provide an opportunity to compare the samples and ensure that they are all of sufficient quality to be included in the final differential expression analyses. In addition, this should provide an opportunity to compare the results from the alignments to the esmeraldo, nonesmeraldo, and combined haplotypes. In addition, with the recent inclusion of all the human alignments, it may be used to quantify them as well.

1 Create testing subsets

For some analyses, we might want to consider only the samples of types CL14 or CLBrener.

I am going to use the following convention when working with the various subsets:

  1. Names of data sets begin with the data included:
  1. cl_ means include all samples.
  2. clbr_ means the clbrener samples.
  3. cl14_ means the cl14 samples.
  4. kcl_,kclbr_,kcl14_ means the ‘kept’ samples respectively after excluding outliers.
  1. The following portion refers to the haplotype used for mapping the reads.
  1. esmer_ means the alignments mapped against the Esmeraldo haplotype.
  2. nonesmer_ means those against NonEsmeraldo.
  3. all_ means the combined haplotypes including Esmeraldo+NonEsmeraldo+Unassigned.
  1. The final portion refers to the data type created.
  1. expt means the expt object which contains the annotation data, sample data, read counts.
  2. metrics means the set of plots describing the data.
  3. xxxnorm means normalized data using xxx where xxx is defined by an abbreviation of the normalization used i. lqcf: log2(quantile(cpm(filtered(data)))) ii. clqcf: combat(log2(quantile(cpm(filtered(data))))) iii. slqcf: sva(log2(quantile(cpm(filtered(data))))) iv. Other letters may be added as needed in each slot.
all_expt <- sm(create_expt(metadata="sample_sheets/cl14clbr_samples_combined_all.xlsx",
                           file_column="clbrenerfile", gene_info=all_genes))
all_expt <- exclude_genes_expt(all_expt)
## Before removal, there were 25100 entries.
## Now there are 23305 entries.
## Percent kept: 99.781, 99.815, 99.814, 99.818, 97.848, 98.584, 99.776, 99.449, 99.331, 99.457, 99.321, 99.395, 99.157, 98.435, 98.543, 99.306, 99.396, 99.653, 99.663, 99.249, 99.436, 98.720, 99.886, 99.761
## Percent removed: 0.219, 0.185, 0.186, 0.182, 2.152, 1.416, 0.224, 0.551, 0.669, 0.543, 0.679, 0.605, 0.843, 1.565, 1.457, 0.694, 0.604, 0.347, 0.337, 0.751, 0.564, 1.280, 0.114, 0.239
outlier_subset <- "sampleid!='HPGL0490'"
all_minus_one <- subset_expt(all_expt, subset=outlier_subset)
## Using a subset expression.
## There were 24, now there are 23 samples.
all_minus_one <- subset_expt(all_minus_one, subset="type!='CL14'")
## Using a subset expression.
## There were 23, now there are 11 samples.
all_minus_cl14 <- subset_expt(all_expt, subset="type!='CL14'")
## Using a subset expression.
## There were 24, now there are 12 samples.
tmp_expt <- sm(normalize_expt(all_minus_cl14, filter=TRUE, convert="cpm", norm="quant", transform="log2"))
pp(file="images/all_clbr_pca.pdf", image=plot_pca(tmp_expt)$plot)
## Writing the image to: images/all_clbr_pca.pdf and calling dev.off().

tmp_expt <- sm(normalize_expt(all_minus_one, filter=TRUE, convert="cpm", norm="quant", transform="log2"))
pp(file="images/all_clbr_minus_one_pca.pdf", image=plot_pca(tmp_expt)$plot)
## Writing the image to: images/all_clbr_minus_one_pca.pdf and calling dev.off().

## I am going to copy the original expts to names which follow the conventions below
cl_esmer_expt <- esmer_expt
cl_nonesmer_expt <- nonesmer_expt
cl_all_expt <- all_expt
## Split the esmer data into CLBR/CL14
clbr_esmer_expt <- subset_expt(esmer_expt, "type=='CLBr'")
## Using a subset expression.
## There were 18, now there are 9 samples.
cl14_esmer_expt <- subset_expt(esmer_expt, "type=='CL14'")
## Using a subset expression.
## There were 18, now there are 9 samples.
## Split the nonesmer data into CLBR/CL14
clbr_nonesmer_expt <- subset_expt(nonesmer_expt, "type=='CLBr'")
## Using a subset expression.
## There were 18, now there are 9 samples.
cl14_nonesmer_expt <- subset_expt(nonesmer_expt, "type=='CL14'")
## Using a subset expression.
## There were 18, now there are 9 samples.
## And the combined data
clbr_all_expt <- subset_expt(all_expt, "type=='CLBr'")
## Using a subset expression.
## There were 24, now there are 12 samples.
cl14_all_expt <- subset_expt(all_expt, "type=='CL14'")
## Using a subset expression.
## There were 24, now there are 12 samples.
clbr_all_norm <- normalize_expt(clbr_all_expt, norm="quant", convert="cpm", transform="log2", filter=TRUE)
## This function will replace the expt$expressionset slot with:
## log2(cpm(quant(cbcb(data))))
## It will save copies of each step along the way
##  in expt$normalized with the corresponding libsizes. Keep libsizes in mind
##  when invoking limma.  The appropriate libsize is non-log(cpm(normalized)).
##  This is most likely kept at:
##  'new_expt$normalized$intermediate_counts$normalization$libsizes'
##  A copy of this may also be found at:
##  new_expt$best_libsize
## Not correcting the count-data for batch effects.  If batch is
##  included in EdgerR/limma's model, then this is probably wise; but in extreme
##  batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 1209 low-count genes (22096 remaining).
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## Step 5: not doing batch correction.
pp(file="images/clbrener_sample_pca.pdf", image=plot_pca(clbr_all_norm)$plot)
## Writing the image to: images/clbrener_sample_pca.pdf and calling dev.off().

Now we have 9 expt objects, one for each haplotype (esmer, nonesmer, combined(all)) for each subset (all data, cl14 data, and clbr data)

cl_esmer_metrics <- sm(graph_metrics(esmer_expt))
cl_nonesmer_metrics <- sm(graph_metrics(nonesmer_expt))
cl_all_metrics <- sm(graph_metrics(all_expt))
clbr_esmer_metrics <- sm(graph_metrics(clbr_esmer_expt))
clbr_nonesmer_metrics <- sm(graph_metrics(clbr_nonesmer_expt))
clbr_all_metrics <- sm(graph_metrics(clbr_all_expt))
cl14_esmer_metrics <- sm(graph_metrics(cl14_esmer_expt))
cl14_nonesmer_metrics <- sm(graph_metrics(cl14_nonesmer_expt))
cl14_all_metrics <- sm(graph_metrics(cl14_all_expt))

1.1 Quick check the library sizes

I find myself thinking I should plot these together in a single plot somehow. They are kind of fascinating, as there are more reads on the nonesmer mappings than esmer for all samples. Why is this?

cl_esmer_metrics$libsize

cl_nonesmer_metrics$libsize

## NonEsmeraldo often gets >= 10% more reads.
cl_all_metrics$libsize

1.2 Check global data distributions across subsets

As a starting point, let us see how similar the data distributions are across subsets/haplotypes.

This will be printed in order: esmer mapping:, all-data, clbr-data, cl14-data nonesmer mappings, all-data, clbr-data, cl14-data all mappings: all-data, clbr-data, cl14-data human-data: all

cl_esmer_metrics$density

## The distribusions mostly look good but hpgl0125 and hpgl 476/475/482 are a
## little off as I think we will see more clearly in the smaller dataset plots
clbr_esmer_metrics$density

## Ahh the lower density samples are the amastigote 60 hours, that makes sense:
## 476,482,475 are all A60.
cl14_esmer_metrics$cvplot

## This is similarly true for the cl14 samples, but at both time points, 60 and
## 96 hours.

cl_nonesmer_metrics$density

clbr_nonesmer_metrics$density

cl14_nonesmer_metrics$density

## Surprisingly to me, the nonesmer and esmer density plots are different for
## some samples.

cl_all_metrics$density

clbr_all_metrics$density

## Ahh the lower density samples are the amastigote 60 hours, that makes sense:
## 476,482,475 are all A60.
cl14_nonesmer_metrics$density

## Surprisingly to me, the nonesmer and esmer density plots are different for
## some samples.

cl_esmer_metrics$boxplot

cl_nonesmer_metrics$boxplot

cl_all_metrics$boxplot

We could in theory print out the pca, heatmaps etc now, but I think I want first to perform a default normalization of the data and see the trends in the data a bit more clearly.

cl_esmer_lqcf <- default_norm(cl_esmer_expt, transform="log2")
cl_nonesmer_lqcf <- default_norm(cl_nonesmer_expt, transform="log2")
cl_all_lqcf <- default_norm(cl_all_expt, transform="log2")
clbr_esmer_lqcf <- default_norm(clbr_esmer_expt, transform="log2")
clbr_nonesmer_lqcf <- default_norm(clbr_nonesmer_expt, transform="log2")
clbr_all_lqcf <- default_norm(clbr_all_expt, transform="log2")
cl14_esmer_lqcf <- default_norm(cl14_esmer_expt, transform="log2")
cl14_nonesmer_lqcf <- default_norm(cl14_nonesmer_expt, transform="log2")
cl14_all_lqcf <- default_norm(cl14_all_expt, transform="log2")

2 Redo metric plotting on normalized data

cl_esmer_lqcf_metrics <- sm(graph_metrics(cl_esmer_lqcf))
cl_nonesmer_lqcf_metrics <- sm(graph_metrics(cl_nonesmer_lqcf))
cl_all_lqcf_metrics <- sm(graph_metrics(cl_all_lqcf))
clbr_esmer_lqcf_metrics <- sm(graph_metrics(clbr_esmer_lqcf))
clbr_nonesmer_lqcf_metrics <- sm(graph_metrics(clbr_nonesmer_lqcf))
clbr_all_lqcf_metrics <- sm(graph_metrics(clbr_all_lqcf))
cl14_esmer_lqcf_metrics <- sm(graph_metrics(cl14_esmer_lqcf))
cl14_nonesmer_lqcf_metrics <- sm(graph_metrics(cl14_nonesmer_lqcf))
cl14_all_lqcf_metrics <- sm(graph_metrics(cl14_all_lqcf))

For the following plots, I think I will focus on the combined data unless there is something which is unclear unless it is separated.

2.1 Correlation heatmaps

The correlations in these plots are exceedingly high, 0.7<x<1.0.

cl_esmer_lqcf_metrics$corheat

cl_nonesmer_lqcf_metrics$corheat

## Once again the differences between esmer and nonesmer are striking
cl_all_lqcf_metrics$corheat

## HPGL0485 (cl14 epimastigote) is clearly the strangest sample of all.
## Its cohorts, HPGL0128 looks most like a CL14 60 hour sample; while HPGL0129
## looks like a CLBr epimastigote.
## HPGL0490 is also problematic, as it looks very similar to the CLBr 96 hour samples.

cl_esmer_lqcf_metrics$disheat

cl_nonesmer_lqcf_metrics$disheat

cl_all_lqcf_metrics$disheat

## Same story as above I think, but the problematic samples HPGL0485 is even
## starker to my eyes.

cl_all_lqcf_metrics$smc

Once again, I see that the nonesmer mappings are surprisingly (to me) different than the esmer. So much so that in the most pathologically difficult samples, we get pretty big shifts in clustering for the epimastigote and trypomastigote samples. This is starting to make me think that a large number of the genes attributed to the ‘Unassigned’ haplotype are actually Esmeraldo.

2.2 PCA

For the PCA plots, I think it might prove useful to look at the CLBr and CL14 samples separately. However I will start out with the entire set like the heatmaps.

cl_esmer_lqcf_metrics$pc_plot

cl_nonesmer_lqcf_metrics$pc_plot

cl_all_lqcf_metrics$pc_plot

cl14_esmer_lqcf_metrics$pc_plot

cl14_nonesmer_lqcf_metrics$pc_plot

cl14_all_lqcf_metrics$pc_plot

clbr_esmer_lqcf_metrics$pc_plot

clbr_nonesmer_lqcf_metrics$pc_plot

clbr_all_lqcf_metrics$pc_plot

2.3 A separate, simpler Figure S02A:

Note that this figure must be generated before removing troubling samples because I reused the names of some of the expressionsets.

## 20170214 This is causing a segmentation fault. hmm it seems fixed now...
## All I did was restart the computer, R worries my sometimes.
cl_all_written <- write_expt(cl_all_expt,
                             excel=paste0("excel/testing-v", ver, ".xlsx"))
svg(file="images/figS02a_atb_v2.3.svg")
cl_all_lqcf_metrics$corheat
dev.off()
testing <- sm(normalize_expt(cl_all_expt, filter=FALSE, convert="cpm",
                             norm="quant", transform="log2"))
testing_disheat <- plot_disheat(testing)
testing_pca <- plot_pca(testing, plot_labels=FALSE)
pdf(file="images/figS02a_atb_v2.2_all_samples.pdf")
testing_disheat$plot
dev.off()
svg(file="images/figS02b_atb_v2.2_all_samples.svg")
testing_pca$plot
dev.off()

## I think therefore we want:
## 1.  A PCA of all samples log2(cpm(quant(filt(sva())))) (not in that order)
## 2.  Corresponding correlation heatmap, distance, and smc.
test_data <- sm(normalize_expt(cl_all_expt, transform="log2", norm="quant",
                               convert="cpm", filter=TRUE))
##test_write <- write_expt(cl_all_expt, transform="log2", norm="quant",
##                         convert="cpm", filter=TRUE,
##                         excel=paste0("excel/cl_all_expt-v", ver, ".xlsx"))
fig_pca <- plot_pca(test_data)
fig_cd <- plot_corheat(test_data)
fig_dd <- plot_disheat(test_data)
fig_smc <- plot_sm(test_data)
svg(file="images/figS02v2a_atb_v2.0.svg")
fig_pca$plot
dev.off()
pdf(file="images/figS02v2b_atb_v2.0.pdf")
fig_cd$plot
dev.off()
pdf(file="images/figS02v2c_atb_v2.0.pdf")
fig_dd$plot
dev.off()
svg(file="images/figS02v2d_atb_v2.0.svg")
fig_smc$plot
dev.off()

2.4 A couple of possible implementations of Figure 2

The text introducing Figure 2 is as follows:

“Mapped sequencing data derived from all libraries were analyzed using two methods to inspect the relationships between samples and to identify outliers. Samples were subjected to principal component analysis (PCA) as well as hierarchical clustering. The resulting heat map and PCA plot (Figure 2) showed a clear separation between samples as well as the expected clustering between biological triplicates, except for one library generated from RNA isolated from CL Brener trypomastigotes, which was identified as an outlier and was excluded from further analyses (Figure 2C). Of the 24887 genes analyzed, 23191 were expressed at a level of at least 1 count per million in at least 2 samples.”

I take this to mean that we are performing a heatmap and PCA of all samples.

The heat map provided in the powerpoint, appears to me to not have all samples, but is instead missing the CL14 epimastigotes. While it is true we remove them, this piece of text does say we are looking at all samples – which is it?

library(Biobase)
new_names <- paste0(rownames(pData(cl_all_lqcf$expressionset)), "_",
                    pData(cl_all_lqcf$expressionset)$originalcond, "_",
                    pData(cl_all_lqcf$expressionset)$batch)
distance_plot <- plot_disheat(cl_all_lqcf, expt_names=new_names)

correlation_plot <- plot_corheat(cl_all_lqcf, expt_names=new_names)

library(Heatplus)
row_correlations <- hpgl_cor(exprs(cl_all_lqcf$expressionset))
row_design <- as.data.frame(pData(cl_all_lqcf$expressionset))
## Reorder the design to match the correlation clustering order
row_design <- row_design[rownames(row_correlations), ]

column_distances <- as.matrix(dist(t(exprs(cl_all_lqcf$expressionset))))
column_design <- as.data.frame(pData(cl_all_lqcf$expressionset))
## Reorder these according to the distance clustering
column_design <- column_design[rownames(column_distances), ]

## Now choose the columns of the design to add as annotations
row_design <- row_design[, c("batch","replicate")]
row_design$replicate <- as.factor(row_design$replicate)
column_design <- column_design[, c("type","stage")]

my_annotations <- list(Row=list(data=row_design),
                       Col=list(data=column_design))
my_labels <- list(Row=list(nrow=5),
                  Col=list(nrow=5))
my_cluster_distances <- list(Row=list(cuth=2),
                             Col=list(cuth=100))
cordist <- column_distances
for (r in 1:nrow(row_correlations)) {
    for (c in 1:ncol(row_correlations)) {
        if (c > r) {
            cordist[r,c] <- row_correlations[r,c]
        } else if (r == c) {
            cordist[r,c] <- 0
        }
    }
}
## ok, so now we go from -1 (highest correlation) to 0 (lowest correlation),
## then up to high numbers (distance)
map1 <- annHeatmap2(cordist, scale="none",
                    cluster=my_cluster_distances,
                    ann=my_annotations,
                    labels=my_labels)
## heatmap.2 heatmaps have: $breaks (color breaks), $col (colors by breaks),
## and $colorTable (low,high,color assignments)
## annHeatmap2 has $breaks and $col -- perhaps I can use that.
plot(map1)

3 Remove troubled samples

My previous notes, and the plots above suggest strongly that the samples HPGL0485, HPGL0490, and HPGL0163 are problematic. I have already excluded HPGL0163, so now I just want to drop the other two.

I am going to therefore create new base kexpt objects and overwrite the normalized data with the new normalized data.

old_subset <- "sampleid!='HPGL0485'&sampleid!='HPGL0490'&sampleid!='HPGL0129'&sampleid!='HPGL0128'"
##new_subset <- "sampleid!='HPGL0128'&sampleid!='HPGL0129'&sampleid!='HPGL0483'&sampleid!='HPGL0485'&sampleid!='HPGL0486'&sampleid!='HPGL0487'&sampleid!='HPGL0490'"
##new_subset <- "stage!='Epi'&sampleid!='HPGL0490'"
outlier_subset <- "sampleid!='HPGL0490'"
noepi_subset <- "stage!='Epi'"
cl_all_kexptv1 <- subset_expt(cl_all_expt, subset=outlier_subset)
## Using a subset expression.
## There were 24, now there are 23 samples.
cl_all_kexpt <- subset_expt(cl_all_kexptv1, subset=noepi_subset)
## Using a subset expression.
## There were 23, now there are 17 samples.
expt_filename <- paste0("excel/cl_all_kexptv1-", ver, ".xlsx")
printed_expt <- write_expt(cl_all_kexptv1, violin=FALSE,
                           excel=expt_filename,
                           filter=TRUE, transform="log2", norm="quant",
                           convert="raw", batch="sva")
## Writing the first sheet, containing a legend and some summary data.
## Writing the raw reads.
## Graphing the raw reads.
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## Warning in doTryCatch(return(expr), name, parentenv, handler): display list
## redraw incomplete

## Warning in doTryCatch(return(expr), name, parentenv, handler): display list
## redraw incomplete
## Writing the normalized reads.
## Graphing the normalized reads.
## Error in quantile.default(prop_median, p = c(1, 3)/4) : 
##   missing values and NaN's not allowed if 'na.rm' is FALSE
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 2 rows containing missing values (geom_bar).
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## Warning in doTryCatch(return(expr), name, parentenv, handler): display list
## redraw incomplete
## Warning in doTryCatch(return(expr), name, parentenv, handler): display list
## redraw incomplete
## Writing the median reads by factor.
cl_esmer_kexpt <- subset_expt(cl_esmer_expt, subset=outlier_subset)
## Using a subset expression.
## There were 18, now there are 17 samples.
cl_nonesmer_kexpt <- subset_expt(cl_nonesmer_expt, subset=outlier_subset)
## Using a subset expression.
## There were 18, now there are 17 samples.
cl_all_kexpt <- subset_expt(cl_all_expt, subset=outlier_subset)
## Using a subset expression.
## There were 24, now there are 23 samples.
clbr_esmer_kexpt <- subset_expt(clbr_esmer_expt, subset=outlier_subset)
## Using a subset expression.
## There were 9, now there are 8 samples.
clbr_nonesmer_kexpt <- subset_expt(clbr_nonesmer_expt, subset=outlier_subset)
## Using a subset expression.
## There were 9, now there are 8 samples.
clbr_all_kexpt <- subset_expt(clbr_all_expt, subset=outlier_subset)
## Using a subset expression.
## There were 12, now there are 11 samples.
cl14_esmer_kexpt <- subset_expt(cl14_esmer_expt, subset=outlier_subset)
## Using a subset expression.
## There were 9, now there are 9 samples.
cl14_nonesmer_kexpt <- subset_expt(cl14_nonesmer_expt, subset=outlier_subset)
## Using a subset expression.
## There were 9, now there are 9 samples.
cl14_all_kexpt <- subset_expt(cl14_all_expt, subset=outlier_subset)
## Using a subset expression.
## There were 12, now there are 12 samples.
cl_esmer_lqcf <- default_norm(cl_esmer_kexpt, transform="log2")
cl_nonesmer_lqcf <- default_norm(cl_nonesmer_kexpt, transform="log2")
cl_all_lqcf <- default_norm(cl_all_kexpt, transform="log2")
clbr_esmer_lqcf <- default_norm(clbr_esmer_kexpt, transform="log2")
clbr_nonesmer_lqcf <- default_norm(clbr_nonesmer_kexpt, transform="log2")
clbr_all_lqcf <- default_norm(clbr_all_kexpt, transform="log2")
cl14_esmer_lqcf <- default_norm(cl14_esmer_kexpt, transform="log2")
cl14_nonesmer_lqcf <- default_norm(cl14_nonesmer_kexpt, transform="log2")
cl14_all_lqcf <- default_norm(cl14_all_kexpt, transform="log2")

cl_esmer_lqcf_metrics <- sm(graph_metrics(cl_esmer_lqcf))
cl_nonesmer_lqcf_metrics <- sm(graph_metrics(cl_nonesmer_lqcf))
cl_all_lqcf_metrics <- sm(graph_metrics(cl_all_lqcf))
clbr_esmer_lqcf_metrics <- sm(graph_metrics(clbr_esmer_lqcf))
clbr_nonesmer_lqcf_metrics <- sm(graph_metrics(clbr_nonesmer_lqcf))
clbr_all_lqcf_metrics <- sm(graph_metrics(clbr_all_lqcf))
cl14_esmer_lqcf_metrics <- sm(graph_metrics(cl14_esmer_lqcf))
cl14_nonesmer_lqcf_metrics <- sm(graph_metrics(cl14_nonesmer_lqcf))
cl14_all_lqcf_metrics <- sm(graph_metrics(cl14_all_lqcf))

3.1 A separate, simpler Figure 02A:

distance_plot <- plot_disheat(cl_all_lqcf)

testing_data <- normalize_expt(cl_all_kexpt, transform="log2", convert="cpm",
                               filter=TRUE, batch="ssva")
## This function will replace the expt$expressionset slot with:
## log2(ssva(cpm(cbcb(data))))
## It will save copies of each step along the way
##  in expt$normalized with the corresponding libsizes. Keep libsizes in mind
##  when invoking limma.  The appropriate libsize is non-log(cpm(normalized)).
##  This is most likely kept at:
##  'new_expt$normalized$intermediate_counts$normalization$libsizes'
##  A copy of this may also be found at:
##  new_expt$best_libsize
## Leaving the data unnormalized.  This is necessary for DESeq, but
##  EdgeR/limma might benefit from normalization.  Good choices include quantile,
##  size-factor, tmm, etc.
## Step 1: performing count filter with option: cbcb
## Removing 948 low-count genes (22357 remaining).
## Step 2: not normalizing the data.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 4818 values equal to 0, adding 1 to the matrix.
## Step 5: doing batch correction with ssva.
## Note to self:  If you get an error like 'x contains missing values' The data has too many 0's and needs a stronger low-count filter applied.
## Passing off to all_adjusters.
## batch_counts: Before batch/surrogate estimation, 491121 entries are x>1: 96%.
## batch_counts: Before batch/surrogate estimation, 4818 entries are x==0: 1%.
## batch_counts: Before batch/surrogate estimation, 18272 entries are 0<x<1: 4%.
## The be method chose 1 surrogate variable(s).
## Did not understand ssva, assuming supervised sva.
## sva warning: controls provided so supervised sva is being performed.
## Number of significant surrogate variables is:  1
## There are 1511 (0%) elements which are < 0 after batch correction.
## Setting low elements to zero.
correlation_plot <- plot_corheat(testing_data, keysize=1.5)

pdf(file="images/fig02a_atb_v2.3.pdf")
correlation_plot$plot
dev.off()
## png 
##   2
pca_plot <- plot_pca(testing_data, plot_labels=FALSE)
## Not putting labels on the plot.
svg(file="images/fig02b_atb_v2.3.svg")
pca_plot$plot
dev.off()
## png 
##   2

4 Re-visualize chosen normalized metrics

Now the visualizations should be maximally pretty.

cl_nonesmer_lqcf_metrics$corheat

clbr_nonesmer_lqcf_metrics$corheat

cl14_nonesmer_lqcf_metrics$corheat

cl_nonesmer_lqcf_metrics$pcaplot
## NULL
clbr_nonesmer_lqcf_metrics$pcaplot
## NULL
cl14_nonesmer_lqcf_metrics$pcaplot
## NULL

5 Violins!

Lets see if my new violin function works.

clbr_filt <- normalize_expt(clbr_all_kexpt, filter="cbcb")
varpart_cb <- varpart(clbr_filt, predictor=NULL, factors=c("condition", "actualbatch"))
varpart_cb$partition_plot
gff_file <- "reference/gff/clbrener_8.1_complete_genes.gff.gz"
gff_annotations <- sm(load_gff_annotations(gff_file, type="gene"))
rownames(gff_annotations) <- gff_annotations[["ID"]]

all_expt <- create_expt(metadata="sample_sheets/cl14clbr_samples_combined_all.xlsx",
                        file_column="clbrenerfile", gene_info=gff_annotations)
## Reading the sample metadata.
## Dropped 12 rows from the sample metadata because they were blank.
## The sample definitions comprises: 28 rows(samples) and 23 columns(metadata fields).
## Reading count tables.
## Reading count files with read.table().
## /mnt/sshfs/cbcbsub01/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/tcruzi_all_2016/preprocessing/parasite/cl14/hpgl0123/outputs/tophat_tcruzi_all/accepted_paired.count.xz contains 25105 rows.
## /mnt/sshfs/cbcbsub01/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/tcruzi_all_2016/preprocessing/parasite/cl14/hpgl0125/outputs/tophat_tcruzi_all/accepted_paired.count.xz contains 25105 rows and merges to 25105 rows.
## /mnt/sshfs/cbcbsub01/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/tcruzi_all_2016/preprocessing/parasite/cl14/hpgl0126/outputs/tophat_tcruzi_all/accepted_paired.count.xz contains 25105 rows and merges to 25105 rows.
## /mnt/sshfs/cbcbsub01/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/tcruzi_all_2016/preprocessing/parasite/cl14/hpgl0127/outputs/tophat_tcruzi_all/accepted_paired.count.xz contains 25105 rows and merges to 25105 rows.
## /mnt/sshfs/cbcbsub01/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/tcruzi_all_2016/preprocessing/parasite/cl14/hpgl0128/outputs/tophat_tcruzi_all/accepted_paired.count.xz contains 25105 rows and merges to 25105 rows.
## /mnt/sshfs/cbcbsub01/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/tcruzi_all_2016/preprocessing/parasite/cl14/hpgl0129/outputs/tophat_tcruzi_all/accepted_paired.count.xz contains 25105 rows and merges to 25105 rows.
## /mnt/sshfs/cbcbsub01/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/tcruzi_all_2016/preprocessing/parasite/clbr/hpgl0473/outputs/tophat_tcruzi_all/accepted_paired.count.xz contains 25105 rows and merges to 25105 rows.
## /mnt/sshfs/cbcbsub01/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/tcruzi_all_2016/preprocessing/parasite/cl14/hpgl0474/outputs/tophat_tcruzi_all/accepted_paired.count.xz contains 25105 rows and merges to 25105 rows.
## /mnt/sshfs/cbcbsub01/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/tcruzi_all_2016/preprocessing/parasite/clbr/hpgl0475/outputs/tophat_tcruzi_all/accepted_paired.count.xz contains 25105 rows and merges to 25105 rows.
## /mnt/sshfs/cbcbsub01/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/tcruzi_all_2016/preprocessing/parasite/clbr/hpgl0476/outputs/tophat_tcruzi_all/accepted_paired.count.xz contains 25105 rows and merges to 25105 rows.
## /mnt/sshfs/cbcbsub01/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/tcruzi_all_2016/preprocessing/parasite/cl14/hpgl0477/outputs/tophat_tcruzi_all/accepted_paired.count.xz contains 25105 rows and merges to 25105 rows.
## /mnt/sshfs/cbcbsub01/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/tcruzi_all_2016/preprocessing/parasite/cl14/hpgl0478/outputs/tophat_tcruzi_all/accepted_paired.count.xz contains 25105 rows and merges to 25105 rows.
## /mnt/sshfs/cbcbsub01/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/tcruzi_all_2016/preprocessing/parasite/cl14/hpgl0479/outputs/tophat_tcruzi_all/accepted_paired.count.xz contains 25105 rows and merges to 25105 rows.
## /mnt/sshfs/cbcbsub01/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/tcruzi_all_2016/preprocessing/parasite/clbr/hpgl0480/outputs/tophat_tcruzi_all/accepted_paired.count.xz contains 25105 rows and merges to 25105 rows.
## /mnt/sshfs/cbcbsub01/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/tcruzi_all_2016/preprocessing/parasite/clbr/hpgl0481/outputs/tophat_tcruzi_all/accepted_paired.count.xz contains 25105 rows and merges to 25105 rows.
## /mnt/sshfs/cbcbsub01/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/tcruzi_all_2016/preprocessing/parasite/clbr/hpgl0482/outputs/tophat_tcruzi_all/accepted_paired.count.xz contains 25105 rows and merges to 25105 rows.
## /mnt/sshfs/cbcbsub01/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/tcruzi_all_2016/preprocessing/parasite/clbr/hpgl0483/outputs/tophat_tcruzi_all/accepted_paired.count.xz contains 25105 rows and merges to 25105 rows.
## /mnt/sshfs/cbcbsub01/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/tcruzi_all_2016/preprocessing/parasite/clbr/hpgl0484/outputs/tophat_tcruzi_all/accepted_paired.count.xz contains 25105 rows and merges to 25105 rows.
## /mnt/sshfs/cbcbsub01/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/tcruzi_all_2016/preprocessing/parasite/cl14/hpgl0485/outputs/tophat_tcruzi_all/accepted_paired.count.xz contains 25105 rows and merges to 25105 rows.
## /mnt/sshfs/cbcbsub01/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/tcruzi_all_2016/preprocessing/parasite/clbr/hpgl0486/outputs/tophat_tcruzi_all/accepted_paired.count.xz contains 25105 rows and merges to 25105 rows.
## /mnt/sshfs/cbcbsub01/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/tcruzi_all_2016/preprocessing/parasite/clbr/hpgl0487/outputs/tophat_tcruzi_all/accepted_paired.count.xz contains 25105 rows and merges to 25105 rows.
## /mnt/sshfs/cbcbsub01/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/tcruzi_all_2016/preprocessing/parasite/clbr/hpgl0488/outputs/tophat_tcruzi_all/accepted_paired.count.xz contains 25105 rows and merges to 25105 rows.
## /mnt/sshfs/cbcbsub01/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/tcruzi_all_2016/preprocessing/parasite/cl14/hpgl0489/outputs/tophat_tcruzi_all/accepted_paired.count.xz contains 25105 rows and merges to 25105 rows.
## /mnt/sshfs/cbcbsub01/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/tcruzi_all_2016/preprocessing/parasite/clbr/hpgl0490/outputs/tophat_tcruzi_all/accepted_paired.count.xz contains 25105 rows and merges to 25105 rows.
## Finished reading count data.
## Warning in create_expt(metadata = "sample_sheets/
## cl14clbr_samples_combined_all.xlsx", : Some samples were removed when cross
## referencing the samples against the count data.
## Matched 25099 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## Saving the expressionset to 'expt.rda'.
## The final expressionset has 25100 rows and 24 columns.
all_expt <- exclude_genes_expt(all_expt)
## The txtype column is null, doing nothing.
var_expt <- subset_expt(all_expt, subset="type=='CLBr'")
## Using a subset expression.
## There were 24, now there are 12 samples.
snp_expt <- count_expt_snps(var_expt, type="percent", annot_column="bcffile")
## Parsed with column specification:
## cols(
##   `# chr_pos_ref_alt` = col_character(),
##   diff_count = col_double(),
##   all_count = col_double(),
##   pct = col_double()
## )
## Parsed with column specification:
## cols(
##   `# chr_pos_ref_alt` = col_character(),
##   diff_count = col_double(),
##   all_count = col_double(),
##   pct = col_double()
## )
## Parsed with column specification:
## cols(
##   `# chr_pos_ref_alt` = col_character(),
##   diff_count = col_double(),
##   all_count = col_double(),
##   pct = col_double()
## )
## Parsed with column specification:
## cols(
##   `# chr_pos_ref_alt` = col_character(),
##   diff_count = col_double(),
##   all_count = col_double(),
##   pct = col_double()
## )
## Parsed with column specification:
## cols(
##   `# chr_pos_ref_alt` = col_character(),
##   diff_count = col_double(),
##   all_count = col_double(),
##   pct = col_double()
## )
## Parsed with column specification:
## cols(
##   `# chr_pos_ref_alt` = col_character(),
##   diff_count = col_double(),
##   all_count = col_double(),
##   pct = col_double()
## )
## Parsed with column specification:
## cols(
##   `# chr_pos_ref_alt` = col_character(),
##   diff_count = col_double(),
##   all_count = col_double(),
##   pct = col_double()
## )
## Parsed with column specification:
## cols(
##   `# chr_pos_ref_alt` = col_character(),
##   diff_count = col_double(),
##   all_count = col_double(),
##   pct = col_double()
## )
## Parsed with column specification:
## cols(
##   `# chr_pos_ref_alt` = col_character(),
##   diff_count = col_double(),
##   all_count = col_double(),
##   pct = col_double()
## )
## Parsed with column specification:
## cols(
##   `# chr_pos_ref_alt` = col_character(),
##   diff_count = col_double(),
##   all_count = col_double(),
##   pct = col_double()
## )
## Parsed with column specification:
## cols(
##   `# chr_pos_ref_alt` = col_character(),
##   diff_count = col_double(),
##   all_count = col_double(),
##   pct = col_double()
## )
## Parsed with column specification:
## cols(
##   `# chr_pos_ref_alt` = col_character(),
##   diff_count = col_double(),
##   all_count = col_double(),
##   pct = col_double()
## )
snp_numbers <- plot_libsize(snp_expt)
## Error in plot_libsize(snp_expt): (list) object cannot be coerced to type 'double'
snp_numbers$plot
## Error in eval(expr, envir, enclos): object 'snp_numbers' not found
snp_sets <- get_snp_sets(snp_expt, factor="condition")
## The factor CLBr.A96 has 3 rows.
## The factor CLBr.A60 has 3 rows.
## The factor CLBr.Tryp has 3 rows.
## The factor CLBr.Epi has 3 rows.
## Iterating over 845 elements.
snp_summary <- snps_vs_genes(var_expt, snp_sets)
## Error in .find_seqnames_col(df_colnames0, seqnames.field0, prefix): cannnot find seqnames column
snp_genes <- snps_vs_intersections(var_expt, snp_sets,
                                   chr_column="seqnames")
## Error in snps_vs_intersections(var_expt, snp_sets, chr_column = "seqnames"): could not find function "snps_vs_intersections"
snp_norm <- normalize_expt(snp_expt, filter=TRUE, transform="log2")
## This function will replace the expt$expressionset slot with:
## log2(cbcb(data))
## It will save copies of each step along the way
##  in expt$normalized with the corresponding libsizes. Keep libsizes in mind
##  when invoking limma.  The appropriate libsize is non-log(cpm(normalized)).
##  This is most likely kept at:
##  'new_expt$normalized$intermediate_counts$normalization$libsizes'
##  A copy of this may also be found at:
##  new_expt$best_libsize
## Leaving the data unconverted.  It is often advisable to cpm/rpkm
##  the data to normalize for sampling differences, keep in mind though that rpkm
##  has some annoying biases, and voom() by default does a cpm (though hpgl_voom()
##  will try to detect this).
## Leaving the data unnormalized.  This is necessary for DESeq, but
##  EdgeR/limma might benefit from normalization.  Good choices include quantile,
##  size-factor, tmm, etc.
## Not correcting the count-data for batch effects.  If batch is
##  included in EdgerR/limma's model, then this is probably wise; but in extreme
##  batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 0 low-count genes (8295 remaining).
## Step 2: not normalizing the data.
## Step 3: not converting the data.
## Step 4: transforming the data with log2.
## transform_counts: Found 66453 values equal to 0, adding 1 to the matrix.
## Step 5: not doing batch correction.
test <- plot_corheat(snp_norm)

test2 <- plot_sample_heatmap(snp_norm, Rowv=FALSE)

index.html 01_annotation.html

pander::pander(sessionInfo())

R version 3.6.1 (2019-07-05)

Platform: x86_64-pc-linux-gnu (64-bit)

locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C

attached base packages: parallel, stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: foreach(v.1.5.0), ruv(v.0.9.7.1), Heatplus(v.2.32.1), hpgltools(v.1.0), Biobase(v.2.46.0) and BiocGenerics(v.0.32.0)

loaded via a namespace (and not attached): tidyselect(v.1.0.0), lme4(v.1.1-23), RSQLite(v.2.2.0), AnnotationDbi(v.1.48.0), grid(v.3.6.1), BiocParallel(v.1.20.1), Rtsne(v.0.15), devtools(v.2.3.0), munsell(v.0.5.0), codetools(v.0.2-16), preprocessCore(v.1.48.0), statmod(v.1.4.34), withr(v.2.2.0), colorspace(v.1.4-1), GOSemSim(v.2.12.1), knitr(v.1.28), rstudioapi(v.0.11), stats4(v.3.6.1), DOSE(v.3.12.0), labeling(v.0.3), urltools(v.1.7.3), GenomeInfoDbData(v.1.2.2), polyclip(v.1.10-0), bit64(v.0.9-7), farver(v.2.0.3), rprojroot(v.1.3-2), vctrs(v.0.2.4), xfun(v.0.13), BiocFileCache(v.1.10.2), fastcluster(v.1.1.25), R6(v.2.4.1), doParallel(v.1.0.15), GenomeInfoDb(v.1.22.1), graphlayouts(v.0.7.0), locfit(v.1.5-9.4), bitops(v.1.0-6), fgsea(v.1.12.0), gridGraphics(v.0.5-0), DelayedArray(v.0.12.3), assertthat(v.0.2.1), scales(v.1.1.0), ggraph(v.2.0.2), enrichplot(v.1.6.1), gtable(v.0.3.0), sva(v.3.34.0), processx(v.3.4.2), tidygraph(v.1.1.2), rlang(v.0.4.6), genefilter(v.1.68.0), splines(v.3.6.1), rtracklayer(v.1.46.0), europepmc(v.0.3), BiocManager(v.1.30.10), yaml(v.2.2.1), reshape2(v.1.4.4), GenomicFeatures(v.1.38.2), backports(v.1.1.6), qvalue(v.2.18.0), clusterProfiler(v.3.14.3), tools(v.3.6.1), usethis(v.1.6.1), ggplotify(v.0.0.5), ggplot2(v.3.3.0), ellipsis(v.0.3.0), gplots(v.3.0.3), RColorBrewer(v.1.1-2), sessioninfo(v.1.1.1), ggridges(v.0.5.2), Rcpp(v.1.0.4.6), plyr(v.1.8.6), base64enc(v.0.1-3), progress(v.1.2.2), zlibbioc(v.1.32.0), purrr(v.0.3.4), RCurl(v.1.98-1.2), ps(v.1.3.2), prettyunits(v.1.1.1), openssl(v.1.4.1), viridis(v.0.5.1), cowplot(v.1.0.0), S4Vectors(v.0.24.4), SummarizedExperiment(v.1.16.1), ggrepel(v.0.8.2), colorRamps(v.2.3), fs(v.1.4.1), variancePartition(v.1.16.1), magrittr(v.1.5), data.table(v.1.12.8), DO.db(v.2.9), openxlsx(v.4.1.4), triebeard(v.0.3.0), matrixStats(v.0.56.0), pkgload(v.1.0.2), hms(v.0.5.3), evaluate(v.0.14), xtable(v.1.8-4), pbkrtest(v.0.4-8.6), XML(v.3.99-0.3), IRanges(v.2.20.2), gridExtra(v.2.3), testthat(v.2.3.2), compiler(v.3.6.1), biomaRt(v.2.42.1), tibble(v.3.0.1), KernSmooth(v.2.23-17), crayon(v.1.3.4), minqa(v.1.2.4), htmltools(v.0.4.0), mgcv(v.1.8-31), corpcor(v.1.6.9), snow(v.0.4-3), tidyr(v.1.0.2), DBI(v.1.1.0), tweenr(v.1.0.1), dbplyr(v.1.4.3), MASS(v.7.3-51.6), rappdirs(v.0.3.1), boot(v.1.3-25), readr(v.1.3.1), Matrix(v.1.2-18), cli(v.2.0.2), quadprog(v.1.5-8), gdata(v.2.18.0), igraph(v.1.2.5), GenomicRanges(v.1.38.0), pkgconfig(v.2.0.3), rvcheck(v.0.1.8), GenomicAlignments(v.1.22.1), xml2(v.1.3.2), annotate(v.1.64.0), XVector(v.0.26.0), stringr(v.1.4.0), callr(v.3.4.3), digest(v.0.6.25), Biostrings(v.2.54.0), rmarkdown(v.2.1), fastmatch(v.1.1-0), edgeR(v.3.28.1), directlabels(v.2020.1.31), curl(v.4.3), Rsamtools(v.2.2.3), gtools(v.3.8.2), nloptr(v.1.2.2.1), lifecycle(v.0.2.0), nlme(v.3.1-147), jsonlite(v.1.6.1), desc(v.1.2.0), viridisLite(v.0.3.0), askpass(v.1.1), limma(v.3.42.2), fansi(v.0.4.1), pillar(v.1.4.3), lattice(v.0.20-41), httr(v.1.4.1), pkgbuild(v.1.0.7), survival(v.3.1-12), GO.db(v.3.10.0), glue(v.1.4.0), remotes(v.2.1.1), zip(v.2.0.4), iterators(v.1.0.12), pander(v.0.6.3), bit(v.1.1-15.2), ggforce(v.0.3.1), stringi(v.1.4.6), blob(v.1.2.1), doSNOW(v.1.0.18), caTools(v.1.18.0), memoise(v.1.1.0) and dplyr(v.0.8.5)

message(paste0("This is hpgltools commit: ", get_git_commit()))
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 5763684c1dc9aaca47da1768d30388741268c02c
## This is hpgltools commit: Sat Apr 25 14:42:57 2020 -0400: 5763684c1dc9aaca47da1768d30388741268c02c
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
## Saving to 02_sample_estimation_202005-v202005.rda.xz
tmp <- sm(saveme(filename=this_save))
---
title: "RNAseq of T.cruzi CL14/CLBr: Sample Estimation"
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
 html_document:
  code_download: true
  code_folding: show
  fig_caption: true
  fig_height: 7
  fig_width: 7
  highlight: default
  keep_md: false
  mode: selfcontained
  number_sections: true
  self_contained: true
  theme: readable
  toc: true
  toc_float:
    collapsed: false
    smooth_scroll: false
---

<style>
  body .main-container {
    max-width: 1600px;
  }
</style>

```{r options, include=FALSE}
library("hpgltools")
tt <- sm(devtools::load_all("/data/hpgltools"))
knitr::opts_knit$set(progress=TRUE,
                     verbose=TRUE,
                     width=90,
                     echo=TRUE)
knitr::opts_chunk$set(error=TRUE,
                      fig.width=8,
                      fig.height=8,
                      dpi=96)
old_options <- options(digits=4,
                       stringsAsFactors=FALSE,
                       knitr.duplicate.label="allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size=12))
set.seed(1)
ver <- "20170810"
previous_file <- "01_annotation.Rmd"

tmp <- try(sm(loadme(filename=paste0(gsub(pattern="\\.Rmd", replace="", x=previous_file), "-v", ver, ".rda.xz"))))

ver <- "202005"
rmd_file <- "02_sample_estimation_202005.Rmd"
```

```{r render, eval=FALSE, include=FALSE}
rmarkdown::render(rmd_file)

rmarkdown::render(rmd_file, output_format="pdf_document")
```

[index.html](index.html) [annotation.html](annotation.html)

This document is intended to provide an opportunity to compare the samples and ensure that they are
all of sufficient quality to be included in the final differential expression analyses.  In
addition, this should provide an opportunity to compare the results from the alignments to the
esmeraldo, nonesmeraldo, and combined haplotypes.  In addition, with the recent inclusion of all the
human alignments, it may be used to quantify them as well.

# Create testing subsets

For some analyses, we might want to consider only the samples of types CL14 or CLBrener.

I am going to use the following convention when working with the various subsets:

1.  Names of data sets begin with the data included:
  a. cl_ means include all samples.
  b. clbr_ means the clbrener samples.
  c. cl14_ means the cl14 samples.
  d. kcl_,kclbr_,kcl14_ means the 'kept' samples respectively after excluding outliers.
2.  The following portion refers to the haplotype used for mapping the reads.
  a. esmer_ means the alignments mapped against the Esmeraldo haplotype.
  b. nonesmer_ means those against NonEsmeraldo.
  c. all_ means the combined haplotypes including Esmeraldo+NonEsmeraldo+Unassigned.
3.  The final portion refers to the data type created.
  a. expt means the expt object which contains the annotation data, sample data, read counts.
  b. metrics means the set of plots describing the data.
  c. xxxnorm means normalized data using xxx where xxx is defined by an abbreviation of the normalization used
    i.   lqcf: log2(quantile(cpm(filtered(data))))
    ii.  clqcf: combat(log2(quantile(cpm(filtered(data)))))
    iii. slqcf: sva(log2(quantile(cpm(filtered(data)))))
    iv.  Other letters may be added as needed in each slot.

```{r create_expt}
all_expt <- sm(create_expt(metadata="sample_sheets/cl14clbr_samples_combined_all.xlsx",
                           file_column="clbrenerfile", gene_info=all_genes))
all_expt <- exclude_genes_expt(all_expt)

outlier_subset <- "sampleid!='HPGL0490'"
all_minus_one <- subset_expt(all_expt, subset=outlier_subset)
all_minus_one <- subset_expt(all_minus_one, subset="type!='CL14'")
all_minus_cl14 <- subset_expt(all_expt, subset="type!='CL14'")

tmp_expt <- sm(normalize_expt(all_minus_cl14, filter=TRUE, convert="cpm", norm="quant", transform="log2"))
pp(file="images/all_clbr_pca.pdf", image=plot_pca(tmp_expt)$plot)
tmp_expt <- sm(normalize_expt(all_minus_one, filter=TRUE, convert="cpm", norm="quant", transform="log2"))
pp(file="images/all_clbr_minus_one_pca.pdf", image=plot_pca(tmp_expt)$plot)
```

```{r create_subsets}
## I am going to copy the original expts to names which follow the conventions below
cl_esmer_expt <- esmer_expt
cl_nonesmer_expt <- nonesmer_expt
cl_all_expt <- all_expt
## Split the esmer data into CLBR/CL14
clbr_esmer_expt <- subset_expt(esmer_expt, "type=='CLBr'")
cl14_esmer_expt <- subset_expt(esmer_expt, "type=='CL14'")
## Split the nonesmer data into CLBR/CL14
clbr_nonesmer_expt <- subset_expt(nonesmer_expt, "type=='CLBr'")
cl14_nonesmer_expt <- subset_expt(nonesmer_expt, "type=='CL14'")
## And the combined data
clbr_all_expt <- subset_expt(all_expt, "type=='CLBr'")
cl14_all_expt <- subset_expt(all_expt, "type=='CL14'")

clbr_all_norm <- normalize_expt(clbr_all_expt, norm="quant", convert="cpm", transform="log2", filter=TRUE)
pp(file="images/clbrener_sample_pca.pdf", image=plot_pca(clbr_all_norm)$plot)
```

Now we have 9 expt objects, one for each haplotype (esmer, nonesmer, combined(all))
for each subset (all data, cl14 data, and clbr data)

```{r initial_graphs_hidden1, fig.show="hide"}
cl_esmer_metrics <- sm(graph_metrics(esmer_expt))
cl_nonesmer_metrics <- sm(graph_metrics(nonesmer_expt))
cl_all_metrics <- sm(graph_metrics(all_expt))
clbr_esmer_metrics <- sm(graph_metrics(clbr_esmer_expt))
clbr_nonesmer_metrics <- sm(graph_metrics(clbr_nonesmer_expt))
clbr_all_metrics <- sm(graph_metrics(clbr_all_expt))
cl14_esmer_metrics <- sm(graph_metrics(cl14_esmer_expt))
cl14_nonesmer_metrics <- sm(graph_metrics(cl14_nonesmer_expt))
cl14_all_metrics <- sm(graph_metrics(cl14_all_expt))
```

## Quick check the library sizes

I find myself thinking I should plot these together in a single plot somehow.
They are kind of fascinating, as there are more reads on the nonesmer mappings than esmer for all
samples.  Why is this?

```{r libsizes}
cl_esmer_metrics$libsize
cl_nonesmer_metrics$libsize
## NonEsmeraldo often gets >= 10% more reads.
cl_all_metrics$libsize
```

## Check global data distributions across subsets

As a starting point, let us see how similar the data distributions are across subsets/haplotypes.

This will be printed in order:
esmer mapping:, all-data, clbr-data, cl14-data
nonesmer mappings, all-data, clbr-data, cl14-data
all mappings: all-data, clbr-data, cl14-data
human-data: all

```{r densities}
cl_esmer_metrics$density
## The distribusions mostly look good but hpgl0125 and hpgl 476/475/482 are a
## little off as I think we will see more clearly in the smaller dataset plots
clbr_esmer_metrics$density
## Ahh the lower density samples are the amastigote 60 hours, that makes sense:
## 476,482,475 are all A60.
cl14_esmer_metrics$cvplot
## This is similarly true for the cl14 samples, but at both time points, 60 and
## 96 hours.

cl_nonesmer_metrics$density
clbr_nonesmer_metrics$density
cl14_nonesmer_metrics$density
## Surprisingly to me, the nonesmer and esmer density plots are different for
## some samples.

cl_all_metrics$density
clbr_all_metrics$density
## Ahh the lower density samples are the amastigote 60 hours, that makes sense:
## 476,482,475 are all A60.
cl14_nonesmer_metrics$density
## Surprisingly to me, the nonesmer and esmer density plots are different for
## some samples.

cl_esmer_metrics$boxplot
cl_nonesmer_metrics$boxplot
cl_all_metrics$boxplot
```

We could in theory print out the pca, heatmaps etc now, but I think I want first to perform a
default normalization of the data and see the trends in the data a bit more clearly.

```{r default_norm}
cl_esmer_lqcf <- default_norm(cl_esmer_expt, transform="log2")
cl_nonesmer_lqcf <- default_norm(cl_nonesmer_expt, transform="log2")
cl_all_lqcf <- default_norm(cl_all_expt, transform="log2")
clbr_esmer_lqcf <- default_norm(clbr_esmer_expt, transform="log2")
clbr_nonesmer_lqcf <- default_norm(clbr_nonesmer_expt, transform="log2")
clbr_all_lqcf <- default_norm(clbr_all_expt, transform="log2")
cl14_esmer_lqcf <- default_norm(cl14_esmer_expt, transform="log2")
cl14_nonesmer_lqcf <- default_norm(cl14_nonesmer_expt, transform="log2")
cl14_all_lqcf <- default_norm(cl14_all_expt, transform="log2")
```

# Redo metric plotting on normalized data

```{r initial_graphs_hidden, fig.show="hide"}
cl_esmer_lqcf_metrics <- sm(graph_metrics(cl_esmer_lqcf))
cl_nonesmer_lqcf_metrics <- sm(graph_metrics(cl_nonesmer_lqcf))
cl_all_lqcf_metrics <- sm(graph_metrics(cl_all_lqcf))
clbr_esmer_lqcf_metrics <- sm(graph_metrics(clbr_esmer_lqcf))
clbr_nonesmer_lqcf_metrics <- sm(graph_metrics(clbr_nonesmer_lqcf))
clbr_all_lqcf_metrics <- sm(graph_metrics(clbr_all_lqcf))
cl14_esmer_lqcf_metrics <- sm(graph_metrics(cl14_esmer_lqcf))
cl14_nonesmer_lqcf_metrics <- sm(graph_metrics(cl14_nonesmer_lqcf))
cl14_all_lqcf_metrics <- sm(graph_metrics(cl14_all_lqcf))
```

For the following plots, I think I will focus on the combined data unless there is something which
is unclear unless it is separated.

## Correlation heatmaps

The correlations in these plots are exceedingly high, 0.7<x<1.0.

```{r all_heatmaps}
cl_esmer_lqcf_metrics$corheat
cl_nonesmer_lqcf_metrics$corheat
## Once again the differences between esmer and nonesmer are striking
cl_all_lqcf_metrics$corheat
## HPGL0485 (cl14 epimastigote) is clearly the strangest sample of all.
## Its cohorts, HPGL0128 looks most like a CL14 60 hour sample; while HPGL0129
## looks like a CLBr epimastigote.
## HPGL0490 is also problematic, as it looks very similar to the CLBr 96 hour samples.

cl_esmer_lqcf_metrics$disheat
cl_nonesmer_lqcf_metrics$disheat
cl_all_lqcf_metrics$disheat
## Same story as above I think, but the problematic samples HPGL0485 is even
## starker to my eyes.

cl_all_lqcf_metrics$smc
```

Once again, I see that the nonesmer mappings are surprisingly (to me) different than the
esmer.  So much so that in the most pathologically difficult samples, we get pretty big shifts
in clustering for the epimastigote and trypomastigote samples.  This is starting to make me think
that a large number of the genes attributed to the 'Unassigned' haplotype are actually Esmeraldo.

## PCA

For the PCA plots, I think it might prove useful to look at the CLBr and CL14 samples separately.
However I will start out with the entire set like the heatmaps.

```{r pca_plots}
cl_esmer_lqcf_metrics$pc_plot
cl_nonesmer_lqcf_metrics$pc_plot
cl_all_lqcf_metrics$pc_plot
cl14_esmer_lqcf_metrics$pc_plot
cl14_nonesmer_lqcf_metrics$pc_plot
cl14_all_lqcf_metrics$pc_plot
clbr_esmer_lqcf_metrics$pc_plot
clbr_nonesmer_lqcf_metrics$pc_plot
clbr_all_lqcf_metrics$pc_plot
```

## A separate, simpler Figure S02A:

Note that this figure must be generated before removing troubling samples because I reused the names
of some of the expressionsets.

```{r figure_s02a_simpler, eval=FALSE}
## 20170214 This is causing a segmentation fault. hmm it seems fixed now...
## All I did was restart the computer, R worries my sometimes.
cl_all_written <- write_expt(cl_all_expt,
                             excel=paste0("excel/testing-v", ver, ".xlsx"))
svg(file="images/figS02a_atb_v2.3.svg")
cl_all_lqcf_metrics$corheat
dev.off()
testing <- sm(normalize_expt(cl_all_expt, filter=FALSE, convert="cpm",
                             norm="quant", transform="log2"))
testing_disheat <- plot_disheat(testing)
testing_pca <- plot_pca(testing, plot_labels=FALSE)
pdf(file="images/figS02a_atb_v2.2_all_samples.pdf")
testing_disheat$plot
dev.off()
svg(file="images/figS02b_atb_v2.2_all_samples.svg")
testing_pca$plot
dev.off()

## I think therefore we want:
## 1.  A PCA of all samples log2(cpm(quant(filt(sva())))) (not in that order)
## 2.  Corresponding correlation heatmap, distance, and smc.
test_data <- sm(normalize_expt(cl_all_expt, transform="log2", norm="quant",
                               convert="cpm", filter=TRUE))
##test_write <- write_expt(cl_all_expt, transform="log2", norm="quant",
##                         convert="cpm", filter=TRUE,
##                         excel=paste0("excel/cl_all_expt-v", ver, ".xlsx"))
fig_pca <- plot_pca(test_data)
fig_cd <- plot_corheat(test_data)
fig_dd <- plot_disheat(test_data)
fig_smc <- plot_sm(test_data)
svg(file="images/figS02v2a_atb_v2.0.svg")
fig_pca$plot
dev.off()
pdf(file="images/figS02v2b_atb_v2.0.pdf")
fig_cd$plot
dev.off()
pdf(file="images/figS02v2c_atb_v2.0.pdf")
fig_dd$plot
dev.off()
svg(file="images/figS02v2d_atb_v2.0.svg")
fig_smc$plot
dev.off()
```

## A couple of possible implementations of Figure 2

The text introducing Figure 2 is as follows:

"Mapped sequencing data derived from all libraries were analyzed using two methods to inspect the
relationships between samples and to identify outliers. Samples were subjected to principal
component analysis (PCA) as well as hierarchical clustering. The resulting heat map and PCA plot
(Figure 2) showed a clear separation between samples as well as the expected clustering between
biological triplicates, except for one library generated from RNA isolated from CL Brener
trypomastigotes, which was identified as an outlier and was excluded from further analyses (Figure
2C). Of the 24887 genes analyzed, 23191 were expressed at a level of  at least 1 count per million
in at least 2 samples."

I take this to mean that we are performing a heatmap and PCA of _all_ samples.

The heat map provided in the powerpoint, appears to me to not have all samples, but is instead
missing the CL14 epimastigotes.  While it is true we remove them, this piece of text does say we are
looking at all samples -- which is it?

```{r figure2_silly}
library(Biobase)
new_names <- paste0(rownames(pData(cl_all_lqcf$expressionset)), "_",
                    pData(cl_all_lqcf$expressionset)$originalcond, "_",
                    pData(cl_all_lqcf$expressionset)$batch)
distance_plot <- plot_disheat(cl_all_lqcf, expt_names=new_names)
correlation_plot <- plot_corheat(cl_all_lqcf, expt_names=new_names)

library(Heatplus)
row_correlations <- hpgl_cor(exprs(cl_all_lqcf$expressionset))
row_design <- as.data.frame(pData(cl_all_lqcf$expressionset))
## Reorder the design to match the correlation clustering order
row_design <- row_design[rownames(row_correlations), ]

column_distances <- as.matrix(dist(t(exprs(cl_all_lqcf$expressionset))))
column_design <- as.data.frame(pData(cl_all_lqcf$expressionset))
## Reorder these according to the distance clustering
column_design <- column_design[rownames(column_distances), ]

## Now choose the columns of the design to add as annotations
row_design <- row_design[, c("batch","replicate")]
row_design$replicate <- as.factor(row_design$replicate)
column_design <- column_design[, c("type","stage")]

my_annotations <- list(Row=list(data=row_design),
                       Col=list(data=column_design))
my_labels <- list(Row=list(nrow=5),
                  Col=list(nrow=5))
my_cluster_distances <- list(Row=list(cuth=2),
                             Col=list(cuth=100))
cordist <- column_distances
for (r in 1:nrow(row_correlations)) {
    for (c in 1:ncol(row_correlations)) {
        if (c > r) {
            cordist[r,c] <- row_correlations[r,c]
        } else if (r == c) {
            cordist[r,c] <- 0
        }
    }
}
## ok, so now we go from -1 (highest correlation) to 0 (lowest correlation),
## then up to high numbers (distance)
map1 <- annHeatmap2(cordist, scale="none",
                    cluster=my_cluster_distances,
                    ann=my_annotations,
                    labels=my_labels)
## heatmap.2 heatmaps have: $breaks (color breaks), $col (colors by breaks),
## and $colorTable (low,high,color assignments)
## annHeatmap2 has $breaks and $col -- perhaps I can use that.
plot(map1)
```

# Remove troubled samples

My previous notes, and the plots above suggest strongly that the samples HPGL0485, HPGL0490, and HPGL0163 are problematic.
I have already excluded HPGL0163, so now I just want to drop the other two.

I am going to therefore create new base kexpt objects and overwrite the normalized data with the new normalized data.

```{r sample_removal, fig.show="hide"}
old_subset <- "sampleid!='HPGL0485'&sampleid!='HPGL0490'&sampleid!='HPGL0129'&sampleid!='HPGL0128'"
##new_subset <- "sampleid!='HPGL0128'&sampleid!='HPGL0129'&sampleid!='HPGL0483'&sampleid!='HPGL0485'&sampleid!='HPGL0486'&sampleid!='HPGL0487'&sampleid!='HPGL0490'"
##new_subset <- "stage!='Epi'&sampleid!='HPGL0490'"
outlier_subset <- "sampleid!='HPGL0490'"
noepi_subset <- "stage!='Epi'"
cl_all_kexptv1 <- subset_expt(cl_all_expt, subset=outlier_subset)
cl_all_kexpt <- subset_expt(cl_all_kexptv1, subset=noepi_subset)
expt_filename <- paste0("excel/cl_all_kexptv1-", ver, ".xlsx")
printed_expt <- write_expt(cl_all_kexptv1, violin=FALSE,
                           excel=expt_filename,
                           filter=TRUE, transform="log2", norm="quant",
                           convert="raw", batch="sva")

cl_esmer_kexpt <- subset_expt(cl_esmer_expt, subset=outlier_subset)
cl_nonesmer_kexpt <- subset_expt(cl_nonesmer_expt, subset=outlier_subset)
cl_all_kexpt <- subset_expt(cl_all_expt, subset=outlier_subset)
clbr_esmer_kexpt <- subset_expt(clbr_esmer_expt, subset=outlier_subset)
clbr_nonesmer_kexpt <- subset_expt(clbr_nonesmer_expt, subset=outlier_subset)
clbr_all_kexpt <- subset_expt(clbr_all_expt, subset=outlier_subset)
cl14_esmer_kexpt <- subset_expt(cl14_esmer_expt, subset=outlier_subset)
cl14_nonesmer_kexpt <- subset_expt(cl14_nonesmer_expt, subset=outlier_subset)
cl14_all_kexpt <- subset_expt(cl14_all_expt, subset=outlier_subset)
cl_esmer_lqcf <- default_norm(cl_esmer_kexpt, transform="log2")
cl_nonesmer_lqcf <- default_norm(cl_nonesmer_kexpt, transform="log2")
cl_all_lqcf <- default_norm(cl_all_kexpt, transform="log2")
clbr_esmer_lqcf <- default_norm(clbr_esmer_kexpt, transform="log2")
clbr_nonesmer_lqcf <- default_norm(clbr_nonesmer_kexpt, transform="log2")
clbr_all_lqcf <- default_norm(clbr_all_kexpt, transform="log2")
cl14_esmer_lqcf <- default_norm(cl14_esmer_kexpt, transform="log2")
cl14_nonesmer_lqcf <- default_norm(cl14_nonesmer_kexpt, transform="log2")
cl14_all_lqcf <- default_norm(cl14_all_kexpt, transform="log2")

cl_esmer_lqcf_metrics <- sm(graph_metrics(cl_esmer_lqcf))
cl_nonesmer_lqcf_metrics <- sm(graph_metrics(cl_nonesmer_lqcf))
cl_all_lqcf_metrics <- sm(graph_metrics(cl_all_lqcf))
clbr_esmer_lqcf_metrics <- sm(graph_metrics(clbr_esmer_lqcf))
clbr_nonesmer_lqcf_metrics <- sm(graph_metrics(clbr_nonesmer_lqcf))
clbr_all_lqcf_metrics <- sm(graph_metrics(clbr_all_lqcf))
cl14_esmer_lqcf_metrics <- sm(graph_metrics(cl14_esmer_lqcf))
cl14_nonesmer_lqcf_metrics <- sm(graph_metrics(cl14_nonesmer_lqcf))
cl14_all_lqcf_metrics <- sm(graph_metrics(cl14_all_lqcf))
```

## A separate, simpler Figure 02A:

```{r figure_02a_simpler}
distance_plot <- plot_disheat(cl_all_lqcf)

testing_data <- normalize_expt(cl_all_kexpt, transform="log2", convert="cpm",
                               filter=TRUE, batch="ssva")
correlation_plot <- plot_corheat(testing_data, keysize=1.5)
pdf(file="images/fig02a_atb_v2.3.pdf")
correlation_plot$plot
dev.off()
pca_plot <- plot_pca(testing_data, plot_labels=FALSE)
svg(file="images/fig02b_atb_v2.3.svg")
pca_plot$plot
dev.off()
```

# Re-visualize chosen normalized metrics

Now the visualizations should be maximally pretty.

```{r vis_pretty}
cl_nonesmer_lqcf_metrics$corheat
clbr_nonesmer_lqcf_metrics$corheat
cl14_nonesmer_lqcf_metrics$corheat
cl_nonesmer_lqcf_metrics$pcaplot
clbr_nonesmer_lqcf_metrics$pcaplot
cl14_nonesmer_lqcf_metrics$pcaplot
```

# Violins!

Lets see if my new violin function works.

```{r violin, eval=FALSE}
clbr_filt <- normalize_expt(clbr_all_kexpt, filter="cbcb")
varpart_cb <- varpart(clbr_filt, predictor=NULL, factors=c("condition", "actualbatch"))
varpart_cb$partition_plot
```

```{r variants}
gff_file <- "reference/gff/clbrener_8.1_complete_genes.gff.gz"
gff_annotations <- sm(load_gff_annotations(gff_file, type="gene"))
rownames(gff_annotations) <- gff_annotations[["ID"]]

all_expt <- create_expt(metadata="sample_sheets/cl14clbr_samples_combined_all.xlsx",
                        file_column="clbrenerfile", gene_info=gff_annotations)
all_expt <- exclude_genes_expt(all_expt)
var_expt <- subset_expt(all_expt, subset="type=='CLBr'")

snp_expt <- count_expt_snps(var_expt, type="percent", annot_column="bcffile")

snp_numbers <- plot_libsize(snp_expt)
snp_numbers$plot

snp_sets <- get_snp_sets(snp_expt, factor="condition")

snp_summary <- snps_vs_genes(var_expt, snp_sets)
snp_genes <- snps_vs_intersections(var_expt, snp_sets,
                                   chr_column="seqnames")

snp_norm <- normalize_expt(snp_expt, filter=TRUE, transform="log2")
test <- plot_corheat(snp_norm)
test2 <- plot_sample_heatmap(snp_norm, Rowv=FALSE)

```

[index.html](index.html)
[01_annotation.html](01_annotation.html)

```{r saveme}
pander::pander(sessionInfo())
message(paste0("This is hpgltools commit: ", get_git_commit()))
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
tmp <- sm(saveme(filename=this_save))
```
