2 Create expressionset

Note that as of 20190124, two samples are still missing: hpgl0914 and hpgl0749. I am rerunning their mapping with salmon now with the assumption that I just missed them previously. I noted their status in the ‘skipped’ column of the online sample sheet.

Note 20190125: hpgl0749 mapped; but hpgl0914 has some weirdness still. It looks like the forward reads have a gzip CRC error, so I used zcat to extract the available data and will then retrim/remap.

Another note, I am using the ‘state’ column, which was missing the field ‘la_infected’ for the Leishmania amazonensis samples; this resulted in a set of ‘NA’ conditions. I therefore added la_infected to the relevant fields to the sample sheet online and my working sheet.

I also found an error in the time attribution for sample hpgl0461. The time was set to undefined, while in the study it was t24h. That has been changed in both the online and my copy of the sample sheet.

## Reading the sample metadata.
## The sample definitions comprises: 292 rows(samples) and 35 columns(metadata fields).
## Reading count tables.
## Using the transcript to gene mapping.
## Reading salmon data with tximport.
## Finished reading count tables.
## Matched 19629 annotations and counts.
## Bringing together the count matrix and gene information.
## The mapped IDs are not the rownames of your gene information, changing them now.
## Some annotations were lost in merging, setting them to 'undefined'.

3 Extract the mbio data

Now we have a 291 sample data set, but we only want the samples from the human portion of the mBio paper, which Najib helpfully defined in the ‘study’ column of the sample sheet as ‘mBio’.

Thus I will pull those samples from the sample sheet and set the conditions/batches to what I am assuming are reasonable values. An important caveat: we need to concatenate the existing columns: ‘expt_time’ and ‘state’ in order to get useful values for the condition.

In addition, I am removing the L. amazonensis samples for the moment.

## There were 273, now there are 82 samples.
## There were 82, now there are 66 samples.

4 Perform a few metrics

Now make some plots and see if I get similar ones to those observed in the paper.

Here is the link with the PCA plots and such: https://mbio.asm.org/content/7/3/e00027-16/figures-only

Unless I am mistaken, the only things I have to compare against are some fancy PCA plots in the main paper and a few raw-ish ones in the supplemental.

5 Normalize and plot more

5.2 limma batch adjust

In this iteration, we use limma’s function to remove batch effect, which I think is what was used in order to make the figure in the paper. This is borne out by the fact that the image generated is nearly identical to the one in the paper.

## This function will replace the expt$expressionset slot with:
## log2(limma(cpm(quant(hpgl(data)))))
## It backs up the current data into a slot named:
##  expt$backup_expressionset. It will also save copies of each step along the way
##  in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
##  when invoking limma.  The appropriate libsize is the non-log(cpm(normalized)).
##  This is most likely kept at:
##  'new_expt$normalized$intermediate_counts$normalization$libsizes'
##  A copy of this may also be found at:
##  new_expt$best_libsize
## Step 1: performing count filter with option: hpgl
## Removing 7814 low-count genes (11815 remaining).
## Step 2: normalizing the data with quant.
## Using normalize.quantiles.robust due to a thread error in preprocessCore.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 1337 values equal to 0, adding 1 to the matrix.
## Step 5: doing batch correction with limma.
## Note to self:  If you get an error like 'x contains missing values' The data has too many 0's and needs a stronger low-count filter applied.
## batch_counts: Before batch correction, 47487 entries 0<=x<1.
## batch_counts: Before batch correction, 1337 entries are >= 0.
## Passing off to all_adjusters.
## batch_counts: Before batch/surrogate estimation, 1337 entries are x<=0.
## The be method chose 9 surrogate variable(s).
## batch_counts: Using limma's removeBatchEffect to remove batch effect.
## If you receive a warning: 'NANs produced', one potential reason is that the data was quantile normalized.
## The number of elements which are < 0 after batch correction is: 1776
## The variable low_to_zero sets whether to change <0 values to 0 and is: FALSE
## Not putting labels on the plot.

5.3 svaseq batch adjust

Finally, I employ my favorite method: svaseq(). This squishes the time-based differences in the data and highlights the differences between the various infection states.

## This function will replace the expt$expressionset slot with:
## log2(svaseq(cpm(hpgl(data))))
## It backs up the current data into a slot named:
##  expt$backup_expressionset. It will also save copies of each step along the way
##  in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
##  when invoking limma.  The appropriate libsize is the non-log(cpm(normalized)).
##  This is most likely kept at:
##  'new_expt$normalized$intermediate_counts$normalization$libsizes'
##  A copy of this may also be found at:
##  new_expt$best_libsize
## Leaving the data unnormalized.  This is necessary for DESeq, but
##  EdgeR/limma might benefit from normalization.  Good choices include quantile,
##  size-factor, tmm, etc.
## Step 1: performing count filter with option: hpgl
## Removing 7814 low-count genes (11815 remaining).
## Step 2: not normalizing the data.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 5536 values equal to 0, adding 1 to the matrix.
## Step 5: doing batch correction with svaseq.
## Note to self:  If you get an error like 'x contains missing values' The data has too many 0's and needs a stronger low-count filter applied.
## batch_counts: Before batch correction, 46466 entries 0<=x<1.
## batch_counts: Before batch correction, 5536 entries are >= 0.
## Passing off to all_adjusters.
## batch_counts: Before batch/surrogate estimation, 5536 entries are x<=0.
## The be method chose 8 surrogate variable(s).
## Attempting svaseq estimation with 8 surrogates.
## The number of elements which are < 0 after batch correction is: 1084
## The variable low_to_zero sets whether to change <0 values to 0 and is: FALSE
## Not putting labels on the plot.

5.4 Bead effect

I do not recall what methods were used to estimate the ‘bead effect’ in the data. Therefore I am copy/pasting the relevant logs from Laura and will then try to recapitulate the tasks performed separately.

I think the makeTab() function is what was used to regenerate the p-values for the bead-adjusted data.

Following the function definition is a representative invocation performed by Laura. (I copy/pasted from her log with minor formatting changes).

## Quantile normalize counts
countsSubQ <- qNorm(counts)
## Specify model
mod = model.matrix(~0+condition+batch)
## Use voom to transform quantile-normalized count data to log2-counts per million, estimate mean-variance relationship
## and use m-v relationship to computer appropriate observational-level weights
v <- voom(countsSubQ, mod)
## Fit a linear model for each gene using the specified design contained in v
fit <- lmFit(v)

makeTab <- function(contrFit, coef1, coef2, ...) {
  ## Compute test statistic
  stat <- pmin(abs(contrFit$t[, coef1]), abs(contrFit$t[, coef2]))
  ## Compute pvalue for stat
  pval <- pmax(contrFit$p.value[, coef1], contrFit$p.value[, coef2])
  ## Adjust pvalue for multiple testing
  adj.pval <- p.adjust(pval, method="BH")
  ## Make the toptable
  tab <- topTable(contrFit, coef=coef1, sort.by="none", ...)
  coef1_name <- colnames(contrFit$coef)[coef1]
  coef2_name <- colnames(contrFit$coef)[coef2]
  new_tab <- data.frame(tab$ID, tab$logFC, contrFit$coef[, coef2], tab$AveExpr,
                        tab$t, contrFit$t[, coef2], stat, pval, adj.pval)
  new_tab <- new_tab[order(-stat), ]

  colnames(new_tab) <- c("ID", paste0("logFC_", coef1_name),
                         paste0("logFC_", coef2_name),
                         "AveExpr",
                         paste0("t_", coef1_name),
                         paste0("t_", coef2_name),
                         "stat",
                         "P.Value",
                         "adj.P.Value")
  new_tab
}

## eBayes finds an F-statistic from the set of t-statistics for that gene
beads24.infLM24.contr.mat <- makeContrasts(uninf_inf=(conditioninfLM24-conditionuninf24),
                                           beads_inf=(conditioninfLM24-conditionbeads24),
                                           levels=v$design)
beads24.infLM24.fit <- contrasts.fit(fit, beads24.infLM24.contr.mat)
beads24.infLM24.eb <- eBayes(beads24.infLM24.fit)

beads24.infLM24.topTab <- makeTab(beads24.infLM24.eb, 1, 2, number=nrow(v$E))

As far as the above goes, it mostly makes sense. My question is, how do we get the modified logFC values? Presumably that is later down in the log.

5.5 makeTab2

Looking further down I found the following invocations, which partially but incompletely answer my question.

## Define makeTab2 function
## construct a DE result table for infection vs. uninfected and beads
## contrFit: the result of eBayes after conrasts.fit
## cellmeansFit: the cell means fit (lmFit(v) above)
## conjContrasts: the 'conjuctive' null test (infection vs. uninf AND infect vs. beads)
## disjContrast: the 'other' test (beads vs. uninf)
makeTab2 <- function(contrFit, cellmeansFit, conjContrasts, disjContrast) {
  ## Get average expression for all relevant terms
  contr_level_counts <- rowSums(contrFit$contrasts[, c(conjContrasts, disjContrast)] != 0)
  ## Define the condition levels involved in the tests
  levels_to_use <- names(contr_level_counts)[contr_level_counts > 0]
  ## Extract the average counts for each, make into table
  ave_expression_mat <- cellmeansFit$coef[, levels_to_use]
  exp_table <- data.frame(ID=rownames(ave_expression_mat))
  exp_table <- cbind(exp_table, as.data.frame(ave_expression_mat))
  names(exp_table)[-1] <- paste(
    "AveExpr", gsub("condition","",levels_to_use),
    sep=":")
  ## Compute test statistic, adjusted pval, and logFC for conjuctive test
  ## Add to table
  stat <- rowMins(abs(contrFit$t[, conjContrasts]))
  pval <- rowMaxs(contrFit$p.value[, conjContrasts])
  adj.pval <- p.adjust(pval, method="BH")
  fcs <- as.data.frame(contrFit$coef[, conjContrasts])
  names(fcs) <- paste("logFC", names(fcs), sep=":")
  conj_pvals <- as.data.frame(apply(contrFit$p.value[, conjContrasts], 2,
                                    p.adjust, method="BH"))
  names(conj_pvals) <- paste("adj.P.Val", names(conj_pvals), sep=":")
  conj_table <- data.frame(ID=rownames(contrFit))
  conj_table <- cbind(conj_table, fcs, conj_pvals, stat=stat, adj.P.Value=adj.pval)
  names(conj_table)[seq(2 + 2 * length(conjContrasts), ncol(conj_table))] <- paste(
    c("stat","adj.P.Value"),
    paste(conjContrasts,collapse=":"),
    sep=":")
  ## Make the table for the 'other' test
  disj_table <- data.frame(ID=rownames(contrFit),
                           logFC=contrFit$coef[, disjContrast],
                           adj.P.Value=p.adjust(contrFit$p.value[, disjContrast], method="BH"))
  names(disj_table)[-1] <- paste(c("logFC", "adj.P.Value"), disjContrast, sep=":")
  ## Combine tables, making sure all tables are in the same order
  stopifnot(all(exp_table$ID == conj_table$ID & exp_table$ID == disj_table$ID))
  out_table <- cbind(exp_table, conj_table[, -1], disj_table[, -1])

  ## order output table by the statistic in the disjunctive test
  o <- order(-stat)
  out_table[o,]
}

infLM4.infLM24.contr.mat <- makeContrasts(uninf_inf=((conditioninfLM24-conditionuninf24)-(conditioninfLM4-conditionuninf4)),
                                          beads_inf=((conditioninfLM24-conditionbeads24)-(conditioninfLM4-conditionbeads4)),
                                          uninf_beads=((conditionbeads24-conditionuninf24)-(conditionbeads4-conditionuninf4)), levels=v$design)
infLM4.infLM24.fit <- contrasts.fit(fit, infLM4.infLM24.contr.mat)
infLM4.infLM24.eb <- eBayes(infLM4.infLM24.fit)

infLM4.infLM24.topTab <- makeTab2(infLM4.infLM24.eb, fit, c("uninf_inf", "beads_inf"),
                                  c("uninf_beads"))

I think that is everything performed. If I understand what I see, then it is doing the following:

  • quantile normalize and filter the count table.
  • perform voom with a condition+batch experimental model.
  • invoke lmFit on the result.
  • set up a set of contrasts which include:
    • uninf_inf = (infected at time y - uninfected at time y) - (infected at time x - uninfected at time x)
    • beads_inf = (infected at time y - beads at time y) - (infected at time x - beads at time x)
    • uninf_beads = (beads at time y - uninfected at time y) - (beads at time x - uninfected at time x)
  • Invoke contrasts.fit and eBayes
  • Invoke the makeTab(2)() function with the eBayes result as arg1, the lmFit result as arg2, and the character list of c(“uninf_inf”, “beads_inf”) as the third arg and finally “uninf_beads” as the final argument. makeTab2() does the following:
    • uses the original fit to extract the average expression values.
    • finds the minimum t statistic and maximum pvalue for each gene from the uninf_inf and beads_inf columns.
    • uses p.adjust on this set of maximized pvalues.
    • pulls the logFC values for each of the uninf_inf and beads_inf columns.
    • does p.adjust on the pvalues of the contrast pvalues.
    • uses cbind on these pieces to make a single table.
    • uses cbind on the expression table, the newly created table (conj) and the disjunct table.
    • orders them according to the t statistic.

I do not see how this set of operations gives us a better picture of the effect of beads during an infection. The primary thing I see in it is the modification of the p-values and the compound contrast of (infy-uninfy)-(infx-uninfx) It seems to me that this is the perfect time for an interaction model?

6 Implementing the contrasts from the paper

With the above in mind, it is pretty trivial for me to perform limma/edger with the same contrasts. I will first invoke my interpretation of the paper contrasts using limma_pairwise() and for the 4 hour data lmajor data.

After rereading the previous implementation, I think I get it. It was in fact using two contrasts: infected/uninfected and infected/beads. It reported the infected/beads result and then took the least significant of the p-value and t statistics of the two contrasts, re-adjusted them, and reported these.

6.1 hpgltools method

## This function will replace the expt$expressionset slot with:
## hpgl(data)
## It backs up the current data into a slot named:
##  expt$backup_expressionset. It will also save copies of each step along the way
##  in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
##  when invoking limma.  The appropriate libsize is the non-log(cpm(normalized)).
##  This is most likely kept at:
##  'new_expt$normalized$intermediate_counts$normalization$libsizes'
##  A copy of this may also be found at:
##  new_expt$best_libsize
## Leaving the data in its current base format, keep in mind that
##  some metrics are easier to see when the data is log2 transformed, but
##  EdgeR/DESeq do not accept transformed data.
## Leaving the data unconverted.  It is often advisable to cpm/rpkm
##  the data to normalize for sampling differences, keep in mind though that rpkm
##  has some annoying biases, and voom() by default does a cpm (though hpgl_voom()
##  will try to detect this).
## Leaving the data unnormalized.  This is necessary for DESeq, but
##  EdgeR/limma might benefit from normalization.  Good choices include quantile,
##  size-factor, tmm, etc.
## Not correcting the count-data for batch effects.  If batch is
##  included in EdgerR/limma's model, then this is probably wise; but in extreme
##  batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: hpgl
## Removing 7814 low-count genes (11815 remaining).
## Step 2: not normalizing the data.
## Step 3: not converting the data.
## Step 4: not transforming the data.
## Step 5: not doing batch correction.
## Using limma's removeBatchEffect to visualize with(out) batch inclusion.
## Finished running DE analyses, collecting outputs.
## Comparing analyses.

6.2 Compare against a sheet I downloaded from the paper.

I saved the worksheet ‘infLM4_before’ as inline-supplementary-material-5_infLM4_before.csv It is 4 hpi / uninfected, which is happily a contrast I performed.

## Warning: Missing column names filled in: 'X2' [2], 'X3' [3], 'X4' [4],
## 'X5' [5]
## Parsed with column specification:
## cols(
##   `DE genes in L. major-infected human macrophages relative to uninfected controls, 4 hpi, not accounting for phagocytosis` = col_character(),
##   X2 = col_character(),
##   X3 = col_character(),
##   X4 = col_character(),
##   X5 = col_character()
## )
## [1] 5118   52
## Warning: NaNs produced
## 
##  Pearson's product-moment correlation
## 
## data:  common[["Fold change"]] and common[["limma_logfc"]]
## t = 160, df = 2000, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9610 0.9672
## sample estimates:
##    cor 
## 0.9642
## Used Bon Ferroni corrected t test(s) between columns.

## Warning: Missing column names filled in: 'X2' [2], 'X3' [3], 'X4' [4],
## 'X5' [5], 'X6' [6]
## Parsed with column specification:
## cols(
##   `DE genes in L. major-infected human macrophages relative to uninfected controls, 4 hpi, with accounting for phagocytosis` = col_character(),
##   X2 = col_character(),
##   X3 = col_character(),
##   X4 = col_character(),
##   X5 = col_character(),
##   X6 = col_character()
## )
## [1] 2955   53
## Warning: NaNs produced
## 
##  Pearson's product-moment correlation
## 
## data:  common[["Fold change beads v inf"]] and common[["limma_logfc"]]
## t = 110, df = 1200, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9451 0.9561
## sample estimates:
##    cor 
## 0.9509
## Used Bon Ferroni corrected t test(s) between columns.

R version 3.5.2 (2018-12-20)

Platform: x86_64-pc-linux-gnu (64-bit)

locale: LC_CTYPE=en_US.utf8, LC_NUMERIC=C, LC_TIME=en_US.utf8, LC_COLLATE=en_US.utf8, LC_MONETARY=en_US.utf8, LC_MESSAGES=en_US.utf8, LC_PAPER=en_US.utf8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.utf8 and LC_IDENTIFICATION=C

attached base packages: parallel, stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: ruv(v.0.9.7), bindrcpp(v.0.2.2), hpgltools(v.2018.11), Biobase(v.2.42.0) and BiocGenerics(v.0.28.0)

loaded via a namespace (and not attached): tidyselect(v.0.2.5), lme4(v.1.1-19), htmlwidgets(v.1.3), RSQLite(v.2.1.1), AnnotationDbi(v.1.44.0), grid(v.3.5.2), BiocParallel(v.1.16.5), devtools(v.2.0.1), munsell(v.0.5.0), codetools(v.0.2-16), preprocessCore(v.1.45.0), units(v.0.6-2), withr(v.2.1.2), colorspace(v.1.4-0), GOSemSim(v.2.8.0), knitr(v.1.21), rstudioapi(v.0.9.0), stats4(v.3.5.2), Vennerable(v.3.1.0.9000), robustbase(v.0.93-3), DOSE(v.3.8.2), labeling(v.0.3), urltools(v.1.7.1), tximport(v.1.10.1), GenomeInfoDbData(v.1.2.0), bit64(v.0.9-7), farver(v.1.1.0), rprojroot(v.1.3-2), xfun(v.0.4), R6(v.2.3.0), doParallel(v.1.0.14), GenomeInfoDb(v.1.18.1), locfit(v.1.5-9.1), bitops(v.1.0-6), fgsea(v.1.8.0), gridGraphics(v.0.3-0), DelayedArray(v.0.8.0), assertthat(v.0.2.0), scales(v.1.0.0), ggraph(v.1.0.2), nnet(v.7.3-12), enrichplot(v.1.2.0), gtable(v.0.2.0), sva(v.3.30.1), processx(v.3.2.1), rlang(v.0.3.1), genefilter(v.1.64.0), splines(v.3.5.2), rtracklayer(v.1.42.1), lazyeval(v.0.2.1), acepack(v.1.4.1), europepmc(v.0.3), checkmate(v.1.9.1), yaml(v.2.2.0), reshape2(v.1.4.3), GenomicFeatures(v.1.34.1), backports(v.1.1.3), qvalue(v.2.14.1), Hmisc(v.4.1-1), RBGL(v.1.58.1), clusterProfiler(v.3.10.1), tools(v.3.5.2), usethis(v.1.4.0), ggplotify(v.0.0.3), ggplot2(v.3.1.0), gplots(v.3.0.1), RColorBrewer(v.1.1-2), sessioninfo(v.1.1.1), ggridges(v.0.5.1), Rcpp(v.1.0.0), plyr(v.1.8.4), base64enc(v.0.1-3), progress(v.1.2.0), zlibbioc(v.1.28.0), purrr(v.0.2.5), RCurl(v.1.95-4.11), ps(v.1.3.0), prettyunits(v.1.0.2), rpart(v.4.1-13), viridis(v.0.5.1), cowplot(v.0.9.4), S4Vectors(v.0.20.1), SummarizedExperiment(v.1.12.0), ggrepel(v.0.8.0), cluster(v.2.0.7-1), colorRamps(v.2.3), fs(v.1.2.6), variancePartition(v.1.12.1), magrittr(v.1.5), data.table(v.1.12.0), DO.db(v.2.9), openxlsx(v.4.1.0), triebeard(v.0.3.0), packrat(v.0.5.0), matrixStats(v.0.54.0), pkgload(v.1.0.2), hms(v.0.4.2), evaluate(v.0.12), xtable(v.1.8-3), pbkrtest(v.0.4-7), XML(v.3.98-1.16), IRanges(v.2.16.0), gridExtra(v.2.3), testthat(v.2.0.1), compiler(v.3.5.2), biomaRt(v.2.38.0), tibble(v.2.0.1), KernSmooth(v.2.23-15), crayon(v.1.3.4), minqa(v.1.2.4), htmltools(v.0.3.6), mgcv(v.1.8-26), corpcor(v.1.6.9), Formula(v.1.2-3), geneplotter(v.1.60.0), tidyr(v.0.8.2), DBI(v.1.0.0), tweenr(v.1.0.1), MASS(v.7.3-51.1), Matrix(v.1.2-15), readr(v.1.3.1), cli(v.1.0.1), gdata(v.2.18.0), bindr(v.0.1.1), igraph(v.1.2.2), GenomicRanges(v.1.34.0), pkgconfig(v.2.0.2), rvcheck(v.0.1.3), GenomicAlignments(v.1.18.1), foreign(v.0.8-71), xml2(v.1.2.0), foreach(v.1.4.4), annotate(v.1.60.0), XVector(v.0.22.0), stringr(v.1.3.1), callr(v.3.1.1), digest(v.0.6.18), graph(v.1.60.0), Biostrings(v.2.50.2), rmarkdown(v.1.11), fastmatch(v.1.1-0), htmlTable(v.1.13.1), edgeR(v.3.24.3), Rsamtools(v.1.34.0), gtools(v.3.8.1), nloptr(v.1.2.1), nlme(v.3.1-137), jsonlite(v.1.6), desc(v.1.2.0), viridisLite(v.0.3.0), limma(v.3.38.3), pillar(v.1.3.1), lattice(v.0.20-38), DEoptimR(v.1.0-8), httr(v.1.4.0), pkgbuild(v.1.0.2), survival(v.2.43-3), GO.db(v.3.7.0), glue(v.1.3.0), remotes(v.2.0.2), zip(v.1.0.0), UpSetR(v.1.3.3), iterators(v.1.0.10), pander(v.0.6.3), bit(v.1.1-14), ggforce(v.0.1.3), stringi(v.1.2.4), blob(v.1.1.1), DESeq2(v.1.22.2), latticeExtra(v.0.6-28), caTools(v.1.17.1.1), memoise(v.1.1.0) and dplyr(v.0.7.8)

## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset b0e11455e9f02944597c1bb5027f8ecbeb14201b
## This is hpgltools commit: Fri Jan 18 13:42:11 2019 -0500: b0e11455e9f02944597c1bb5027f8ecbeb14201b
---
title: "20190124 Recapitulating previous results: which genes are DE in human macrophages at 4hrs upon infection with L. major?"
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
  html_document:
    code_download: true
    code_folding: show
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: tango
    keep_md: false
    mode: selfcontained
    number_sections: true
    self_contained: true
    theme: readable
    toc: true
    toc_float:
      collapsed: false
      smooth_scroll: false
  rmdformats::readthedown:
    code_download: true
    code_folding: show
    df_print: paged
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: tango
    width: 300
    keep_md: false
    mode: selfcontained
    toc_float: true
  BiocStyle::html_document:
    code_download: true
    code_folding: show
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: tango
    keep_md: false
    mode: selfcontained
    toc_float: true
---

<style type="text/css">
body, td {
  font-size: 16px;
}
code.r{
  font-size: 16px;
}
pre {
 font-size: 16px
}
</style>

```{r options, include=FALSE}
library(hpgltools)
tt <- sm(devtools::load_all("~/hpgltools"))
knitr::opts_knit$set(progress=TRUE,
                     verbose=TRUE,
                     width=120,
                     echo=TRUE)
knitr::opts_chunk$set(error=TRUE,
                      fig.width=8,
                      fig.height=8,
                      dpi=96)
old_options <- options(digits=4,
                       stringsAsFactors=FALSE,
                       knitr.duplicate.label="allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size=12))
ver <- "20190124"
rundate <- format(Sys.Date(), format="%Y%m%d")
rmd_file <- paste0("20190124_mbio_human.Rmd")
```

# Gather annotation data

```{r annotations}
hs_annot <- load_biomart_annotations()$annotation
rownames(hs_annot) <- make.names(
  paste0(hs_annot[["ensembl_transcript_id"]], ".",
         hs_annot[["transcript_version"]]),
  unique=TRUE)
hs_tx_gene <- hs_annot[, c("ensembl_gene_id", "ensembl_transcript_id")]
hs_tx_gene[["id"]] <- rownames(hs_tx_gene)
hs_tx_gene <- hs_tx_gene[, c("id", "ensembl_gene_id")]
new_hs_annot <- hs_annot
rownames(new_hs_annot) <- make.names(hs_annot[["ensembl_gene_id"]], unique=TRUE)
```

# Create expressionset

Note that as of 20190124, two samples are still missing: hpgl0914 and hpgl0749.
I am rerunning their mapping with salmon now with the assumption that I just
missed them previously.  I noted their status in the 'skipped' column of the
online sample sheet.

Note 20190125: hpgl0749 mapped; but hpgl0914 has some weirdness still.  It looks
like the forward reads have a gzip CRC error, so I used zcat to extract the
available data and will then retrim/remap.

Another note, I am using the 'state' column, which was missing the field
'la_infected' for the Leishmania amazonensis samples; this resulted in a set of
'NA' conditions.  I therefore added la_infected to the relevant fields to the
sample sheet online and my working sheet.

I also found an error in the time attribution for sample hpgl0461.  The time was
set to undefined, while in the study it was t24h.  That has been changed in both
the online and my copy of the sample sheet.

```{r expressionset}
sample_sheet <- "sample_sheets/all_leishmania_samples_20190124.xlsx"
lots <- create_expt(sample_sheet,
                    gene_info=new_hs_annot,
                    tx_gene_map=hs_tx_gene)
```

# Extract the mbio data

Now we have a 291 sample data set, but we only want the samples from the human
portion of the mBio paper, which Najib helpfully defined in the 'study' column
of the sample sheet as 'mBio'.

Thus I will pull those samples from the sample sheet and set the
conditions/batches to what I am assuming are reasonable values.
An important caveat: we need to concatenate the existing columns: 'expt_time'
and 'state' in order to get useful values for the condition.

In addition, I am removing the L. amazonensis samples for the moment.

```{r subset_expt}
mbio_expt <- subset_expt(lots, subset="study=='mBio'")
mbio_expt <- subset_expt(mbio_expt, subset="studybatch!='unused'")
##mbio_expt <- subset_expt(mbio_expt, subset="pathogenspecies!='lamazonensis'")
##mbio_expt <- subset_expt(mbio_expt, subset="donor!='thp1'")
mbio_expt <- set_expt_batches(mbio_expt, "studybatch")
metadata <- pData(mbio_expt)
new_condition <- paste0(metadata[["state"]], "_", metadata[["expttime"]])
mbio_expt <- set_expt_conditions(mbio_expt, "state")
```

# Perform a few metrics

Now make some plots and see if I get similar ones to those observed in the
paper.

Here is the link with the PCA plots and such:
https://mbio.asm.org/content/7/3/e00027-16/figures-only

Unless I am mistaken, the only things I have to compare against are some fancy
PCA plots in the main paper and a few raw-ish ones in the supplemental.

```{r show_initial_metrics}
libsize <- plot_libsize(mbio_expt)
libsize$plot
```

# Normalize and plot more

## No batch adjust

This first plot makes no attempt to handle the various batch effects in the data.

```{r normalize_replot}
mbio_norm <- sm(normalize_expt(mbio_expt, transform="log2", convert="cpm", filter=TRUE, norm="quant"))
mbio_pca <- plot_pca(mbio_norm, size_column="expttime", plot_labels=FALSE, cis=NULL,
                     size_order=c("t4h", "t24h", "t48h", "t72h"))
##mbio_pca <- plot_pca(mbio_norm)
mbio_pca$plot
```

## limma batch adjust

In this iteration, we use limma's function to remove batch effect, which I think
is what was used in order to make the figure in the paper.  This is borne out by
the fact that the image generated is nearly identical to the one in the paper.

```{r normalize_limma_replot}
mbio_batch1 <- normalize_expt(mbio_expt, transform="log2", convert="cpm",
                              filter=TRUE, norm="quant", batch="limma")
mbio_pca1 <- plot_pca(mbio_batch1, size_column="expttime", plot_labels=FALSE,
                      cis=NULL, size_order=c("t4h", "t24h", "t48h", "t72h"))
##mbio_pca <- plot_pca(mbio_norm)
mbio_pca1$plot
```

## svaseq batch adjust

Finally, I employ my favorite method: svaseq().  This squishes the time-based
differences in the data and highlights the differences between the various
infection states.

```{r normalize_svaseq_replot}
mbio_batch2 <- normalize_expt(mbio_expt, transform="log2", convert="cpm",
                              filter=TRUE, batch="svaseq")
mbio_pca2 <- plot_pca(mbio_batch2, size_column="expttime", plot_labels=FALSE,
                      cis=NULL, size_order=c("t4h", "t24h", "t48h", "t72h"))
##mbio_pca <- plot_pca(mbio_norm)
mbio_pca2$plot
```

## Bead effect

I do not recall what methods were used to estimate the 'bead effect' in the
data.  Therefore I am copy/pasting the relevant logs from Laura and will then
try to recapitulate the tasks performed separately.

I think the makeTab() function is what was used to regenerate the p-values for
the bead-adjusted data.

Following the function definition is a representative invocation performed by
Laura. (I copy/pasted from her log with minor formatting changes).

```{r lauras_code, eval=FALSE}
## Quantile normalize counts
countsSubQ <- qNorm(counts)
## Specify model
mod = model.matrix(~0+condition+batch)
## Use voom to transform quantile-normalized count data to log2-counts per million, estimate mean-variance relationship
## and use m-v relationship to computer appropriate observational-level weights
v <- voom(countsSubQ, mod)
## Fit a linear model for each gene using the specified design contained in v
fit <- lmFit(v)

makeTab <- function(contrFit, coef1, coef2, ...) {
  ## Compute test statistic
  stat <- pmin(abs(contrFit$t[, coef1]), abs(contrFit$t[, coef2]))
  ## Compute pvalue for stat
  pval <- pmax(contrFit$p.value[, coef1], contrFit$p.value[, coef2])
  ## Adjust pvalue for multiple testing
  adj.pval <- p.adjust(pval, method="BH")
  ## Make the toptable
  tab <- topTable(contrFit, coef=coef1, sort.by="none", ...)
  coef1_name <- colnames(contrFit$coef)[coef1]
  coef2_name <- colnames(contrFit$coef)[coef2]
  new_tab <- data.frame(tab$ID, tab$logFC, contrFit$coef[, coef2], tab$AveExpr,
                        tab$t, contrFit$t[, coef2], stat, pval, adj.pval)
  new_tab <- new_tab[order(-stat), ]

  colnames(new_tab) <- c("ID", paste0("logFC_", coef1_name),
                         paste0("logFC_", coef2_name),
                         "AveExpr",
                         paste0("t_", coef1_name),
                         paste0("t_", coef2_name),
                         "stat",
                         "P.Value",
                         "adj.P.Value")
  new_tab
}

## eBayes finds an F-statistic from the set of t-statistics for that gene
beads24.infLM24.contr.mat <- makeContrasts(uninf_inf=(conditioninfLM24-conditionuninf24),
                                           beads_inf=(conditioninfLM24-conditionbeads24),
                                           levels=v$design)
beads24.infLM24.fit <- contrasts.fit(fit, beads24.infLM24.contr.mat)
beads24.infLM24.eb <- eBayes(beads24.infLM24.fit)

beads24.infLM24.topTab <- makeTab(beads24.infLM24.eb, 1, 2, number=nrow(v$E))
```

As far as the above goes, it mostly makes sense.  My question is, how do we get
the modified logFC values?  Presumably that is later down in the log.

## makeTab2

Looking further down I found the following invocations, which partially but
incompletely answer my question.

```{r lauras_maketab2, eval=FALSE}
## Define makeTab2 function
## construct a DE result table for infection vs. uninfected and beads
## contrFit: the result of eBayes after conrasts.fit
## cellmeansFit: the cell means fit (lmFit(v) above)
## conjContrasts: the 'conjuctive' null test (infection vs. uninf AND infect vs. beads)
## disjContrast: the 'other' test (beads vs. uninf)
makeTab2 <- function(contrFit, cellmeansFit, conjContrasts, disjContrast) {
  ## Get average expression for all relevant terms
  contr_level_counts <- rowSums(contrFit$contrasts[, c(conjContrasts, disjContrast)] != 0)
  ## Define the condition levels involved in the tests
  levels_to_use <- names(contr_level_counts)[contr_level_counts > 0]
  ## Extract the average counts for each, make into table
  ave_expression_mat <- cellmeansFit$coef[, levels_to_use]
  exp_table <- data.frame(ID=rownames(ave_expression_mat))
  exp_table <- cbind(exp_table, as.data.frame(ave_expression_mat))
  names(exp_table)[-1] <- paste(
    "AveExpr", gsub("condition","",levels_to_use),
    sep=":")
  ## Compute test statistic, adjusted pval, and logFC for conjuctive test
  ## Add to table
  stat <- rowMins(abs(contrFit$t[, conjContrasts]))
  pval <- rowMaxs(contrFit$p.value[, conjContrasts])
  adj.pval <- p.adjust(pval, method="BH")
  fcs <- as.data.frame(contrFit$coef[, conjContrasts])
  names(fcs) <- paste("logFC", names(fcs), sep=":")
  conj_pvals <- as.data.frame(apply(contrFit$p.value[, conjContrasts], 2,
                                    p.adjust, method="BH"))
  names(conj_pvals) <- paste("adj.P.Val", names(conj_pvals), sep=":")
  conj_table <- data.frame(ID=rownames(contrFit))
  conj_table <- cbind(conj_table, fcs, conj_pvals, stat=stat, adj.P.Value=adj.pval)
  names(conj_table)[seq(2 + 2 * length(conjContrasts), ncol(conj_table))] <- paste(
    c("stat","adj.P.Value"),
    paste(conjContrasts,collapse=":"),
    sep=":")
  ## Make the table for the 'other' test
  disj_table <- data.frame(ID=rownames(contrFit),
                           logFC=contrFit$coef[, disjContrast],
                           adj.P.Value=p.adjust(contrFit$p.value[, disjContrast], method="BH"))
  names(disj_table)[-1] <- paste(c("logFC", "adj.P.Value"), disjContrast, sep=":")
  ## Combine tables, making sure all tables are in the same order
  stopifnot(all(exp_table$ID == conj_table$ID & exp_table$ID == disj_table$ID))
  out_table <- cbind(exp_table, conj_table[, -1], disj_table[, -1])

  ## order output table by the statistic in the disjunctive test
  o <- order(-stat)
  out_table[o,]
}

infLM4.infLM24.contr.mat <- makeContrasts(uninf_inf=((conditioninfLM24-conditionuninf24)-(conditioninfLM4-conditionuninf4)),
                                          beads_inf=((conditioninfLM24-conditionbeads24)-(conditioninfLM4-conditionbeads4)),
                                          uninf_beads=((conditionbeads24-conditionuninf24)-(conditionbeads4-conditionuninf4)), levels=v$design)
infLM4.infLM24.fit <- contrasts.fit(fit, infLM4.infLM24.contr.mat)
infLM4.infLM24.eb <- eBayes(infLM4.infLM24.fit)

infLM4.infLM24.topTab <- makeTab2(infLM4.infLM24.eb, fit, c("uninf_inf", "beads_inf"),
                                  c("uninf_beads"))
```

I think that is everything performed.  If I understand what I see, then it is
doing the following:

* quantile normalize and filter the count table.
* perform voom with a condition+batch experimental model.
* invoke lmFit on the result.
* set up a set of contrasts which include:
    * uninf_inf = (infected at time y - uninfected at time y) - (infected at time x - uninfected at time x)
    * beads_inf = (infected at time y - beads at time y) - (infected at time x - beads at time x)
    * uninf_beads = (beads at time y - uninfected at time y) - (beads at time x - uninfected at time x)
* Invoke contrasts.fit and eBayes
* Invoke the makeTab(2)() function with the eBayes result as arg1, the lmFit result as arg2, and
  the character list of c("uninf_inf", "beads_inf") as the third arg and finally "uninf_beads"
  as the final argument.  makeTab2() does the following:
    * uses the original fit to extract the average expression values.
    * finds the minimum t statistic and maximum pvalue for each gene from the uninf_inf and beads_inf columns.
    * uses p.adjust on this set of maximized pvalues.
    * pulls the logFC values for each of the uninf_inf and beads_inf columns.
    * does p.adjust on the pvalues of the contrast pvalues.
    * uses cbind on these pieces to make a single table.
    * uses cbind on the expression table, the newly created table (conj) and the disjunct table.
    * orders them according to the t statistic.

I do not see how this set of operations gives us a better picture of the effect
of beads during an infection.  The primary thing I see in it is the modification
of the p-values and the compound contrast of (infy-uninfy)-(infx-uninfx)
It seems to me that this is the perfect time for an interaction model?

# Implementing the contrasts from the paper

With the above in mind, it is pretty trivial for me to perform limma/edger with the same contrasts.
I will first invoke my interpretation of the paper contrasts using limma_pairwise() and for the
4 hour data lmajor data.

After rereading the previous implementation, I think I get it.  It was in fact
using two contrasts: infected/uninfected and infected/beads.  It reported the
infected/beads result and then took the least significant of the p-value and t
statistics of the two contrasts, re-adjusted them, and reported these.

## hpgltools method

```{r pairwise, fig.show="hide"}
keepers <- list(
  "4hpi_uninf" = c("lm_infected_t4h", "uninfected_t4h"),
  "4hpi_beads" = c("lm_infected_t4h", "bead_t4h")
)
mbio_filt <- set_expt_conditions(mbio_expt, new_condition)
mbio_filt <- normalize_expt(mbio_filt, filter=TRUE)
## Something weird happened, my counts are somehow getting cast as non-integers!
## Thus I am invoking force=TRUE until I figure out what is going on.
mbio_pairwise <- all_pairwise(mbio_filt, model_batch=TRUE,
                              do_ebseq=FALSE, force=TRUE)
excel_file <- glue::glue("excel/{rundate}_mbio_pairwise_tables-v{ver}.xlsx")
mbio_tables <- sm(combine_de_tables(mbio_pairwise, keepers=keepers, excel=excel_file))
```

```{r pairwise_plots}
mbio_pairwise$comp$heat
mbio_tables[["deseq_ma_plots"]][["4hpi_uninf"]][["plot"]]
```

## Compare against a sheet I downloaded from the paper.

I saved the worksheet 'infLM4_before' as inline-supplementary-material-5_infLM4_before.csv
It is 4 hpi / uninfected, which is happily a contrast I performed.

```{r compare_pairwise}
old_table <- readr::read_csv("excel/inline-supplementary-material-5_infLM4_before.csv")
colnames(old_table) <- old_table[1, ]
old_table <- as.data.frame(old_table[-1, ])
rownames(old_table) <- old_table[["ID"]]
new_table <- mbio_tables[["data"]][["4hpi_uninf"]]
common <- merge(old_table, new_table, by="row.names")
dim(common)
common[["Fold change"]] <- log2(as.numeric(common[["Fold change"]]))
cor.test(common[["Fold change"]], common[["limma_logfc"]])
test <- plot_linear_scatter(common[, c("Fold change", "limma_logfc")])
test$scatter

old_table <- readr::read_csv("excel/inline-supplementary-material-5_infLM4_after.csv")
colnames(old_table) <- old_table[1, ]
old_table <- as.data.frame(old_table[-1, ])
rownames(old_table) <- old_table[["ID"]]
new_table <- mbio_tables[["data"]][["4hpi_beads"]]
common <- merge(old_table, new_table, by="row.names")
dim(common)
common[["Fold change beads v inf"]] <- log2(as.numeric(common[["Fold change beads v inf"]]))
cor.test(common[["Fold change beads v inf"]], common[["limma_logfc"]])
test <- plot_linear_scatter(common[, c("Fold change beads v inf", "limma_logfc")])
test$scatter
```

```{r saveme}
pander::pander(sessionInfo())
message(paste0("This is hpgltools commit: ", get_git_commit()))
## message(paste0("Saving to ", savefile))
## tmp <- sm(saveme(filename=savefile))
```
