1 Gather annotation data

hs_annot <- load_biomart_annotations()$annotation
## The biomart annotations file already exists, loading from it.
rownames(hs_annot) <- make.names(
  paste0(hs_annot[["ensembl_transcript_id"]], ".",
         hs_annot[["transcript_version"]]),
  unique = TRUE)
hs_tx_gene <- hs_annot[, c("ensembl_gene_id", "ensembl_transcript_id")]
hs_tx_gene[["id"]] <- rownames(hs_tx_gene)
hs_tx_gene <- hs_tx_gene[, c("id", "ensembl_gene_id")]
new_hs_annot <- hs_annot
rownames(new_hs_annot) <- make.names(hs_annot[["ensembl_gene_id"]], unique = TRUE)

2 Create expressionset

Note that as of 20190124, two samples are still missing: hpgl0914 and hpgl0749. I am rerunning their mapping with salmon now with the assumption that I just missed them previously. I noted their status in the ‘skipped’ column of the online sample sheet.

Note 20190125: hpgl0749 mapped; but hpgl0914 has some weirdness still. It looks like the forward reads have a gzip CRC error, so I used zcat to extract the available data and will then retrim/remap.

Note: 20190210: All samples mapped.

Another note, I am using the ‘state’ column, which was missing the field ‘la_infected’ for the Leishmania amazonensis samples; this resulted in a set of ‘NA’ conditions. I therefore added la_infected to the relevant fields to the sample sheet online and my working sheet.

I also found an error in the time attribution for sample hpgl0461. The time was set to undefined, while in the study it was t24h. That has been changed in both the online and my copy of the sample sheet.

##sample_sheet <- "sample_sheets/all_leishmania_samples_20190124.xlsx"
sample_sheet <- "sample_sheets/all_leishmania_samples_20190225.xlsx"
lots <- create_expt(sample_sheet,
                    file_column = "hsapiensfile",
                    gene_info = new_hs_annot,
                    tx_gene_map = hs_tx_gene)
## Reading the sample metadata.
## Dropped 16 rows from the sample metadata because they were blank.
## The sample definitions comprises: 421 rows(samples) and 59 columns(metadata fields).
## Reading count tables.
## Using the transcript to gene mapping.
## Reading salmon data with tximport.
## Finished reading count data.
## Warning in create_expt(sample_sheet, file_column = "hsapiensfile", gene_info
## = new_hs_annot, : Some samples were removed when cross referencing the samples
## against the count data.
## Matched 16933 annotations and counts.
## Bringing together the count matrix and gene information.
## The mapped IDs are not the rownames of your gene information, changing them now.
## Some annotations were lost in merging, setting them to 'undefined'.
## Saving the expressionset to 'expt.rda'.
## The final expressionset has 16933 rows and 267 columns.

3 Extract the mbio data

Now we have a 267 sample data set, but we only want the samples from the human portion of the mBio paper, which Najib helpfully defined in the ‘study’ column of the sample sheet as ‘mBio’.

Thus I will pull those samples from the sample sheet and set the conditions/batches to what I am assuming are reasonable values. An important caveat: we need to concatenate the existing columns: ‘expt_time’ and ‘state’ in order to get useful values for the condition.

In addition, I am removing the L. amazonensis samples for the moment.

mbio_expt <- subset_expt(lots, subset = "study=='mbio'")
## Using a subset expression.
## There were 267, now there are 82 samples.
mbio_expt <- subset_expt(mbio_expt, subset = "studybatch!='unused'")
## Using a subset expression.
## There were 82, now there are 66 samples.
##mbio_expt <- subset_expt(mbio_expt, subset = "pathogenspecies!='lamazonensis'")
##mbio_expt <- subset_expt(mbio_expt, subset = "donor!='thp1'")
mbio_expt <- set_expt_batches(mbio_expt, "studybatch")
metadata <- pData(mbio_expt)
new_condition <- paste0(metadata[["state"]], "_", metadata[["expttime"]])
mbio_expt <- set_expt_conditions(mbio_expt, "state")

4 Perform a few metrics

Now make some plots and see if I get similar ones to those observed in the paper.

Here is the link with the PCA plots and such: https://mbio.asm.org/content/7/3/e00027-16/figures-only

Unless I am mistaken, the only things I have to compare against are some fancy PCA plots in the main paper and a few raw-ish ones in the supplemental.

libsize <- plot_libsize(mbio_expt)
libsize$plot

5 Normalize and plot more

5.1 No batch adjust

This first plot makes no attempt to handle the various batch effects in the data.

mbio_norm <- sm(normalize_expt(mbio_expt, transform = "log2", convert = "cpm", filter = TRUE, norm = "quant"))
mbio_pca <- plot_pca(mbio_norm, size_column = "expttime", plot_labels = FALSE, cis = NULL,
                     size_order = c("t4h", "t24h", "t48h", "t72h"))
## Not putting labels on the PC plot.
##mbio_pca <- plot_pca(mbio_norm)
mbio_pca$plot
## Warning: Use of `pca_data[["condition"]]` is discouraged. Use
## `.data[["condition"]]` instead.

5.2 limma batch adjust

In this iteration, we use limma’s function to remove batch effect, which I think is what was used in order to make the figure in the paper. This is borne out by the fact that the image generated is nearly identical to the one in the paper.

5.2.1 Original code

filterCounts = function(counts, lib.size = NULL, thresh = 1, minSamples = 2) {
  cpms <- 2^cbcbSEQ::log2CPM(counts, lib.size = lib.size)$y
  keep <- rowSums(cpms > thresh) >= minSamples
  counts <- counts[keep, ]
  counts
}
condition <- pData(mbio_expt)$condition
x <- table(condition)
countsTable <- exprs(mbio_expt)
dim(countsTable)
##[1] 20956    72
counts <- filterCounts(countsTable, thresh = 1, minSamples = min(x))
dim(counts)
##[1] 13801    72
##12666 prior to adding in THP1 batch

##Quantile normalize counts
library(preprocessCore)
countsSubQ <- cbcbSEQ::qNorm(counts)

##Explore data for batch effects
##Transform data to log2 counts per million
x <- cbcbSEQ::log2CPM(countsSubQ)

##Compute principal components
library(corpcor)
s <- cbcbSEQ::makeSVD(x$y)

batch <- pData(mbio_expt)$batch
cbcbSEQ::pcRes(s$v, s$d, condition, batch)
cbcbSEQ::plotPC(s$v, s$d)
mbio_batch1 <- sm(normalize_expt(t1, transform = "log2", convert = "cpm",
                                 filter = TRUE, norm = "quant", batch = "limma"))
## Error in normalize_expt(t1, transform = "log2", convert = "cpm", filter = TRUE, : object 't1' not found
mbio_pca1 <- plot_pca(mbio_batch1,
                      cis = NULL,
                      size_column = "expttime", size_order = c("t4h", "t24h", "t48h", "t72h"))
## Error in plot_pca(mbio_batch1, cis = NULL, size_column = "expttime", size_order = c("t4h", : object 'mbio_batch1' not found
##mbio_pca <- plot_pca(mbio_norm)
mbio_pca1$plot
## Error in eval(expr, envir, enclos): object 'mbio_pca1' not found

5.3 svaseq batch adjust

Finally, I employ my favorite method: svaseq(). This squishes the time-based differences in the data and highlights the differences between the various infection states.

mbio_batch2 <- sm(normalize_expt(mbio_expt, transform = "log2", convert = "cpm",
                                 filter = TRUE, batch = "svaseq"))
mbio_pca2 <- plot_pca(mbio_batch2, size_column = "expttime", plot_labels = FALSE,
                      cis = NULL, size_order = c("t4h", "t24h", "t48h", "t72h"))
## Not putting labels on the PC plot.
##mbio_pca <- plot_pca(mbio_norm)
mbio_pca2$plot
## Warning: Use of `pca_data[["condition"]]` is discouraged. Use
## `.data[["condition"]]` instead.

5.4 Bead effect

I do not recall what methods were used to estimate the ‘bead effect’ in the data. Therefore I am copy/pasting the relevant logs from Laura and will then try to recapitulate the tasks performed separately.

I think the makeTab() function is what was used to regenerate the p-values for the bead-adjusted data.

Following the function definition is a representative invocation performed by Laura. (I copy/pasted from her log with minor formatting changes).

## Quantile normalize counts
countsSubQ <- qNorm(counts)
## Specify model
mod = model.matrix(~0+condition+batch)
## Use voom to transform quantile-normalized count data to log2-counts per million, estimate mean-variance relationship
## and use m-v relationship to computer appropriate observational-level weights
v <- voom(countsSubQ, mod)
## Fit a linear model for each gene using the specified design contained in v
fit <- lmFit(v)

makeTab <- function(contrFit, coef1, coef2, ...) {
  ## Compute test statistic
  stat <- pmin(abs(contrFit$t[, coef1]), abs(contrFit$t[, coef2]))
  ## Compute pvalue for stat
  pval <- pmax(contrFit$p.value[, coef1], contrFit$p.value[, coef2])
  ## Adjust pvalue for multiple testing
  adj.pval <- p.adjust(pval, method = "BH")
  ## Make the toptable
  tab <- topTable(contrFit, coef = coef1, sort.by = "none", ...)
  coef1_name <- colnames(contrFit$coef)[coef1]
  coef2_name <- colnames(contrFit$coef)[coef2]
  new_tab <- data.frame(tab$ID, tab$logFC, contrFit$coef[, coef2], tab$AveExpr,
                        tab$t, contrFit$t[, coef2], stat, pval, adj.pval)
  new_tab <- new_tab[order(-stat), ]

  colnames(new_tab) <- c("ID", paste0("logFC_", coef1_name),
                         paste0("logFC_", coef2_name),
                         "AveExpr",
                         paste0("t_", coef1_name),
                         paste0("t_", coef2_name),
                         "stat",
                         "P.Value",
                         "adj.P.Value")
  new_tab
}

## eBayes finds an F-statistic from the set of t-statistics for that gene
beads24.infLM24.contr.mat <- makeContrasts(uninf_inf=(conditioninfLM24-conditionuninf24),
                                           beads_inf=(conditioninfLM24-conditionbeads24),
                                           levels = v$design)
beads24.infLM24.fit <- contrasts.fit(fit, beads24.infLM24.contr.mat)
beads24.infLM24.eb <- eBayes(beads24.infLM24.fit)

beads24.infLM24.topTab <- makeTab(beads24.infLM24.eb, 1, 2, number = nrow(v$E))

As far as the above goes, it mostly makes sense. My question is, how do we get the modified logFC values? Presumably that is later down in the log.

5.5 makeTab2

Looking further down I found the following invocations, which partially but incompletely answer my question.

## Define makeTab2 function
## construct a DE result table for infection vs. uninfected and beads
## contrFit: the result of eBayes after conrasts.fit
## cellmeansFit: the cell means fit (lmFit(v) above)
## conjContrasts: the 'conjuctive' null test (infection vs. uninf AND infect vs. beads)
## disjContrast: the 'other' test (beads vs. uninf)
makeTab2 <- function(contrFit, cellmeansFit, conjContrasts, disjContrast) {
  ## Get average expression for all relevant terms
  contr_level_counts <- rowSums(contrFit$contrasts[, c(conjContrasts, disjContrast)] != 0)
  ## Define the condition levels involved in the tests
  levels_to_use <- names(contr_level_counts)[contr_level_counts > 0]
  ## Extract the average counts for each, make into table
  ave_expression_mat <- cellmeansFit$coef[, levels_to_use]
  exp_table <- data.frame(ID = rownames(ave_expression_mat))
  exp_table <- cbind(exp_table, as.data.frame(ave_expression_mat))
  names(exp_table)[-1] <- paste(
    "AveExpr", gsub("condition","",levels_to_use),
    sep = ":")
  ## Compute test statistic, adjusted pval, and logFC for conjuctive test
  ## Add to table
  stat <- rowMins(abs(contrFit$t[, conjContrasts]))
  pval <- rowMaxs(contrFit$p.value[, conjContrasts])
  adj.pval <- p.adjust(pval, method = "BH")
  fcs <- as.data.frame(contrFit$coef[, conjContrasts])
  names(fcs) <- paste("logFC", names(fcs), sep = ":")
  conj_pvals <- as.data.frame(apply(contrFit$p.value[, conjContrasts], 2,
                                    p.adjust, method = "BH"))
  names(conj_pvals) <- paste("adj.P.Val", names(conj_pvals), sep = ":")
  conj_table <- data.frame(ID = rownames(contrFit))
  conj_table <- cbind(conj_table, fcs, conj_pvals, stat = stat, adj.P.Value = adj.pval)
  names(conj_table)[seq(2 + 2 * length(conjContrasts), ncol(conj_table))] <- paste(
    c("stat","adj.P.Value"),
    paste(conjContrasts,collapse = ":"),
    sep = ":")
  ## Make the table for the 'other' test
  disj_table <- data.frame(ID = rownames(contrFit),
                           logFC = contrFit$coef[, disjContrast],
                           adj.P.Value = p.adjust(contrFit$p.value[, disjContrast], method = "BH"))
  names(disj_table)[-1] <- paste(c("logFC", "adj.P.Value"), disjContrast, sep = ":")
  ## Combine tables, making sure all tables are in the same order
  stopifnot(all(exp_table$ID == conj_table$ID & exp_table$ID == disj_table$ID))
  out_table <- cbind(exp_table, conj_table[, -1], disj_table[, -1])

  ## order output table by the statistic in the disjunctive test
  o <- order(-stat)
  out_table[o,]
}

infLM4.infLM24.contr.mat <- makeContrasts(uninf_inf=((conditioninfLM24-conditionuninf24)-(conditioninfLM4-conditionuninf4)),
                                          beads_inf=((conditioninfLM24-conditionbeads24)-(conditioninfLM4-conditionbeads4)),
                                          uninf_beads=((conditionbeads24-conditionuninf24)-(conditionbeads4-conditionuninf4)), levels = v$design)
infLM4.infLM24.fit <- contrasts.fit(fit, infLM4.infLM24.contr.mat)
infLM4.infLM24.eb <- eBayes(infLM4.infLM24.fit)

infLM4.infLM24.topTab <- makeTab2(infLM4.infLM24.eb, fit, c("uninf_inf", "beads_inf"),
                                  c("uninf_beads"))

I think that is everything performed. If I understand what I see, then it is doing the following:

  • quantile normalize and filter the count table.
  • perform voom with a condition+batch experimental model.
  • invoke lmFit on the result.
  • set up a set of contrasts which include:
    • uninf_inf = (infected at time y - uninfected at time y) - (infected at time x - uninfected at time x)
    • beads_inf = (infected at time y - beads at time y) - (infected at time x - beads at time x)
    • uninf_beads = (beads at time y - uninfected at time y) - (beads at time x - uninfected at time x)
  • Invoke contrasts.fit and eBayes
  • Invoke the makeTab(2)() function with the eBayes result as arg1, the lmFit result as arg2, and the character list of c(“uninf_inf”, “beads_inf”) as the third arg and finally “uninf_beads” as the final argument. makeTab2() does the following:
    • uses the original fit to extract the average expression values.
    • finds the minimum t statistic and maximum pvalue for each gene from the uninf_inf and beads_inf columns.
    • uses p.adjust on this set of maximized pvalues.
    • pulls the logFC values for each of the uninf_inf and beads_inf columns.
    • does p.adjust on the pvalues of the contrast pvalues.
    • uses cbind on these pieces to make a single table.
    • uses cbind on the expression table, the newly created table (conj) and the disjunct table.
    • orders them according to the t statistic.

I do not see how this set of operations gives us a better picture of the effect of beads during an infection. The primary thing I see in it is the modification of the p-values and the compound contrast of (infy-uninfy)-(infx-uninfx) It seems to me that this might instead be time for an interaction model?

6 Implementing the contrasts from the paper

With the above in mind, it is pretty trivial for me to perform limma/edger with the same contrasts. I will first invoke my interpretation of the paper contrasts using limma_pairwise() and for the 4 hour data lmajor data.

After rereading the previous implementation, I think I get it. It was in fact using two contrasts: infected/uninfected and infected/beads. It reported the infected/beads result and then took the least significant of the p-value and t statistics of the two contrasts, re-adjusted them, and reported these.

6.1 hpgltools method

keepers <- list(
  "4hpi_uninf" = c("lm_infected_t4h", "uninfected_t4h"),
  "4hpi_beads" = c("lm_infected_t4h", "bead_t4h")
)
mbio_filt <- set_expt_conditions(mbio_expt, new_condition)
mbio_filt <- sm(normalize_expt(mbio_filt, filter = TRUE))
## Something weird happened, my counts are somehow getting cast as non-integers!
## Thus I am invoking force = TRUE until I figure out what is going on.
mbio_pairwise <- sm(all_pairwise(mbio_filt, model_batch = TRUE,
                                 do_ebseq = FALSE, force = TRUE))
excel_file <- glue::glue("excel/{rundate}_mbio_pairwise_tables-v{ver}.xlsx")
mbio_tables <- sm(combine_de_tables(mbio_pairwise, keepers = keepers, excel = excel_file))
mbio_pairwise$comp$heat

mbio_tables[["plots"]][["4hpi_uninf"]][["deseq_ma_plots"]][["plot"]]

6.2 Compare against a sheet I downloaded from the paper.

I saved the worksheet ‘infLM4_before’ as inline-supplementary-material-5_infLM4_before.csv It is 4 hpi / uninfected, which is happily a contrast I performed.

old_table <- readr::read_csv("excel/inline-supplementary-material-5_infLM4_before.csv")
## Warning: Missing column names filled in: 'X2' [2], 'X3' [3], 'X4' [4], 'X5' [5]
## 
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## cols(
##   `DE genes in L. major-infected human macrophages relative to uninfected controls, 4 hpi, not accounting for phagocytosis` = col_character(),
##   X2 = col_character(),
##   X3 = col_character(),
##   X4 = col_character(),
##   X5 = col_character()
## )
colnames(old_table) <- old_table[1, ]
old_table <- as.data.frame(old_table[-1, ])

rownames(old_table) <- old_table[["ID"]]
new_table <- mbio_tables[["data"]][["4hpi_uninf"]]
common <- merge(old_table, new_table, by = "row.names")
dim(common)
## [1] 4308   52
common[["Fold change"]] <- as.numeric(common[["Fold change"]])
neg_idx <- common[["Fold change"]] < 0
common[neg_idx, "Fold change"] <- log2(1 / (-1 * common[neg_idx, "Fold change"]))
common[!neg_idx, "Fold change"] <- log2(common[!neg_idx, "Fold change"])

cor.test(common[["Fold change"]], common[["limma_logfc"]])
## 
##  Pearson's product-moment correlation
## 
## data:  common[["Fold change"]] and common[["limma_logfc"]]
## t = 111, df = 4306, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8534 0.8689
## sample estimates:
##    cor 
## 0.8613
test <- plot_linear_scatter(common[, c("Fold change", "limma_logfc")])
test$scatter

old_table <- readr::read_csv("excel/inline-supplementary-material-5_infLM4_before.csv")
## Warning: Missing column names filled in: 'X2' [2], 'X3' [3], 'X4' [4], 'X5' [5]
## 
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## cols(
##   `DE genes in L. major-infected human macrophages relative to uninfected controls, 4 hpi, not accounting for phagocytosis` = col_character(),
##   X2 = col_character(),
##   X3 = col_character(),
##   X4 = col_character(),
##   X5 = col_character()
## )
colnames(old_table) <- old_table[1, ]
old_table <- as.data.frame(old_table[-1, ])
rownames(old_table) <- old_table[["ID"]]
new_table <- mbio_tables[["data"]][["4hpi_beads"]]
common <- merge(old_table, new_table, by = "row.names")
dim(common)
## [1] 4308   52
neg_idx <- common[["Fold change"]] < 0
common[["Fold change"]] <- as.numeric(common[["Fold change"]])
common[neg_idx, "Fold change"] <- log2(1 / (-1 * common[neg_idx, "Fold change"]))
common[!neg_idx, "Fold change"] <- log2(common[!neg_idx, "Fold change"])

cor.test(common[["Fold change"]], common[["limma_logfc"]])
## 
##  Pearson's product-moment correlation
## 
## data:  common[["Fold change"]] and common[["limma_logfc"]]
## t = 93, df = 4306, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8058 0.8258
## sample estimates:
##   cor 
## 0.816
test <- plot_linear_scatter(common[, c("Fold change", "limma_logfc")])
test$scatter

6.3 Compare modified p-value data (after bead effect)

old_table <- readr::read_csv("excel/inline-supplementary-material-5_infLM4_after.csv")
## Warning: Missing column names filled in: 'X2' [2], 'X3' [3], 'X4' [4], 'X5' [5],
## 'X6' [6]
## 
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## cols(
##   `DE genes in L. major-infected human macrophages relative to uninfected controls, 4 hpi, with accounting for phagocytosis` = col_character(),
##   X2 = col_character(),
##   X3 = col_character(),
##   X4 = col_character(),
##   X5 = col_character(),
##   X6 = col_character()
## )
colnames(old_table) <- old_table[1, ]
old_table <- as.data.frame(old_table[-1, ])
rownames(old_table) <- old_table[["ID"]]
new_table <- mbio_tables[["data"]][["4hpi_beads"]]
common <- merge(old_table, new_table, by = "row.names")
dim(common)
## [1] 2470   53
common[["Fold change beads v inf"]] <- log2(as.numeric(common[["Fold change uninf v inf"]]))
## Warning: NaNs produced
cor.test(common[["Fold change beads v inf"]], common[["limma_logfc"]])
## 
##  Pearson's product-moment correlation
## 
## data:  common[["Fold change beads v inf"]] and common[["limma_logfc"]]
## t = 36, df = 955, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7353 0.7883
## sample estimates:
##    cor 
## 0.7631
test <- plot_linear_scatter(common[, c("Fold change beads v inf", "limma_logfc")])
test$scatter

t1_infuninf <- mbio_tables[["data"]][[1]]
t2_beaduninf <- mbio_tables[["data"]][[2]]
lfc_pvals <- data.frame(
  row.names = rownames(t1_infuninf),
  "l2fc" = t1_infuninf[["limma_logfc"]],
  "infuninf" = t1_infuninf[["limma_adjp"]],
  "beaduninf" = t2_beaduninf[["limma_adjp"]])
rownames(lfc_pvals) <- rownames(t1_infuninf)
lfc_pvals[["worst"]] <- 1.0
library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:testthat':
## 
##     matches
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:hpgltools':
## 
##     combine
## The following object is masked from 'package:testthat':
## 
##     matches
## The following object is masked from 'package:Biobase':
## 
##     combine
## The following objects are masked from 'package:BiocGenerics':
## 
##     combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tibble)
lfc_pvals <- lfc_pvals %>%
  rownames_to_column("rownames") %>%
  rowwise() %>%
  mutate(worst = pmax(infuninf, beaduninf))

merged <- merge(old_table, lfc_pvals, by.x = "ID", by.y = "rownames", all.x = TRUE)
merged[["adj.P.Value"]] <- as.numeric(merged[["adj.P.Value"]])
merged[["infuninf"]] <- as.numeric(merged[["infuninf"]])
merged[["beaduninf"]] <- as.numeric(merged[["beaduninf"]])
merged[["worst"]] <- as.numeric(merged[["worst"]])
cor.test(merged[["adj.P.Value"]], merged[["worst"]])
## 
##  Pearson's product-moment correlation
## 
## data:  merged[["adj.P.Value"]] and merged[["worst"]]
## t = 12, df = 2468, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1989 0.2734
## sample estimates:
##    cor 
## 0.2365
cor.test(merged[["adj.P.Value"]], merged[["infuninf"]])
## 
##  Pearson's product-moment correlation
## 
## data:  merged[["adj.P.Value"]] and merged[["infuninf"]]
## t = 9, df = 2468, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1408 0.2171
## sample estimates:
##    cor 
## 0.1792
cor.test(merged[["adj.P.Value"]], merged[["beaduninf"]])
## 
##  Pearson's product-moment correlation
## 
## data:  merged[["adj.P.Value"]] and merged[["beaduninf"]]
## t = 12, df = 2468, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1945 0.2691
## sample estimates:
##    cor 
## 0.2321
## I think this is suggestive that the two pvalue metrics are similar?

7 Repeat figure 5

In order to recreate figure 5, I think all I need to do is generate a set of logFCs for the mouse data used in Figure 5/Table S7 and compare them. If my regenerated logFCs are similar, then I can call it success, as my human logFCs have a correlation coefficient of ~ 0.96.

The only problem with doing this is that it looks to me that I have data from multiple mouse experiments all with the study name ‘lminfectome’. I think therefore, I must figure out what samples I actually want to use and therefore presumably split the ‘lminfectome’ experiment into a couple/few separate experiments.

mm_annot <- load_biomart_annotations(species = "mmusculus")$annotation
## The biomart annotations file already exists, loading from it.
rownames(mm_annot) <- make.names(
  paste0(mm_annot[["ensembl_transcript_id"]], ".",
         mm_annot[["transcript_version"]]),
  unique = TRUE)
mm_tx_gene <- mm_annot[, c("ensembl_gene_id", "ensembl_transcript_id")]
mm_tx_gene[["id"]] <- rownames(mm_tx_gene)
mm_tx_gene <- mm_tx_gene[, c("id", "ensembl_gene_id")]
new_mm_annot <- mm_annot
rownames(new_mm_annot) <- make.names(mm_annot[["ensembl_gene_id"]], unique = TRUE)

lots_mm <- create_expt(sample_sheet,
                       file_column = "mmusculusfile",
                       gene_info = new_mm_annot,
                       tx_gene_map = mm_tx_gene)
## Reading the sample metadata.
## Dropped 16 rows from the sample metadata because they were blank.
## The sample definitions comprises: 421 rows(samples) and 59 columns(metadata fields).
## Reading count tables.
## Using the transcript to gene mapping.
## Reading salmon data with tximport.
## Finished reading count data.
## Warning in create_expt(sample_sheet, file_column = "mmusculusfile", gene_info
## = new_mm_annot, : Some samples were removed when cross referencing the samples
## against the count data.
## Matched 19660 annotations and counts.
## Bringing together the count matrix and gene information.
## The mapped IDs are not the rownames of your gene information, changing them now.
## Some annotations were lost in merging, setting them to 'undefined'.
## Saving the expressionset to 'expt.rda'.
## The final expressionset has 19660 rows and 100 columns.
bmc_expt <- subset_expt(lots_mm, subset = "study=='bmc'")
## Using a subset expression.
## There were 100, now there are 24 samples.
bmc_expt <- set_expt_batches(bmc_expt, "studybatch")
metadata <- pData(bmc_expt)
new_condition <- paste0(metadata[["infectstate"]], "_", metadata[["expttime"]])
bmc_expt <- set_expt_conditions(bmc_expt, new_condition)

bmc_norm <- sm(normalize_expt(bmc_expt, transform = "log2", filter = TRUE,
                              convert = "cpm", norm = "quant"))
test <- plot_pca(bmc_norm)
test$plot

bmc_normbatch <- sm(normalize_expt(bmc_expt, transform = "log2", filter = TRUE,
                                   convert = "cpm", norm = "quant", batch = "limma"))
test <- plot_pca(bmc_normbatch)
test$plot

bmc_normbatch <- sm(normalize_expt(bmc_expt, transform = "log2", filter = TRUE,
                                   convert = "cpm", norm = "quant", batch = "svaseq"))
test <- plot_pca(bmc_normbatch)
test$plot

## This plot is similar to the one observed in figure 2B of
## https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-2237-2
## YAY!

bmc_pairwise <- all_pairwise(bmc_expt, model_batch = TRUE, force = TRUE)
## Plotting a PCA before surrogate/batch inclusion.
## Not putting labels on the PC plot.
## Using limma's removeBatchEffect to visualize with(out) batch inclusion.
## Not putting labels on the PC plot.
## Finished running DE analyses, collecting outputs.
## Comparing analyses.

bmc_tables <- sm(combine_de_tables(bmc_pairwise))

compare_table <- readr::read_tsv("excel/inline-supplementary-material-7.csv", skip = 1)
## 
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## cols(
##   `Gene Symbol` = col_character(),
##   `Human ID` = col_character(),
##   `logFC in human` = col_double(),
##   `Mouse ID` = col_character(),
##   `logFC in mouse` = col_double()
## )
bmc_table <- bmc_tables$data[["yes_t4h_vs_no_t4h"]]
compare_merged <- merge(compare_table, bmc_table, by.x = "Mouse ID", by.y = "row.names", all.x = TRUE)
cor.test(compare_merged[["logFC in mouse"]], compare_merged[["limma_logfc"]])
## 
##  Pearson's product-moment correlation
## 
## data:  compare_merged[["logFC in mouse"]] and compare_merged[["limma_logfc"]]
## t = 133, df = 1557, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9542 0.9623
## sample estimates:
##    cor 
## 0.9584

8 Big Questions from Najib

In the following blocks, I will attempt to directly address the ‘Big Questions’ posed by Najib in my TODO document. For the moment, it will not be very well organized, as I did some of this stuff before Najib’s questions appeared. Once I finish, I will reorganize it though.

8.1 4 hour human macrophages

Which genes are DE in human macrophages at 4 hours upon infection with L. major?

This question was addressed above.

8.2 4 hour bead effect

This question is mostly addressed above, but needs to be expanded slightly. It is not a very interesting question to me, to be honest; since my degree of agreement with Laura’s previous analyses is very high, I am content to just say: “whatever she said is correct for this question, lets move to something new.”

8.3 Shared with CIDEIM

How many of those are shared with DE genes in CIDEIM’s human macrophages infected with L. panamensis?

test_expt <- subset_expt(lots, subset = "study=='mbio'|lab=='ade'")
## Using a subset expression.
## There were 267, now there are 132 samples.
test_expt <- subset_expt(test_expt, subset = "celltype=='macrophage'")
## Using a subset expression.
## There were 132, now there are 105 samples.
test_expt <- subset_expt(test_expt, subset = "state!='bead'")
## Using a subset expression.
## There were 105, now there are 91 samples.
test_expt <- set_expt_batches(test_expt, "lab")
metadata <- pData(test_expt)
new_condition <- paste0(metadata[["state"]], "_", metadata[["expttime"]])
test_expt <- set_expt_conditions(test_expt, new_condition)
metadata <- pData(test_expt)
new_condition <- paste0(metadata[["state"]], "_", metadata[["expttime"]])

test_expt <- set_expt_conditions(test_expt, new_condition)
test_filt <- sm(normalize_expt(test_expt, filter = TRUE))

test_norm <- sm(normalize_expt(test_expt, norm = "quant", convert = "cpm",
                               transform = "log2", batch = "limma",
                               filter = TRUE))
test_pca <- plot_pca(test_norm, plot_labels = FALSE, cis = FALSE)
## Not putting labels on the PC plot.
test_pca$plot
## Warning in MASS::cov.trob(data[, vars]): Probable convergence failure

## Warning in MASS::cov.trob(data[, vars]): Probable convergence failure

## Warning in MASS::cov.trob(data[, vars]): Probable convergence failure

## Warning in MASS::cov.trob(data[, vars]): Probable convergence failure

test_norm <- sm(normalize_expt(test_expt, norm = "quant", convert = "cpm",
                               transform = "log2", batch = "svaseq",
                               filter = TRUE))
test_pca <- plot_pca(test_norm, plot_labels = FALSE, cis = FALSE)
## Not putting labels on the PC plot.
test_pca$plot

new_condition <- paste0(metadata[["state"]], "_", metadata[["expttime"]])
test_filt <- set_expt_conditions(test_filt, new_condition)
test_de <- sm(all_pairwise(test_filt, model_batch = "svaseq", force = TRUE))
keepers <- list(
  "lm_t4h" = c("lm_infected_t4h", "uninfected_t4h"),
  "mbio" = c("lp_healing_undefined", "uninfected_undefined"))
test_table <- combine_de_tables(test_de, keepers = keepers)

comparison_df <- merge(test_table[["data"]][["lm_t4h"]],
                       test_table[["data"]][["mbio"]], by = "row.names")
comp_df <- comparison_df[, c("deseq_logfc.x", "deseq_logfc.y")]
rownames(comp_df) <- rownames(comparison_df)
colnames(comp_df) <- c("lm_t4h", "mbio")
compare_scatter <- plot_linear_scatter(comp_df)
compare_scatter$scatter
compare_scatter$correlation
pander::pander(sessionInfo())

R version 4.0.3 (2020-10-10)

Platform: x86_64-pc-linux-gnu (64-bit)

locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C

attached base packages: parallel, stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: tibble(v.3.0.6), dplyr(v.1.0.4), tidyr(v.1.1.2), ruv(v.0.9.7.1), hpgltools(v.1.0), testthat(v.3.0.2), R6(v.2.5.0), Biobase(v.2.50.0) and BiocGenerics(v.0.36.0)

loaded via a namespace (and not attached): utf8(v.1.1.4), R.utils(v.2.10.1), tidyselect(v.1.1.0), lme4(v.1.1-26), RSQLite(v.2.2.3), AnnotationDbi(v.1.52.0), grid(v.4.0.3), BiocParallel(v.1.24.1), IHW(v.1.18.0), devtools(v.2.3.2), scatterpie(v.0.1.5), munsell(v.0.5.0), preprocessCore(v.1.52.1), codetools(v.0.2-18), statmod(v.1.4.35), withr(v.2.4.1), colorspace(v.2.0-0), GOSemSim(v.2.16.1), highr(v.0.8), knitr(v.1.31), rstudioapi(v.0.13), stats4(v.4.0.3), Vennerable(v.3.1.0.9000), robustbase(v.0.93-7), DOSE(v.3.16.0), labeling(v.0.4.2), MatrixGenerics(v.1.2.1), slam(v.0.1-48), tximport(v.1.18.0), lpsymphony(v.1.18.0), GenomeInfoDbData(v.1.2.4), polyclip(v.1.10-0), bit64(v.4.0.5), farver(v.2.0.3), rprojroot(v.2.0.2), downloader(v.0.4), vctrs(v.0.3.6), generics(v.0.1.0), xfun(v.0.21), BiocFileCache(v.1.14.0), doParallel(v.1.0.16), GenomeInfoDb(v.1.26.2), graphlayouts(v.0.7.1), locfit(v.1.5-9.4), bitops(v.1.0-6), cachem(v.1.0.4), fgsea(v.1.16.0), DelayedArray(v.0.16.1), assertthat(v.0.2.1), scales(v.1.1.1), ggraph(v.2.0.5), enrichplot(v.1.10.2), gtable(v.0.3.0), sva(v.3.38.0), processx(v.3.4.5), tidygraph(v.1.2.0), rlang(v.0.4.10), genefilter(v.1.72.1), splines(v.4.0.3), rtracklayer(v.1.50.0), broom(v.0.7.5), BiocManager(v.1.30.10), yaml(v.2.2.1), reshape2(v.1.4.4), GenomicFeatures(v.1.42.1), backports(v.1.2.1), qvalue(v.2.22.0), RBGL(v.1.66.0), clusterProfiler(v.3.18.1), tools(v.4.0.3), usethis(v.2.0.1), ggplot2(v.3.3.3), ellipsis(v.0.3.1), gplots(v.3.1.1), jquerylib(v.0.1.3), RColorBrewer(v.1.1-2), sessioninfo(v.1.1.1), Rcpp(v.1.0.6), plyr(v.1.8.6), progress(v.1.2.2), zlibbioc(v.1.36.0), purrr(v.0.3.4), RCurl(v.1.98-1.2), ps(v.1.5.0), prettyunits(v.1.1.1), openssl(v.1.4.3), viridis(v.0.5.1), cowplot(v.1.1.1), S4Vectors(v.0.28.1), SummarizedExperiment(v.1.20.0), ggrepel(v.0.9.1), colorRamps(v.2.3), fs(v.1.5.0), variancePartition(v.1.20.0), magrittr(v.2.0.1), data.table(v.1.14.0), DO.db(v.2.9), openxlsx(v.4.2.3), matrixStats(v.0.58.0), pkgload(v.1.2.0), hms(v.1.0.0), evaluate(v.0.14), xtable(v.1.8-4), pbkrtest(v.0.5-0.1), XML(v.3.99-0.5), IRanges(v.2.24.1), gridExtra(v.2.3), compiler(v.4.0.3), biomaRt(v.2.46.3), KernSmooth(v.2.23-18), crayon(v.1.4.1), shadowtext(v.0.0.7), R.oo(v.1.24.0), minqa(v.1.2.4), htmltools(v.0.5.1.1), corpcor(v.1.6.9), mgcv(v.1.8-34), geneplotter(v.1.68.0), DBI(v.1.1.1), tweenr(v.1.0.1), dbplyr(v.2.1.0), MASS(v.7.3-53.1), rappdirs(v.0.3.3), boot(v.1.3-27), readr(v.1.4.0), Matrix(v.1.3-2), cli(v.2.3.1), R.methodsS3(v.1.8.1), igraph(v.1.2.6), GenomicRanges(v.1.42.0), pkgconfig(v.2.0.3), rvcheck(v.0.1.8), GenomicAlignments(v.1.26.0), xml2(v.1.3.2), foreach(v.1.5.1), annotate(v.1.68.0), bslib(v.0.2.4), XVector(v.0.30.0), stringr(v.1.4.0), callr(v.3.5.1), digest(v.0.6.27), graph(v.1.68.0), Biostrings(v.2.58.0), rmarkdown(v.2.7), fastmatch(v.1.1-0), edgeR(v.3.32.1), PROPER(v.1.22.0), curl(v.4.3), Rsamtools(v.2.6.0), gtools(v.3.8.2), nloptr(v.1.2.2.2), lifecycle(v.1.0.0), nlme(v.3.1-152), jsonlite(v.1.7.2), desc(v.1.2.0), viridisLite(v.0.3.0), askpass(v.1.1), limma(v.3.46.0), fansi(v.0.4.2), pillar(v.1.5.0), lattice(v.0.20-41), DEoptimR(v.1.0-8), fastmap(v.1.1.0), httr(v.1.4.2), pkgbuild(v.1.2.0), survival(v.3.2-7), GO.db(v.3.12.1), glue(v.1.4.2), remotes(v.2.2.0), fdrtool(v.1.2.16), zip(v.2.1.1), iterators(v.1.0.13), pander(v.0.6.3), bit(v.4.0.4), ggforce(v.0.3.2), stringi(v.1.5.3), sass(v.0.3.1), blob(v.1.2.1), DESeq2(v.1.30.1), caTools(v.1.18.1) and memoise(v.2.0.0)

message(paste0("This is hpgltools commit: ", get_git_commit()))
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset f7d3248107373a9f6b0a79b70e90b82371490162
## This is hpgltools commit: Thu Mar 4 15:10:56 2021 -0500: f7d3248107373a9f6b0a79b70e90b82371490162
## message(paste0("Saving to ", savefile))
## tmp <- sm(saveme(filename = savefile))
---
title: "20190124 Recapitulating previous results: which genes are DE in human macrophages at 4hrs upon infection with L. major?"
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
  html_document:
    code_download: true
    code_folding: show
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: tango
    keep_md: false
    mode: selfcontained
    number_sections: true
    self_contained: true
    theme: readable
    toc: true
    toc_float:
      collapsed: false
      smooth_scroll: false
  rmdformats::readthedown:
    code_download: true
    code_folding: show
    df_print: paged
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: tango
    width: 300
    keep_md: false
    mode: selfcontained
    toc_float: true
  BiocStyle::html_document:
    code_download: true
    code_folding: show
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: tango
    keep_md: false
    mode: selfcontained
    toc_float: true
---

<style type = "text/css">
body, td {
  font-size: 16px;
}
code.r{
  font-size: 16px;
}
pre {
 font-size: 16px
}
</style>

```{r options, include = FALSE}
library(hpgltools)
tt <- sm(devtools::load_all("~/hpgltools"))
knitr::opts_knit$set(progress = TRUE,
                     verbose = TRUE,
                     width = 120,
                     echo = TRUE)
knitr::opts_chunk$set(error = TRUE,
                      fig.width = 8,
                      fig.height = 8,
                      dpi = 96)
old_options <- options(digits = 4,
                       stringsAsFactors = FALSE,
                       knitr.duplicate.label = "allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size = 12))
ver <- "20190124"
rundate <- format(Sys.Date(), format = "%Y%m%d")
rmd_file <- paste0("20190124_mbio_human.Rmd")
```

# Gather annotation data

```{r annotations}
hs_annot <- load_biomart_annotations()$annotation
rownames(hs_annot) <- make.names(
  paste0(hs_annot[["ensembl_transcript_id"]], ".",
         hs_annot[["transcript_version"]]),
  unique = TRUE)
hs_tx_gene <- hs_annot[, c("ensembl_gene_id", "ensembl_transcript_id")]
hs_tx_gene[["id"]] <- rownames(hs_tx_gene)
hs_tx_gene <- hs_tx_gene[, c("id", "ensembl_gene_id")]
new_hs_annot <- hs_annot
rownames(new_hs_annot) <- make.names(hs_annot[["ensembl_gene_id"]], unique = TRUE)
```

# Create expressionset

Note that as of 20190124, two samples are still missing: hpgl0914 and hpgl0749.
I am rerunning their mapping with salmon now with the assumption that I just
missed them previously.  I noted their status in the 'skipped' column of the
online sample sheet.

Note 20190125: hpgl0749 mapped; but hpgl0914 has some weirdness still.  It looks
like the forward reads have a gzip CRC error, so I used zcat to extract the
available data and will then retrim/remap.

Note: 20190210: All samples mapped.

Another note, I am using the 'state' column, which was missing the field
'la_infected' for the Leishmania amazonensis samples; this resulted in a set of
'NA' conditions.  I therefore added la_infected to the relevant fields to the
sample sheet online and my working sheet.

I also found an error in the time attribution for sample hpgl0461.  The time was
set to undefined, while in the study it was t24h.  That has been changed in both
the online and my copy of the sample sheet.

```{r expressionset}
##sample_sheet <- "sample_sheets/all_leishmania_samples_20190124.xlsx"
sample_sheet <- "sample_sheets/all_leishmania_samples_20190225.xlsx"
lots <- create_expt(sample_sheet,
                    file_column = "hsapiensfile",
                    gene_info = new_hs_annot,
                    tx_gene_map = hs_tx_gene)
```

# Extract the mbio data

Now we have a 267 sample data set, but we only want the samples from the human
portion of the mBio paper, which Najib helpfully defined in the 'study' column
of the sample sheet as 'mBio'.

Thus I will pull those samples from the sample sheet and set the
conditions/batches to what I am assuming are reasonable values.
An important caveat: we need to concatenate the existing columns: 'expt_time'
and 'state' in order to get useful values for the condition.

In addition, I am removing the L. amazonensis samples for the moment.

```{r subset_expt}
mbio_expt <- subset_expt(lots, subset = "study=='mbio'")
mbio_expt <- subset_expt(mbio_expt, subset = "studybatch!='unused'")
##mbio_expt <- subset_expt(mbio_expt, subset = "pathogenspecies!='lamazonensis'")
##mbio_expt <- subset_expt(mbio_expt, subset = "donor!='thp1'")
mbio_expt <- set_expt_batches(mbio_expt, "studybatch")
metadata <- pData(mbio_expt)
new_condition <- paste0(metadata[["state"]], "_", metadata[["expttime"]])
mbio_expt <- set_expt_conditions(mbio_expt, "state")
```

# Perform a few metrics

Now make some plots and see if I get similar ones to those observed in the
paper.

Here is the link with the PCA plots and such:
https://mbio.asm.org/content/7/3/e00027-16/figures-only

Unless I am mistaken, the only things I have to compare against are some fancy
PCA plots in the main paper and a few raw-ish ones in the supplemental.

```{r show_initial_metrics}
libsize <- plot_libsize(mbio_expt)
libsize$plot
```

# Normalize and plot more

## No batch adjust

This first plot makes no attempt to handle the various batch effects in the data.

```{r normalize_replot}
mbio_norm <- sm(normalize_expt(mbio_expt, transform = "log2", convert = "cpm", filter = TRUE, norm = "quant"))
mbio_pca <- plot_pca(mbio_norm, size_column = "expttime", plot_labels = FALSE, cis = NULL,
                     size_order = c("t4h", "t24h", "t48h", "t72h"))
##mbio_pca <- plot_pca(mbio_norm)
mbio_pca$plot
```

## limma batch adjust

In this iteration, we use limma's function to remove batch effect, which I think
is what was used in order to make the figure in the paper.  This is borne out by
the fact that the image generated is nearly identical to the one in the paper.

### Original code

```{r limma_original, eval = FALSE}
filterCounts = function(counts, lib.size = NULL, thresh = 1, minSamples = 2) {
  cpms <- 2^cbcbSEQ::log2CPM(counts, lib.size = lib.size)$y
  keep <- rowSums(cpms > thresh) >= minSamples
  counts <- counts[keep, ]
  counts
}
condition <- pData(mbio_expt)$condition
x <- table(condition)
countsTable <- exprs(mbio_expt)
dim(countsTable)
##[1] 20956    72
counts <- filterCounts(countsTable, thresh = 1, minSamples = min(x))
dim(counts)
##[1] 13801    72
##12666 prior to adding in THP1 batch

##Quantile normalize counts
library(preprocessCore)
countsSubQ <- cbcbSEQ::qNorm(counts)

##Explore data for batch effects
##Transform data to log2 counts per million
x <- cbcbSEQ::log2CPM(countsSubQ)

##Compute principal components
library(corpcor)
s <- cbcbSEQ::makeSVD(x$y)

batch <- pData(mbio_expt)$batch
cbcbSEQ::pcRes(s$v, s$d, condition, batch)
cbcbSEQ::plotPC(s$v, s$d)

```

```{r normalize_limma_replot}
mbio_batch1 <- sm(normalize_expt(t1, transform = "log2", convert = "cpm",
                                 filter = TRUE, norm = "quant", batch = "limma"))
mbio_pca1 <- plot_pca(mbio_batch1,
                      cis = NULL,
                      size_column = "expttime", size_order = c("t4h", "t24h", "t48h", "t72h"))
##mbio_pca <- plot_pca(mbio_norm)
mbio_pca1$plot
```

## svaseq batch adjust

Finally, I employ my favorite method: svaseq().  This squishes the time-based
differences in the data and highlights the differences between the various
infection states.

```{r normalize_svaseq_replot}
mbio_batch2 <- sm(normalize_expt(mbio_expt, transform = "log2", convert = "cpm",
                                 filter = TRUE, batch = "svaseq"))
mbio_pca2 <- plot_pca(mbio_batch2, size_column = "expttime", plot_labels = FALSE,
                      cis = NULL, size_order = c("t4h", "t24h", "t48h", "t72h"))
##mbio_pca <- plot_pca(mbio_norm)
mbio_pca2$plot
```

## Bead effect

I do not recall what methods were used to estimate the 'bead effect' in the
data.  Therefore I am copy/pasting the relevant logs from Laura and will then
try to recapitulate the tasks performed separately.

I think the makeTab() function is what was used to regenerate the p-values for
the bead-adjusted data.

Following the function definition is a representative invocation performed by
Laura. (I copy/pasted from her log with minor formatting changes).

```{r lauras_code, eval = FALSE}
## Quantile normalize counts
countsSubQ <- qNorm(counts)
## Specify model
mod = model.matrix(~0+condition+batch)
## Use voom to transform quantile-normalized count data to log2-counts per million, estimate mean-variance relationship
## and use m-v relationship to computer appropriate observational-level weights
v <- voom(countsSubQ, mod)
## Fit a linear model for each gene using the specified design contained in v
fit <- lmFit(v)

makeTab <- function(contrFit, coef1, coef2, ...) {
  ## Compute test statistic
  stat <- pmin(abs(contrFit$t[, coef1]), abs(contrFit$t[, coef2]))
  ## Compute pvalue for stat
  pval <- pmax(contrFit$p.value[, coef1], contrFit$p.value[, coef2])
  ## Adjust pvalue for multiple testing
  adj.pval <- p.adjust(pval, method = "BH")
  ## Make the toptable
  tab <- topTable(contrFit, coef = coef1, sort.by = "none", ...)
  coef1_name <- colnames(contrFit$coef)[coef1]
  coef2_name <- colnames(contrFit$coef)[coef2]
  new_tab <- data.frame(tab$ID, tab$logFC, contrFit$coef[, coef2], tab$AveExpr,
                        tab$t, contrFit$t[, coef2], stat, pval, adj.pval)
  new_tab <- new_tab[order(-stat), ]

  colnames(new_tab) <- c("ID", paste0("logFC_", coef1_name),
                         paste0("logFC_", coef2_name),
                         "AveExpr",
                         paste0("t_", coef1_name),
                         paste0("t_", coef2_name),
                         "stat",
                         "P.Value",
                         "adj.P.Value")
  new_tab
}

## eBayes finds an F-statistic from the set of t-statistics for that gene
beads24.infLM24.contr.mat <- makeContrasts(uninf_inf=(conditioninfLM24-conditionuninf24),
                                           beads_inf=(conditioninfLM24-conditionbeads24),
                                           levels = v$design)
beads24.infLM24.fit <- contrasts.fit(fit, beads24.infLM24.contr.mat)
beads24.infLM24.eb <- eBayes(beads24.infLM24.fit)

beads24.infLM24.topTab <- makeTab(beads24.infLM24.eb, 1, 2, number = nrow(v$E))
```

As far as the above goes, it mostly makes sense.  My question is, how do we get
the modified logFC values?  Presumably that is later down in the log.

## makeTab2

Looking further down I found the following invocations, which partially but
incompletely answer my question.

```{r lauras_maketab2, eval = FALSE}
## Define makeTab2 function
## construct a DE result table for infection vs. uninfected and beads
## contrFit: the result of eBayes after conrasts.fit
## cellmeansFit: the cell means fit (lmFit(v) above)
## conjContrasts: the 'conjuctive' null test (infection vs. uninf AND infect vs. beads)
## disjContrast: the 'other' test (beads vs. uninf)
makeTab2 <- function(contrFit, cellmeansFit, conjContrasts, disjContrast) {
  ## Get average expression for all relevant terms
  contr_level_counts <- rowSums(contrFit$contrasts[, c(conjContrasts, disjContrast)] != 0)
  ## Define the condition levels involved in the tests
  levels_to_use <- names(contr_level_counts)[contr_level_counts > 0]
  ## Extract the average counts for each, make into table
  ave_expression_mat <- cellmeansFit$coef[, levels_to_use]
  exp_table <- data.frame(ID = rownames(ave_expression_mat))
  exp_table <- cbind(exp_table, as.data.frame(ave_expression_mat))
  names(exp_table)[-1] <- paste(
    "AveExpr", gsub("condition","",levels_to_use),
    sep = ":")
  ## Compute test statistic, adjusted pval, and logFC for conjuctive test
  ## Add to table
  stat <- rowMins(abs(contrFit$t[, conjContrasts]))
  pval <- rowMaxs(contrFit$p.value[, conjContrasts])
  adj.pval <- p.adjust(pval, method = "BH")
  fcs <- as.data.frame(contrFit$coef[, conjContrasts])
  names(fcs) <- paste("logFC", names(fcs), sep = ":")
  conj_pvals <- as.data.frame(apply(contrFit$p.value[, conjContrasts], 2,
                                    p.adjust, method = "BH"))
  names(conj_pvals) <- paste("adj.P.Val", names(conj_pvals), sep = ":")
  conj_table <- data.frame(ID = rownames(contrFit))
  conj_table <- cbind(conj_table, fcs, conj_pvals, stat = stat, adj.P.Value = adj.pval)
  names(conj_table)[seq(2 + 2 * length(conjContrasts), ncol(conj_table))] <- paste(
    c("stat","adj.P.Value"),
    paste(conjContrasts,collapse = ":"),
    sep = ":")
  ## Make the table for the 'other' test
  disj_table <- data.frame(ID = rownames(contrFit),
                           logFC = contrFit$coef[, disjContrast],
                           adj.P.Value = p.adjust(contrFit$p.value[, disjContrast], method = "BH"))
  names(disj_table)[-1] <- paste(c("logFC", "adj.P.Value"), disjContrast, sep = ":")
  ## Combine tables, making sure all tables are in the same order
  stopifnot(all(exp_table$ID == conj_table$ID & exp_table$ID == disj_table$ID))
  out_table <- cbind(exp_table, conj_table[, -1], disj_table[, -1])

  ## order output table by the statistic in the disjunctive test
  o <- order(-stat)
  out_table[o,]
}

infLM4.infLM24.contr.mat <- makeContrasts(uninf_inf=((conditioninfLM24-conditionuninf24)-(conditioninfLM4-conditionuninf4)),
                                          beads_inf=((conditioninfLM24-conditionbeads24)-(conditioninfLM4-conditionbeads4)),
                                          uninf_beads=((conditionbeads24-conditionuninf24)-(conditionbeads4-conditionuninf4)), levels = v$design)
infLM4.infLM24.fit <- contrasts.fit(fit, infLM4.infLM24.contr.mat)
infLM4.infLM24.eb <- eBayes(infLM4.infLM24.fit)

infLM4.infLM24.topTab <- makeTab2(infLM4.infLM24.eb, fit, c("uninf_inf", "beads_inf"),
                                  c("uninf_beads"))
```

I think that is everything performed.  If I understand what I see, then it is
doing the following:

* quantile normalize and filter the count table.
* perform voom with a condition+batch experimental model.
* invoke lmFit on the result.
* set up a set of contrasts which include:
    * uninf_inf = (infected at time y - uninfected at time y) - (infected at time x - uninfected at time x)
    * beads_inf = (infected at time y - beads at time y) - (infected at time x - beads at time x)
    * uninf_beads = (beads at time y - uninfected at time y) - (beads at time x - uninfected at time x)
* Invoke contrasts.fit and eBayes
* Invoke the makeTab(2)() function with the eBayes result as arg1, the lmFit result as arg2, and
  the character list of c("uninf_inf", "beads_inf") as the third arg and finally "uninf_beads"
  as the final argument.  makeTab2() does the following:
    * uses the original fit to extract the average expression values.
    * finds the minimum t statistic and maximum pvalue for each gene from the uninf_inf and beads_inf columns.
    * uses p.adjust on this set of maximized pvalues.
    * pulls the logFC values for each of the uninf_inf and beads_inf columns.
    * does p.adjust on the pvalues of the contrast pvalues.
    * uses cbind on these pieces to make a single table.
    * uses cbind on the expression table, the newly created table (conj) and the disjunct table.
    * orders them according to the t statistic.

I do not see how this set of operations gives us a better picture of the effect
of beads during an infection.  The primary thing I see in it is the modification
of the p-values and the compound contrast of (infy-uninfy)-(infx-uninfx)
It seems to me that this might instead be time for an interaction model?

# Implementing the contrasts from the paper

With the above in mind, it is pretty trivial for me to perform limma/edger with the same contrasts.
I will first invoke my interpretation of the paper contrasts using limma_pairwise() and for the
4 hour data lmajor data.

After rereading the previous implementation, I think I get it.  It was in fact
using two contrasts: infected/uninfected and infected/beads.  It reported the
infected/beads result and then took the least significant of the p-value and t
statistics of the two contrasts, re-adjusted them, and reported these.

## hpgltools method

```{r pairwise, fig.show = "hide"}
keepers <- list(
  "4hpi_uninf" = c("lm_infected_t4h", "uninfected_t4h"),
  "4hpi_beads" = c("lm_infected_t4h", "bead_t4h")
)
mbio_filt <- set_expt_conditions(mbio_expt, new_condition)
mbio_filt <- sm(normalize_expt(mbio_filt, filter = TRUE))
## Something weird happened, my counts are somehow getting cast as non-integers!
## Thus I am invoking force = TRUE until I figure out what is going on.
mbio_pairwise <- sm(all_pairwise(mbio_filt, model_batch = TRUE,
                                 do_ebseq = FALSE, force = TRUE))
excel_file <- glue::glue("excel/{rundate}_mbio_pairwise_tables-v{ver}.xlsx")
mbio_tables <- sm(combine_de_tables(mbio_pairwise, keepers = keepers, excel = excel_file))
```

```{r pairwise_plots}
mbio_pairwise$comp$heat
mbio_tables[["plots"]][["4hpi_uninf"]][["deseq_ma_plots"]][["plot"]]
```

## Compare against a sheet I downloaded from the paper.

I saved the worksheet 'infLM4_before' as inline-supplementary-material-5_infLM4_before.csv
It is 4 hpi / uninfected, which is happily a contrast I performed.

```{r compare_pairwise}
old_table <- readr::read_csv("excel/inline-supplementary-material-5_infLM4_before.csv")
colnames(old_table) <- old_table[1, ]
old_table <- as.data.frame(old_table[-1, ])

rownames(old_table) <- old_table[["ID"]]
new_table <- mbio_tables[["data"]][["4hpi_uninf"]]
common <- merge(old_table, new_table, by = "row.names")
dim(common)
common[["Fold change"]] <- as.numeric(common[["Fold change"]])
neg_idx <- common[["Fold change"]] < 0
common[neg_idx, "Fold change"] <- log2(1 / (-1 * common[neg_idx, "Fold change"]))
common[!neg_idx, "Fold change"] <- log2(common[!neg_idx, "Fold change"])

cor.test(common[["Fold change"]], common[["limma_logfc"]])
test <- plot_linear_scatter(common[, c("Fold change", "limma_logfc")])
test$scatter

old_table <- readr::read_csv("excel/inline-supplementary-material-5_infLM4_before.csv")
colnames(old_table) <- old_table[1, ]
old_table <- as.data.frame(old_table[-1, ])
rownames(old_table) <- old_table[["ID"]]
new_table <- mbio_tables[["data"]][["4hpi_beads"]]
common <- merge(old_table, new_table, by = "row.names")
dim(common)
neg_idx <- common[["Fold change"]] < 0
common[["Fold change"]] <- as.numeric(common[["Fold change"]])
common[neg_idx, "Fold change"] <- log2(1 / (-1 * common[neg_idx, "Fold change"]))
common[!neg_idx, "Fold change"] <- log2(common[!neg_idx, "Fold change"])

cor.test(common[["Fold change"]], common[["limma_logfc"]])
test <- plot_linear_scatter(common[, c("Fold change", "limma_logfc")])
test$scatter
```

## Compare modified p-value data (after bead effect)

```{r modified}
old_table <- readr::read_csv("excel/inline-supplementary-material-5_infLM4_after.csv")
colnames(old_table) <- old_table[1, ]
old_table <- as.data.frame(old_table[-1, ])
rownames(old_table) <- old_table[["ID"]]
new_table <- mbio_tables[["data"]][["4hpi_beads"]]
common <- merge(old_table, new_table, by = "row.names")
dim(common)
common[["Fold change beads v inf"]] <- log2(as.numeric(common[["Fold change uninf v inf"]]))
cor.test(common[["Fold change beads v inf"]], common[["limma_logfc"]])
test <- plot_linear_scatter(common[, c("Fold change beads v inf", "limma_logfc")])
test$scatter

t1_infuninf <- mbio_tables[["data"]][[1]]
t2_beaduninf <- mbio_tables[["data"]][[2]]
lfc_pvals <- data.frame(
  row.names = rownames(t1_infuninf),
  "l2fc" = t1_infuninf[["limma_logfc"]],
  "infuninf" = t1_infuninf[["limma_adjp"]],
  "beaduninf" = t2_beaduninf[["limma_adjp"]])
rownames(lfc_pvals) <- rownames(t1_infuninf)
lfc_pvals[["worst"]] <- 1.0
library(tidyr)
library(dplyr)
library(tibble)
lfc_pvals <- lfc_pvals %>%
  rownames_to_column("rownames") %>%
  rowwise() %>%
  mutate(worst = pmax(infuninf, beaduninf))

merged <- merge(old_table, lfc_pvals, by.x = "ID", by.y = "rownames", all.x = TRUE)
merged[["adj.P.Value"]] <- as.numeric(merged[["adj.P.Value"]])
merged[["infuninf"]] <- as.numeric(merged[["infuninf"]])
merged[["beaduninf"]] <- as.numeric(merged[["beaduninf"]])
merged[["worst"]] <- as.numeric(merged[["worst"]])
cor.test(merged[["adj.P.Value"]], merged[["worst"]])
cor.test(merged[["adj.P.Value"]], merged[["infuninf"]])
cor.test(merged[["adj.P.Value"]], merged[["beaduninf"]])
## I think this is suggestive that the two pvalue metrics are similar?
```

# Repeat figure 5

In order to recreate figure 5, I think all I need to do is generate a set of
logFCs for the mouse data used in Figure 5/Table S7 and compare them.  If my
regenerated logFCs are similar, then I can call it success, as my human logFCs
have a correlation coefficient of ~ 0.96.

The only problem with doing this is that it looks to me that I have data from
multiple mouse experiments all with the study name 'lminfectome'.  I think
therefore, I must figure out what samples I actually want to use and therefore
presumably split the 'lminfectome' experiment into a couple/few separate
experiments.

```{r mouse_data}
mm_annot <- load_biomart_annotations(species = "mmusculus")$annotation
rownames(mm_annot) <- make.names(
  paste0(mm_annot[["ensembl_transcript_id"]], ".",
         mm_annot[["transcript_version"]]),
  unique = TRUE)
mm_tx_gene <- mm_annot[, c("ensembl_gene_id", "ensembl_transcript_id")]
mm_tx_gene[["id"]] <- rownames(mm_tx_gene)
mm_tx_gene <- mm_tx_gene[, c("id", "ensembl_gene_id")]
new_mm_annot <- mm_annot
rownames(new_mm_annot) <- make.names(mm_annot[["ensembl_gene_id"]], unique = TRUE)

lots_mm <- create_expt(sample_sheet,
                       file_column = "mmusculusfile",
                       gene_info = new_mm_annot,
                       tx_gene_map = mm_tx_gene)

bmc_expt <- subset_expt(lots_mm, subset = "study=='bmc'")
bmc_expt <- set_expt_batches(bmc_expt, "studybatch")
metadata <- pData(bmc_expt)
new_condition <- paste0(metadata[["infectstate"]], "_", metadata[["expttime"]])
bmc_expt <- set_expt_conditions(bmc_expt, new_condition)

bmc_norm <- sm(normalize_expt(bmc_expt, transform = "log2", filter = TRUE,
                              convert = "cpm", norm = "quant"))
test <- plot_pca(bmc_norm)
test$plot

bmc_normbatch <- sm(normalize_expt(bmc_expt, transform = "log2", filter = TRUE,
                                   convert = "cpm", norm = "quant", batch = "limma"))
test <- plot_pca(bmc_normbatch)
test$plot

bmc_normbatch <- sm(normalize_expt(bmc_expt, transform = "log2", filter = TRUE,
                                   convert = "cpm", norm = "quant", batch = "svaseq"))
test <- plot_pca(bmc_normbatch)
test$plot
## This plot is similar to the one observed in figure 2B of
## https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-2237-2
## YAY!

bmc_pairwise <- all_pairwise(bmc_expt, model_batch = TRUE, force = TRUE)
bmc_tables <- sm(combine_de_tables(bmc_pairwise))

compare_table <- readr::read_tsv("excel/inline-supplementary-material-7.csv", skip = 1)
bmc_table <- bmc_tables$data[["yes_t4h_vs_no_t4h"]]
compare_merged <- merge(compare_table, bmc_table, by.x = "Mouse ID", by.y = "row.names", all.x = TRUE)
cor.test(compare_merged[["logFC in mouse"]], compare_merged[["limma_logfc"]])
```

# Big Questions from Najib

In the following blocks, I will attempt to directly address the 'Big Questions'
posed by Najib in my TODO document.  For the moment, it will not be very well
organized, as I did some of this stuff before Najib's questions appeared.  Once
I finish, I will reorganize it though.

## 4 hour human macrophages

Which genes are DE in human macrophages at 4 hours upon infection with L. major?

This question was addressed above.

## 4 hour bead effect

This question is mostly addressed above, but needs to be expanded slightly.  It
is not a very interesting question to me, to be honest; since my degree of
agreement with Laura's previous analyses is very high, I am content to just say:
"whatever she said is correct for this question, lets move to something new."

## Shared with CIDEIM

How many of those are shared with DE genes in CIDEIM's human macrophages
infected with L. panamensis?

```{r cideim_comparison}
test_expt <- subset_expt(lots, subset = "study=='mbio'|lab=='ade'")
test_expt <- subset_expt(test_expt, subset = "celltype=='macrophage'")
test_expt <- subset_expt(test_expt, subset = "state!='bead'")

test_expt <- set_expt_batches(test_expt, "lab")
metadata <- pData(test_expt)
new_condition <- paste0(metadata[["state"]], "_", metadata[["expttime"]])
test_expt <- set_expt_conditions(test_expt, new_condition)
metadata <- pData(test_expt)
new_condition <- paste0(metadata[["state"]], "_", metadata[["expttime"]])

test_expt <- set_expt_conditions(test_expt, new_condition)
```

```{r plot_mbio_cideim}
test_filt <- sm(normalize_expt(test_expt, filter = TRUE))

test_norm <- sm(normalize_expt(test_expt, norm = "quant", convert = "cpm",
                               transform = "log2", batch = "limma",
                               filter = TRUE))
test_pca <- plot_pca(test_norm, plot_labels = FALSE, cis = FALSE)
test_pca$plot

test_norm <- sm(normalize_expt(test_expt, norm = "quant", convert = "cpm",
                               transform = "log2", batch = "svaseq",
                               filter = TRUE))
test_pca <- plot_pca(test_norm, plot_labels = FALSE, cis = FALSE)
test_pca$plot
```

```{r de_mbio_cideim, eval = FALSE}
new_condition <- paste0(metadata[["state"]], "_", metadata[["expttime"]])
test_filt <- set_expt_conditions(test_filt, new_condition)
test_de <- sm(all_pairwise(test_filt, model_batch = "svaseq", force = TRUE))
keepers <- list(
  "lm_t4h" = c("lm_infected_t4h", "uninfected_t4h"),
  "mbio" = c("lp_healing_undefined", "uninfected_undefined"))
test_table <- combine_de_tables(test_de, keepers = keepers)

comparison_df <- merge(test_table[["data"]][["lm_t4h"]],
                       test_table[["data"]][["mbio"]], by = "row.names")
comp_df <- comparison_df[, c("deseq_logfc.x", "deseq_logfc.y")]
rownames(comp_df) <- rownames(comparison_df)
colnames(comp_df) <- c("lm_t4h", "mbio")
compare_scatter <- plot_linear_scatter(comp_df)
compare_scatter$scatter
compare_scatter$correlation
```

```{r saveme}
pander::pander(sessionInfo())
message(paste0("This is hpgltools commit: ", get_git_commit()))
## message(paste0("Saving to ", savefile))
## tmp <- sm(saveme(filename = savefile))
```
