## The biomart annotations file already exists, loading from it.
rownames(hs_annot) <- make.names(
paste0(hs_annot[["ensembl_transcript_id"]], ".",
hs_annot[["transcript_version"]]),
unique=TRUE)
hs_tx_gene <- hs_annot[, c("ensembl_gene_id", "ensembl_transcript_id")]
hs_tx_gene[["id"]] <- rownames(hs_tx_gene)
hs_tx_gene <- hs_tx_gene[, c("id", "ensembl_gene_id")]
new_hs_annot <- hs_annot
rownames(new_hs_annot) <- make.names(hs_annot[["ensembl_gene_id"]], unique=TRUE)Note that as of 20190124, two samples are still missing: hpgl0914 and hpgl0749. I am rerunning their mapping with salmon now with the assumption that I just missed them previously. I noted their status in the ‘skipped’ column of the online sample sheet.
Note 20190125: hpgl0749 mapped; but hpgl0914 has some weirdness still. It looks like the forward reads have a gzip CRC error, so I used zcat to extract the available data and will then retrim/remap.
Note: 20190210: All samples mapped.
Another note, I am using the ‘state’ column, which was missing the field ‘la_infected’ for the Leishmania amazonensis samples; this resulted in a set of ‘NA’ conditions. I therefore added la_infected to the relevant fields to the sample sheet online and my working sheet.
I also found an error in the time attribution for sample hpgl0461. The time was set to undefined, while in the study it was t24h. That has been changed in both the online and my copy of the sample sheet.
##sample_sheet <- "sample_sheets/all_leishmania_samples_20190124.xlsx"
sample_sheet <- "sample_sheets/all_leishmania_samples_20190225.xlsx"
lots <- create_expt(sample_sheet,
file_column="hsapiensfile",
gene_info=new_hs_annot,
tx_gene_map=hs_tx_gene)## Reading the sample metadata.
## The sample definitions comprises: 421 rows(samples) and 55 columns(metadata fields).
## Reading count tables.
## Using the transcript to gene mapping.
## Reading salmon data with tximport.
## Finished reading count tables.
## Matched 19629 annotations and counts.
## Bringing together the count matrix and gene information.
## The mapped IDs are not the rownames of your gene information, changing them now.
## Some annotations were lost in merging, setting them to 'undefined'.
Now we have a 291 sample data set, but we only want the samples from the human portion of the mBio paper, which Najib helpfully defined in the ‘study’ column of the sample sheet as ‘mBio’.
Thus I will pull those samples from the sample sheet and set the conditions/batches to what I am assuming are reasonable values. An important caveat: we need to concatenate the existing columns: ‘expt_time’ and ‘state’ in order to get useful values for the condition.
In addition, I am removing the L. amazonensis samples for the moment.
## There were 267, now there are 82 samples.
## There were 82, now there are 66 samples.
##mbio_expt <- subset_expt(mbio_expt, subset="pathogenspecies!='lamazonensis'")
##mbio_expt <- subset_expt(mbio_expt, subset="donor!='thp1'")
mbio_expt <- set_expt_batches(mbio_expt, "studybatch")
metadata <- pData(mbio_expt)
new_condition <- paste0(metadata[["state"]], "_", metadata[["expttime"]])
mbio_expt <- set_expt_conditions(mbio_expt, "state")Now make some plots and see if I get similar ones to those observed in the paper.
Here is the link with the PCA plots and such: https://mbio.asm.org/content/7/3/e00027-16/figures-only
Unless I am mistaken, the only things I have to compare against are some fancy PCA plots in the main paper and a few raw-ish ones in the supplemental.
This first plot makes no attempt to handle the various batch effects in the data.
mbio_norm <- sm(normalize_expt(mbio_expt, transform="log2", convert="cpm", filter=TRUE, norm="quant"))
mbio_pca <- plot_pca(mbio_norm, size_column="expttime", plot_labels=FALSE, cis=NULL,
size_order=c("t4h", "t24h", "t48h", "t72h"))## Not putting labels on the plot.
## Warning: Removed 1 rows containing missing values (geom_point).
## Error in grid.Call.graphics(C_setviewport, vp, TRUE): non-finite location and/or size for viewport
In this iteration, we use limma’s function to remove batch effect, which I think is what was used in order to make the figure in the paper. This is borne out by the fact that the image generated is nearly identical to the one in the paper.
mbio_batch1 <- normalize_expt(mbio_expt, transform="log2", convert="cpm",
filter=TRUE, norm="quant", batch="limma")## This function will replace the expt$expressionset slot with:
## log2(limma(cpm(quant(cbcb(data)))))
## It backs up the current data into a slot named:
## expt$backup_expressionset. It will also save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Step 1: performing count filter with option: cbcb
## Removing 7800 low-count genes (11829 remaining).
## Step 2: normalizing the data with quant.
## Using normalize.quantiles.robust due to a thread error in preprocessCore.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 1572 values equal to 0, adding 1 to the matrix.
## Step 5: doing batch correction with limma.
## Note to self: If you get an error like 'x contains missing values' The data has too many 0's and needs a stronger low-count filter applied.
## batch_counts: Before batch correction, 47979 entries 0<=x<1.
## batch_counts: Before batch correction, 1572 entries are >= 0.
## Passing off to all_adjusters.
## batch_counts: Before batch/surrogate estimation, 731163 entries are x>1.
## batch_counts: Before batch/surrogate estimation, 1572 entries are x==0.
## batch_counts: Before batch/surrogate estimation, 47979 entries are 0<x<1.
## The be method chose 9 surrogate variable(s).
## batch_counts: Using limma's removeBatchEffect to remove batch effect.
## If you receive a warning: 'NANs produced', one potential reason is that the data was quantile normalized.
## The number of elements which are < 0 after batch correction is: 1815
## The variable low_to_zero sets whether to change <0 values to 0 and is: FALSE
mbio_pca1 <- plot_pca(mbio_batch1, size_column="expttime", plot_labels=FALSE,
cis=NULL, size_order=c("t4h", "t24h", "t48h", "t72h"))## Not putting labels on the plot.
## Warning: Removed 1 rows containing missing values (geom_point).
## Error in grid.Call.graphics(C_setviewport, vp, TRUE): non-finite location and/or size for viewport
Finally, I employ my favorite method: svaseq(). This squishes the time-based differences in the data and highlights the differences between the various infection states.
mbio_batch2 <- normalize_expt(mbio_expt, transform="log2", convert="cpm",
filter=TRUE, batch="svaseq")## This function will replace the expt$expressionset slot with:
## log2(svaseq(cpm(cbcb(data))))
## It backs up the current data into a slot named:
## expt$backup_expressionset. It will also save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Leaving the data unnormalized. This is necessary for DESeq, but
## EdgeR/limma might benefit from normalization. Good choices include quantile,
## size-factor, tmm, etc.
## Step 1: performing count filter with option: cbcb
## Removing 7800 low-count genes (11829 remaining).
## Step 2: not normalizing the data.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 5634 values equal to 0, adding 1 to the matrix.
## Step 5: doing batch correction with svaseq.
## Note to self: If you get an error like 'x contains missing values' The data has too many 0's and needs a stronger low-count filter applied.
## batch_counts: Before batch correction, 47040 entries 0<=x<1.
## batch_counts: Before batch correction, 5634 entries are >= 0.
## Passing off to all_adjusters.
## batch_counts: Before batch/surrogate estimation, 728040 entries are x>1.
## batch_counts: Before batch/surrogate estimation, 5634 entries are x==0.
## batch_counts: Before batch/surrogate estimation, 47040 entries are 0<x<1.
## The be method chose 8 surrogate variable(s).
## Attempting svaseq estimation with 8 surrogates.
## The number of elements which are < 0 after batch correction is: 1120
## The variable low_to_zero sets whether to change <0 values to 0 and is: FALSE
mbio_pca2 <- plot_pca(mbio_batch2, size_column="expttime", plot_labels=FALSE,
cis=NULL, size_order=c("t4h", "t24h", "t48h", "t72h"))## Not putting labels on the plot.
## Warning: Removed 1 rows containing missing values (geom_point).
## Error in grid.Call.graphics(C_setviewport, vp, TRUE): non-finite location and/or size for viewport
I do not recall what methods were used to estimate the ‘bead effect’ in the data. Therefore I am copy/pasting the relevant logs from Laura and will then try to recapitulate the tasks performed separately.
I think the makeTab() function is what was used to regenerate the p-values for the bead-adjusted data.
Following the function definition is a representative invocation performed by Laura. (I copy/pasted from her log with minor formatting changes).
## Quantile normalize counts
countsSubQ <- qNorm(counts)
## Specify model
mod = model.matrix(~0+condition+batch)
## Use voom to transform quantile-normalized count data to log2-counts per million, estimate mean-variance relationship
## and use m-v relationship to computer appropriate observational-level weights
v <- voom(countsSubQ, mod)
## Fit a linear model for each gene using the specified design contained in v
fit <- lmFit(v)
makeTab <- function(contrFit, coef1, coef2, ...) {
## Compute test statistic
stat <- pmin(abs(contrFit$t[, coef1]), abs(contrFit$t[, coef2]))
## Compute pvalue for stat
pval <- pmax(contrFit$p.value[, coef1], contrFit$p.value[, coef2])
## Adjust pvalue for multiple testing
adj.pval <- p.adjust(pval, method="BH")
## Make the toptable
tab <- topTable(contrFit, coef=coef1, sort.by="none", ...)
coef1_name <- colnames(contrFit$coef)[coef1]
coef2_name <- colnames(contrFit$coef)[coef2]
new_tab <- data.frame(tab$ID, tab$logFC, contrFit$coef[, coef2], tab$AveExpr,
tab$t, contrFit$t[, coef2], stat, pval, adj.pval)
new_tab <- new_tab[order(-stat), ]
colnames(new_tab) <- c("ID", paste0("logFC_", coef1_name),
paste0("logFC_", coef2_name),
"AveExpr",
paste0("t_", coef1_name),
paste0("t_", coef2_name),
"stat",
"P.Value",
"adj.P.Value")
new_tab
}
## eBayes finds an F-statistic from the set of t-statistics for that gene
beads24.infLM24.contr.mat <- makeContrasts(uninf_inf=(conditioninfLM24-conditionuninf24),
beads_inf=(conditioninfLM24-conditionbeads24),
levels=v$design)
beads24.infLM24.fit <- contrasts.fit(fit, beads24.infLM24.contr.mat)
beads24.infLM24.eb <- eBayes(beads24.infLM24.fit)
beads24.infLM24.topTab <- makeTab(beads24.infLM24.eb, 1, 2, number=nrow(v$E))As far as the above goes, it mostly makes sense. My question is, how do we get the modified logFC values? Presumably that is later down in the log.
Looking further down I found the following invocations, which partially but incompletely answer my question.
## Define makeTab2 function
## construct a DE result table for infection vs. uninfected and beads
## contrFit: the result of eBayes after conrasts.fit
## cellmeansFit: the cell means fit (lmFit(v) above)
## conjContrasts: the 'conjuctive' null test (infection vs. uninf AND infect vs. beads)
## disjContrast: the 'other' test (beads vs. uninf)
makeTab2 <- function(contrFit, cellmeansFit, conjContrasts, disjContrast) {
## Get average expression for all relevant terms
contr_level_counts <- rowSums(contrFit$contrasts[, c(conjContrasts, disjContrast)] != 0)
## Define the condition levels involved in the tests
levels_to_use <- names(contr_level_counts)[contr_level_counts > 0]
## Extract the average counts for each, make into table
ave_expression_mat <- cellmeansFit$coef[, levels_to_use]
exp_table <- data.frame(ID=rownames(ave_expression_mat))
exp_table <- cbind(exp_table, as.data.frame(ave_expression_mat))
names(exp_table)[-1] <- paste(
"AveExpr", gsub("condition","",levels_to_use),
sep=":")
## Compute test statistic, adjusted pval, and logFC for conjuctive test
## Add to table
stat <- rowMins(abs(contrFit$t[, conjContrasts]))
pval <- rowMaxs(contrFit$p.value[, conjContrasts])
adj.pval <- p.adjust(pval, method="BH")
fcs <- as.data.frame(contrFit$coef[, conjContrasts])
names(fcs) <- paste("logFC", names(fcs), sep=":")
conj_pvals <- as.data.frame(apply(contrFit$p.value[, conjContrasts], 2,
p.adjust, method="BH"))
names(conj_pvals) <- paste("adj.P.Val", names(conj_pvals), sep=":")
conj_table <- data.frame(ID=rownames(contrFit))
conj_table <- cbind(conj_table, fcs, conj_pvals, stat=stat, adj.P.Value=adj.pval)
names(conj_table)[seq(2 + 2 * length(conjContrasts), ncol(conj_table))] <- paste(
c("stat","adj.P.Value"),
paste(conjContrasts,collapse=":"),
sep=":")
## Make the table for the 'other' test
disj_table <- data.frame(ID=rownames(contrFit),
logFC=contrFit$coef[, disjContrast],
adj.P.Value=p.adjust(contrFit$p.value[, disjContrast], method="BH"))
names(disj_table)[-1] <- paste(c("logFC", "adj.P.Value"), disjContrast, sep=":")
## Combine tables, making sure all tables are in the same order
stopifnot(all(exp_table$ID == conj_table$ID & exp_table$ID == disj_table$ID))
out_table <- cbind(exp_table, conj_table[, -1], disj_table[, -1])
## order output table by the statistic in the disjunctive test
o <- order(-stat)
out_table[o,]
}
infLM4.infLM24.contr.mat <- makeContrasts(uninf_inf=((conditioninfLM24-conditionuninf24)-(conditioninfLM4-conditionuninf4)),
beads_inf=((conditioninfLM24-conditionbeads24)-(conditioninfLM4-conditionbeads4)),
uninf_beads=((conditionbeads24-conditionuninf24)-(conditionbeads4-conditionuninf4)), levels=v$design)
infLM4.infLM24.fit <- contrasts.fit(fit, infLM4.infLM24.contr.mat)
infLM4.infLM24.eb <- eBayes(infLM4.infLM24.fit)
infLM4.infLM24.topTab <- makeTab2(infLM4.infLM24.eb, fit, c("uninf_inf", "beads_inf"),
c("uninf_beads"))I think that is everything performed. If I understand what I see, then it is doing the following:
I do not see how this set of operations gives us a better picture of the effect of beads during an infection. The primary thing I see in it is the modification of the p-values and the compound contrast of (infy-uninfy)-(infx-uninfx) It seems to me that this is the perfect time for an interaction model?
With the above in mind, it is pretty trivial for me to perform limma/edger with the same contrasts. I will first invoke my interpretation of the paper contrasts using limma_pairwise() and for the 4 hour data lmajor data.
After rereading the previous implementation, I think I get it. It was in fact using two contrasts: infected/uninfected and infected/beads. It reported the infected/beads result and then took the least significant of the p-value and t statistics of the two contrasts, re-adjusted them, and reported these.
keepers <- list(
"4hpi_uninf" = c("lm_infected_t4h", "uninfected_t4h"),
"4hpi_beads" = c("lm_infected_t4h", "bead_t4h")
)
mbio_filt <- set_expt_conditions(mbio_expt, new_condition)
mbio_filt <- normalize_expt(mbio_filt, filter=TRUE)## This function will replace the expt$expressionset slot with:
## cbcb(data)
## It backs up the current data into a slot named:
## expt$backup_expressionset. It will also save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Leaving the data in its current base format, keep in mind that
## some metrics are easier to see when the data is log2 transformed, but
## EdgeR/DESeq do not accept transformed data.
## Leaving the data unconverted. It is often advisable to cpm/rpkm
## the data to normalize for sampling differences, keep in mind though that rpkm
## has some annoying biases, and voom() by default does a cpm (though hpgl_voom()
## will try to detect this).
## Leaving the data unnormalized. This is necessary for DESeq, but
## EdgeR/limma might benefit from normalization. Good choices include quantile,
## size-factor, tmm, etc.
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 7800 low-count genes (11829 remaining).
## Step 2: not normalizing the data.
## Step 3: not converting the data.
## Step 4: not transforming the data.
## Step 5: not doing batch correction.
## Something weird happened, my counts are somehow getting cast as non-integers!
## Thus I am invoking force=TRUE until I figure out what is going on.
mbio_pairwise <- all_pairwise(mbio_filt, model_batch=TRUE,
do_ebseq=FALSE, force=TRUE)## Plotting a PCA before surrogates/batch inclusion.
## Using limma's removeBatchEffect to visualize with(out) batch inclusion.
## Finished running DE analyses, collecting outputs.
## Comparing analyses.
excel_file <- glue::glue("excel/{rundate}_mbio_pairwise_tables-v{ver}.xlsx")
mbio_tables <- sm(combine_de_tables(mbio_pairwise, keepers=keepers, excel=excel_file))I saved the worksheet ‘infLM4_before’ as inline-supplementary-material-5_infLM4_before.csv It is 4 hpi / uninfected, which is happily a contrast I performed.
## Warning: Missing column names filled in: 'X2' [2], 'X3' [3], 'X4' [4],
## 'X5' [5]
## Parsed with column specification:
## cols(
## `DE genes in L. major-infected human macrophages relative to uninfected controls, 4 hpi, not accounting for phagocytosis` = col_character(),
## X2 = col_character(),
## X3 = col_character(),
## X4 = col_character(),
## X5 = col_character()
## )
colnames(old_table) <- old_table[1, ]
old_table <- as.data.frame(old_table[-1, ])
rownames(old_table) <- old_table[["ID"]]
new_table <- mbio_tables[["data"]][["4hpi_uninf"]]
common <- merge(old_table, new_table, by="row.names")
dim(common)## [1] 5119 52
## Warning: NaNs produced
##
## Pearson's product-moment correlation
##
## data: common[["Fold change"]] and common[["limma_logfc"]]
## t = 160, df = 2000, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9612 0.9674
## sample estimates:
## cor
## 0.9644
## Used Bon Ferroni corrected t test(s) between columns.
## Warning: Missing column names filled in: 'X2' [2], 'X3' [3], 'X4' [4],
## 'X5' [5]
## Parsed with column specification:
## cols(
## `DE genes in L. major-infected human macrophages relative to uninfected controls, 4 hpi, not accounting for phagocytosis` = col_character(),
## X2 = col_character(),
## X3 = col_character(),
## X4 = col_character(),
## X5 = col_character()
## )
colnames(old_table) <- old_table[1, ]
old_table <- as.data.frame(old_table[-1, ])
rownames(old_table) <- old_table[["ID"]]
new_table <- mbio_tables[["data"]][["4hpi_beads"]]
common <- merge(old_table, new_table, by="row.names")
dim(common)## [1] 5119 52
## Warning: NaNs produced
##
## Pearson's product-moment correlation
##
## data: common[["Fold change beads v inf"]] and common[["limma_logfc"]]
## t = 120, df = 2000, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9283 0.9395
## sample estimates:
## cor
## 0.9341
## Used Bon Ferroni corrected t test(s) between columns.
## Warning: Missing column names filled in: 'X2' [2], 'X3' [3], 'X4' [4],
## 'X5' [5], 'X6' [6]
## Parsed with column specification:
## cols(
## `DE genes in L. major-infected human macrophages relative to uninfected controls, 4 hpi, with accounting for phagocytosis` = col_character(),
## X2 = col_character(),
## X3 = col_character(),
## X4 = col_character(),
## X5 = col_character(),
## X6 = col_character()
## )
colnames(old_table) <- old_table[1, ]
old_table <- as.data.frame(old_table[-1, ])
rownames(old_table) <- old_table[["ID"]]
new_table <- mbio_tables[["data"]][["4hpi_beads"]]
common <- merge(old_table, new_table, by="row.names")
dim(common)## [1] 2956 53
## Warning: NaNs produced
##
## Pearson's product-moment correlation
##
## data: common[["Fold change beads v inf"]] and common[["limma_logfc"]]
## t = 87, df = 1200, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9209 0.9365
## sample estimates:
## cor
## 0.9291
## Used Bon Ferroni corrected t test(s) between columns.
t1_infuninf <- mbio_tables[["data"]][[1]]
t2_beaduninf <- mbio_tables[["data"]][[2]]
lfc_pvals <- data.frame(
row.names=rownames(t1_infuninf),
"l2fc" = t1_infuninf[["limma_logfc"]],
"infuninf" = t1_infuninf[["limma_adjp"]],
"beaduninf" = t2_beaduninf[["limma_adjp"]])
rownames(lfc_pvals) <- rownames(t1_infuninf)
lfc_pvals[["worst"]] <- 1.0
library(tidyr)
library(dplyr)##
## Attaching package: 'dplyr'
## The following object is masked from 'package:hpgltools':
##
## combine
## The following object is masked from 'package:Biobase':
##
## combine
## The following objects are masked from 'package:BiocGenerics':
##
## combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tibble)
lfc_pvals <- lfc_pvals %>%
rownames_to_column("rownames") %>%
rowwise() %>%
mutate(worst=pmax(infuninf, beaduninf))
merged <- merge(old_table, lfc_pvals, by.x="row.names", by.y="rownames", all.x=TRUE)
cor.test(as.numeric(merged[["adj.P.Val"]]), as.numeric(merged[["worst"]]))## Error in cor.test.default(as.numeric(merged[["adj.P.Val"]]), as.numeric(merged[["worst"]])): 'x' and 'y' must have the same length
## Error in cor.test.default(as.numeric(merged[["adj.P.Val"]]), as.numeric(merged[["infuninf"]])): 'x' and 'y' must have the same length
## Error in cor.test.default(as.numeric(merged[["adj.P.Val"]]), as.numeric(merged[["beaduninf"]])): 'x' and 'y' must have the same length
## I think this is suggestive that the two pvalue metrics are similar?
## Just to repeat our previous check that the fold changes are maintained.
cor.test(log2(as.numeric(merged[["Fold change"]])), as.numeric(merged[["l2fc"]]))## Error in cor.test.default(log2(as.numeric(merged[["Fold change"]])), as.numeric(merged[["l2fc"]])): 'x' and 'y' must have the same length
In order to recreate figure 5, I think all I need to do is generate a set of logFCs for the mouse data used in Figure 5/Table S7 and compare them. If my regenerated logFCs are similar, then I can call it success, as my human logFCs have a correlation coefficient of ~ 0.96.
The only problem with doing this is that it looks to me that I have data from multiple mouse experiments all with the study name ‘lminfectome’. I think therefore, I must figure out what samples I actually want to use and therefore presumably split the ‘lminfectome’ experiment into a couple/few separate experiments.
## The biomart annotations file already exists, loading from it.
rownames(mm_annot) <- make.names(
paste0(mm_annot[["ensembl_transcript_id"]], ".",
mm_annot[["transcript_version"]]),
unique=TRUE)
mm_tx_gene <- mm_annot[, c("ensembl_gene_id", "ensembl_transcript_id")]
mm_tx_gene[["id"]] <- rownames(mm_tx_gene)
mm_tx_gene <- mm_tx_gene[, c("id", "ensembl_gene_id")]
new_mm_annot <- mm_annot
rownames(new_mm_annot) <- make.names(mm_annot[["ensembl_gene_id"]], unique=TRUE)
lots_mm <- create_expt(sample_sheet,
file_column="mmusculusfile",
gene_info=new_mm_annot,
tx_gene_map=mm_tx_gene)## Reading the sample metadata.
## The sample definitions comprises: 421 rows(samples) and 55 columns(metadata fields).
## Reading count tables.
## Using the transcript to gene mapping.
## Reading salmon data with tximport.
## Finished reading count tables.
## Matched 19660 annotations and counts.
## Bringing together the count matrix and gene information.
## The mapped IDs are not the rownames of your gene information, changing them now.
## Some annotations were lost in merging, setting them to 'undefined'.
## There were 100, now there are 24 samples.
bmc_expt <- set_expt_batches(bmc_expt, "studybatch")
metadata <- pData(bmc_expt)
new_condition <- paste0(metadata[["infectstate"]], "_", metadata[["expttime"]])
bmc_expt <- set_expt_conditions(bmc_expt, new_condition)
bmc_norm <- normalize_expt(bmc_expt, transform="log2", filter=TRUE,
convert="cpm", norm="quant")## This function will replace the expt$expressionset slot with:
## log2(cpm(quant(cbcb(data))))
## It backs up the current data into a slot named:
## expt$backup_expressionset. It will also save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 9148 low-count genes (10512 remaining).
## Step 2: normalizing the data with quant.
## Using normalize.quantiles.robust due to a thread error in preprocessCore.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 130 values equal to 0, adding 1 to the matrix.
## Step 5: not doing batch correction.
bmc_normbatch <- normalize_expt(bmc_expt, transform="log2", filter=TRUE,
convert="cpm", norm="quant", batch="limma")## This function will replace the expt$expressionset slot with:
## log2(limma(cpm(quant(cbcb(data)))))
## It backs up the current data into a slot named:
## expt$backup_expressionset. It will also save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Step 1: performing count filter with option: cbcb
## Removing 9148 low-count genes (10512 remaining).
## Step 2: normalizing the data with quant.
## Using normalize.quantiles.robust due to a thread error in preprocessCore.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 130 values equal to 0, adding 1 to the matrix.
## Step 5: doing batch correction with limma.
## Note to self: If you get an error like 'x contains missing values' The data has too many 0's and needs a stronger low-count filter applied.
## batch_counts: Before batch correction, 10663 entries 0<=x<1.
## batch_counts: Before batch correction, 130 entries are >= 0.
## Passing off to all_adjusters.
## batch_counts: Before batch/surrogate estimation, 241495 entries are x>1.
## batch_counts: Before batch/surrogate estimation, 130 entries are x==0.
## batch_counts: Before batch/surrogate estimation, 10663 entries are 0<x<1.
## The be method chose 4 surrogate variable(s).
## batch_counts: Using limma's removeBatchEffect to remove batch effect.
## If you receive a warning: 'NANs produced', one potential reason is that the data was quantile normalized.
## The number of elements which are < 0 after batch correction is: 171
## The variable low_to_zero sets whether to change <0 values to 0 and is: FALSE
## This plot is shockingly similar to the one observed in figure 2B of
## https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-2237-2
## YAY!
bmc_pairwise <- all_pairwise(bmc_expt, model_batch=TRUE, force=TRUE)## Plotting a PCA before surrogates/batch inclusion.
## Using limma's removeBatchEffect to visualize with(out) batch inclusion.
## Finished running DE analyses, collecting outputs.
## Comparing analyses.
## Writing a legend of columns.
## Working on table 1/28: no_t48h_vs_no_t24h
## The ebseq table is null.
## 20181210 a pthread error in normalize.quantiles leads me to robust.
## Used Bon Ferroni corrected t test(s) between columns.
## Used Bon Ferroni corrected t test(s) between columns.
## Working on table 2/28: no_t4h_vs_no_t24h
## The ebseq table is null.
## 20181210 a pthread error in normalize.quantiles leads me to robust.
## Used Bon Ferroni corrected t test(s) between columns.
## Used Bon Ferroni corrected t test(s) between columns.
## Working on table 3/28: no_t72h_vs_no_t24h
## The ebseq table is null.
## 20181210 a pthread error in normalize.quantiles leads me to robust.
## Used Bon Ferroni corrected t test(s) between columns.
## Used Bon Ferroni corrected t test(s) between columns.
## Working on table 4/28: yes_t24h_vs_no_t24h
## The ebseq table is null.
## 20181210 a pthread error in normalize.quantiles leads me to robust.
## Used Bon Ferroni corrected t test(s) between columns.
## Used Bon Ferroni corrected t test(s) between columns.
## Working on table 5/28: yes_t48h_vs_no_t24h
## The ebseq table is null.
## 20181210 a pthread error in normalize.quantiles leads me to robust.
## Used Bon Ferroni corrected t test(s) between columns.
## Used Bon Ferroni corrected t test(s) between columns.
## Working on table 6/28: yes_t4h_vs_no_t24h
## The ebseq table is null.
## 20181210 a pthread error in normalize.quantiles leads me to robust.
## Used Bon Ferroni corrected t test(s) between columns.
## Used Bon Ferroni corrected t test(s) between columns.
## Working on table 7/28: yes_t72h_vs_no_t24h
## The ebseq table is null.
## 20181210 a pthread error in normalize.quantiles leads me to robust.
## Used Bon Ferroni corrected t test(s) between columns.
## Used Bon Ferroni corrected t test(s) between columns.
## Working on table 8/28: no_t4h_vs_no_t48h
## The ebseq table is null.
## 20181210 a pthread error in normalize.quantiles leads me to robust.
## Used Bon Ferroni corrected t test(s) between columns.
## Used Bon Ferroni corrected t test(s) between columns.
## Working on table 9/28: no_t72h_vs_no_t48h
## The ebseq table is null.
## 20181210 a pthread error in normalize.quantiles leads me to robust.
## Used Bon Ferroni corrected t test(s) between columns.
## Used Bon Ferroni corrected t test(s) between columns.
## Working on table 10/28: yes_t24h_vs_no_t48h
## The ebseq table is null.
## 20181210 a pthread error in normalize.quantiles leads me to robust.
## Used Bon Ferroni corrected t test(s) between columns.
## Used Bon Ferroni corrected t test(s) between columns.
## Working on table 11/28: yes_t48h_vs_no_t48h
## The ebseq table is null.
## 20181210 a pthread error in normalize.quantiles leads me to robust.
## Used Bon Ferroni corrected t test(s) between columns.
## Used Bon Ferroni corrected t test(s) between columns.
## Working on table 12/28: yes_t4h_vs_no_t48h
## The ebseq table is null.
## 20181210 a pthread error in normalize.quantiles leads me to robust.
## Used Bon Ferroni corrected t test(s) between columns.
## Used Bon Ferroni corrected t test(s) between columns.
## Working on table 13/28: yes_t72h_vs_no_t48h
## The ebseq table is null.
## 20181210 a pthread error in normalize.quantiles leads me to robust.
## Used Bon Ferroni corrected t test(s) between columns.
## Used Bon Ferroni corrected t test(s) between columns.
## Working on table 14/28: no_t72h_vs_no_t4h
## The ebseq table is null.
## 20181210 a pthread error in normalize.quantiles leads me to robust.
## Used Bon Ferroni corrected t test(s) between columns.
## Used Bon Ferroni corrected t test(s) between columns.
## Working on table 15/28: yes_t24h_vs_no_t4h
## The ebseq table is null.
## 20181210 a pthread error in normalize.quantiles leads me to robust.
## Used Bon Ferroni corrected t test(s) between columns.
## Used Bon Ferroni corrected t test(s) between columns.
## Working on table 16/28: yes_t48h_vs_no_t4h
## The ebseq table is null.
## 20181210 a pthread error in normalize.quantiles leads me to robust.
## Used Bon Ferroni corrected t test(s) between columns.
## Used Bon Ferroni corrected t test(s) between columns.
## Working on table 17/28: yes_t4h_vs_no_t4h
## The ebseq table is null.
## 20181210 a pthread error in normalize.quantiles leads me to robust.
## Used Bon Ferroni corrected t test(s) between columns.
## Used Bon Ferroni corrected t test(s) between columns.
## Working on table 18/28: yes_t72h_vs_no_t4h
## The ebseq table is null.
## 20181210 a pthread error in normalize.quantiles leads me to robust.
## Used Bon Ferroni corrected t test(s) between columns.
## Used Bon Ferroni corrected t test(s) between columns.
## Working on table 19/28: yes_t24h_vs_no_t72h
## The ebseq table is null.
## 20181210 a pthread error in normalize.quantiles leads me to robust.
## Used Bon Ferroni corrected t test(s) between columns.
## Used Bon Ferroni corrected t test(s) between columns.
## Working on table 20/28: yes_t48h_vs_no_t72h
## The ebseq table is null.
## 20181210 a pthread error in normalize.quantiles leads me to robust.
## Used Bon Ferroni corrected t test(s) between columns.
## Used Bon Ferroni corrected t test(s) between columns.
## Working on table 21/28: yes_t4h_vs_no_t72h
## The ebseq table is null.
## 20181210 a pthread error in normalize.quantiles leads me to robust.
## Used Bon Ferroni corrected t test(s) between columns.
## Used Bon Ferroni corrected t test(s) between columns.
## Working on table 22/28: yes_t72h_vs_no_t72h
## The ebseq table is null.
## 20181210 a pthread error in normalize.quantiles leads me to robust.
## Used Bon Ferroni corrected t test(s) between columns.
## Used Bon Ferroni corrected t test(s) between columns.
## Working on table 23/28: yes_t48h_vs_yes_t24h
## The ebseq table is null.
## 20181210 a pthread error in normalize.quantiles leads me to robust.
## Used Bon Ferroni corrected t test(s) between columns.
## Used Bon Ferroni corrected t test(s) between columns.
## Working on table 24/28: yes_t4h_vs_yes_t24h
## The ebseq table is null.
## 20181210 a pthread error in normalize.quantiles leads me to robust.
## Used Bon Ferroni corrected t test(s) between columns.
## Used Bon Ferroni corrected t test(s) between columns.
## Working on table 25/28: yes_t72h_vs_yes_t24h
## The ebseq table is null.
## 20181210 a pthread error in normalize.quantiles leads me to robust.
## Used Bon Ferroni corrected t test(s) between columns.
## Used Bon Ferroni corrected t test(s) between columns.
## Working on table 26/28: yes_t4h_vs_yes_t48h
## The ebseq table is null.
## 20181210 a pthread error in normalize.quantiles leads me to robust.
## Used Bon Ferroni corrected t test(s) between columns.
## Used Bon Ferroni corrected t test(s) between columns.
## Working on table 27/28: yes_t72h_vs_yes_t48h
## The ebseq table is null.
## 20181210 a pthread error in normalize.quantiles leads me to robust.
## Used Bon Ferroni corrected t test(s) between columns.
## Used Bon Ferroni corrected t test(s) between columns.
## Working on table 28/28: yes_t72h_vs_yes_t4h
## The ebseq table is null.
## 20181210 a pthread error in normalize.quantiles leads me to robust.
## Used Bon Ferroni corrected t test(s) between columns.
## Used Bon Ferroni corrected t test(s) between columns.
## Parsed with column specification:
## cols(
## `Gene Symbol` = col_character(),
## `Human ID` = col_character(),
## `logFC in human` = col_double(),
## `Mouse ID` = col_character(),
## `logFC in mouse` = col_double()
## )
bmc_table <- bmc_tables$data[["yes_t4h_vs_no_t4h"]]
compare_merged <- merge(compare_table, bmc_table, by.x="Mouse ID", by.y="row.names", all.x=TRUE)
cor.test(compare_merged[["logFC in mouse"]], compare_merged[["limma_logfc"]])##
## Pearson's product-moment correlation
##
## data: compare_merged[["logFC in mouse"]] and compare_merged[["limma_logfc"]]
## t = 130, df = 1600, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9542 0.9623
## sample estimates:
## cor
## 0.9584
In the following blocks, I will attempt to directly address the ‘Big Questions’ posed by Najib in my TODO document. For the moment, it will not be very well organized, as I did some of this stuff before Najib’s questions appeared. Once I finish, I will reorganize it though.
Which genes are DE in human macrophages at 4 hours upon infection with L. major?
This question was addressed above.
This question is mostly addressed above, but needs to be expanded slightly. It is not a very interesting question to me, to be honest; since my degree of agreement with Laura’s previous analyses is very high, I am content to just say: “whatever she said is correct for this question, lets move to something new.”