This document will first explore differentially expressed genes in humans 4 hours after infection followed by the same question in mice.
I want to perform a series of comparisons among the host cells: human and mouse. Thus I need to collect annotation data for both species and get the set of orthologs between them.
In the following block, I download the human annotations from biomart. In addition, I take a moment to recreate the transcript IDs as observed in the salmon count tables (yes, I know they are not actually count tables). Finally, I create a table which maps transcripts to genes, this will be used when we generate the expressionset so that we get gene expression levels from transcripts via the R package ‘tximport’.
## The biomart annotations file already exists, loading from it.
rownames(hs_annot) <- make.names(
paste0(hs_annot[["ensembl_transcript_id"]], ".",
hs_annot[["transcript_version"]]),
unique=TRUE)
hs_tx_gene <- hs_annot[, c("ensembl_gene_id", "ensembl_transcript_id")]
hs_tx_gene[["id"]] <- rownames(hs_tx_gene)
hs_tx_gene <- hs_tx_gene[, c("id", "ensembl_gene_id")]
new_hs_annot <- hs_annot
rownames(new_hs_annot) <- make.names(hs_annot[["ensembl_gene_id"]], unique=TRUE)
The question is reasonably self-contained. I want to compare the uninfected human samples against any samples which were infected for 4 hours. So let us first pull those samples and then poke at them a bit.
The following block creates an expressionset using all human-quantified samples. As mentioned previously, it uses the table of transcript<->gene mappings, and the biomart annotations.
Given this set of ~440 samples, it then drops the following:
and resets the condition and batch factors to the ‘infection state’ metadatum and ‘study’, respectively.
sample_sheet <- "sample_sheets/leishmania_host_metasheet_20190426.xlsx"
hs_expt <- create_expt(sample_sheet,
file_column="hsapiensfile",
gene_info=new_hs_annot,
tx_gene_map=hs_tx_gene)
## Reading the sample metadata.
## The sample definitions comprises: 437 rows(samples) and 55 columns(metadata fields).
## Reading count tables.
## Using the transcript to gene mapping.
## Reading salmon data with tximport.
## Finished reading count tables.
## Matched 19629 annotations and counts.
## Bringing together the count matrix and gene information.
## The mapped IDs are not the rownames of your gene information, changing them now.
## Some annotations were lost in merging, setting them to 'undefined'.
## The final expressionset has 19629 rows and 267 columns.
## There were 267, now there are 247 samples.
## There were 247, now there are 64 samples.
hs_t4h_expt <- set_expt_conditions(hs_t4h_expt, fact="infectstate")
hs_t4h_expt <- set_expt_batches(hs_t4h_expt, fact="study")
table(hs_t4h_expt$conditions)
##
## no stim yes
## 18 35 11
##
## lps-timecourse m-gm-csf mbio
## 8 39 17
## Writing the first sheet, containing a legend and some summary data.
## Writing the raw reads.
## Graphing the raw reads.
## Writing the normalized reads.
## Graphing the normalized reads.
## Writing the median reads by factor.
## The factor no has 18 rows.
## The factor stim has 35 rows.
## The factor yes has 11 rows.
## Note: zip::zip() is deprecated, please use zip::zipr() instead
Let us perform some generic metrics of the t4h human expressionset. As per usual, I plot the metrics first of the raw data; followed by the same metrics of log2(quantile(cpm(sva(filtered(data))))).
hs_t4h_norm <- normalize_expt(hs_t4h_expt, norm="quant", convert="cpm",
transform="log2", filter=TRUE, batch="svaseq")
## This function will replace the expt$expressionset slot with:
## log2(svaseq(cpm(quant(cbcb(data)))))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Warning in normalize_expt(hs_t4h_expt, norm = "quant", convert = "cpm", :
## Quantile normalization and sva do not always play well together.
## Step 1: performing count filter with option: cbcb
## Removing 7360 low-count genes (12269 remaining).
## Step 2: normalizing the data with quant.
## Using normalize.quantiles.robust due to a thread error in preprocessCore.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 3822 values equal to 0, adding 1 to the matrix.
## Step 5: doing batch correction with svaseq.
## Note to self: If you get an error like 'x contains missing values' The data has too many 0's and needs a stronger low-count filter applied.
## Passing off to all_adjusters.
## batch_counts: Before batch/surrogate estimation, 711279 entries are x>1: 90.6%.
## batch_counts: Before batch/surrogate estimation, 3822 entries are x==0: 0.487%.
## batch_counts: Before batch/surrogate estimation, 70115 entries are 0<x<1: 8.93%.
## The be method chose 12 surrogate variable(s).
## Attempting svaseq estimation with 12 surrogates.
## There are 2034 (0.259%) elements which are < 0 after batch correction.
## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomTextRepel() has yet to be implemented in plotly.
## If you'd like to see this geom implemented,
## Please open an issue with your example code at
## https://github.com/ropensci/plotly/issues
I perhaps should have removed the stimulated samples sooner, but I was curious to see their effect on the distribution first.
## There were 64, now there are 29 samples.
## There were 29, now there are 26 samples.
hs_t4h_inf_norm <- normalize_expt(hs_t4h_inf, transform="log2", convert="cpm",
filter=TRUE, batch="svaseq")
## This function will replace the expt$expressionset slot with:
## log2(svaseq(cpm(cbcb(data))))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Leaving the data unnormalized. This is necessary for DESeq, but
## EdgeR/limma might benefit from normalization. Good choices include quantile,
## size-factor, tmm, etc.
## Step 1: performing count filter with option: cbcb
## Removing 7851 low-count genes (11778 remaining).
## Step 2: not normalizing the data.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 2698 values equal to 0, adding 1 to the matrix.
## Step 5: doing batch correction with svaseq.
## Note to self: If you get an error like 'x contains missing values' The data has too many 0's and needs a stronger low-count filter applied.
## Passing off to all_adjusters.
## batch_counts: Before batch/surrogate estimation, 285443 entries are x>1: 93.2%.
## batch_counts: Before batch/surrogate estimation, 2698 entries are x==0: 0.881%.
## batch_counts: Before batch/surrogate estimation, 18087 entries are 0<x<1: 5.91%.
## The be method chose 6 surrogate variable(s).
## Attempting svaseq estimation with 6 surrogates.
## There are 722 (0.236%) elements which are < 0 after batch correction.
keepers <- list("infection" = c("yes", "no"))
hs_t4h_de <- all_pairwise(hs_t4h_inf, model_batch="svaseq", filter=TRUE, force=TRUE)
## batch_counts: Before batch/surrogate estimation, 302042 entries are x>1: 98.6%.
## batch_counts: Before batch/surrogate estimation, 2698 entries are x==0: 0.881%.
## batch_counts: Before batch/surrogate estimation, 172 entries are 0<x<1: 0.0562%.
## The be method chose 5 surrogate variable(s).
## Attempting svaseq estimation with 5 surrogates.
## This function will replace the expt$expressionset slot with:
## log2(cpm(quant(cbcb(data))))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 0 low-count genes (11778 remaining).
## Step 2: normalizing the data with quant.
## Using normalize.quantiles.robust due to a thread error in preprocessCore.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 1567 values equal to 0, adding 1 to the matrix.
## Step 5: not doing batch correction.
## Plotting a PCA before surrogates/batch inclusion.
## Using svaseq to visualize before/after batch inclusion.
## Performing a test normalization with: raw
## This function will replace the expt$expressionset slot with:
## log2(svaseq(cpm(cbcb(data))))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Leaving the data unnormalized. This is necessary for DESeq, but
## EdgeR/limma might benefit from normalization. Good choices include quantile,
## size-factor, tmm, etc.
## Step 1: performing count filter with option: cbcb
## Removing 0 low-count genes (11778 remaining).
## Step 2: not normalizing the data.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 2698 values equal to 0, adding 1 to the matrix.
## Step 5: doing batch correction with svaseq.
## Note to self: If you get an error like 'x contains missing values' The data has too many 0's and needs a stronger low-count filter applied.
## Passing off to all_adjusters.
## batch_counts: Before batch/surrogate estimation, 285443 entries are x>1: 93.2%.
## batch_counts: Before batch/surrogate estimation, 2698 entries are x==0: 0.881%.
## batch_counts: Before batch/surrogate estimation, 18087 entries are 0<x<1: 5.91%.
## The be method chose 6 surrogate variable(s).
## Attempting svaseq estimation with 6 surrogates.
## There are 722 (0.236%) elements which are < 0 after batch correction.
## Finished running DE analyses, collecting outputs.
## Comparing analyses.
hs_t4h_table <- combine_de_tables(hs_t4h_de, keepers=keepers,
excel="excel/HsM0Lm4h_de_tables.xlsx")
## Deleting the file excel/HsM0Lm4h_de_tables.xlsx before writing the tables.
## Writing a legend of columns.
## Printing a pca plot before/after surrogates/batch estimation.
## Working on 1/1: infection which is: yes/no.
## Found table with yes_vs_no
## 20181210 a pthread error in normalize.quantiles leads me to robust.
## Used Bon Ferroni corrected t test(s) between columns.
## Used Bon Ferroni corrected t test(s) between columns.
## Used Bon Ferroni corrected t test(s) between columns.
## Adding venn plots for infection.
## Limma expression coefficients for infection; R^2: 0.958; equation: y = 0.985x + 0.022
## Edger expression coefficients for infection; R^2: 0.952; equation: y = 1.05x - 0.454
## DESeq2 expression coefficients for infection; R^2: 0.952; equation: y = 1.05x - 0.474
## Writing summary information.
## Attempting to add the comparison plot to pairwise_summary at row: 23 and column: 1
## Performing save of the workbook.
Most of this should be the same in process as what was performed for the human.
I want to perform a series of comparisons among the host cells: human and mouse. Thus I need to collect annotation data for both species and get the set of orthologs between them.
## The biomart annotations file already exists, loading from it.
rownames(mm_annot) <- make.names(
paste0(mm_annot[["ensembl_transcript_id"]], ".",
mm_annot[["transcript_version"]]),
unique=TRUE)
mm_tx_gene <- mm_annot[, c("ensembl_gene_id", "ensembl_transcript_id")]
mm_tx_gene[["id"]] <- rownames(mm_tx_gene)
mm_tx_gene <- mm_tx_gene[, c("id", "ensembl_gene_id")]
new_mm_annot <- mm_annot
rownames(new_mm_annot) <- make.names(mm_annot[["ensembl_gene_id"]], unique=TRUE)
The question is reasonably self-contained. I want to compare the uninfected human samples against any samples which were infected for 4 hours. So let us first pull those samples and then poke at them a bit.
mm_expt <- create_expt(sample_sheet,
file_column="mmusculusfile",
gene_info=new_mm_annot,
tx_gene_map=mm_tx_gene)
## Reading the sample metadata.
## The sample definitions comprises: 437 rows(samples) and 55 columns(metadata fields).
## Reading count tables.
## Using the transcript to gene mapping.
## Reading salmon data with tximport.
## Finished reading count tables.
## Matched 19660 annotations and counts.
## Bringing together the count matrix and gene information.
## The mapped IDs are not the rownames of your gene information, changing them now.
## Some annotations were lost in merging, setting them to 'undefined'.
## The final expressionset has 19660 rows and 105 columns.
## There were 105, now there are 41 samples.
##
## no stim yes
## 11 24 6
##
## undefined
## 41
## Writing the first sheet, containing a legend and some summary data.
## Writing the raw reads.
## Graphing the raw reads.
## Writing the normalized reads.
## Graphing the normalized reads.
## Writing the median reads by factor.
## The factor no has 11 rows.
## The factor stim has 24 rows.
## The factor yes has 6 rows.
mm_t4h_norm <- normalize_expt(mm_t4h_expt, norm="quant", convert="cpm",
transform="log2", filter=TRUE, batch="svaseq")
## This function will replace the expt$expressionset slot with:
## log2(svaseq(cpm(quant(cbcb(data)))))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Warning in normalize_expt(mm_t4h_expt, norm = "quant", convert = "cpm", :
## Quantile normalization and sva do not always play well together.
## Step 1: performing count filter with option: cbcb
## Removing 9350 low-count genes (10310 remaining).
## Step 2: normalizing the data with quant.
## Using normalize.quantiles.robust due to a thread error in preprocessCore.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 38 values equal to 0, adding 1 to the matrix.
## Step 5: doing batch correction with svaseq.
## Note to self: If you get an error like 'x contains missing values' The data has too many 0's and needs a stronger low-count filter applied.
## Passing off to all_adjusters.
## batch_counts: Before batch/surrogate estimation, 390888 entries are x>1: 92.5%.
## batch_counts: Before batch/surrogate estimation, 38 entries are x==0: 0.00899%.
## batch_counts: Before batch/surrogate estimation, 31784 entries are 0<x<1: 7.52%.
## The be method chose 8 surrogate variable(s).
## Attempting svaseq estimation with 8 surrogates.
## There are 807 (0.191%) elements which are < 0 after batch correction.
## There were 41, now there are 17 samples.
mm_t4h_nostim_norm <- sm(normalize_expt(mm_t4h_nostim, filter=TRUE,
norm="quant", convert="cpm",
transform="log2", batch="svaseq"))
mm_t4h_nostim_pca <- plot_pca(mm_t4h_nostim_norm)
mm_t4h_nostim_pca$plot
mm_t4h_inf_norm <- normalize_expt(mm_t4h_expt, transform="log2", convert="cpm",
filter=TRUE, batch="svaseq")
## This function will replace the expt$expressionset slot with:
## log2(svaseq(cpm(cbcb(data))))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Leaving the data unnormalized. This is necessary for DESeq, but
## EdgeR/limma might benefit from normalization. Good choices include quantile,
## size-factor, tmm, etc.
## Step 1: performing count filter with option: cbcb
## Removing 9350 low-count genes (10310 remaining).
## Step 2: not normalizing the data.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 3253 values equal to 0, adding 1 to the matrix.
## Step 5: doing batch correction with svaseq.
## Note to self: If you get an error like 'x contains missing values' The data has too many 0's and needs a stronger low-count filter applied.
## Passing off to all_adjusters.
## batch_counts: Before batch/surrogate estimation, 385615 entries are x>1: 91.2%.
## batch_counts: Before batch/surrogate estimation, 3253 entries are x==0: 0.770%.
## batch_counts: Before batch/surrogate estimation, 33842 entries are 0<x<1: 8.01%.
## The be method chose 7 surrogate variable(s).
## Attempting svaseq estimation with 7 surrogates.
## There are 1186 (0.281%) elements which are < 0 after batch correction.
## This function will replace the expt$expressionset slot with:
## cbcb(data)
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Leaving the data in its current base format, keep in mind that
## some metrics are easier to see when the data is log2 transformed, but
## EdgeR/DESeq do not accept transformed data.
## Leaving the data unconverted. It is often advisable to cpm/rpkm
## the data to normalize for sampling differences, keep in mind though that rpkm
## has some annoying biases, and voom() by default does a cpm (though hpgl_voom()
## will try to detect this).
## Leaving the data unnormalized. This is necessary for DESeq, but
## EdgeR/limma might benefit from normalization. Good choices include quantile,
## size-factor, tmm, etc.
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 9679 low-count genes (9981 remaining).
## Step 2: not normalizing the data.
## Step 3: not converting the data.
## Step 4: not transforming the data.
## Step 5: not doing batch correction.
## batch_counts: Before batch/surrogate estimation, 168728 entries are x>1: 99.4%.
## batch_counts: Before batch/surrogate estimation, 571 entries are x==0: 0.337%.
## batch_counts: Before batch/surrogate estimation, 110 entries are 0<x<1: 0.0648%.
## The be method chose 2 surrogate variable(s).
## Attempting svaseq estimation with 2 surrogates.
## This function will replace the expt$expressionset slot with:
## log2(cpm(quant(cbcb(data))))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 0 low-count genes (9981 remaining).
## Step 2: normalizing the data with quant.
## Using normalize.quantiles.robust due to a thread error in preprocessCore.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 15 values equal to 0, adding 1 to the matrix.
## Step 5: not doing batch correction.
## Plotting a PCA before surrogates/batch inclusion.
## Using svaseq to visualize before/after batch inclusion.
## Performing a test normalization with: raw
## This function will replace the expt$expressionset slot with:
## log2(svaseq(cpm(cbcb(data))))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Leaving the data unnormalized. This is necessary for DESeq, but
## EdgeR/limma might benefit from normalization. Good choices include quantile,
## size-factor, tmm, etc.
## Step 1: performing count filter with option: cbcb
## Removing 0 low-count genes (9981 remaining).
## Step 2: not normalizing the data.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 571 values equal to 0, adding 1 to the matrix.
## Step 5: doing batch correction with svaseq.
## Note to self: If you get an error like 'x contains missing values' The data has too many 0's and needs a stronger low-count filter applied.
## Passing off to all_adjusters.
## batch_counts: Before batch/surrogate estimation, 163054 entries are x>1: 96.1%.
## batch_counts: Before batch/surrogate estimation, 571 entries are x==0: 0.337%.
## batch_counts: Before batch/surrogate estimation, 6052 entries are 0<x<1: 3.57%.
## The be method chose 4 surrogate variable(s).
## Attempting svaseq estimation with 4 surrogates.
## There are 149 (0.0878%) elements which are < 0 after batch correction.
## Finished running DE analyses, collecting outputs.
## Comparing analyses.
mm_t4h_table <- combine_de_tables(mm_t4h_de, keepers=keepers,
excel="excel/MmM0Lm4h_de_tables.xlsx")
## Deleting the file excel/MmM0Lm4h_de_tables.xlsx before writing the tables.
## Writing a legend of columns.
## Printing a pca plot before/after surrogates/batch estimation.
## Working on 1/1: infection which is: yes/no.
## Found table with yes_vs_no
## 20181210 a pthread error in normalize.quantiles leads me to robust.
## Used Bon Ferroni corrected t test(s) between columns.
## Used Bon Ferroni corrected t test(s) between columns.
## Used Bon Ferroni corrected t test(s) between columns.
## Adding venn plots for infection.
## Limma expression coefficients for infection; R^2: 0.915; equation: y = 0.952x + 0.2
## Edger expression coefficients for infection; R^2: 0.908; equation: y = 1.03x - 0.29
## DESeq2 expression coefficients for infection; R^2: 0.908; equation: y = 1.03x - 0.304
## Writing summary information.
## Attempting to add the comparison plot to pairwise_summary at row: 23 and column: 1
## Performing save of the workbook.
Let us see if our human differential expression result is similar to that obtained in Table S2.
I downloaded the supplementary tables from the paper. I believe #5 is the one we want to compare against. The metric of fold change was weirdly encoded in the table, it was written as a positive and negative fold change, which to me is like trying to print out sqrt(-4) without using i.
previous_hs <- readxl::read_excel("excel/inline-supplementary-material-5.xls", sheet=2)
previous_hs_lfc <- previous_hs[, c("ID", "Fold change")]
## The following addresses the way the fold changes were written.
## and puts them back on the log scale.
neg_idx <- previous_hs_lfc[[2]] < 0
previous_hs_lfc[neg_idx, 2] <- -1 * (1 / previous_hs_lfc[neg_idx, 2])
previous_hs_lfc[[2]] <- log2(previous_hs_lfc[[2]])
merged <- merge(previous_hs_lfc, hs_t4h_table$data[[1]], by.x="ID", by.y="row.names")
cor.test(merged[["limma_logfc"]], merged[["Fold change"]])
##
## Pearson's product-moment correlation
##
## data: merged[["limma_logfc"]] and merged[["Fold change"]]
## t = 107, df = 5056, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8243 0.8412
## sample estimates:
## cor
## 0.833
previous_mm <- readxl::read_excel("excel/12864_2015_2237_MOESM3_ESM.xls", sheet=2, skip=1)
previous_mm_lfc <- previous_mm[, c("ID", "Fold change")]
neg_idx <- previous_mm_lfc[[2]] < 0
previous_mm_lfc[neg_idx, 2] <- -1 * (1 / previous_mm_lfc[neg_idx, 2])
previous_mm_lfc[[2]] <- log2(previous_mm_lfc[[2]])
merged <- merge(previous_mm_lfc, mm_t4h_table$data[[1]], by.x="ID", by.y="row.names")
cor.test(merged[["limma_logfc"]], merged[["Fold change"]])
##
## Pearson's product-moment correlation
##
## data: merged[["limma_logfc"]] and merged[["Fold change"]]
## t = 216, df = 5655, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9417 0.9473
## sample estimates:
## cor
## 0.9445
## Used Bon Ferroni corrected t test(s) between columns.
All of the neutrophil data is in mouse, apparently. This will make it more difficult, perhaps impossible to get an accurate answer.
So instead, look at infection vs. uninfected in mouse and then compare to the earliest Sacks’ timepoints in neutrophils.
subset <- "(hostcelltype=='PMN'&host=='mus_musculus'&expttime=='t12h') |
(hostcelltype=='macrophage'&host=='mus_musculus')"
neut_macr_mus <- subset_expt(mm_expt, subset=subset)
## There were 105, now there are 80 samples.
## There were 80, now there are 56 samples.
neut_macr_mus <- set_expt_conditions(neut_macr_mus, fact="infectstate")
neut_macr_mus <- set_expt_batches(neut_macr_mus, fact="hostcelltype")
neut_macr_mus_norm <- normalize_expt(neut_macr_mus, convert="cpm",
norm="quant", filter=TRUE)
## This function will replace the expt$expressionset slot with:
## cpm(quant(cbcb(data)))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Leaving the data in its current base format, keep in mind that
## some metrics are easier to see when the data is log2 transformed, but
## EdgeR/DESeq do not accept transformed data.
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 0 low-count genes (19660 remaining).
## Step 2: normalizing the data with quant.
## Using normalize.quantiles.robust due to a thread error in preprocessCore.
## Step 3: converting the data with cpm.
## Step 4: not transforming the data.
## Step 5: not doing batch correction.
neut_macr_mus_normbatch <- normalize_expt(neut_macr_mus, convert="cpm",
norm="quant", filter=TRUE, batch="svaseq")
## This function will replace the expt$expressionset slot with:
## svaseq(cpm(quant(cbcb(data))))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Leaving the data in its current base format, keep in mind that
## some metrics are easier to see when the data is log2 transformed, but
## EdgeR/DESeq do not accept transformed data.
## Warning in normalize_expt(neut_macr_mus, convert = "cpm", norm = "quant", :
## Quantile normalization and sva do not always play well together.
## Step 1: performing count filter with option: cbcb
## Removing 0 low-count genes (19660 remaining).
## Step 2: normalizing the data with quant.
## Using normalize.quantiles.robust due to a thread error in preprocessCore.
## Step 3: converting the data with cpm.
## Step 4: not transforming the data.
## Step 5: doing batch correction with svaseq.
## Note to self: If you get an error like 'x contains missing values' The data has too many 0's and needs a stronger low-count filter applied.
## Passing off to all_adjusters.
## batch_counts: Before batch/surrogate estimation, 515332 entries are x>1: 46.8%.
## batch_counts: Before batch/surrogate estimation, 323375 entries are x==0: 29.4%.
## batch_counts: Before batch/surrogate estimation, 262253 entries are 0<x<1: 23.8%.
## The be method chose 10 surrogate variable(s).
## Attempting svaseq estimation with 10 surrogates.
## There are 30709 (2.79%) elements which are < 0 after batch correction.
neut_macr_mus_filt <- sm(normalize_expt(neut_macr_mus, filter="simple"))
neut_macr_mus_de <- sm(all_pairwise(
neut_macr_mus_filt, parallel=FALSE,
force=TRUE, model_batch="svaseq"))
neut_macr_mus_table <- sm(combine_de_tables(
neut_macr_mus_de, keepers=keepers,
excel="excel/MmM0Lm4h_vs_MmPMNLm12h_de_tables.xlsx"))
neut_macr_mus_sig <- sm(extract_significant_genes(
neut_macr_mus_table,
excel="excel/MmM0Lm4h_vs_MmPMNLm12h_sig_tables.xlsx"))
I think this handles questions a through e?
R version 3.6.0 RC (2019-04-18 r76404)
Platform: x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=en_US.utf8, LC_NUMERIC=C, LC_TIME=en_US.utf8, LC_COLLATE=en_US.utf8, LC_MONETARY=en_US.utf8, LC_MESSAGES=en_US.utf8, LC_PAPER=en_US.utf8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.utf8 and LC_IDENTIFICATION=C
attached base packages: parallel, stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: edgeR(v.3.24.3), foreach(v.1.4.4), ruv(v.0.9.7), hpgltools(v.1.0), Biobase(v.2.42.0) and BiocGenerics(v.0.28.0)
loaded via a namespace (and not attached): tidyselect(v.0.2.5), lme4(v.1.1-21), RSQLite(v.2.1.1), AnnotationDbi(v.1.44.0), htmlwidgets(v.1.3), grid(v.3.6.0), BiocParallel(v.1.16.6), Rtsne(v.0.15), devtools(v.2.0.1), munsell(v.0.5.0), codetools(v.0.2-16), preprocessCore(v.1.45.0), withr(v.2.1.2), colorspace(v.1.4-1), GOSemSim(v.2.8.0), knitr(v.1.22), rstudioapi(v.0.10), stats4(v.3.6.0), Vennerable(v.3.1.0.9000), robustbase(v.0.93-4), DOSE(v.3.8.2), labeling(v.0.3), urltools(v.1.7.2), tximport(v.1.10.1), GenomeInfoDbData(v.1.2.0), polyclip(v.1.10-0), bit64(v.0.9-7), farver(v.1.1.0), rprojroot(v.1.3-2), xfun(v.0.6), R6(v.2.4.0), doParallel(v.1.0.14), GenomeInfoDb(v.1.18.2), locfit(v.1.5-9.1), bitops(v.1.0-6), fgsea(v.1.8.0), gridGraphics(v.0.3-0), DelayedArray(v.0.8.0), assertthat(v.0.2.1), promises(v.1.0.1), scales(v.1.0.0), nnet(v.7.3-12), ggraph(v.1.0.2), enrichplot(v.1.2.0), gtable(v.0.3.0), sva(v.3.30.1), processx(v.3.3.0), rlang(v.0.3.3), genefilter(v.1.64.0), splines(v.3.6.0), rtracklayer(v.1.42.2), lazyeval(v.0.2.2), acepack(v.1.4.1), checkmate(v.1.9.1), europepmc(v.0.3), yaml(v.2.2.0), reshape2(v.1.4.3), GenomicFeatures(v.1.34.7), crosstalk(v.1.0.0), backports(v.1.1.3), httpuv(v.1.5.0), qvalue(v.2.14.1), Hmisc(v.4.2-0), RBGL(v.1.58.2), clusterProfiler(v.3.10.1), tools(v.3.6.0), usethis(v.1.4.0), ggplotify(v.0.0.3), ggplot2(v.3.1.0), gplots(v.3.0.1.1), RColorBrewer(v.1.1-2), blockmodeling(v.0.3.4), sessioninfo(v.1.1.1), ggridges(v.0.5.1), Rcpp(v.1.0.1), plyr(v.1.8.4), base64enc(v.0.1-3), progress(v.1.2.0), zlibbioc(v.1.28.0), purrr(v.0.3.2), RCurl(v.1.95-4.12), ps(v.1.3.0), prettyunits(v.1.0.2), rpart(v.4.1-13), viridis(v.0.5.1), cowplot(v.0.9.4), S4Vectors(v.0.20.1), cluster(v.2.0.8), SummarizedExperiment(v.1.12.0), ggrepel(v.0.8.0), colorRamps(v.2.3), fs(v.1.2.7), variancePartition(v.1.12.3), magrittr(v.1.5), data.table(v.1.12.0), DO.db(v.2.9), openxlsx(v.4.1.0), triebeard(v.0.3.0), packrat(v.0.5.0), matrixStats(v.0.54.0), pkgload(v.1.0.2), hms(v.0.4.2), mime(v.0.6), evaluate(v.0.13), xtable(v.1.8-3), pbkrtest(v.0.4-7), XML(v.3.98-1.19), readxl(v.1.3.1), IRanges(v.2.16.0), gridExtra(v.2.3), testthat(v.2.0.1), compiler(v.3.6.0), biomaRt(v.2.38.0), tibble(v.2.1.1), KernSmooth(v.2.23-15), crayon(v.1.3.4), minqa(v.1.2.4), htmltools(v.0.3.6), mgcv(v.1.8-28), corpcor(v.1.6.9), later(v.0.8.0), snow(v.0.4-3), Formula(v.1.2-3), geneplotter(v.1.60.0), tidyr(v.0.8.3), DBI(v.1.0.0), tweenr(v.1.0.1), MASS(v.7.3-51.3), boot(v.1.3-20), Matrix(v.1.2-17), readr(v.1.3.1), cli(v.1.1.0), quadprog(v.1.5-5), gdata(v.2.18.0), igraph(v.1.2.4), GenomicRanges(v.1.34.0), pkgconfig(v.2.0.2), registry(v.0.5-1), rvcheck(v.0.1.3), GenomicAlignments(v.1.18.1), foreign(v.0.8-71), plotly(v.4.8.0), xml2(v.1.2.0), annotate(v.1.60.1), rngtools(v.1.3.1), pkgmaker(v.0.27), XVector(v.0.22.0), bibtex(v.0.4.2), doRNG(v.1.7.1), EBSeq(v.1.22.1), stringr(v.1.4.0), callr(v.3.2.0), digest(v.0.6.18), graph(v.1.60.0), Biostrings(v.2.50.2), cellranger(v.1.1.0), rmarkdown(v.1.12), fastmatch(v.1.1-0), htmlTable(v.1.13.1), directlabels(v.2018.05.22), curl(v.3.3), shiny(v.1.2.0), Rsamtools(v.1.34.1), gtools(v.3.8.1), nloptr(v.1.2.1), nlme(v.3.1-137), jsonlite(v.1.6), desc(v.1.2.0), viridisLite(v.0.3.0), limma(v.3.38.3), pillar(v.1.3.1), lattice(v.0.20-38), DEoptimR(v.1.0-8), httr(v.1.4.0), pkgbuild(v.1.0.3), survival(v.2.44-1.1), GO.db(v.3.7.0), glue(v.1.3.1), remotes(v.2.0.2), zip(v.2.0.1), UpSetR(v.1.3.3), iterators(v.1.0.10), pander(v.0.6.3), bit(v.1.1-14), ggforce(v.0.2.1), stringi(v.1.4.3), blob(v.1.1.1), DESeq2(v.1.22.2), doSNOW(v.1.0.16), latticeExtra(v.0.6-28), caTools(v.1.17.1.2), memoise(v.1.1.0) and dplyr(v.0.8.0.1)
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 0ebb3165f07d676a83da460824f337251efbcf69
## This is hpgltools commit: Fri Apr 12 15:03:48 2019 -0400: 0ebb3165f07d676a83da460824f337251efbcf69