This document will first explore differentially expressed genes in humans 4 hours after infection followed by the same question in mice.
I want to perform a series of comparisons among the host cells: human and mouse. Thus I need to collect annotation data for both species and get the set of orthologs between them.
In the following block, I download the human annotations from biomart. In addition, I take a moment to recreate the transcript IDs as observed in the salmon count tables (yes, I know they are not actually count tables). Finally, I create a table which maps transcripts to genes, this will be used when we generate the expressionset so that we get gene expression levels from transcripts via the R package ‘tximport’.
load_biomart_annotations()$annotation hs_annot <-
## The biomart annotations file already exists, loading from it.
rownames(hs_annot) <- make.names(
paste0(hs_annot[["ensembl_transcript_id"]], ".",
"transcript_version"]]),
hs_annot[[unique=TRUE)
hs_annot[, c("ensembl_gene_id", "ensembl_transcript_id")]
hs_tx_gene <-"id"]] <- rownames(hs_tx_gene)
hs_tx_gene[[ hs_tx_gene[, c("id", "ensembl_gene_id")]
hs_tx_gene <- hs_annot
new_hs_annot <-rownames(new_hs_annot) <- make.names(hs_annot[["ensembl_gene_id"]], unique=TRUE)
The question is reasonably self-contained. I want to compare the uninfected human samples against any samples which were infected for 4 hours. So let us first pull those samples and then poke at them a bit.
The following block creates an expressionset using all human-quantified samples. As mentioned previously, it uses the table of transcript<->gene mappings, and the biomart annotations.
Given this set of ~440 samples, it then drops the following:
and resets the condition and batch factors to the ‘infection state’ metadatum and ‘study’, respectively.
"sample_sheets/leishmania_host_metasheet_20190426.xlsx"
sample_sheet <- create_expt(sample_sheet,
hs_expt <-savefile="Hs_M0Lm4h.rda",
file_column="hsapiensfile",
gene_info=new_hs_annot,
tx_gene_map=hs_tx_gene)
## Reading the sample metadata.
## Dropped 2 rows from the sample metadata because they were blank.
## The sample definitions comprises: 437 rows(samples) and 55 columns(metadata fields).
## Reading count tables.
## Using the transcript to gene mapping.
## Reading salmon data with tximport.
## Finished reading count data.
## Warning in create_expt(sample_sheet, savefile = "Hs_M0Lm4h.rda", file_column =
## "hsapiensfile", : Some samples were removed when cross referencing the samples
## against the count data.
## Matched 16933 annotations and counts.
## Bringing together the count matrix and gene information.
## The mapped IDs are not the rownames of your gene information, changing them now.
## Some annotations were lost in merging, setting them to 'undefined'.
## The final expressionset has 16933 rows and 267 columns.
subset_expt(hs_expt, subset="skipped!='yes'") hs_expt_noskipped <-
## Using a subset expression.
## There were 267, now there are 247 samples.
subset_expt(hs_expt_noskipped, subset="expttime=='t4h'") hs_t4h_expt <-
## Using a subset expression.
## There were 247, now there are 64 samples.
set_expt_conditions(hs_t4h_expt, fact="infectstate")
hs_t4h_expt <- set_expt_batches(hs_t4h_expt, fact="study")
hs_t4h_expt <-table(hs_t4h_expt$conditions)
##
## no stim yes
## 18 35 11
table(hs_t4h_expt$batches)
##
## lps-timecourse m-gm-csf mbio
## 8 39 17
write_expt(hs_t4h_expt, excel="excel/HsM0Lm4h_expt.xlsx") hs_written <-
## Deleting the file excel/HsM0Lm4h_expt.xlsx before writing the tables.
## Writing the first sheet, containing a legend and some summary data.
## Writing the raw reads.
## Graphing the raw reads.
## Warning: ggrepel: 27 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## Warning: ggrepel: 35 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 5 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 53 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 41 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 54 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Attempting mixed linear model with: ~ condition + batch
## Fitting the expressionset to the model, this is slow.
## Loading required package: Matrix
##
## Total:101 s
## Placing factor: condition at the beginning of the model.
## Writing the normalized reads.
## Graphing the normalized reads.
## Warning: ggrepel: 30 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## Warning: ggrepel: 37 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 33 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Attempting mixed linear model with: ~ condition + batch
## Fitting the expressionset to the model, this is slow.
##
## Total:73 s
## Placing factor: condition at the beginning of the model.
## Writing the median reads by factor.
Let us perform some generic metrics of the t4h human expressionset. As per usual, I plot the metrics first of the raw data; followed by the same metrics of log2(quantile(cpm(sva(filtered(data))))).
sm(graph_metrics(hs_t4h_expt)) hs_t4h_plots <-
normalize_expt(hs_t4h_expt, norm="quant", convert="cpm",
hs_t4h_norm <-transform="log2", filter=TRUE, batch="svaseq")
## This function will replace the expt$expressionset slot with:
## log2(svaseq(cpm(quant(cbcb(data)))))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep libsizes in mind
## when invoking limma. The appropriate libsize is non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Warning in normalize_expt(hs_t4h_expt, norm = "quant", convert = "cpm", :
## Quantile normalization and sva do not always play well together.
## Step 1: performing count filter with option: cbcb
## Removing 6178 low-count genes (10755 remaining).
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## The method is: svaseq.
## Step 4: doing batch correction with svaseq.
## Using the current state of normalization.
## Passing the data to all_adjusters using the svaseq estimate type.
## batch_counts: Before batch/surrogate estimation, 567379 entries are x>1: 82%.
## batch_counts: Before batch/surrogate estimation, 34944 entries are x==0: 5%.
## batch_counts: Before batch/surrogate estimation, 85997 entries are 0<x<1: 12%.
## The be method chose 12 surrogate variables.
## Attempting svaseq estimation with 12 surrogates.
## There are 10791 (2%) elements which are < 0 after batch correction.
## Setting low elements to zero.
## Step 4: transforming the data with log2.
## transform_counts: Found 10791 values equal to 0, adding 1 to the matrix.
sm(graph_metrics(hs_t4h_norm)) hs_t4h_norm_plots <-
$legend hs_t4h_plots
$libsize hs_t4h_plots
$boxplot hs_t4h_plots
$pc_plot hs_t4h_norm_plots
## Warning: ggrepel: 23 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
plotly_pca(hs_t4h_norm) interactive <-
## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomTextRepel() has yet to be implemented in plotly.
## If you'd like to see this geom implemented,
## Please open an issue with your example code at
## https://github.com/ropensci/plotly/issues
$plot interactive
## Warning: ggrepel: 23 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
$plotly interactive
I perhaps should have removed the stimulated samples sooner, but I was curious to see their effect on the distribution first.
subset_expt(hs_t4h_expt, subset="condition!='stim'") hs_t4h_inf <-
## Using a subset expression.
## There were 64, now there are 29 samples.
subset_expt(hs_t4h_inf, subset="batch!='lps-timecourse'") hs_t4h_inf <-
## Using a subset expression.
## There were 29, now there are 26 samples.
normalize_expt(hs_t4h_inf, transform="log2", convert="cpm",
hs_t4h_inf_norm <-filter=TRUE, batch="svaseq")
## This function will replace the expt$expressionset slot with:
## log2(svaseq(cpm(cbcb(data))))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep libsizes in mind
## when invoking limma. The appropriate libsize is non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Leaving the data unnormalized. This is necessary for DESeq, but
## EdgeR/limma might benefit from normalization. Good choices include quantile,
## size-factor, tmm, etc.
## Step 1: performing count filter with option: cbcb
## Removing 6868 low-count genes (10065 remaining).
## Step 2: not normalizing the data.
## Step 3: converting the data with cpm.
## The method is: svaseq.
## Step 4: doing batch correction with svaseq.
## Using the current state of normalization.
## Passing the data to all_adjusters using the svaseq estimate type.
## batch_counts: Before batch/surrogate estimation, 228379 entries are x>1: 87%.
## batch_counts: Before batch/surrogate estimation, 10707 entries are x==0: 4%.
## batch_counts: Before batch/surrogate estimation, 22604 entries are 0<x<1: 9%.
## The be method chose 6 surrogate variables.
## Attempting svaseq estimation with 6 surrogates.
## There are 2743 (1%) elements which are < 0 after batch correction.
## Setting low elements to zero.
## Step 4: transforming the data with log2.
## transform_counts: Found 2743 values equal to 0, adding 1 to the matrix.
plot_pca(hs_t4h_inf_norm, plot_title="H. sapiens, L. major, t4h")
hs_t4h_pca <-$plot hs_t4h_pca
## Warning: ggrepel: 5 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
list("infection" = c("yes", "no"))
keepers <- all_pairwise(hs_t4h_inf, model_batch="svaseq", filter=TRUE, force=TRUE) hs_t4h_de <-
## batch_counts: Before batch/surrogate estimation, 247395 entries are x>1: 95%.
## batch_counts: Before batch/surrogate estimation, 10707 entries are x==0: 4%.
## batch_counts: Before batch/surrogate estimation, 2154 entries are 0<x<1: 1%.
## The be method chose 5 surrogate variables.
## Attempting svaseq estimation with 5 surrogates.
## Plotting a PCA before surrogate/batch inclusion.
## Not putting labels on the PC plot.
## Using svaseq to visualize before/after batch inclusion.
## Performing a test normalization with: raw
## This function will replace the expt$expressionset slot with:
## log2(svaseq(cpm(cbcb(data))))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep libsizes in mind
## when invoking limma. The appropriate libsize is non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Leaving the data unnormalized. This is necessary for DESeq, but
## EdgeR/limma might benefit from normalization. Good choices include quantile,
## size-factor, tmm, etc.
## Step 1: performing count filter with option: cbcb
## Removing 0 low-count genes (10065 remaining).
## Step 2: not normalizing the data.
## Step 3: converting the data with cpm.
## The method is: svaseq.
## Step 4: doing batch correction with svaseq.
## Using the current state of normalization.
## Passing the data to all_adjusters using the svaseq estimate type.
## batch_counts: Before batch/surrogate estimation, 228379 entries are x>1: 87%.
## batch_counts: Before batch/surrogate estimation, 10707 entries are x==0: 4%.
## batch_counts: Before batch/surrogate estimation, 22604 entries are 0<x<1: 9%.
## The be method chose 6 surrogate variables.
## Attempting svaseq estimation with 6 surrogates.
## There are 2743 (1%) elements which are < 0 after batch correction.
## Setting low elements to zero.
## Step 4: transforming the data with log2.
## transform_counts: Found 2743 values equal to 0, adding 1 to the matrix.
## Not putting labels on the PC plot.
## Finished running DE analyses, collecting outputs.
## Comparing analyses.
combine_de_tables(hs_t4h_de, keepers=keepers,
hs_t4h_table <-excel="excel/HsM0Lm4h_de_tables.xlsx")
## Writing a legend of columns.
## Printing a pca plot before/after surrogates/batch estimation.
## Working on 1/1: infection which is: yes/no.
## Found table with yes_vs_no
## Adding venn plots for infection.
## Limma expression coefficients for infection; R^2: 0.97; equation: y = 0.974x + 0.123
## Deseq expression coefficients for infection; R^2: 0.966; equation: y = 0.95x + 0.349
## Edger expression coefficients for infection; R^2: 0.966; equation: y = 0.951x + 0.461
## Writing summary information, compare_plot is: TRUE.
## Performing save of excel/HsM0Lm4h_de_tables.xlsx.
sm(extract_significant_genes(hs_t4h_table, excel="excel/HsM0Lm4h_sig_tables.xlsx")) hs_t4h_sig <-
Most of this should be the same in process as what was performed for the human.
I want to perform a series of comparisons among the host cells: human and mouse. Thus I need to collect annotation data for both species and get the set of orthologs between them.
load_biomart_annotations(species="mmusculus")$annotation mm_annot <-
## The biomart annotations file already exists, loading from it.
rownames(mm_annot) <- make.names(
paste0(mm_annot[["ensembl_transcript_id"]], ".",
"transcript_version"]]),
mm_annot[[unique=TRUE)
mm_annot[, c("ensembl_gene_id", "ensembl_transcript_id")]
mm_tx_gene <-"id"]] <- rownames(mm_tx_gene)
mm_tx_gene[[ mm_tx_gene[, c("id", "ensembl_gene_id")]
mm_tx_gene <- mm_annot
new_mm_annot <-rownames(new_mm_annot) <- make.names(mm_annot[["ensembl_gene_id"]], unique=TRUE)
The question is reasonably self-contained. I want to compare the uninfected human samples against any samples which were infected for 4 hours. So let us first pull those samples and then poke at them a bit.
create_expt(sample_sheet,
mm_expt <-file_column="mmusculusfile",
gene_info=new_mm_annot,
tx_gene_map=mm_tx_gene)
## Reading the sample metadata.
## Dropped 2 rows from the sample metadata because they were blank.
## The sample definitions comprises: 437 rows(samples) and 55 columns(metadata fields).
## Reading count tables.
## Using the transcript to gene mapping.
## Reading salmon data with tximport.
## Finished reading count data.
## Warning in create_expt(sample_sheet, file_column = "mmusculusfile", gene_info
## = new_mm_annot, : Some samples were removed when cross referencing the samples
## against the count data.
## Matched 19660 annotations and counts.
## Bringing together the count matrix and gene information.
## The mapped IDs are not the rownames of your gene information, changing them now.
## Some annotations were lost in merging, setting them to 'undefined'.
## Saving the expressionset to 'expt.rda'.
## The final expressionset has 19660 rows and 105 columns.
subset_expt(mm_expt, subset="expttime=='t4h'") mm_t4h_expt <-
## Using a subset expression.
## There were 105, now there are 41 samples.
set_expt_conditions(mm_t4h_expt, fact="infectstate")
mm_t4h_expt <-table(mm_t4h_expt$conditions)
##
## no stim yes
## 11 24 6
table(mm_t4h_expt$batches)
##
## undefined
## 41
write_expt(mm_t4h_expt, excel="excel/MmM0Lm4h_expt.xlsx") mm_written <-
## Deleting the file excel/MmM0Lm4h_expt.xlsx before writing the tables.
## Writing the first sheet, containing a legend and some summary data.
## Writing the raw reads.
## Graphing the raw reads.
## Warning: ggrepel: 26 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 4 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## Warning: ggrepel: 25 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 7 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 23 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 6 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 3 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 7 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## varpart sees only 1 batch, adjusting the model accordingly.
## Attempting mixed linear model with: ~ condition
## Fitting the expressionset to the model, this is slow.
## Error in if (ncol(exprObj) != nrow(data)) { : argument is of length zero
## A couple of common errors:
## An error like 'vtv downdated' may be because there are too many 0s, filter the data and rerun.
## An error like 'number of levels of each grouping factor must be < number of observations' means
## that the factor used is not appropriate for the analysis - it really only works for factors
## which are shared among multiple samples.
## Retrying with only condition in the model.
##
## Total:91 s
## Placing factor: condition at the beginning of the model.
## Writing the normalized reads.
## Graphing the normalized reads.
## Warning: ggrepel: 19 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 3 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## Warning: ggrepel: 11 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 8 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## varpart sees only 1 batch, adjusting the model accordingly.
## Attempting mixed linear model with: ~ condition
## Fitting the expressionset to the model, this is slow.
## Error in if (ncol(exprObj) != nrow(data)) { : argument is of length zero
## A couple of common errors:
## An error like 'vtv downdated' may be because there are too many 0s, filter the data and rerun.
## An error like 'number of levels of each grouping factor must be < number of observations' means
## that the factor used is not appropriate for the analysis - it really only works for factors
## which are shared among multiple samples.
## Retrying with only condition in the model.
##
## Total:63 s
## Placing factor: condition at the beginning of the model.
## Writing the median reads by factor.
sm(graph_metrics(mm_t4h_expt)) mm_t4h_plots <-
normalize_expt(mm_t4h_expt, norm="quant", convert="cpm",
mm_t4h_norm <-transform="log2", filter=TRUE, batch="svaseq")
## This function will replace the expt$expressionset slot with:
## log2(svaseq(cpm(quant(cbcb(data)))))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep libsizes in mind
## when invoking limma. The appropriate libsize is non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Warning in normalize_expt(mm_t4h_expt, norm = "quant", convert = "cpm", :
## Quantile normalization and sva do not always play well together.
## Step 1: performing count filter with option: cbcb
## Removing 9350 low-count genes (10310 remaining).
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## The method is: svaseq.
## Step 4: doing batch correction with svaseq.
## Using the current state of normalization.
## Passing the data to all_adjusters using the svaseq estimate type.
## batch_counts: Before batch/surrogate estimation, 390851 entries are x>1: 92%.
## batch_counts: Before batch/surrogate estimation, 38 entries are x==0: 0%.
## batch_counts: Before batch/surrogate estimation, 31821 entries are 0<x<1: 8%.
## The be method chose 7 surrogate variables.
## Attempting svaseq estimation with 7 surrogates.
## There are 1390 (0%) elements which are < 0 after batch correction.
## Setting low elements to zero.
## Step 4: transforming the data with log2.
## transform_counts: Found 1390 values equal to 0, adding 1 to the matrix.
sm(graph_metrics(mm_t4h_norm)) mm_t4h_norm_plots <-
$legend mm_t4h_plots
$libsize mm_t4h_plots
$boxplot mm_t4h_plots
$pc_plot mm_t4h_norm_plots
## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
subset_expt(mm_t4h_expt, subset="condition!='stim'") mm_t4h_nostim <-
## Using a subset expression.
## There were 41, now there are 17 samples.
sm(normalize_expt(mm_t4h_nostim, filter=TRUE,
mm_t4h_nostim_norm <-norm="quant", convert="cpm",
transform="log2", batch="svaseq"))
plot_pca(mm_t4h_nostim_norm)
mm_t4h_nostim_pca <-$plot mm_t4h_nostim_pca
normalize_expt(mm_t4h_expt, transform="log2", convert="cpm",
mm_t4h_inf_norm <-filter=TRUE, batch="svaseq")
## This function will replace the expt$expressionset slot with:
## log2(svaseq(cpm(cbcb(data))))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep libsizes in mind
## when invoking limma. The appropriate libsize is non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Leaving the data unnormalized. This is necessary for DESeq, but
## EdgeR/limma might benefit from normalization. Good choices include quantile,
## size-factor, tmm, etc.
## Step 1: performing count filter with option: cbcb
## Removing 9350 low-count genes (10310 remaining).
## Step 2: not normalizing the data.
## Step 3: converting the data with cpm.
## The method is: svaseq.
## Step 4: doing batch correction with svaseq.
## Using the current state of normalization.
## Passing the data to all_adjusters using the svaseq estimate type.
## batch_counts: Before batch/surrogate estimation, 385615 entries are x>1: 91%.
## batch_counts: Before batch/surrogate estimation, 3253 entries are x==0: 1%.
## batch_counts: Before batch/surrogate estimation, 33842 entries are 0<x<1: 8%.
## The be method chose 7 surrogate variables.
## Attempting svaseq estimation with 7 surrogates.
## There are 1637 (0%) elements which are < 0 after batch correction.
## Setting low elements to zero.
## Step 4: transforming the data with log2.
## transform_counts: Found 1637 values equal to 0, adding 1 to the matrix.
plot_pca(mm_t4h_inf_norm, plot_title="M. musculus, L. major, t4h")
mm_t4h_pca <-$plot mm_t4h_pca
## Warning: ggrepel: 3 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
normalize_expt(mm_t4h_nostim, filter=TRUE) mm_t4h_nostim_filt <-
## This function will replace the expt$expressionset slot with:
## cbcb(data)
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep libsizes in mind
## when invoking limma. The appropriate libsize is non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Leaving the data in its current base format, keep in mind that
## some metrics are easier to see when the data is log2 transformed, but
## EdgeR/DESeq do not accept transformed data.
## Leaving the data unconverted. It is often advisable to cpm/rpkm
## the data to normalize for sampling differences, keep in mind though that rpkm
## has some annoying biases, and voom() by default does a cpm (though hpgl_voom()
## will try to detect this).
## Leaving the data unnormalized. This is necessary for DESeq, but
## EdgeR/limma might benefit from normalization. Good choices include quantile,
## size-factor, tmm, etc.
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 9679 low-count genes (9981 remaining).
## Step 2: not normalizing the data.
## Step 3: not converting the data.
## The method is: raw.
## Step 4: not doing batch correction.
## Step 4: not transforming the data.
all_pairwise(mm_t4h_nostim_filt, model_batch="svaseq", force=TRUE) mm_t4h_de <-
## batch_counts: Before batch/surrogate estimation, 168728 entries are x>1: 99%.
## batch_counts: Before batch/surrogate estimation, 571 entries are x==0: 0%.
## batch_counts: Before batch/surrogate estimation, 110 entries are 0<x<1: 0%.
## The be method chose 2 surrogate variables.
## Attempting svaseq estimation with 2 surrogates.
## Plotting a PCA before surrogate/batch inclusion.
## Not putting labels on the PC plot.
## Using svaseq to visualize before/after batch inclusion.
## Performing a test normalization with: raw
## This function will replace the expt$expressionset slot with:
## log2(svaseq(cpm(cbcb(data))))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep libsizes in mind
## when invoking limma. The appropriate libsize is non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Leaving the data unnormalized. This is necessary for DESeq, but
## EdgeR/limma might benefit from normalization. Good choices include quantile,
## size-factor, tmm, etc.
## Step 1: performing count filter with option: cbcb
## Removing 0 low-count genes (9981 remaining).
## Step 2: not normalizing the data.
## Step 3: converting the data with cpm.
## The method is: svaseq.
## Step 4: doing batch correction with svaseq.
## Using the current state of normalization.
## Passing the data to all_adjusters using the svaseq estimate type.
## batch_counts: Before batch/surrogate estimation, 163054 entries are x>1: 96%.
## batch_counts: Before batch/surrogate estimation, 571 entries are x==0: 0%.
## batch_counts: Before batch/surrogate estimation, 6052 entries are 0<x<1: 4%.
## The be method chose 4 surrogate variables.
## Attempting svaseq estimation with 4 surrogates.
## There are 241 (0%) elements which are < 0 after batch correction.
## Setting low elements to zero.
## Step 4: transforming the data with log2.
## transform_counts: Found 241 values equal to 0, adding 1 to the matrix.
## Not putting labels on the PC plot.
## Finished running DE analyses, collecting outputs.
## Comparing analyses.
combine_de_tables(mm_t4h_de, keepers=keepers,
mm_t4h_table <-excel="excel/MmM0Lm4h_de_tables.xlsx")
## Writing a legend of columns.
## Printing a pca plot before/after surrogates/batch estimation.
## Working on 1/1: infection which is: yes/no.
## Found table with yes_vs_no
## Adding venn plots for infection.
## Limma expression coefficients for infection; R^2: 0.916; equation: y = 0.952x + 0.25
## Deseq expression coefficients for infection; R^2: 0.909; equation: y = 0.876x + 1.18
## Edger expression coefficients for infection; R^2: 0.908; equation: y = 0.875x + 1.11
## Writing summary information, compare_plot is: TRUE.
## Performing save of excel/MmM0Lm4h_de_tables.xlsx.
sm(extract_significant_genes(mm_t4h_table,
mm_t4h_sig <-excel="excel/MmM0Lm4h_sig_tables.xlsx"))
Let us see if our human differential expression result is similar to that obtained in Table S2.
I downloaded the supplementary tables from the paper. I believe #5 is the one we want to compare against. The metric of fold change was weirdly encoded in the table, it was written as a positive and negative fold change, which to me is like trying to print out sqrt(-4) without using i.
readxl::read_excel("excel/inline-supplementary-material-5.xls", sheet=2)
previous_hs <- previous_hs[, c("ID", "Fold change")]
previous_hs_lfc <-
## The following addresses the way the fold changes were written.
## and puts them back on the log scale.
previous_hs_lfc[[2]] < 0
neg_idx <-2] <- -1 * (1 / previous_hs_lfc[neg_idx, 2])
previous_hs_lfc[neg_idx, 2]] <- log2(previous_hs_lfc[[2]])
previous_hs_lfc[[
merge(previous_hs_lfc, hs_t4h_table$data[[1]], by.x="ID", by.y="row.names")
merged <-cor.test(merged[["limma_logfc"]], merged[["Fold change"]])
##
## Pearson's product-moment correlation
##
## data: merged[["limma_logfc"]] and merged[["Fold change"]]
## t = 62, df = 4200, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6746 0.7063
## sample estimates:
## cor
## 0.6908
readxl::read_excel("excel/12864_2015_2237_MOESM3_ESM.xls", sheet=2, skip=1)
previous_mm <- previous_mm[, c("ID", "Fold change")]
previous_mm_lfc <- previous_mm_lfc[[2]] < 0
neg_idx <-2] <- -1 * (1 / previous_mm_lfc[neg_idx, 2])
previous_mm_lfc[neg_idx, 2]] <- log2(previous_mm_lfc[[2]])
previous_mm_lfc[[
merge(previous_mm_lfc, mm_t4h_table$data[[1]], by.x="ID", by.y="row.names")
merged <-cor.test(merged[["limma_logfc"]], merged[["Fold change"]])
##
## Pearson's product-moment correlation
##
## data: merged[["limma_logfc"]] and merged[["Fold change"]]
## t = 216, df = 5655, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9417 0.9473
## sample estimates:
## cor
## 0.9445
plot_linear_scatter(merged[, c("limma_logfc", "Fold change")])$scatter
All of the neutrophil data is in mouse, apparently. This will make it more difficult, perhaps impossible to get an accurate answer.
So instead, look at infection vs. uninfected in mouse and then compare to the earliest Sacks’ timepoints in neutrophils.
"(hostcelltype=='PMN'&host=='mus_musculus'&expttime=='t12h') |
subset <- (hostcelltype=='macrophage'&host=='mus_musculus')"
subset_expt(mm_expt, subset=subset) neut_macr_mus <-
## Using a subset expression.
## There were 105, now there are 80 samples.
subset_expt(neut_macr_mus, subset="infectstate!='stim'") neut_macr_mus <-
## Using a subset expression.
## There were 80, now there are 56 samples.
set_expt_conditions(neut_macr_mus, fact="infectstate")
neut_macr_mus <- set_expt_batches(neut_macr_mus, fact="hostcelltype")
neut_macr_mus <- normalize_expt(neut_macr_mus, convert="cpm",
neut_macr_mus_norm <-norm="quant", filter=TRUE)
## This function will replace the expt$expressionset slot with:
## cpm(quant(cbcb(data)))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep libsizes in mind
## when invoking limma. The appropriate libsize is non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Leaving the data in its current base format, keep in mind that
## some metrics are easier to see when the data is log2 transformed, but
## EdgeR/DESeq do not accept transformed data.
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 0 low-count genes (19660 remaining).
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## The method is: raw.
## Step 4: not doing batch correction.
## Step 4: not transforming the data.
plot_pca(neut_macr_mus_norm)$plot
## Warning: ggrepel: 31 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
normalize_expt(neut_macr_mus, convert="cpm",
neut_macr_mus_normbatch <-norm="quant", filter=TRUE, batch="svaseq")
## This function will replace the expt$expressionset slot with:
## svaseq(cpm(quant(cbcb(data))))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep libsizes in mind
## when invoking limma. The appropriate libsize is non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Leaving the data in its current base format, keep in mind that
## some metrics are easier to see when the data is log2 transformed, but
## EdgeR/DESeq do not accept transformed data.
## Warning in normalize_expt(neut_macr_mus, convert = "cpm", norm = "quant", :
## Quantile normalization and sva do not always play well together.
## Step 1: performing count filter with option: cbcb
## Removing 0 low-count genes (19660 remaining).
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## The method is: svaseq.
## Step 4: doing batch correction with svaseq.
## Using the current state of normalization.
## Passing the data to all_adjusters using the svaseq estimate type.
## batch_counts: Before batch/surrogate estimation, 515332 entries are x>1: 47%.
## batch_counts: Before batch/surrogate estimation, 323375 entries are x==0: 29%.
## batch_counts: Before batch/surrogate estimation, 262253 entries are 0<x<1: 24%.
## The be method chose 10 surrogate variables.
## Attempting svaseq estimation with 10 surrogates.
## There are 30709 (3%) elements which are < 0 after batch correction.
## Setting low elements to zero.
## Step 4: not transforming the data.
plot_pca(neut_macr_mus_normbatch)$plot
## Warning: ggrepel: 22 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
sm(normalize_expt(neut_macr_mus, filter="simple"))
neut_macr_mus_filt <- sm(all_pairwise(
neut_macr_mus_de <-parallel=FALSE,
neut_macr_mus_filt, force=TRUE, model_batch="svaseq"))
sm(combine_de_tables(
neut_macr_mus_table <-keepers=keepers,
neut_macr_mus_de, excel="excel/MmM0Lm4h_vs_MmPMNLm12h_de_tables.xlsx"))
sm(extract_significant_genes(
neut_macr_mus_sig <-
neut_macr_mus_table,excel="excel/MmM0Lm4h_vs_MmPMNLm12h_sig_tables.xlsx"))
I think this handles questions a through e?
::pander(sessionInfo()) pander
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C
attached base packages: splines, parallel, stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: edgeR(v.3.32.1), ruv(v.0.9.7.1), lme4(v.1.1-26), Matrix(v.1.3-2), BiocParallel(v.1.24.1), variancePartition(v.1.20.0), hpgltools(v.1.0), testthat(v.3.0.2), R6(v.2.5.0), Biobase(v.2.50.0) and BiocGenerics(v.0.36.0)
loaded via a namespace (and not attached): rappdirs(v.0.3.3), rtracklayer(v.1.50.0), R.methodsS3(v.1.8.1), tidyr(v.1.1.2), ggplot2(v.3.3.3), bit64(v.4.0.5), knitr(v.1.31), DelayedArray(v.0.16.1), R.utils(v.2.10.1), data.table(v.1.14.0), RCurl(v.1.98-1.2), doParallel(v.1.0.16), generics(v.0.1.0), GenomicFeatures(v.1.42.1), preprocessCore(v.1.52.1), callr(v.3.5.1), cowplot(v.1.1.1), usethis(v.2.0.1), RSQLite(v.2.2.3), shadowtext(v.0.0.7), bit(v.4.0.4), enrichplot(v.1.10.2), xml2(v.1.3.2), SummarizedExperiment(v.1.20.0), assertthat(v.0.2.1), viridis(v.0.5.1), xfun(v.0.21), tximport(v.1.18.0), hms(v.1.0.0), jquerylib(v.0.1.3), evaluate(v.0.14), IHW(v.1.18.0), DEoptimR(v.1.0-8), progress(v.1.2.2), caTools(v.1.18.1), dbplyr(v.2.1.0), readxl(v.1.3.1), igraph(v.1.2.6), DBI(v.1.1.1), geneplotter(v.1.68.0), htmlwidgets(v.1.5.3), stats4(v.4.0.3), purrr(v.0.3.4), ellipsis(v.0.3.1), crosstalk(v.1.1.1), dplyr(v.1.0.4), backports(v.1.2.1), annotate(v.1.68.0), biomaRt(v.2.46.3), MatrixGenerics(v.1.2.1), blockmodeling(v.1.0.0), vctrs(v.0.3.6), remotes(v.2.2.0), cachem(v.1.0.4), withr(v.2.4.1), ggforce(v.0.3.2), robustbase(v.0.93-7), GenomicAlignments(v.1.26.0), fdrtool(v.1.2.16), prettyunits(v.1.1.1), DOSE(v.3.16.0), lazyeval(v.0.2.2), crayon(v.1.4.1), genefilter(v.1.72.1), pkgconfig(v.2.0.3), slam(v.0.1-48), labeling(v.0.4.2), tweenr(v.1.0.1), GenomeInfoDb(v.1.26.2), nlme(v.3.1-152), pkgload(v.1.1.0), devtools(v.2.3.2), rlang(v.0.4.10), lifecycle(v.1.0.0), downloader(v.0.4), BiocFileCache(v.1.14.0), directlabels(v.2021.1.13), cellranger(v.1.1.0), rprojroot(v.2.0.2), polyclip(v.1.10-0), matrixStats(v.0.58.0), graph(v.1.68.0), lpsymphony(v.1.18.0), boot(v.1.3-27), processx(v.3.4.5), viridisLite(v.0.3.0), bitops(v.1.0-6), R.oo(v.1.24.0), KernSmooth(v.2.23-18), pander(v.0.6.3), Biostrings(v.2.58.0), EBSeq(v.1.30.0), blob(v.1.2.1), stringr(v.1.4.0), qvalue(v.2.22.0), readr(v.1.4.0), S4Vectors(v.0.28.1), scales(v.1.1.1), memoise(v.2.0.0), magrittr(v.2.0.1), plyr(v.1.8.6), gplots(v.3.1.1), zlibbioc(v.1.36.0), compiler(v.4.0.3), scatterpie(v.0.1.5), RColorBrewer(v.1.1-2), DESeq2(v.1.30.0), Rsamtools(v.2.6.0), cli(v.2.3.0), XVector(v.0.30.0), ps(v.1.5.0), MASS(v.7.3-53.1), mgcv(v.1.8-34), tidyselect(v.1.1.0), stringi(v.1.5.3), highr(v.0.8), yaml(v.2.2.1), GOSemSim(v.2.16.1), askpass(v.1.1), locfit(v.1.5-9.4), ggrepel(v.0.9.1), grid(v.4.0.3), sass(v.0.3.1), fastmatch(v.1.1-0), tools(v.4.0.3), rstudioapi(v.0.13), foreach(v.1.5.1), gridExtra(v.2.3), farver(v.2.0.3), Rtsne(v.0.15), ggraph(v.2.0.4), digest(v.0.6.27), rvcheck(v.0.1.8), BiocManager(v.1.30.10), quadprog(v.1.5-8), Rcpp(v.1.0.6), GenomicRanges(v.1.42.0), broom(v.0.7.4), httr(v.1.4.2), AnnotationDbi(v.1.52.0), colorspace(v.2.0-0), XML(v.3.99-0.5), fs(v.1.5.0), IRanges(v.2.24.1), RBGL(v.1.66.0), statmod(v.1.4.35), PROPER(v.1.22.0), graphlayouts(v.0.7.1), plotly(v.4.9.3), sessioninfo(v.1.1.1), xtable(v.1.8-4), jsonlite(v.1.7.2), nloptr(v.1.2.2.2), tidygraph(v.1.2.0), corpcor(v.1.6.9), Vennerable(v.3.1.0.9000), pillar(v.1.4.7), htmltools(v.0.5.1.1), glue(v.1.4.2), fastmap(v.1.1.0), minqa(v.1.2.4), clusterProfiler(v.3.18.1), codetools(v.0.2-18), fgsea(v.1.16.0), pkgbuild(v.1.2.0), lattice(v.0.20-41), bslib(v.0.2.4), tibble(v.3.0.6), sva(v.3.38.0), pbkrtest(v.0.5-0.1), curl(v.4.3), colorRamps(v.2.3), gtools(v.3.8.2), zip(v.2.1.1), GO.db(v.3.12.1), openxlsx(v.4.2.3), openssl(v.1.4.3), survival(v.3.2-7), limma(v.3.46.0), rmarkdown(v.2.7), desc(v.1.2.0), munsell(v.0.5.0), DO.db(v.2.9), fastcluster(v.1.1.25), GenomeInfoDbData(v.1.2.4), iterators(v.1.0.13), reshape2(v.1.4.4) and gtable(v.0.3.0)
message("This is hpgltools commit: ", get_git_commit())
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 7219ed2313bbb619d4c80d25632cc1142bf79fdd
## This is hpgltools commit: Mon Feb 22 13:29:16 2021 -0500: 7219ed2313bbb619d4c80d25632cc1142bf79fdd
## message(paste0("Saving to ", savefile))
## tmp <- sm(saveme(filename=savefile))