Take a look at these samples. I will spend a few minutes (ok, having started the paper, hours) reading their previous document to fill this intro out.
The following command was used to invoke a combination of trimomatic, fastp, fastqc, hisat2, samtools, htseq, and a couple gatk tools.
I used the ensembl mm39 release 112. That should work fine with the most recent biomart gene annotations (202411).
For the moment, the useast server seems to be down, I think I will just use the archive until that is fixed.
##mm_annot <- load_biomart_annotations(species = "mmusculus", archive = FALSE)
mm_annot <- load_biomart_annotations(species = "mmusculus")
## The biomart annotations file already exists, loading from it.
I will probably disable the following block once I have the starting metadata in a completed state so that if someone wants to see all the numbers of mapped reads etc without having my working tree they will still be available.
meta <- gather_preprocessing_metadata("sample_sheets/all_samples.xlsx", species = "mm39_112",
new_metadata = sample_sheet)
This results in a new column ‘hisat_count_table’, so let us use that, and the gene annotations, to create an expressionset.
mm_expt <- create_expt(sample_sheet,
gene_info = mm_genes,
file_column = "hisat_count_table") %>%
set_expt_conditions(fact = "flox")
## Reading the sample metadata.
## Checking the state of the condition column.
## Checking the state of the batch column.
## Checking the condition factor.
## The sample definitions comprises: 24 rows(samples) and 27 columns(metadata fields).
## Matched 25409 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## Saving the expressionset to 'expt.rda'.
## The final expressionset has 25425 features and 24 samples.
## The numbers of samples by condition are:
##
## control_1a control_2b control_s9 ko_1a_c25b ko_2b_54C1 ko_2b_54C2 ko_s9_s8
## 3 3 3 3 3 3 3
## so_s9_r14
## 3
## A modified expressionSet containing 25425 genes and 24 sample. There are 28 metadata columns and 13 annotation columns.
## The primary condition is comprised of:
## control_1a, control_2b, control_s9, ko_1a_c25b, ko_2b_54C1, ko_2b_54C2, ko_s9_s8, so_s9_r14.
## Its current state is: raw(data).
combined <- paste0(pData(mm_expt)[["state"]], "_",
pData(mm_expt)[["line"]])
pData(mm_expt)[["combined"]] <- combined
plot_meta_sankey(mm_expt, factors = c("state", "line", "batch"))
## A sankey plot describing the metadata of 24 samples,
## including 26 out of 0 nodes and traversing metadata factors:
## .
Plot a few things to make sure there are not samples which are immeidately problematic.
## Library sizes of 24 samples,
## ranging from 14,686,859 to 17,960,838.
## The following samples have less than 16526.25 genes.
## [1] "1_IATF-1A_" "2_IATF-1A_" "3_IATF-1A_" "4_IATF-2B_" "5_IATF-2B_" "6_IATF-2B_"
## [7] "7_IATF-S9_" "8_IATF-S9_" "9_IATF-S9_" "10_IATF-1A" "11_IATF-1A" "12_IATF-1A"
## [13] "13_IATF-2B" "14_IATF-2B" "15_IATF-2B" "16_IATF-2B" "19_IATF-S9" "20_IATF-S9"
## [19] "21_IATF-S9" "22_IATF-S9" "23_IATF-S9" "24_IATF-S9"
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## A non-zero genes plot of 24 samples.
## These samples have an average 16.65 CPM coverage and 15156 genes observed, ranging from 14576 to
## 16635.
## A heatmap of pairwise sample distances ranging from:
## 16297.552607677 to 497679.486643764.
## A heatmap of pairwise sample correlations ranging from:
## 0.573161898662224 to 0.999309770057417.
Nothing jumps out, except it is quite rare for raw data to cluster this tightly, that is pretty weird.
mm_norm <- normalize_expt(mm_expt, transform = "log2", convert = "cpm",
norm = "quant", filter = TRUE)
## Removing 13165 low-count genes (12260 remaining).
## transform_counts: Found 1918 values equal to 0, adding 1 to the matrix.
## A heatmap of pairwise sample distances ranging from:
## 14.9259724949041 to 139.831122474879.
## A heatmap of pairwise sample correlations ranging from:
## 0.836561417109657 to 0.998137561607948.
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by control_1a, control_2b, control_s9, ko_1a_c25b, ko_2b_54C1, ko_2b_54C2, ko_s9_s8, so_s9_r14
## Shapes are defined by b1, b2, b3.
Wow, these samples are clustering absurdly well.
## Error in simple_varpart(mm_expt, factors = c("state", "line", "batch")): unused argument (factors = c("state", "line", "batch"))
## Error: object 'mm_varpart' not found
## Error: object 'mm_varpart' not found
Make the cell status the factor to compare; have the lineage as a secondary factor, perhaps add sva?
## The numbers of samples by condition are:
##
## control ko
## 9 15
## The number of samples by batch are:
##
## l1a l2b ls9
## 6 9 9
keepers <- list(
"ko_vs_control" = c("ko", "control"))
mm_state_line_de <- all_pairwise(
mm_state_line, model_batch = TRUE, keepers = keepers, filter = TRUE)
##
## control ko
## 9 15
##
## l1a l2b ls9
## 6 9 9
## Basic step 0/3: Normalizing data.
## Basic step 0/3: Converting data.
## Basic step 0/3: Transforming data.
## converting counts to integer mode
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## libsize was not specified, this parameter has profound effects on limma's result.
## Using the libsize from expt$best_libsize.
## Warning in createContrastL(objFlt$formula, objFlt$data, L): Contrasts with only a
## single non-zero term are already evaluated by default.
## libsize was not specified, this parameter has profound effects on limma's result.
## Using the libsize from expt$best_libsize.
## A pairwise differential expression with results from: basic, deseq, ebseq, edger, limma, noiseq.
## This used a surrogate/batch estimate from: batch in model/limma.
## The primary analysis performed 21 comparisons.
## The logFC agreement among the methods follows:
## k_vs_cntrl
## basic_vs_deseq 0.9383
## basic_vs_dream 0.8849
## basic_vs_ebseq 0.8385
## basic_vs_edger 0.9404
## basic_vs_limma 0.8921
## basic_vs_noiseq 0.9777
## deseq_vs_dream 0.8504
## deseq_vs_ebseq 0.8373
## deseq_vs_edger 0.9865
## deseq_vs_limma 0.8518
## deseq_vs_noiseq 0.9321
## dream_vs_ebseq 0.9391
## dream_vs_edger 0.8640
## dream_vs_limma 0.9959
## dream_vs_noiseq 0.8772
## ebseq_vs_edger 0.8685
## ebseq_vs_limma 0.9366
## ebseq_vs_noiseq 0.8303
## edger_vs_limma 0.8654
## edger_vs_noiseq 0.9315
## limma_vs_noiseq 0.8780
mm_state_line_tables <- combine_de_tables(
mm_state_line_de, label_column = "mgi_symbol",
excel = glue("excel/ko_vs_control_table-v{ver}.xlsx"))
## Error in loadNamespace(x) : there is no package called 'Vennerable'
## Error in loadNamespace(x) : there is no package called 'Vennerable'
## Error in loadNamespace(x) : there is no package called 'Vennerable'
## Error in loadNamespace(x) : there is no package called 'Vennerable'
## A set of combined differential expression results.
## table deseq_sigup deseq_sigdown edger_sigup edger_sigdown limma_sigup
## 1 ko_vs_control 539 863 523 830 715
## limma_sigdown
## 1 556
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## Plot describing unique/shared genes in a differential expression table.
mm_state_line_sig <- extract_significant_genes(
mm_state_line_tables, according_to = "deseq",
excel = glue("excel/ko_vs_control_sig-v{ver}.xlsx"))
mm_state_line_sig
## A set of genes deemed significant according to deseq.
## The parameters defining significant were:
## LFC cutoff: 1 adj P cutoff: 0.05
## deseq_up deseq_down
## ko_vs_control 539 863
##
## control ko
## 9 15
## Basic step 0/3: Normalizing data.
## Basic step 0/3: Converting data.
## Basic step 0/3: Transforming data.
## converting counts to integer mode
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## libsize was not specified, this parameter has profound effects on limma's result.
## Using the libsize from expt$best_libsize.
## Warning in createContrastL(objFlt$formula, objFlt$data, L): Contrasts with only a
## single non-zero term are already evaluated by default.
## libsize was not specified, this parameter has profound effects on limma's result.
## Using the libsize from expt$best_libsize.
## A pairwise differential expression with results from: basic, deseq, ebseq, edger, limma, noiseq.
## This used a surrogate/batch estimate from: sva.
## The primary analysis performed 21 comparisons.
## The logFC agreement among the methods follows:
## k_vs_cntrl
## basic_vs_deseq 0.9861
## basic_vs_dream 0.9887
## basic_vs_ebseq 0.8385
## basic_vs_edger 0.9787
## basic_vs_limma 0.9951
## basic_vs_noiseq 0.9777
## deseq_vs_dream 0.9948
## deseq_vs_ebseq 0.8317
## deseq_vs_edger 0.9951
## deseq_vs_limma 0.9889
## deseq_vs_noiseq 0.9682
## dream_vs_ebseq 0.8316
## dream_vs_edger 0.9894
## dream_vs_limma 0.9944
## dream_vs_noiseq 0.9782
## ebseq_vs_edger 0.8404
## ebseq_vs_limma 0.8411
## ebseq_vs_noiseq 0.8303
## edger_vs_limma 0.9831
## edger_vs_noiseq 0.9615
## limma_vs_noiseq 0.9756
all_tables_sva <- combine_de_tables(
all_de_sva, label_column = "mgi_symbol",
excel = glue("excel/ko_vs_control_sva_table-v{ver}.xlsx"))
## Error in loadNamespace(x) : there is no package called 'Vennerable'
## Error in loadNamespace(x) : there is no package called 'Vennerable'
## Error in loadNamespace(x) : there is no package called 'Vennerable'
## Error in loadNamespace(x) : there is no package called 'Vennerable'
## A set of combined differential expression results.
## table deseq_sigup deseq_sigdown edger_sigup edger_sigdown limma_sigup
## 1 ko_vs_control 552 857 531 835 643
## limma_sigdown
## 1 760
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## Plot describing unique/shared genes in a differential expression table.
all_sig_sva <- extract_significant_genes(
all_tables_sva, excel = glue("excel/three_contrasts_sva_sig-v{ver}.xlsx"))
all_sig_sva
## A set of genes deemed significant according to limma, edger, deseq, ebseq, basic.
## The parameters defining significant were:
## LFC cutoff: 1 adj P cutoff: 0.05
## limma_up limma_down edger_up edger_down deseq_up deseq_down ebseq_up
## ko_vs_control 643 760 531 835 552 857 186
## ebseq_down basic_up basic_down
## ko_vs_control 406 2479 0
The following compares the ko/control for each of the cell groups.
## The numbers of samples by condition are:
##
## control_l1a control_l2b control_ls9 ko_l1a ko_l2b ko_ls9
## 3 3 3 3 6 6
keepers <- list(
"l1a" = c("ko_l1a", "control_l1a"),
"l2b" = c("ko_l2b", "control_l2b"),
"ls9" = c("ko_ls9", "control_ls9"))
mm_de <- all_pairwise(mm_combined, keepers = keepers, keep_underscore = TRUE)
##
## control_l1a control_l2b control_ls9 ko_l1a ko_l2b ko_ls9
## 3 3 3 3 6 6
##
## b1 b2 b3
## 8 8 8
## Basic step 0/3: Filtering data.
## Basic step 0/3: Normalizing data.
## Basic step 0/3: Converting data.
## Basic step 0/3: Transforming data.
## converting counts to integer mode
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## libsize was not specified, this parameter has profound effects on limma's result.
## Using the libsize from expt$libsize.
## Warning in createContrastL(objFlt$formula, objFlt$data, L): Contrasts with only a
## single non-zero term are already evaluated by default.
## The contrast control_ls9 is not in the results.
## If this is not an extra contrast, then this is an error.
## The contrast ko_l1a is not in the results.
## If this is not an extra contrast, then this is an error.
## The contrast ko_l2b is not in the results.
## If this is not an extra contrast, then this is an error.
## The contrast ko_ls9 is not in the results.
## If this is not an extra contrast, then this is an error.
## The contrast control_ls9 is not in the results.
## If this is not an extra contrast, then this is an error.
## The contrast ko_l1a is not in the results.
## If this is not an extra contrast, then this is an error.
## The contrast ko_l2b is not in the results.
## If this is not an extra contrast, then this is an error.
## The contrast ko_ls9 is not in the results.
## If this is not an extra contrast, then this is an error.
## The contrast ko_l1a is not in the results.
## If this is not an extra contrast, then this is an error.
## The contrast ko_l2b is not in the results.
## If this is not an extra contrast, then this is an error.
## The contrast ko_ls9 is not in the results.
## If this is not an extra contrast, then this is an error.
## The contrast ko_l2b is not in the results.
## If this is not an extra contrast, then this is an error.
## The contrast ko_ls9 is not in the results.
## If this is not an extra contrast, then this is an error.
## The contrast ko_ls9 is not in the results.
## If this is not an extra contrast, then this is an error.
## libsize was not specified, this parameter has profound effects on limma's result.
## Using the libsize from expt$libsize.
## Warning in correlate_de_tables(results, annot_df = annot_df, extra_contrasts =
## extra_contrasts): The merge of basic, ko_l1a_vs_control_l1a and ebseq,
## ko_l1a_vs_control_l1a failed.
## Warning in correlate_de_tables(results, annot_df = annot_df, extra_contrasts =
## extra_contrasts): The merge of basic, ko_l2b_vs_control_l2b and ebseq,
## ko_l2b_vs_control_l2b failed.
## Warning in correlate_de_tables(results, annot_df = annot_df, extra_contrasts =
## extra_contrasts): The merge of basic, ko_ls9_vs_control_ls9 and ebseq,
## ko_ls9_vs_control_ls9 failed.
## Warning in correlate_de_tables(results, annot_df = annot_df, extra_contrasts =
## extra_contrasts): The merge of deseq, ko_l1a_vs_control_l1a and ebseq,
## ko_l1a_vs_control_l1a failed.
## Warning in correlate_de_tables(results, annot_df = annot_df, extra_contrasts =
## extra_contrasts): The merge of deseq, ko_l2b_vs_control_l2b and ebseq,
## ko_l2b_vs_control_l2b failed.
## Warning in correlate_de_tables(results, annot_df = annot_df, extra_contrasts =
## extra_contrasts): The merge of deseq, ko_ls9_vs_control_ls9 and ebseq,
## ko_ls9_vs_control_ls9 failed.
## Warning in correlate_de_tables(results, annot_df = annot_df, extra_contrasts =
## extra_contrasts): The merge of dream, ko_l1a_vs_control_l1a and ebseq,
## ko_l1a_vs_control_l1a failed.
## Warning in correlate_de_tables(results, annot_df = annot_df, extra_contrasts =
## extra_contrasts): The merge of dream, ko_l2b_vs_control_l2b and ebseq,
## ko_l2b_vs_control_l2b failed.
## Warning in correlate_de_tables(results, annot_df = annot_df, extra_contrasts =
## extra_contrasts): The merge of dream, ko_ls9_vs_control_ls9 and ebseq,
## ko_ls9_vs_control_ls9 failed.
## Warning in correlate_de_tables(results, annot_df = annot_df, extra_contrasts =
## extra_contrasts): The merge of ebseq, control_l2b_vs_control_l1a and edger,
## control_l2b_vs_control_l1a failed.
## Warning in correlate_de_tables(results, annot_df = annot_df, extra_contrasts =
## extra_contrasts): The merge of ebseq, control_l2b_vs_control_l1a and limma,
## control_l2b_vs_control_l1a failed.
## Warning in correlate_de_tables(results, annot_df = annot_df, extra_contrasts =
## extra_contrasts): The merge of ebseq, control_l2b_vs_control_l1a and noiseq,
## control_l2b_vs_control_l1a failed.
## A pairwise differential expression with results from: basic, deseq, ebseq, edger, limma, noiseq.
## This used a surrogate/batch estimate from: batch in model/limma.
## The primary analysis performed 15 comparisons.
mm_tables <- combine_de_tables(mm_de, label_column = "mgi_symbol",
excel = "excel/three_contrasts.xlsx")
## Deleting the file excel/three_contrasts.xlsx before writing the tables.
## Error in ggplot2::geom_point(size = size, alpha = alpha, colour = grDevices::hsv(linear_model_weights * :
## Problem while setting up geom aesthetics.
## ℹ Error occurred in the 13th layer.
## Caused by error in `check_aesthetics()`:
## ! Aesthetics must be either length 1 or the same as the data (25425).
## ✖ Fix the following mappings: `colour`.
## Error in ggplot2::geom_point(size = size, alpha = alpha, colour = grDevices::hsv(linear_model_weights * :
## Problem while setting up geom aesthetics.
## ℹ Error occurred in the 13th layer.
## Caused by error in `check_aesthetics()`:
## ! Aesthetics must be either length 1 or the same as the data (25425).
## ✖ Fix the following mappings: `colour`.
## Error in ggplot2::geom_point(size = size, alpha = alpha, colour = grDevices::hsv(linear_model_weights * :
## Problem while setting up geom aesthetics.
## ℹ Error occurred in the 13th layer.
## Caused by error in `check_aesthetics()`:
## ! Aesthetics must be either length 1 or the same as the data (25425).
## ✖ Fix the following mappings: `colour`.
## Error in ggplot2::geom_point(size = size, alpha = alpha, colour = grDevices::hsv(linear_model_weights * :
## Problem while setting up geom aesthetics.
## ℹ Error occurred in the 13th layer.
## Caused by error in `check_aesthetics()`:
## ! Aesthetics must be either length 1 or the same as the data (25425).
## ✖ Fix the following mappings: `colour`.
## Error in ggplot2::geom_point(size = size, alpha = alpha, colour = grDevices::hsv(linear_model_weights * :
## Problem while setting up geom aesthetics.
## ℹ Error occurred in the 13th layer.
## Caused by error in `check_aesthetics()`:
## ! Aesthetics must be either length 1 or the same as the data (25425).
## ✖ Fix the following mappings: `colour`.
## Error in ggplot2::geom_point(size = size, alpha = alpha, colour = grDevices::hsv(linear_model_weights * :
## Problem while setting up geom aesthetics.
## ℹ Error occurred in the 13th layer.
## Caused by error in `check_aesthetics()`:
## ! Aesthetics must be either length 1 or the same as the data (25425).
## ✖ Fix the following mappings: `colour`.
## A set of combined differential expression results.
## table deseq_sigup deseq_sigdown edger_sigup edger_sigdown
## 1 ko_l1a_vs_control_l1a 1203 1651 1325 1711
## 2 ko_l2b_vs_control_l2b 1612 1557 1751 1674
## 3 ko_ls9_vs_control_ls9 1089 539 1077 579
## limma_sigup limma_sigdown
## 1 1711 1457
## 2 1723 1655
## 3 752 735
## Plot describing unique/shared genes in a differential expression table.
## Deleting the file excel/three_contrasts-sig.xlsx before writing the tables.
## A set of genes deemed significant according to limma, edger, deseq, basic.
## The parameters defining significant were:
## LFC cutoff: 1 adj P cutoff: 0.05
## limma_up limma_down edger_up edger_down deseq_up deseq_down
## ko_l1a_vs_control_l1a 1711 1457 1325 1711 1203 1651
## ko_l2b_vs_control_l2b 1723 1655 1751 1674 1612 1557
## ko_ls9_vs_control_ls9 752 735 1077 579 1089 539
## basic_up basic_down
## ko_l1a_vs_control_l1a 0 0
## ko_l2b_vs_control_l2b 0 0
## ko_ls9_vs_control_ls9 0 0
## Running gProfiler on every set of significant genes found:
## BP CC CORUM HP KEGG MIRNA MF REAC TF WP
## ko_l1a_vs_control_l1a_up 319 67 0 0 0 0 62 0 150 0
## ko_l1a_vs_control_l1a_down 869 116 0 0 0 0 84 5 293 1
## ko_l2b_vs_control_l2b_up 182 80 0 0 0 0 72 0 106 0
## ko_l2b_vs_control_l2b_down 877 103 0 1 2 0 88 15 302 1
## ko_ls9_vs_control_ls9_up 501 100 0 0 4 0 63 5 254 0
## ko_ls9_vs_control_ls9_down 404 38 0 0 0 2 30 3 21 0
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## Error in .stopf("Duplicate values in %s not allowed", universeArg): Duplicate values in names(stats) not allowed
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'dotplot': object 'mm_cp' not found
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'dotplot': object 'mm_cp' not found
## Error in h(simpleError(msg, call)): error in evaluating the argument 'gse' in selecting a method for function 'plot_topn_gsea': object 'mm_cp' not found
## Error: object 'topn_gsea' not found
Use UpSetR to get shared/unique genes deemed significant.
sig_intersect <- upsetr_sig(mm_sig)
intersect_write <- write_upset_groups(sig_intersect, excel = "excel/contrast_intsections.xlsx")
## Deleting the file excel/contrast_intsections.xlsx before writing the tables.