I think for speed sake, I will run these through salmon.
I have been making some changes to my pipeline recently; so I will put 100k reads into a test directory.
I am going to use that tree to run everything first, notably I want my tools to collect more thorough statistics on runtime etc to improve my nascent heuristics to choose memory/time on the cluster.
First let us load some reasonably current biomart annotations.
## The biomart annotations file already exists, loading from it.
My code is a little bit picky about ensuring that the various IDs match. Thus I will likely need to do a little work to make certain that the various version numbers match properly.
## Did not find the condition column in the sample sheet.
## Filling it in as undefined.
## Did not find the batch column in the sample sheet.
## Filling it in as undefined.
## Checking the state of the condition column.
## Checking the state of the batch column.
## Checking the condition factor.
## Writing new metadata to: sample_sheets/PRJNA675090_modified.xlsx
## Deleting the file sample_sheets/PRJNA675090_modified.xlsx before writing the tables.
I keep meaning to add a little function to strip off the tx_version suffix, given that tximport now has an argument to ignore it.
We usually do all of our quantification at the gene level, but it seems like this might be an occasion when transcript changes and/or splicing might be of interest. Thus I am going to generate a couple of summarized experiments, one for each. For the moment we will just use the gene level abundances.
tx_gene_map <- tx_annot[, c("ensembl_transcript_id", "ensembl_gene_id")]
rownames(tx_gene_map) <- make.names(gsub(x = rownames(tx_gene_map),
pattern = "\\.\\d+$", replacement = ""), unique = TRUE)
hs_se_tx <- create_se(meta[["new_meta"]], gene_info = tx_annot, file_column = "salmon_count_table")
## Reading the sample metadata.
## Checking the state of the condition column.
## Checking the state of the batch column.
## Checking the condition factor.
## The sample definitions comprises: 41 rows(samples) and 22 columns(metadata fields).
## Warning in create_se(meta[["new_meta"]], gene_info = tx_annot, file_column =
## "salmon_count_table"): Some samples were removed when cross referencing the samples
## against the count data.
## Matched 85820 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## Saving the summarized experiment to 'se.rda'.
## The final summarized experiment has 97117 rows and 22 columns.
hs_se_gene <- create_se(meta[["new_meta"]], gene_info = annot, file_column = "salmon_count_table",
tx_gene_map = tx_gene_map)
## Reading the sample metadata.
## Checking the state of the condition column.
## Checking the state of the batch column.
## Checking the condition factor.
## The sample definitions comprises: 41 rows(samples) and 22 columns(metadata fields).
## In some cases, (notably salmon) the format of the IDs used by this can be tricky.
## It is likely to require the transcript ID followed by a '.' and the ensembl column:
## 'transcript_version', which is explicitly different than the gene version column.
## If this is not correctly performed, very few genes will be observed
## reading in files with read_tsv
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## transcripts missing from tx2gene: 7591
## summarizing abundance
## summarizing counts
## summarizing length
## Warning in create_se(meta[["new_meta"]], gene_info = annot, file_column =
## "salmon_count_table", : Some samples were removed when cross referencing the samples
## against the count data.
## Matched 21356 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## Saving the summarized experiment to 'se.rda'.
## The final summarized experiment has 21356 rows and 22 columns.
I have extracted two potentially interesting columns from the metadata, after that I will need to read more carefully in the paper to try to get a sense of what is what…
## The numbers of samples by condition are:
##
## control HCM
## 5 35
## The number of samples by batch are:
##
## exome rnaseq
## 17 23
## The colors used in the expressionset are: #1B9E77, #7570B3.
## Library sizes of 40 samples,
## ranging from 2,677,665 to 7,112,244.
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## A non-zero genes plot of 40 samples.
## These samples have an average 4.765 CPM coverage and 16851 genes observed, ranging from 13986 to
## 19653.
## Warning: ggrepel: 21 unlabeled data points (too many overlaps). Consider increasing
## max.overlaps
## A sankey plot describing the metadata of 40 samples,
## including 5 out of 0 nodes and traversing metadata factors:
## condition, batch.
Presumably we will need to separate the exome and rnaseq data; but for the moment I will leave them together to see what I can see.
## Running normalize_se.
## Removing 1203 low-count genes (20153 remaining).
## transform_counts: Found 12213 values equal to 0, adding 1 to the matrix.
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by control, HCM
## Shapes are defined by exome, rnaseq.
I suppose it should be no surprise that the two experiment types are very different. Let us therefore skip on the exome data for now.
hs_rna <- subset_se(hs_se, subset = "batch=='rnaseq'")
hs_rna_norm <- normalize(hs_rna, transform = "log2", convert = "cpm",
filter = TRUE, norm = "quant")
## Running normalize_se.
## Removing 8111 low-count genes (13245 remaining).
## transform_counts: Found 9325 values equal to 0, adding 1 to the matrix.
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by control, HCM
## Shapes are defined by rnaseq.
## png
## 2
In our initial PCA plot, SRR12999746 looks particularly strange.
hs_idx <- colnames(hs_rna) != "SRR12999746"
kept_ids <- colnames(hs_rna)[hs_idx]
hs_excluded <- subset_se(hs_rna, ids = kept_ids)
excluded_norm <- normalize(hs_excluded, filter = TRUE, convert = "cpm",
transform = "log2", norm = "quant")
## Running normalize_se.
## Removing 8234 low-count genes (13122 remaining).
## transform_counts: Found 8289 values equal to 0, adding 1 to the matrix.
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by control, HCM
## Shapes are defined by rnaseq.
excluded_nb <- normalize(hs_excluded, filter = TRUE, convert = "cpm",
transform = "log2", batch = "svaseq")
## Running normalize_se.
## Removing 8234 low-count genes (13122 remaining).
## transform_counts: Found 2514 values less than 0.
## Warning in transform_counts(count_table, method = transform, ...): NaNs produced
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by control, HCM
## Shapes are defined by rnaseq.
hs_de <- all_pairwise(hs_excluded, filter = TRUE, model_svs = "svaseq",
model_fstring = "~ 0 + condition", force = TRUE)
## control HCM
## 5 17
## Running normalize_se.
## Removing 8234 low-count genes (13122 remaining).
## Basic step 0/3: Normalizing data.
## Basic step 0/3: Converting data.
## I think this is failing? SummarizedExperiment
## Basic step 0/3: Transforming data.
## Running normalize_se.
## Warning in choose_binom_dataset(input, force = force): This data was inappropriately
## forced into integers.
## This received a matrix of SVs.
## converting counts to integer mode
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## Warning in choose_binom_dataset(input, force = force): This data was inappropriately
## forced into integers.
## Warning in createContrastL(objFlt$formula, objFlt$data, L): Contrasts with only a
## single non-zero term are already evaluated by default.
## Warning in choose_binom_dataset(input, force = force): This data was inappropriately
## forced into integers.
## Warning in choose_binom_dataset(input, force = force): This data was inappropriately
## forced into integers.
## conditions
## control HCM
## 5 17
## conditions
## control HCM
## 5 17
## Warning in choose_binom_dataset(input, force = force): This data was inappropriately
## forced into integers.
## conditions
## control HCM
## 5 17
## A pairwise differential expression with results from: basic, deseq, ebseq, edger, limma, noiseq.
## This used a surrogate/batch estimate from: svaseq.
## The primary analysis performed 1 comparisons.
## The logFC agreement among the methods follows:
## HCM_vs_cnt
## basic_vs_deseq 0.5069
## basic_vs_dream 0.8682
## basic_vs_ebseq 0.7508
## basic_vs_edger 0.7705
## basic_vs_limma 0.8652
## basic_vs_noiseq 0.8735
## deseq_vs_dream 0.5136
## deseq_vs_ebseq 0.7515
## deseq_vs_edger 0.8013
## deseq_vs_limma 0.5004
## deseq_vs_noiseq 0.6683
## dream_vs_ebseq 0.7205
## dream_vs_edger 0.7613
## dream_vs_limma 0.9961
## dream_vs_noiseq 0.8213
## ebseq_vs_edger 0.8179
## ebseq_vs_limma 0.7083
## ebseq_vs_noiseq 0.9443
## edger_vs_limma 0.7511
## edger_vs_noiseq 0.8365
## limma_vs_noiseq 0.8138
## Deleting the file excel/control_vs_hcm.xlsx before writing the tables.
## A set of combined differential expression results.
## table deseq_sigup deseq_sigdown edger_sigup edger_sigdown limma_sigup
## 1 HCM_vs_control 199 189 163 157 92
## limma_sigdown
## 1 195
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## Plot describing unique/shared genes in a differential expression table.
## Deleting the file excel/control_vs_hcm-sig.xlsx before writing the tables.
## A set of genes deemed significant according to limma, edger, deseq, ebseq, basic.
## The parameters defining significant were:
## LFC cutoff: 1 adj P cutoff: 0.05
## limma_up limma_down edger_up edger_down deseq_up deseq_down ebseq_up
## HCM_vs_control 92 195 163 157 199 189 135
## ebseq_down basic_up basic_down
## HCM_vs_control 130 153 160
The default settings for the various ontology/enrichment tools I have assume human data.
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## leading edge analysis...
## done...
## Reading KEGG annotation online: "https://rest.kegg.jp/link/hsa/pathway"...
## Reading KEGG annotation online: "https://rest.kegg.jp/list/pathway/hsa"...
## ReactomePA v1.50.0 Learn more at https://yulab-smu.top/contribution-knowledge-mining/
##
## Please cite:
##
## Guangchuang Yu, Qing-Yu He. ReactomePA: an R/Bioconductor package for
## reactome pathway analysis and visualization. Molecular BioSystems. 2016,
## 12(2):477-479
## Loading required package: org.Hs.eg.db
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:hpgltools':
##
## conditions<-, normalize, sd, var
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, aperm, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget, order,
## paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce,
## rownames, sapply, saveRDS, setdiff, table, tapply, union, unique,
## unsplit, which.max, which.min
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with 'browseVignettes()'.
## To cite Bioconductor, see 'citation("Biobase")', and for packages
## 'citation("pkgname")'.
## Loading required package: IRanges
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:hpgltools':
##
## findMatches, first, second
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:hpgltools':
##
## shift
## Deleting the file excel/all_cp_HCM_vs_control_up.xlsx before writing the tables.
## Writing a sheet containing the legend.
## Writing the BP data.
## Writing the MF data.
## Writing the CC data.
## Writing the KEGG data.
## Finished writing excel file.
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## leading edge analysis...
## done...
## Deleting the file excel/all_cp_HCM_vs_control_down.xlsx before writing the tables.
## Writing a sheet containing the legend.
## Writing the BP data.
## Writing the MF data.
## Writing the CC data.
## Writing the KEGG data.
## Finished writing excel file.
I am just fooling around now.
hs_gp <- all_gprofiler(hs_sig)
fun_plots <- plot_enrichresult(hs_gp[["HCM_vs_control_up"]][["BP_enrich"]])
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## Warning: ggrepel: 15 unlabeled data points (too many overlaps). Consider increasing
## max.overlaps
I am always amazed at how many genes are shared across groups.
A few specific mSigDB categories: C8, C7, C5, C3, C2
This is actually C2 because I need to download a new copy of msigdb in the new format.
I downloaded a fresh copy of the human mSigDB because my copy was in the old format and that no longer parses well.
up_cp_c8 <- simple_clusterprofiler(input_up, de_table = table,
msig_db = msigdb,
do_msigdb = TRUE, msigdb_category = "C8",
do_mesh = TRUE, do_dose = TRUE, orgdb_from = "ENSEMBL")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'rownames': object 'input_up' not found
c2 <- load_gmt_signatures(signatures = "reference/msigdb_v2024.1.Hs.db", signature_category = "c2",
id_type = "entrez")
c2_enricher <- simple_cp_enricher(sig_genes = input_up, de_table = table, db = c2)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'sig_genes' in selecting a method for function 'simple_cp_enricher': object 'input_up' not found
c2_merged <- merge(as.data.frame(c2_enricher), msig_meta, by.x = "ID", by.y = "standard_name", all.x = TRUE)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'merge': error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'c2_enricher' not found
## Error in h(simpleError(msg, call)): error in evaluating the argument 'data' in selecting a method for function 'write_xlsx': object 'c2_merged' not found
c3 <- load_gmt_signatures(signatures = "reference/msigdb_v2024.1.Hs.db", signature_category = "c3",
id_type = "entrez")
c3_enricher <- simple_cp_enricher(sig_genes = input_up, de_table = table, db = c3)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'sig_genes' in selecting a method for function 'simple_cp_enricher': object 'input_up' not found
c3_merged <- merge(as.data.frame(c3_enricher), msig_meta, by.x = "ID", by.y = "standard_name", all.x = TRUE)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'merge': error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'c3_enricher' not found
## Error in h(simpleError(msg, call)): error in evaluating the argument 'data' in selecting a method for function 'write_xlsx': error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'c3_merged' not found
c4 <- load_gmt_signatures(signatures = "reference/msigdb_v2024.1.Hs.db", signature_category = "c4",
id_type = "entrez")
c4_enricher <- simple_cp_enricher(sig_genes = input_up, de_table = table, db = c4)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'sig_genes' in selecting a method for function 'simple_cp_enricher': object 'input_up' not found
c4_merged <- merge(as.data.frame(c4_enricher), msig_meta, by.x = "ID", by.y = "standard_name", all.x = TRUE)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'merge': error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'c4_enricher' not found
## Error in h(simpleError(msg, call)): error in evaluating the argument 'data' in selecting a method for function 'write_xlsx': error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'c4_merged' not found
c5 <- load_gmt_signatures(signatures = "reference/msigdb_v2024.1.Hs.db", signature_category = "c5",
id_type = "entrez")
c5_enricher <- simple_cp_enricher(sig_genes = input_up, de_table = table, db = c5)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'sig_genes' in selecting a method for function 'simple_cp_enricher': object 'input_up' not found
c5_merged <- merge(as.data.frame(c5_enricher), msig_meta, by.x = "ID", by.y = "standard_name", all.x = TRUE)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'merge': error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'c5_enricher' not found
## Error in h(simpleError(msg, call)): error in evaluating the argument 'data' in selecting a method for function 'write_xlsx': error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'c5_merged' not found
C6 returns no hits.
c7 <- load_gmt_signatures(signatures = "reference/msigdb_v2024.1.Hs.db", signature_category = "c7",
id_type = "entrez")
c7_enricher <- simple_cp_enricher(sig_genes = input_up, de_table = table, db = c7)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'sig_genes' in selecting a method for function 'simple_cp_enricher': object 'input_up' not found
c7_merged <- merge(as.data.frame(c7_enricher), msig_meta, by.x = "ID", by.y = "standard_name", all.x = TRUE)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'merge': error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'c7_enricher' not found
## Error in h(simpleError(msg, call)): error in evaluating the argument 'data' in selecting a method for function 'write_xlsx': error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'c7_merged' not found
c8 <- load_gmt_signatures(signatures = "reference/msigdb_v2024.1.Hs.db", signature_category = "c8",
id_type = "entrez")
c8_enricher <- simple_cp_enricher(sig_genes = input_up, de_table = table, db = c8)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'sig_genes' in selecting a method for function 'simple_cp_enricher': object 'input_up' not found
c8_merged <- merge(as.data.frame(c8_enricher), msig_meta, by.x = "ID", by.y = "standard_name", all.x = TRUE)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'merge': error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'c8_enricher' not found
## Error in h(simpleError(msg, call)): error in evaluating the argument 'data' in selecting a method for function 'write_xlsx': error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'c8_merged' not found
input_up <- as.data.frame(hs_sig[["deseq"]][["ups"]][[1]])
input_down <- as.data.frame(hs_sig[["deseq"]][["downs"]][[1]])
table <- hs_table[["data"]][[1]]
up_gp <- simple_gprofiler(input_up)
up_cp_c8 <- simple_clusterprofiler(input_up, de_table = table,
do_msigdb = TRUE, msigdb_category = "C8",
do_mesh = TRUE, do_dose = TRUE, orgdb_from = "ENSEMBL")
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## leading edge analysis...
## done...
## Warning in simple_clusterprofiler(input_up, de_table = table, do_msigdb = TRUE, : I do
## not know this DOSE organism, leaving it as human.
## snapshotDate(): 2024-10-28
## Warning in simple_clusterprofiler(input_up, de_table = table, do_msigdb = TRUE, : I do
## not know this mesh organism, leaving it as human.
## loading from cache
## snapshotDate(): 2024-10-28
## loading from cache
## Deleting the file excel/c2_increased.xlsx before writing the tables.
## write_xlsx() wrote excel/c2_increased.xlsx.
## The cursor is on sheet first, row: 129 column: 14.