Thanks to my time playing with Carrie and the Taneyhill samples, I re-evaluated suppa and think there are some potentially compelling use-case scenarios. Thus, I want to take the improved processing I employed with that data and translate it to some of our other datasets, starting with Cecilia’s human macrophages infected with Leishmania major or amazonensis.
My reworked suppa invocation is able to use either salmon quantifications or htseq-count transcript tables. Thus I think I have everything I need in the multiple_leishmania project directory wherein I passed all of our samples across every project to a single salmon-based pipeline. In addition I have some sample sheets there which should make it relatively easy to pull out the relevant samples.
This document is in its own tree, so I am going to symlink my preprocessing directory from the multiple_leishmania tree and copy the most relevant sample sheet here.
I thus pruned the master sample sheet to include only the mbio data, then separated that into two new sheets, one with every macrophage sample (e.g. no promastigotes), and one with only the major infections or uninfected (e.g. no amazonensis nor beads).
The most relevant metadata columns for my purposes are therefore:
(In the file ‘suppa_samples_only_major.xlsx’) sampleid, infect_state, expt_time, donor, hg38_111_salmon
With that in mind, let us attempt to spin up suppa to compare ‘yes’ vs ‘no’ from the infect_state column using the salmon data.
I wrote some code to simplify the usage of suppa so that I can feed it a sample sheet filename, column of interest, and numerator/denominator. The code then collects the quantifications associated with the relevant samples, and runs the various suppa tools to generate all the various intermediate files used to compare splicing events among the desired groups.
## The biomart annotations file already exists, loading from it.
I deleted the annotations and genome for this older release of the human genome. Thus I will need to either remap or redownload them. I am thinking that those are old enough it is worth while to redo.
I decided also to use the newest human annotation set with the assumption that there are new fun splicing annotations available.
module add cyoa/202402
cd preprocessing
base=~/scratch/rnaseq/splicing_leishmania_2023/
start=$(pwd)
for i in $(/bin/cat ${base}/sample_sheets/only_major_sampleids.txt); do
cd $i
cyoa --method salmon --species hg38_111 --input r1.fastq.xz:r2.fastq.xz
cyoa --method hisat --species hg38_111 --input r1.fastq.xz:r2.fastq.xz --gff_type gene --gff_tag ID --stranded no
cd $start
done
start=$(pwd)
for i in $(/bin/cat ${base}/sample_sheets/bead_samples.txt); do
cd $i
cyoa --method salmon --species hg38_111 --input r1.fastq.xz:r2.fastq.xz
cyoa --method hisat --species hg38_111 --input r1.fastq.xz:r2.fastq.xz --gff_type gene --gff_tag ID --stranded no
cd $start
done
ok, so now the relevant metadata column is hg38_111_salmon
output_prefix <- "outputs/90suppa_hg38_111_infect_state"
dpsi_prefix <- file.path(output_prefix, "diff")
tpm_prefix <- file.path(output_prefix, "tpm")
event_prefix <- file.path(output_prefix, "events")
numerator <- "yes"
denominator <- "no"
comparison_prefix <- glue("{numerator}_{denominator}")
tx_dpsi <- file.path(dpsi_prefix, glue("{comparison_prefix}_ioi_empirical.dpsi"))
tx_tpm <- file.path(dpsi_prefix, glue("{comparison_prefix}_ioi_empirical_avglogtpm.tab"))
## test_events_ioe <- file.path(event_prefix, "local_as_events_SE_strict.ioe")
tx_events <- file.path(event_prefix, "transcript_events.ioi")
tx_psi <- file.path(dpsi_prefix, glue("{comparison_prefix}_ioi_classical.psivec"))
## Attempt to see delta transcripts.
#tx_plot <- plot_suppa(file_prefix = output_prefix, type = "transcript", annot = annot,
# numerator = numerator, denominator = denominator)
inf_uninf_event_plot <- plot_suppa(file_prefix = "outputs/90suppa_hg38_111_infect_state",
numerator = numerator,
denominator = denominator, type = "type",
annot = hg38_111_annot$gene_annotation,
annot_column = "hgnc_symbol",
alpha = 0.5)
inf_uninf_event_plot$ma
## Warning: Removed 121164 rows containing missing values or values outside the scale range (`geom_point()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range (`geom_text_repel()`).
## Warning: ggrepel: 528 unlabeled data points (too many overlaps). Consider increasing max.overlaps
sig_events <- inf_uninf_event_plot[["sig"]]
## I think the following should not be needed next time I run this
## (e.g. I think I changed the function to take this into account.
rownames(sig_events) <- sig_events[["Row.names"]]
sig_events[["Row.names"]] <- NULL
inf_uninf_events <- merge(sig_events, annot,
by.x = "gene_name", by.y = "row.names",
all.x = TRUE)
write_xlsx(excel = glue("excel/{comparison_prefix}_testing_suppa_sigevents.xlsx"),
data = inf_uninf_events)
## Deleting the file excel/yes_no_testing_suppa_sigevents.xlsx before writing the tables.
## write_xlsx() wrote excel/yes_no_testing_suppa_sigevents.xlsx.
## The cursor is on sheet first, row: 570 column: 98.
module add cyoa
module add suppa
cyoa --method suppa --species hg38_111 --input sample_sheets/suppa_samples_uninf_bead.xlsx \
--file_column hg38_111_salmon --condition_column pathogen_species
numerator <- "none"
denominator <- "bead"
comparison_prefix <- glue("{numerator}_{denominator}")
bead_uninf_plot <- plot_suppa(file_prefix = "outputs/90suppa_hg38_111_pathogen_species",
numerator = numerator,
denominator = denominator, type = "type",
annot = hg38_111_annot$gene_annotation,
annot_column = "hgnc_symbol",
alpha = 0.5)
## Error in read.table(t, sep = "\t"): duplicate 'row.names' are not allowed
## Error in eval(expr, envir, enclos): object 'bead_uninf_plot' not found
## Error in eval(expr, envir, enclos): object 'bead_uninf_plot' not found
## I think the following should not be needed next time I run this
## (e.g. I think I changed the function to take this into account.
rownames(sig_events) <- sig_events[["Row.names"]]
sig_events[["Row.names"]] <- NULL
bead_uninf_events <- merge(sig_events, annot,
by.x = "gene_name", by.y = "row.names",
all.x = TRUE)
write_xlsx(excel = glue("excel/{comparison_prefix}_testing_suppa_sigevents.xlsx"),
data = bead_uninf_events)
## Deleting the file excel/none_bead_testing_suppa_sigevents.xlsx before writing the tables.
## write_xlsx() wrote excel/none_bead_testing_suppa_sigevents.xlsx.
## The cursor is on sheet first, row: 570 column: 98.
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## Add a little logic here to use enrichplot::dotplot().
## png
## 2
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## Add a little logic here to use enrichplot::dotplot().
module add cyoa
module add suppa
cyoa --method suppa --species hg38_111 --input sample_sheets/suppa_samples_uninf_all4hr.xlsx \
--file_column hg38_111_salmon --condition_column infect_state \
--numerator inf --denominator uninf
numerator <- "t4h"
denominator <- "no"
comparison_prefix <- glue("{numerator}_{denominator}")
uninf_4h_plot <- plot_suppa(file_prefix = "outputs/90suppa_hg38_111_infect_state",
numerator = numerator,
denominator = denominator, type = "type",
annot = hg38_111_annot$gene_annotation,
annot_column = "hgnc_symbol",
alpha = 0.5)
uninf_4h_plot$ma
## Warning: Removed 120087 rows containing missing values or values outside the scale range (`geom_point()`).
## Warning: Removed 6 rows containing missing values or values outside the scale range (`geom_text_repel()`).
## Warning: ggrepel: 824 unlabeled data points (too many overlaps). Consider increasing max.overlaps
sig_events <- uninf_4h_plot[["sig"]]
## I think the following should not be needed next time I run this
## (e.g. I think I changed the function to take this into account.
rownames(sig_events) <- sig_events[["Row.names"]]
sig_events[["Row.names"]] <- NULL
t4h_uninf_events <- merge(sig_events, annot,
by.x = "gene_name", by.y = "row.names",
all.x = TRUE)
write_xlsx(excel = glue("excel/{comparison_prefix}_testing_suppa_sigevents.xlsx"),
data = t4h_uninf_events)
## Deleting the file excel/t4h_no_testing_suppa_sigevents.xlsx before writing the tables.
## write_xlsx() wrote excel/t4h_no_testing_suppa_sigevents.xlsx.
## The cursor is on sheet first, row: 877 column: 79.
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## Add a little logic here to use enrichplot::dotplot().
## png
## 2
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## Add a little logic here to use enrichplot::dotplot().