1 Introduction

Thanks to my time playing with Carrie and the Taneyhill samples, I re-evaluated suppa and think there are some potentially compelling use-case scenarios. Thus, I want to take the improved processing I employed with that data and translate it to some of our other datasets, starting with Cecilia’s human macrophages infected with Leishmania major or amazonensis.

My reworked suppa invocation is able to use either salmon quantifications or htseq-count transcript tables. Thus I think I have everything I need in the multiple_leishmania project directory wherein I passed all of our samples across every project to a single salmon-based pipeline. In addition I have some sample sheets there which should make it relatively easy to pull out the relevant samples.

1.1 Setting up

This document is in its own tree, so I am going to symlink my preprocessing directory from the multiple_leishmania tree and copy the most relevant sample sheet here.

I thus pruned the master sample sheet to include only the mbio data, then separated that into two new sheets, one with every macrophage sample (e.g. no promastigotes), and one with only the major infections or uninfected (e.g. no amazonensis nor beads).

The most relevant metadata columns for my purposes are therefore:

(In the file ‘suppa_samples_only_major.xlsx’) sampleid, infect_state, expt_time, donor, hg38_111_salmon

With that in mind, let us attempt to spin up suppa to compare ‘yes’ vs ‘no’ from the infect_state column using the salmon data.

1.1.1 Some details

I wrote some code to simplify the usage of suppa so that I can feed it a sample sheet filename, column of interest, and numerator/denominator. The code then collects the quantifications associated with the relevant samples, and runs the various suppa tools to generate all the various intermediate files used to compare splicing events among the desired groups.

1.2 Annotations

hg38_111_annot <- load_biomart_annotations(archive = FALSE)
## The biomart annotations file already exists, loading from it.
annot <- hg38_111_annot[["annotation"]]

I deleted the annotations and genome for this older release of the human genome. Thus I will need to either remap or redownload them. I am thinking that those are old enough it is worth while to redo.

1.3 So redo salmon first

I decided also to use the newest human annotation set with the assumption that there are new fun splicing annotations available.

module add cyoa/202402
cd preprocessing
base=~/scratch/rnaseq/splicing_leishmania_2023/

start=$(pwd)
for i in $(/bin/cat  ${base}/sample_sheets/only_major_sampleids.txt); do
    cd $i
    cyoa --method salmon --species hg38_111 --input r1.fastq.xz:r2.fastq.xz
    cyoa --method hisat --species hg38_111 --input r1.fastq.xz:r2.fastq.xz --gff_type gene --gff_tag ID --stranded no
    cd $start
done


start=$(pwd)
for i in $(/bin/cat  ${base}/sample_sheets/bead_samples.txt); do
    cd $i
    cyoa --method salmon --species hg38_111 --input r1.fastq.xz:r2.fastq.xz
    cyoa --method hisat --species hg38_111 --input r1.fastq.xz:r2.fastq.xz --gff_type gene --gff_tag ID --stranded no
    cd $start
done

ok, so now the relevant metadata column is hg38_111_salmon

module add cyoa
module add suppa
cyoa --method suppa --species hg38_111 --input sample_sheets/suppa_samples_only_major.xlsx \
     --file_column hg38_111_salmon --condition_column infect_state

2 Gather output files and plot the result

output_prefix <- "outputs/90suppa_hg38_111_infect_state"
dpsi_prefix <- file.path(output_prefix, "diff")
tpm_prefix <- file.path(output_prefix, "tpm")
event_prefix <- file.path(output_prefix, "events")

numerator <- "yes"
denominator <- "no"
comparison_prefix <- glue("{numerator}_{denominator}")
tx_dpsi <- file.path(dpsi_prefix, glue("{comparison_prefix}_ioi_empirical.dpsi"))
tx_tpm <- file.path(dpsi_prefix, glue("{comparison_prefix}_ioi_empirical_avglogtpm.tab"))
## test_events_ioe <- file.path(event_prefix, "local_as_events_SE_strict.ioe")
tx_events <- file.path(event_prefix, "transcript_events.ioi")
tx_psi <- file.path(dpsi_prefix, glue("{comparison_prefix}_ioi_classical.psivec"))
## Attempt to see delta transcripts.
#tx_plot <- plot_suppa(file_prefix = output_prefix, type = "transcript", annot = annot,
#                      numerator = numerator, denominator = denominator)

3 By event types and infected/uninfected

inf_uninf_event_plot <- plot_suppa(file_prefix = "outputs/90suppa_hg38_111_infect_state",
                                   numerator = numerator,
                                   denominator = denominator, type = "type",
                                   annot = hg38_111_annot$gene_annotation,
                                   annot_column = "hgnc_symbol",
                                   alpha = 0.5)
inf_uninf_event_plot$ma
## Warning: Removed 121164 rows containing missing values or values outside the scale range (`geom_point()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range (`geom_text_repel()`).
## Warning: ggrepel: 528 unlabeled data points (too many overlaps). Consider increasing max.overlaps

sig_events <- inf_uninf_event_plot[["sig"]]
## I think the following should not be needed next time I run this
## (e.g. I think I changed the function to take this into account.
rownames(sig_events) <- sig_events[["Row.names"]]
sig_events[["Row.names"]] <- NULL
inf_uninf_events <- merge(sig_events, annot,
                          by.x = "gene_name", by.y = "row.names",
                          all.x = TRUE)
write_xlsx(excel = glue("excel/{comparison_prefix}_testing_suppa_sigevents.xlsx"),
                        data = inf_uninf_events)
## Deleting the file excel/yes_no_testing_suppa_sigevents.xlsx before writing the tables.
## write_xlsx() wrote excel/yes_no_testing_suppa_sigevents.xlsx.
## The cursor is on sheet first, row: 570 column: 98.

4 Repeat, considering uninfected vs bead

module add cyoa
module add suppa
cyoa --method suppa --species hg38_111 --input sample_sheets/suppa_samples_uninf_bead.xlsx \
     --file_column hg38_111_salmon --condition_column pathogen_species

4.1 Look at the result

numerator <- "none"
denominator <- "bead"
comparison_prefix <- glue("{numerator}_{denominator}")
bead_uninf_plot <- plot_suppa(file_prefix = "outputs/90suppa_hg38_111_pathogen_species",
                              numerator = numerator,
                              denominator = denominator, type = "type",
                              annot = hg38_111_annot$gene_annotation,
                              annot_column = "hgnc_symbol",
                              alpha = 0.5)
## Error in read.table(t, sep = "\t"): duplicate 'row.names' are not allowed
bead_uninf_plot$ma
## Error in eval(expr, envir, enclos): object 'bead_uninf_plot' not found
sig_events <- bead_uninf_plot[["sig"]]
## Error in eval(expr, envir, enclos): object 'bead_uninf_plot' not found
## I think the following should not be needed next time I run this
## (e.g. I think I changed the function to take this into account.
rownames(sig_events) <- sig_events[["Row.names"]]
sig_events[["Row.names"]] <- NULL
bead_uninf_events <- merge(sig_events, annot,
                           by.x = "gene_name", by.y = "row.names",
                           all.x = TRUE)
write_xlsx(excel = glue("excel/{comparison_prefix}_testing_suppa_sigevents.xlsx"),
                        data = bead_uninf_events)
## Deleting the file excel/none_bead_testing_suppa_sigevents.xlsx before writing the tables.
## write_xlsx() wrote excel/none_bead_testing_suppa_sigevents.xlsx.
## The cursor is on sheet first, row: 570 column: 98.

4.2 Take a moment to examine these events

inf_uninf_gp <- simple_gprofiler(inf_uninf_events[["gene_name"]])
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## Add a little logic here to use enrichplot::dotplot().
enrichplot::dotplot(inf_uninf_gp$GO_enrich)

pp(file = "images/inf_uninf_bp.png", height=20, width=12)
inf_uninf_gp$pvalue_plots$BP
dev.off()
## png 
##   2
bead_uninf_gp <- simple_gprofiler(bead_uninf_events[["gene_name"]])
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## Add a little logic here to use enrichplot::dotplot().
enrichplot::dotplot(bead_uninf_gp$GO_enrich)

4.3 Consider 4 hr vs uninfected

module add cyoa
module add suppa
cyoa --method suppa --species hg38_111 --input sample_sheets/suppa_samples_uninf_all4hr.xlsx \
     --file_column hg38_111_salmon --condition_column infect_state \
     --numerator inf --denominator uninf
numerator <- "t4h"
denominator <- "no"
comparison_prefix <- glue("{numerator}_{denominator}")
uninf_4h_plot <- plot_suppa(file_prefix = "outputs/90suppa_hg38_111_infect_state",
                              numerator = numerator,
                              denominator = denominator, type = "type",
                              annot = hg38_111_annot$gene_annotation,
                              annot_column = "hgnc_symbol",
                              alpha = 0.5)
uninf_4h_plot$ma
## Warning: Removed 120087 rows containing missing values or values outside the scale range (`geom_point()`).
## Warning: Removed 6 rows containing missing values or values outside the scale range (`geom_text_repel()`).
## Warning: ggrepel: 824 unlabeled data points (too many overlaps). Consider increasing max.overlaps

sig_events <- uninf_4h_plot[["sig"]]
## I think the following should not be needed next time I run this
## (e.g. I think I changed the function to take this into account.
rownames(sig_events) <- sig_events[["Row.names"]]
sig_events[["Row.names"]] <- NULL
t4h_uninf_events <- merge(sig_events, annot,
                           by.x = "gene_name", by.y = "row.names",
                           all.x = TRUE)
write_xlsx(excel = glue("excel/{comparison_prefix}_testing_suppa_sigevents.xlsx"),
                        data = t4h_uninf_events)
## Deleting the file excel/t4h_no_testing_suppa_sigevents.xlsx before writing the tables.
## write_xlsx() wrote excel/t4h_no_testing_suppa_sigevents.xlsx.
## The cursor is on sheet first, row: 877 column: 79.
t4h_uninf_gp <- simple_gprofiler(t4h_uninf_events[["gene_name"]])
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## Add a little logic here to use enrichplot::dotplot().
enrichplot::dotplot(t4h_uninf_gp$GO_enrich)

pp(file = "images/inf_uninf_bp.png", height=20, width=12)
t4h_uninf_gp$pvalue_plots$BP
dev.off()
## png 
##   2
bead_uninf_gp <- simple_gprofiler(bead_uninf_events[["gene_name"]])
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## Add a little logic here to use enrichplot::dotplot().
enrichplot::dotplot(bead_uninf_gp$GO_enrich)

---
title: "Exploring splicing in our infected/uninfected macrophage samples."
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
  html_document:
    code_download: true
    code_folding: show
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: zenburn
    keep_md: false
    mode: selfcontained
    number_sections: true
    self_contained: true
    theme: readable
    toc: true
    toc_float:
      collapsed: false
      smooth_scroll: false
  rmdformats::readthedown:
    code_download: true
    code_folding: show
    df_print: paged
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: zenburn
    width: 300
    keep_md: false
    mode: selfcontained
    toc_float: true
  BiocStyle::html_document:
    code_download: true
    code_folding: show
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: zenburn
    keep_md: false
    mode: selfcontained
    toc_float: true
---

<style type="text/css">
body, td {
  font-size: 16px;
}
code.r{
  font-size: 16px;
}
pre {
 font-size: 16px
}
body .main-container {
  max-width: 1600px;
}
</style>

```{r options, include=FALSE}
library(hpgltools)
library(reticulate)
tt <- try(devtools::load_all("~/hpgltools"))
knitr::opts_knit$set(
  progress = TRUE, verbose = TRUE, width = 90, echo = TRUE)
knitr::opts_chunk$set(
  error = TRUE, fig.width = 8, fig.height = 8, fig.retina = 2,
  fig.pos = "t", fig.align = "center", dpi = if (knitr::is_latex_output()) 72 else 300,
  out.width = "100%", dev = "png",
  dev.args = list(png = list(type = "cairo-png")))
old_options <- options(digits = 4,
                       stringsAsFactors = FALSE,
                       knitr.duplicate.label = "allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size = 12))
ver <- "202305"
previous_file <- ""
ver <- format(Sys.Date(), "%Y%m%d")

##tmp <- sm(loadme(filename=paste0(gsub(pattern="\\.Rmd", replace="", x=previous_file), "-v", ver, ".rda.xz")))
rmd_file <- "index.Rmd"
```

# Introduction

Thanks to my time playing with Carrie and the Taneyhill samples, I
re-evaluated suppa and think there are some potentially compelling
use-case scenarios.  Thus, I want to take the improved processing I
employed with that data and translate it to some of our other
datasets, starting with Cecilia's human macrophages infected with
Leishmania major or amazonensis.

My reworked suppa invocation is able to use either salmon
quantifications or htseq-count transcript tables.  Thus I think I have
everything I need in the multiple_leishmania project directory wherein
I passed all of our samples across every project to a single
salmon-based pipeline.  In addition I have some sample sheets there
which should make it relatively easy to pull out the relevant samples.

## Setting up

This document is in its own tree, so I am going to symlink my
preprocessing directory from the multiple_leishmania tree and copy the
most relevant sample sheet here.

I thus pruned the master sample sheet to include only the mbio data,
then separated that into two new sheets, one with every macrophage
sample (e.g. no promastigotes), and one with only the major infections
or uninfected (e.g. no amazonensis nor beads).

The most relevant metadata columns for my purposes are therefore:

(In the file 'suppa_samples_only_major.xlsx')
sampleid, infect_state, expt_time, donor, hg38_111_salmon

With that in mind, let us attempt to spin up suppa to compare 'yes' vs
'no' from the infect_state column using the salmon data.

### Some details

I wrote some code to simplify the usage of suppa so that I can feed it
a sample sheet filename, column of interest, and
numerator/denominator.  The code then collects the quantifications
associated with the relevant samples, and runs the various suppa tools
to generate all the various intermediate files used to compare
splicing events among the desired groups.

## Annotations

```{r}
hg38_111_annot <- load_biomart_annotations(archive = FALSE)
annot <- hg38_111_annot[["annotation"]]
```

I deleted the annotations and genome for this older release of the
human genome.  Thus I will need to either remap or redownload them.  I
am thinking that those are old enough it is worth while to redo.

## So redo salmon first

I decided also to use the newest human annotation set with the
assumption that there are new fun splicing annotations available.

```{bash, eval=FALSE}
module add cyoa/202402
cd preprocessing
base=~/scratch/rnaseq/splicing_leishmania_2023/

start=$(pwd)
for i in $(/bin/cat  ${base}/sample_sheets/only_major_sampleids.txt); do
    cd $i
    cyoa --method salmon --species hg38_111 --input r1.fastq.xz:r2.fastq.xz
    cyoa --method hisat --species hg38_111 --input r1.fastq.xz:r2.fastq.xz --gff_type gene --gff_tag ID --stranded no
    cd $start
done


start=$(pwd)
for i in $(/bin/cat  ${base}/sample_sheets/bead_samples.txt); do
    cd $i
    cyoa --method salmon --species hg38_111 --input r1.fastq.xz:r2.fastq.xz
    cyoa --method hisat --species hg38_111 --input r1.fastq.xz:r2.fastq.xz --gff_type gene --gff_tag ID --stranded no
    cd $start
done
```

ok, so now the relevant metadata column is hg38_111_salmon

```{bash, eval=FALSE}
module add cyoa
module add suppa
cyoa --method suppa --species hg38_111 --input sample_sheets/suppa_samples_only_major.xlsx \
     --file_column hg38_111_salmon --condition_column infect_state
```

# Gather output files and plot the result

```{r}
output_prefix <- "outputs/90suppa_hg38_111_infect_state"
dpsi_prefix <- file.path(output_prefix, "diff")
tpm_prefix <- file.path(output_prefix, "tpm")
event_prefix <- file.path(output_prefix, "events")

numerator <- "yes"
denominator <- "no"
comparison_prefix <- glue("{numerator}_{denominator}")
tx_dpsi <- file.path(dpsi_prefix, glue("{comparison_prefix}_ioi_empirical.dpsi"))
tx_tpm <- file.path(dpsi_prefix, glue("{comparison_prefix}_ioi_empirical_avglogtpm.tab"))
## test_events_ioe <- file.path(event_prefix, "local_as_events_SE_strict.ioe")
tx_events <- file.path(event_prefix, "transcript_events.ioi")
tx_psi <- file.path(dpsi_prefix, glue("{comparison_prefix}_ioi_classical.psivec"))
## Attempt to see delta transcripts.
#tx_plot <- plot_suppa(file_prefix = output_prefix, type = "transcript", annot = annot,
#                      numerator = numerator, denominator = denominator)
```

# By event types and infected/uninfected

```{r}
inf_uninf_event_plot <- plot_suppa(file_prefix = "outputs/90suppa_hg38_111_infect_state",
                                   numerator = numerator,
                                   denominator = denominator, type = "type",
                                   annot = hg38_111_annot$gene_annotation,
                                   annot_column = "hgnc_symbol",
                                   alpha = 0.5)
inf_uninf_event_plot$ma

sig_events <- inf_uninf_event_plot[["sig"]]
## I think the following should not be needed next time I run this
## (e.g. I think I changed the function to take this into account.
rownames(sig_events) <- sig_events[["Row.names"]]
sig_events[["Row.names"]] <- NULL
inf_uninf_events <- merge(sig_events, annot,
                          by.x = "gene_name", by.y = "row.names",
                          all.x = TRUE)
write_xlsx(excel = glue("excel/{comparison_prefix}_testing_suppa_sigevents.xlsx"),
                        data = inf_uninf_events)
```

# Repeat, considering uninfected vs bead

```{bash, eval=FALSE}
module add cyoa
module add suppa
cyoa --method suppa --species hg38_111 --input sample_sheets/suppa_samples_uninf_bead.xlsx \
     --file_column hg38_111_salmon --condition_column pathogen_species
```

## Look at the result

```{r}
numerator <- "none"
denominator <- "bead"
comparison_prefix <- glue("{numerator}_{denominator}")
bead_uninf_plot <- plot_suppa(file_prefix = "outputs/90suppa_hg38_111_pathogen_species",
                              numerator = numerator,
                              denominator = denominator, type = "type",
                              annot = hg38_111_annot$gene_annotation,
                              annot_column = "hgnc_symbol",
                              alpha = 0.5)
bead_uninf_plot$ma

sig_events <- bead_uninf_plot[["sig"]]
## I think the following should not be needed next time I run this
## (e.g. I think I changed the function to take this into account.
rownames(sig_events) <- sig_events[["Row.names"]]
sig_events[["Row.names"]] <- NULL
bead_uninf_events <- merge(sig_events, annot,
                           by.x = "gene_name", by.y = "row.names",
                           all.x = TRUE)
write_xlsx(excel = glue("excel/{comparison_prefix}_testing_suppa_sigevents.xlsx"),
                        data = bead_uninf_events)
```

## Take a moment to examine these events

```{r}
inf_uninf_gp <- simple_gprofiler(inf_uninf_events[["gene_name"]])
enrichplot::dotplot(inf_uninf_gp$GO_enrich)
pp(file = "images/inf_uninf_bp.png", height=20, width=12)
inf_uninf_gp$pvalue_plots$BP
dev.off()

bead_uninf_gp <- simple_gprofiler(bead_uninf_events[["gene_name"]])
enrichplot::dotplot(bead_uninf_gp$GO_enrich)
```

## Consider 4 hr vs uninfected

```{bash, eval=FALSE}
module add cyoa
module add suppa
cyoa --method suppa --species hg38_111 --input sample_sheets/suppa_samples_uninf_all4hr.xlsx \
     --file_column hg38_111_salmon --condition_column infect_state \
     --numerator inf --denominator uninf
```

```{r}
numerator <- "t4h"
denominator <- "no"
comparison_prefix <- glue("{numerator}_{denominator}")
uninf_4h_plot <- plot_suppa(file_prefix = "outputs/90suppa_hg38_111_infect_state",
                              numerator = numerator,
                              denominator = denominator, type = "type",
                              annot = hg38_111_annot$gene_annotation,
                              annot_column = "hgnc_symbol",
                              alpha = 0.5)
uninf_4h_plot$ma

sig_events <- uninf_4h_plot[["sig"]]
## I think the following should not be needed next time I run this
## (e.g. I think I changed the function to take this into account.
rownames(sig_events) <- sig_events[["Row.names"]]
sig_events[["Row.names"]] <- NULL
t4h_uninf_events <- merge(sig_events, annot,
                           by.x = "gene_name", by.y = "row.names",
                           all.x = TRUE)
write_xlsx(excel = glue("excel/{comparison_prefix}_testing_suppa_sigevents.xlsx"),
                        data = t4h_uninf_events)
```


```{r}
t4h_uninf_gp <- simple_gprofiler(t4h_uninf_events[["gene_name"]])
enrichplot::dotplot(t4h_uninf_gp$GO_enrich)
pp(file = "images/inf_uninf_bp.png", height=20, width=12)
t4h_uninf_gp$pvalue_plots$BP
dev.off()

bead_uninf_gp <- simple_gprofiler(bead_uninf_events[["gene_name"]])
enrichplot::dotplot(bead_uninf_gp$GO_enrich)
```
