1 Introduction

This dataset contains data from two experiments:

  1. ‘rph’: This appears to me to be a reasonably simple comparison of a ‘normal’ PA14 lab strain and two methods to knock out the rph gene ribonuclease PH; “Phosphorolytic 3’ to 5’ exoribonuclease” which helps tRNA 3’ end maturation. It removes residues after the CCA end, and also sometimes adds nucleotides using nucleoside diphosphates as input. Potentially helps degrade 16S when starving.
  2. ‘alginate’: This is a very complex experiment which uses a single strain (PA14 with a plasmid containing algU). AlgU regulates the rsmA promoter which is important for virulence and has quite a lot of literature surrounding its function. I am guessing this is a knockout strain for algU? Either way, this strain was grown either in LB or SCFM (synthetic cystic fibrosis sputum media) with one or more supplements: DMSO (control), or IPTG (induce), and/or Eb/EbS/EbO (reducers).

2 Annotations

I am only now checking out pseudomonas.com. At first glance I am not seeing a good way to programmatically download the annotation data, but there is a nice catalog of files to download. In addition, it appears they do a good job of ensuring that all of the strain annotations are at NCBI as genbank files; so I should not have significant troubles if I want to get these annotations.

With that in mind, I am using two data sources for annotations, the gff file I extracted from the NCBI reference genome and copied to the local ‘reference/’ directory (thus I assume it is identical to what is at pseudomonas.com), and annotations from microbesonline.org.

pa14_gff <- load_gff_annotations("reference/paeruginosa_pa14.gff", id_col="gene_id")
## Trying attempt: rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo = TRUE)
## Had a successful gff import with rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo = TRUE)
## Returning a df with 16 columns and 11946 rows.
rownames(pa14_gff) <- pa14_gff[["gene_id"]]
## The Alias column has PA14_00010

pa14_microbes <- as.data.frame(load_microbesonline_annotations("PA14"))
## Found 1 entry.
## Pseudomonas aeruginosa UCBPP-PA14Proteobacteria2006-11-22yes105972208963
## The species being downloaded is: Pseudomonas aeruginosa UCBPP-PA14
## Downloading: http://www.microbesonline.org/cgi-bin/genomeInfo.cgi?tId=208963;export=tab
## The sysName column has PA14_0010

pa14_annot <- merge(pa14_gff, pa14_microbes, by.x="Alias", by.y="sysName")
rownames(pa14_annot) <- pa14_annot[["gene_id"]]

## The identifiers are a bit odd, so we need to do a little work
pa14_length <- pa14_annot[, c("gene_id", "width", "Alias")]
pa14_go <- load_microbesonline_go(species="PA14", id_column="sysName")
## Found 1 entry.
## Pseudomonas aeruginosa UCBPP-PA14Proteobacteria2006-11-22yes105972208963
## The species being downloaded is: Pseudomonas aeruginosa UCBPP-PA14 and is being downloaded as 208963.tab.
pa14_go_length <- merge(pa14_go, pa14_length, by.x="sysName", by.y="Alias")
pa14_go <- pa14_go_length[, c("sysName", "GO")]
colnames(pa14_go) <- c("ID", "GO")
pa14_length <- pa14_go_length[, c("sysName", "width")]
##pa14_length_ids <- unique(pa14_length)[["gene_id"]]
##pa14_length <- pa14_length[pa14_length_ids, ]
rownames(pa14_length) <- make.names(pa14_length[[1]], unique=TRUE)
colnames(pa14_length) <- c("ID", "width")

3 Make the expressionset

Previously, I loaded all samples from both groups of experiments together and separated them. I decided in this iteration to separate them in the hopes of avoiding any potential confusion.

The following line reads in the sample sheet and creates a data structure which contains all of the sample annotations, the gene annotations, and the counts/gene.

sk_expt <- create_expt("sample_sheets/all_samples_sk_202206.xlsx",
                       gene_info=pa14_annot,
                       file_column="hisatcounttablereverse")
## Reading the sample metadata.
## Did not find the condition column in the sample sheet.
## Filling it in as undefined.
## Did not find the batch column in the sample sheet.
## Filling it in as undefined.
## The sample definitions comprises: 45 rows(samples) and 32 columns(metadata fields).
## Matched 5972 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## Saving the expressionset to 'expt.rda'.
## The final expressionset has 5979 features and 45 samples.

4 Quick view of the data

The first view of the data is to see how many counts/sample were observed and to get a sense of how many genes were observed:

The range of reads/sample goes from ~ 3.1 to 5.9 million, which seems quite reasonable to me.

The range of genes observed goes from ~ 5,925 to 5,980. This is a beautifully tight range.

pa14_libsize <- plot_libsize(sk_expt)
pa14_libsize$plot

pa14_nonzero <- plot_nonzero(sk_expt)
pa14_nonzero$plot
## Warning: ggrepel: 6 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

5 Separate the experiments and write out the data

I suspect that SooKyoung might enjoy poking at the counts in excel. Before writing it out though let us set the conditions and batches so the excel output has some pretty colors.

sk_expt <- set_expt_conditions(sk_expt, fact="media") %>%
  set_expt_batches(fact="bioreplicate")

rph_expt <- subset_expt(sk_expt, subset="media=='LB'") %>%
  set_expt_conditions(fact="strains") %>%
  sanitize_expt_metadata()
## subset_expt(): There were 45, now there are 9 samples.
rph_written <- write_expt(rph_expt,
                          excel=glue::glue("excel/rph_data-v{ver}.xlsx"))
## Deleting the file excel/rph_data-v20220630.xlsx before writing the tables.
## Writing the first sheet, containing a legend and some summary data.
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:S4Vectors':
## 
##     expand
## 
## Total:43 s
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## 
## Total:40 s

6 Alginate induce vs. reduce

Before writing out the alginate data, let us take a moment to separate the factors for inducing and reducing expression

alginate_expt <- subset_expt(sk_expt, subset="media!='LB'") %>%
  set_expt_conditions(fact="media")
## subset_expt(): There were 45, now there are 36 samples.
conditions <- pData(alginate_expt)[["condition"]]
induce_bool <- grepl(pattern="IPTG", x=conditions)
induce_factor <- rep("not_induced", length(induce_bool))
induce_factor[induce_bool] <- "IPTG"

reduce_bool <- grepl(pattern="Eb", x=conditions)
reduce_factor <- rep("not_reduced", length(reduce_bool))
eb_samples <-  grepl(pattern="Eb", x=conditions)
reduce_factor[eb_samples] <- "Eb"
ebo_samples <- grepl(pattern="EbO", x=conditions)
reduce_factor[ebo_samples] <- "EbO"
ebs_samples <- grepl(pattern="EbS", x=conditions)
reduce_factor[ebs_samples] <- "EbS"

media_factor <- conditions
lb_bool <- grepl(pattern="LB", x=conditions)
media_factor[lb_bool] <- "LB"
media_factor[!lb_bool] <- "SCFM"

pData(alginate_expt)[["media_add"]] <- pData(alginate_expt)[["media"]]
pData(alginate_expt)[["media"]] <- media_factor
pData(alginate_expt)[["induced"]] <- as.factor(induce_factor)
pData(alginate_expt)[["reduced"]] <- as.factor(reduce_factor)

induce_reduce_factor <- paste0(induce_factor, "_", reduce_factor)
pData(alginate_expt)[["induce_reduce"]] <- induce_reduce_factor

media_induce_reduce_factor <- paste0(media_factor, "_", induce_reduce_factor)

alginate_expt <- set_expt_conditions(alginate_expt, fact=media_induce_reduce_factor) %>%
  set_expt_batches(fact="bioreplicate")

reduce_bool <- reduce_factor
reduce_idx <- grepl(pattern="Eb", x=reduce_bool)
reduce_bool[reduce_idx] <- "reduced"
pData(alginate_expt)[["reduced_bool"]] <- reduce_bool

7 Write the new alginate data

Now we have rewritten the alginate metadata to have separate factors for if/when samples were induced via IPTG and if/when the were reduced via Eb*, and the media factor is only Lb vs. SCFM.

alginate_written <- write_expt(alginate_expt,
                               excel=glue::glue("excel/alginate_data-v{ver}.xlsx"))
## Deleting the file excel/alginate_data-v20220630.xlsx before writing the tables.
## Writing the first sheet, containing a legend and some summary data.
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## 
## Total:39 s
## Error in solve.default(t(mod) %*% mod) : 
##   Lapack routine dgesv: system is exactly singular: U[18,18] = 0
## Error in solve.default(t(mod) %*% mod) : 
##   Lapack routine dgesv: system is exactly singular: U[18,18] = 0
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## 
## Total:42 s

8 rph data

Lets take a moment to look at and perform some simple analyses of the rph data set first. One thing to note, my sanitizer function removed the delta from the condition, perhaps I should have it replace delta with ‘d’ or something? In any event, the deletion strain is more similar to wild-type than the transposon mutant – by a lot.

rph_norm <- normalize_expt(rph_expt, transform="log2",
                           convert="cpm", norm="quant", filter=TRUE)
## Removing 211 low-count genes (5768 remaining).
## transform_counts: Found 220 values equal to 0, adding 1 to the matrix.
plot_pca(rph_norm)$plot

plot_boxplot(rph_expt)
## 920 entries are 0.  We are on a log scale, adding 1 to the data.

8.1 Differential expression

The data is clear enough that I do not think it warrants any intrusive things like sva, I will just include bioreplicate in the statistical model.

rph_de <- all_pairwise(rph_expt, model_batch=TRUE, filter=TRUE)
## Using limma's removeBatchEffect to visualize with(out) batch inclusion.
## Finished running DE analyses, collecting outputs.
## Comparing analyses.

## I suspect that only the first two contrasts will be of significant interest.
interesting <- list(
    "deletion_vs_wt" = c("pa14rph", "pa14wt"),
    "trans_vs_wt" = c("pa14tnrph", "pa14wt"),
    "deletion_vs_trans" = c("pa14rph", "pa14tnrph"))
rph_tables <- combine_de_tables(
    rph_de, keepers=interesting,
    excel=glue::glue("excel/rph_tables-v{ver}.xlsx"))
## Deleting the file excel/rph_tables-v20220630.xlsx before writing the tables.
rph_sig <- extract_significant_genes(
    rph_tables,
    excel=glue::glue("excel/rph_sig-v{ver}.xlsx"))
## Deleting the file excel/rph_sig-v20220630.xlsx before writing the tables.

8.2 Ontology

Let us see if there are some groups of genes of interest as identified by GO.

deletion_ups <- rph_sig[["deseq"]][["ups"]][[1]]
rownames(deletion_ups) <- make.names(deletion_ups[["namex"]], unique=TRUE)
deletion_downs <- rph_sig[["deseq"]][["downs"]][[1]]
rownames(deletion_downs) <- make.names(deletion_downs[["namex"]], unique=TRUE)

trans_ups <- rph_sig[["deseq"]][["ups"]][[2]]
rownames(trans_ups) <- make.names(trans_ups[["namex"]], unique=TRUE)
trans_downs <- rph_sig[["deseq"]][["downs"]][[2]]
rownames(trans_downs) <- make.names(trans_downs[["namex"]], unique=TRUE)

## The go data from microbesonline is keyed by the gene name
## (e.g. dnaA), not gene ID or PA id or whatever.

deletion_up_goseq <- simple_goseq(deletion_ups, go_db=pa14_go, length_db=pa14_length,
                                  excel=glue::glue("excel/deletion_up_goseq-v{ver}.xlsx"))
## Found 62 go_db genes and 62 length_db genes out of 210.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
## The score column is null, defaulting to score.
## Possible columns are:
## [1] "category"                 "over_represented_pvalue" 
## [3] "under_represented_pvalue" "numDEInCat"              
## [5] "numInCat"                 "term"                    
## [7] "ontology"                 "qvalue"
## The score column is null, defaulting to score.
## Possible columns are:
## [1] "category"                 "over_represented_pvalue" 
## [3] "under_represented_pvalue" "numDEInCat"              
## [5] "numInCat"                 "term"                    
## [7] "ontology"                 "qvalue"
## The score column is null, defaulting to score.
## Possible columns are:
## [1] "category"                 "over_represented_pvalue" 
## [3] "under_represented_pvalue" "numDEInCat"              
## [5] "numInCat"                 "term"                    
## [7] "ontology"                 "qvalue"
## Deleting the file excel/deletion_up_goseq-v20220630.xlsx before writing the tables.
## Writing a sheet containing the legend.
## Writing the BP data.
## Writing the MF data.
## Writing the CC data.
## Finished writing excel file.
deletion_up_goseq[["pvalue_plots"]][["bpp_plot_over"]]

deletion_up_goseq[["pvalue_plots"]][["mfp_plot_over"]]

deletion_down_goseq <- simple_goseq(deletion_downs, go_db=pa14_go, length_db=pa14_length,
                                    excel=glue::glue("excel/deletion_down_goseq-v{ver}.xlsx"))
## Found 34 go_db genes and 34 length_db genes out of 79.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
## The score column is null, defaulting to score.
## Possible columns are:
## [1] "category"                 "over_represented_pvalue" 
## [3] "under_represented_pvalue" "numDEInCat"              
## [5] "numInCat"                 "term"                    
## [7] "ontology"                 "qvalue"
## The score column is null, defaulting to score.
## Possible columns are:
## [1] "category"                 "over_represented_pvalue" 
## [3] "under_represented_pvalue" "numDEInCat"              
## [5] "numInCat"                 "term"                    
## [7] "ontology"                 "qvalue"
## The score column is null, defaulting to score.
## Possible columns are:
## [1] "category"                 "over_represented_pvalue" 
## [3] "under_represented_pvalue" "numDEInCat"              
## [5] "numInCat"                 "term"                    
## [7] "ontology"                 "qvalue"
## Deleting the file excel/deletion_down_goseq-v20220630.xlsx before writing the tables.
## Writing a sheet containing the legend.
## Writing the BP data.
## Writing the MF data.
## Writing the CC data.
## Finished writing excel file.
deletion_down_goseq[["pvalue_plots"]][["bpp_plot_over"]]

deletion_down_goseq[["pvalue_plots"]][["mfp_plot_over"]]

trans_up_goseq <- simple_goseq(trans_ups, go_db=pa14_go, length_db=pa14_length,
                               excel=glue::glue("excel/trans_up_goseq-v{ver}.xlsx"))
## Found 332 go_db genes and 332 length_db genes out of 913.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
## The score column is null, defaulting to score.
## Possible columns are:
## [1] "category"                 "over_represented_pvalue" 
## [3] "under_represented_pvalue" "numDEInCat"              
## [5] "numInCat"                 "term"                    
## [7] "ontology"                 "qvalue"
## The score column is null, defaulting to score.
## Possible columns are:
## [1] "category"                 "over_represented_pvalue" 
## [3] "under_represented_pvalue" "numDEInCat"              
## [5] "numInCat"                 "term"                    
## [7] "ontology"                 "qvalue"
## The score column is null, defaulting to score.
## Possible columns are:
## [1] "category"                 "over_represented_pvalue" 
## [3] "under_represented_pvalue" "numDEInCat"              
## [5] "numInCat"                 "term"                    
## [7] "ontology"                 "qvalue"
## Deleting the file excel/trans_up_goseq-v20220630.xlsx before writing the tables.
## Writing a sheet containing the legend.
## Writing the BP data.
## Writing the MF data.
## Writing the CC data.
## Finished writing excel file.
trans_down_goseq <- simple_goseq(trans_downs, go_db=pa14_go, length_db=pa14_length,
                                 excel=glue::glue("excel/trans_down_goseq-v{ver}.xlsx"))
## Found 285 go_db genes and 285 length_db genes out of 777.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
## The score column is null, defaulting to score.
## Possible columns are:
## [1] "category"                 "over_represented_pvalue" 
## [3] "under_represented_pvalue" "numDEInCat"              
## [5] "numInCat"                 "term"                    
## [7] "ontology"                 "qvalue"
## The score column is null, defaulting to score.
## Possible columns are:
## [1] "category"                 "over_represented_pvalue" 
## [3] "under_represented_pvalue" "numDEInCat"              
## [5] "numInCat"                 "term"                    
## [7] "ontology"                 "qvalue"
## The score column is null, defaulting to score.
## Possible columns are:
## [1] "category"                 "over_represented_pvalue" 
## [3] "under_represented_pvalue" "numDEInCat"              
## [5] "numInCat"                 "term"                    
## [7] "ontology"                 "qvalue"
## Deleting the file excel/trans_down_goseq-v20220630.xlsx before writing the tables.
## Writing a sheet containing the legend.
## Writing the BP data.
## Writing the MF data.
## Writing the CC data.
## Finished writing excel file.

8.3 Circos

As a final quick thing, lets plot the genes via circos.

pa14_annot[["chromosome"]] <- "Pseudomonas_aeruginosa_UCBPP_PA14"
rph_delta_df <- rph_tables[["data"]][["deletion_vs_wt"]][, c("deseq_logfc", "deseq_adjp")]
rph_trans_df <- rph_tables[["data"]][["trans_vs_wt"]][, c("deseq_logfc", "deseq_adjp")]

rph_cfg <- circos_prefix(pa14_annot, name="rph", cog_column = "COGFun",
                            start_column="start.x", end_column="end", strand_column="strand.x",
                            chr_column="chromosome", id_column="gene_id")
## This assumes you have a colors.conf in circos/colors/ and fonts.conf in circos/fonts/
## It also assumes you have conf/ideogram.conf, conf/ticks.conf, and conf/housekeeping.conf
## It will write circos/conf/rph.conf with a reasonable first approximation config file.
## Wrote karyotype to circos/conf/ideograms/rph.conf
## This should match the ideogram= line in rph.conf
## Wrote ticks to circos/conf/ticks_rph.conf
rph_kary <- circos_karyotype(rph_cfg, fasta="reference/paeruginosa_pa14.fasta")
## Wrote karyotype to circos/conf/karyotypes/rph.conf
## This should match the karyotype= line in rph.conf
rph_plus_minus <- circos_plus_minus(rph_cfg, width=0.06, thickness=40)
## Writing data file: circos/data/rph_plus_go.txt with the + strand GO data.
## Writing data file: circos/data/rph_minus_go.txt with the - strand GO data.
## Wrote the +/- config files.  Appending their inclusion to the master file.
## Returning the inner width: 0.88.  Use it as the outer for the next ring.
rph_delta_hist <- circos_hist(rph_cfg, rph_delta_df, colname="deseq_logfc",
                              basename="delta", outer=rph_plus_minus)
## Assuming the input is a dataframe.
## Writing data file: circos/data/rph_deltadeseq_logfc_hist.txt with the deltadeseq_logfc column.
## Returning the inner width: 0.8.  Use it as the outer for the next ring.
rph_tn_hist <- circos_hist(rph_cfg, rph_trans_df, colname="deseq_logfc",
                           basename="trans", outer=rph_delta_hist)
## Assuming the input is a dataframe.
## Writing data file: circos/data/rph_transdeseq_logfc_hist.txt with the transdeseq_logfc column.
## Returning the inner width: 0.72.  Use it as the outer for the next ring.
rph_finish <- circos_suffix(rph_cfg)
rph_made <- circos_make(rph_cfg, target="rph")

9 Examine the algU data

There are two media, an inducer, and multiple methods of reducing expression…

9.1 Initial visualization

alg_norm <- normalize_expt(alginate_expt, transform="log2",
                           convert="cpm", norm="quant", filter=TRUE)
## Removing 50 low-count genes (5929 remaining).
## transform_counts: Found 199 values equal to 0, adding 1 to the matrix.
plot_pca(alg_norm)$plot
## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

I perceive a few different groups, which if broadly read from left to right include:

  1. synthetic sputum, neither IPTG nor Eb*. This is the most ‘divergent’ group of all.
  2. LB, neither IPTG nor Eb*. These are closest to the top-left SCFM samples.
  3. Adjacent to #2 are the LB+Eb0 and LB+Eb: E.g. the reduced LB samples.
  4. Below #3 are the induced SCFM samples on the left.
  5. Immediately to the right are the various SCFM samples with both induction and reduction (IPTG + Eb*).
  6. The final large group is all LB, the left most group is induced.
  7. The last clustered LB group on the farthest right is comprised of all the LB samples with both induction and reduction (IPTG + Eb*).

Given this, there are a few ways to compare the data. The following three groups describe the easily measurable comparisons in order of difference (thus #1 is a huge difference, #3 is small):

  1. There is a profound difference between LB and SCFM samples.
  2. In both of the media types, there is a palpable difference between the no-additives and everything else (IPTG and/or Eb*).
  3. Among the samples where the various additives were applied, there is a palpable difference between when IPTG was added vs. IPTG and Eb*.

9.2 Differential expression

I spoke with Theresa and we agreed that a likely good statistical model is: ~ media + induce:reduce which is to say we will attempt an interaction comparison of induce vs. reduce controlling for media.

For simplicity sake, I will also do the simple comparisons among the conditions as they are delineated in the above PCA plot.

9.2.1 All Pairwise

I will start with the simpler task:

interesting <- list(
    "scfm_lb" = c("SCFMnotinducednotreduced", "LBnotinducednotreduced"),
    "lb_lbinduce" = c("LBnotinducednotreduced", "LBIPTGnotreduced"),
    "scfm_scfminduce" = c("SCFMnotinducednotreduced", "SCFMIPTGnotreduced"),
    "lb_lbcompound" = c("LBnotinducednotreduced", "LBnotinducedEb"),
    "iptg_iptgeb" = c("LBIPTGnotreduced", "LBIPTGEb"),
    "iptg_iptgebo" = c("LBIPTGnotreduced", "LBIPTGEbO"),
    "iptg_iptgebs" = c("LBIPTGnotreduced", "LBIPTGEbS"),
    "iptgeb_iptgebo" = c("LBIPTGEb", "LBIPTGEbO"),
    "iptgeb_iptgebs" = c("LBIPTGEb", "LBIPTGEbS"),
    "scfmiptg_lbiptg" = c("SCFMIPTGnotreduced", "LBIPTGnotreduced"),
    "scfmdmso_scfmiptg" = c("SCFMnotinducednotreduced", "SCFMIPTGnotreduced"),
    "scfmiptg_scfmiptgeb" = c("SCFMIPTGnotreduced", "SCFMIPTGEb"),
    "scfmiptg_scfmiptgebo" = c("SCFMIPTGnotreduced", "SCFMIPTGEbO"),
    "scfmiptg_scfmiptgebs" = c("SCFMIPTGnotreduced", "SCFMIPTGEbS"))

alg_de_nobatch <- all_pairwise(alginate_expt, model_batch=FALSE)
## Finished running DE analyses, collecting outputs.
## Comparing analyses.

alg_nobatch_tables <- combine_de_tables(
    alg_de_nobatch, keepers=interesting,
    excel=glue::glue("excel/alg_tables_nobatch-v{ver}.xlsx"))
## Deleting the file excel/alg_tables_nobatch-v20220630.xlsx before writing the tables.
alg_nobatch_sig <- extract_significant_genes(
    alg_nobatch_tables,
    excel = glue::glue("excel/alg_sig_nobatch-v{ver}.xlsx"))
## Deleting the file excel/alg_sig_nobatch-v20220630.xlsx before writing the tables.
alg_de <- all_pairwise(alginate_expt, model_batch=TRUE)
## Using limma's removeBatchEffect to visualize with(out) batch inclusion.
## Finished running DE analyses, collecting outputs.
## Comparing analyses.

alg_tables <- combine_de_tables(
    alg_de, keepers=interesting,
    excel=glue::glue("excel/alg_tables_batch-v{ver}.xlsx"))
## Deleting the file excel/alg_tables_batch-v20220630.xlsx before writing the tables.
alg_sig <- extract_significant_genes(
    alg_tables,
    excel=glue::glue("excel/alg_sig_batch-v{ver}.xlsx"))
## Deleting the file excel/alg_sig_batch-v20220630.xlsx before writing the tables.

9.2.2 Interaction Model

There is an important caveat for comparing this data and thinking about an interaction model: In the LB, we have a full rank matrix of LB+nothing, LB+inducer, LB+reducers, and LB+inducer+reducers. In the SCFM, we do not, we have only SCFM+nothing, SCFM+inducer, SCFM+inducer+reducers – but no SCFM+reducers. As a result, we do not have the power to do some of the most interesting stuff, e.g. media+inducer:reducers.

If we take away the SCFM samples, we can do inducer:reducers, but that is much less exciting.

pander::pander(sessionInfo())
message(paste0("This is hpgltools commit: ", get_git_commit()))
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
tmp <- sm(saveme(filename=this_save))
---
title: "Some Pseudomonas RNAseq data: SK."
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
  html_document:
    code_download: true
    code_folding: show
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: tango
    keep_md: false
    mode: selfcontained
    number_sections: true
    self_contained: true
    theme: readable
    toc: true
    toc_float:
      collapsed: false
      smooth_scroll: false
  rmdformats::readthedown:
    code_download: true
    code_folding: show
    df_print: paged
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: tango
    width: 300
    keep_md: false
    mode: selfcontained
    toc_float: true
  BiocStyle::html_document:
    code_download: true
    code_folding: show
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: tango
    keep_md: false
    mode: selfcontained
    toc_float: true
---

<style type="text/css">
body, td {
  font-size: 16px;
}
code.r{
  font-size: 16px;
}
pre {
 font-size: 16px
}
</style>

```{r options, include=FALSE}
library("hpgltools")
tt <- devtools::load_all("~/hpgltools")
knitr::opts_knit$set(width=120,
                     progress=TRUE,
                     verbose=TRUE,
                     echo=TRUE)
knitr::opts_chunk$set(error=TRUE,
                      dpi=96)
old_options <- options(digits=4,
                       stringsAsFactors=FALSE,
                       knitr.duplicate.label="allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size=10))
rundate <- format(Sys.Date(), format="%Y%m%d")
previous_file <- ""
ver <- format(Sys.Date(), "%Y%m%d")

##tmp <- sm(loadme(filename=paste0(gsub(pattern="\\.Rmd", replace="", x=previous_file), "-v", ver, ".rda.xz")))
rmd_file <- "index_sk_202206.Rmd"
```

# Introduction

This dataset contains data from two experiments:

1.  'rph': This appears to me to be a reasonably simple comparison of
    a 'normal' PA14 lab strain and two methods to knock out the rph
    gene ribonuclease PH; "Phosphorolytic 3' to 5' exoribonuclease"
    which helps tRNA 3' end maturation.  It removes residues after the
    CCA end, and also sometimes adds nucleotides using nucleoside
    diphosphates as input.  Potentially helps degrade 16S when
    starving.
2.  'alginate': This is a very complex experiment which uses a single
    strain (PA14 with a plasmid containing algU).  AlgU regulates the
    rsmA promoter which is important for virulence and has quite a lot
    of literature surrounding its function.  I am guessing this is a
    knockout strain for algU?  Either way, this strain was grown
    either in LB or SCFM (synthetic cystic fibrosis sputum media) with
    one or more supplements: DMSO (control), or IPTG (induce), and/or
    Eb/EbS/EbO (reducers).

# Annotations

I am only now checking out pseudomonas.com.  At first glance I am not
seeing a good way to programmatically download the annotation data,
but there is a nice catalog of files to download.  In addition, it
appears they do a good job of ensuring that all of the strain
annotations are at NCBI as genbank files; so I should not have
significant troubles if I want to get these annotations.

With that in mind, I am using two data sources for annotations, the
gff file I extracted from the NCBI reference genome and copied to the
local 'reference/' directory (thus I assume it is identical to what is
at pseudomonas.com), and annotations from microbesonline.org.

```{r annotation}
pa14_gff <- load_gff_annotations("reference/paeruginosa_pa14.gff", id_col="gene_id")
rownames(pa14_gff) <- pa14_gff[["gene_id"]]
## The Alias column has PA14_00010

pa14_microbes <- as.data.frame(load_microbesonline_annotations("PA14"))
## The sysName column has PA14_0010

pa14_annot <- merge(pa14_gff, pa14_microbes, by.x="Alias", by.y="sysName")
rownames(pa14_annot) <- pa14_annot[["gene_id"]]

## The identifiers are a bit odd, so we need to do a little work
pa14_length <- pa14_annot[, c("gene_id", "width", "Alias")]
pa14_go <- load_microbesonline_go(species="PA14", id_column="sysName")
pa14_go_length <- merge(pa14_go, pa14_length, by.x="sysName", by.y="Alias")
pa14_go <- pa14_go_length[, c("sysName", "GO")]
colnames(pa14_go) <- c("ID", "GO")
pa14_length <- pa14_go_length[, c("sysName", "width")]
##pa14_length_ids <- unique(pa14_length)[["gene_id"]]
##pa14_length <- pa14_length[pa14_length_ids, ]
rownames(pa14_length) <- make.names(pa14_length[[1]], unique=TRUE)
colnames(pa14_length) <- c("ID", "width")
```

# Make the expressionset

Previously, I loaded all samples from both groups of experiments
together and separated them.  I decided in this iteration to separate
them in the hopes of avoiding any potential confusion.

The following line reads in the sample sheet and creates a data
structure which contains all of the sample annotations, the gene
annotations, and the counts/gene.

```{r expressionset}
sk_expt <- create_expt("sample_sheets/all_samples_sk_202206.xlsx",
                       gene_info=pa14_annot,
                       file_column="hisatcounttablereverse")
```

# Quick view of the data

The first view of the data is to see how many counts/sample were
observed and to get a sense of how many genes were observed:

The range of reads/sample goes from ~ 3.1 to 5.9 million, which seems
quite reasonable to me.

The range of genes observed goes from ~ 5,925 to 5,980.  This is a
beautifully tight range.

```{r metrics}
pa14_libsize <- plot_libsize(sk_expt)
pa14_libsize$plot

pa14_nonzero <- plot_nonzero(sk_expt)
pa14_nonzero$plot
```

# Separate the experiments and write out the data

I suspect that SooKyoung might enjoy poking at the counts in excel.
Before writing it out though let us set the conditions and batches so
the excel output has some pretty colors.

```{r write_expt}
sk_expt <- set_expt_conditions(sk_expt, fact="media") %>%
  set_expt_batches(fact="bioreplicate")

rph_expt <- subset_expt(sk_expt, subset="media=='LB'") %>%
  set_expt_conditions(fact="strains") %>%
  sanitize_expt_metadata()

rph_written <- write_expt(rph_expt,
                          excel=glue::glue("excel/rph_data-v{ver}.xlsx"))
```

# Alginate induce vs. reduce

Before writing out the alginate data, let us take a moment to separate
the factors for inducing and reducing expression

```{r alginate_reduce_induce}
alginate_expt <- subset_expt(sk_expt, subset="media!='LB'") %>%
  set_expt_conditions(fact="media")

conditions <- pData(alginate_expt)[["condition"]]
induce_bool <- grepl(pattern="IPTG", x=conditions)
induce_factor <- rep("not_induced", length(induce_bool))
induce_factor[induce_bool] <- "IPTG"

reduce_bool <- grepl(pattern="Eb", x=conditions)
reduce_factor <- rep("not_reduced", length(reduce_bool))
eb_samples <-  grepl(pattern="Eb", x=conditions)
reduce_factor[eb_samples] <- "Eb"
ebo_samples <- grepl(pattern="EbO", x=conditions)
reduce_factor[ebo_samples] <- "EbO"
ebs_samples <- grepl(pattern="EbS", x=conditions)
reduce_factor[ebs_samples] <- "EbS"

media_factor <- conditions
lb_bool <- grepl(pattern="LB", x=conditions)
media_factor[lb_bool] <- "LB"
media_factor[!lb_bool] <- "SCFM"

pData(alginate_expt)[["media_add"]] <- pData(alginate_expt)[["media"]]
pData(alginate_expt)[["media"]] <- media_factor
pData(alginate_expt)[["induced"]] <- as.factor(induce_factor)
pData(alginate_expt)[["reduced"]] <- as.factor(reduce_factor)

induce_reduce_factor <- paste0(induce_factor, "_", reduce_factor)
pData(alginate_expt)[["induce_reduce"]] <- induce_reduce_factor

media_induce_reduce_factor <- paste0(media_factor, "_", induce_reduce_factor)

alginate_expt <- set_expt_conditions(alginate_expt, fact=media_induce_reduce_factor) %>%
  set_expt_batches(fact="bioreplicate")

reduce_bool <- reduce_factor
reduce_idx <- grepl(pattern="Eb", x=reduce_bool)
reduce_bool[reduce_idx] <- "reduced"
pData(alginate_expt)[["reduced_bool"]] <- reduce_bool
```

# Write the new alginate data

Now we have rewritten the alginate metadata to have separate factors
for if/when samples were induced via IPTG and if/when the were reduced
via Eb*, and the media factor is only Lb vs. SCFM.

```{r write_alginate}
alginate_written <- write_expt(alginate_expt,
                               excel=glue::glue("excel/alginate_data-v{ver}.xlsx"))
```

# rph data

Lets take a moment to look at and perform some simple analyses of the
rph data set first.  One thing to note, my sanitizer function removed
the delta from the condition, perhaps I should have it replace delta
with 'd' or something?  In any event, the deletion strain is more
similar to wild-type than the transposon mutant -- by a lot.

```{r rph}
rph_norm <- normalize_expt(rph_expt, transform="log2",
                           convert="cpm", norm="quant", filter=TRUE)
plot_pca(rph_norm)$plot

plot_boxplot(rph_expt)
```

## Differential expression

The data is clear enough that I do not think it warrants any intrusive
things like sva, I will just include bioreplicate in the statistical
model.

```{r rph_de}
rph_de <- all_pairwise(rph_expt, model_batch=TRUE, filter=TRUE)
## I suspect that only the first two contrasts will be of significant interest.
interesting <- list(
    "deletion_vs_wt" = c("pa14rph", "pa14wt"),
    "trans_vs_wt" = c("pa14tnrph", "pa14wt"),
    "deletion_vs_trans" = c("pa14rph", "pa14tnrph"))
rph_tables <- combine_de_tables(
    rph_de, keepers=interesting,
    excel=glue::glue("excel/rph_tables-v{ver}.xlsx"))
rph_sig <- extract_significant_genes(
    rph_tables,
    excel=glue::glue("excel/rph_sig-v{ver}.xlsx"))
```

## Ontology

Let us see if there are some groups of genes of interest as identified
by GO.

```{r go_rph}
deletion_ups <- rph_sig[["deseq"]][["ups"]][[1]]
rownames(deletion_ups) <- make.names(deletion_ups[["namex"]], unique=TRUE)
deletion_downs <- rph_sig[["deseq"]][["downs"]][[1]]
rownames(deletion_downs) <- make.names(deletion_downs[["namex"]], unique=TRUE)

trans_ups <- rph_sig[["deseq"]][["ups"]][[2]]
rownames(trans_ups) <- make.names(trans_ups[["namex"]], unique=TRUE)
trans_downs <- rph_sig[["deseq"]][["downs"]][[2]]
rownames(trans_downs) <- make.names(trans_downs[["namex"]], unique=TRUE)

## The go data from microbesonline is keyed by the gene name
## (e.g. dnaA), not gene ID or PA id or whatever.

deletion_up_goseq <- simple_goseq(deletion_ups, go_db=pa14_go, length_db=pa14_length,
                                  excel=glue::glue("excel/deletion_up_goseq-v{ver}.xlsx"))
deletion_up_goseq[["pvalue_plots"]][["bpp_plot_over"]]
deletion_up_goseq[["pvalue_plots"]][["mfp_plot_over"]]

deletion_down_goseq <- simple_goseq(deletion_downs, go_db=pa14_go, length_db=pa14_length,
                                    excel=glue::glue("excel/deletion_down_goseq-v{ver}.xlsx"))
deletion_down_goseq[["pvalue_plots"]][["bpp_plot_over"]]
deletion_down_goseq[["pvalue_plots"]][["mfp_plot_over"]]

trans_up_goseq <- simple_goseq(trans_ups, go_db=pa14_go, length_db=pa14_length,
                               excel=glue::glue("excel/trans_up_goseq-v{ver}.xlsx"))
trans_down_goseq <- simple_goseq(trans_downs, go_db=pa14_go, length_db=pa14_length,
                                 excel=glue::glue("excel/trans_down_goseq-v{ver}.xlsx"))
```

## Circos

As a final quick thing, lets plot the genes via circos.

```{r circos_fun}
pa14_annot[["chromosome"]] <- "Pseudomonas_aeruginosa_UCBPP_PA14"
rph_delta_df <- rph_tables[["data"]][["deletion_vs_wt"]][, c("deseq_logfc", "deseq_adjp")]
rph_trans_df <- rph_tables[["data"]][["trans_vs_wt"]][, c("deseq_logfc", "deseq_adjp")]

rph_cfg <- circos_prefix(pa14_annot, name="rph", cog_column = "COGFun",
                            start_column="start.x", end_column="end", strand_column="strand.x",
                            chr_column="chromosome", id_column="gene_id")
rph_kary <- circos_karyotype(rph_cfg, fasta="reference/paeruginosa_pa14.fasta")
rph_plus_minus <- circos_plus_minus(rph_cfg, width=0.06, thickness=40)
rph_delta_hist <- circos_hist(rph_cfg, rph_delta_df, colname="deseq_logfc",
                              basename="delta", outer=rph_plus_minus)
rph_tn_hist <- circos_hist(rph_cfg, rph_trans_df, colname="deseq_logfc",
                           basename="trans", outer=rph_delta_hist)
rph_finish <- circos_suffix(rph_cfg)
rph_made <- circos_make(rph_cfg, target="rph")
```

# Examine the algU data

There are two media, an inducer, and multiple methods of reducing expression...

## Initial visualization

```{r alginate02}
alg_norm <- normalize_expt(alginate_expt, transform="log2",
                           convert="cpm", norm="quant", filter=TRUE)
plot_pca(alg_norm)$plot
```

I perceive a few different groups, which if broadly read from left to
right include:

1.  synthetic sputum, neither IPTG nor Eb*.  This is the most
    'divergent' group of all.
2.  LB, neither IPTG nor Eb*.  These are closest to the top-left SCFM
    samples.
3.  Adjacent to #2 are the LB+Eb0 and LB+Eb:  E.g. the reduced LB
    samples.
4.  Below #3 are the induced SCFM samples on the left.
5.  Immediately to the right are the various SCFM samples with both
    induction and reduction (IPTG + Eb*).
6.  The final large group is all LB, the left most group is induced.
7.  The last clustered LB group on the farthest right is comprised of
    all the LB samples with both induction and reduction (IPTG + Eb*).

Given this, there are a few ways to compare the data.  The following
three groups describe the easily measurable comparisons in order of
difference (thus #1 is a huge difference, #3 is small):

1.  There is a profound difference between LB and SCFM samples.
2.  In both of the media types, there is a palpable difference between
    the no-additives and everything else (IPTG and/or Eb*).
3.  Among the samples where the various additives were applied, there
    is a palpable difference between when IPTG was added vs. IPTG
    _and_ Eb*.

## Differential expression

I spoke with Theresa and we agreed that a likely good statistical
model is: ~ media + induce:reduce which is to say we will attempt an
interaction comparison of induce vs. reduce controlling for media.

For simplicity sake, I will also do the simple comparisons among the
conditions as they are delineated in the above PCA plot.

### All Pairwise

I will start with the simpler task:

```{r simple_de}
interesting <- list(
    "scfm_lb" = c("SCFMnotinducednotreduced", "LBnotinducednotreduced"),
    "lb_lbinduce" = c("LBnotinducednotreduced", "LBIPTGnotreduced"),
    "scfm_scfminduce" = c("SCFMnotinducednotreduced", "SCFMIPTGnotreduced"),
    "lb_lbcompound" = c("LBnotinducednotreduced", "LBnotinducedEb"),
    "iptg_iptgeb" = c("LBIPTGnotreduced", "LBIPTGEb"),
    "iptg_iptgebo" = c("LBIPTGnotreduced", "LBIPTGEbO"),
    "iptg_iptgebs" = c("LBIPTGnotreduced", "LBIPTGEbS"),
    "iptgeb_iptgebo" = c("LBIPTGEb", "LBIPTGEbO"),
    "iptgeb_iptgebs" = c("LBIPTGEb", "LBIPTGEbS"),
    "scfmiptg_lbiptg" = c("SCFMIPTGnotreduced", "LBIPTGnotreduced"),
    "scfmdmso_scfmiptg" = c("SCFMnotinducednotreduced", "SCFMIPTGnotreduced"),
    "scfmiptg_scfmiptgeb" = c("SCFMIPTGnotreduced", "SCFMIPTGEb"),
    "scfmiptg_scfmiptgebo" = c("SCFMIPTGnotreduced", "SCFMIPTGEbO"),
    "scfmiptg_scfmiptgebs" = c("SCFMIPTGnotreduced", "SCFMIPTGEbS"))

alg_de_nobatch <- all_pairwise(alginate_expt, model_batch=FALSE)

alg_nobatch_tables <- combine_de_tables(
    alg_de_nobatch, keepers=interesting,
    excel=glue::glue("excel/alg_tables_nobatch-v{ver}.xlsx"))
alg_nobatch_sig <- extract_significant_genes(
    alg_nobatch_tables,
    excel = glue::glue("excel/alg_sig_nobatch-v{ver}.xlsx"))

alg_de <- all_pairwise(alginate_expt, model_batch=TRUE)
alg_tables <- combine_de_tables(
    alg_de, keepers=interesting,
    excel=glue::glue("excel/alg_tables_batch-v{ver}.xlsx"))
alg_sig <- extract_significant_genes(
    alg_tables,
    excel=glue::glue("excel/alg_sig_batch-v{ver}.xlsx"))
```

### Interaction Model

There is an important caveat for comparing this data and thinking
about an interaction model:  In the LB, we have a full rank matrix of
LB+nothing, LB+inducer, LB+reducers, and LB+inducer+reducers.  In the
SCFM, we do not, we have only SCFM+nothing, SCFM+inducer,
SCFM+inducer+reducers -- but no SCFM+reducers.  As a result, we do not
have the power to do some of the most interesting stuff,
e.g. media+inducer:reducers.

If we take away the SCFM samples, we can do inducer:reducers, but that
is much less exciting.

```{r saveme, eval=FALSE}
pander::pander(sessionInfo())
message(paste0("This is hpgltools commit: ", get_git_commit()))
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
tmp <- sm(saveme(filename=this_save))
```
