1 Important note

I copied my index_rofA and just performed replace-string on it for the times and strains:

  1. s/revert/complement/g
  2. s/transition/start/g
  3. s/exponential/e20m/g
  4. s/stationary/e60m/g

2 S.pyogenes 5448 pdxR RNASeq version: 202204

2.1 Preprocessing

I used my cyoa tool to process these samples by doing the following:

  1. Copying the data from the sequencer into the directory ‘preprocessing/’
  2. Used a slightly involved shell command to create a directory for each sample and copy the reads for it to the ‘unprocessed/’ subdirectory within it.
  3. invoked the following:
cd preprocessing
start=$(pwd)
for i in $(/bin/ls -d ./*)
do
  cd $i
  rm -rf outputs scripts
  cyoa --task pipe --method prnas --species spyogenes_5005 \
       --gff_type gene --gff_tag locus_tag \
       --input $(/bin/ls unprocessed/* | tr '\n' ':' | sed 's/:$//g')
  cd $start
done

The above for loop goes into each sample and does the following:

  1. Trims the data, heavily compresses the outputs.
  2. Runs fastqc
  3. Runs hisat2 using my spyogenes_5005 indices.
  4. Converts the sam alignment to sorted/indexed bam.
  5. Makes a couple of extra copies of it with some filters.
  6. Compresses the aligned/unaligned reads.
  7. Runs htseq-count on the alignments to count reads/gene.

Note the following steps were not actually run because I had a speeling error. But since they are not necessary for the explicitly RNASeq analyses I first want to do, I ignored it. I am curious though to see if there are other mutations in these strains, so I will likely run those portions manually.

  1. Runs freebayes on the alignments to look for variants.
  2. Sorts/compresses the freebayes output.
  3. Does some parsing of the freebayes output and provides some tables about where mutations were observed.

2.2 Collect annotation information

Same two primary annotation sources, the gff file used for mapping/counting, and microbesonline.org. Note that since I moved to just downloading the material from the web interface, I no longer have a handy method to get the taxon ID, so I go there and hunt down the taxId manually.

gff_annot <- load_gff_annotations("reference/spyogenes_5005.gff", type = "CDS")
## Trying attempt: rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo = TRUE)
## Trying attempt: rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo = FALSE)
## Had a successful gff import with rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo = FALSE)
## Returning a df with 19 columns and 1841 rows.
rownames(gff_annot) <- gff_annot[["locus_tag"]]
head(gff_annot)
##               seqnames start  end width strand                 source type
## M5005_Spy0001 CP000017   232 1587  1356      + EMBL/GenBank/SwissProt  CDS
## M5005_Spy0002 CP000017  1742 2878  1137      + EMBL/GenBank/SwissProt  CDS
## M5005_Spy0003 CP000017  2953 3150   198      + EMBL/GenBank/SwissProt  CDS
## M5005_Spy0004 CP000017  3480 4595  1116      + EMBL/GenBank/SwissProt  CDS
## M5005_Spy0005 CP000017  4665 5234   570      + EMBL/GenBank/SwissProt  CDS
## M5005_Spy0006 CP000017  5237 8740  3504      + EMBL/GenBank/SwissProt  CDS
##               score phase codon_start gene     locus_tag  old_locus_tag
## M5005_Spy0001    NA     1           1 dnaA M5005_Spy0001 M5005_Spy_0001
## M5005_Spy0002    NA     1           1 dnaN M5005_Spy0002 M5005_Spy_0002
## M5005_Spy0003    NA     1           1 <NA> M5005_Spy0003 M5005_Spy_0003
## M5005_Spy0004    NA     1           1 <NA> M5005_Spy0004 M5005_Spy_0004
## M5005_Spy0005    NA     1           1  pth M5005_Spy0005 M5005_Spy_0005
## M5005_Spy0006    NA     1           1 trcF M5005_Spy0006 M5005_Spy_0006
##                                                 product protein_id transl_table
## M5005_Spy0001 chromosomal replication initiator protein AAZ50620.1           11
## M5005_Spy0002             DNA polymerase III beta chain AAZ50621.1           11
## M5005_Spy0003                putative cytosolic protein AAZ50622.1           11
## M5005_Spy0004                       GTP-binding protein AAZ50623.1           11
## M5005_Spy0005                   peptidyl-tRNA hydrolase AAZ50624.1           11
## M5005_Spy0006      transcription-repair coupling factor AAZ50625.1           11
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   translation
## M5005_Spy0001                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             MTENEQIFWNRVLELAQSQLKQATYEFFVHDARLLKVDKHIATIYLDQMKELFWEKNLKDVILTAGFEVYNAQISVDYVFEEDLMIEQNQTKINQKPKQQALNSLPTVTSDLNSKYSFENFIQGDENRWAVAASIAVANTPGTTYNPLFIWGGPGLGKTHLLNAIGNSVLLENPNARIKYITAENFINEFVIHIRLDTMDELKEKFRNLDLLLIDDIQSLAKKTLSGTQEEFFNTFNALHNNNKQIVLTSDRTPDHLNDLEDRLVTRFKWGLTVNITPPDFETRVAILTNKIQEYNFIFPQDTIEYLAGQFDSNVRDLEGALKDISLVANFKQIDTITVDIAAEAIRARKQDGPKMTVIPIEEIQAQVGKFYGVTVKEIKATKRTQNIVLARQVAMFLAREMTDNSLPKIGKEFGGRDHSTVLHAYNKIKNMISQDESLRIEIETIKNKIK
## M5005_Spy0002                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      MIQFSINRTLFIHALNTTKRAISTKNAIPILSSIKIEVTSTGVTLTGSNGQISIENTIPVSNENAGLLITSPGAILLEASFFINIISSLPDISINVKEIEQHQVVLTSGKSEITLKGKDVDQYPRLQEVSTENPLILKTKLLKSIIAETAFAASLQESRPILTGVHIVLSNHKDFKAVATDSHRMSQRLITLDNTSADFDVVIPSKSLREFSAVFTDDIETVEVFFSPSQILFRSEHISFYTRLLEGNYPDTDRLLMTEFETEVVFNTQSLRHAMERAFLISNATQNGTVKLEITQNHISAHVNSPEVGKVNEDLDIVSQSGSDLTISFNPTYLIESLKAIKSETVKIHFLSPVRPFTLTPGDEEESFIQLITPVRTN
## M5005_Spy0003                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               MYQIGSFVEMKKPHACVIKETGKKANQWKVLRVGADIKIQCTNCQHVIMMSRYDFERKLKKVLQP
## M5005_Spy0004                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             MALTAGIVGLPNVGKSTLFNAITKAGAEAANYPFATIDPNVGMVEVPDERLQKLTELITPKKTVPTTFEFTDIAGIVKGASRGEGLGNKFLANIREIDAIVHVVRAFDDENVMREQGREDAFVDPIADIDTINLELILADLESINKRYARVEKMARTQKDKESVAEFNVLQKIKPVLEDGKSARTIEFTEDEAKVVKGLFLLTTKPVLYVANVDEDKVANPDGIDYVKQIRDFAATENAEVVVISARAEEEISELDDEDKEEFLEAIGLTESGVDKLTRAAYHLLGLGTYFTAGEKEVRAWTFKRGIKAPQAAGIIHSDFERGFIRAVTMSYDDLMTYGSEKAVKEAGRLREEGKEYVVQDGDIMEFRFNV
## M5005_Spy0005                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   MVKMIVGLGNPGSKYEKTKHNIGFMAIDNIVKNLDVTFTDDKNFKAQIGSTFINHEKVYFVKPTTFMNNSGIAVKALLTYYNIDITDLIVIYDDLDMEVSKLRLRSKGSAGGHNGIKSIIAHIGTQEFNRIKVGIGRPLKGMTVINHVMGQFNTEDNIAISLTLDRVVNAVKFYLQENDFEKTMQKFNG
## M5005_Spy0006 MDILELFSQNKKVQSWHSGLTTLGRQLVMGLSGSSKTLAIASAYLDDQKKIVVVTSTQNEVEKLASDLSSLLDEELVFQFFADDVAAAEFIFASMDKALSRIETLQFLRNPKSQGVLIVSLSGLRILLPNPDVFTKSQIQLTVGEDYDSDTLTKQLMTIGYQKVSQVISPGEFSRRGDILDIYEITQELPYRLEFFGDDIDSIRQFHPETQKSFEQLEGIFINPASDLIFEVSDFQRGIEQLEKALQTAQDDKKSYLEDVLAVSKNGFKHKDIRKFQSLFYEKEWSLLDYIPKGTPIFFDDFQKLVDKNARFDLEIANLLTEDLQQGKALSNLNYFTDNYRELRHYKPATFFSNFHKGLGNIKFDQMHQLTQYAMQEFFNQFPLLIDEIKRYQKNQTTVIVQVESQYAYERLEKSFQDYQFRLPLVSANQIVSRESQIVIGAISSGFYFADEKLALITEHEIYHKKIKRRARRSNISNAERLKDYNELAVGDYVVHNVHGIGRFLGIETIQIQGIHRDYVTIQYQNSDRISLPIDQISSLSKYVSADGKEPKINKLNDGRFQKTKQKVARQVEDIADDLLKLYAERSQQKGFSFSPDDDLQRAFDDDFAFVETEDQLRSIKEIKADMESMQPMDRLLVGDVGFGKTEVAMRAAFKAVNDHKQVAVLVPTTVLAQQHYENFKARFENYPVEVDVLSRFRSKKEQAETLERVRKGQIDIIIGTHRLLSKDVVFSDLGLIVIDEEQRFGVKHKETLKELKTKVDVLTLTATPIPRTLHMSMLGIRDLSVIETPPTNRYPVQTYVLENNPGLVREAIIREMDRGGQIFYVYNKVDTIEKKVAELQELVPEASIGFVHGQMSEIQLENTLIDFINGDYDVLVATTIIETGVDISNVNTLFIENADHMGLSTLYQLRGRVGRSNRIAYAYLMYRPDKVLTEVSEKRLEAIKGFTELGSGFKIAMRDLSIRGAGNILGASQSGFIDSVGFEMYSQLLEQAIASKQGKTTVRQKGNTEINLQIDAYLPDDYIADERQKIDIYKRIREIQSREDYLNLQDELMDRFGEYPDQVAYLLEIALLKHYMDNAFAELVERKNNQVIVRFEVTSLTYFLTQDYFEALSKTHLKAKISEHQGKIDIVFDVRHQKDYRILEELMLFGERLSEIKIRKNNSVFK
##               eC_number note
## M5005_Spy0001           <NA>
## M5005_Spy0002   2.7.7.7 <NA>
## M5005_Spy0003           <NA>
## M5005_Spy0004           <NA>
## M5005_Spy0005  3.1.1.29 <NA>
## M5005_Spy0006           <NA>
microbes_annot <- load_microbesonline_annotations(species="5005")
## Found 1 entry.
## Streptococcus pyogenes MGAS5005Firmicutesyes2005-08-25yes101950293653
## The species being downloaded is: Streptococcus pyogenes MGAS5005
## Downloading: http://www.microbesonline.org/cgi-bin/genomeInfo.cgi?tId=293653;export=tab
microbes_annot <- as.data.frame(microbes_annot)
rownames(microbes_annot) <- make.names(gsub(x=microbes_annot[["sysName"]],
                                            pattern="Spy_", replacement="Spy"), unique=TRUE)

both_annot <- merge(gff_annot, microbes_annot, by="row.names", all.y=TRUE)
rownames(both_annot) <- both_annot[["Row.names"]]
both_annot[["Row.names"]] <- NULL

2.3 Rerunning my fasta36 mapper across strains

Once upon a time I wrote a little tool to compare closely species/strains. It is essentially a poor-man’s ortholog search. It has been a very long time since last I used it, and it required some modifications in order to work properly. The most notable problem is that the pathnames it uses are too long for fasta36. I shortened them and fixed a couple of old errors and it appears to work again.

single_hits <- readr::read_tsv("single_multi/outputs/fasta_spyogenes_5448_cds_spyogenes_5005_cds/spyogenes_5448_cds_singles.txt")
## Rows: 1340 Columns: 2
## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): AKK69518.1, AAZ50620.1:1.000:0
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(single_hits) <- c("from", "to")
single_hits[["to"]] <- gsub(pattern="(^.*?):.*$", replacement="\\1",
                            x=single_hits[["to"]], perl=TRUE)
## The gff annotation has the protein_id column matching this cds entry.

single_both <- merge(mgas_annot, single_hits, by.x="protein_id", by.y="from", all.x=TRUE)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'merge': object 'mgas_annot' not found
single_both <- merge(single_both, ref_gff_annot, by.x="to", by.y="protein_id", all.x=TRUE)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'merge': object 'single_both' not found
## Now pick up the missing old_locus_tags and put in the 5448 IDs
missing_ids <- is.na(single_both[["old_locus_tag"]])
## Error in eval(expr, envir, enclos): object 'single_both' not found
new_ids <- single_both[missing_ids, "locus_tag.x"]
## Error in eval(expr, envir, enclos): object 'single_both' not found
single_both[missing_ids, "old_locus_tag"] <- new_ids
## Error in eval(expr, envir, enclos): object 'new_ids' not found
single_both <- merge(single_both, microbes_annot, by.x="old_locus_tag", by.y="sysName", all.x=TRUE)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'merge': object 'single_both' not found
rownames(single_both) <- make.names(single_both[["gene_id"]], unique=TRUE)
## Error in make.names(single_both[["gene_id"]], unique = TRUE): object 'single_both' not found

2.4 Create expressionSet

Note that I did the following to the sample sheet provided by Dr. McIver:

  1. Changed the dP_R20_4 sample to dP_R20_2 (The original sample name is still there are the original column).
  2. Added columns at the end containing the count table locations.
  3. Added columns ‘short_media’, ‘growth_phase’, ‘genotype’ which hopefully contain the relevant metadata extracted from the sample descriptions.
  4. Added a column ‘Experiment’ which is either ‘rofA’ or ‘pdxR’.
pdxr_expt <- create_expt(metadata = "sample_sheets/all_samples.xlsx",
                        gene_info = both_annot, file_column = "spyogenes5005genecounts") %>%
  subset_expt(subset="experiment=='pdxR'") %>%
  set_expt_conditions(fact="genotype") %>%
  set_expt_batches(fact="growthphase")
## Reading the sample metadata.
## Did not find the condition column in the sample sheet.
## Filling it in as undefined.
## Did not find the batch column in the sample sheet.
## Filling it in as undefined.
## The sample definitions comprises: 72 rows(samples) and 27 columns(metadata fields).
## Matched 1926 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## Saving the expressionset to 'expt.rda'.
## The final expressionset has 1926 features and 72 samples.
## subset_expt(): There were 72, now there are 36 samples.
gene_lengths <- as.data.frame(cbind(rownames(fData(pdxr_expt)),
                                    fData(pdxr_expt)[, c("width")]))
colnames(gene_lengths) <- c("ID", "length")

cog_db <- as.data.frame(cbind(rownames(fData(pdxr_expt)),
                              fData(pdxr_expt)[, c("COGFun")]))
colnames(cog_db) <- c("ID", "GO")
undef_idx <- cog_db[["GO"]] == "undefined"
cog_db <- cog_db[!undef_idx, ]
cog_db <- cog_db %>%
  mutate(GO = strsplit(as.character(GO), "")) %>%
  unnest(GO)

go_db <- as.data.frame(cbind(rownames(fData(pdxr_expt)),
                             fData(pdxr_expt)[, c("GO")]))
colnames(go_db) <- c("ID", "GO")
go_db <- go_db %>%
  mutate(GO = strsplit(as.character(GO), ",")) %>%
  unnest(GO)

I have 36 samples to play with, let us see what they look like.

2.5 Poke expressionSet

pdxr_libsize <- plot_libsize(pdxr_expt)
pdxr_libsize$plot

pdxr_filter_plot <- plot_libsize_prepost(pdxr_expt)
pdxr_filter_plot$count_plot

pdxr_filter_plot$lowgene_plot
## Warning: Using alpha for a discrete variable is not advised.

pdxr_nonzero <- plot_nonzero(pdxr_expt)
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.
pdxr_nonzero$plot
## Warning: ggrepel: 7 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## awesome, there is a range of ~ 10 genes!

written <- write_expt(pdxr_expt, excel = glue::glue("excel/pdxr_expt-v{ver}.xlsx"))
## Deleting the file excel/pdxr_expt-v202204.xlsx before writing the tables.
## Writing the first sheet, containing a legend and some summary data.
## `geom_smooth()` using formula 'y ~ x'
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## The following object is masked from 'package:S4Vectors':
## 
##     expand
## 
## Total:12 s
## `geom_smooth()` using formula 'y ~ x'
## 
## Total:12 s

2.6 Quick visualizations

2.6.1 Without considering time

Let us start with some views of the data without thinking about batch effects.

pdxr_norm <- normalize_expt(pdxr_expt, filter=TRUE, norm="quant", convert="cpm", transform="log2")
## Removing 85 low-count genes (1841 remaining).
pdxr_pca <- plot_pca(pdxr_norm)
pdxr_pca$plot

pdxr_heatmap <- plot_disheat(pdxr_norm)
pdxr_heatmap$plot

pdxr_sm <- plot_sm(pdxr_norm)
## Performing correlation.
pdxr_sm$plot

2.6.2 Considering time

Repeat the previous plots, but this time using limma’s batch removal method (which is just a residuals).

pdxr_nb <- normalize_expt(pdxr_expt, filter=TRUE, norm="quant", convert="cpm",
                             transform="log2", batch="limma")
## Removing 85 low-count genes (1841 remaining).
## If you receive a warning: 'NANs produced', one potential reason is that the data was quantile normalized.
## Setting 51 low elements to zero.
## transform_counts: Found 51 values equal to 0, adding 1 to the matrix.
pdxr_nb_pca <- plot_pca(pdxr_nb)
pdxr_nb_pca$plot

pdxr_nb_heatmap <- plot_corheat(pdxr_nb)
pdxr_nb_heatmap$plot

Well, it is pretty clear that time is the dominant factor.

2.7 Differential Expression analyses

I am going to do the DE in 3 separate pieces:

  1. Only compare strains
  2. Only compare times
  3. Compare the concatenation of strains and times.

2.7.1 Strain only comparisons

strain_de <- all_pairwise(pdxr_expt, model_batch=TRUE, filter=TRUE)
## This DE analysis will perform all pairwise comparisons among:
## 
## complement      delta         WT 
##         12         12         12
## This analysis will include a batch factor in the model comprised of:
## 
##  e20m  e60m start 
##    12    12    12
## This will pre-filter the input data using normalize_expt's: TRUE argument.
## Using limma's removeBatchEffect to visualize with(out) batch inclusion.
## Finished running DE analyses, collecting outputs.
## Comparing analyses.

strain_keepers <- list(
    "delta_wt" = c("delta", "WT"),
    "complement_wt" = c("complement", "WT"),
    "delta_complement" = c("delta", "complement"))
strain_tables <- combine_de_tables(
    strain_de, keepers = strain_keepers,
    excel = glue::glue("excel/pdxr_strain_tables-v{ver}.xlsx"))
## Deleting the file excel/pdxr_strain_tables-v202204.xlsx before writing the tables.
strain_sig <- extract_significant_genes(
    strain_tables,
    excel = glue::glue("excel/pdxr_strain_sig-v{ver}.xlsx"))
## Deleting the file excel/pdxr_strain_sig-v202204.xlsx before writing the tables.
## Using p column: limma_adjp.
## Using p column: edger_adjp.
## Using p column: deseq_adjp.
## Using p column: ebseq_adjp.
## Using p column: basic_adjp.

2.7.2 Time only comparisons

pdxr_time <- set_expt_conditions(pdxr_expt, fact="growthphase") %>%
  set_expt_batches(fact="genotype")
time_de <- all_pairwise(pdxr_time, model_batch=TRUE)
## This DE analysis will perform all pairwise comparisons among:
## 
##  e20m  e60m start 
##    12    12    12
## This analysis will include a batch factor in the model comprised of:
## 
## complement      delta         WT 
##         12         12         12
## Using limma's removeBatchEffect to visualize with(out) batch inclusion.
## Finished running DE analyses, collecting outputs.
## Comparing analyses.

time_keepers <- list(
    "e20m_start" = c("e20m", "start"),
    "e60m_e20m" = c("e60m", "e20m"),
    "e60m_start" = c("e60m", "start"))
time_tables <- combine_de_tables(
    time_de, keepers = time_keepers,
    excel = glue::glue("excel/pdxr_time_tables-v{ver}.xlsx"))
## Deleting the file excel/pdxr_time_tables-v202204.xlsx before writing the tables.
time_sig <- extract_significant_genes(
    time_tables,
    excel = glue::glue("excel/pdxr_time_sig-v{ver}.xlsx"))
## Deleting the file excel/pdxr_time_sig-v202204.xlsx before writing the tables.
## Using p column: limma_adjp.
## Using p column: edger_adjp.
## Using p column: deseq_adjp.
## Using p column: ebseq_adjp.
## Using p column: basic_adjp.

2.7.3 Combining the two comparisons

combined_factors <- paste0(pData(pdxr_expt)[["genotype"]], "_",
                           pData(pdxr_expt)[["growthphase"]])
combined_expt <- set_expt_conditions(pdxr_expt, fact=combined_factors)

combined_de <- all_pairwise(combined_expt, model_batch="svaseq", filter=TRUE)

combined_keepers <- list(
    "e20m_delta_vs_wt" = c("deltae20m", "WTe20m"),
    "e20m_delta_vs_complement" = c("deltae20m", "complemente20m"),
    "e60m_delta_vs_wt" = c("deltae60m", "WTe60m"),
    "e60m_delta_vs_complement" = c("deltae60m", "complemente60m"),
    "start_delta_vs_wt" = c("deltastart", "WTstart"),
    "start_delta_vs_complement" = c("deltastart", "complementstart"),
    "WT_e20m_vs_start" = c("WTe20m", "WTstart"),
    "delta_e20m_vs_start" = c("deltae20m", "deltastart"),
    "complement_e20m_vs_start" = c("complemente20m", "complementstart"),
    "WT_e60m_vs_start" = c("WTe60m", "WTstart"),
    "delta_e60m_vs_start" = c("deltae60m", "deltastart"),
    "complement_e60m_vs_start" = c("complemente60m", "complementstart"),
    "WT_e60m_vs_e20m" = c("WTe60m", "WTe20m"),
    "delta_e60m_vs_e20m" = c("deltae60m", "deltae20m"),
    "complement_e60m_vs_e20m" = c("complemente60m", "complemente20m"))

combined_tables <- combine_de_tables(
    combined_de, keepers = combined_keepers,
    excel = glue::glue("excel/pdxr_combined_tables-v{ver}.xlsx"))
combined_sig <- extract_significant_genes(
    combined_tables,
    excel = glue::glue("excel/pdxr_combined_sig-v{ver}.xlsx"))

3 Gene Set Enrichment via COGs

Dr. McIver just reminded me of the plot which is of interest for considering the significance of the various categories of gene functions in the data. I think a potentially good version of this would be to use some of the visualizations from clusterprofiler against the COGs as per:

https://yulab-smu.top/biomedical-knowledge-mining-book/enrichplot.html

In order to get an initial set to test, I will try some quick and dirty goseq with the COG categories.

time_e20start_up <- time_sig[["deseq"]][["ups"]][["e20m_start"]]
e20start_up_goseq_cog <- simple_goseq(time_e20start_up, go_db=cog_db, length_db=gene_lengths,
                                      expand_categories=FALSE)
## Found 104 go_db genes and 170 length_db genes out of 170.
## The score column is null, defaulting to score.
## Possible columns are:
## [1] "category"                 "over_represented_pvalue" 
## [3] "under_represented_pvalue" "numDEInCat"              
## [5] "numInCat"                 "qvalue"
## Error in `[.data.frame`(plotting, , c("term", "over_represented_pvalue", : undefined columns selected
e20start_up_goseq_go <- simple_goseq(time_e20start_up, go_db=go_db, length_db=gene_lengths)
## Found 170 go_db genes and 170 length_db genes out of 170.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
## The score column is null, defaulting to score.
## Possible columns are:
## [1] "category"                 "over_represented_pvalue" 
## [3] "under_represented_pvalue" "numDEInCat"              
## [5] "numInCat"                 "term"                    
## [7] "ontology"                 "qvalue"
## Error in layer(data = data, mapping = mapping, stat = stat, geom = GeomText, : object 'test_color' not found
e20start_up_goseq_go$pvalue_plots$mfp_plot_over
## Error in eval(expr, envir, enclos): object 'e20start_up_goseq_go' not found
e20start_up_goseq_go$pvalue_plots$bpp_plot_over
## Error in eval(expr, envir, enclos): object 'e20start_up_goseq_go' not found

4 Make a circos picture

Work on making a pretty picture, there are kind of a lot of potential contrasts, so lets just start with one.

circos_cfg <- circos_prefix(fData(pdxr_expt), name="pdxr", chr_column="seqnames", start_column="start.x",
                            stop_column = "end", strand_column="strand.x", cog_column = "COGFun")
## This assumes you have a colors.conf in circos/colors/ and fonts.conf in circos/fonts/
## It also assumes you have conf/ideogram.conf, conf/ticks.conf, and conf/housekeeping.conf
## It will write circos/conf/pdxr.conf with a reasonable first approximation config file.
## Setting 86 undefined entries to the plus strand.
## Warning in circos_prefix(fData(pdxr_expt), name = "pdxr", chr_column =
## "seqnames", : NAs introduced by coercion

## Warning in circos_prefix(fData(pdxr_expt), name = "pdxr", chr_column =
## "seqnames", : NAs introduced by coercion
## Wrote karyotype to circos/conf/ideograms/pdxr.conf
## This should match the ideogram= line in pdxr.conf
## Wrote ticks to circos/conf/ticks_pdxr.conf
kary <- circos_karyotype(circos_cfg, fasta = "reference/spyogenes_5005.fasta")
## Wrote karyotype to circos/conf/karyotypes/pdxr.conf
## This should match the karyotype= line in pdxr.conf
plus_minus <- circos_plus_minus(circos_cfg, width = 0.06, thickness = 40,
                                url_string = "http://www.microbesonline.org/cgi-bin/fetchLocus.cgi?locus=[id]")
## Writing data file: circos/data/pdxr_plus_go.txt with the + strand GO data.
## Writing data file: circos/data/pdxr_minus_go.txt with the - strand GO data.
## Wrote the +/- config files.  Appending their inclusion to the master file.
## Returning the inner width: 0.88.  Use it as the outer for the next ring.
exp_delta_vs_wt_hist <- circos_hist(circos_cfg, combined_tables, tablename="start_delta_vs_wt",
                                    colname="deseq_logfc", basename="start_deltawt",
                                    outer=plus_minus)
## Error in "combined_de" %in% class(input): object 'combined_tables' not found
cjb_suffix <- circos_suffix(circos_cfg)
made <- circos_make(circos_cfg, target="pdxr")

circos/pdxr.png

if (!isTRUE(get0("skip_load"))) {
  pander::pander(sessionInfo())
  message(paste0("This is hpgltools commit: ", get_git_commit()))
  this_save <- paste0(gsub(pattern = "\\.Rmd", replace = "", x = rmd_file), "-v", ver, ".rda.xz")
  message(paste0("Saving to ", this_save))
  tmp <- sm(saveme(filename = this_save))
}
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 605cc89b5f1cadea6923b53ac71e234ba0181fe7
## This is hpgltools commit: Wed Aug 10 22:39:40 2022 -0400: 605cc89b5f1cadea6923b53ac71e234ba0181fe7
## Saving to index_pdxR-v202204.rda.xz
---
title: "S.pyogenes 5448 rnaseq comparing pdxR strains across time."
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
 html_document:
  code_download: true
  code_folding: show
  fig_caption: true
  fig_height: 7
  fig_width: 7
  highlight: default
  keep_md: false
  mode: selfcontained
  number_sections: true
  self_contained: true
  theme: readable
  toc: true
  toc_float:
    collapsed: false
    smooth_scroll: false
---

<style>
  body .main-container {
    max-width: 1600px;
  }
</style>

```{r options, include = FALSE}
library(hpgltools)
library(tidyverse)
tt <- devtools::load_all("~/hpgltools")
knitr::opts_knit$set(progress = TRUE,
                     verbose = TRUE,
                     width = 90,
                     echo = TRUE)
knitr::opts_chunk$set(error = TRUE,
                      fig.width = 8,
                      fig.height = 8,
                      dpi = 96)
old_options <- options(digits = 4,
                       stringsAsFactors = FALSE,
                       knitr.duplicate.label = "allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size = 10))
set.seed(1)
ver <- "202204"
rmd_file <- "index_pdxR.Rmd"
```

# Important note

I copied my index_rofA and just performed replace-string on it for the times and strains:

1.  s/revert/complement/g
2.  s/transition/start/g
3.  s/exponential/e20m/g
4.  s/stationary/e60m/g

# S.pyogenes 5448 pdxR RNASeq version: `r ver`

## Preprocessing

I used my cyoa tool to process these samples by doing the following:

1.  Copying the data from the sequencer into the directory 'preprocessing/'
2.  Used a slightly involved shell command to create a directory for
    each sample and copy the reads for it to the 'unprocessed/'
    subdirectory within it.
3.  invoked the following:

```{bash, eval=FALSE}
cd preprocessing
start=$(pwd)
for i in $(/bin/ls -d ./*)
do
  cd $i
  rm -rf outputs scripts
  cyoa --task pipe --method prnas --species spyogenes_5005 \
       --gff_type gene --gff_tag locus_tag \
       --input $(/bin/ls unprocessed/* | tr '\n' ':' | sed 's/:$//g')
  cd $start
done
```

The above for loop goes into each sample and does the following:

1.  Trims the data, heavily compresses the outputs.
2.  Runs fastqc
3.  Runs hisat2 using my spyogenes_5005 indices.
4.  Converts the sam alignment to sorted/indexed bam.
5.  Makes a couple of extra copies of it with some filters.
6.  Compresses the aligned/unaligned reads.
7.  Runs htseq-count on the alignments to count reads/gene.

  Note the following steps were not actually run because I had a
  speeling error.  But since they are not necessary for the explicitly
  RNASeq analyses I first want to do, I ignored it.  I am curious
  though to see if there are other mutations in these strains, so I
  will likely run those portions manually.

8.  Runs freebayes on the alignments to look for variants.
9.  Sorts/compresses the freebayes output.
10. Does some parsing of the freebayes output and provides some tables
    about where mutations were observed.

## Collect annotation information

Same two primary annotation sources, the gff file used for mapping/counting,
and microbesonline.org.  Note that since I moved to just downloading the
material from the web interface, I no longer have a handy method to get the
taxon ID, so I go there and hunt down the taxId manually.

```{r annotation}
gff_annot <- load_gff_annotations("reference/spyogenes_5005.gff", type = "CDS")
rownames(gff_annot) <- gff_annot[["locus_tag"]]
head(gff_annot)

microbes_annot <- load_microbesonline_annotations(species="5005")
microbes_annot <- as.data.frame(microbes_annot)
rownames(microbes_annot) <- make.names(gsub(x=microbes_annot[["sysName"]],
                                            pattern="Spy_", replacement="Spy"), unique=TRUE)

both_annot <- merge(gff_annot, microbes_annot, by="row.names", all.y=TRUE)
rownames(both_annot) <- both_annot[["Row.names"]]
both_annot[["Row.names"]] <- NULL
```

## Rerunning my fasta36 mapper across strains

Once upon a time I wrote a little tool to compare closely
species/strains.  It is essentially a poor-man's ortholog search.  It
has been a very long time since last I used it, and it required some
modifications in order to work properly.  The most notable problem is
that the pathnames it uses are too long for fasta36.  I shortened them
and fixed a couple of old errors and it appears to work again.

```{r single_multi}
single_hits <- readr::read_tsv("single_multi/outputs/fasta_spyogenes_5448_cds_spyogenes_5005_cds/spyogenes_5448_cds_singles.txt")
colnames(single_hits) <- c("from", "to")
single_hits[["to"]] <- gsub(pattern="(^.*?):.*$", replacement="\\1",
                            x=single_hits[["to"]], perl=TRUE)
## The gff annotation has the protein_id column matching this cds entry.

single_both <- merge(mgas_annot, single_hits, by.x="protein_id", by.y="from", all.x=TRUE)
single_both <- merge(single_both, ref_gff_annot, by.x="to", by.y="protein_id", all.x=TRUE)

## Now pick up the missing old_locus_tags and put in the 5448 IDs
missing_ids <- is.na(single_both[["old_locus_tag"]])
new_ids <- single_both[missing_ids, "locus_tag.x"]
single_both[missing_ids, "old_locus_tag"] <- new_ids

single_both <- merge(single_both, microbes_annot, by.x="old_locus_tag", by.y="sysName", all.x=TRUE)
rownames(single_both) <- make.names(single_both[["gene_id"]], unique=TRUE)
```

## Create expressionSet

Note that I did the following to the sample sheet provided by
Dr. McIver:

1.  Changed the dP_R20_4 sample to dP_R20_2 (The original sample name
    is still there are the original column).
2.  Added columns at the end containing the count table locations.
3.  Added columns 'short_media', 'growth_phase', 'genotype' which
    hopefully contain the relevant metadata extracted from the sample
    descriptions.
4.  Added a column 'Experiment' which is either 'rofA' or 'pdxR'.

```{r expt}
pdxr_expt <- create_expt(metadata = "sample_sheets/all_samples.xlsx",
                        gene_info = both_annot, file_column = "spyogenes5005genecounts") %>%
  subset_expt(subset="experiment=='pdxR'") %>%
  set_expt_conditions(fact="genotype") %>%
  set_expt_batches(fact="growthphase")
```

```{r goseq_data}
gene_lengths <- as.data.frame(cbind(rownames(fData(pdxr_expt)),
                                    fData(pdxr_expt)[, c("width")]))
colnames(gene_lengths) <- c("ID", "length")

cog_db <- as.data.frame(cbind(rownames(fData(pdxr_expt)),
                              fData(pdxr_expt)[, c("COGFun")]))
colnames(cog_db) <- c("ID", "GO")
undef_idx <- cog_db[["GO"]] == "undefined"
cog_db <- cog_db[!undef_idx, ]
cog_db <- cog_db %>%
  mutate(GO = strsplit(as.character(GO), "")) %>%
  unnest(GO)

go_db <- as.data.frame(cbind(rownames(fData(pdxr_expt)),
                             fData(pdxr_expt)[, c("GO")]))
colnames(go_db) <- c("ID", "GO")
go_db <- go_db %>%
  mutate(GO = strsplit(as.character(GO), ",")) %>%
  unnest(GO)
```


I have 36 samples to play with, let us see what they look like.

## Poke expressionSet

```{r poke}
pdxr_libsize <- plot_libsize(pdxr_expt)
pdxr_libsize$plot

pdxr_filter_plot <- plot_libsize_prepost(pdxr_expt)
pdxr_filter_plot$count_plot
pdxr_filter_plot$lowgene_plot

pdxr_nonzero <- plot_nonzero(pdxr_expt)
pdxr_nonzero$plot
## awesome, there is a range of ~ 10 genes!

written <- write_expt(pdxr_expt, excel = glue::glue("excel/pdxr_expt-v{ver}.xlsx"))
```

## Quick visualizations

### Without considering time

Let us start with some views of the data without thinking about batch effects.

```{r view_nobatch}
pdxr_norm <- normalize_expt(pdxr_expt, filter=TRUE, norm="quant", convert="cpm", transform="log2")
pdxr_pca <- plot_pca(pdxr_norm)
pdxr_pca$plot

pdxr_heatmap <- plot_disheat(pdxr_norm)
pdxr_heatmap$plot

pdxr_sm <- plot_sm(pdxr_norm)
pdxr_sm$plot
```

### Considering time

Repeat the previous plots, but this time using limma's batch removal
method (which is just a residuals).

```{r nb_view}
pdxr_nb <- normalize_expt(pdxr_expt, filter=TRUE, norm="quant", convert="cpm",
                             transform="log2", batch="limma")
pdxr_nb_pca <- plot_pca(pdxr_nb)
pdxr_nb_pca$plot

pdxr_nb_heatmap <- plot_corheat(pdxr_nb)
pdxr_nb_heatmap$plot
```

Well, it is pretty clear that time is the dominant factor.

## Differential Expression analyses

I am going to do the DE in 3 separate pieces:

1.  Only compare strains
2.  Only compare times
3.  Compare the concatenation of strains and times.

### Strain only comparisons

```{r pairwise_strain}
strain_de <- all_pairwise(pdxr_expt, model_batch=TRUE, filter=TRUE)
strain_keepers <- list(
    "delta_wt" = c("delta", "WT"),
    "complement_wt" = c("complement", "WT"),
    "delta_complement" = c("delta", "complement"))
strain_tables <- combine_de_tables(
    strain_de, keepers = strain_keepers,
    excel = glue::glue("excel/pdxr_strain_tables-v{ver}.xlsx"))
strain_sig <- extract_significant_genes(
    strain_tables,
    excel = glue::glue("excel/pdxr_strain_sig-v{ver}.xlsx"))
```

### Time only comparisons

```{r pairwise_time}
pdxr_time <- set_expt_conditions(pdxr_expt, fact="growthphase") %>%
  set_expt_batches(fact="genotype")
time_de <- all_pairwise(pdxr_time, model_batch=TRUE)
time_keepers <- list(
    "e20m_start" = c("e20m", "start"),
    "e60m_e20m" = c("e60m", "e20m"),
    "e60m_start" = c("e60m", "start"))
time_tables <- combine_de_tables(
    time_de, keepers = time_keepers,
    excel = glue::glue("excel/pdxr_time_tables-v{ver}.xlsx"))
time_sig <- extract_significant_genes(
    time_tables,
    excel = glue::glue("excel/pdxr_time_sig-v{ver}.xlsx"))
```

### Combining the two comparisons

```{r combined_de, eval=FALSE}
combined_factors <- paste0(pData(pdxr_expt)[["genotype"]], "_",
                           pData(pdxr_expt)[["growthphase"]])
combined_expt <- set_expt_conditions(pdxr_expt, fact=combined_factors)

combined_de <- all_pairwise(combined_expt, model_batch="svaseq", filter=TRUE)

combined_keepers <- list(
    "e20m_delta_vs_wt" = c("deltae20m", "WTe20m"),
    "e20m_delta_vs_complement" = c("deltae20m", "complemente20m"),
    "e60m_delta_vs_wt" = c("deltae60m", "WTe60m"),
    "e60m_delta_vs_complement" = c("deltae60m", "complemente60m"),
    "start_delta_vs_wt" = c("deltastart", "WTstart"),
    "start_delta_vs_complement" = c("deltastart", "complementstart"),
    "WT_e20m_vs_start" = c("WTe20m", "WTstart"),
    "delta_e20m_vs_start" = c("deltae20m", "deltastart"),
    "complement_e20m_vs_start" = c("complemente20m", "complementstart"),
    "WT_e60m_vs_start" = c("WTe60m", "WTstart"),
    "delta_e60m_vs_start" = c("deltae60m", "deltastart"),
    "complement_e60m_vs_start" = c("complemente60m", "complementstart"),
    "WT_e60m_vs_e20m" = c("WTe60m", "WTe20m"),
    "delta_e60m_vs_e20m" = c("deltae60m", "deltae20m"),
    "complement_e60m_vs_e20m" = c("complemente60m", "complemente20m"))

combined_tables <- combine_de_tables(
    combined_de, keepers = combined_keepers,
    excel = glue::glue("excel/pdxr_combined_tables-v{ver}.xlsx"))
combined_sig <- extract_significant_genes(
    combined_tables,
    excel = glue::glue("excel/pdxr_combined_sig-v{ver}.xlsx"))
```

# Gene Set Enrichment via COGs

Dr. McIver just reminded me of the plot which is of interest for
considering the significance of the various categories of gene
functions in the data.  I think a potentially good version of this
would be to use some of the visualizations from clusterprofiler
against the COGs as per:

https://yulab-smu.top/biomedical-knowledge-mining-book/enrichplot.html

In order to get an initial set to test, I will try some quick and
dirty goseq with the COG categories.

```{r cog_gsea}
time_e20start_up <- time_sig[["deseq"]][["ups"]][["e20m_start"]]
e20start_up_goseq_cog <- simple_goseq(time_e20start_up, go_db=cog_db, length_db=gene_lengths,
                                      expand_categories=FALSE)

e20start_up_goseq_go <- simple_goseq(time_e20start_up, go_db=go_db, length_db=gene_lengths)
e20start_up_goseq_go$pvalue_plots$mfp_plot_over
e20start_up_goseq_go$pvalue_plots$bpp_plot_over
```

# Make a circos picture

Work on making a pretty picture, there are kind of a lot of potential contrasts, so lets just
start with one.

```{r circos_result}
circos_cfg <- circos_prefix(fData(pdxr_expt), name="pdxr", chr_column="seqnames", start_column="start.x",
                            stop_column = "end", strand_column="strand.x", cog_column = "COGFun")
kary <- circos_karyotype(circos_cfg, fasta = "reference/spyogenes_5005.fasta")
plus_minus <- circos_plus_minus(circos_cfg, width = 0.06, thickness = 40,
                                url_string = "http://www.microbesonline.org/cgi-bin/fetchLocus.cgi?locus=[id]")
exp_delta_vs_wt_hist <- circos_hist(circos_cfg, combined_tables, tablename="start_delta_vs_wt",
                                    colname="deseq_logfc", basename="start_deltawt",
                                    outer=plus_minus)
cjb_suffix <- circos_suffix(circos_cfg)
made <- circos_make(circos_cfg, target="pdxr")
```

[circos/pdxr.png](circos/pdxr.png)


```{r saveme}
if (!isTRUE(get0("skip_load"))) {
  pander::pander(sessionInfo())
  message(paste0("This is hpgltools commit: ", get_git_commit()))
  this_save <- paste0(gsub(pattern = "\\.Rmd", replace = "", x = rmd_file), "-v", ver, ".rda.xz")
  message(paste0("Saving to ", this_save))
  tmp <- sm(saveme(filename = this_save))
}
```
