1 Introduction

This project seeks to understand the effects of deleting/complementing pknF in Mtb. I cannot honestly say I know anything about the biology involved except the bits from the review Volker kindly sent me.

pknF is one of the surprisingly large number of serine threonine protein kinases in Mtb. Most bacteria have only a couple/few of these, but Mtb has 11, and uses them with similar frequency to the two component systems in order to respond to its changing environment during infection (I assume, I have not gotten that far in the review yet). In previous experiments performed by people in Volker’s lab, a deletion of this particular STPK causes a much more virulent phenotype during mouse infection.

2 Preprocessing

I wrote a quick sample sheet for these 9 samples and will use my cyoa toy to perform the various tasks to process the data.

2.1 What annotations do I have?

I am assuming the most appropriate annotations come from microbesonline? Let us check real quick. I should probably go back through my old logs to answer these questions.

annot <- load_microbesonline_annotations("CDC1551")
## Found 1 entry.
## Mycobacterium tuberculosis CDC1551Actinobacteriayes2007-05-08yes10429383331
## The species being downloaded is: Mycobacterium tuberculosis CDC1551
## Downloading: http://www.microbesonline.org/cgi-bin/genomeInfo.cgi?tId=83331;export=tab
annot <- as.data.frame(annot)
rownames(annot) <- make.names(annot[["sysName"]], unique = TRUE)
go_db <- load_microbesonline_go(id = 83331, id_column = "sysName")
## The species being downloaded is: Mycobacterium tuberculosis CDC1551 and is being downloaded as 83331.tab.
h37_annot <- load_microbesonline_annotations("H37Rv")
## Found 1 entry.
## Mycobacterium tuberculosis H37RvActinobacteriayes2007-05-08yes10404783332
## The species being downloaded is: Mycobacterium tuberculosis H37Rv
## Downloading: http://www.microbesonline.org/cgi-bin/genomeInfo.cgi?tId=83332;export=tab
h37_go <- load_microbesonline_go(id = 83332, id_column = "sysName")
## The species being downloaded is: Mycobacterium tuberculosis H37Rv and is being downloaded as 83332.tab.

That looks pretty reasonable to me, it even has COG groups and all the fun stuff I want.

2.2 Which genome am I using?

Before I start, I should recall which is my current reference genome and see if there is a better alternative since last I looked.

My current genome is strain H37Rv last downloaded in 2022, it is derived from accession NC_000962.3 / assembly GCF_000195955.2. Glancing at the gff file I have, I am guessing that the most appropriate tags to index against are either locus_tag(Rv0003) or gene(recF).

cd preprocessing
start=$(pwd)
for i in $(/bin/ls -d * | grep -v Undeter); do
    cd $i
    cyoa --method prnaseq --species mtuberculosis_h37rv \
         --gff_type gene --gff_tag locus_tag \
         --introns 0 --freebayes 1 \
         --input $(/bin/ls *.gz | tr '\n' ':' | sed 's/:$//g')
    cd $start
done

2.3 The correct Strain is CDC1551

Ok, so let us redo this accordingly.

cd preprocessing
start=$(pwd)
for i in $(/bin/ls -d * | grep -v Undeter); do
    cd $i
    cyoa --method prnaseq --species mycobacterium_tuberculosis_cdc1551 \
         --gff_type gene --gff_tag locus_tag \
         --introns 0 --freebayes 1 \
         --input $(/bin/ls *.gz | tr '\n' ':' | sed 's/:$//g')
    cd $start
done

3 Ortholog search of CDC1551/H37Rv

A few minutes ago I was making a couple of notes to myself, one of which was to explicitly note where PknF is in this strain. Sadly, when I looked into the annotations I downloaded it was unnamed. I was able to make a good guess given that I knew the location in H37Rv and saw a group of genes in CDC1551 in basically the same place in a similar orientation. Later, when the DE finished I was able to prove it by showing that MT1788 (position 1962996) is the most decreased in the Δ strain and most increased in the complement.

With that in mind, I was a little sad to not have the full set of H37Rv annotations available and so am spinning up my favorite ortholog search tool ‘orthofinder’. The following block is where I am going to do that.

I first need to extract the amino acid sequences from H37Rv (I have them in place already for CDC1551).

mkdir orthologs
cd orthologs
cp ~/libraries/genome/mtub*.gb .
cp ~/libraries/genome/myco*.faa .
cyoa --method gb2gff --input *.gb
ls *.faa
cyoa --method orthofinder --input mtuberculosis_h37rv.faa:mycobacterium_tuberculosis_cdc1551.faa

For the curious, that last command generates the following on the computing cluster:

orthofinder -f outputs/50orthofinder/input \
  -o outputs/50orthofinder/output \
  2>>outputs/50orthofinder/orthofinder.stderr 1>>outputs/50orthofinder/orthofinder.stdout
mv outputs/50orthofinder/output/Results_Dec06/* outputs/50orthofinder/output
rmdir outputs/50orthofinder/output/Results_Dec06

It writes the input file for me so that I cannot introduce any annoying typeographical errrors[sic]. It also creates a little job that parses the quite confusing outputs from orthofinder into an (hopefully) easy to read table.

My hope is to shortly be able to append all the H37Rv IDs to my annotation set for CDC1551. We shall see, orthofinder takes ~ 10 minutes / 1000 genes.

3.0.1 Read in orthogroup information

The orthologs produced by orthofinder appear to pick up single-groups for 3343 genes, which seems pretty good, I am assuming the remainder of genes are multi-gene families which will also be nicely identified.

single_orthologs <- as.data.frame(readr::read_tsv("orthologs/outputs/50orthofinder/orthogroups_single_named.tsv"))
## New names:
## • `` -> `...4`
## Warning: One or more parsing issues, call `problems()` on your data frame for details, e.g.:
##   dat <- vroom(...)
##   problems(dat)
## Rows: 3343 Columns: 4
## ── Column specification ───────────────────────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (3): Orthogroup, mtuberculosis_h37rv, ...4
## lgl (1): mycobacterium_tuberculosis_cdc1551
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
rownames(single_orthologs) <- single_orthologs[["Orthogroup"]]
single_orthologs <- single_orthologs[, c(2, 4)]
colnames(single_orthologs) <- c("h37rv", "cdc1551")

## There are a bunch of 1:many mappings and many:many mappings it turns out!
## and I need to reformat the text file containing them, supido orthofinder
## added some random characters in the lists of genes which make it weird to parse.
all_orthogroups <- as.data.frame(readr::read_tsv("orthologs/outputs/50orthofinder/orthogroups_all_named.tsv"))
## New names:
## • `` -> `...2`
## Warning: One or more parsing issues, call `problems()` on your data frame for details, e.g.:
##   dat <- vroom(...)
##   problems(dat)
## Rows: 3551 Columns: 2
## ── Column specification ───────────────────────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): Orthogroup, ...2
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Ok, so I need to do a little work to clean up the (1|many):(1|many) mappings; until then let us just merge the 1:1 mappings into the CDC1551 annotations.

I think I will need to get the NIH IDs from the genbank or gff file before merging to the microbesonline annotations. An important caveat I did not consider before: the ortholog searches were explicitly on CDS/proteins and my general annotations are not.

h37_cds <- load_gff_annotations("orthologs/mtuberculosis_h37rv_all.gff", type = "CDS")
## Returning a df with 30 columns and 3906 rows.
h37_cds_singles <- merge(h37_cds, single_orthologs, by.x = "protein_id",
                         by.y = "h37rv", all.x = TRUE)

cdc_cds <- load_gff_annotations("orthologs/mycobacterium_tuberculosis_cdc1551_cds.gff", type = "CDS")
## Returning a df with 18 columns and 4189 rows.
h37_ncbi_cdc <- merge(h37_cds_singles, cdc_cds, by.x = "cdc1551", by.y = "protein_id")

single_gene_mapping <- h37_ncbi_cdc[, c("locus_tag.x", "locus_tag.y", "gene.x",
                                        "product.x", "gene.y", "product.y")]
colnames(single_gene_mapping) <- c("h37rv_tag", "cdc1551_tag", "h37_gene",
                                   "h37_product", "cdc_gene", "cdc_product")

3.1 Add the single gene maps to the annotations

Oh, the way I did the above, this will almost certainly end up with redundant columns; but since those columns are coming from different sources it will be a good control to ensure that everything worked properly (e.g. if they prove not to be redundant then we know there were shenanigans in my ortholog mapping).

annot <- merge(annot, single_gene_mapping, by.x = "row.names",
               by.y = "cdc1551_tag", all.x = TRUE)
rownames(annot) <- annot[["Row.names"]]
annot[["Row.names"]] <- NULL

4 A slightly random aside, rRNA and 6S

Given the pecularities we observed vis a vis the Pseudomonas reads, there is explicit interest from April to know how well the ribosomal depletion went; so I did a little poking and found that the ribosomal genes are named rrf(5S), rrs(16S), and rrl(23S). When I make a bacterial ribosomal database I usually like to include a few tRNAs and 6S, even though 5S/6S/tRNAs should be size-selected out of the libraries pretty consistently. It turns out the biology of RNAs which interact with σ in Mtb is kind of wild and there is no known 6s, but a somewhat similar RNA named ‘Ms1’.

Anyhow, I made a rRNA database for Mtb and here is a query against it to get a sense of likely rRNA reads in the samples.

cd preprocessing
start=$(pwd)
for i in $(/bin/ls -d * | grep -v Undetermined); do
    cd $i
    cyoa --method hisat --libtype rRNA --species mtuberculosis_h37rv \
         --input $(/bin/ls *-trimmed.fastq.xz | tr '\n' ':' | sed 's/:$//g')
    cd $start
done

5 Append information to the sample sheet

We have a relatively sparse sample sheet to start with, lets add some summary stats from cyoa to it.

There is one important piece of information that Daniel alerted me to which is not included in the gathered metadata (I think): these samples appear to have an extraorindarily large amount of potentially other bacterial sequence in them; and it appears to be dominated by Pseudomonas aeruginosa – I am assuming PA01 and will try mapping the reads against that strain to see what is going on.

Until those mappings are complete, I will continue on as normal, but keep in mind that the following will almost certain need to change.

spec <- make_rnaseq_spec()
gathered <- gather_preprocessing_metadata("sample_sheets/all_samples.xlsx",
                                         species = "mycobacterium_tuberculosis_cdc1551",
                                         specification = spec)
## Skipping for now
## Warning in dispatch_filename_search(meta, input_file_spec, verbose = verbose, : The input file is NA for:
## preprocessing/WT_1/outputs/*salmon_mycobacterium_tuberculosis_cdc1551/quant.sf.
## Warning in dispatch_filename_search(meta, input_file_spec, verbose = verbose, : The input file is NA for:
## preprocessing/WT_2/outputs/*salmon_mycobacterium_tuberculosis_cdc1551/quant.sf.
## Warning in dispatch_filename_search(meta, input_file_spec, verbose = verbose, : The input file is NA for:
## preprocessing/WT_3/outputs/*salmon_mycobacterium_tuberculosis_cdc1551/quant.sf.
## Warning in dispatch_filename_search(meta, input_file_spec, verbose = verbose, : The input file is NA for:
## preprocessing/Mutant_pknF_1/outputs/*salmon_mycobacterium_tuberculosis_cdc1551/quant.sf.
## Warning in dispatch_filename_search(meta, input_file_spec, verbose = verbose, : The input file is NA for:
## preprocessing/Mutant_pknF_2/outputs/*salmon_mycobacterium_tuberculosis_cdc1551/quant.sf.
## Warning in dispatch_filename_search(meta, input_file_spec, verbose = verbose, : The input file is NA for:
## preprocessing/Mutant_pknF_3/outputs/*salmon_mycobacterium_tuberculosis_cdc1551/quant.sf.
## Warning in dispatch_filename_search(meta, input_file_spec, verbose = verbose, : The input file is NA for:
## preprocessing/Comp_pknF_1/outputs/*salmon_mycobacterium_tuberculosis_cdc1551/quant.sf.
## Warning in dispatch_filename_search(meta, input_file_spec, verbose = verbose, : The input file is NA for:
## preprocessing/Comp_pknF_2/outputs/*salmon_mycobacterium_tuberculosis_cdc1551/quant.sf.
## Warning in dispatch_filename_search(meta, input_file_spec, verbose = verbose, : The input file is NA for:
## preprocessing/Comp_pknF_3/outputs/*salmon_mycobacterium_tuberculosis_cdc1551/quant.sf.
## Writing new metadata to: sample_sheets/all_samples_modified.xlsx
## Deleting the file sample_sheets/all_samples_modified.xlsx before writing the tables.

6 Create an expressionset

expt <- create_expt(gathered[["new_file"]], gene_info = annot, file_column = "hisatcounttable")
## Reading the sample metadata.
## The sample definitions comprises: 9 rows(samples) and 19 columns(metadata fields).
## Matched 4245 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## Saving the expressionset to 'expt.rda'.
## The final expressionset has 4245 features and 9 samples.
written <- write_expt(expt, excel = glue("excel/all_samples-v{ver}.xlsx"))
## Writing the first sheet, containing a legend and some summary data.
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## 86 entries are 0.  We are on a log scale, adding 1 to the data.
## 
## Changed 86 zero count features.
## 
## Naively calculating coefficient of variation/dispersion with respect to condition.
## 
## Finished calculating dispersion estimates.
## 
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## 
## Total:29 s

7 Poke at it

plot_libsize(expt)
## Library sizes of 9 samples, 
## ranging from 3,338,710 to 13,076,335.

plot_nonzero(expt)
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## A non-zero genes plot of 9 samples.
## These samples have an average 8.867 CPM coverage and 4235 genes observed, ranging from 4227 to
## 4242.

norm <- normalize_expt(expt, transform = "log2", convert = "cpm", norm = "quant", filter = TRUE)
## Removing 91 low-count genes (4154 remaining).
plot_corheat(norm)
## A heatmap of pairwise sample correlations ranging from: 
## 0.924732991192705 to 0.987981556516807.

plot_disheat(norm)
## A heatmap of pairwise sample distances ranging from: 
## 23.2338031782721 to 58.1252671359335.

plot_pca(norm)
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by wt, delta, complement
## Shapes are defined by b1, b2, b3.

nb <- normalize_expt(expt, transform = "log2", convert = "cpm", batch = "limma", filter = TRUE)
## Removing 91 low-count genes (4154 remaining).
## If you receive a warning: 'NANs produced', one potential reason is that the data was quantile normalized.
## Setting 2 low elements to zero.
## transform_counts: Found 2 values equal to 0, adding 1 to the matrix.
plot_pca(nb)
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by wt, delta, complement
## Shapes are defined by b1, b2, b3.

ns <- normalize_expt(expt, transform = "log2", convert = "cpm", batch = "svaseq", filter = TRUE)
## Removing 91 low-count genes (4154 remaining).
## Setting 2 low elements to zero.
## transform_counts: Found 2 values equal to 0, adding 1 to the matrix.
plot_pca(ns)
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by wt, delta, complement
## Shapes are defined by b1, b2, b3.

Well, that is not great. I arbitrarily tried a batch in model and sva adjustment, but am not including them without further information.

7.1 Check the rRNA content

The following does not work because I fell into the assumption that these are H37Rv samples again! Doofus.

I guess I could pull the IDs because I matched them against H37Rv; but I also mapped against H37Rv, so I could just use that.

rrna_loci <- c("Rvnr01", "Rvnr02", "Rvnr03")
rrna_expt <- expt[rrna_loci, ]

While I was typing the above incorrectly, the alignments started finishing the query against my Mtb rRNA database. It looks like ~ 3% of the reads map to rRNA.

8 varpart

varpart <- simple_varpart(expt)
## 
## Total:37 s
varpart
## The result of using variancePartition with the model:
## ~ condition + batch

9 Kraken matrix

genus_expt <- create_expt(gathered[["new_file"]],
                          file_column = "krakenmatrix", file_type = "table")
## Reading the sample metadata.
## The sample definitions comprises: 9 rows(samples) and 19 columns(metadata fields).
## Warning in first_rownames != current_rownames: longer object length is not a multiple of shorter object length
## Warning in read_counts_expt(sample_ids, filenames, countdir = countdir, : The file:
## /home/trey/sshfs/scratch/atb/rnaseq/mycobacterium_tuberculosis_2023/preprocessing/WT_2/outputs/05kraken_bacteria/kraken_report_matrix.tsv
## has mismatched rownames.
## Warning in first_rownames != current_rownames: longer object length is not a multiple of shorter object length
## Warning in read_counts_expt(sample_ids, filenames, countdir = countdir, : The file:
## /home/trey/sshfs/scratch/atb/rnaseq/mycobacterium_tuberculosis_2023/preprocessing/WT_3/outputs/05kraken_bacteria/kraken_report_matrix.tsv
## has mismatched rownames.
## Warning in first_rownames != current_rownames: longer object length is not a multiple of shorter object length
## Warning in read_counts_expt(sample_ids, filenames, countdir = countdir, : The file:
## /home/trey/sshfs/scratch/atb/rnaseq/mycobacterium_tuberculosis_2023/preprocessing/Mutant_pknF_1/outputs/05kraken_bacteria/kraken_report_matrix.tsv
## has mismatched rownames.
## Warning in first_rownames != current_rownames: longer object length is not a multiple of shorter object length
## Warning in read_counts_expt(sample_ids, filenames, countdir = countdir, : The file:
## /home/trey/sshfs/scratch/atb/rnaseq/mycobacterium_tuberculosis_2023/preprocessing/Mutant_pknF_2/outputs/05kraken_bacteria/kraken_report_matrix.tsv
## has mismatched rownames.
## Warning in first_rownames != current_rownames: longer object length is not a multiple of shorter object length
## Warning in read_counts_expt(sample_ids, filenames, countdir = countdir, : The file:
## /home/trey/sshfs/scratch/atb/rnaseq/mycobacterium_tuberculosis_2023/preprocessing/Mutant_pknF_3/outputs/05kraken_bacteria/kraken_report_matrix.tsv
## has mismatched rownames.
## Warning in first_rownames != current_rownames: longer object length is not a multiple of shorter object length
## Warning in read_counts_expt(sample_ids, filenames, countdir = countdir, : The file:
## /home/trey/sshfs/scratch/atb/rnaseq/mycobacterium_tuberculosis_2023/preprocessing/Comp_pknF_1/outputs/05kraken_bacteria/kraken_report_matrix.tsv
## has mismatched rownames.
## Warning in first_rownames != current_rownames: longer object length is not a multiple of shorter object length
## Warning in read_counts_expt(sample_ids, filenames, countdir = countdir, : The file:
## /home/trey/sshfs/scratch/atb/rnaseq/mycobacterium_tuberculosis_2023/preprocessing/Comp_pknF_2/outputs/05kraken_bacteria/kraken_report_matrix.tsv
## has mismatched rownames.
## Warning in first_rownames != current_rownames: longer object length is not a multiple of shorter object length
## Warning in read_counts_expt(sample_ids, filenames, countdir = countdir, : The file:
## /home/trey/sshfs/scratch/atb/rnaseq/mycobacterium_tuberculosis_2023/preprocessing/Comp_pknF_3/outputs/05kraken_bacteria/kraken_report_matrix.tsv
## has mismatched rownames.
## Warning in create_expt(gathered[["new_file"]], file_column = "krakenmatrix", : There are some NAs in this data,
## the 'handle_nas' parameter may be required.
## Matched 191 annotations and counts.
## Bringing together the count matrix and gene information.
## Saving the expressionset to 'expt.rda'.
## The final expressionset has 191 features and 9 samples.
genus_norm <- normalize_expt(genus_expt, convert = "cpm")
plot_disheat(genus_norm)
## A heatmap of pairwise sample distances ranging from: 
## 4945.06863345337 to 102810.646015848.

genus_normv2 <- normalize_expt(genus_expt, convert = "cpm", transform = "log2")
plot_pca(genus_normv2)
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by wt, delta, complement
## Shapes are defined by b1, b2, b3.

plot_libsize(genus_expt)
## Library sizes of 9 samples, 
## ranging from 371,301 to 2,637,312.

head(exprs(genus_expt))
##                  WT_1 WT_2 WT_3 Mutant_pknF_1 Mutant_pknF_2 Mutant_pknF_3 Comp_pknF_1 Comp_pknF_2 Comp_pknF_3
## Acetobacter        96   29   66           116            81            53          27         146         143
## Achromobacter      26   20   15            34            28            21           6          49          59
## Acidiferrobacter   26    8   12            18            21            10           2          23          31
## Acidihalobacter    14    5   11            23            38            11           4          38          40
## Acinetobacter     819  314  448           770           691           477         147        1130         962
## Actinobacillus     26   20   26            36            36            10           4          28          32
exprs(genus_expt)["Pseudomonas", ]
##          WT_1          WT_2          WT_3 Mutant_pknF_1 Mutant_pknF_2 Mutant_pknF_3   Comp_pknF_1   Comp_pknF_2 
##       1509718        546307       1020882       1635517       1502338        902653        341293       2477879 
##   Comp_pknF_3 
##       2388934

10 Differential Expression

I am going to assume for the moment that we can learn interesting things from the data, but that it will require more intervention than usual for a bacterial dataset.

keepers <- list(
  "delta_wt" = c("delta", "wt"),
  "comp_wt" = c("complement", "wt"),
  "delta_comp" = c("delta", "complement"))
de <- all_pairwise(expt, keepers = keepers, filter = TRUE, model_batch = "svaseq")
## 
##         wt      delta complement 
##          3          3          3
## Removing 0 low-count genes (4154 remaining).
## Setting 2 low elements to zero.
## transform_counts: Found 2 values equal to 0, adding 1 to the matrix.

tables <- combine_de_tables(
  de, keepers = keepers, excel = glue("excel/de_tables-v{ver}.xlsx"))
## Checking limma for name delta_wt:wt_vs_delta
## Checking deseq for name delta_wt:wt_vs_delta
## Checking edger for name delta_wt:wt_vs_delta
## Checking ebseq for name delta_wt:wt_vs_delta
## Checking noiseq for name delta_wt:wt_vs_delta
## Checking basic for name delta_wt:wt_vs_delta
## Checking limma for name comp_wt:wt_vs_complement
## Checking deseq for name comp_wt:wt_vs_complement
## Checking edger for name comp_wt:wt_vs_complement
## Checking ebseq for name comp_wt:wt_vs_complement
## Checking noiseq for name comp_wt:wt_vs_complement
## Checking basic for name comp_wt:wt_vs_complement
## Checking limma for name delta_comp:complement_vs_delta
## Checking deseq for name delta_comp:complement_vs_delta
## Checking edger for name delta_comp:complement_vs_delta
## Checking ebseq for name delta_comp:delta_vs_complement
## Checking noiseq for name delta_comp:complement_vs_delta
## Checking basic for name delta_comp:complement_vs_delta
## About to start combine_mapped_table()
## Inverting column: basic_logfc.
## Inverting column: deseq_logfc.
## Inverting column: ebseq_logfc.
## Inverting column: edger_logfc.
## Inverting column: limma_logfc.
## Inverting column: noiseq_logfc.
## Finished combine_mapped_table()
## About to start combine_mapped_table()
## Inverting column: basic_logfc.
## Inverting column: deseq_logfc.
## Inverting column: ebseq_logfc.
## Inverting column: edger_logfc.
## Inverting column: limma_logfc.
## Inverting column: noiseq_logfc.
## Finished combine_mapped_table()
## About to start combine_mapped_table()
## Inverting column: basic_logfc.
## Inverting column: deseq_logfc.
## Inverting column: edger_logfc.
## Inverting column: limma_logfc.
## Inverting column: noiseq_logfc.
## Finished combine_mapped_table()
sig <- extract_significant_genes(
  tables, according_to = "deseq", excel = glue("excel/de_sig-v{ver}.xlsx"))

10.1 Also provide a less stringent version

less_stringent <- extract_significant_genes(
  tables, according_to = "deseq", lfc = 0.6, p = 1,
  excel = glue("excel/de_sig_0.6lfc_1.0pval-v{ver}.xlsx"))

10.1.1 Test batch-in-model

Given the PCA above, it seems at least possible that including batch in model might provide more reliable results than sva. So, the following block does that. Spoiler alert: I don’t think it helped.

de_batch <- all_pairwise(expt, keepers = keepers, filter = TRUE, model_batch = TRUE)
## 
##         wt      delta complement 
##          3          3          3 
## 
## b1 b2 b3 
##  3  3  3
de_batch
## A pairwise differential expression with results from: basic, deseq, ebseq, edger, limma, noiseq.
## This used a surrogate/batch estimate from: batch in model/limma.
## The primary analysis performed 15 comparisons.
tables_batch <- combine_de_tables(de_batch, keepers = keepers,
                                  excel = glue("excel/de_tables_batch-v{ver}.xlsx"))
## Checking limma for name delta_wt:wt_vs_delta
## Checking deseq for name delta_wt:wt_vs_delta
## Checking edger for name delta_wt:wt_vs_delta
## Checking ebseq for name delta_wt:wt_vs_delta
## Checking noiseq for name delta_wt:wt_vs_delta
## Checking basic for name delta_wt:wt_vs_delta
## Checking limma for name comp_wt:wt_vs_complement
## Checking deseq for name comp_wt:wt_vs_complement
## Checking edger for name comp_wt:wt_vs_complement
## Checking ebseq for name comp_wt:wt_vs_complement
## Checking noiseq for name comp_wt:wt_vs_complement
## Checking basic for name comp_wt:wt_vs_complement
## Checking limma for name delta_comp:complement_vs_delta
## Checking deseq for name delta_comp:complement_vs_delta
## Checking edger for name delta_comp:complement_vs_delta
## Checking ebseq for name delta_comp:delta_vs_complement
## Checking noiseq for name delta_comp:complement_vs_delta
## Checking basic for name delta_comp:complement_vs_delta
## About to start combine_mapped_table()
## Inverting column: basic_logfc.
## Inverting column: deseq_logfc.
## Inverting column: ebseq_logfc.
## Inverting column: edger_logfc.
## Inverting column: limma_logfc.
## Inverting column: noiseq_logfc.
## Finished combine_mapped_table()
## About to start combine_mapped_table()
## Inverting column: basic_logfc.
## Inverting column: deseq_logfc.
## Inverting column: ebseq_logfc.
## Inverting column: edger_logfc.
## Inverting column: limma_logfc.
## Inverting column: noiseq_logfc.
## Finished combine_mapped_table()
## About to start combine_mapped_table()
## Inverting column: basic_logfc.
## Inverting column: deseq_logfc.
## Inverting column: edger_logfc.
## Inverting column: limma_logfc.
## Inverting column: noiseq_logfc.
## Finished combine_mapped_table()
tables_batch
## A set of combined differential expression results.

##                          table deseq_sigup deseq_sigdown edger_sigup edger_sigdown limma_sigup limma_sigdown
## 1         wt_vs_delta-inverted          31             1          16             3           2             1
## 2    wt_vs_complement-inverted          17             7          14             6           2             1
## 3 complement_vs_delta-inverted          12             9          11             8           1             1

sig_batch <- extract_significant_genes(tables_batch, according_to = "deseq",
                                       excel = glue("excel/de_sig_batch-v{ver}.xlsx"))
sig_batch
## A set of genes deemed significant according to deseq.
## The parameters defining significant were:
## LFC cutoff: 1 adj P cutoff: 0.05
##            deseq_up deseq_down
## delta_wt         31          1
## comp_wt          17          7
## delta_comp       12          9

10.2 The intersection of Δ/WT and Δ/comp

library(UpSetR)
first_up <- sig[["deseq"]][["ups"]][["delta_wt"]]
second_up <- sig[["deseq"]][["ups"]][["delta_comp"]]

first_down <- sig[["deseq"]][["downs"]][["delta_wt"]]
second_down <- sig[["deseq"]][["downs"]][["delta_comp"]]

comparison <- list(
  "d_w_up" = rownames(first_up),
  "d_c_up" = rownames(second_up),
  "d_w_down" = rownames(first_down),
  "d_c_down" = rownames(second_down))
query <- UpSetR::fromList(comparison)
sets <- UpSetR::upset(query)
sets

## So, there are 2 shared down and 12 up.  Also 1 weirdo (which given
## its proximity to the operon containing PknF is likely interesting)

overlap_ids <- overlap_groups(comparison)
interesting_up <- overlap_geneids(overlap_ids, "d_w_up:d_c_up")
interesting_up
##  d_w_up3  d_w_up6 d_w_up18 d_w_up20 d_w_up29 d_w_up30 d_w_up37 d_w_up45 d_w_up57 d_w_up71 d_w_up90 d_w_up95 
## "MT3214" "MT0995" "MT2020" "MT0423" "MT1255" "MT0424" "MT2579" "MT2574" "MT3712" "MT2576" "MT1256" "MT0595"
interesting_down <- overlap_geneids(overlap_ids, "d_w_down:d_c_down")
interesting_down
## d_c_up25 d_c_up26 
## "MT1702" "MT0995"
weirdo <- overlap_geneids(overlap_ids, "d_w_up:d_c_down")
weirdo
##  d_w_up2 
## "MT1790"
overlapping_meta <- annot[c(interesting_up, interesting_down, weirdo), ]
written <- write_xlsx(overlapping_meta, excel = "excel/overlapping_stringent_genes.xlsx")
## Deleting the file excel/overlapping_stringent_genes.xlsx before writing the tables.

10.3 Repeat the intersection of Δ/WT and Δ/comp with the less significant data

first_up <- less_stringent[["deseq"]][["ups"]][["delta_wt"]]
second_up <- less_stringent[["deseq"]][["ups"]][["delta_comp"]]

first_down <- less_stringent[["deseq"]][["downs"]][["delta_wt"]]
second_down <- less_stringent[["deseq"]][["downs"]][["delta_comp"]]

comparison <- list(
  "d_w_up" = rownames(first_up),
  "d_c_up" = rownames(second_up),
  "d_w_down" = rownames(first_down),
  "d_c_down" = rownames(second_down))
query <- UpSetR::fromList(comparison)
sets <- UpSetR::upset(query)
sets

## So, there are 2 shared down and 12 up.  Also 1 weirdo (which given
## its proximity to the operon containing PknF is likely interesting)

overlap_ids <- overlap_groups(comparison)
interesting_up <- overlap_geneids(overlap_ids, "d_w_up:d_c_up")
interesting_up
##     d_w_up1     d_w_up3     d_w_up6     d_w_up7     d_w_up9    d_w_up10    d_w_up11    d_w_up12    d_w_up14 
##    "MT1789"    "MT3214"    "MT0995"    "MT1002"    "MT2707"    "MT2115"    "MT3249"    "MT1001"    "MT0996" 
##    d_w_up16    d_w_up17    d_w_up19    d_w_up21    d_w_up23    d_w_up24    d_w_up25    d_w_up28    d_w_up29 
##    "MT1003"    "MT2052"    "MT2020"    "MT0423"    "MT0705"  "MT1003.1"    "MT2216"    "MT2215"    "MT0368" 
##    d_w_up30    d_w_up31    d_w_up32    d_w_up33    d_w_up36    d_w_up37    d_w_up39    d_w_up41    d_w_up42 
##    "MT1255"    "MT0424"    "MT1511"    "MT0265"    "MT0706" "MT3573.12"    "MT2579"    "MT0845"    "MT0296" 
##    d_w_up44    d_w_up45    d_w_up46    d_w_up47    d_w_up48    d_w_up49    d_w_up50    d_w_up51    d_w_up52 
##    "MT1000"    "MT1200"    "MT1400"    "MT2703"    "MT3250"    "MT2574"    "MT2021"    "MT1512"    "MT2053" 
##    d_w_up53    d_w_up54    d_w_up58    d_w_up59    d_w_up60    d_w_up61    d_w_up62    d_w_up65    d_w_up66 
##    "MT2704"    "MT2705"    "MT3907"    "MT2578"    "MT2783"    "MT3217"    "MT3712"    "MT2815"    "MT3342" 
##    d_w_up67    d_w_up70    d_w_up72    d_w_up76    d_w_up77    d_w_up81    d_w_up82    d_w_up83    d_w_up84 
##    "MT0397"  "MT0706.1"    "MT0367"    "MT2576"    "MT3263"    "MT2050"    "MT0787"    "MT2600"    "MT2577" 
##    d_w_up85    d_w_up86    d_w_up87    d_w_up90    d_w_up92    d_w_up95    d_w_up96    d_w_up97    d_w_up98 
##    "MT2816"    "MT3632"    "MT3216"    "MT3644"    "MT0297" "MT3573.11"    "MT1256"    "MT0366"    "MT2573" 
##   d_w_up101   d_w_up102   d_w_up105   d_w_up107   d_w_up108   d_w_up110   d_w_up115   d_w_up125   d_w_up131 
##    "MT0595"    "MT2306"    "MT1011"    "MT0596"    "MT1636"    "MT1637"    "MT0425"    "MT0846"    "MT1199" 
##   d_w_up133   d_w_up136   d_w_up139   d_w_up143   d_w_up150   d_w_up151   d_w_up152   d_w_up159   d_w_up172 
##    "MT1510"    "MT3646"    "MT0115"    "MT2702"    "MT1641"    "MT3165"    "MT2904"    "MT1540"    "MT3326" 
##   d_w_up173   d_w_up174   d_w_up175   d_w_up183   d_w_up186   d_w_up191   d_w_up200   d_w_up207   d_w_up208 
##    "MT2090"    "MT0295"    "MT3985"    "MT2091"    "MT3645"    "MT1102"    "MT0365"    "MT2571"    "MT1639" 
##   d_w_up209   d_w_up212   d_w_up217   d_w_up218   d_w_up224   d_w_up231   d_w_up236   d_w_up264   d_w_up265 
##    "MT2049"    "MT2005"    "MT3219"    "MT3220"    "MT2572"    "MT2015"    "MT2701"  "MT3712.1"    "MT1508" 
##   d_w_up267   d_w_up269   d_w_up270   d_w_up274   d_w_up277   d_w_up286   d_w_up287   d_w_up297   d_w_up303 
##    "MT3983"    "MT3984"    "MT1638"    "MT2061"    "MT2575"    "MT3212"    "MT2022"    "MT1774"    "MT0305" 
##   d_w_up311   d_w_up313   d_w_up317   d_w_up319   d_w_up340   d_w_up343   d_w_up344   d_w_up347   d_w_up348 
##    "MT2698"    "MT2383"    "MT1126"    "MT1012"    "MT1779"    "MT1010"    "MT0292"    "MT1198"    "MT2570" 
##   d_w_up398   d_w_up400   d_w_up401   d_w_up425   d_w_up438   d_w_up439   d_w_up441   d_w_up443   d_w_up445 
##    "MT2305"    "MT3002"    "MT2059"    "MT2660"    "MT0298"    "MT1257"    "MT2087"    "MT0070"    "MT1479" 
##   d_w_up451 
##    "MT2303"
interesting_down <- overlap_geneids(overlap_ids, "d_w_down:d_c_down")
interesting_down
##   d_c_up105   d_c_up106   d_c_up107   d_c_up108   d_c_up109   d_c_up110   d_c_up111   d_c_up112   d_c_up114 
##  "MT3427.1"    "MT3644"    "MT0598"    "MT2954"    "MT2600"    "MT0368"  "MT0706.1"    "MT2705"    "MT0425" 
##   d_c_up118   d_c_up119   d_c_up120   d_c_up123   d_c_up124   d_c_up127   d_c_up130   d_c_up131   d_c_up134 
##    "MT0295"    "MT2305"    "MT1225"    "MT3249"  "MT3194.1"    "MT0996"    "MT2815"    "MT3263"    "MT1639" 
##   d_c_up135   d_c_up136   d_c_up139   d_c_up140   d_c_up142   d_c_up144   d_c_up145   d_c_up147   d_c_up150 
##    "MT2116"    "MT3342"    "MT0896"    "MT3218"    "MT1126"    "MT1557"    "MT3165"    "MT2215"    "MT3632" 
##   d_c_up153   d_c_up155   d_c_up158   d_c_up159   d_c_up163   d_c_up167   d_c_up172   d_c_up174   d_c_up181 
##    "MT0736"    "MT0846"    "MT0265"    "MT2053"    "MT0730"    "MT0599"    "MT1017"    "MT1000"    "MT1779" 
##   d_c_up182   d_c_up183   d_c_up184   d_c_up192   d_c_up193   d_c_up201   d_c_up206   d_c_up208   d_c_up209 
##    "MT0746"    "MT1605"  "MT1003.1"    "MT0938"    "MT2660"    "MT1774"    "MT1510"    "MT2878"    "MT2216" 
##   d_c_up212   d_c_up214   d_c_up215   d_c_up225   d_c_up226   d_c_up227   d_c_up230   d_w_down2   d_w_down3 
##    "MT2572"    "MT2889"    "MT0397"    "MT2606"    "MT0787"    "MT2383"    "MT1010"    "MT1152"    "MT3745" 
##   d_w_down5  d_w_down11  d_w_down13  d_w_down15  d_w_down16  d_w_down23  d_w_down24  d_w_down25  d_w_down30 
##  "MT3718.1"    "MT0174"    "MT3413"  "MT1169.1"    "MT1170"    "MT3905"    "MT0992"    "MT1067"    "MT1184" 
##  d_w_down31  d_w_down37  d_w_down45  d_w_down47  d_w_down53  d_w_down55  d_w_down56  d_w_down60  d_w_down67 
##    "MT1757"    "MT0117"    "MT0273"    "MT2349"    "MT2246"    "MT3585"    "MT1798"    "MT3844"    "MT2952" 
##  d_w_down68  d_w_down70  d_w_down76  d_w_down78  d_w_down90  d_w_down92  d_w_down95  d_w_down99 d_w_down100 
##    "MT2396"  "MT0204.1"    "MT3133"    "MT0084"    "MT0918"    "MT2730"    "MT2106"    "MT3402"    "MT2148" 
## d_w_down102 d_w_down111 
##    "MT2808"    "MT2393"
weirdo <- overlap_geneids(overlap_ids, "d_w_up:d_c_down")
weirdo
##    d_w_up2  d_w_up146  d_w_up187  d_w_up281  d_w_up328  d_w_up332  d_w_up396  d_w_up405  d_w_up433  d_w_up452 
##   "MT1790"   "MT0850"   "MT2527"   "MT2069"   "MT3211"   "MT0342"   "MT2075" "MT1707.1"   "MT3283"   "MT2068"
overlapping <- as.character(c(interesting_up, interesting_down, weirdo))

overlapping_meta <- annot[c(interesting_up, interesting_down, weirdo), ]
written <- write_xlsx(overlapping_meta, excel = "excel/overlapping_less_stringent_genes.xlsx")
## Deleting the file excel/overlapping_less_stringent_genes.xlsx before writing the tables.

11 Ontology enrichment

Given the relative sparsity of the annotations for CDC1551, I might either go back to my H37Rv mapping data (which was pretty decent), or use the orthologs and their annotations; but since I already have the data in its current form loaded, let us just try it and see.

11.1 Volker’s explicit query first

There are only 15 genes in the set of interesting intersections; so I am going to group them together.

In my annotation block above I downloaded the H37Rv and CDC1551 annotations/GO categories. There appears to be a bug in the GO loader, it needed the actual ID number instead of a unique string like the annotations; happily the annotation load prints out the ID number so I did not have to think very hard.

11.1.1 Length and COG databases

I forgot for a couple weeks to perform the same overrepresentation analyses using the COG function assignments. Let us create that association table here and add the goseq searches below.

annot[["length"]] <- abs(annot[["start"]] - annot[["stop"]])
length_db <- as.data.frame(annot[, c("name", "length")])
colnames(length_db) <- c("ID", "length")

start_cog_df <- annot[, c("desc", "COG", "COGFun", "COGDesc", "TIGRFam", "TIGRRoles")]
start_cog_df[["ID"]] <- rownames(start_cog_df)
cog_df <- start_cog_df[, c("ID", "COGFun")]
colnames(cog_df) <- c("ID", "GO")

11.2 The ontology searches

query_genes <- unique(c(interesting_up, interesting_down, weirdo))

goseq_query <- simple_goseq(query_genes, go_db = go_db,
                            length_db = length_db, min_xref = 4)
## Found 113 go_db genes and 147 length_db genes out of 187.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
goseq_query
## A set of ontologies produced by goseq using 187
## with significance cutoff 0.1.
## There are 25 MF hits, 19, BP hits, and 2 CC hits.
## Category mfp_plot_over is the most populated with 12 hits.

cog_query <- simple_goseq(query_genes, go_db = cog_df,
                          expand_categories = FALSE,
                          length_db = length_db, min_xref = 3)
## Found 187 go_db genes and 147 length_db genes out of 187.

11.2.1 Same thing, less stringent

goseq_query <- simple_goseq(unique(overlapping), go_db = go_db,
                            length_db = length_db, min_xref = 4)
goseq_query
written <- write_goseq_data(goseq_query, excel = "excel/goseq_less_stringent_results.xlsx")

topgo_query <- simple_topgo(unique(overlapping), excel = "excel/topgo_less_stringent_results.xlsx"))
## Error: <text>:6:100: unexpected ')'
## 5: 
## 6: topgo_query <- simple_topgo(unique(overlapping), excel = "excel/topgo_less_stringent_results.xlsx"))
##                                                                                                       ^

Given that there are only a few genes in the above query, let us also do the three primary contrasts separately as I think they will prove more interesting.

goseq_up_delta_wt <- simple_goseq(sig[["deseq"]][["ups"]][["delta_wt"]], go_db = go_db,
                                  length_db = length_db)
## Found 67 go_db genes and 83 length_db genes out of 108.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
goseq_up_delta_wt
## A set of ontologies produced by goseq using 108
## with significance cutoff 0.1.
## There are 27 MF hits, 12, BP hits, and 2 CC hits.
## Category mfp_plot_over is the most populated with 11 hits.

goseq_down_delta_wt <- simple_goseq(sig[["deseq"]][["downs"]][["delta_wt"]], go_db = go_db,
                                    length_db = length_db)
## Found 3 go_db genes and 17 length_db genes out of 17.
goseq_down_delta_wt
## NULL
goseq_up_comp_wt <- simple_goseq(sig[["deseq"]][["ups"]][["comp_wt"]], go_db = go_db,
                                  length_db = length_db, min_xref = 20)
## Found 27 go_db genes and 51 length_db genes out of 54.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
goseq_up_comp_wt
## A set of ontologies produced by goseq using 54
## with significance cutoff 0.1.
## There are 20 MF hits, 11, BP hits, and 0 CC hits.
## Category mfp_plot_over is the most populated with 8 hits.

goseq_down_comp_wt <- simple_goseq(sig[["deseq"]][["downs"]][["comp_wt"]], go_db = go_db,
                                    length_db = length_db, min_xref = 5)
## Found 6 go_db genes and 24 length_db genes out of 25.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
goseq_down_comp_wt
## A set of ontologies produced by goseq using 25
## with significance cutoff 0.1.
## There are 4 MF hits, 3, BP hits, and 0 CC hits.
## Category mfp_plot_over is the most populated with 3 hits.

goseq_up_delta_comp <- simple_goseq(sig[["deseq"]][["ups"]][["delta_comp"]], go_db = go_db,
                                    length_db = length_db, min_xref = 10)
## Found 20 go_db genes and 32 length_db genes out of 36.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
goseq_up_delta_comp
## A set of ontologies produced by goseq using 36
## with significance cutoff 0.1.
## There are 8 MF hits, 3, BP hits, and 1 CC hits.
## Category mfp_plot_over is the most populated with 7 hits.

goseq_down_delta_comp <- simple_goseq(sig[["deseq"]][["downs"]][["delta_comp"]], go_db = go_db,
                                    length_db = length_db, min_xref = 5)
## Found 7 go_db genes and 31 length_db genes out of 32.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
goseq_down_delta_comp
## A set of ontologies produced by goseq using 32
## with significance cutoff 0.1.
## There are 4 MF hits, 2, BP hits, and 1 CC hits.
## Category mfp_plot_over is the most populated with 4 hits.

12 Circos

I think it would be nice to have the wt mean expression values as a baseline

means <- mean_by_factor(norm)[["medians"]]
## The factor wt has 3 rows.
## The factor delta has 3 rows.
## The factor complement has 3 rows.

12.1 Collect the STPKs

stpk_operons <- c("pknA", "pknB", ## ~15k, perhaps ending a 7 gene operon (fhaA, fhaB, pstP, rodA, rbpA, pknA, pknB)
                 ## No pknC
                 "pknD", ## ~ 1038500, perhaps 2 gene operon with pstS2
                 "pknE", ## ~ 2000000, perhaps with vapB34, vapC34, Rv1742, pknE
                 "pknF", ## ~ 2000000, close to E, pknF, Rv1747, Rv1748
                 "pknG", ## ~ 2000000, Rv0412c, hlnH, pknG
                 "pknH", ## ~ 1415000, maybe 8 genes, Rv1273c, Rv1272c, Rv1271c, lprA, Rv1269c, Rv1268c, embR, pknH
                 "pknI", ## ~ 3222000, ffh, Rv2915c, pknI, Rv2913c, Rv2912c
                 "pknJ", ## ~ 2345000, Rv2082, Rv2083, Rv2084, Rv2085, Rv2086, Rv2087, pknJ
                 "pknK", ## ~ 3442500, pknK, Rv3079c
                 "pknL",
                 "lppH" ## when searching for pknM, lpphH, Rv3577, arsB2 4015000
                 )
mapped_stpk <- single_gene_mapping[["h37_gene"]] %in% stpk_operons
mapped_ids <- single_gene_mapping[mapped_stpk, ]
mapped_ids
##      h37rv_tag cdc1551_tag h37_gene                                        h37_product cdc_gene
## 14     Rv0014c      MT0017     pknB               serine/threonine-protein kinase PknB     <NA>
## 15     Rv0015c      MT0018     pknA               serine/threonine-protein kinase PknA     <NA>
## 346    Rv0410c      MT0423     pknG               serine/threonine-protein kinase PknG     <NA>
## 778    Rv0931c      MT0958     pknD               serine/threonine-protein kinase PknD     <NA>
## 1051   Rv1266c      MT1304     pknH               serine/threonine-protein kinase PknH     <NA>
## 1462    Rv1743      MT1785     pknE               serine/threonine-protein kinase PknE     <NA>
## 1465    Rv1746      MT1788     pknF               serine/threonine-protein kinase PknF     <NA>
## 1755    Rv2088      MT2149     pknJ transmembrane serine/threonine-protein kinase PknJ     <NA>
## 1832    Rv2176      MT2232     pknL               serine/threonine-protein kinase PknL     <NA>
## 2474   Rv2914c      MT2982     pknI               serine/threonine-protein kinase PknI     <NA>
## 2614   Rv3080c      MT3165     pknK               serine/threonine-protein kinase PknK     <NA>
## 3040    Rv3576      MT3681     lppH                                   lipoprotein LppH     <NA>
##                                                 cdc_product
## 14                          serine/threonine protein kinase
## 15                          serine/threonine protein kinase
## 346                         serine/threonine protein kinase
## 778                         serine/threonine protein kinase
## 1051                        serine/threonine protein kinase
## 1462                        serine/threonine protein kinase
## 1465                        serine/threonine protein kinase
## 1755                        serine/threonine protein kinase
## 1832                        serine/threonine protein kinase
## 2474                        serine/threonine protein kinase
## 2614 serine/threonine protein kinase / MalT-related protein
## 3040                                   hypothetical protein
mapped_ids[["cdc1551_tag"]]
##  [1] "MT0017" "MT0018" "MT0423" "MT0958" "MT1304" "MT1785" "MT1788" "MT2149" "MT2232" "MT2982" "MT3165" "MT3681"
## Oh no shit the reason I couldn't find them before was the difference between pknB and PknB, damnit.
## So that is annoying I likely did not need to do the silly ortholog map at all.

Let us add a column to the annotations noting the STPKs and potentially include them in the circos plot.

The current map is in order from outer->inner

    • strand genes colored by COGFun
    • strand genes
  • WT gene expression on log scale, color scale chosen automatically as heatmap
  • Δ/WT comparison on log scale, color scale chosen automatically as heatmap
  • comp/WT comparison, Ibid.
  • Δ/comp comparison, Ibid.
  • ideally, tiles showing the location of the 11 STPKs and 1 rando which might also be a STPK.
annot[["chromosome"]] <- "AE000516"
annot[["stpk"]] <- "NULL"
annot[mapped_ids[["cdc1551_tag"]], "stpk"] <- "stpk"
stpk_rows <- annot[mapped_ids[["cdc1551_tag"]], ]

delta_wt_df <- tables[["data"]][["delta_wt"]][, c("deseq_logfc", "deseq_adjp")]
comp_wt_df <- tables[["data"]][["comp_wt"]][, c("deseq_logfc", "deseq_adjp")]
delta_comp_df <- tables[["data"]][["delta_comp"]][, c("deseq_logfc", "deseq_adjp")]

pknf_cfg <- circos_prefix(annot, name="pknf", cog_column = "COGFun",
                          start_column = "start", stop_column = "stop",
                          strand_column = "strand", chr_column = "chromosome")
## This assumes you have a colors.conf in circos/colors/ and fonts.conf in circos/fonts/
## It also assumes you have conf/ideogram.conf, conf/ticks.conf, and conf/housekeeping.conf
## It will write circos/conf/pknf.conf with a reasonable first approximation config file.
## Wrote karyotype to circos/conf/ideograms/pknf.conf
## This should match the ideogram= line in pknf.conf
## Wrote ticks to circos/conf/ticks_pknf.conf
pknf_kary <- circos_karyotype(
  pknf_cfg, fasta = "~/libraries/genome/mycobacterium_tuberculosis_cdc1551.fasta")
## Wrote karyotype to circos/conf/karyotypes/pknf.conf
## This should match the karyotype= line in pknf.conf
plus_minus <- circos_plus_minus(pknf_cfg, width = 0.06, thickness = 40)
## Writing data file: circos/data/pknf_plus_go.txt with the + strand GO data.
## Writing data file: circos/data/pknf_minus_go.txt with the - strand GO data.
## Wrote the +/- config files.  Appending their inclusion to the master file.
## Returning the inner width: 0.88.  Use it as the outer for the next ring.
wt_exprs_heat <- circos_heatmap(pknf_cfg, means, colname = "wt", basename = "mean_wt",
                             outer = plus_minus)
## Assuming the input is a dataframe.
## Writing data file: circos/data/pknf_mean_wtwt_heatmap.txt with the mean_wtwt column.
## Returning the inner width: 0.78.  Use it as the outer for the next ring.
delta_wt_hist <- circos_heatmap(pknf_cfg, delta_wt_df, colname = "deseq_logfc",
                             basename = "delta_wt", outer = wt_exprs_heat)
## Assuming the input is a dataframe.
## Writing data file: circos/data/pknf_delta_wtdeseq_logfc_heatmap.txt with the delta_wtdeseq_logfc column.
## Returning the inner width: 0.68.  Use it as the outer for the next ring.
comp_wt_hist <- circos_heatmap(pknf_cfg, comp_wt_df, colname = "deseq_logfc",
                            basename = "comp_wt", outer = delta_wt_hist)
## Assuming the input is a dataframe.
## Writing data file: circos/data/pknf_comp_wtdeseq_logfc_heatmap.txt with the comp_wtdeseq_logfc column.
## Returning the inner width: 0.58.  Use it as the outer for the next ring.
delta_comp_hist <- circos_heatmap(pknf_cfg, delta_comp_df, colname = "deseq_logfc",
                                  basename = "delta_comp", outer = comp_wt_hist)
## Assuming the input is a dataframe.
## Writing data file: circos/data/pknf_delta_compdeseq_logfc_heatmap.txt with the delta_compdeseq_logfc column.
## Returning the inner width: 0.48.  Use it as the outer for the next ring.
tile_colors <- "0000AA"
names(tile_colors) <- c("stpk")
stpk_tiles <- circos_tile(pknf_cfg, stpk_rows, colname = "stpk",
                          colors = tile_colors, width = 0.125,
                          padding = 0, thickness = 125, margin = 0.00,
                          basename = "stpk", outer = delta_comp_hist)
## Writing data file: circos/data/pknfstpk_tile.txt with the stpkstpk column.
## Returning the inner width: 0.355.  Use it as the outer for the next ring.
finish <- circos_suffix(pknf_cfg)
made <- circos_make(pknf_cfg, target = "pknf")

One of the primary reasons I usually do histograms instead of heatmaps: I do not really understand how to lay out the color range so that it is consistent across different ranges of logFC values (I suppose I could just normalize them to -1<=x<=1?

The following is pulled from a Pseudomonas tiling of specific genes of interest; I have a feeling that the various STPKs may want to be added in a similar fashion

other <- c("gene1652095", "gene1652097", "gene1652099", "gene1652101",
           "gene1652103", "gene1652105", "gene1652107", "gene1652109",
           "gene1652111", "gene1652113", "gene1652115", "gene1652117",
           "gene1652119", "gene1652121", "gene1652123", "gene1652125",
           "gene1652127", "gene1652129", "gene1652131", "gene1652133",
           "gene1652135", "gene1652137", "gene1652139", "gene1652141",
           "gene1652143", "gene1652145", "gene1652147", "gene1652149",
           "gene1652151", "gene1652153", "gene1652155", "gene1652157",
           "gene1652159", "gene1652161", "gene1652163", "gene1652165")
## also BLUE
pel <- c("gene1654787", "gene1654789", "gene1654793",
         "gene1654795", "gene1654797", "gene1654799")
extra[down_orn, "flag"] <- "down_orn"
extra[up_orn, "flag"] <- "up_orn"
extra[effectors, "flag"] <- "effectors"
extra[other, "flag"] <- "other"
extra[pel, "flag"] <- "pel"
keep_idx <- extra[["flag"]] != ""
extra <- extra[keep_idx, ]
summary(as.factor(extra$flag))
---
title: "Sequencing 3 Mycobacterium Tuberculosis strains."
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
  html_document:
    code_download: true
    code_folding: show
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: zenburn
    keep_md: false
    mode: selfcontained
    number_sections: true
    self_contained: true
    theme: readable
    toc: true
    toc_float:
      collapsed: false
      smooth_scroll: false
  rmdformats::readthedown:
    code_download: true
    code_folding: show
    df_print: paged
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: zenburn
    width: 300
    keep_md: false
    mode: selfcontained
    toc_float: true
  BiocStyle::html_document:
    code_download: true
    code_folding: show
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: zenburn
    keep_md: false
    mode: selfcontained
    toc_float: true
---

<style type="text/css">
body, td {
  font-size: 16px;
}
code.r{
  font-size: 16px;
}
pre {
 font-size: 16px
}
body .main-container {
  max-width: 1600px;
}
</style>

```{r options, include=FALSE}
library(hpgltools)
library(hpgldata)
library(reticulate)
library(glue)
tt <- devtools::load_all("~/hpgltools")
knitr::opts_knit$set(
  progress = TRUE, verbose = TRUE, width = 90, echo = TRUE)
knitr::opts_chunk$set(
  error = TRUE, fig.width = 8, fig.height = 8, fig.retina = 2,
  fig.pos = "t", fig.align = "center", dpi = if (knitr::is_latex_output()) 72 else 300,
  out.width = "100%", dev = "png",
  dev.args = list(png = list(type = "cairo-png")))
old_options <- options(digits = 4,
                       stringsAsFactors = FALSE,
                       knitr.duplicate.label = "allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size = 12))
ver <- "202311"
previous_file <- ""
ver <- format(Sys.Date(), "%Y%m%d")

##tmp <- sm(loadme(filename=paste0(gsub(pattern="\\.Rmd", replace="", x=previous_file), "-v", ver, ".rda.xz")))
rmd_file <- "index.Rmd"
```

# Introduction

This project seeks to understand the effects of deleting/complementing
pknF in Mtb. I cannot honestly say I know anything about the biology
involved except the bits from the review Volker kindly sent me.

pknF is one of the surprisingly large number of serine threonine
protein kinases in Mtb.  Most bacteria have only a couple/few of
these, but Mtb has 11, and uses them with similar frequency to the two
component systems in order to respond to its changing environment
during infection (I assume, I have not gotten that far in the review
yet).  In previous experiments performed by people in Volker's lab,
a deletion of this particular STPK causes a much more virulent
phenotype during mouse infection.

# Preprocessing

I wrote a quick sample sheet for these 9 samples and will use my cyoa
toy to perform the various tasks to process the data.

## What annotations do I have?

I am assuming the most appropriate annotations come from
microbesonline?  Let us check real quick.  I should probably go back
through my old logs to answer these questions.

```{r}
annot <- load_microbesonline_annotations("CDC1551")
annot <- as.data.frame(annot)
rownames(annot) <- make.names(annot[["sysName"]], unique = TRUE)
go_db <- load_microbesonline_go(id = 83331, id_column = "sysName")

h37_annot <- load_microbesonline_annotations("H37Rv")
h37_go <- load_microbesonline_go(id = 83332, id_column = "sysName")
```

That looks pretty reasonable to me, it even has COG groups and all the
fun stuff I want.

## Which genome am I using?

Before I start, I should recall which is my current reference genome
and see if there is a better alternative since last I looked.

My current genome is strain H37Rv last downloaded in 2022, it is
derived from accession NC_000962.3 / assembly GCF_000195955.2.
Glancing at the gff file I have, I am guessing that the most
appropriate tags to index against are either locus_tag(Rv0003) or
gene(recF).

```{bash, eval=FALSE}
cd preprocessing
start=$(pwd)
for i in $(/bin/ls -d * | grep -v Undeter); do
    cd $i
    cyoa --method prnaseq --species mtuberculosis_h37rv \
         --gff_type gene --gff_tag locus_tag \
         --introns 0 --freebayes 1 \
         --input $(/bin/ls *.gz | tr '\n' ':' | sed 's/:$//g')
    cd $start
done
```

## The correct Strain is CDC1551

Ok, so let us redo this accordingly.

```{bash, eval=FALSE}
cd preprocessing
start=$(pwd)
for i in $(/bin/ls -d * | grep -v Undeter); do
    cd $i
    cyoa --method prnaseq --species mycobacterium_tuberculosis_cdc1551 \
         --gff_type gene --gff_tag locus_tag \
         --introns 0 --freebayes 1 \
         --input $(/bin/ls *.gz | tr '\n' ':' | sed 's/:$//g')
    cd $start
done
```

# Ortholog search of CDC1551/H37Rv

A few minutes ago I was making a couple of notes to myself, one of
which was to explicitly note where PknF is in this strain.  Sadly,
when I looked into the annotations I downloaded it was unnamed.  I
was able to make a good guess given that I knew the location in H37Rv
and saw a group of genes in CDC1551 in basically the same place in a similar
orientation.  Later, when the DE finished I was able to prove it by
showing that MT1788 (position 1962996) is the most decreased in the Δ
strain and most increased in the complement.

With that in mind, I was a little sad to not have the full set of
H37Rv annotations available and so am spinning up my favorite ortholog
search tool 'orthofinder'.  The following block is where I am going to
do that.

I first need to extract the amino acid sequences from H37Rv (I have
them in place already for CDC1551).

```{bash, eval=FALSE}
mkdir orthologs
cd orthologs
cp ~/libraries/genome/mtub*.gb .
cp ~/libraries/genome/myco*.faa .
cyoa --method gb2gff --input *.gb
ls *.faa
cyoa --method orthofinder --input mtuberculosis_h37rv.faa:mycobacterium_tuberculosis_cdc1551.faa
```

For the curious, that last command generates the following on the
computing cluster:

```{bash, eval=FALSE}
orthofinder -f outputs/50orthofinder/input \
  -o outputs/50orthofinder/output \
  2>>outputs/50orthofinder/orthofinder.stderr 1>>outputs/50orthofinder/orthofinder.stdout
mv outputs/50orthofinder/output/Results_Dec06/* outputs/50orthofinder/output
rmdir outputs/50orthofinder/output/Results_Dec06
```

It writes the input file for me so that I cannot introduce any
annoying typeographical errrors[sic].  It also creates a little job
that parses the quite confusing outputs from orthofinder into an
(hopefully) easy to read table.

My hope is to shortly be able to append all the H37Rv IDs to my
annotation set for CDC1551.  We shall see, orthofinder takes ~ 10
minutes / 1000 genes.

### Read in orthogroup information

The orthologs produced by orthofinder appear to pick up single-groups
for 3343 genes, which seems pretty good, I am assuming the remainder
of genes are multi-gene families which will also be nicely identified.

```{r}
single_orthologs <- as.data.frame(readr::read_tsv("orthologs/outputs/50orthofinder/orthogroups_single_named.tsv"))
rownames(single_orthologs) <- single_orthologs[["Orthogroup"]]
single_orthologs <- single_orthologs[, c(2, 4)]
colnames(single_orthologs) <- c("h37rv", "cdc1551")

## There are a bunch of 1:many mappings and many:many mappings it turns out!
## and I need to reformat the text file containing them, supido orthofinder
## added some random characters in the lists of genes which make it weird to parse.
all_orthogroups <- as.data.frame(readr::read_tsv("orthologs/outputs/50orthofinder/orthogroups_all_named.tsv"))
```

Ok, so I need to do a little work to clean up the (1|many):(1|many)
mappings; until then let us just merge the 1:1 mappings into the
CDC1551 annotations.

I think I will need to get the NIH IDs from the genbank or gff file
before merging to the microbesonline annotations.  An important caveat
I did not consider before: the ortholog searches were explicitly on
CDS/proteins and my general annotations are not.

```{r}
h37_cds <- load_gff_annotations("orthologs/mtuberculosis_h37rv_all.gff", type = "CDS")

h37_cds_singles <- merge(h37_cds, single_orthologs, by.x = "protein_id",
                         by.y = "h37rv", all.x = TRUE)

cdc_cds <- load_gff_annotations("orthologs/mycobacterium_tuberculosis_cdc1551_cds.gff", type = "CDS")
h37_ncbi_cdc <- merge(h37_cds_singles, cdc_cds, by.x = "cdc1551", by.y = "protein_id")

single_gene_mapping <- h37_ncbi_cdc[, c("locus_tag.x", "locus_tag.y", "gene.x",
                                        "product.x", "gene.y", "product.y")]
colnames(single_gene_mapping) <- c("h37rv_tag", "cdc1551_tag", "h37_gene",
                                   "h37_product", "cdc_gene", "cdc_product")
```

## Add the single gene maps to the annotations

Oh, the way I did the above, this will almost certainly end up with
redundant columns; but since those columns are coming from different
sources it will be a good control to ensure that everything worked
properly (e.g. if they prove not to be redundant then we know there
were shenanigans in my ortholog mapping).

```{r}
annot <- merge(annot, single_gene_mapping, by.x = "row.names",
               by.y = "cdc1551_tag", all.x = TRUE)
rownames(annot) <- annot[["Row.names"]]
annot[["Row.names"]] <- NULL
```

# A slightly random aside, rRNA and 6S

Given the pecularities we observed vis a vis the Pseudomonas reads,
there is explicit interest from April to know how well the ribosomal
depletion went; so I did a little poking and found that the ribosomal
genes are named rrf(5S), rrs(16S), and rrl(23S).  When I make a
bacterial ribosomal database I usually like to include a few tRNAs and 6S, even
though 5S/6S/tRNAs should be size-selected out of the libraries pretty
consistently.  It turns out the biology of RNAs which interact with
σ in Mtb is kind of wild and there is no known 6s, but a somewhat
similar RNA named 'Ms1'.

Anyhow, I made a rRNA database for Mtb and here is a query against it
to get a sense of likely rRNA reads in the samples.

```{bash, eval=FALSE}
cd preprocessing
start=$(pwd)
for i in $(/bin/ls -d * | grep -v Undetermined); do
    cd $i
    cyoa --method hisat --libtype rRNA --species mtuberculosis_h37rv \
         --input $(/bin/ls *-trimmed.fastq.xz | tr '\n' ':' | sed 's/:$//g')
    cd $start
done
```

# Append information to the sample sheet

We have a relatively sparse sample sheet to start with, lets add some
summary stats from cyoa to it.

There is one important piece of information that Daniel alerted me to
which is not included in the gathered metadata (I think): these
samples appear to have an extraorindarily large amount of potentially
other bacterial sequence in them; and it appears to be dominated by
Pseudomonas aeruginosa -- I am assuming PA01 and will try mapping the
reads against that strain to see what is going on.

Until those mappings are complete, I will continue on as normal, but
keep in mind that the following will almost certain need to change.

```{r}
spec <- make_rnaseq_spec()
gathered <- gather_preprocessing_metadata("sample_sheets/all_samples.xlsx",
                                         species = "mycobacterium_tuberculosis_cdc1551",
                                         specification = spec)
```

# Create an expressionset

```{r}
expt <- create_expt(gathered[["new_file"]], gene_info = annot, file_column = "hisatcounttable")

written <- write_expt(expt, excel = glue("excel/all_samples-v{ver}.xlsx"))
```

# Poke at it

```{r}
plot_libsize(expt)
plot_nonzero(expt)

norm <- normalize_expt(expt, transform = "log2", convert = "cpm", norm = "quant", filter = TRUE)
plot_corheat(norm)
plot_disheat(norm)
plot_pca(norm)

nb <- normalize_expt(expt, transform = "log2", convert = "cpm", batch = "limma", filter = TRUE)
plot_pca(nb)

ns <- normalize_expt(expt, transform = "log2", convert = "cpm", batch = "svaseq", filter = TRUE)
plot_pca(ns)
```

Well, that is not great.  I arbitrarily tried a batch in model and sva
adjustment, but am not including them without further information.

## Check the rRNA content

The following does not work because I fell into the assumption that
these are H37Rv samples again!  Doofus.

I guess I could pull the IDs because I matched them against H37Rv;
but I also mapped against H37Rv, so I could just use that.

```{r, eval=FALSE}
rrna_loci <- c("Rvnr01", "Rvnr02", "Rvnr03")
rrna_expt <- expt[rrna_loci, ]
```

While I was typing the above incorrectly, the alignments started
finishing the query against my Mtb rRNA database.  It looks like ~ 3%
of the reads map to rRNA.

# varpart

```{r}
varpart <- simple_varpart(expt)
varpart
```

# Kraken matrix

```{r}
genus_expt <- create_expt(gathered[["new_file"]],
                          file_column = "krakenmatrix", file_type = "table")
genus_norm <- normalize_expt(genus_expt, convert = "cpm")
plot_disheat(genus_norm)
genus_normv2 <- normalize_expt(genus_expt, convert = "cpm", transform = "log2")
plot_pca(genus_normv2)
plot_libsize(genus_expt)
head(exprs(genus_expt))
exprs(genus_expt)["Pseudomonas", ]
```

# Differential Expression

I am going to assume for the moment that we can learn interesting
things from the data, but that it will require more intervention than
usual for a bacterial dataset.

```{r}
keepers <- list(
  "delta_wt" = c("delta", "wt"),
  "comp_wt" = c("complement", "wt"),
  "delta_comp" = c("delta", "complement"))
de <- all_pairwise(expt, keepers = keepers, filter = TRUE, model_batch = "svaseq")
tables <- combine_de_tables(
  de, keepers = keepers, excel = glue("excel/de_tables-v{ver}.xlsx"))
sig <- extract_significant_genes(
  tables, according_to = "deseq", excel = glue("excel/de_sig-v{ver}.xlsx"))
```

## Also provide a less stringent version

```{r}
less_stringent <- extract_significant_genes(
  tables, according_to = "deseq", lfc = 0.6, p = 1,
  excel = glue("excel/de_sig_0.6lfc_1.0pval-v{ver}.xlsx"))
```

### Test batch-in-model

Given the PCA above, it seems at least possible that including batch
in model might provide more reliable results than sva.  So, the
following block does that.  Spoiler alert: I don't think it helped.

```{r}
de_batch <- all_pairwise(expt, keepers = keepers, filter = TRUE, model_batch = TRUE)
de_batch
tables_batch <- combine_de_tables(de_batch, keepers = keepers,
                                  excel = glue("excel/de_tables_batch-v{ver}.xlsx"))
tables_batch
sig_batch <- extract_significant_genes(tables_batch, according_to = "deseq",
                                       excel = glue("excel/de_sig_batch-v{ver}.xlsx"))
sig_batch
```

## The intersection of Δ/WT and Δ/comp

```{r}
library(UpSetR)
first_up <- sig[["deseq"]][["ups"]][["delta_wt"]]
second_up <- sig[["deseq"]][["ups"]][["delta_comp"]]

first_down <- sig[["deseq"]][["downs"]][["delta_wt"]]
second_down <- sig[["deseq"]][["downs"]][["delta_comp"]]

comparison <- list(
  "d_w_up" = rownames(first_up),
  "d_c_up" = rownames(second_up),
  "d_w_down" = rownames(first_down),
  "d_c_down" = rownames(second_down))
query <- UpSetR::fromList(comparison)
sets <- UpSetR::upset(query)
sets
## So, there are 2 shared down and 12 up.  Also 1 weirdo (which given
## its proximity to the operon containing PknF is likely interesting)

overlap_ids <- overlap_groups(comparison)
interesting_up <- overlap_geneids(overlap_ids, "d_w_up:d_c_up")
interesting_up

interesting_down <- overlap_geneids(overlap_ids, "d_w_down:d_c_down")
interesting_down

weirdo <- overlap_geneids(overlap_ids, "d_w_up:d_c_down")
weirdo

overlapping_meta <- annot[c(interesting_up, interesting_down, weirdo), ]
written <- write_xlsx(overlapping_meta, excel = "excel/overlapping_stringent_genes.xlsx")
```

## Repeat the intersection of Δ/WT and Δ/comp with the less significant data

```{r}
first_up <- less_stringent[["deseq"]][["ups"]][["delta_wt"]]
second_up <- less_stringent[["deseq"]][["ups"]][["delta_comp"]]

first_down <- less_stringent[["deseq"]][["downs"]][["delta_wt"]]
second_down <- less_stringent[["deseq"]][["downs"]][["delta_comp"]]

comparison <- list(
  "d_w_up" = rownames(first_up),
  "d_c_up" = rownames(second_up),
  "d_w_down" = rownames(first_down),
  "d_c_down" = rownames(second_down))
query <- UpSetR::fromList(comparison)
sets <- UpSetR::upset(query)
sets
## So, there are 2 shared down and 12 up.  Also 1 weirdo (which given
## its proximity to the operon containing PknF is likely interesting)

overlap_ids <- overlap_groups(comparison)
interesting_up <- overlap_geneids(overlap_ids, "d_w_up:d_c_up")
interesting_up

interesting_down <- overlap_geneids(overlap_ids, "d_w_down:d_c_down")
interesting_down

weirdo <- overlap_geneids(overlap_ids, "d_w_up:d_c_down")
weirdo

overlapping <- as.character(c(interesting_up, interesting_down, weirdo))

overlapping_meta <- annot[c(interesting_up, interesting_down, weirdo), ]
written <- write_xlsx(overlapping_meta, excel = "excel/overlapping_less_stringent_genes.xlsx")
```

# Ontology enrichment

Given the relative sparsity of the annotations for CDC1551, I might
either go back to my H37Rv mapping data (which was pretty decent), or
use the orthologs and their annotations; but since I already have the
data in its current form loaded, let us just try it and see.

## Volker's explicit query first

There are only 15 genes in the set of interesting intersections; so I
am going to group them together.

In my annotation block above I downloaded the H37Rv and CDC1551
annotations/GO categories.  There appears to be a bug in the GO
loader, it needed the actual ID number instead of a unique string like
the annotations; happily the annotation load prints out the ID number
so I did not have to think very hard.

### Length and COG databases

I forgot for a couple weeks to perform the same overrepresentation
analyses using the COG function assignments.  Let us create that
association table here and add the goseq searches below.

```{r}
annot[["length"]] <- abs(annot[["start"]] - annot[["stop"]])
length_db <- as.data.frame(annot[, c("name", "length")])
colnames(length_db) <- c("ID", "length")

start_cog_df <- annot[, c("desc", "COG", "COGFun", "COGDesc", "TIGRFam", "TIGRRoles")]
start_cog_df[["ID"]] <- rownames(start_cog_df)
cog_df <- start_cog_df[, c("ID", "COGFun")]
colnames(cog_df) <- c("ID", "GO")
```

## The ontology searches

```{r}
query_genes <- unique(c(interesting_up, interesting_down, weirdo))

goseq_query <- simple_goseq(query_genes, go_db = go_db,
                            length_db = length_db, min_xref = 4)
goseq_query

cog_query <- simple_goseq(query_genes, go_db = cog_df,
                          expand_categories = FALSE,
                          length_db = length_db, min_xref = 3)
```

### Same thing, less stringent

```{r}
goseq_query <- simple_goseq(unique(overlapping), go_db = go_db,
                            length_db = length_db, min_xref = 4)
goseq_query
written <- write_goseq_data(goseq_query, excel = "excel/goseq_less_stringent_results.xlsx")

topgo_query <- simple_topgo(unique(overlapping), excel = "excel/topgo_less_stringent_results.xlsx"))
```

Given that there are only a few genes in the above query, let us also
do the three primary contrasts separately as I think they will prove
more interesting.

```{r}
goseq_up_delta_wt <- simple_goseq(sig[["deseq"]][["ups"]][["delta_wt"]], go_db = go_db,
                                  length_db = length_db)
goseq_up_delta_wt

goseq_down_delta_wt <- simple_goseq(sig[["deseq"]][["downs"]][["delta_wt"]], go_db = go_db,
                                    length_db = length_db)
goseq_down_delta_wt


goseq_up_comp_wt <- simple_goseq(sig[["deseq"]][["ups"]][["comp_wt"]], go_db = go_db,
                                  length_db = length_db, min_xref = 20)
goseq_up_comp_wt

goseq_down_comp_wt <- simple_goseq(sig[["deseq"]][["downs"]][["comp_wt"]], go_db = go_db,
                                    length_db = length_db, min_xref = 5)
goseq_down_comp_wt


goseq_up_delta_comp <- simple_goseq(sig[["deseq"]][["ups"]][["delta_comp"]], go_db = go_db,
                                    length_db = length_db, min_xref = 10)
goseq_up_delta_comp

goseq_down_delta_comp <- simple_goseq(sig[["deseq"]][["downs"]][["delta_comp"]], go_db = go_db,
                                    length_db = length_db, min_xref = 5)
goseq_down_delta_comp
```

# Circos

I think it would be nice to have the wt mean expression values as a
baseline

```{r}
means <- mean_by_factor(norm)[["medians"]]
```

## Collect the STPKs

```{r}
stpk_operons <- c("pknA", "pknB", ## ~15k, perhaps ending a 7 gene operon (fhaA, fhaB, pstP, rodA, rbpA, pknA, pknB)
                 ## No pknC
                 "pknD", ## ~ 1038500, perhaps 2 gene operon with pstS2
                 "pknE", ## ~ 2000000, perhaps with vapB34, vapC34, Rv1742, pknE
                 "pknF", ## ~ 2000000, close to E, pknF, Rv1747, Rv1748
                 "pknG", ## ~ 2000000, Rv0412c, hlnH, pknG
                 "pknH", ## ~ 1415000, maybe 8 genes, Rv1273c, Rv1272c, Rv1271c, lprA, Rv1269c, Rv1268c, embR, pknH
                 "pknI", ## ~ 3222000, ffh, Rv2915c, pknI, Rv2913c, Rv2912c
                 "pknJ", ## ~ 2345000, Rv2082, Rv2083, Rv2084, Rv2085, Rv2086, Rv2087, pknJ
                 "pknK", ## ~ 3442500, pknK, Rv3079c
                 "pknL",
                 "lppH" ## when searching for pknM, lpphH, Rv3577, arsB2 4015000
                 )
mapped_stpk <- single_gene_mapping[["h37_gene"]] %in% stpk_operons
mapped_ids <- single_gene_mapping[mapped_stpk, ]
mapped_ids
mapped_ids[["cdc1551_tag"]]
## Oh no shit the reason I couldn't find them before was the difference between pknB and PknB, damnit.
## So that is annoying I likely did not need to do the silly ortholog map at all.
```

Let us add a column to the annotations noting the STPKs and
potentially include them in the circos plot.

The current map is in order from outer->inner

* + strand genes colored by COGFun
* - strand genes
* WT gene expression on log scale, color scale chosen automatically as
  heatmap
* Δ/WT comparison on log scale, color scale chosen automatically as
  heatmap
* comp/WT comparison, Ibid.
* Δ/comp comparison, Ibid.
* ideally, tiles showing the location of the 11 STPKs and 1 rando
  which might also be a STPK.

```{r}
annot[["chromosome"]] <- "AE000516"
annot[["stpk"]] <- "NULL"
annot[mapped_ids[["cdc1551_tag"]], "stpk"] <- "stpk"
stpk_rows <- annot[mapped_ids[["cdc1551_tag"]], ]

delta_wt_df <- tables[["data"]][["delta_wt"]][, c("deseq_logfc", "deseq_adjp")]
comp_wt_df <- tables[["data"]][["comp_wt"]][, c("deseq_logfc", "deseq_adjp")]
delta_comp_df <- tables[["data"]][["delta_comp"]][, c("deseq_logfc", "deseq_adjp")]

pknf_cfg <- circos_prefix(annot, name="pknf", cog_column = "COGFun",
                          start_column = "start", stop_column = "stop",
                          strand_column = "strand", chr_column = "chromosome")
pknf_kary <- circos_karyotype(
  pknf_cfg, fasta = "~/libraries/genome/mycobacterium_tuberculosis_cdc1551.fasta")
plus_minus <- circos_plus_minus(pknf_cfg, width = 0.06, thickness = 40)
wt_exprs_heat <- circos_heatmap(pknf_cfg, means, colname = "wt", basename = "mean_wt",
                             outer = plus_minus)
delta_wt_hist <- circos_heatmap(pknf_cfg, delta_wt_df, colname = "deseq_logfc",
                             basename = "delta_wt", outer = wt_exprs_heat)
comp_wt_hist <- circos_heatmap(pknf_cfg, comp_wt_df, colname = "deseq_logfc",
                            basename = "comp_wt", outer = delta_wt_hist)
delta_comp_hist <- circos_heatmap(pknf_cfg, delta_comp_df, colname = "deseq_logfc",
                                  basename = "delta_comp", outer = comp_wt_hist)
tile_colors <- "0000AA"
names(tile_colors) <- c("stpk")
stpk_tiles <- circos_tile(pknf_cfg, stpk_rows, colname = "stpk",
                          colors = tile_colors, width = 0.125,
                          padding = 0, thickness = 125, margin = 0.00,
                          basename = "stpk", outer = delta_comp_hist)
finish <- circos_suffix(pknf_cfg)
made <- circos_make(pknf_cfg, target = "pknf")
```

One of the primary reasons I usually do histograms instead of
heatmaps: I do not really understand how to lay out the color range so
that it is consistent across different ranges of logFC values (I
suppose I could just normalize them to -1<=x<=1?

The following is pulled from a Pseudomonas tiling of specific genes of
interest; I have a feeling that the various STPKs may want to be added
in a similar fashion



```{r, eval=FALSE}
other <- c("gene1652095", "gene1652097", "gene1652099", "gene1652101",
           "gene1652103", "gene1652105", "gene1652107", "gene1652109",
           "gene1652111", "gene1652113", "gene1652115", "gene1652117",
           "gene1652119", "gene1652121", "gene1652123", "gene1652125",
           "gene1652127", "gene1652129", "gene1652131", "gene1652133",
           "gene1652135", "gene1652137", "gene1652139", "gene1652141",
           "gene1652143", "gene1652145", "gene1652147", "gene1652149",
           "gene1652151", "gene1652153", "gene1652155", "gene1652157",
           "gene1652159", "gene1652161", "gene1652163", "gene1652165")
## also BLUE
pel <- c("gene1654787", "gene1654789", "gene1654793",
         "gene1654795", "gene1654797", "gene1654799")
extra[down_orn, "flag"] <- "down_orn"
extra[up_orn, "flag"] <- "up_orn"
extra[effectors, "flag"] <- "effectors"
extra[other, "flag"] <- "other"
extra[pel, "flag"] <- "pel"
keep_idx <- extra[["flag"]] != ""
extra <- extra[keep_idx, ]
summary(as.factor(extra$flag))


```
