TNSeq of Streptococcus pyogenes strain HSC5

Installation and setup

This is an rmarkdown document which makes heavy use of the hpgltools package. The following section demonstrates how to set that up in a clean R environment.

## Use R's install.packages to install devtools.
install.packages("devtools")
## Use devtools to install hpgltools.
devtools::install_github("elsayedlab/hpgltools")
## Load hpgltools into the R environment.
library(hpgltools)
## Use hpgltools' autoloads_all() function to install the many packages used by hpgltools.
autoloads_all()

TODO list

The following are some requests I have received and whether or not I think I did them.

  1. Dval report in CDM with/without Asn. (I think I did)
  2. PCA plot including duplicates. (I think I did)
  3. Use L897_RS06150 as a bench mark to ensure that contrasts work in the intended direction.

TNSeq of hsc5 during infections

Get the genome

The following block is responsible for reading the genome and annotation. There are a couple of sources for this data which may be considered ‘canonical’ and this has caused some confusion:

The NCBI genome accession NC_021807 is a mostly completed sequencing effort for strain HSC5. The annotation data included in it includes two IDs: ‘locus_tag’ and ‘old_locus_tag’.

Though microbesonline does not seem to have a genome for HSC5, I did find a separate reference genome at: https://img.jgi.doe.gov/cgi-bin/m/main.cgi?section=TaxonDetail&page=taxonDetail&taxon_oid=2563366592#browse This genome uses what I believe to be the old_locus_tag IDs.

In the context of the NCBI genome, our favorite gene is accession “L897_RS06150”: “asparagine synthetase A”, “catalyzes the formation of asparagine from aspartate and ammonia; Derived by automated computational analysis using gene prediction method: Protein Homology.” This lies between an rRNA methyltransferase and carbamate kinase and may be found at position 1210395/1818351 (66% or 8 o’clock).

Using the doe resource, this same gene is ID: L897_06325 and is between arcC: carbamate kinase and 16S guanine966 N2 methyltransferase (That is an important modification for the accuracy of the ribosome I think, yay ribosomes!).

The asnA region Here is the region of most interest from IGV with the following attributes from top -> bottom. 1. Bar of the entire genome with the visible region displayed as a red box. 2. 8+ CDS regions with asnA in the middle in blue, including a red copy of asnA below. 3. Blue vertical bars denoting the available CDS TA insertion positions. 4. 4 libraries (thyt0, thyt1, asn+, asn-) where each library has a coverage map with range from 0-500 reads followed by blue, red, gray blocks denoting forward, reverse, and multi reads respectively.

gff_file <- "reference/mgas_hsc5.gff.gz"
hsc5_info <- gff2df(gff_file)
## Trying attempt: rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo=TRUE)
## Trying attempt: rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo=FALSE)
## Had a successful gff import with rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo=FALSE)
## Returning a df with 15 columns and 1812 rows.
## The following lines are likely replaced by the above
## but will stay here for the moment
## annotations_hsc5 <- rtracklayer::import.gff3(gff_file)
##annotation_hsc5_info <- BiocGenerics::as.data.frame(annotations_hsc5)
##rownames(annotation_hsc5_info) <- make.names(annotation_hsc5_info$locus_tag, unique=TRUE)
genes_hsc5 <- hsc5_info[hsc5_info$type=="gene",]
rownames(genes_hsc5) <- make.names(genes_hsc5$locus_tag, unique=TRUE)
gene_annotations_hsc5 <- subset(genes_hsc5, select = c("start", "end", "width", "strand", "locus_tag"))
gene_lengths <- gene_annotations_hsc5[,c("width","start")]
gene_lengths$ID <- rownames(gene_lengths)
gene_lengths <- gene_lengths[,c("ID","width")]
short_annotations_hsc5 <- gene_annotations_hsc5[,c("width", "locus_tag")]

tooltip_data_hsc5 <- make_tooltips(annotations=hsc5_info, desc_col=c("old_locus_tag","gene"), id_col="locus_tag")

xlsx_writer <- hpgltools::write_xls(gene_annotations_hsc5, sheet="genes")
## Converted strand to characters.
openxlsx::saveWorkbook(wb=xlsx_writer$workbook, file="excel/annotations.xlsx", overwrite=TRUE)

##go_entries_hsc5 = strsplit(as.character(microbes_go_hsc5$GO), split=",", perl=TRUE)
##microbes_go_oneperrow_hsc5 = data.frame(name = rep(microbes_go_hsc5$sysName, sapply(go_entries_hsc5, length)), GO = unlist(go_entries_hsc5))
##microbes_go_hsc5 = microbes_go_oneperrow_hsc5
##rm(microbes_go_oneperrow_hsc5)
##rm(go_entries_hsc5)
## These are used for gene ontology stuff...
##microbes_lengths_hsc5 = microbes_hsc5[,c("sysName", "start","stop")]
##microbes_lengths_hsc5$length = abs(microbes_hsc5$start - microbes_hsc5$stop)
##microbes_lengths_hsc5 = microbes_lengths_hsc5[,c("sysName","length")]

Set up the experiments

The following block sets up an expressionset of the data using all_samples.csv, this includes runs 1(a), 2(b), and 3(c). As a couple of images will show, run 1 is too different to use. It is worth noting that we might be able to use it if we batch correct the data, but I think the loss makes it not worth while.

hsc5_expt <- create_expt("all_samples.csv")
## processed_data/counts/run1/thyt0-trimmed-v0M1.count.xz contains 1817 rows.
## processed_data/counts/run1/thyt1-trimmed-v0M1.count.xz contains 1817 rows and merges to 1817 rows.
## processed_data/counts/run1/cdmwithasn-trimmed-v0M1.count.xz contains 1817 rows and merges to 1817 rows.
## processed_data/counts/run1/cdmnoasn-trimmed-v0M1.count.xz contains 1817 rows and merges to 1817 rows.
## processed_data/counts/run2/thyt0run2-trimmed-v0M1.count.xz contains 1817 rows and merges to 1817 rows.
## processed_data/counts/run2/thyt1run2-trimmed-v0M1.count.xz contains 1817 rows and merges to 1817 rows.
## processed_data/counts/run2/asnrun2-trimmed-v0M1.count.xz contains 1817 rows and merges to 1817 rows.
## processed_data/counts/run2/noasnrun2-trimmed-v0M1.count.xz contains 1817 rows and merges to 1817 rows.
## processed_data/counts/run3/thyt0-trimmed-v0M1.count.xz contains 1817 rows and merges to 1817 rows.
## processed_data/counts/run3/thyt1-trimmed-v0M1.count.xz contains 1817 rows and merges to 1817 rows.
## processed_data/counts/run3/asn-trimmed-v0M1.count.xz contains 1817 rows and merges to 1817 rows.
## processed_data/counts/run3/noasn-trimmed-v0M1.count.xz contains 1817 rows and merges to 1817 rows.
knitr::kable(head(Biobase::exprs(hsc5_expt$expressionset)))
HPGL0596 HPGL0597 HPGL0598 HPGL0599 HPGL0596b HPGL0597b HPGL0598b HPGL0599b HPGL0596c HPGL0597c HPGL0598c HPGL0599c
L897_RS00005 0 0 0 1 21 2 4 6 26 5 11 12
L897_RS00010 0 1 2 0 12 20 9 4 19 45 25 11
L897_RS00015 10 0 12 15 20 11 23 16 38 55 40 33
L897_RS00020 91 24 26 57 1010 1565 2094 1521 2369 2190 1865 1772
L897_RS00025 0 0 0 0 1 1 0 1 1 2 3 0
L897_RS00030 3523 1707 2947 1236 25280 57446 41772 39227 51103 74479 42698 40760
## Show the first few lines of the count table.
norm_hsc5 <- normalize_expt(hsc5_expt, transform="log2", norm="quant", convert="cpm", filter_low=TRUE)
## This function will replace the expt$expressionset slot with:
## log2(cpm(quant(data)))
## It backs up the current data into a slot named:
##  expt$backup_expressionset. It will also save copies of each step along the way
##  in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
##  when invoking limma.  The appropriate libsize is the non-log(cpm(normalized)).
##  This is most likely kept at:
##  'new_expt$normalized$intermediate_counts$normalization$libsizes'
##  A copy of this may also be found at:
##  new_expt$best_libsize
## Filter is false, this should likely be set to something, good
##  choices include cbcb, kofa, pofa (anything but FALSE).  If you want this to
##  stay FALSE, keep in mind that if other normalizations are performed, then the
##  resulting libsizes are likely to be strange (potentially negative!)
## Not correcting the count-data for batch effects.  If batch is
##  included in EdgerR/limma's model, then this is probably wise; but in extreme
##  batch effects this is a good parameter to play with.
## Step 1: not doing count filtering.
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## Performing batch correction at step 5.
## Step 5: not doing batch correction.
## Normalize the data
plot_pca(norm_hsc5)$plot

## Show that the first run is far too different than the others.
batch_hsc5 <- normalize_expt(hsc5_expt, transform="log2", norm="rle", convert="cpm", batch=TRUE, filter_log=TRUE)
## This function will replace the expt$expressionset slot with:
## log2(batch-correct(cpm(rle(data))))
## It backs up the current data into a slot named:
##  expt$backup_expressionset. It will also save copies of each step along the way
##  in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
##  when invoking limma.  The appropriate libsize is the non-log(cpm(normalized)).
##  This is most likely kept at:
##  'new_expt$normalized$intermediate_counts$normalization$libsizes'
##  A copy of this may also be found at:
##  new_expt$best_libsize
## Filter is false, this should likely be set to something, good
##  choices include cbcb, kofa, pofa (anything but FALSE).  If you want this to
##  stay FALSE, keep in mind that if other normalizations are performed, then the
##  resulting libsizes are likely to be strange (potentially negative!)
## Step 1: not doing count filtering.
## Step 2: normalizing the data with rle.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## Performing batch correction at step 5.
## Step 5: doing batch correction with TRUE.
## batch_counts: Before batch correction, 544 entries 0<x<1.
## batch_counts: Before batch correction, 5380 entries are >= 0.
## batch_counts: Using limma's removeBatchEffect to remove batch effect.
## The number of elements which are < 0 after batch correction is: 5057
## The variable low_to_zero sets whether to change <0 values to 0 and is: FALSE
plot_pca(batch_hsc5)$plot

## Try again with batch correction and see that they still do not fit together well.
## Run 1 can in no way be compared to runs 2/3

Set up the experiment without run 1.

The csv file kept_samples.csv excludes the earlier samples, and when we look at the preliminary statistics plots, the data looks much better without run 1(a).

hsc5_expt = s_p(create_expt("kept_samples.csv"))$result
## processed_data/counts/run2/thyt0run2-trimmed-v0M1.count.xz contains 1817 rows.
## processed_data/counts/run2/thyt1run2-trimmed-v0M1.count.xz contains 1817 rows and merges to 1817 rows.
## processed_data/counts/run2/asnrun2-trimmed-v0M1.count.xz contains 1817 rows and merges to 1817 rows.
## processed_data/counts/run2/noasnrun2-trimmed-v0M1.count.xz contains 1817 rows and merges to 1817 rows.
## processed_data/counts/run3/thyt0-trimmed-v0M1.count.xz contains 1817 rows and merges to 1817 rows.
## processed_data/counts/run3/thyt1-trimmed-v0M1.count.xz contains 1817 rows and merges to 1817 rows.
## processed_data/counts/run3/asn-trimmed-v0M1.count.xz contains 1817 rows and merges to 1817 rows.
## processed_data/counts/run3/noasn-trimmed-v0M1.count.xz contains 1817 rows and merges to 1817 rows.
norm_hsc5 = s_p(normalize_expt(hsc5_expt, transform="log2", norm="tmm", convert="cpm", filter_low=FALSE, batch=FALSE))$result
## This function will replace the expt$expressionset slot with:
## log2(FALSE(cpm(tmm(data))))
## It backs up the current data into a slot named:
##  expt$backup_expressionset. It will also save copies of each step along the way
##  in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
##  when invoking limma.  The appropriate libsize is the non-log(cpm(normalized)).
##  This is most likely kept at:
##  'new_expt$normalized$intermediate_counts$normalization$libsizes'
##  A copy of this may also be found at:
##  new_expt$best_libsize
## Filter is false, this should likely be set to something, good
##  choices include cbcb, kofa, pofa (anything but FALSE).  If you want this to
##  stay FALSE, keep in mind that if other normalizations are performed, then the
##  resulting libsizes are likely to be strange (potentially negative!)
## Step 1: not doing count filtering.
## Step 2: normalizing the data with tmm.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## Performing batch correction at step 5.
## Step 5: doing batch correction with FALSE.
## batch_counts: Before batch correction, 241 entries 0<x<1.
## batch_counts: Before batch correction, 2802 entries are >= 0.
## Did not recognize the batch correction, leaving the table alone.
## Recognized batch corrections include: 'limma', 'combatmod', 'sva',
## limmaresid, combat_noprior, combat, svaseq, and ruvg.
## The number of elements which are < 0 after batch correction is: 2802
## The variable low_to_zero sets whether to change <0 values to 0 and is: FALSE
plot_pca(norm_hsc5)$plot

## There is a pretty clear batch effect on PC1
## The top two PCs have pretty much the same amount of variance between them
## Thus I think adding batch to the model will be a good thing

norm_hsc5 = s_p(normalize_expt(hsc5_expt, transform="log2", norm="tmm", convert="cpm", filter_low=FALSE, batch=TRUE))$result
## This function will replace the expt$expressionset slot with:
## log2(batch-correct(cpm(tmm(data))))
## It backs up the current data into a slot named:
##  expt$backup_expressionset. It will also save copies of each step along the way
##  in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
##  when invoking limma.  The appropriate libsize is the non-log(cpm(normalized)).
##  This is most likely kept at:
##  'new_expt$normalized$intermediate_counts$normalization$libsizes'
##  A copy of this may also be found at:
##  new_expt$best_libsize
## Filter is false, this should likely be set to something, good
##  choices include cbcb, kofa, pofa (anything but FALSE).  If you want this to
##  stay FALSE, keep in mind that if other normalizations are performed, then the
##  resulting libsizes are likely to be strange (potentially negative!)
## Step 1: not doing count filtering.
## Step 2: normalizing the data with tmm.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## Performing batch correction at step 5.
## Step 5: doing batch correction with TRUE.
## batch_counts: Before batch correction, 241 entries 0<x<1.
## batch_counts: Before batch correction, 2802 entries are >= 0.
## batch_counts: Using limma's removeBatchEffect to remove batch effect.
## The number of elements which are < 0 after batch correction is: 2834
## The variable low_to_zero sets whether to change <0 values to 0 and is: FALSE
plot_pca(norm_hsc5)$plot

## Very nice

Graph metrics using this data set

The following block creates graphs of all the likely metrics we may want to use to diagnose strengths/weaknesses in the data.

## This block generates all the possible graphs but hides the graphs so that we can see them when interested but not before.
unmodified_metrics <- s_p(graph_metrics(hsc5_expt))$result
## Graphing number of non-zero genes with respect to CPM by library.
## Graphing library sizes.
## Graphing a boxplot.
## I am reasonably sure this should be log scaled and am setting it.
##   If this is incorrect, set scale='raw'
## Graphing a correlation heatmap.
## Graphing a standard median correlation.
## Performing correlation.
## Graphing a distance heatmap.
## Graphing a standard median distance.
## Performing distance.
## Graphing a PCA plot.
## Plotting a density plot.
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
normalized_metrics <- s_p(graph_metrics(norm_hsc5, qq=TRUE))$result
## Graphing number of non-zero genes with respect to CPM by library.
## Graphing library sizes.
## Graphing a boxplot.
## Graphing a correlation heatmap.
## Graphing a standard median correlation.
## Performing correlation.
## Graphing a distance heatmap.
## Graphing a standard median distance.
## Performing distance.
## Graphing a PCA plot.
## Plotting a density plot.
## QQ plotting!
## Making plot of HPGL0596b(1) vs. a sample distribution.
## Making plot of HPGL0597b(2) vs. a sample distribution.
## Making plot of HPGL0598b(3) vs. a sample distribution.
## Making plot of HPGL0599b(4) vs. a sample distribution.
## Making plot of HPGL0596c(5) vs. a sample distribution.
## Making plot of HPGL0597c(6) vs. a sample distribution.
## Making plot of HPGL0598c(7) vs. a sample distribution.
## Making plot of HPGL0599c(8) vs. a sample distribution.

And this block shows a few of particular interest: 1. The number of reads in each library (we hope that these are not too dis-similar) 2. A density plot showing the number of genes observed with every number of reads (we hope that they are similar across libraries, at least after normalization) 3. A boxplot describing the number of reads per gene / library. This shows the same thing as a density plot, just in a different fashion. 4. A correlation heatmap of the normalized data (we hope that samples of the same condition cluster together) 5. A distance heatmap of the normalized data (we hope that samples of the same condition cluster together) 6. A qq distribution of each sample against the median. (we hope that no sample is very different than the others)

## Now show the metrics of interest
unmodified_metrics$libsize

## 596b (thyt0 run 2 has many fewer reads)
unmodified_metrics$density

## and so is shifted a bit to the left.
unmodified_metrics$boxplot

## and also down on a box plot.

## The following are plots of the normalized data.
normalized_metrics$corheat

## After normalization it behaves pretty well, though
normalized_metrics$disheat

## This last one might be new to folks reading this document
normalized_metrics$qqlog

## Each of the 8 plots comprise a comparison of the rank-ordered genes vs. the mean of all samples
## The idea is that if one or more are significantly shifted in some way, then those samples are untenably different
## than the rest.  The color from light-blue->dark-blue shows the density of genes at the given normalized(cpm)

Fitness analyses

The fitness analyses we have performed are a sort of differential expression bastardization. In this case, I pass the unmodified data to limma/edger/deseq and tell them all to include batch in the experimental model. Each of these tools is then free to apply its own normalizations to the data and perform its own version of differential expression. Therefore, this ‘fitness’ analysis suggests that when a log2 fold-change is >0, then the number of insertions supported in a given gene is greater in the second sample compare to the first and is therefore of “higher fitness” in the first. Unfortunately, I have not been very clear in my descriptions of this in the past and have caused some confusion here.

It is worth noting that I repeated this analysis using raw data and then again using data where I removed the low-count genes. The definition of ‘low-count’ in this context is rather important because genes with 0 reads in a sample are considered to be (near)essential in that condition. We therefore remove 346 genes which have very few reads across all samples. (The limit used is a threshold of 1 cpm across >= 2 samples).

all_de <- s_p(all_pairwise(hsc5_expt, model_batch=TRUE))$result
## Starting limma pairwise comparison.
## libsize was not specified, this parameter has profound effects on limma's result.
## Using the libsize from expt$normalized$intermediate_counts$normalization$libsize
## Limma step 1/6: choosing model.
## Limma step 2/6: running voom
## The voom input was not cpm, converting now.
## The voom input was not log2, transforming now.
## limma step 3/6: running lmFit
## Limma step 4/6: making and fitting contrasts.
## Limma step 5/6: Running eBayes and topTable.
## Limma step 6/6: Writing limma outputs.
## limma step 6/6: 1/10: Printing table: asn.
## limma step 6/6: 2/10: Printing table: noasn.
## limma step 6/6: 3/10: Printing table: thyt0.
## limma step 6/6: 4/10: Printing table: thyt1.
## limma step 6/6: 5/10: Printing table: noasn_vs_asn.
## limma step 6/6: 6/10: Printing table: thyt0_vs_asn.
## limma step 6/6: 7/10: Printing table: thyt1_vs_asn.
## limma step 6/6: 8/10: Printing table: thyt0_vs_noasn.
## limma step 6/6: 9/10: Printing table: thyt1_vs_noasn.
## limma step 6/6: 10/10: Printing table: thyt1_vs_thyt0.
## Starting DESeq2 pairwise comparisons.
## The data should be suitable for deseq2.
## If deseq freaks out, check the state of the count table and ensure that it is in integer counts.
## DESeq2 step 1/5: Including batch and condition in the deseq model.
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
## DESeq2 step 2/5: Estimate size factors.
## DESeq2 step 3/5: estimate Dispersions.
## DESeq2 step 4/5: nbinomWaldTest.
## DESeq2 step 5/5: 1/6: Printing table: noasn_vs_asn
## DESeq2 step 5/5: 2/6: Printing table: thyt0_vs_asn
## DESeq2 step 5/5: 3/6: Printing table: thyt1_vs_asn
## DESeq2 step 5/5: 4/6: Printing table: thyt0_vs_noasn
## DESeq2 step 5/5: 5/6: Printing table: thyt1_vs_noasn
## DESeq2 step 5/5: 6/6: Printing table: thyt1_vs_thyt0
## Collected coefficients for: asn
## Collected coefficients for: noasn
## Collected coefficients for: thyt0
## Collected coefficients for: thyt1
## Starting edgeR pairwise comparisons.
## The data should be suitable for EdgeR.
## If EdgeR freaks out, check the state of the count table and ensure that it is in integer counts.
## EdgeR step 1/9: normalizing data.
## EdgeR step 2/9: Estimating the common dispersion.
## EdgeR step 3/9: Estimating dispersion across genes.
## EdgeR step 4/9: Estimating GLM Common dispersion.
## EdgeR step 5/9: Estimating GLM Trended dispersion.
## EdgeR step 6/9: Estimating GLM Tagged dispersion.
## EdgeR step 7/9: Running glmFit.
## EdgeR step 8/9: Making pairwise contrasts.
## EdgeR step 9/9: 1/6: Printing table: noasn_vs_asn.
## EdgeR step 9/9: 2/6: Printing table: thyt0_vs_asn.
## EdgeR step 9/9: 3/6: Printing table: thyt1_vs_asn.
## EdgeR step 9/9: 4/6: Printing table: thyt0_vs_noasn.
## EdgeR step 9/9: 5/6: Printing table: thyt1_vs_noasn.
## EdgeR step 9/9: 6/6: Printing table: thyt1_vs_thyt0.
## Starting basic pairwise comparison.
## This basic pairwise function assumes log2, converted, normalized counts, normalizing now.
## Basic step 1/3: Creating median and variance tables.
## Basic step 2/3: Performing comparisons.
## Basic step 2/3: 1/6: Performing log2 subtraction: noasn_vs_asn
## Basic step 2/3: 2/6: Performing log2 subtraction: thyt0_vs_asn
## Basic step 2/3: 3/6: Performing log2 subtraction: thyt1_vs_asn
## Basic step 2/3: 4/6: Performing log2 subtraction: thyt0_vs_noasn
## Basic step 2/3: 5/6: Performing log2 subtraction: thyt1_vs_noasn
## Basic step 2/3: 6/6: Performing log2 subtraction: thyt1_vs_thyt0
## Basic step 3/3: Creating faux DE Tables.
## Basic: Returning tables.
## 1/6: Comparing analyses: noasn_vs_asn
## 2/6: Comparing analyses: thyt0_vs_asn
## 3/6: Comparing analyses: thyt1_vs_asn
## 4/6: Comparing analyses: thyt0_vs_noasn
## 5/6: Comparing analyses: thyt1_vs_noasn
## 6/6: Comparing analyses: thyt1_vs_thyt0

## holy crap DESeq2 really doesn't agree with either limma nor edgeR!
## I wonder if this is due to low-count fintering?
low_filtered <- s_p(normalize_expt(hsc5_expt, filter=TRUE))$result
## This function will replace the expt$expressionset slot with:
## filter(data)
## It backs up the current data into a slot named:
##  expt$backup_expressionset. It will also save copies of each step along the way
##  in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
##  when invoking limma.  The appropriate libsize is the non-log(cpm(normalized)).
##  This is most likely kept at:
##  'new_expt$normalized$intermediate_counts$normalization$libsizes'
##  A copy of this may also be found at:
##  new_expt$best_libsize
## Leaving the data in its current base format, keep in mind that
##  some metrics are easier to see when the data is log2 transformed, but
##  EdgeR/DESeq do not accept transformed data.
## Leaving the data unconverted.  It is often advisable to cpm/rpkm
##  the data to normalize for sampling differences, keep in mind though that rpkm
##  has some annoying biases, and voom() by default does a cpm (though hpgl_voom()
##  will try to detect this).
## Leaving the data unnormalized.  This is necessary for DESeq, but
##  EdgeR/limma might benefit from normalization.  Good choices include quantile,
##  size-factor, tmm, etc.
## Not correcting the count-data for batch effects.  If batch is
##  included in EdgerR/limma's model, then this is probably wise; but in extreme
##  batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 233 low-count genes (1579 remaining).
## Step 2: not normalizing the data.
## Step 3: not converting the data.
## Step 4: not transforming the data.
## Performing batch correction at step 5.
## Step 5: not doing batch correction.
low_de <- s_p(all_pairwise(low_filtered, model_batch=TRUE))$result
## Starting limma pairwise comparison.
## libsize was not specified, this parameter has profound effects on limma's result.
## Using the libsize from expt$best_libsize.
## Limma step 1/6: choosing model.
## Limma step 2/6: running voom
## The voom input was not cpm, converting now.
## The voom input was not log2, transforming now.
## limma step 3/6: running lmFit
## Limma step 4/6: making and fitting contrasts.
## Limma step 5/6: Running eBayes and topTable.
## Limma step 6/6: Writing limma outputs.
## limma step 6/6: 1/10: Printing table: asn.
## limma step 6/6: 2/10: Printing table: noasn.
## limma step 6/6: 3/10: Printing table: thyt0.
## limma step 6/6: 4/10: Printing table: thyt1.
## limma step 6/6: 5/10: Printing table: noasn_vs_asn.
## limma step 6/6: 6/10: Printing table: thyt0_vs_asn.
## limma step 6/6: 7/10: Printing table: thyt1_vs_asn.
## limma step 6/6: 8/10: Printing table: thyt0_vs_noasn.
## limma step 6/6: 9/10: Printing table: thyt1_vs_noasn.
## limma step 6/6: 10/10: Printing table: thyt1_vs_thyt0.
## Starting DESeq2 pairwise comparisons.
## The data should be suitable for deseq2.
## If deseq freaks out, check the state of the count table and ensure that it is in integer counts.
## DESeq2 step 1/5: Including batch and condition in the deseq model.
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
## DESeq2 step 2/5: Estimate size factors.
## DESeq2 step 3/5: estimate Dispersions.
## DESeq2 step 4/5: nbinomWaldTest.
## DESeq2 step 5/5: 1/6: Printing table: noasn_vs_asn
## DESeq2 step 5/5: 2/6: Printing table: thyt0_vs_asn
## DESeq2 step 5/5: 3/6: Printing table: thyt1_vs_asn
## DESeq2 step 5/5: 4/6: Printing table: thyt0_vs_noasn
## DESeq2 step 5/5: 5/6: Printing table: thyt1_vs_noasn
## DESeq2 step 5/5: 6/6: Printing table: thyt1_vs_thyt0
## Collected coefficients for: asn
## Collected coefficients for: noasn
## Collected coefficients for: thyt0
## Collected coefficients for: thyt1
## Starting edgeR pairwise comparisons.
## The data should be suitable for EdgeR.
## If EdgeR freaks out, check the state of the count table and ensure that it is in integer counts.
## EdgeR step 1/9: normalizing data.
## EdgeR step 2/9: Estimating the common dispersion.
## EdgeR step 3/9: Estimating dispersion across genes.
## EdgeR step 4/9: Estimating GLM Common dispersion.
## EdgeR step 5/9: Estimating GLM Trended dispersion.
## EdgeR step 6/9: Estimating GLM Tagged dispersion.
## EdgeR step 7/9: Running glmFit.
## EdgeR step 8/9: Making pairwise contrasts.
## EdgeR step 9/9: 1/6: Printing table: noasn_vs_asn.
## EdgeR step 9/9: 2/6: Printing table: thyt0_vs_asn.
## EdgeR step 9/9: 3/6: Printing table: thyt1_vs_asn.
## EdgeR step 9/9: 4/6: Printing table: thyt0_vs_noasn.
## EdgeR step 9/9: 5/6: Printing table: thyt1_vs_noasn.
## EdgeR step 9/9: 6/6: Printing table: thyt1_vs_thyt0.
## Starting basic pairwise comparison.
## This basic pairwise function assumes log2, converted, normalized counts, normalizing now.
## Basic step 1/3: Creating median and variance tables.
## Basic step 2/3: Performing comparisons.
## Basic step 2/3: 1/6: Performing log2 subtraction: noasn_vs_asn
## Basic step 2/3: 2/6: Performing log2 subtraction: thyt0_vs_asn
## Basic step 2/3: 3/6: Performing log2 subtraction: thyt1_vs_asn
## Basic step 2/3: 4/6: Performing log2 subtraction: thyt0_vs_noasn
## Basic step 2/3: 5/6: Performing log2 subtraction: thyt1_vs_noasn
## Basic step 2/3: 6/6: Performing log2 subtraction: thyt1_vs_thyt0
## Basic step 3/3: Creating faux DE Tables.
## Basic: Returning tables.
## 1/6: Comparing analyses: noasn_vs_asn
## 2/6: Comparing analyses: thyt0_vs_asn
## 3/6: Comparing analyses: thyt1_vs_asn
## 4/6: Comparing analyses: thyt0_vs_noasn
## 5/6: Comparing analyses: thyt1_vs_noasn
## 6/6: Comparing analyses: thyt1_vs_thyt0

## That is part of the difference, but not all of it
## (note that the low end of the color-spectrum is now a rho of 0.6 rather than 0.5)
## I have a suspicion that the way I am including batch in DESeq's model is either incorrect or can be changed.
keepers <- list(
    "thyt1_thyt0" = c("thyt1", "thyt0"),
    "asnminus_asnplus" = c("noasn", "asn"))
initial_result <- combine_de_tables(low_de, annot_df=short_annotations_hsc5,
                                    keepers=keepers, excel="excel/initial_fitness.xlsx")
## Deleting the file excel/initial_fitness.xlsx before writing the tables.
## Working on 1/2: thyt1_thyt0
## Found table with thyt1_vs_thyt0
## Running create_combined_table with inverse=FALSE
## The table is: thyt1_vs_thyt0
##            table total limma_up limma_sigup deseq_up deseq_sigup edger_up
## 1 thyt1_vs_thyt0  1579       38           6        6           6       49
##   edger_sigup basic_up basic_sigup limma_down limma_sigdown deseq_down
## 1          10       61           7         74            35          7
##   deseq_sigdown edger_down edger_sigdown basic_down basic_sigdown meta_up
## 1             7         63            26         53            10      27
##   meta_sigup meta_down meta_sigdown
## 1         14        55           36
## Working on 2/2: asnminus_asnplus
## Found table with noasn_vs_asn
## Running create_combined_table with inverse=FALSE
## The table is: noasn_vs_asn
##            table total limma_up limma_sigup deseq_up deseq_sigup edger_up
## 1 thyt1_vs_thyt0  1579       38           6        6           6       49
## 2   noasn_vs_asn  1579       41          21       18          18       50
##   edger_sigup basic_up basic_sigup limma_down limma_sigdown deseq_down
## 1          10       61           7         74            35          7
## 2          22       68          15        117            68         29
##   deseq_sigdown edger_down edger_sigdown basic_down basic_sigdown meta_up
## 1             7         63            26         53            10      27
## 2            29        119            62         74            21      36
##   meta_sigup meta_down meta_sigdown
## 1         14        55           36
## 2         23        82           60
## Attempting to add a coefficient plot for thyt1_thyt0 at column 37
## Warning: Removed 139 rows containing non-finite values (stat_smooth).
## Warning: Removed 139 rows containing missing values (geom_point).
## Attempting to add a coefficient plot for asnminus_asnplus at column 37

## Warning: Removed 137 rows containing non-finite values (stat_smooth).
## Warning: Removed 137 rows containing missing values (geom_point).
## Writing summary information.
## The sheet pairwise_summary already exists, it will get overwritten
## Attempting to add the comparison plot to pairwise_summary at row: 20 and column: 1
## Performing save of the workbook.

initial_result$plots$thyt1_thyt0
## Warning: Removed 139 rows containing non-finite values (stat_smooth).
## Warning: Removed 139 rows containing missing values (geom_point).

## Showing relatrive counts between t1 -> t0

test_plot <- initial_result$plots$asnminus_asnplus
## The gene of most interest is L897_RS06150
favorite_gene <- test_plot$data["L897_RS06150",]
favorite_gene
##              first second
## L897_RS06150 6.884  8.362
2^8.4 / 2^6.8
## [1] 3.031
## Ok, that is encouraging at least, it supports 3x more insertions when in asn+ media.
test_plot +
    ggrepel::geom_text_repel(data=favorite_gene, nudge_x=-2, nudge_y=1,
                             ggplot2::aes(x=first, y=second, label = rownames(favorite_gene))) +
    ggplot2::geom_point(data=favorite_gene, ggplot2::aes(x=first, y=second), colour="red")
## Warning: Removed 137 rows containing non-finite values (stat_smooth).
## Warning: Removed 137 rows containing missing values (geom_point).

Dval

The following from Miriam: The CDS Dval is calculated by dividing the number of hits in each gene by the number of TAs. I guess plotting it against length might be interesting?

dval_hsc5 <- convert_counts(hsc5_expt, convert="cp_seq_m", annotations="reference/mgas_hsc5.gff.gz", fasta="reference/mgas_hsc5.fasta", entry_type="gene")

quick_csv <- dval_hsc5$count_table
colnames(quick_csv) <- make.names(hsc5_expt$design$condition, unique=TRUE)
write.csv(quick_csv, file="excel/dval_alone.csv")

dval_df <- as.data.frame(dval_hsc5$count_table)
dval_df$median <- log2(matrixStats::rowMedians(as.matrix(dval_df)) + 1)
dval_df = merge(dval_df, gene_lengths, by="row.names")
dval_df$median_vs_width = dval_df$median / dval_df$width
rownames(dval_df) <- dval_df$ID
keepers <- list(
    "thyt1_thyt0" = c("thyt1","thyt0"),
    "asnminus_asnplus" = c("noasn","asn")
    )
initial_result <- combine_de_tables(low_de, annot_df=dval_df, excel="excel/initial_fitness_withdval.xlsx", keepers=keepers)
## Deleting the file excel/initial_fitness_withdval.xlsx before writing the tables.
## Working on 1/2: thyt1_thyt0
## Found table with thyt1_vs_thyt0
## Running create_combined_table with inverse=FALSE
## The table is: thyt1_vs_thyt0
##            table total limma_up limma_sigup deseq_up deseq_sigup edger_up
## 1 thyt1_vs_thyt0  1579       38           6        6           6       49
##   edger_sigup basic_up basic_sigup limma_down limma_sigdown deseq_down
## 1          10       61           7         74            35          7
##   deseq_sigdown edger_down edger_sigdown basic_down basic_sigdown meta_up
## 1             7         63            26         53            10      27
##   meta_sigup meta_down meta_sigdown
## 1         14        55           36
## Working on 2/2: asnminus_asnplus
## Found table with noasn_vs_asn
## Running create_combined_table with inverse=FALSE
## The table is: noasn_vs_asn
##            table total limma_up limma_sigup deseq_up deseq_sigup edger_up
## 1 thyt1_vs_thyt0  1579       38           6        6           6       49
## 2   noasn_vs_asn  1579       41          21       18          18       50
##   edger_sigup basic_up basic_sigup limma_down limma_sigdown deseq_down
## 1          10       61           7         74            35          7
## 2          22       68          15        117            68         29
##   deseq_sigdown edger_down edger_sigdown basic_down basic_sigdown meta_up
## 1             7         63            26         53            10      27
## 2            29        119            62         74            21      36
##   meta_sigup meta_down meta_sigdown
## 1         14        55           36
## 2         23        82           60
## Converted rownames to characters.
## Attempting to add a coefficient plot for thyt1_thyt0 at column 48
## Warning: Removed 139 rows containing non-finite values (stat_smooth).
## Warning: Removed 139 rows containing missing values (geom_point).
## Converted rownames to characters.
## Attempting to add a coefficient plot for asnminus_asnplus at column 48

## Warning: Removed 137 rows containing non-finite values (stat_smooth).
## Warning: Removed 137 rows containing missing values (geom_point).
## Writing summary information.
## The sheet pairwise_summary already exists, it will get overwritten
## Attempting to add the comparison plot to pairwise_summary at row: 20 and column: 1
## Performing save of the workbook.

dval_width = dval_df[,c("width","median")]
plot_scatter(dval_width)

density = as.matrix(Biobase::exprs(hsc5_expt$expressionset))
density_df = as.data.frame(density)
density_df$median = log2(matrixStats::rowMedians(as.matrix(density)) + 1)
density_df = merge(density_df, gene_lengths, by="row.names")
density_df = density_df[,c("width","median")]
plot_scatter(density_df)

xlsx_writer <- hpgltools::write_xls(dval_df, sheet="dval")
## Converted Row.names to characters.
## Error in openxlsx::writeDataTable(wb, sheet, x = data, tableStyle = "TableStyleMedium9", : Column names of x must be case-insensitive unique.
openxlsx::saveWorkbook(wb=xlsx_writer$workbook, file="excel/dval.xlsx", overwrite=TRUE)

View a distribution from essentiality

While Yoann is here lets do a quick histogram and see if essentiality metrics make sense As we go from m1 -> m16, the sensitivity of the essentiality tool decreases and a larger number of spurious 0’s occurs.

## Testing to see if some parameters are better than others
run2_thyt0_m1 <- read.csv("essentiality/mh_ess/run2/05v1M1l20gen_thyt0_genome.bam_essentiality_m1.csv", comment.char="#", header=FALSE, sep="\t")
colnames(run2_thyt0_m1) <- c("ORF","k","n","r","s","zbar","call")
plot_histogram(run2_thyt0_m1$zbar) + ggplot2::scale_y_continuous(limits=c(0,6))
## No binwidth provided, setting it to 0.004 in order to have 500 bins.

run2_thyt0_m2 <- read.csv("essentiality/mh_ess/run2/05v1M1l20gen_thyt0_genome.bam_essentiality_m2.csv", comment.char="#", header=FALSE, sep="\t")
colnames(run2_thyt0_m2) <- c("ORF","k","n","r","s","zbar","call")
plot_histogram(run2_thyt0_m2$zbar) + ggplot2::scale_y_continuous(limits=c(0,6))
## No binwidth provided, setting it to 0.004 in order to have 500 bins.

run2_thyt0_m4 <- read.csv("essentiality/mh_ess/run2/05v1M1l20gen_thyt0_genome.bam_essentiality_m4.csv", comment.char="#", header=FALSE, sep="\t")
colnames(run2_thyt0_m4) <- c("ORF","k","n","r","s","zbar","call")
plot_histogram(run2_thyt0_m4$zbar) + ggplot2::scale_y_continuous(limits=c(0,6))
## No binwidth provided, setting it to 0.004 in order to have 500 bins.

## This is not good
run2_thyt0_m16 <- read.csv("essentiality/mh_ess/run2/05v1M1l20gen_thyt0_genome.bam_essentiality_m16.csv", comment.char="#", header=FALSE, sep="\t")
colnames(run2_thyt0_m16) <- c("ORF","k","n","r","s","zbar","call")
plot_histogram(run2_thyt0_m16$zbar) + ggplot2::scale_y_continuous(limits=c(0,6))
## No binwidth provided, setting it to 0.004 in order to have 500 bins.

Count TAs

The function ‘tnseq_saturation()’ gives some ideas about how well saturated each library is. In the ideal world, the numbers of ‘zeros’ is lower while the number of 8s, 16s, 32s, etc are higher.

tas_thyt0 <- tnseq_saturation("essentiality/mh_ess/run2/tas_05v1M1l20gen_thyt0_genome.bam.txt.wig")
tas_thyt0
tas_thyt1 = tnseq_saturation("essentiality/mh_ess/run2/tas_05v1M1l20gen_thyt1_genome.bam.txt.wig")
tas_thyt1
tas_asp = tnseq_saturation("essentiality/mh_ess/run2/tas_05v1M1l20gen_asp_genome.bam.txt.wig")
tas_asp
tas_noasp = tnseq_saturation("essentiality/mh_ess/run2/tas_05v1M1l20gen_noasp_genome.bam.txt.wig")
tas_noasp

Some circos goodness

Lets make a right quick plot of the library densities

merged = merge(gene_annotations_hsc5, norm_data, by="row.names")
rownames(merged) = merged$Row.names

hsc5_cfg = hpgltools:::circos_prefix('hsc5')
hsc5_kary = hpgltools:::circos_karyotype(name='hsc5', fasta="reference/mgas_hsc5.fasta")
go_table = annotation_hsc5_info
go_table = go_table[,c("start","end","strand")]
go_table$go = ""
hsc5_plus_minus = hpgltools:::circos_plus_minus(go_table, cfgout=hsc5_cfg)
hsc5_thyt0 = hpgltools:::circos_hist(merged, cfgout=hsc5_cfg, colname="HPGL0596", outer=hsc5_plus_minus, fill_color="black", color="black")
hsc5_thyt1 = hpgltools:::circos_hist(merged, cfgout=hsc5_cfg, colname="HPGL0597", outer=hsc5_thyt0, fill_color="blue", color="black")
hsc5_noasp = hpgltools:::circos_hist(merged, cfgout=hsc5_cfg, colname="HPGL0599", outer=hsc5_thyt1, fill_color="green", color="black")
hsc5_asp = hpgltools:::circos_hist(merged, cfgout=hsc5_cfg, colname="HPGL0598", outer=hsc5_noasp, fill_color="red", color="black")
hpgltools:::circos_suffix(hsc5_cfg)
hpgltools:::circos_make('hsc5')