1 TNSeq of S. agalacticae and 3 concentrations of calprotectin.

This worksheet aims to lay out the tasks I performed when analyzing some TNSeq data of a group B streptococcus.

2 Three concentrations of calprotectin (none, 60, and 480 – is this [mM], mg/mL, or what?)

It looks to me, that despite the oddities in processing the raw reads, there is nice coverage and some obviously essential genes. The next question: did any change status as more calprotectin was added?

3 Grab annotation data for Streptococcus agalactiae.

The ID for strain a909 at microbesonline.org is: 205921.

Let us load up annotations from my gff file along with the microbesonline.

As a heads up, the count tables are using IDs which look like: SAK_RS00185. This appears to be the ‘sysName’ column from microbesonline and the locus_tag column from the gff annotations. In addition, there are a bunch of unused columns in both data sets which we likely want to prune.

Ahh, that is incorrect, the microbesonline ‘sysName’ is the same as ‘old_locus_tag’ column.

There are three relatively closely related strains which may be sufficiently similar to use in this analysis. The actual strain is cjb111, but that has not yet been quite finished, as far as I can tell. Therefore I will repeat most(all?) tasks with strains a909 and vr2603 to see if they may be more useful.

3.1 Strain a909

## Found 1 entry.
##                             Genome     Phylum Paper     Loaded Complete #Chr.
## 3107 Streptococcus agalactiae A909 Firmicutes   yes 2006-01-18      yes     1
##      #Plasmids #Genes tax_id
## 3107         0   2136 205921
## The species being downloaded is: Streptococcus agalactiae A909
## Downloading: http://www.microbesonline.org/cgi-bin/genomeInfo.cgi?tId=205921;export=tab
## Trying attempt: rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo=TRUE)
## Trying attempt: rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo=FALSE)
## Had a successful gff import with rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo=FALSE)
## Returning a df with 34 columns and 4426 rows.

3.2 Strain cjb111

## Found 1 entry.
##                               Genome     Phylum Paper     Loaded Complete #Chr.
## 3108 Streptococcus agalactiae CJB111 Firmicutes   yes 2007-05-08       no    NA
##      #Plasmids #Genes tax_id
## 3108        NA   2208 342617
## The species being downloaded is: Streptococcus agalactiae CJB111
## Downloading: http://www.microbesonline.org/cgi-bin/genomeInfo.cgi?tId=342617;export=tab
## Trying attempt: rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo=TRUE)
## Trying attempt: rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo=FALSE)
## Had a successful gff import with rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo=FALSE)
## Returning a df with 14 columns and 2208 rows.

3.3 Strain vr2603

I think this might actually be 2603vr, I get confused, and under a few specific circumstances R acts strange when things start with numbers.

## Found 1 entry.
##                                Genome     Phylum Paper Loaded Complete #Chr.
## 3105 Streptococcus agalactiae 2603V/R Firmicutes   yes             yes     1
##      #Plasmids #Genes tax_id
## 3105         0   2273 208435
## The species being downloaded is: Streptococcus agalactiae 2603V/R
## Downloading: http://www.microbesonline.org/cgi-bin/genomeInfo.cgi?tId=208435;export=tab
## Trying attempt: rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo=TRUE)
## Had a successful gff import with rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo=TRUE)
## Returning a df with 28 columns and 4611 rows.

4 Create Expressionsets

The following block merges the various counts, annotations, and experimental metadata.

Just as with the annotations, I will create one expressionset for each strain.

4.1 Strain a909

## Reading the sample metadata.
## The sample definitions comprises: 9 rows(samples) and 29 columns(metadata fields).
## Reading count tables.
## Reading count tables with read.table().
## /mnt/sshfs/cbcbsub/fs/cbcb-lab/nelsayed/scratch/atb/tnseq/sagalacticae_2019/preprocessing/01/outputs/bowtie_sagalactiae_a909/trimmed_ca-v0M1.count.xz contains 2212 rows.
## preprocessing/02/outputs/bowtie_sagalactiae_a909/trimmed_ca-v0M1.count.xz contains 2212 rows and merges to 2212 rows.
## preprocessing/03/outputs/bowtie_sagalactiae_a909/trimmed_ca-v0M1.count.xz contains 2212 rows and merges to 2212 rows.
## preprocessing/04/outputs/bowtie_sagalactiae_a909/trimmed_ca-v0M1.count.xz contains 2212 rows and merges to 2212 rows.
## preprocessing/05/outputs/bowtie_sagalactiae_a909/trimmed_ca-v0M1.count.xz contains 2212 rows and merges to 2212 rows.
## preprocessing/06/outputs/bowtie_sagalactiae_a909/trimmed_ca-v0M1.count.xz contains 2212 rows and merges to 2212 rows.
## preprocessing/07/outputs/bowtie_sagalactiae_a909/trimmed_ca-v0M1.count.xz contains 2212 rows and merges to 2212 rows.
## preprocessing/08/outputs/bowtie_sagalactiae_a909/trimmed_ca-v0M1.count.xz contains 2212 rows and merges to 2212 rows.
## preprocessing/09/outputs/bowtie_sagalactiae_a909/trimmed_ca-v0M1.count.xz contains 2212 rows and merges to 2212 rows.
## Finished reading count tables.
## Matched 2048 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## Saving the expressionset to 'expt.rda'.
## The final expressionset has 2207 rows and 9 columns.
## Writing the first sheet, containing a legend and some summary data.
## Writing the raw reads.
## Graphing the raw reads.
## Warning in doTryCatch(return(expr), name, parentenv, handler): display list
## redraw incomplete

## Warning in doTryCatch(return(expr), name, parentenv, handler): display list
## redraw incomplete
## Attempting mixed linear model with: ~  (1|condition) + (1|batch)
## Fitting the expressionset to the model, this is slow.
## Dividing work into 100 chunks...
## 
## Total:14 s
## Placing factor: condition at the beginning of the model.
## Writing the normalized reads.
## Graphing the normalized reads.
## Warning in doTryCatch(return(expr), name, parentenv, handler): display list
## redraw incomplete

## Warning in doTryCatch(return(expr), name, parentenv, handler): display list
## redraw incomplete
## Attempting mixed linear model with: ~  (1|condition) + (1|batch)
## Fitting the expressionset to the model, this is slow.
## Dividing work into 100 chunks...
## 
## Total:16 s
## Placing factor: condition at the beginning of the model.
## Writing the median reads by factor.

4.2 Strain cjb111

## Reading the sample metadata.
## The sample definitions comprises: 9 rows(samples) and 29 columns(metadata fields).
## Reading count tables.
## Reading count tables with read.table().
## /mnt/sshfs/cbcbsub/fs/cbcb-lab/nelsayed/scratch/atb/tnseq/sagalacticae_2019/preprocessing/01/outputs/bowtie_sagalactiae_cjb111/trimmed_ca-v0M1.count.xz contains 2213 rows.
## preprocessing/02/outputs/bowtie_sagalactiae_cjb111/trimmed_ca-v0M1.count.xz contains 2213 rows and merges to 2213 rows.
## preprocessing/03/outputs/bowtie_sagalactiae_cjb111/trimmed_ca-v0M1.count.xz contains 2213 rows and merges to 2213 rows.
## preprocessing/04/outputs/bowtie_sagalactiae_cjb111/trimmed_ca-v0M1.count.xz contains 2213 rows and merges to 2213 rows.
## preprocessing/05/outputs/bowtie_sagalactiae_cjb111/trimmed_ca-v0M1.count.xz contains 2213 rows and merges to 2213 rows.
## preprocessing/06/outputs/bowtie_sagalactiae_cjb111/trimmed_ca-v0M1.count.xz contains 2213 rows and merges to 2213 rows.
## preprocessing/07/outputs/bowtie_sagalactiae_cjb111/trimmed_ca-v0M1.count.xz contains 2213 rows and merges to 2213 rows.
## preprocessing/08/outputs/bowtie_sagalactiae_cjb111/trimmed_ca-v0M1.count.xz contains 2213 rows and merges to 2213 rows.
## preprocessing/09/outputs/bowtie_sagalactiae_cjb111/trimmed_ca-v0M1.count.xz contains 2213 rows and merges to 2213 rows.
## Finished reading count tables.
## Matched 2208 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## Saving the expressionset to 'expt.rda'.
## The final expressionset has 2208 rows and 9 columns.
## Writing the first sheet, containing a legend and some summary data.
## Writing the raw reads.
## Graphing the raw reads.
## Warning in doTryCatch(return(expr), name, parentenv, handler): display list
## redraw incomplete

## Warning in doTryCatch(return(expr), name, parentenv, handler): display list
## redraw incomplete
## Attempting mixed linear model with: ~  (1|condition) + (1|batch)
## Fitting the expressionset to the model, this is slow.
## Dividing work into 100 chunks...
## 
## Total:11 s
## Placing factor: condition at the beginning of the model.
## Writing the normalized reads.
## Graphing the normalized reads.
## Warning in doTryCatch(return(expr), name, parentenv, handler): display list
## redraw incomplete

## Warning in doTryCatch(return(expr), name, parentenv, handler): display list
## redraw incomplete
## Attempting mixed linear model with: ~  (1|condition) + (1|batch)
## Fitting the expressionset to the model, this is slow.
## Dividing work into 100 chunks...
## 
## Total:13 s
## Placing factor: condition at the beginning of the model.
## Writing the median reads by factor.

4.3 Strain 2603vr

## Reading the sample metadata.
## The sample definitions comprises: 9 rows(samples) and 29 columns(metadata fields).
## Reading count tables.
## Reading count tables with read.table().
## /mnt/sshfs/cbcbsub/fs/cbcb-lab/nelsayed/scratch/atb/tnseq/sagalacticae_2019/preprocessing/01/outputs/bowtie_sagalactiae_2603vr/trimmed_ca-v0M1.count.xz contains 2284 rows.
## preprocessing/02/outputs/bowtie_sagalactiae_2603vr/trimmed_ca-v0M1.count.xz contains 2284 rows and merges to 2284 rows.
## preprocessing/03/outputs/bowtie_sagalactiae_2603vr/trimmed_ca-v0M1.count.xz contains 2284 rows and merges to 2284 rows.
## preprocessing/04/outputs/bowtie_sagalactiae_2603vr/trimmed_ca-v0M1.count.xz contains 2284 rows and merges to 2284 rows.
## preprocessing/05/outputs/bowtie_sagalactiae_2603vr/trimmed_ca-v0M1.count.xz contains 2284 rows and merges to 2284 rows.
## preprocessing/06/outputs/bowtie_sagalactiae_2603vr/trimmed_ca-v0M1.count.xz contains 2284 rows and merges to 2284 rows.
## preprocessing/07/outputs/bowtie_sagalactiae_2603vr/trimmed_ca-v0M1.count.xz contains 2284 rows and merges to 2284 rows.
## preprocessing/08/outputs/bowtie_sagalactiae_2603vr/trimmed_ca-v0M1.count.xz contains 2284 rows and merges to 2284 rows.
## preprocessing/09/outputs/bowtie_sagalactiae_2603vr/trimmed_ca-v0M1.count.xz contains 2284 rows and merges to 2284 rows.
## Finished reading count tables.
## Matched 2193 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## Saving the expressionset to 'expt.rda'.
## The final expressionset has 2279 rows and 9 columns.
## Writing the first sheet, containing a legend and some summary data.
## Writing the raw reads.
## Graphing the raw reads.
## Warning in doTryCatch(return(expr), name, parentenv, handler): display list
## redraw incomplete

## Warning in doTryCatch(return(expr), name, parentenv, handler): display list
## redraw incomplete
## Attempting mixed linear model with: ~  (1|condition) + (1|batch)
## Fitting the expressionset to the model, this is slow.
## Dividing work into 100 chunks...
## 
## Total:11 s
## A couple of common errors:
## An error like 'vtv downdated' may be because there are too many 0s, filter the data and rerun.
## An error like 'number of levels of each grouping factor must be < number of observations' means
## that the factor used is not appropriate for the analysis - it really only works for factors
## which are shared among multiple samples.
## Retrying with only condition in the model.
## Loading required package: Matrix
## 
## Total:6 s
## Placing factor: condition at the beginning of the model.
## Writing the normalized reads.
## Graphing the normalized reads.
## Warning in doTryCatch(return(expr), name, parentenv, handler): display list
## redraw incomplete

## Warning in doTryCatch(return(expr), name, parentenv, handler): display list
## redraw incomplete
## Attempting mixed linear model with: ~  (1|condition) + (1|batch)
## Fitting the expressionset to the model, this is slow.
## Dividing work into 100 chunks...
## 
## Total:15 s
## Placing factor: condition at the beginning of the model.
## Writing the median reads by factor.

5 A Few diagnostic plots

5.3 Strain 2603V/R

I think this looks reasonable, though it makes me slightly wonder if 04 and 09 are switched. But as long as we are willing to state that the primary difference is between calprotectin and control, then I would suggest against considering it. I think it is reasonable to assume the samples are not switched and this is just how they are. If however, the primary goal is to investigate changing concentrations of calprotectin, then I would want to check into this distribution of samples or make the statement that these two concentrations have no significant difference unless we get more samples to look at.

6 Check tnseq saturation

I moved this above the differential “expression”/“fitness” analysis so that we can add the results from it as annotation data to the DE tables if requested.

saturation_01 <- tnseq_saturation(
  "preprocessing/01/outputs/essentiality_sagalactiae_a909/trimmed_ca-v0M1.wig",
  adjust=2)
saturation_01$plot
saturation_01$hits_summary
ess_plts <- plot_essentiality(
  "preprocessing/01/outputs/essentiality_sagalactiae_a909/mh_ess-trimmed_ca-v0M1_gene_tas_m2.csv")
ess_plts[["zbar"]]

saturation_02 <- tnseq_saturation(
  "preprocessing/02/outputs/essentiality_sagalactiae_a909/trimmed_ca-v0M1.wig")
saturation_02$plot
saturation_02$hits_summary
ess_plts <- plot_essentiality(
  "preprocessing/02/outputs/essentiality_sagalactiae_a909/mh_ess-trimmed_ca-v0M1_gene_tas_m2.csv")
ess_plts[["zbar"]]

saturation_03 <- tnseq_saturation(
  "preprocessing/03/outputs/essentiality_sagalactiae_a909/trimmed_ca-v0M1.wig")
saturation_03$plot
saturation_03$hits_summary
ess_plts <- plot_essentiality(
  "preprocessing/03/outputs/essentiality_sagalactiae_a909/mh_ess-trimmed_ca-v0M1_gene_tas_m2.csv")
ess_plts[["zbar"]]

saturation_04 <- tnseq_saturation(
  "preprocessing/04/outputs/essentiality_sagalactiae_a909/trimmed_ca-v0M1.wig")
saturation_04$plot
saturation_04$hits_summary
ess_plts <- plot_essentiality(
  "preprocessing/04/outputs/essentiality_sagalactiae_a909/mh_ess-trimmed_ca-v0M1_gene_tas_m2.csv")
ess_plts[["zbar"]]

saturation_05 <- tnseq_saturation(
  "preprocessing/05/outputs/essentiality_sagalactiae_a909/trimmed_ca-v0M1.wig")
saturation_05$plot
saturation_05$hits_summary
ess_plts <- plot_essentiality(
  "preprocessing/06/outputs/essentiality_sagalactiae_a909/mh_ess-trimmed_ca-v0M1_gene_tas_m2.csv")
ess_plts[["zbar"]]

saturation_06 <- tnseq_saturation(
  "preprocessing/06/outputs/essentiality_sagalactiae_a909/trimmed_ca-v0M1.wig")
saturation_06$plot
saturation_06$hits_summary
ess_plts <- plot_essentiality(
  "preprocessing/06/outputs/essentiality_sagalactiae_a909/mh_ess-trimmed_ca-v0M1_gene_tas_m2.csv")
ess_plts[["zbar"]]

saturation_07 <- tnseq_saturation(
  "preprocessing/07/outputs/essentiality_sagalactiae_a909/trimmed_ca-v0M1.wig")
saturation_07$plot
saturation_07$hits_summary
ess_plts <- plot_essentiality(
  "preprocessing/07/outputs/essentiality_sagalactiae_a909/mh_ess-trimmed_ca-v0M1_gene_tas_m2.csv")
ess_plts[["zbar"]]

saturation_08 <- tnseq_saturation(
  "preprocessing/08/outputs/essentiality_sagalactiae_a909/trimmed_ca-v0M1.wig")
saturation_08$plot
saturation_08$hits_summary
ess_plts <- plot_essentiality(
  "preprocessing/08/outputs/essentiality_sagalactiae_a909/mh_ess-trimmed_ca-v0M1_gene_tas_m2.csv")
ess_plts[["zbar"]]

saturation_09 <- tnseq_saturation(
  "preprocessing/09/outputs/essentiality_sagalactiae_a909/trimmed_ca-v0M1.wig")
saturation_09$plot
saturation_09$hits_summary
ess_plts <- plot_essentiality(
  "preprocessing/09/outputs/essentiality_sagalactiae_a909/mh_ess-trimmed_ca-v0M1_gene_tas_m2.csv")
ess_plts[["zbar"]]


saturation_control <- tnseq_saturation(
  "preprocessing/combined_control/outputs/essentiality_sagalactiae_a909/trimmed_ca-v1m1.wig")
saturation_control$plot
ess_plts <- plot_essentiality(
  "preprocessing/combined_control/outputs/essentiality_sagalactiae_a909/mh_ess-trimmed_ca-v1m1_gene_tas_m2.csv")
ess_plts[["zbar"]]
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.

## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.

## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.

7 Changed genes

For differential expression, I am going to assume until I hear otherwise, that my batch assignments are not correct and that the 1,2,3 assignments of the sample names do not actually delineate separate batches. Though, if they do delineate separate batches, it might be taken as a (very)small degree of evidence that 04 and 09 were switched.

7.1 Strain a909

## Parsed with column specification:
## cols(
##   control_orf = col_character(),
##   control_k = col_double(),
##   control_n = col_double(),
##   control_r = col_double(),
##   control_s = col_double(),
##   control_zbar = col_double(),
##   control_call = col_character()
## )
## Parsed with column specification:
## cols(
##   low_orf = col_character(),
##   low_k = col_double(),
##   low_n = col_double(),
##   low_r = col_double(),
##   low_s = col_double(),
##   low_zbar = col_double(),
##   low_call = col_character()
## )
## Parsed with column specification:
## cols(
##   high_orf = col_character(),
##   high_k = col_double(),
##   high_n = col_double(),
##   high_r = col_double(),
##   high_s = col_double(),
##   high_zbar = col_double(),
##   high_call = col_character()
## )
## Plotting a PCA before surrogates/batch inclusion.
## Not putting labels on the plot.
## Assuming no batch in model for testing pca.
## Not putting labels on the plot.
## Starting basic_pairwise().
## Starting basic pairwise comparison.
## Basic step 0/3: Filtering data.
## Basic step 0/3: Normalizing data.
## Basic step 0/3: Converting data.
## Basic step 0/3: Transforming data.
## Basic step 1/3: Creating mean and variance tables.
## Basic step 2/3: Performing 6 comparisons.
## Basic step 3/3: Creating faux DE Tables.
## Basic: Returning tables.
## Starting deseq_pairwise().
## Starting DESeq2 pairwise comparisons.
## The data should be suitable for EdgeR/DESeq/EBSeq. If they freak out, check the state of the count table and ensure that it is in integer counts.
## Choosing the non-intercept containing model.
## DESeq2 step 1/5: Including only condition in the deseq model.
## converting counts to integer mode
##   it appears that the last variable in the design formula, 'condition',
##   has a factor level, 'control', which is not the reference level. we recommend
##   to use factor(...,levels=...) or relevel() to set this as the reference level
##   before proceeding. for more information, please see the 'Note on factor levels'
##   in vignette('DESeq2').
##   it appears that the last variable in the design formula, 'condition',
##   has a factor level, 'control', which is not the reference level. we recommend
##   to use factor(...,levels=...) or relevel() to set this as the reference level
##   before proceeding. for more information, please see the 'Note on factor levels'
##   in vignette('DESeq2').
## DESeq2 step 2/5: Estimate size factors.
## DESeq2 step 3/5: Estimate dispersions.
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## Using a parametric fitting seems to have worked.
## DESeq2 step 4/5: nbinomWaldTest.
## Starting ebseq_pairwise().
## The data should be suitable for EdgeR/DESeq/EBSeq. If they freak out, check the state of the count table and ensure that it is in integer counts.
## Starting EBSeq pairwise subset.
## Choosing the non-intercept containing model.
## Starting EBTest of cal_high vs. cal_low.
## Copying ppee values as ajusted p-values until I figure out how to deal with them.
## Starting EBTest of cal_high vs. control.
## Copying ppee values as ajusted p-values until I figure out how to deal with them.
## Starting EBTest of cal_low vs. control.
## Copying ppee values as ajusted p-values until I figure out how to deal with them.
## Starting edger_pairwise().
## Starting edgeR pairwise comparisons.
## The data should be suitable for EdgeR/DESeq/EBSeq. If they freak out, check the state of the count table and ensure that it is in integer counts.
## Choosing the non-intercept containing model.
## EdgeR step 1/9: Importing and normalizing data.
## EdgeR step 2/9: Estimating the common dispersion.
## EdgeR step 3/9: Estimating dispersion across genes.
## EdgeR step 4/9: Estimating GLM Common dispersion.
## EdgeR step 5/9: Estimating GLM Trended dispersion.
## EdgeR step 6/9: Estimating GLM Tagged dispersion.
## EdgeR step 7/9: Running glmFit, switch to glmQLFit by changing the argument 'edger_test'.
## EdgeR step 8/9: Making pairwise contrasts.

## Starting limma_pairwise().
## Starting limma pairwise comparison.
## libsize was not specified, this parameter has profound effects on limma's result.
## Using the libsize from expt$libsize.
## Limma step 1/6: choosing model.
## Choosing the non-intercept containing model.
## Limma step 2/6: running limma::voom(), switch with the argument 'which_voom'.
## Using normalize.method=quantile for voom.

## Limma step 3/6: running lmFit with method: ls.
## Limma step 4/6: making and fitting contrasts with no intercept. (~ 0 + factors)
## Limma step 5/6: Running eBayes with robust=FALSE and trend=FALSE.
## Limma step 6/6: Writing limma outputs.
## Limma step 6/6: 1/3: Creating table: cal_low_vs_cal_high.  Adjust=BH
## Limma step 6/6: 2/3: Creating table: control_vs_cal_high.  Adjust=BH
## Limma step 6/6: 3/3: Creating table: control_vs_cal_low.  Adjust=BH
## Limma step 6/6: 1/3: Creating table: cal_high.  Adjust=BH
## Limma step 6/6: 2/3: Creating table: cal_low.  Adjust=BH
## Limma step 6/6: 3/3: Creating table: control.  Adjust=BH
## Comparing analyses.

## Deleting the file excel/20200211-a909_tables-v20191105.xlsx before writing the tables.
## Writing a legend of columns.
## Printing a pca plot before/after surrogates/batch estimation.
## Working on 1/2: low_vs_control which is: cal_low/control.
## Found inverse table with control_vs_cal_low
## Working on 2/2: high_vs_control which is: cal_high/control.
## Found inverse table with control_vs_cal_high
## Adding venn plots for low_vs_control.

## Limma expression coefficients for low_vs_control; R^2: 0.996; equation: y = 0.998x + 0.0239
## Deseq expression coefficients for low_vs_control; R^2: 0.991; equation: y = 0.991x + 0.0985
## Edger expression coefficients for low_vs_control; R^2: 0.997; equation: y = 0.999x + 0.0131
## Adding venn plots for high_vs_control.

## Limma expression coefficients for high_vs_control; R^2: 0.996; equation: y = 0.998x + 0.0239
## Deseq expression coefficients for high_vs_control; R^2: 0.991; equation: y = 0.991x + 0.0985
## Edger expression coefficients for high_vs_control; R^2: 0.997; equation: y = 0.999x + 0.0131
## Writing summary information, compare_plot is: TRUE.
## Performing save of excel/20200211-a909_tables-v20191105.xlsx.

7.2 Strain 2603V/R

## Plotting a PCA before surrogates/batch inclusion.
## Not putting labels on the plot.
## Assuming no batch in model for testing pca.
## Not putting labels on the plot.
## Starting basic_pairwise().
## Starting basic pairwise comparison.
## Basic step 0/3: Filtering data.
## Basic step 0/3: Normalizing data.
## Basic step 0/3: Converting data.
## Basic step 0/3: Transforming data.
## Basic step 1/3: Creating mean and variance tables.
## Basic step 2/3: Performing 6 comparisons.
## Basic step 3/3: Creating faux DE Tables.
## Basic: Returning tables.
## Starting deseq_pairwise().
## Starting DESeq2 pairwise comparisons.
## The data should be suitable for EdgeR/DESeq/EBSeq. If they freak out, check the state of the count table and ensure that it is in integer counts.
## Choosing the non-intercept containing model.
## DESeq2 step 1/5: Including only condition in the deseq model.
## converting counts to integer mode
##   it appears that the last variable in the design formula, 'condition',
##   has a factor level, 'control', which is not the reference level. we recommend
##   to use factor(...,levels=...) or relevel() to set this as the reference level
##   before proceeding. for more information, please see the 'Note on factor levels'
##   in vignette('DESeq2').
##   it appears that the last variable in the design formula, 'condition',
##   has a factor level, 'control', which is not the reference level. we recommend
##   to use factor(...,levels=...) or relevel() to set this as the reference level
##   before proceeding. for more information, please see the 'Note on factor levels'
##   in vignette('DESeq2').
## DESeq2 step 2/5: Estimate size factors.
## DESeq2 step 3/5: Estimate dispersions.
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## Using a parametric fitting seems to have worked.
## DESeq2 step 4/5: nbinomWaldTest.
## Starting ebseq_pairwise().
## The data should be suitable for EdgeR/DESeq/EBSeq. If they freak out, check the state of the count table and ensure that it is in integer counts.
## Starting EBSeq pairwise subset.
## Choosing the non-intercept containing model.
## Starting EBTest of cal_high vs. cal_low.
## Copying ppee values as ajusted p-values until I figure out how to deal with them.
## Starting EBTest of cal_high vs. control.
## Copying ppee values as ajusted p-values until I figure out how to deal with them.
## Starting EBTest of cal_low vs. control.
## Copying ppee values as ajusted p-values until I figure out how to deal with them.
## Starting edger_pairwise().
## Starting edgeR pairwise comparisons.
## The data should be suitable for EdgeR/DESeq/EBSeq. If they freak out, check the state of the count table and ensure that it is in integer counts.
## Choosing the non-intercept containing model.
## EdgeR step 1/9: Importing and normalizing data.
## EdgeR step 2/9: Estimating the common dispersion.
## EdgeR step 3/9: Estimating dispersion across genes.
## EdgeR step 4/9: Estimating GLM Common dispersion.
## EdgeR step 5/9: Estimating GLM Trended dispersion.
## EdgeR step 6/9: Estimating GLM Tagged dispersion.
## EdgeR step 7/9: Running glmFit, switch to glmQLFit by changing the argument 'edger_test'.
## EdgeR step 8/9: Making pairwise contrasts.

## Starting limma_pairwise().
## Starting limma pairwise comparison.
## libsize was not specified, this parameter has profound effects on limma's result.
## Using the libsize from expt$libsize.
## Limma step 1/6: choosing model.
## Choosing the non-intercept containing model.
## Limma step 2/6: running limma::voom(), switch with the argument 'which_voom'.
## Using normalize.method=quantile for voom.

## Limma step 3/6: running lmFit with method: ls.
## Limma step 4/6: making and fitting contrasts with no intercept. (~ 0 + factors)
## Limma step 5/6: Running eBayes with robust=FALSE and trend=FALSE.
## Limma step 6/6: Writing limma outputs.
## Limma step 6/6: 1/3: Creating table: cal_low_vs_cal_high.  Adjust=BH
## Limma step 6/6: 2/3: Creating table: control_vs_cal_high.  Adjust=BH
## Limma step 6/6: 3/3: Creating table: control_vs_cal_low.  Adjust=BH
## Limma step 6/6: 1/3: Creating table: cal_high.  Adjust=BH
## Limma step 6/6: 2/3: Creating table: cal_low.  Adjust=BH
## Limma step 6/6: 3/3: Creating table: control.  Adjust=BH
## Comparing analyses.

## Deleting the file excel/20200211-vr2603_tables-v20191105.xlsx before writing the tables.
## Writing a legend of columns.
## Printing a pca plot before/after surrogates/batch estimation.
## Working on 1/2: low_vs_control which is: cal_low/control.
## Found inverse table with control_vs_cal_low
## Working on 2/2: high_vs_control which is: cal_high/control.
## Found inverse table with control_vs_cal_high
## Adding venn plots for low_vs_control.

## Limma expression coefficients for low_vs_control; R^2: 0.996; equation: y = 1x - 0.003
## Deseq expression coefficients for low_vs_control; R^2: 0.992; equation: y = 0.996x + 0.0363
## Edger expression coefficients for low_vs_control; R^2: 0.997; equation: y = 0.999x + 0.00572
## Adding venn plots for high_vs_control.

## Limma expression coefficients for high_vs_control; R^2: 0.996; equation: y = 1x - 0.003
## Deseq expression coefficients for high_vs_control; R^2: 0.992; equation: y = 0.996x + 0.0363
## Edger expression coefficients for high_vs_control; R^2: 0.997; equation: y = 0.999x + 0.00572
## Writing summary information, compare_plot is: TRUE.
## Performing save of excel/20200211-vr2603_tables-v20191105.xlsx.

7.3 Strain CJB111

## Plotting a PCA before surrogates/batch inclusion.
## Not putting labels on the plot.
## Assuming no batch in model for testing pca.
## Not putting labels on the plot.
## Starting basic_pairwise().
## Starting basic pairwise comparison.
## Basic step 0/3: Filtering data.
## Basic step 0/3: Normalizing data.
## Basic step 0/3: Converting data.
## Basic step 0/3: Transforming data.
## Basic step 1/3: Creating mean and variance tables.
## Basic step 2/3: Performing 6 comparisons.
## Basic step 3/3: Creating faux DE Tables.
## Basic: Returning tables.
## Starting deseq_pairwise().
## Starting DESeq2 pairwise comparisons.
## The data should be suitable for EdgeR/DESeq/EBSeq. If they freak out, check the state of the count table and ensure that it is in integer counts.
## Choosing the non-intercept containing model.
## DESeq2 step 1/5: Including only condition in the deseq model.
## converting counts to integer mode
##   it appears that the last variable in the design formula, 'condition',
##   has a factor level, 'control', which is not the reference level. we recommend
##   to use factor(...,levels=...) or relevel() to set this as the reference level
##   before proceeding. for more information, please see the 'Note on factor levels'
##   in vignette('DESeq2').
##   it appears that the last variable in the design formula, 'condition',
##   has a factor level, 'control', which is not the reference level. we recommend
##   to use factor(...,levels=...) or relevel() to set this as the reference level
##   before proceeding. for more information, please see the 'Note on factor levels'
##   in vignette('DESeq2').
## DESeq2 step 2/5: Estimate size factors.
## DESeq2 step 3/5: Estimate dispersions.
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## Using a parametric fitting seems to have worked.
## DESeq2 step 4/5: nbinomWaldTest.
## Starting ebseq_pairwise().
## The data should be suitable for EdgeR/DESeq/EBSeq. If they freak out, check the state of the count table and ensure that it is in integer counts.
## Starting EBSeq pairwise subset.
## Choosing the non-intercept containing model.
## Starting EBTest of cal_high vs. cal_low.
## Copying ppee values as ajusted p-values until I figure out how to deal with them.
## Starting EBTest of cal_high vs. control.
## Copying ppee values as ajusted p-values until I figure out how to deal with them.
## Starting EBTest of cal_low vs. control.
## Copying ppee values as ajusted p-values until I figure out how to deal with them.
## Starting edger_pairwise().
## Starting edgeR pairwise comparisons.
## The data should be suitable for EdgeR/DESeq/EBSeq. If they freak out, check the state of the count table and ensure that it is in integer counts.
## Choosing the non-intercept containing model.
## EdgeR step 1/9: Importing and normalizing data.
## EdgeR step 2/9: Estimating the common dispersion.
## EdgeR step 3/9: Estimating dispersion across genes.
## EdgeR step 4/9: Estimating GLM Common dispersion.
## EdgeR step 5/9: Estimating GLM Trended dispersion.
## EdgeR step 6/9: Estimating GLM Tagged dispersion.
## EdgeR step 7/9: Running glmFit, switch to glmQLFit by changing the argument 'edger_test'.
## EdgeR step 8/9: Making pairwise contrasts.

## Starting limma_pairwise().
## Starting limma pairwise comparison.
## libsize was not specified, this parameter has profound effects on limma's result.
## Using the libsize from expt$libsize.
## Limma step 1/6: choosing model.
## Choosing the non-intercept containing model.
## Limma step 2/6: running limma::voom(), switch with the argument 'which_voom'.
## Using normalize.method=quantile for voom.

## Limma step 3/6: running lmFit with method: ls.
## Limma step 4/6: making and fitting contrasts with no intercept. (~ 0 + factors)
## Limma step 5/6: Running eBayes with robust=FALSE and trend=FALSE.
## Limma step 6/6: Writing limma outputs.
## Limma step 6/6: 1/3: Creating table: cal_low_vs_cal_high.  Adjust=BH
## Limma step 6/6: 2/3: Creating table: control_vs_cal_high.  Adjust=BH
## Limma step 6/6: 3/3: Creating table: control_vs_cal_low.  Adjust=BH
## Limma step 6/6: 1/3: Creating table: cal_high.  Adjust=BH
## Limma step 6/6: 2/3: Creating table: cal_low.  Adjust=BH
## Limma step 6/6: 3/3: Creating table: control.  Adjust=BH
## Comparing analyses.

## Deleting the file excel/20200211-cjb111_tables-v20191105.xlsx before writing the tables.
## Writing a legend of columns.
## Printing a pca plot before/after surrogates/batch estimation.
## Working on 1/2: low_vs_control which is: cal_low/control.
## Found inverse table with control_vs_cal_low
## Working on 2/2: high_vs_control which is: cal_high/control.
## Found inverse table with control_vs_cal_high
## Adding venn plots for low_vs_control.

## Limma expression coefficients for low_vs_control; R^2: 0.995; equation: y = 0.996x + 0.0383
## Deseq expression coefficients for low_vs_control; R^2: 0.991; equation: y = 0.987x + 0.147
## Edger expression coefficients for low_vs_control; R^2: 0.996; equation: y = 0.997x + 0.0308
## Adding venn plots for high_vs_control.

## Limma expression coefficients for high_vs_control; R^2: 0.995; equation: y = 0.996x + 0.0383
## Deseq expression coefficients for high_vs_control; R^2: 0.991; equation: y = 0.987x + 0.147
## Edger expression coefficients for high_vs_control; R^2: 0.996; equation: y = 0.997x + 0.0308
## Writing summary information, compare_plot is: TRUE.
## Performing save of excel/20200211-cjb111_tables-v20191105.xlsx.

8 Circos

## This assumes you have a colors.conf in circos/colors/ and fonts.conf in circos/fonts/
## It also assumes you have conf/ideogram.conf, conf/ticks.conf, and conf/housekeeping.conf
## It will write circos/conf/a909.conf with a reasonable first approximation config file.
## Wrote karyotype to circos/conf/ideograms/a909.conf
## This should match the karyotype= line in a909.conf
## Wrote ticks to circos/conf/ticks_a909.conf
## Wrote karyotype to circos/conf/karyotypes/a909.conf
## This should match the karyotype= line in a909.conf
## Writing data file: circos/data/a909_plus_go.txt with the + strand GO data.
## Writing data file: circos/data/a909_minus_go.txt with the - strand GO data.
## Wrote the +/- config files.  Appending their inclusion to the master file.
## Returning the inner width: 0.92.  Use it as the outer for the next ring.
## Writing data file: circos/data/a909_lowdeseq_logfc_hist.txt with the lowdeseq_logfc column.
## Writing data file: circos/data/a909_highdeseq_logfc_hist.txt with the highdeseq_logfc column.
## Writing data file: circos/data/a909_control_tilecontrol_call_tile.txt with the control_call column.
## Writing data file: circos/data/a909_low_tilelow_call_tile.txt with the low_call column.
## Writing data file: circos/data/a909_high_tilehigh_call_tile.txt with the high_call column.

9 Circos email conversation

Here is (most of) the text of a recent email from Kevin:

" I think a plot of the following might be the best for the paper. It would just remove the 2 Bayesians of the high and low, and add the mapping of the transposon (krait) to the genome:

    • Strand ORFs (COG colored)
    • Strand ORFs (COG colored)
    • Strand Krmit insertions
    • Strand Krmit insertions
  1. DESeq2 low/control samples
  2. DESeq2 high/control samples
  3. DeJesus Bayesian result for control sample only combined.

That way we can see the coverage of the Krmit insertions and only the Bayesian in control conditions (broadly essential). "

I am thinking that this is not a problem, but I will probably do 3 and 4 as 3 rings, one each for the +/- strand control/low/high samples.

In another hallway query, Kevin suggested rpkm of the control/low/high ‘master’ libraries.

ergo…

## Reading the sample metadata.
## The sample definitions comprises: 3 rows(samples) and 11 columns(metadata fields).
## Reading count tables.
## Reading count tables with read.table().
## /mnt/sshfs/cbcbsub/fs/cbcb-lab/nelsayed/scratch/atb/tnseq/sagalacticae_2019/preprocessing/combined_control/outputs/bowtie_sagalactiae_a909/trimmed_ca-v1m1.count.xz contains 2212 rows.
## /mnt/sshfs/cbcbsub/fs/cbcb-lab/nelsayed/scratch/atb/tnseq/sagalacticae_2019/preprocessing/combined_low/outputs/bowtie_sagalactiae_a909/trimmed_ca-v1m1.count.xz contains 2212 rows and merges to 2212 rows.
## /mnt/sshfs/cbcbsub/fs/cbcb-lab/nelsayed/scratch/atb/tnseq/sagalacticae_2019/preprocessing/combined_high/outputs/bowtie_sagalactiae_a909/trimmed_ca-v1m1.count.xz contains 2212 rows and merges to 2212 rows.
## Finished reading count tables.
## Matched 2048 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## Saving the expressionset to 'expt.rda'.
## The final expressionset has 2207 rows and 3 columns.
## This function will replace the expt$expressionset slot with:
## log2(rpkm(data))
## It will save copies of each step along the way
##  in expt$normalized with the corresponding libsizes. Keep libsizes in mind
##  when invoking limma.  The appropriate libsize is non-log(cpm(normalized)).
##  This is most likely kept at:
##  'new_expt$normalized$intermediate_counts$normalization$libsizes'
##  A copy of this may also be found at:
##  new_expt$best_libsize
## Filter is false, this should likely be set to something, good
##  choices include cbcb, kofa, pofa (anything but FALSE).  If you want this to
##  stay FALSE, keep in mind that if other normalizations are performed, then the
##  resulting libsizes are likely to be strange (potentially negative!)
## Leaving the data unnormalized.  This is necessary for DESeq, but
##  EdgeR/limma might benefit from normalization.  Good choices include quantile,
##  size-factor, tmm, etc.
## Not correcting the count-data for batch effects.  If batch is
##  included in EdgerR/limma's model, then this is probably wise; but in extreme
##  batch effects this is a good parameter to play with.
## Step 1: not doing count filtering.
## Step 2: not normalizing the data.
## Step 3: converting the data with rpkm.
## Step 4: transforming the data with log2.
## transform_counts: Found 436 values equal to 0, adding 1 to the matrix.
## Step 5: not doing batch correction.
## This assumes you have a colors.conf in circos/colors/ and fonts.conf in circos/fonts/
## It also assumes you have conf/ideogram.conf, conf/ticks.conf, and conf/housekeeping.conf
## It will write circos/conf/a909v2.conf with a reasonable first approximation config file.
## Wrote karyotype to circos/conf/ideograms/a909v2.conf
## This should match the karyotype= line in a909v2.conf
## Wrote ticks to circos/conf/ticks_a909v2.conf
## Wrote karyotype to circos/conf/karyotypes/a909v2.conf
## This should match the karyotype= line in a909v2.conf
## Writing data file: circos/data/a909v2_plus_go.txt with the + strand GO data.
## Writing data file: circos/data/a909v2_minus_go.txt with the - strand GO data.
## Wrote the +/- config files.  Appending their inclusion to the master file.
## Returning the inner width: 0.89.  Use it as the outer for the next ring.
## Writing data file: circos/data/a909v2_control_rpkmcontrol_hist.txt with the control_rpkmcontrol column.
## Writing data file: circos/data/a909v2_low_rpkmlow_hist.txt with the low_rpkmlow column.
## Writing data file: circos/data/a909v2_high_rpkmhigh_hist.txt with the high_rpkmhigh column.
## Writing data file: circos/data/a909v2_control_tilecontrol_call_tile.txt with the control_call column.
## Writing data file: circos/data/a909v2_low_tilelow_call_tile.txt with the low_call column.
## Writing data file: circos/data/a909v2_high_tilehigh_call_tile.txt with the high_call column.
## Writing data file: circos/data/a909v2_lowdeseq_logfc_hist.txt with the lowdeseq_logfc column.
## Writing data file: circos/data/a909v2_highdeseq_logfc_hist.txt with the highdeseq_logfc column.
## This assumes you have a colors.conf in circos/colors/ and fonts.conf in circos/fonts/
## It also assumes you have conf/ideogram.conf, conf/ticks.conf, and conf/housekeeping.conf
## It will write circos/conf/a909v3.conf with a reasonable first approximation config file.
## Wrote karyotype to circos/conf/ideograms/a909v3.conf
## This should match the karyotype= line in a909v3.conf
## Wrote ticks to circos/conf/ticks_a909v3.conf
## Wrote karyotype to circos/conf/karyotypes/a909v3.conf
## This should match the karyotype= line in a909v3.conf
## Writing data file: circos/data/a909v3_plus_go.txt with the + strand GO data.
## Writing data file: circos/data/a909v3_minus_go.txt with the - strand GO data.
## Wrote the +/- config files.  Appending their inclusion to the master file.
## Returning the inner width: 0.89.  Use it as the outer for the next ring.
## Writing data file: circos/data/a909v3_control_rpkmcontrol_hist.txt with the control_rpkmcontrol column.
## Writing data file: circos/data/a909v3_low_rpkmlow_hist.txt with the low_rpkmlow column.
## Writing data file: circos/data/a909v3_high_rpkmhigh_hist.txt with the high_rpkmhigh column.
## Writing data file: circos/data/a909v3_lowdeseq_logfc_hist.txt with the lowdeseq_logfc column.
## Writing data file: circos/data/a909v3_highdeseq_logfc_hist.txt with the highdeseq_logfc column.
## Reading the sample metadata.
## The sample definitions comprises: 3 rows(samples) and 11 columns(metadata fields).
## Reading count tables.
## Reading count tables with read.table().
## /mnt/sshfs/cbcbsub/fs/cbcb-lab/nelsayed/scratch/atb/tnseq/sagalacticae_2019/preprocessing/combined_control/outputs/bowtie_sagalactiae_a909/trimmed_ca-v1m1.count.xz contains 2212 rows.
## /mnt/sshfs/cbcbsub/fs/cbcb-lab/nelsayed/scratch/atb/tnseq/sagalacticae_2019/preprocessing/combined_low/outputs/bowtie_sagalactiae_a909/trimmed_ca-v1m1.count.xz contains 2212 rows and merges to 2212 rows.
## /mnt/sshfs/cbcbsub/fs/cbcb-lab/nelsayed/scratch/atb/tnseq/sagalacticae_2019/preprocessing/combined_high/outputs/bowtie_sagalactiae_a909/trimmed_ca-v1m1.count.xz contains 2212 rows and merges to 2212 rows.
## Finished reading count tables.
## Matched 2048 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## Saving the expressionset to 'expt.rda'.
## The final expressionset has 2207 rows and 3 columns.
## This function will replace the expt$expressionset slot with:
## log2(rpkm(data))
## It will save copies of each step along the way
##  in expt$normalized with the corresponding libsizes. Keep libsizes in mind
##  when invoking limma.  The appropriate libsize is non-log(cpm(normalized)).
##  This is most likely kept at:
##  'new_expt$normalized$intermediate_counts$normalization$libsizes'
##  A copy of this may also be found at:
##  new_expt$best_libsize
## Filter is false, this should likely be set to something, good
##  choices include cbcb, kofa, pofa (anything but FALSE).  If you want this to
##  stay FALSE, keep in mind that if other normalizations are performed, then the
##  resulting libsizes are likely to be strange (potentially negative!)
## Leaving the data unnormalized.  This is necessary for DESeq, but
##  EdgeR/limma might benefit from normalization.  Good choices include quantile,
##  size-factor, tmm, etc.
## Not correcting the count-data for batch effects.  If batch is
##  included in EdgerR/limma's model, then this is probably wise; but in extreme
##  batch effects this is a good parameter to play with.
## Step 1: not doing count filtering.
## Step 2: not normalizing the data.
## Step 3: converting the data with rpkm.
## Step 4: transforming the data with log2.
## transform_counts: Found 436 values equal to 0, adding 1 to the matrix.
## Step 5: not doing batch correction.
## This assumes you have a colors.conf in circos/colors/ and fonts.conf in circos/fonts/
## It also assumes you have conf/ideogram.conf, conf/ticks.conf, and conf/housekeeping.conf
## It will write circos/conf/a909v4.conf with a reasonable first approximation config file.
## Wrote karyotype to circos/conf/ideograms/a909v4.conf
## This should match the karyotype= line in a909v4.conf
## Wrote ticks to circos/conf/ticks_a909v4.conf
## Wrote karyotype to circos/conf/karyotypes/a909v4.conf
## This should match the karyotype= line in a909v4.conf
## Writing data file: circos/data/a909v4_plus_go.txt with the + strand GO data.
## Writing data file: circos/data/a909v4_minus_go.txt with the - strand GO data.
## Wrote the +/- config files.  Appending their inclusion to the master file.
## Returning the inner width: 0.89.  Use it as the outer for the next ring.
## Writing data file: circos/data/a909v4_control_rpkmcontrol_hist.txt with the control_rpkmcontrol column.
## Writing data file: circos/data/a909v4_low_rpkmlow_hist.txt with the low_rpkmlow column.
## Writing data file: circos/data/a909v4_high_rpkmhigh_hist.txt with the high_rpkmhigh column.
## Writing data file: circos/data/a909v4_control_tilecontrol_call_tile.txt with the control_call column.
## Writing data file: circos/data/a909v4_low_tilelow_call_tile.txt with the low_call column.
## Writing data file: circos/data/a909v4_high_tilehigh_call_tile.txt with the high_call column.
