1 TODO

  • This is a copy of the same file dated 20180611, but an attempt to react to a query from Yan:
  • I am using this worksheet to also attempt to address figure-specific TODO items.

"I was baffled by your result that protein (or peptide) abundance of identified peptides were not changed between March and May. That will completely counter my assumption that a complete cleaning of the mass spec will solve our problem. I wonder how much would it will take to generate more detailed data from our old friend PE/PPE proteins. The information that will be helpful would be intensity, mass accuracy, and chromatography peak width of each peptide in each sample. Please feel free to stop by tomorrow if you have questions.

I like this group of proteins because 1. it’s of extra interest to Volker and Dallas, 2. it’s a small list of proteins, so easy to evaluate. and 3. Protein abundance covers a wide range, from relatively high abundance to barely detectable."

I think therefore I will leave the analyses here alone, but jump down to where I parse/plot the DIA data and attempt to subset it for the PPE proteins in ways similar/identical to Yan’s query. With that in mind, I am going to set eval=FALSE on the blocks of code in this file until I find the one which is actually relevant.

In addition, I am moving blocks of code/analysis up to the top which I think will be relevant.

1.1 Plotting metrics of identified peaks from DIA data

The pyprophet data for all DIA samples should provide the metrics in which Yan is interested. I will focus on the tuberculist transition libraries because that was the one I saw first.

1.1.1 The tuberculist data

pe_genes <- read.table("reference/annotated_pe_genes.txt")[[1]]
tb_pyprophet_fun <- extract_pyprophet_data(metadata="sample_sheets/Mtb_dia_samples.xlsx",
                                           pyprophet_column="tuberculistscored")
## Attempting to read the tsv file for: 2018_0315Briken01: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0315Briken01_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0315Briken02: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0315Briken02_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0315Briken03: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0315Briken03_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0315Briken04: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0315Briken04_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0315Briken05: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0315Briken05_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0315Briken06: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0315Briken06_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0315Briken21: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0315Briken21_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0315Briken22: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0315Briken22_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0315Briken23: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0315Briken23_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0315Briken24: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0315Briken24_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0315Briken25: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0315Briken25_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0315Briken26: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0315Briken26_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0502BrikenDIA07: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0502BrikenDIA07_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0502BrikenDIA08: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0502BrikenDIA08_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0502BrikenDIA09: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0502BrikenDIA09_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0502BrikenDIA11: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0502BrikenDIA11_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0502BrikenDIA12: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0502BrikenDIA12_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0502BrikenDIA01: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0502BrikenDIA01_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0502BrikenDIA02: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0502BrikenDIA02_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0502BrikenDIA03: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0502BrikenDIA03_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0502BrikenDIA04: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0502BrikenDIA04_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0502BrikenDIA05: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0502BrikenDIA05_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0502BrikenDIA06: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0502BrikenDIA06_vs_whole_HCD_dia_scored.tsv.
pe_pyprophet_fun <- subset_pyprophet_data(tb_pyprophet_fun, subset=pe_genes)

tb_mass_plot <- sm(plot_pyprophet_distribution(pe_pyprophet_fun, column="mass"))
pp(file="images/20180720_ppe_mass_boxplot.png", image=tb_mass_plot[["dotboxplot"]])
## Writing the image to: images/20180720_ppe_mass_boxplot.png and calling dev.off().

tb_intensity_plot <- sm(plot_pyprophet_distribution(pe_pyprophet_fun, column="intensity"))
pp(file="images/20180720_ppe_intensity_boxplot.png", image=tb_intensity_plot[["dotboxplot"]])
## Writing the image to: images/20180720_ppe_intensity_boxplot.png and calling dev.off().

tb_lwidth_plot <- sm(plot_pyprophet_distribution(pe_pyprophet_fun, column="leftwidth"))
pp(file="images/20180720_ppe_lwidth_boxplot.png", image=tb_lwidth_plot[["dotboxplot"]])
## Writing the image to: images/20180720_ppe_lwidth_boxplot.png and calling dev.off().

tb_rwidth_plot <- sm(plot_pyprophet_distribution(pe_pyprophet_fun, column="rightwidth"))
pp(file="images/20180720_ppe_rwidth_boxplot.png", image=tb_rwidth_plot[["dotboxplot"]])
## Writing the image to: images/20180720_ppe_rwidth_boxplot.png and calling dev.off().

tb_pyprophet_fun <- extract_pyprophet_data(metadata="sample_sheets/Mtb_dia_samples.xlsx",
                                           pyprophet_column="tuberculistscored")
## Attempting to read the tsv file for: 2018_0315Briken01: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0315Briken01_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0315Briken02: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0315Briken02_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0315Briken03: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0315Briken03_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0315Briken04: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0315Briken04_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0315Briken05: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0315Briken05_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0315Briken06: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0315Briken06_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0315Briken21: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0315Briken21_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0315Briken22: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0315Briken22_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0315Briken23: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0315Briken23_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0315Briken24: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0315Briken24_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0315Briken25: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0315Briken25_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0315Briken26: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0315Briken26_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0502BrikenDIA07: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0502BrikenDIA07_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0502BrikenDIA08: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0502BrikenDIA08_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0502BrikenDIA09: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0502BrikenDIA09_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0502BrikenDIA11: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0502BrikenDIA11_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0502BrikenDIA12: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0502BrikenDIA12_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0502BrikenDIA01: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0502BrikenDIA01_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0502BrikenDIA02: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0502BrikenDIA02_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0502BrikenDIA03: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0502BrikenDIA03_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0502BrikenDIA04: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0502BrikenDIA04_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0502BrikenDIA05: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0502BrikenDIA05_vs_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2018_0502BrikenDIA06: results/pyprophet/20180611/whole_8mz_tuberculist/2018_0502BrikenDIA06_vs_whole_HCD_dia_scored.tsv.
tb_mass_plot <- sm(plot_pyprophet_distribution(tb_pyprophet_fun, column="mass"))
tb_mass_plot[["boxplot"]]

tb_deltart_plot_all <- sm(plot_pyprophet_distribution(
  tb_pyprophet_fun, column="delta_rt"))
tb_deltart_plot_all[["boxplot"]]

tb_deltart_plot_real <- sm(plot_pyprophet_distribution(
  tb_pyprophet_fun,
  column="delta_rt", keep_decoys=FALSE))
tb_deltart_plot_real[["boxplot"]]

tb_deltart_plot_decoys <- sm(plot_pyprophet_distribution(
  tb_pyprophet_fun,
  column="delta_rt", keep_real=FALSE))
tb_deltart_plot_decoys[["boxplot"]]

tb_deltart_plot_decoys[["density"]]

tb_deltart_plot_decoys[["violin"]]

tb_widthvsmass <- sm(plot_pyprophet_data(tb_pyprophet_fun, legend=FALSE))
tb_widthvsmass$plot

2 Analyzing data from openMS and friends.

In preprocessing_comet_highres.Rmd, I used the openMS tutorials and supplemental materials from a couple papers to hopefully correctly perform the various preprocessing tasks required to extract intensity data from DIA/SWATH transitions.

The final steps of that process combined the transition intensities from every sample into a metadata file (results/tric/HCD_meta.tsv), an intensity matrix (results/tric/HCD_outmatrix.tsv), and a feature aligned output matrix (results/tric/aligned_comet_HCD.tsv).

My reading of the SWATH2stats and MSstats source code suggests to me that the log2(intensities) of the feature aligned data are our final proxy for protein abundance. At first glance, this suggests to me that these data might follow a distribution similar to RNASeq data (negative binomial, but perhaps with a bigger tail?). In addition, by the time we use tric on the data, we have a count matrix and sample annotation data frames which look remarkably similar to those used in a RNASeq expressionset. Indeed, by the end of the MSstats processing, it creates a MSnSet class of its own which uses fData/exprs/pData.

For the curious, my reasoning for saying that the log intensities are our proxy for abundance comes from MSstats/R/DataProcess.R in a clause which looks like:

if (logTrans == 2) {
  work[["ABUNDANCE"]] <- log2(work[["ABUNDANCE"]])
} else if (logTrans == 10) {
  work[["ABUNDANCE"]] <- log10(work[["ABUNDANCE"]])
} else {
  ## Above there was a check for only log 2 and 10, but we can do e if we want.
  ## I might go back up there and remove that check. Long live e! 2.718282 rules!
  work[["ABUNDANCE"]] <- log(work[["ABUNDANCE"]]) / log(logTrans)
}

(Note: I added the natural log to the set of conditions, but otherwise the logic is unchanged.)

With that in mind, I want to use some tools with which I am familiar in order to try to understand these data. Therefore I will first attempt to coerce my tric aligned data and annotations into a ‘normal’ expressionset. Then I want to do some diagnostic plots which, if I am wrong and these distributions are not as expected, will be conceptually incorrect (I don’t yet think I am wrong).

2.1 Sample annotation via SWATH2stats

I am using the SWATH2stats vignette as my primary source of information. Thus I see that it uses the OpenSWATH_SM3_GoldStandardAutomatedResults_human_peakgroups.txt which has a format nearly identical to the tric output matrix. Thus for the moment I will assume that the proper input for SWATH2stats is ‘results/tric/comet_HCD.tsv’ and not the metadata nor output matrix. In addition, SWATH2stats provides a similar conversion function which takes the tric output and coerces it into a similar matrix after performing its various filtering tasks. Therefore I will use that.

I keep a sample sheet of all the DIA samples used in this analysis in ‘sample_sheets/dia_samples.xlsx’ It should contain all the other required data with one important caveat, I removed 1 sample by ‘commenting’ it (e.g. prefixing it with ‘##’ – which is an admittedly dumb thing to do in an excel file.

One last caveat: I hacked the SWATH2stats sample_annotation() function to add a couple columns in an attempt to make it a little more robust when faced with sample sheets with differently named columns.

In addition, SWATH2stats provides some nice filtering and combination functions which should be considered when generating various expressionset data structures later.

2.1.1 Creating a swath2stats experiment using our comet-derived library data

loaded <- sm(devtools::load_all("~/scratch/git/SWATH2stats"))

our_tric_data <- read.csv("results/tric/20180611/whole_8mz/comet_HCD.tsv", sep="\t")

sample_annot <- openxlsx::read.xlsx("sample_sheets/Mtb_dia_samples.xlsx")
colnames(sample_annot) <- gsub(pattern="\\.", replacement="", x=colnames(sample_annot))
## Drop samples starting with comments
keep_idx <- ! grepl(pattern="^#", x=sample_annot[["sampleid"]])
sample_annot <- sample_annot[keep_idx, ]
sample_annot[["sampleid"]] <- paste0("s", sample_annot[["sampleid"]])
rownames(sample_annot) <- make.names(sample_annot[["sampleid"]], unique=TRUE)

expt_idx <- sample_annot[["expt_id"]] == "may2018" | sample_annot[["expt_id"]] == "mar2018"
expt_idx[is.na(sample_annot[["expt_id"]])] <- FALSE
sample_annot <- sample_annot[expt_idx, ]
mz_idx <- sample_annot[["windowsize"]] == "8"
sample_annot <- sample_annot[mz_idx, ]
## Set the mzXML column to match the filename column in the data.

## s2s, my witty way of shortening SWATH2stats...
our_s2s_exp <- sample_annotation(data=our_tric_data,
                                 sample_annotation=sample_annot,
                                 run_column="run",
                                 fullpeptidename_column="fullunimodpeptidename")
## Found the same mzXML files in the annotations and data.

2.1.2 Creating a swath2stats experiment using the tuberculist-derived library data

This just repeats the previous stanza, but uses the tuberculist transition-library swath data as its input rather than our own comet data.

There is one important caveat in the following block: I used a regex to remove the second half of geneID_geneName so that later when I merge in the annotation data I have it will match.

tb_tric_data <- read.csv("results/tric/20180611/whole_8mz_tuberculist/comet_HCD.tsv", sep="\t")

tb_tric_data[["ProteinName"]] <- gsub(pattern="^(.*)_.*$", replacement="\\1",
                                      x=tb_tric_data[["ProteinName"]])
tb_s2s_exp <- sample_annotation(data=tb_tric_data,
                                sample_annotation=sample_annot,
                                fullpeptidename_column="fullunimodpeptidename")
## Found the same mzXML files in the annotations and data.

Now I have a couple data structures which should prove useful for the metrics provided by SWATH2stats, MSstats, and my own hpgltools.

3 SWATH2stats continued

The various metrics and filters provided by SWATH2stats seem quite reasonable to me. The only thing that really bothers me is that they are all case sensitive and I found that the most recent tric changed the capitalization of a column, causing these to all fall down. Therefore I went in and made everything case insensitive in a fashion similar to that done by MSstats (except I hate capital letters, so I used tolower() rather than toupper()).

3.1 Run filters on our comet data

## Get correlations on a sample by sample basis
pp(file="images/20180611_our_swath2stats_sample_cor.png")
## Going to write the image to: images/20180611_our_swath2stats_sample_cor.png when dev.off() is called.
sample_cor <- plot_correlation_between_samples(our_s2s_exp, size=2,
                                               fun.aggregate=sum,
                                               column.values="intensity")
dev.off()
## png 
##   2
sample_cond_rep_cor <- plot_correlation_between_samples(our_s2s_exp, size=2,
                                                        comparison=transition_group_id ~
                                                          condition + bioreplicate + run,
                                                        fun.aggregate=sum,
                                                        column.values="intensity")

decoy_lists <- assess_decoy_rate(our_s2s_exp)
## Number of non-decoy peptides: 3259
## Number of decoy peptides: 182
## Decoy rate: 0.0558
## This seems a bit high to me, yesno?
fdr_overall <- assess_fdr_overall(our_s2s_exp, output="Rconsole", plot=TRUE)

byrun_fdr <- assess_fdr_byrun(our_s2s_exp, FFT=0.7, plot=TRUE, output="Rconsole")
## The average FDR by run on assay level is 0.007
## The average FDR by run on peptide level is 0.007
## The average FDR by run on protein level is 0.019

chosen_mscore <- mscore4assayfdr(our_s2s_exp, FFT=0.7, fdr_target=0.02)
## Target assay FDR: 0.02
## Required overall m-score cutoff: 0.0019953
## achieving assay FDR: 0.0189
prot_score <- mscore4protfdr(our_s2s_exp, FFT=0.7, fdr_target=0.02)
## Target protein FDR: 0.02
## Required overall m-score cutoff: 0.00050119
## achieving protein FDR: 0.0181
our_mscore_filtered <- filter_mscore(our_s2s_exp, chosen_mscore)
## Original dimension: 47362, new dimension: 45761, difference: 1601.
our_freq_mscore <- filter_mscore_freqobs(our_s2s_exp, 0.01, 0.8, rm.decoy=FALSE)
## Peptides need to have been quantified in more conditions than: 18.4 in order to pass this percentage-based threshold.
## Fraction of peptides selected: 0.33
## Original dimension: 47841, new dimension: 26232, difference: 21609.
our_data_filtered_fdr <- filter_mscore_fdr(our_mscore_filtered, FFT=0.7,
                                           overall_protein_fdr_target=prot_score,
                                           upper_overall_peptide_fdr_limit=0.05)
## Target protein FDR: 0.000501187233627273
## Required overall m-score cutoff: 0.01
## achieving protein FDR: 0
## filter_mscore_fdr is filtering the data...
## finding m-score cutoff to achieve desired protein FDR in protein master list..
## finding m-score cutoff to achieve desired global peptide FDR..
## Target peptide FDR: 0.05
## Required overall m-score cutoff: 0.01
## Achieving peptide FDR: 0
## Proteins selected: 
## Total proteins selected: 945
## Final target proteins: 945
## Final decoy proteins: 0
## Peptides mapping to these protein entries selected:
## Total mapping peptides: 3203
## Final target peptides: 3203
## Final decoy peptides: 0
## Total peptides selected from:
## Total peptides: 3203
## Final target peptides: 3203
## Final decoy peptides: 0
## Individual run FDR quality of the peptides was not calculated
## as not every run contains a decoy.
## The decoys have been removed from the returned data.
our_only_proteotypic <- filter_proteotypic_peptides(our_data_filtered_fdr)
## Number of proteins detected: 947
## Protein identifiers: Rv1908c, Rv0242c, Rv3224, Rv0667, Rv1133c, Rv3036c
## Number of proteins detected that are supported by a proteotypic peptide: 930
## Number of proteotypic peptides detected: 3158
our_all_filtered <- filter_all_peptides(our_only_proteotypic)
## Number of proteins detected: 930
## First 6 protein identifiers: Rv1908c, Rv0242c, Rv3224, Rv0667, Rv1133c, Rv3036c
our_only_strong <- filter_on_max_peptides(data=our_all_filtered, n_peptides=10)
## Before filtering:
##   Number of proteins: 930
##   Number of peptides: 3158
## 
## Percentage of peptides removed: 8.61%
## 
## After filtering:
##   Number of proteins: 928
##   Number of peptides: 2886
our_only_minimum <- filter_on_min_peptides(data=our_only_strong, n_peptides=3)
## Before filtering:
##   Number of proteins: 928
##   Number of peptides: 2886
## 
## Percentage of peptides removed: 0%
## 
## After filtering:
##   Number of proteins: 895
##   Number of peptides: 2886
## I think these matrixes are probably smarter to use than the raw outmatrix from tric.
## But I am not a fan of rerwriting the sample column names.
matrix_prefix <- file.path("results", "swath2stats", ver)
if (!file.exists(matrix_prefix)) {
  dir.create(matrix_prefix)
}
protein_matrix_all <- write_matrix_proteins(
  our_s2s_exp, write.csv=TRUE,
  filename=file.path(matrix_prefix, "our_protein_all.csv"))
## Protein overview matrix results/swath2stats/20180611/our_protein_all.csv written to working folder.
dim(protein_matrix_all)
## [1] 1110   24
protein_matrix_mscore <- write_matrix_proteins(
  our_mscore_filtered, write.csv=TRUE,
  filename=file.path(matrix_prefix, "our_protein_matrix_mscore.csv"))
## Protein overview matrix results/swath2stats/20180611/our_protein_matrix_mscore.csv written to working folder.
dim(protein_matrix_mscore)
## [1] 945  24
peptide_matrix_mscore <- write_matrix_peptides(
  our_mscore_filtered, write.csv=TRUE,
  filename=file.path(matrix_prefix, "our_peptide_matrix_mscore.csv"))
## Peptide overview matrix results/swath2stats/20180611/our_peptide_matrix_mscore.csv written to working folder.
dim(peptide_matrix_mscore)
## [1] 3203   24
protein_matrix_minimum <- write_matrix_proteins(
  our_only_minimum, write.csv=TRUE,
  filename=file.path(matrix_prefix, "our_protein_matrix_minimum.csv"))
## Protein overview matrix results/swath2stats/20180611/our_protein_matrix_minimum.csv written to working folder.
dim(protein_matrix_minimum)
## [1] 895  24
peptide_matrix_minimum <- write_matrix_peptides(
  our_only_minimum, write.csv=TRUE,
  filename=file.path(matrix_prefix, "our_peptide_matrix_minimum.csv"))
## Peptide overview matrix results/swath2stats/20180611/our_peptide_matrix_minimum.csv written to working folder.
dim(peptide_matrix_minimum)
## [1] 38491    24
rt_cor <- plot_correlation_between_samples(
  our_only_minimum, column.values="intensity", fun.aggregate=sum, size=2)

## I have no effing clue what this plot means.
variation <- plot_variation(our_only_minimum, fun.aggregate=sum)

cols <- colnames(our_only_minimum)
our_disaggregated <- disaggregate(our_only_minimum, all.columns=TRUE)
## The library contains6transitions per precursor.
## The data table was transformed into a table containing one row per transition.
our_msstats_input <- convert_MSstats(our_disaggregated)
## One or several columns required by MSstats were not in the data. The columns were created and filled with NAs.
## Missing columns: productcharge, isotopelabeltype
## isotopelabeltype was filled with light.

3.2 Run filters on the tuberculist

Repeat the above stanza using the tuberculist data.

## Get correlations on a sample by sample basis
pp(file="images/20180611_tb_swath2stats_sample_cor.png")
## Going to write the image to: images/20180611_tb_swath2stats_sample_cor.png when dev.off() is called.
sample_cor <- plot_correlation_between_samples(tb_s2s_exp, size=2,
                                               fun.aggregate=sum,
                                               column.values="intensity")
dev.off()
## png 
##   2
sample_cond_rep_cor <- plot_correlation_between_samples(tb_s2s_exp, size=2,
                                                        comparison=transition_group_id ~
                                                          condition + bioreplicate + run,
                                                        fun.aggregate=sum,
                                                        column.values="intensity")

decoy_lists <- assess_decoy_rate(tb_s2s_exp)
## Number of non-decoy peptides: 23975
## Number of decoy peptides: 1762
## Decoy rate: 0.0735
## This seems a bit high to me, yesno?
fdr_overall <- assess_fdr_overall(tb_s2s_exp, output="Rconsole", plot=TRUE)

byrun_fdr <- assess_fdr_byrun(tb_s2s_exp, FFT=0.7, plot=TRUE, output="Rconsole")
## The average FDR by run on assay level is 0.008
## The average FDR by run on peptide level is 0.009
## The average FDR by run on protein level is 0.044

chosen_mscore <- mscore4assayfdr(tb_s2s_exp, FFT=0.7, fdr_target=0.02)
## Target assay FDR: 0.02
## Required overall m-score cutoff: 0.0056234
## achieving assay FDR: 0.019
prot_score <- mscore4protfdr(tb_s2s_exp, FFT=0.7, fdr_target=0.02)
## Target protein FDR: 0.02
## Required overall m-score cutoff: 0.00070795
## achieving protein FDR: 0.0198
tb_mscore_filtered <- filter_mscore(tb_s2s_exp, chosen_mscore)
## Original dimension: 252491, new dimension: 231348, difference: 21143.
tb_freq_mscore <- filter_mscore_freqobs(tb_s2s_exp, 0.01, 0.8, rm.decoy=FALSE)
## Peptides need to have been quantified in more conditions than: 18.4 in order to pass this percentage-based threshold.
## Fraction of peptides selected: 0.083
## Original dimension: 256996, new dimension: 56485, difference: 200511.
tb_data_filtered_fdr <- filter_mscore_fdr(tb_mscore_filtered, FFT=0.7,
                                           overall_protein_fdr_target=prot_score,
                                           upper_overall_peptide_fdr_limit=0.05)
## Target protein FDR: 0.000707945784384137
## Required overall m-score cutoff: 0.01
## achieving protein FDR: 0
## filter_mscore_fdr is filtering the data...
## finding m-score cutoff to achieve desired protein FDR in protein master list..
## finding m-score cutoff to achieve desired global peptide FDR..
## Target peptide FDR: 0.05
## Required overall m-score cutoff: 0.01
## Achieving peptide FDR: 0
## Proteins selected: 
## Total proteins selected: 3087
## Final target proteins: 3087
## Final decoy proteins: 0
## Peptides mapping to these protein entries selected:
## Total mapping peptides: 22671
## Final target peptides: 22671
## Final decoy peptides: 0
## Total peptides selected from:
## Total peptides: 22671
## Final target peptides: 22671
## Final decoy peptides: 0
## Individual run FDR quality of the peptides was not calculated
## as not every run contains a decoy.
## The decoys have been removed from the returned data.
tb_only_proteotypic <- filter_proteotypic_peptides(tb_data_filtered_fdr)
## Number of proteins detected: 3101
## Protein identifiers: Rv1270c, Rv0161, Rv1613, Rv2540c, Rv0440, Rv1327c
## Number of proteins detected that are supported by a proteotypic peptide: 2964
## Number of proteotypic peptides detected: 22510
tb_all_filtered <- filter_all_peptides(tb_only_proteotypic)
## Number of proteins detected: 2966
## First 6 protein identifiers: Rv1270c, Rv0161, Rv1613, Rv2540c, Rv0440, Rv1327c
tb_only_strong <- filter_on_max_peptides(data=tb_all_filtered, n_peptides=10)
## Before filtering:
##   Number of proteins: 2964
##   Number of peptides: 22510
## 
## Percentage of peptides removed: 23.15%
## 
## After filtering:
##   Number of proteins: 2936
##   Number of peptides: 17299
tb_only_minimum <- filter_on_min_peptides(data=tb_only_strong, n_peptides=3)
## Before filtering:
##   Number of proteins: 2936
##   Number of peptides: 17299
## 
## Percentage of peptides removed: 0.03%
## 
## After filtering:
##   Number of proteins: 2679
##   Number of peptides: 17293
## I think these matrixes are probably smarter to use than the raw outmatrix from tric.
## But I am not a fan of rerwriting the sample column names.
matrix_prefix <- file.path("results", "swath2stats", ver)
if (!file.exists(matrix_prefix)) {
  dir.create(matrix_prefix)
}
protein_matrix_all <- write_matrix_proteins(
  tb_s2s_exp, write.csv=TRUE,
  filename=file.path(matrix_prefix, "tb_protein_all.csv"))
## Protein overview matrix results/swath2stats/20180611/tb_protein_all.csv written to working folder.
dim(protein_matrix_all)
## [1] 4570   24
protein_matrix_mscore <- write_matrix_proteins(
  tb_mscore_filtered, write.csv=TRUE,
  filename=file.path(matrix_prefix, "tb_protein_matrix_mscore.csv"))
## Protein overview matrix results/swath2stats/20180611/tb_protein_matrix_mscore.csv written to working folder.
dim(protein_matrix_mscore)
## [1] 3087   24
peptide_matrix_mscore <- write_matrix_peptides(
  tb_mscore_filtered, write.csv=TRUE,
  filename=file.path(matrix_prefix, "tb_peptide_matrix_mscore.csv"))
## Peptide overview matrix results/swath2stats/20180611/tb_peptide_matrix_mscore.csv written to working folder.
dim(peptide_matrix_mscore)
## [1] 22671    24
protein_matrix_minimum <- write_matrix_proteins(
  tb_only_minimum, write.csv=TRUE,
  filename=file.path(matrix_prefix, "tb_protein_matrix_minimum.csv"))
## Protein overview matrix results/swath2stats/20180611/tb_protein_matrix_minimum.csv written to working folder.
dim(protein_matrix_minimum)
## [1] 2679   24
peptide_matrix_minimum <- write_matrix_peptides(
  tb_only_minimum, write.csv=TRUE,
  filename=file.path(matrix_prefix, "tb_peptide_matrix_minimum.csv"))
## Peptide overview matrix results/swath2stats/20180611/tb_peptide_matrix_minimum.csv written to working folder.
dim(peptide_matrix_minimum)
## [1] 164565     24
rt_cor <- plot_correlation_between_samples(size=2,
  tb_only_minimum, column.values="intensity", fun.aggregate=sum)

## I have no effing clue what this plot means.
variation <- plot_variation(tb_only_minimum, fun.aggregate=sum)

cols <- colnames(tb_only_minimum)
tb_disaggregated <- disaggregate(tb_only_minimum, all.columns=TRUE)
## The library contains5transitions per precursor.
## The data table was transformed into a table containing one row per transition.
tb_msstats_input <- convert_MSstats(tb_disaggregated)
## One or several columns required by MSstats were not in the data. The columns were created and filled with NAs.
## Missing columns: productcharge, isotopelabeltype
## isotopelabeltype was filled with light.

3.3 Some new plots

In response to some interesting queries from Yan, I made a few little functions which query and plot data from the scored data provided by openswath/pyprophet. Let us look at their results here.

3.3.1 Using the comet data

our_pyprophet_fun <- sm(extract_pyprophet_data(metadata="sample_sheets/Mtb_dia_samples.xlsx"))

our_mass_plot <- sm(plot_pyprophet_distribution(our_pyprophet_fun, column="mass"))
our_mass_plot[["boxplot"]]

our_mass_plot[["density"]]

our_deltart_plot_all <- sm(plot_pyprophet_distribution(
  our_pyprophet_fun, column="delta_rt"))
our_deltart_plot_all[["boxplot"]]

our_deltart_plot_real <- sm(plot_pyprophet_distribution(
  our_pyprophet_fun,
  column="delta_rt", keep_decoys=FALSE))
our_deltart_plot_real[["boxplot"]]

our_deltart_plot_decoys <- sm(plot_pyprophet_distribution(
  our_pyprophet_fun,
  column="delta_rt", keep_real=FALSE))
our_deltart_plot_decoys[["boxplot"]]

our_deltart_plot_decoys[["density"]]

our_widthvsmass <- sm(plot_pyprophet_data(our_pyprophet_fun, legend=FALSE))
our_widthvsmass$plot

3.4 MSstats

msstats.org seems to provide a complete solution for performing reasonable metrics of this data.

I am currently reading: http://msstats.org/wp-content/uploads/2017/01/MSstats_v3.7.3_manual.pdf

I made some moderately intrusive changes to MSstats to make it clearer, as well.

3.4.1 Comet derived msstats

devtools::load_all("~/scratch/git/MSstats")
our_msstats_quant <- dataProcess(our_msstats_input)
our_msstats_plots <- sm(dataProcessPlots(our_msstats_quant, type="QCPLOT"))

my_levels <- levels(as.factor(our_msstats_input$condition))
my_levels
comparisons <- ghetto_contrast_matrix(
  numerators=c("wt_filtrate", "delta_filtrate", "comp_filtrate",
               "delta_filtrate", "comp_filtrate", "delta_whole",
               "comp_whole"),
  denominators=c("wt_whole", "delta_whole", "comp_whole",
                 "wt_filtrate", "wt_filtrate", "wt_whole",
                 "wt_whole"))
our_results <- list()
for (c in 1:length(rownames(comparisons))) {
  name <- rownames(comparisons)[c]
  message("Starting ", name)
  comp <- comparisons[c, ]
  comparison <- t(as.matrix(comp))
  rownames(comparison) <- name
  our_results[name] <- sm(MSstats::groupComparison(contrast.matrix=comparison,
                                                   data=our_msstats_quant))
}

4 This is Relevant to Yan’s query!

4.1 P/PE protein QC plots for Yan

Yan asked for the p/pe protein qc plots. ok. I changed the dataProcessPlots to return something useful, so that should be possible now.

pe_genes <- read.table("reference/annotated_pe_genes.txt")[[1]]

## Unfortunately, the names did not get set in my changed version of dataProcessPlots...
plotlst <- our_msstats_plots$QCPLOT
available_plots <- gsub(pattern="^1/", replacement="",
                        x=levels(our_msstats_quant$ProcessedData$PROTEIN))
names(plotlst) <- available_plots

pe_in_avail_idx <- pe_genes %in% available_plots
pe_in_avail <- pe_genes[pe_in_avail_idx]
pe_plots <- plotlst[pe_in_avail]
pdf(file="pe_qc_plots.pdf")
for (p in 1:length(pe_plots)) {
  plot(pe_plots[[p]])
}
dev.off()
length(pe_plots)

4.1.1 Tuberculist derived msstats

tb_msstats_quant <- dataProcess(tb_msstats_input)
##tb_msstats_plots <- sm(dataProcessPlots(tb_msstats_quant, type="QCPLOT"))

my_levels <- levels(as.factor(tb_msstats_input$condition))
my_levels
comparisons <- ghetto_contrast_matrix(
  numerators=c("wt_filtrate", "delta_filtrate", "comp_filtrate",
               "delta_filtrate", "comp_filtrate", "delta_whole",
               "comp_whole"),
  denominators=c("wt_whole", "delta_whole", "comp_whole",
                 "wt_filtrate", "wt_filtrate", "wt_whole",
                 "wt_whole"))
tb_results <- list()
for (c in 1:length(rownames(comparisons))) {
  name <- rownames(comparisons)[c]
  message("Starting ", name)
  comp <- comparisons[c, ]
  comparison <- t(as.matrix(comp))
  rownames(comparison) <- name
  tb_results[name] <- sm(MSstats::groupComparison(contrast.matrix=comparison,
                                                   data=tb_msstats_quant))
}

5 Create hpgltools expressionset

Since I am not certain I understand these data, I will take the intensities from SWATH2stats, metadata, and annotation data; attempt to create a ‘normal’ expressionset; poke at it to see what I can learn.

5.1 Massaging the metadata

I want to use the same metadata as were used for MSstats. It has a few important differences from the requirements of hpgltools: pretty much only that I do not allow rownames/sampleIDs to start with a number.

5.2 Massaging the intensity matrix

I do not want the \1 before the protein names, I already merged them into one entry per gene vis SWATH2stats.

5.2.1 Make matrices with our libraries

our_prot_mtrx <- read.csv(file.path("results", "swath2stats", ver, "our_protein_matrix_minimum.csv"))
rownames(our_prot_mtrx) <- gsub(pattern="^1\\/", replacement="", x=our_prot_mtrx[["proteinname"]])
our_prot_mtrx <- our_prot_mtrx[, -1]
## Important question: Did SWATH2stats reorder my data?
colnames(our_prot_mtrx) <- gsub(pattern="^.*2018", replacement="s2018", x=colnames(our_prot_mtrx))

5.2.2 Make matrices with the tb libraries

tb_prot_mtrx <- read.csv(file.path("results", "swath2stats", ver, "tb_protein_matrix_minimum.csv"))
rownames(tb_prot_mtrx) <- gsub(pattern="^1\\/", replacement="", x=tb_prot_mtrx[["proteinname"]])
tb_prot_mtrx <- tb_prot_mtrx[, -1]
## Important question: Did SWATH2stats reorder my data?
colnames(tb_prot_mtrx) <- gsub(pattern="^.*2018", replacement="s2018", x=colnames(tb_prot_mtrx))

5.3 Merge the pieces

Now we should have sufficient pieces to make an expressionset.

While here, I will also split the data into a cf and whole-cell pair of data structures.

5.3.1 Our data

## Drop the metadata not in the protein matrix:
## And ensure that they are the same order.
reordered <- colnames(our_prot_mtrx)
our_metadata <- sample_annot[reordered, ]

our_protein_expt <- create_expt(metadata=our_metadata,
                                count_dataframe=our_prot_mtrx,
                                gene_info=mtb_annotations)
## Reading the sample metadata.
## The sample definitions comprises: 23, 27 rows, columns.
## Matched 892 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
new_colors <- c("#1B9E77", "#D95F02", "#7570B3", "#E7298A", "#66A61E", "#E6AB02")
names(new_colors) <- c("comp_filtrate", "comp_whole", "delta_filtrate", "delta_whole", "wt_filtrate", "wt_whole")
names(new_colors) <- c("wt_whole", "comp_whole", "delta_whole", "delta_filtrate", "wt_filtrate", "comp_filtrate")
our_protein_expt <- set_expt_colors(our_protein_expt, new_colors)
## The new colors are a character, changing according to condition.
our_whole_expt <- subset_expt(our_protein_expt, subset="collectiontype=='whole'")
## There were 23, now there are 11 samples.
our_cf_expt <- subset_expt(our_protein_expt, subset="collectiontype=='filtrate'")
## There were 23, now there are 12 samples.

5.3.2 Tb data

## Drop the metadata not in the protein matrix:
## And ensure that they are the same order.
reordered <- colnames(tb_prot_mtrx)
metadata <- sample_annot[reordered, ]

tb_protein_expt <- sm(create_expt(metadata,
                                  count_dataframe=tb_prot_mtrx,
                                  gene_info=mtb_annotations))
written <- write_expt(tb_protein_expt, excel=paste0("excel/written-v", ver, ".xlsx"))
## Writing the legend.
## Writing the raw reads.
## Graphing the raw reads.

## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Writing the normalized reads.
## Graphing the normalized reads.

## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Writing the median reads by factor.
## The factor comp_filtrate has 3 rows.
## The factor comp_whole has 2 rows.
## The factor delta_filtrate has 3 rows.
## The factor delta_whole has 3 rows.
## The factor wt_filtrate has 6 rows.
## The factor wt_whole has 6 rows.

tb_whole_expt <- subset_expt(tb_protein_expt, subset="collectiontype=='whole'")
## There were 23, now there are 11 samples.
tb_cf_expt <- subset_expt(tb_protein_expt, subset="collectiontype=='filtrate'")
## There were 23, now there are 12 samples.

5.4 Metrics of the full data set

5.4.1 Our libraries

our_protein_metrics <- sm(graph_metrics(our_protein_expt))
our_protein_norm <- sm(normalize_expt(our_protein_expt, transform="log2", convert="cpm",
                                      norm="quant", filter=TRUE))
our_protein_norm_metrics <- sm(graph_metrics(our_protein_norm))
our_protein_fsva <- sm(normalize_expt(our_protein_expt, transform="log2",
                                   batch="fsva", filter=TRUE))
our_protein_fsva_metrics <- sm(graph_metrics(our_protein_fsva))

5.4.2 Tb libraries

tb_protein_metrics <- sm(graph_metrics(tb_protein_expt))
tb_protein_varplot <- plot_variance_coefficients(tb_protein_expt, x_axis="batch")
## Naively calculating coefficient of variation and quartile dispersion with respect to batch.
## Finished calculating dispersion estimates.
tb_protein_varplot <- plot_variance_coefficients(tb_protein_expt)
## Naively calculating coefficient of variation and quartile dispersion with respect to condition.
## Finished calculating dispersion estimates.
tb_protein_norm <- sm(normalize_expt(tb_protein_expt, transform="log2", convert="cpm",
                                     norm="quant", filter=TRUE))
tb_protein_norm_metrics <- sm(graph_metrics(tb_protein_norm))
tb_protein_fsva <- sm(normalize_expt(tb_protein_expt, transform="log2", convert="cpm",
                                     batch="fsva", filter=TRUE))
tb_protein_fsva_metrics <- sm(graph_metrics(tb_protein_fsva))

5.5 Metrics of the whole-cell data set

5.5.1 Our libraries

our_whole_metrics <- sm(graph_metrics(our_whole_expt))
our_whole_norm <- sm(normalize_expt(our_whole_expt, transform="log2", convert="cpm",
                                    norm="quant", filter=TRUE))
our_whole_norm_metrics <- sm(graph_metrics(our_whole_norm))
our_whole_fsva <- sm(normalize_expt(our_whole_expt, transform="log2", convert="cpm",
                                    batch="fsva", filter=TRUE))
our_whole_fsva_metrics <- sm(graph_metrics(our_whole_fsva))

5.5.2 Tb libraries

tb_whole_metrics <- sm(graph_metrics(tb_whole_expt))
tb_whole_norm <- sm(normalize_expt(tb_whole_expt, transform="log2", convert="cpm",
                                    norm="quant", filter=TRUE))
tb_whole_norm_metrics <- sm(graph_metrics(tb_whole_norm))
tb_whole_fsva <- sm(normalize_expt(tb_whole_expt, transform="log2", convert="cpm",
                                    batch="fsva", filter=TRUE))
tb_whole_fsva_metrics <- sm(graph_metrics(tb_whole_fsva))

5.6 Metrics of the filtrate data set

5.6.1 Our libraries

our_cf_metrics <- sm(graph_metrics(our_cf_expt))
our_cf_norm <- sm(normalize_expt(our_cf_expt, transform="log2", convert="cpm",
                                 norm="quant", filter=TRUE))
our_cf_norm_metrics <- sm(graph_metrics(our_cf_norm))
our_cf_fsva <- sm(normalize_expt(our_cf_expt, transform="log2", convert="cpm",
                                 batch="fsva", filter=TRUE))
our_cf_fsva_metrics <- sm(graph_metrics(our_cf_fsva))

5.6.2 Tb libraries

tb_cf_metrics <- sm(graph_metrics(tb_cf_expt))
tb_cf_norm <- sm(normalize_expt(tb_cf_expt, transform="log2", convert="cpm",
                                 norm="quant", filter=TRUE))
tb_cf_norm_metrics <- sm(graph_metrics(tb_cf_norm))
tb_cf_fsva <- sm(normalize_expt(tb_cf_expt, transform="log2", convert="cpm",
                                 batch="fsva", filter=TRUE))
tb_cf_fsva_metrics <- sm(graph_metrics(tb_cf_fsva))

5.7 plot some metrics

5.7.1 Our libraries

pp(image=our_protein_metrics$libsize,
   file=file.path("images", paste0(ver, "_our_libsize.png")))
## It seems to me that the scale of the data is all within an order of magnitude or two.
## I cannot get used to these absurdly large numbers though.
pp(image=our_protein_norm_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_our_norm_pca.png")))
## There appears to be a nice split in the data, however the un-assayable batch
## effect is a problem.
pp(image=our_protein_fsva_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_fsva_pca.png")))
## fsva seems to get some handle on the data, but I don't think we should rely
## upon it.
pp(image=our_protein_norm_metrics$corheat,
   file=file.path("images", paste0(ver, "_norm_corheat.png")))
## Once again, the whole-cell/culture-filtrate split is very large.
pp(image=our_protein_metrics$density,
   file=file.path("images", paste0(ver, "_raw_density.png")))
## There are two obvious distributions in the data, once again split between types.
pp(image=our_protein_metrics$boxplot,
   file=file.path("images", paste0(ver, "_boxplot.png")))
## This recapitulates the previous plot.

pp(image=our_whole_metrics$libsize,
   file=file.path("images", paste0(ver, "_whole_libsize.png")))
pp(image=our_whole_norm_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_whole_norm_pca.png")))
pp(image=our_whole_fsva_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_whole_fsva_pca.png")))
pp(image=our_whole_norm_metrics$corheat,
   file=file.path("images", paste0(ver, "_whole_norm_corheat.png")))
pp(image=our_whole_metrics$density,
   file=file.path("images", paste0(ver, "_whole_raw_density.png")))
pp(image=our_whole_metrics$boxplot,
   file=file.path("images", paste0(ver, "_whole_boxplot.png")))

pp(image=our_cf_metrics$libsize,
   file=file.path("images", paste0(ver, "_libsize.png")))
pp(image=our_cf_norm_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_norm_pca.png")))
pp(image=our_cf_fsva_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_fsva_pca.png")))
pp(image=our_cf_norm_metrics$corheat,
   file=file.path("images", paste0(ver, "_norm_corheat.png")))
pp(image=our_cf_metrics$density,
   file=file.path("images", paste0(ver, "_raw_density.png")))
pp(image=our_cf_metrics$boxplot,
   file=file.path("images", paste0(ver, "_boxplot.png")))

5.7.2 Tb libraries

I think the figure S1 intensities may come from these:

## Reminder:  tb_protein_metrics <- sm(graph_metrics(tb_protein_expt))
figure_expt <- set_expt_samplenames(tb_protein_expt, pData(tb_protein_expt)[["figurename"]])
fig_s1a <- plot_libsize(figure_expt, title="", order_by="condition", y_label="Sum(intensities) by run",
                        order=c("wt_whole", "delta_whole", "comp_whole",
                                "wt_filtrate", "delta_filtrate", "comp_filtrate"))
## Warning in mode(current): NAs introduced by coercion to integer range
## The scale difference between the smallest and largest
## libraries is > 10. Assuming a log10 scale is better, set scale=FALSE if not.
pp(file="images/fig_s1a.pdf", image=fig_s1a$plot)
## Writing the image to: images/fig_s1a.pdf and calling dev.off().

pp(image=tb_protein_metrics$libsize,
   file=file.path("images", paste0(ver, "_tb_libsize.png")))
## Writing the image to: images/20180611_tb_libsize.png and calling dev.off().

## It seems to me that the scale of the data is all within an order of magnitude or two.
## I cannot get used to these absurdly large numbers though.
pp(image=tb_protein_norm_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_tb_norm_pca.png")))
## Writing the image to: images/20180611_tb_norm_pca.png and calling dev.off().
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

## There appears to be a nice split in the data, however the un-assayable batch
## effect is a problem.
pp(image=tb_protein_fsva_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_fsva_pca.png")))
## Writing the image to: images/20180611_fsva_pca.png and calling dev.off().
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

## fsva seems to get some handle on the data, but I don't think we should rely
## upon it.
pp(image=tb_protein_norm_metrics$corheat,
   file=file.path("images", paste0(ver, "_norm_corheat.png")))
## Writing the image to: images/20180611_norm_corheat.png and calling dev.off().

## Once again, the whole-cell/culture-filtrate split is very large.
pp(image=tb_protein_metrics$density,
   file=file.path("images", paste0(ver, "_raw_density.png")))
## Writing the image to: images/20180611_raw_density.png and calling dev.off().

## There are two obvious distributions in the data, once again split between types.
pp(image=tb_protein_metrics$boxplot,
   file=file.path("images", paste0(ver, "_boxplot.png")))
## Writing the image to: images/20180611_boxplot.png and calling dev.off().

## This recapitulates the previous plot.

pp(image=tb_whole_metrics$libsize,
   file=file.path("images", paste0(ver, "_whole_libsize.png")))
## Writing the image to: images/20180611_whole_libsize.png and calling dev.off().

pp(image=tb_whole_norm_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_whole_norm_pca.png")))
## Writing the image to: images/20180611_whole_norm_pca.png and calling dev.off().
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

pp(image=tb_whole_fsva_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_whole_fsva_pca.png")))
## Writing the image to: images/20180611_whole_fsva_pca.png and calling dev.off().
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

pp(image=tb_whole_norm_metrics$corheat,
   file=file.path("images", paste0(ver, "_whole_norm_corheat.png")))
## Writing the image to: images/20180611_whole_norm_corheat.png and calling dev.off().

pp(image=tb_whole_metrics$density,
   file=file.path("images", paste0(ver, "_whole_raw_density.png")))
## Writing the image to: images/20180611_whole_raw_density.png and calling dev.off().

pp(image=tb_whole_metrics$boxplot,
   file=file.path("images", paste0(ver, "_whole_boxplot.png")))
## Writing the image to: images/20180611_whole_boxplot.png and calling dev.off().

pp(image=tb_cf_metrics$libsize,
   file=file.path("images", paste0(ver, "_libsize.png")))
## Writing the image to: images/20180611_libsize.png and calling dev.off().

pp(image=tb_cf_norm_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_norm_pca.png")))
## Writing the image to: images/20180611_norm_pca.png and calling dev.off().
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

pp(image=tb_cf_fsva_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_fsva_pca.png")))
## Writing the image to: images/20180611_fsva_pca.png and calling dev.off().
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

pp(image=tb_cf_norm_metrics$corheat,
   file=file.path("images", paste0(ver, "_norm_corheat.png")))
## Writing the image to: images/20180611_norm_corheat.png and calling dev.off().

pp(image=tb_cf_metrics$density,
   file=file.path("images", paste0(ver, "_raw_density.png")))
## Writing the image to: images/20180611_raw_density.png and calling dev.off().

pp(image=tb_cf_metrics$boxplot,
   file=file.path("images", paste0(ver, "_boxplot.png")))
## Writing the image to: images/20180611_boxplot.png and calling dev.off().

6 Attempt some quantification comparisons?

6.1 Our libraries

our_pairwise_filt <- sm(normalize_expt(our_protein_expt, filter=TRUE))
our_pairwise_comp <- sm(all_pairwise(our_pairwise_filt, model_batch="fsva", force=TRUE))
our_pairwise_nobatch <- sm(all_pairwise(our_pairwise_filt, model_batch=FALSE, force=TRUE))

6.2 Tb libraries

tb_pairwise_filt <- sm(normalize_expt(tb_protein_expt, filter=TRUE))
tb_pairwise_comp <- sm(all_pairwise(tb_pairwise_filt, model_batch="fsva", force=TRUE))
tb_pairwise_nobatch <- sm(all_pairwise(tb_pairwise_filt, model_batch=FALSE, force=TRUE))

6.2.1 Show a few metrics from the hpgltools pairwise comparisons

our_pairwise_comp$comparison$heat

tb_pairwise_comp$comparison$heat

7 For each msstats run, do a DE table

7.1 wt_filtrate vs wt_whole

7.1.1 Our libraries

keepers <- list(
  "wtcf_vs_wtwhole" = c("wt_filtrate", "wt_whole"))
droppers <- c("undefined")
names(droppers) <- "log2fc"

## Make sure to set the rownames so it will merge into the excel file.
set_name <- "wt_filtrate_vs_wt_whole"
rownames(our_results[[set_name]]) <- our_results[[set_name]][["Protein"]]
## Error in eval(expr, envir, enclos): object 'our_results' not found
wtcf_wtwhole_tables <- sm(combine_de_tables(
  ## our_pairwise_comp, keepers=keepers, extra_annot=our_results[[set_name]],
  our_pairwise_comp, keepers=keepers,
  excludes=droppers,
  excel=paste0("excel/our_wtcf_vs_wtwhole_tables-v", ver, ".xlsx")))
wtcf_nobatch_wtwhole_tables <- sm(combine_de_tables(
  ## our_pairwise_nobatch, keepers=keepers, extra_annot=our_results[[set_name]],
  our_pairwise_nobatch, keepers=keepers,
  excludes=droppers,
  excel=paste0("excel/our_wtcf_vs_wtwhole_nobatch_tables-v", ver, ".xlsx")))
##comp_table <- wtcf_wtwhole_tables$data[[set_name]]
##cor.test(comp_table$log2fc, comp_table$deseq_logfc, method="spearman")

##comp_table <- wtcf_nobatch_wtwhole_tables$data[[set_name]]
##cor.test(comp_table$log2fc, comp_table$deseq_logfc, method="spearman")

7.1.2 Tb libraries

keepers <- list(
  "wtcf_vs_wtwhole" = c("wt_filtrate", "wt_whole"))
droppers <- c("undefined")
names(droppers) <- "log2fc"

## Make sure to set the rownames so it will merge into the excel file.
set_name <- "wt_filtrate_vs_wt_whole"
## rownames(tb_results[[set_name]]) <- tb_results[[set_name]][["Protein"]]
wtcf_wtwhole_tables <- sm(combine_de_tables(
  ## tb_pairwise_comp, keepers=keepers, extra_annot=tb_results[[set_name]],
  tb_pairwise_comp, keepers=keepers,
  excludes=droppers,
  excel=paste0("excel/tb_wtcf_vs_wtwhole_tables-v", ver, ".xlsx")))
wtcf_nobatch_wtwhole_tables <- sm(combine_de_tables(
  ## tb_pairwise_nobatch, keepers=keepers, extra_annot=tb_results[[set_name]],
  tb_pairwise_nobatch, keepers=keepers,
  excludes=droppers,
  excel=paste0("excel/tb_wtcf_vs_wtwhole_nobatch_tables-v", ver, ".xlsx")))
##comp_table <- wtcf_wtwhole_tables$data[[set_name]]
##cor.test(comp_table$log2fc, comp_table$deseq_logfc, method="spearman")

##comp_table <- wtcf_nobatch_wtwhole_tables$data[[set_name]]
##cor.test(comp_table$log2fc, comp_table$deseq_logfc, method="spearman")

7.2 delta_filtrate vs delta_whole

7.2.1 Our libraries

keepers <- list(
  "deltacf_vs_deltawhole" = c("delta_filtrate", "delta_whole"))
set_name <- "delta_filtrate_vs_delta_whole"
## rownames(our_results[[set_name]]) <- our_results[[set_name]][["Protein"]]
deltacf_deltawhole_tables <- sm(combine_de_tables(
  ##  our_pairwise_comp, keepers=keepers, extra_annot=our_results[[set_name]],
  our_pairwise_comp, keepers=keepers,
  excel=paste0("excel/our_deltacf_vs_deltawhole_tables-v", ver, ".xlsx")))

7.2.2 Tb libraries

keepers <- list(
  "deltacf_vs_deltawhole" = c("delta_filtrate", "delta_whole"))
set_name <- "delta_filtrate_vs_delta_whole"
## rownames(tb_results[[set_name]]) <- tb_results[[set_name]][["Protein"]]
deltacf_deltawhole_tables <- sm(combine_de_tables(
##  tb_pairwise_comp, keepers=keepers, extra_annot=tb_results[[set_name]],
  tb_pairwise_comp, keepers=keepers,
  excel=paste0("excel/tb_deltacf_vs_deltawhole_tables-v", ver, ".xlsx")))

7.3 comp_filtrate vs comp_whole

7.3.1 Our libraries

keepers <- list(
  "compcf_vs_compwhole" = c("comp_filtrate", "comp_whole"))
set_name <- "comp_filtrate_vs_comp_whole"
rownames(our_results[[set_name]]) <- our_results[[set_name]][["Protein"]]
## Error in eval(expr, envir, enclos): object 'our_results' not found
compcf_compwhole_tables <- sm(combine_de_tables(
##  our_pairwise_comp, keepers=keepers, extra_annot=our_results[[set_name]],
  our_pairwise_comp, keepers=keepers,
  excel=paste0("excel/our_compcf_vs_compwhole_tables-v", ver, ".xlsx")))

7.3.2 Tb libraries

keepers <- list(
  "compcf_vs_compwhole" = c("comp_filtrate", "comp_whole"))
set_name <- "comp_filtrate_vs_comp_whole"
##rownames(tb_results[[set_name]]) <- tb_results[[set_name]][["Protein"]]
compcf_compwhole_tables <- sm(combine_de_tables(
##  tb_pairwise_comp, keepers=keepers, extra_annot=tb_results[[set_name]],
  tb_pairwise_comp, keepers=keepers,
  excel=paste0("excel/tb_compcf_vs_compwhole_tables-v", ver, ".xlsx")))

7.4 delta_filtrate vs wt_filtrate

7.4.1 Our libraries

keepers <- list(
  "deltacf_vs_wtcf" = c("delta_filtrate", "wt_filtrate"))
set_name <- "delta_filtrate_vs_wt_filtrate"
##rownames(our_results[[set_name]]) <- our_results[[set_name]][["Protein"]]
deltacf_wtcf_tables <- sm(combine_de_tables(
##  our_pairwise_comp, keepers=keepers, extra_annot=our_results[[set_name]],
  our_pairwise_comp, keepers=keepers,
  excel=paste0("excel/our_deltacf_vs_wtcf_tables-v", ver, ".xlsx")))

7.4.2 Tb libraries

keepers <- list(
  "deltacf_vs_wtcf" = c("delta_filtrate", "wt_filtrate"))
set_name <- "delta_filtrate_vs_wt_filtrate"
##rownames(tb_results[[set_name]]) <- tb_results[[set_name]][["Protein"]]
deltacf_wtcf_tables <- sm(combine_de_tables(
  ##tb_pairwise_comp, keepers=keepers, extra_annot=tb_results[[set_name]],
  tb_pairwise_comp, keepers=keepers,
  excel=paste0("excel/tb_deltacf_vs_wtcf_tables-v", ver, ".xlsx")))

7.5 comp_filtrate vs wt_filtrate

7.5.1 Our libraries

keepers <- list(
  "compcf_vs_wtcf" = c("comp_filtrate", "wt_filtrate"))
set_name <- "comp_filtrate_vs_wt_filtrate"
##rownames(our_results[[set_name]]) <- our_results[[set_name]][["Protein"]]
compcf_wtcf_tables <- sm(combine_de_tables(
##  our_pairwise_comp, keepers=keepers, extra_annot=our_results[[set_name]],
  our_pairwise_comp, keepers=keepers,
  excel=paste0("excel/our_compcf_vs_wtcf_tables-v", ver, ".xlsx")))

7.5.2 Tb libraries

keepers <- list(
  "compcf_vs_wtcf" = c("comp_filtrate", "wt_filtrate"))
set_name <- "comp_filtrate_vs_wt_filtrate"
##rownames(tb_results[[set_name]]) <- tb_results[[set_name]][["Protein"]]
compcf_wtcf_tables <- sm(combine_de_tables(
  ##  tb_pairwise_comp, keepers=keepers, extra_annot=tb_results[[set_name]],
  tb_pairwise_comp, keepers=keepers,
  excel=paste0("excel/tb_compcf_vs_wtcf_tables-v", ver, ".xlsx")))

7.6 delta_whole vs wt_whole

7.6.1 Our libraries

keepers <- list(
  "wtcf_vs_wtwhole" = c("wt_filtrate", "wt_whole"))
set_name <- "delta_whole_vs_wt_whole"
##rownames(our_results[[set_name]]) <- our_results[[set_name]][["Protein"]]
wtcf_wtwhole_tables <- sm(combine_de_tables(
  ## our_pairwise_comp, keepers=keepers, extra_annot=our_results[[set_name]],
  our_pairwise_comp, keepers=keepers,
  excel=paste0("excel/our_deltawhole_vs_wtwhole_tables-v", ver, ".xlsx")))

7.6.2 Tb libraries

keepers <- list(
  "wtcf_vs_wtwhole" = c("wt_filtrate", "wt_whole"))
set_name <- "delta_whole_vs_wt_whole"
##rownames(tb_results[[set_name]]) <- tb_results[[set_name]][["Protein"]]
wtcf_wtwhole_tables <- sm(combine_de_tables(
  ##tb_pairwise_comp, keepers=keepers, extra_annot=tb_results[[set_name]],
  tb_pairwise_comp, keepers=keepers,
  excel=paste0("excel/tb_deltawhole_vs_wtwhole_tables-v", ver, ".xlsx")))

7.7 comp_whole vs wt_whole

7.7.1 Our libraries

keepers <- list(
  "compwhole_vs_wtwhole" = c("comp_whole", "wt_whole"))
set_name <- "comp_whole_vs_wt_whole"
##rownames(our_results[[set_name]]) <- our_results[[set_name]][["Protein"]]
compwhole_wtwhole_tables <- sm(combine_de_tables(
  ##our_pairwise_comp, keepers=keepers, extra_annot=our_results[[set_name]],
  our_pairwise_comp, keepers=keepers,
  excel=paste0("excel/our_compwhole_vs_wtwhole_tables-v", ver, ".xlsx")))

7.7.2 Tb libraries

keepers <- list(
  "compwhole_vs_wtwhole" = c("comp_whole", "wt_whole"))
set_name <- "comp_whole_vs_wt_whole"
##rownames(tb_results[[set_name]]) <- tb_results[[set_name]][["Protein"]]
compwhole_wtwhole_tables <- sm(combine_de_tables(
  ##tb_pairwise_comp, keepers=keepers, extra_annot=tb_results[[set_name]],
  tb_pairwise_comp, keepers=keepers,
  excel=paste0("excel/tb_compwhole_vs_wtwhole_tables-v", ver, ".xlsx")))
our_table <- our_pairwise_comp$deseq$all_tables$wt_whole_vs_wt_filtrate
tb_table <- tb_pairwise_comp$deseq$all_tables$wt_whole_vs_wt_filtrate
our_order <- order(our_table[["limma_logfc"]])
rownames(tb_table) <- gsub(pattern="^(.*)_.*", replacement="\\1", x=rownames(tb_table))

test_table <- merge(our_table, tb_table, by="row.names")
cor.test(test_table$logFC.x, test_table$logFC.y)

8 RNASeq vs proteomics

Take the analysis performed on 02_relative_pe_expression, which unfortunately was only using the DDA data, now reuse it with the DIA data and see what happens.

all_de <- read.table("external_data/limma_result.csv", header=TRUE, sep="\t")

pe_ids <- read.table("reference/annotated_pe_genes.txt")
colnames(pe_ids) <- "gene"
pe_annotations <- merge(pe_ids, mtb_annotations, by.x="gene", by.y="id")
## Error in fix.by(by.y, y): 'by' must specify a uniquely valid column
chosen_colnames <- c("rv_id", "chr", "start", "end", "width", "strand",
                     "source", "type", "score", "phase", "id", "gene_id",
                     "description", "function")
colnames(pe_annotations) <- chosen_colnames
## Error in colnames(pe_annotations) <- chosen_colnames: object 'pe_annotations' not found
rownames(pe_annotations) <- pe_annotations[["id"]]
## Error in eval(expr, envir, enclos): object 'pe_annotations' not found
tmp_de <- all_de
colnames(tmp_de) <- c("id", "logfc", "ave_expr", "adjp", "abbrev_id", "description", "gene_length", "type")
colnames(mtb_annotations) <- chosen_colnames
all_de <- merge(tmp_de, mtb_annotations, by="id")
colnames(all_de) <- c("id","logfc", "ave_expr", "adjp", "abbrev_id", "description", "gene_length", "type",
                      "rv_id", "start", "end", "width", "strand", "source", "type", "na", "anotherna",
                      "mtb_id", "another_abbrev_id", "another_description", "html_string")
all_de <- all_de[, c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)]

pe_de <- merge(pe_annotations, all_de, by="id")
## Error in merge(pe_annotations, all_de, by = "id"): object 'pe_annotations' not found
colnames(pe_de) <- c("id", "rv_id", "chromosome", "start", "end", "width", "strand",
                     "source", "type", "score", "phase", "gene_id", "description", "function",
                     "logfc", "ave_expr", "adjp", "another_gene", "another_description", "gene_length",
                     "another_type", "another_id", "another_start", "another_end", "another_width",
                     "another_strand", "another_source", "another_type")
## Error in colnames(pe_de) <- c("id", "rv_id", "chromosome", "start", "end", : object 'pe_de' not found
pe_de <- pe_de[, c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17)]
## Error in eval(expr, envir, enclos): object 'pe_de' not found
## pe_de_annot <- merge(pe_de, mtb_annotations, by="locus_tag")
## The contrast is written as wt - delta, so add back the logFC I think
## Hopefully I did not reverse this in my head.
pe_de$ave_minus_log <- pe_de$ave_expr - pe_de$logfc
## Error in eval(expr, envir, enclos): object 'pe_de' not found
all_de$ave_minus_log <- all_de$ave_expr - all_de$logfc
pe_de$pseudo_rpkm <- (pe_de$ave_minus_log / pe_de$width) * 1000
## Error in eval(expr, envir, enclos): object 'pe_de' not found
all_de$pseudo_rpkm <- (all_de$ave_minus_log / all_de$width) * 1000

maximum_expression <- max(pe_de[["pseudo_rpkm"]])
## Error in eval(expr, envir, enclos): object 'pe_de' not found
maximum_all <- max(all_de$pseudo_rpkm)
pe_de[["exprs_vs_max"]] <- pe_de[["pseudo_rpkm"]] / maximum_expression
## Error in eval(expr, envir, enclos): object 'pe_de' not found
all_de$exprs_vs_max <- all_de$pseudo_rpkm / maximum_all

table_order <- order(pe_de[["exprs_vs_max"]], decreasing=TRUE)
## Error in order(pe_de[["exprs_vs_max"]], decreasing = TRUE): object 'pe_de' not found
pe_de <- pe_de[table_order, ]
## Error in eval(expr, envir, enclos): object 'pe_de' not found
knitr::kable(head(pe_de, n=50))
## Error in head(pe_de, n = 50): object 'pe_de' not found
write.csv(x=pe_de, file="pe_relative_expression.csv")
## Error in is.data.frame(x): object 'pe_de' not found
plot_histogram(pe_de$exprs_vs_max, bins=20)
## Error in plot_histogram(pe_de$exprs_vs_max, bins = 20): object 'pe_de' not found
prot_table <- tb_pairwise_nobatch$limma$all_tables[["wt_whole_vs_delta_whole"]]

all_combined_table <- merge(all_de, prot_table, by.x="id", by.y="row.names")
cor.test(all_combined_table$ave_expr, all_combined_table$AveExpr, method="spearman")
## Warning in cor.test.default(all_combined_table$ave_expr,
## all_combined_table$AveExpr, : Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  all_combined_table$ave_expr and all_combined_table$AveExpr
## S = 1.6e+09, p-value <2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##    rho 
## 0.4659
all_testing <- all_combined_table[, c("ave_expr", "AveExpr")]
all_comparison <- plot_linear_scatter(all_testing)
## Used Bon Ferroni corrected t test(s) between columns.
all_comparison$scatter

pe_combined_table <- merge(pe_de, prot_table, by.x="id", by.y="row.names")
## Error in merge(pe_de, prot_table, by.x = "id", by.y = "row.names"): object 'pe_de' not found
cor.test(pe_combined_table$ave_expr, pe_combined_table$AveExpr, method="spearman")
## Error in cor.test(pe_combined_table$ave_expr, pe_combined_table$AveExpr, : object 'pe_combined_table' not found
pe_testing <- pe_combined_table[, c("ave_expr", "AveExpr")]
## Error in eval(expr, envir, enclos): object 'pe_combined_table' not found
pe_comparison <- plot_linear_scatter(pe_testing)
## Error in data.frame(df[, c(1, 2)]): object 'pe_testing' not found
pe_comparison$scatter
## Error in eval(expr, envir, enclos): object 'pe_comparison' not found
if (!isTRUE(get0("skip_load"))) {
  message(paste0("This is hpgltools commit: ", get_git_commit()))
  this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
  message(paste0("Saving to ", this_save))
  tmp <- sm(saveme(filename=this_save))
  pander::pander(sessionInfo())
}
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 96dbc75fdeb993475ff61dd47697873d31a447fc
## R> packrat::restore()
## This is hpgltools commit: Fri Jul 27 17:35:57 2018 -0400: 96dbc75fdeb993475ff61dd47697873d31a447fc
## Saving to 02_swath2stats_20180720-v20180611.rda.xz

R version 3.5.1 (2018-07-02)

Platform: x86_64-pc-linux-gnu (64-bit)

locale: LC_CTYPE=en_US.utf8, LC_NUMERIC=C, LC_TIME=en_US.utf8, LC_COLLATE=en_US.utf8, LC_MONETARY=en_US.utf8, LC_MESSAGES=en_US.utf8, LC_PAPER=en_US.utf8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.utf8 and LC_IDENTIFICATION=C

attached base packages: stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: foreach(v.1.4.4), Vennerable(v.3.1.0.9000), ruv(v.0.9.7), SWATH2stats(v.1.11.3), bindrcpp(v.0.2.2) and hpgltools(v.2018.03)

loaded via a namespace (and not attached): Rtsne(v.0.13), colorspace(v.1.3-2), rprojroot(v.1.3-2), htmlTable(v.1.12), corpcor(v.1.6.9), XVector(v.0.20.0), GenomicRanges(v.1.32.6), base64enc(v.0.1-3), rstudioapi(v.0.7), roxygen2(v.6.1.0), ggrepel(v.0.8.0), bit64(v.0.9-7), AnnotationDbi(v.1.42.1), xml2(v.1.2.0), codetools(v.0.2-15), splines(v.3.5.1), doParallel(v.1.0.11), robustbase(v.0.93-2), geneplotter(v.1.58.0), knitr(v.1.20), Formula(v.1.2-3), annotate(v.1.58.0), cluster(v.2.0.7-1), graph(v.1.58.0), compiler(v.3.5.1), httr(v.1.3.1), backports(v.1.1.2), assertthat(v.0.2.0), Matrix(v.1.2-14), lazyeval(v.0.2.1), limma(v.3.36.2), acepack(v.1.4.1), htmltools(v.0.3.6), prettyunits(v.1.0.2), tools(v.3.5.1), gtable(v.0.2.0), glue(v.1.3.0), GenomeInfoDbData(v.1.1.0), reshape2(v.1.4.3), dplyr(v.0.7.6), Rcpp(v.0.12.18), Biobase(v.2.40.0), BRAIN(v.1.26.0), preprocessCore(v.1.42.0), nlme(v.3.1-137), iterators(v.1.0.10), stringr(v.1.3.1), openxlsx(v.4.1.0), testthat(v.2.0.0), PolynomF(v.1.0-2), gtools(v.3.8.1), devtools(v.1.13.6), XML(v.3.98-1.12), DEoptimR(v.1.0-8), edgeR(v.3.22.3), directlabels(v.2018.05.22), MASS(v.7.3-50), zlibbioc(v.1.26.0), scales(v.0.5.0), doSNOW(v.1.0.16), hms(v.0.4.2), RBGL(v.1.56.0), parallel(v.3.5.1), SummarizedExperiment(v.1.10.1), RColorBrewer(v.1.1-2), yaml(v.2.2.0), memoise(v.1.1.0), gridExtra(v.2.3), pander(v.0.6.2), ggplot2(v.3.0.0), rpart(v.4.1-13), biomaRt(v.2.36.1), latticeExtra(v.0.6-28), stringi(v.1.2.4), RSQLite(v.2.1.1), genefilter(v.1.62.0), S4Vectors(v.0.18.3), checkmate(v.1.8.5), BiocGenerics(v.0.26.0), zip(v.1.0.0), BiocParallel(v.1.14.2), GenomeInfoDb(v.1.16.0), rlang(v.0.2.1), pkgconfig(v.2.0.1), commonmark(v.1.5), matrixStats(v.0.54.0), bitops(v.1.0-6), evaluate(v.0.11), lattice(v.0.20-35), purrr(v.0.2.5), bindr(v.0.1.1), htmlwidgets(v.1.2), labeling(v.0.3), bit(v.1.1-14), tidyselect(v.0.2.4), plyr(v.1.8.4), magrittr(v.1.5), DESeq2(v.1.20.0), R6(v.2.2.2), snow(v.0.4-2), IRanges(v.2.14.10), Hmisc(v.4.1-1), DelayedArray(v.0.6.2), DBI(v.1.0.0), pillar(v.1.3.0), foreign(v.0.8-71), withr(v.2.1.2), mgcv(v.1.8-24), survival(v.2.42-6), RCurl(v.1.95-4.11), nnet(v.7.3-12), tibble(v.1.4.2), crayon(v.1.3.4), KernSmooth(v.2.23-15), rmarkdown(v.1.10), progress(v.1.2.0), locfit(v.1.5-9.1), grid(v.3.5.1), sva(v.3.28.0), data.table(v.1.11.4), blob(v.1.1.1), digest(v.0.6.15), xtable(v.1.8-2), stats4(v.3.5.1), munsell(v.0.5.0) and quadprog(v.1.5-5)

---
title: "M.tuberculosis 20180611: Analyzing data from OpenSwathWorkFlow/TRIC (20180720)."
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
 html_document:
  code_download: true
  code_folding: show
  fig_caption: true
  fig_height: 7
  fig_width: 7
  highlight: default
  keep_md: false
  mode: selfcontained
  number_sections: true
  self_contained: true
  theme: readable
  toc: true
  toc_float:
    collapsed: false
    smooth_scroll: false
---

<style>
  body .main-container {
    max-width: 1600px;
  }
</style>

```{r options, include=FALSE}
if (!isTRUE(get0("skip_load"))) {
  library(hpgltools)
  tt <- devtools::load_all("~/hpgltools")
  knitr::opts_knit$set(progress=TRUE,
                       verbose=TRUE,
                       width=90,
                       echo=TRUE)
  knitr::opts_chunk$set(error=TRUE,
                        fig.width=8,
                        fig.height=8,
                        dpi=96)
  old_options <- options(digits=4,
                         stringsAsFactors=FALSE,
                         knitr.duplicate.label="allow")
  ggplot2::theme_set(ggplot2::theme_bw(base_size=10))
  ver <- "20180611"
  previous_file <- "01_preprocessing_comet_highres.Rmd"

  tmp <- try(sm(loadme(filename=paste0(gsub(pattern="\\.Rmd", replace="", x=previous_file), "-v", ver, ".rda.xz"))))
  previous_file <- "01_annotation_20180611.Rmd"
  tmp <- try(sm(loadme(filename=paste0(gsub(pattern="\\.Rmd", replace="", x=previous_file), "-v", ver, ".rda.xz"))))

  rmd_file <- "02_swath2stats_20180720.Rmd"
}
```

# TODO

* This is a copy of the same file dated 20180611, but an attempt to react to a
  query from Yan:
* I am using this worksheet to also attempt to address figure-specific TODO
  items.

 "I was baffled by your result that protein (or peptide) abundance of identified
 peptides were not changed between March and May. That will completely counter
 my assumption that a complete cleaning of the mass spec will solve our
 problem. I wonder how much would it will take to generate more detailed data
 from our old friend PE/PPE proteins. The information that will be helpful would
 be intensity, mass accuracy, and chromatography peak width of each peptide in
 each sample. Please feel free to stop by tomorrow if you have questions.

 I like this group of proteins because 1. it's of extra interest to Volker and
 Dallas, 2. it's a small list of proteins, so easy to evaluate. and 3. Protein
 abundance covers a wide range, from relatively high abundance to barely
 detectable."

I think therefore I will leave the analyses here alone, but jump down to where I
parse/plot the DIA data and attempt to subset it for the PPE proteins in ways
similar/identical to Yan's query.  With that in mind, I am going to set
eval=FALSE on the blocks of code in this file until I find the one which is
actually relevant.

In addition, I am moving blocks of code/analysis up to the top which I think
will be relevant.

## Plotting metrics of identified peaks from DIA data

The pyprophet data for all DIA samples should provide the metrics in which Yan
is interested.  I will focus on the tuberculist transition libraries because
that was the one I saw first.

### The tuberculist data

```{r tb_pe_pyprophet_plots}
pe_genes <- read.table("reference/annotated_pe_genes.txt")[[1]]
tb_pyprophet_fun <- extract_pyprophet_data(metadata="sample_sheets/Mtb_dia_samples.xlsx",
                                           pyprophet_column="tuberculistscored")

pe_pyprophet_fun <- subset_pyprophet_data(tb_pyprophet_fun, subset=pe_genes)

tb_mass_plot <- sm(plot_pyprophet_distribution(pe_pyprophet_fun, column="mass"))
pp(file="images/20180720_ppe_mass_boxplot.png", image=tb_mass_plot[["dotboxplot"]])

tb_intensity_plot <- sm(plot_pyprophet_distribution(pe_pyprophet_fun, column="intensity"))
pp(file="images/20180720_ppe_intensity_boxplot.png", image=tb_intensity_plot[["dotboxplot"]])

tb_lwidth_plot <- sm(plot_pyprophet_distribution(pe_pyprophet_fun, column="leftwidth"))
pp(file="images/20180720_ppe_lwidth_boxplot.png", image=tb_lwidth_plot[["dotboxplot"]])

tb_rwidth_plot <- sm(plot_pyprophet_distribution(pe_pyprophet_fun, column="rightwidth"))
pp(file="images/20180720_ppe_rwidth_boxplot.png", image=tb_rwidth_plot[["dotboxplot"]])
```

```{r tb_pyprophet_plots}
tb_pyprophet_fun <- extract_pyprophet_data(metadata="sample_sheets/Mtb_dia_samples.xlsx",
                                           pyprophet_column="tuberculistscored")

tb_mass_plot <- sm(plot_pyprophet_distribution(tb_pyprophet_fun, column="mass"))
tb_mass_plot[["boxplot"]]

tb_deltart_plot_all <- sm(plot_pyprophet_distribution(
  tb_pyprophet_fun, column="delta_rt"))
tb_deltart_plot_all[["boxplot"]]

tb_deltart_plot_real <- sm(plot_pyprophet_distribution(
  tb_pyprophet_fun,
  column="delta_rt", keep_decoys=FALSE))
tb_deltart_plot_real[["boxplot"]]

tb_deltart_plot_decoys <- sm(plot_pyprophet_distribution(
  tb_pyprophet_fun,
  column="delta_rt", keep_real=FALSE))
tb_deltart_plot_decoys[["boxplot"]]
tb_deltart_plot_decoys[["density"]]
tb_deltart_plot_decoys[["violin"]]

tb_widthvsmass <- sm(plot_pyprophet_data(tb_pyprophet_fun, legend=FALSE))
tb_widthvsmass$plot
```

Analyzing data from openMS and friends.
=======================================

In preprocessing_comet_highres.Rmd, I used the openMS tutorials and supplemental
materials from a couple papers to hopefully correctly perform the various
preprocessing tasks required to extract intensity data from DIA/SWATH
transitions.

The final steps of that process combined the transition intensities from every
sample into a metadata file (results/tric/HCD_meta.tsv), an intensity matrix
(results/tric/HCD_outmatrix.tsv), and a feature aligned output matrix
(results/tric/aligned_comet_HCD.tsv).

My reading of the SWATH2stats and MSstats source code suggests to me that the
log2(intensities) of the feature aligned data are our final proxy for protein
abundance.  At first glance, this suggests to me that these data might follow a
distribution similar to RNASeq data (negative binomial, but perhaps with a
bigger tail?).  In addition, by the time we use tric on the data, we have a
count matrix and sample annotation data frames which look remarkably similar to
those used in a RNASeq expressionset.  Indeed, by the end of the MSstats
processing, it creates a MSnSet class of its own which uses fData/exprs/pData.

For the curious, my reasoning for saying that the log intensities are our proxy
for abundance comes from MSstats/R/DataProcess.R in a clause which looks like:

```{r snippet, eval=FALSE}
if (logTrans == 2) {
  work[["ABUNDANCE"]] <- log2(work[["ABUNDANCE"]])
} else if (logTrans == 10) {
  work[["ABUNDANCE"]] <- log10(work[["ABUNDANCE"]])
} else {
  ## Above there was a check for only log 2 and 10, but we can do e if we want.
  ## I might go back up there and remove that check. Long live e! 2.718282 rules!
  work[["ABUNDANCE"]] <- log(work[["ABUNDANCE"]]) / log(logTrans)
}
```

(Note: I added the natural log to the set of conditions, but otherwise the logic
is unchanged.)

With that in mind, I want to use some tools with which I am familiar in order to
try to understand these data.  Therefore I will first attempt to coerce my tric
aligned data and annotations into a 'normal' expressionset.  Then I want to do
some diagnostic plots which, if I am wrong and these distributions are not as
expected, will be conceptually incorrect (I don't yet think I am wrong).

## Sample annotation via SWATH2stats

I am using the SWATH2stats vignette as my primary source of information.  Thus I
see that it uses the
OpenSWATH_SM3_GoldStandardAutomatedResults_human_peakgroups.txt which has a
format nearly identical to the tric output matrix.  Thus for the moment I will
assume that the proper input for SWATH2stats is 'results/tric/comet_HCD.tsv' and
not the metadata nor output matrix.  In addition, SWATH2stats provides a similar
conversion function which takes the tric output and coerces it into a similar
matrix after performing its various filtering tasks.  Therefore I will use that.

I keep a sample sheet of all the DIA samples used in this analysis in
'sample_sheets/dia_samples.xlsx'  It should contain all the other required data
with one important caveat, I removed 1 sample by 'commenting' it (e.g. prefixing
it with '##' -- which is an admittedly dumb thing to do in an excel file.

One last caveat: I hacked the SWATH2stats sample_annotation() function to add a
couple columns in an attempt to make it a little more robust when faced with
sample sheets with differently named columns.

In addition, SWATH2stats provides some nice filtering and combination
functions which should be considered when generating various expressionset data
structures later.

### Creating a swath2stats experiment using our comet-derived library data

```{r our_swath2stats_initial}
loaded <- sm(devtools::load_all("~/scratch/git/SWATH2stats"))

our_tric_data <- read.csv("results/tric/20180611/whole_8mz/comet_HCD.tsv", sep="\t")

sample_annot <- openxlsx::read.xlsx("sample_sheets/Mtb_dia_samples.xlsx")
colnames(sample_annot) <- gsub(pattern="\\.", replacement="", x=colnames(sample_annot))
## Drop samples starting with comments
keep_idx <- ! grepl(pattern="^#", x=sample_annot[["sampleid"]])
sample_annot <- sample_annot[keep_idx, ]
sample_annot[["sampleid"]] <- paste0("s", sample_annot[["sampleid"]])
rownames(sample_annot) <- make.names(sample_annot[["sampleid"]], unique=TRUE)

expt_idx <- sample_annot[["expt_id"]] == "may2018" | sample_annot[["expt_id"]] == "mar2018"
expt_idx[is.na(sample_annot[["expt_id"]])] <- FALSE
sample_annot <- sample_annot[expt_idx, ]
mz_idx <- sample_annot[["windowsize"]] == "8"
sample_annot <- sample_annot[mz_idx, ]
## Set the mzXML column to match the filename column in the data.

## s2s, my witty way of shortening SWATH2stats...
our_s2s_exp <- sample_annotation(data=our_tric_data,
                                 sample_annotation=sample_annot,
                                 run_column="run",
                                 fullpeptidename_column="fullunimodpeptidename")
```

### Creating a swath2stats experiment using the tuberculist-derived library data

This just repeats the previous stanza, but uses the tuberculist
transition-library swath data as its input rather than our own comet
data.

There is one important caveat in the following block: I used a regex to remove
the second half of geneID_geneName so that later when I merge in the annotation
data I have it will match.

```{r tb_swath2stats_initial}
tb_tric_data <- read.csv("results/tric/20180611/whole_8mz_tuberculist/comet_HCD.tsv", sep="\t")

tb_tric_data[["ProteinName"]] <- gsub(pattern="^(.*)_.*$", replacement="\\1",
                                      x=tb_tric_data[["ProteinName"]])
tb_s2s_exp <- sample_annotation(data=tb_tric_data,
                                sample_annotation=sample_annot,
                                fullpeptidename_column="fullunimodpeptidename")
```

Now I have a couple data structures which should prove useful for the metrics
provided by SWATH2stats, MSstats, and my own hpgltools.

# SWATH2stats continued

The various metrics and filters provided by SWATH2stats seem quite reasonable to
me.  The only thing that really bothers me is that they are all case sensitive
and I found that the most recent tric changed the capitalization of a column,
causing these to all fall down.  Therefore I went in and made everything case
insensitive in a fashion similar to that done by MSstats (except I hate capital
letters, so I used tolower() rather than toupper()).

## Run filters on our comet data

```{r our_swath2stats_processing}
## Get correlations on a sample by sample basis
pp(file="images/20180611_our_swath2stats_sample_cor.png")
sample_cor <- plot_correlation_between_samples(our_s2s_exp, size=2,
                                               fun.aggregate=sum,
                                               column.values="intensity")
dev.off()
sample_cond_rep_cor <- plot_correlation_between_samples(our_s2s_exp, size=2,
                                                        comparison=transition_group_id ~
                                                          condition + bioreplicate + run,
                                                        fun.aggregate=sum,
                                                        column.values="intensity")

decoy_lists <- assess_decoy_rate(our_s2s_exp)
## This seems a bit high to me, yesno?
fdr_overall <- assess_fdr_overall(our_s2s_exp, output="Rconsole", plot=TRUE)

byrun_fdr <- assess_fdr_byrun(our_s2s_exp, FFT=0.7, plot=TRUE, output="Rconsole")
chosen_mscore <- mscore4assayfdr(our_s2s_exp, FFT=0.7, fdr_target=0.02)
prot_score <- mscore4protfdr(our_s2s_exp, FFT=0.7, fdr_target=0.02)

our_mscore_filtered <- filter_mscore(our_s2s_exp, chosen_mscore)
our_freq_mscore <- filter_mscore_freqobs(our_s2s_exp, 0.01, 0.8, rm.decoy=FALSE)
our_data_filtered_fdr <- filter_mscore_fdr(our_mscore_filtered, FFT=0.7,
                                           overall_protein_fdr_target=prot_score,
                                           upper_overall_peptide_fdr_limit=0.05)
our_only_proteotypic <- filter_proteotypic_peptides(our_data_filtered_fdr)
our_all_filtered <- filter_all_peptides(our_only_proteotypic)
our_only_strong <- filter_on_max_peptides(data=our_all_filtered, n_peptides=10)
our_only_minimum <- filter_on_min_peptides(data=our_only_strong, n_peptides=3)

## I think these matrixes are probably smarter to use than the raw outmatrix from tric.
## But I am not a fan of rerwriting the sample column names.
matrix_prefix <- file.path("results", "swath2stats", ver)
if (!file.exists(matrix_prefix)) {
  dir.create(matrix_prefix)
}
protein_matrix_all <- write_matrix_proteins(
  our_s2s_exp, write.csv=TRUE,
  filename=file.path(matrix_prefix, "our_protein_all.csv"))
dim(protein_matrix_all)
protein_matrix_mscore <- write_matrix_proteins(
  our_mscore_filtered, write.csv=TRUE,
  filename=file.path(matrix_prefix, "our_protein_matrix_mscore.csv"))
dim(protein_matrix_mscore)
peptide_matrix_mscore <- write_matrix_peptides(
  our_mscore_filtered, write.csv=TRUE,
  filename=file.path(matrix_prefix, "our_peptide_matrix_mscore.csv"))
dim(peptide_matrix_mscore)
protein_matrix_minimum <- write_matrix_proteins(
  our_only_minimum, write.csv=TRUE,
  filename=file.path(matrix_prefix, "our_protein_matrix_minimum.csv"))
dim(protein_matrix_minimum)
peptide_matrix_minimum <- write_matrix_peptides(
  our_only_minimum, write.csv=TRUE,
  filename=file.path(matrix_prefix, "our_peptide_matrix_minimum.csv"))
dim(peptide_matrix_minimum)

rt_cor <- plot_correlation_between_samples(
  our_only_minimum, column.values="intensity", fun.aggregate=sum, size=2)
## I have no effing clue what this plot means.
variation <- plot_variation(our_only_minimum, fun.aggregate=sum)

cols <- colnames(our_only_minimum)
our_disaggregated <- disaggregate(our_only_minimum, all.columns=TRUE)
our_msstats_input <- convert_MSstats(our_disaggregated)
```

## Run filters on the tuberculist

Repeat the above stanza using the tuberculist data.

```{r tb_swath2stats_processing}
## Get correlations on a sample by sample basis
pp(file="images/20180611_tb_swath2stats_sample_cor.png")
sample_cor <- plot_correlation_between_samples(tb_s2s_exp, size=2,
                                               fun.aggregate=sum,
                                               column.values="intensity")
dev.off()
sample_cond_rep_cor <- plot_correlation_between_samples(tb_s2s_exp, size=2,
                                                        comparison=transition_group_id ~
                                                          condition + bioreplicate + run,
                                                        fun.aggregate=sum,
                                                        column.values="intensity")

decoy_lists <- assess_decoy_rate(tb_s2s_exp)
## This seems a bit high to me, yesno?
fdr_overall <- assess_fdr_overall(tb_s2s_exp, output="Rconsole", plot=TRUE)

byrun_fdr <- assess_fdr_byrun(tb_s2s_exp, FFT=0.7, plot=TRUE, output="Rconsole")
chosen_mscore <- mscore4assayfdr(tb_s2s_exp, FFT=0.7, fdr_target=0.02)
prot_score <- mscore4protfdr(tb_s2s_exp, FFT=0.7, fdr_target=0.02)

tb_mscore_filtered <- filter_mscore(tb_s2s_exp, chosen_mscore)
tb_freq_mscore <- filter_mscore_freqobs(tb_s2s_exp, 0.01, 0.8, rm.decoy=FALSE)
tb_data_filtered_fdr <- filter_mscore_fdr(tb_mscore_filtered, FFT=0.7,
                                           overall_protein_fdr_target=prot_score,
                                           upper_overall_peptide_fdr_limit=0.05)
tb_only_proteotypic <- filter_proteotypic_peptides(tb_data_filtered_fdr)
tb_all_filtered <- filter_all_peptides(tb_only_proteotypic)
tb_only_strong <- filter_on_max_peptides(data=tb_all_filtered, n_peptides=10)
tb_only_minimum <- filter_on_min_peptides(data=tb_only_strong, n_peptides=3)

## I think these matrixes are probably smarter to use than the raw outmatrix from tric.
## But I am not a fan of rerwriting the sample column names.
matrix_prefix <- file.path("results", "swath2stats", ver)
if (!file.exists(matrix_prefix)) {
  dir.create(matrix_prefix)
}
protein_matrix_all <- write_matrix_proteins(
  tb_s2s_exp, write.csv=TRUE,
  filename=file.path(matrix_prefix, "tb_protein_all.csv"))
dim(protein_matrix_all)
protein_matrix_mscore <- write_matrix_proteins(
  tb_mscore_filtered, write.csv=TRUE,
  filename=file.path(matrix_prefix, "tb_protein_matrix_mscore.csv"))
dim(protein_matrix_mscore)
peptide_matrix_mscore <- write_matrix_peptides(
  tb_mscore_filtered, write.csv=TRUE,
  filename=file.path(matrix_prefix, "tb_peptide_matrix_mscore.csv"))
dim(peptide_matrix_mscore)
protein_matrix_minimum <- write_matrix_proteins(
  tb_only_minimum, write.csv=TRUE,
  filename=file.path(matrix_prefix, "tb_protein_matrix_minimum.csv"))
dim(protein_matrix_minimum)
peptide_matrix_minimum <- write_matrix_peptides(
  tb_only_minimum, write.csv=TRUE,
  filename=file.path(matrix_prefix, "tb_peptide_matrix_minimum.csv"))
dim(peptide_matrix_minimum)

rt_cor <- plot_correlation_between_samples(size=2,
  tb_only_minimum, column.values="intensity", fun.aggregate=sum)
## I have no effing clue what this plot means.
variation <- plot_variation(tb_only_minimum, fun.aggregate=sum)

cols <- colnames(tb_only_minimum)
tb_disaggregated <- disaggregate(tb_only_minimum, all.columns=TRUE)
tb_msstats_input <- convert_MSstats(tb_disaggregated)
```

## Some new plots

In response to some interesting queries from Yan, I made a few little functions
which query and plot data from the scored data provided by openswath/pyprophet.
Let us look at their results here.

### Using the comet data

```{r our_pyprophet_plots}
our_pyprophet_fun <- sm(extract_pyprophet_data(metadata="sample_sheets/Mtb_dia_samples.xlsx"))

our_mass_plot <- sm(plot_pyprophet_distribution(our_pyprophet_fun, column="mass"))
our_mass_plot[["boxplot"]]
our_mass_plot[["density"]]

our_deltart_plot_all <- sm(plot_pyprophet_distribution(
  our_pyprophet_fun, column="delta_rt"))
our_deltart_plot_all[["boxplot"]]

our_deltart_plot_real <- sm(plot_pyprophet_distribution(
  our_pyprophet_fun,
  column="delta_rt", keep_decoys=FALSE))
our_deltart_plot_real[["boxplot"]]

our_deltart_plot_decoys <- sm(plot_pyprophet_distribution(
  our_pyprophet_fun,
  column="delta_rt", keep_real=FALSE))
our_deltart_plot_decoys[["boxplot"]]
our_deltart_plot_decoys[["density"]]

our_widthvsmass <- sm(plot_pyprophet_data(our_pyprophet_fun, legend=FALSE))
our_widthvsmass$plot
```

## MSstats

msstats.org seems to provide a complete solution for performing reasonable metrics of this data.

I am currently reading: http://msstats.org/wp-content/uploads/2017/01/MSstats_v3.7.3_manual.pdf

I made some moderately intrusive changes to MSstats to make it clearer, as well.

### Comet derived msstats

```{r our_msstats_quant, eval=FALSE}
devtools::load_all("~/scratch/git/MSstats")
our_msstats_quant <- dataProcess(our_msstats_input)
our_msstats_plots <- sm(dataProcessPlots(our_msstats_quant, type="QCPLOT"))

my_levels <- levels(as.factor(our_msstats_input$condition))
my_levels
comparisons <- ghetto_contrast_matrix(
  numerators=c("wt_filtrate", "delta_filtrate", "comp_filtrate",
               "delta_filtrate", "comp_filtrate", "delta_whole",
               "comp_whole"),
  denominators=c("wt_whole", "delta_whole", "comp_whole",
                 "wt_filtrate", "wt_filtrate", "wt_whole",
                 "wt_whole"))
our_results <- list()
for (c in 1:length(rownames(comparisons))) {
  name <- rownames(comparisons)[c]
  message("Starting ", name)
  comp <- comparisons[c, ]
  comparison <- t(as.matrix(comp))
  rownames(comparison) <- name
  our_results[name] <- sm(MSstats::groupComparison(contrast.matrix=comparison,
                                                   data=our_msstats_quant))
}
```

# This is Relevant to Yan's query!

## P/PE protein QC plots for Yan

Yan asked for the p/pe protein qc plots. ok.  I changed the dataProcessPlots to
return something useful, so that should be possible now.

```{r pe, eval=FALSE}
pe_genes <- read.table("reference/annotated_pe_genes.txt")[[1]]

## Unfortunately, the names did not get set in my changed version of dataProcessPlots...
plotlst <- our_msstats_plots$QCPLOT
available_plots <- gsub(pattern="^1/", replacement="",
                        x=levels(our_msstats_quant$ProcessedData$PROTEIN))
names(plotlst) <- available_plots

pe_in_avail_idx <- pe_genes %in% available_plots
pe_in_avail <- pe_genes[pe_in_avail_idx]
pe_plots <- plotlst[pe_in_avail]
pdf(file="pe_qc_plots.pdf")
for (p in 1:length(pe_plots)) {
  plot(pe_plots[[p]])
}
dev.off()
length(pe_plots)
```

### Tuberculist derived msstats

```{r tb_msstats_quant, eval=FALSE}
tb_msstats_quant <- dataProcess(tb_msstats_input)
##tb_msstats_plots <- sm(dataProcessPlots(tb_msstats_quant, type="QCPLOT"))

my_levels <- levels(as.factor(tb_msstats_input$condition))
my_levels
comparisons <- ghetto_contrast_matrix(
  numerators=c("wt_filtrate", "delta_filtrate", "comp_filtrate",
               "delta_filtrate", "comp_filtrate", "delta_whole",
               "comp_whole"),
  denominators=c("wt_whole", "delta_whole", "comp_whole",
                 "wt_filtrate", "wt_filtrate", "wt_whole",
                 "wt_whole"))
tb_results <- list()
for (c in 1:length(rownames(comparisons))) {
  name <- rownames(comparisons)[c]
  message("Starting ", name)
  comp <- comparisons[c, ]
  comparison <- t(as.matrix(comp))
  rownames(comparison) <- name
  tb_results[name] <- sm(MSstats::groupComparison(contrast.matrix=comparison,
                                                   data=tb_msstats_quant))
}
```

# Create hpgltools expressionset

Since I am not certain I understand these data, I will take the intensities from
SWATH2stats, metadata, and annotation data;  attempt to create a 'normal'
expressionset; poke at it to see what I can learn.

## Massaging the metadata

I want to use the same metadata as were used for MSstats.  It has a few
important differences from the requirements of hpgltools: pretty much only that
I do not allow rownames/sampleIDs to start with a number.

## Massaging the intensity matrix

I do not want the \1 before the protein names, I already merged them into one
entry per gene vis SWATH2stats.

### Make matrices with our libraries

```{r our_protein_matrix}
our_prot_mtrx <- read.csv(file.path("results", "swath2stats", ver, "our_protein_matrix_minimum.csv"))
rownames(our_prot_mtrx) <- gsub(pattern="^1\\/", replacement="", x=our_prot_mtrx[["proteinname"]])
our_prot_mtrx <- our_prot_mtrx[, -1]
## Important question: Did SWATH2stats reorder my data?
colnames(our_prot_mtrx) <- gsub(pattern="^.*2018", replacement="s2018", x=colnames(our_prot_mtrx))
```

### Make matrices with the tb libraries

```{r tb_protein_matrix}
tb_prot_mtrx <- read.csv(file.path("results", "swath2stats", ver, "tb_protein_matrix_minimum.csv"))
rownames(tb_prot_mtrx) <- gsub(pattern="^1\\/", replacement="", x=tb_prot_mtrx[["proteinname"]])
tb_prot_mtrx <- tb_prot_mtrx[, -1]
## Important question: Did SWATH2stats reorder my data?
colnames(tb_prot_mtrx) <- gsub(pattern="^.*2018", replacement="s2018", x=colnames(tb_prot_mtrx))
```

## Merge the pieces

Now we should have sufficient pieces to make an expressionset.

While here, I will also split the data into a cf and whole-cell pair of data structures.

### Our data

```{r our_expt}
## Drop the metadata not in the protein matrix:
## And ensure that they are the same order.
reordered <- colnames(our_prot_mtrx)
our_metadata <- sample_annot[reordered, ]

our_protein_expt <- create_expt(metadata=our_metadata,
                                count_dataframe=our_prot_mtrx,
                                gene_info=mtb_annotations)
new_colors <- c("#1B9E77", "#D95F02", "#7570B3", "#E7298A", "#66A61E", "#E6AB02")
names(new_colors) <- c("comp_filtrate", "comp_whole", "delta_filtrate", "delta_whole", "wt_filtrate", "wt_whole")
names(new_colors) <- c("wt_whole", "comp_whole", "delta_whole", "delta_filtrate", "wt_filtrate", "comp_filtrate")
our_protein_expt <- set_expt_colors(our_protein_expt, new_colors)

our_whole_expt <- subset_expt(our_protein_expt, subset="collectiontype=='whole'")
our_cf_expt <- subset_expt(our_protein_expt, subset="collectiontype=='filtrate'")
```

### Tb data

```{r tb_expt}
## Drop the metadata not in the protein matrix:
## And ensure that they are the same order.
reordered <- colnames(tb_prot_mtrx)
metadata <- sample_annot[reordered, ]

tb_protein_expt <- sm(create_expt(metadata,
                                  count_dataframe=tb_prot_mtrx,
                                  gene_info=mtb_annotations))
written <- write_expt(tb_protein_expt, excel=paste0("excel/written-v", ver, ".xlsx"))
tb_whole_expt <- subset_expt(tb_protein_expt, subset="collectiontype=='whole'")
tb_cf_expt <- subset_expt(tb_protein_expt, subset="collectiontype=='filtrate'")
```

## Metrics of the full data set

### Our libraries

```{r our_protein_metrics, fig.show='hide'}
our_protein_metrics <- sm(graph_metrics(our_protein_expt))
our_protein_norm <- sm(normalize_expt(our_protein_expt, transform="log2", convert="cpm",
                                      norm="quant", filter=TRUE))
our_protein_norm_metrics <- sm(graph_metrics(our_protein_norm))
our_protein_fsva <- sm(normalize_expt(our_protein_expt, transform="log2",
                                   batch="fsva", filter=TRUE))
our_protein_fsva_metrics <- sm(graph_metrics(our_protein_fsva))
```

### Tb libraries

```{r tb_protein_metrics, fig.show='hide'}
tb_protein_metrics <- sm(graph_metrics(tb_protein_expt))
tb_protein_varplot <- plot_variance_coefficients(tb_protein_expt, x_axis="batch")
tb_protein_varplot <- plot_variance_coefficients(tb_protein_expt)
tb_protein_norm <- sm(normalize_expt(tb_protein_expt, transform="log2", convert="cpm",
                                     norm="quant", filter=TRUE))
tb_protein_norm_metrics <- sm(graph_metrics(tb_protein_norm))
tb_protein_fsva <- sm(normalize_expt(tb_protein_expt, transform="log2", convert="cpm",
                                     batch="fsva", filter=TRUE))
tb_protein_fsva_metrics <- sm(graph_metrics(tb_protein_fsva))
```

## Metrics of the whole-cell data set

### Our libraries

```{r our_whole_metrics, fig.show='hide'}
our_whole_metrics <- sm(graph_metrics(our_whole_expt))
our_whole_norm <- sm(normalize_expt(our_whole_expt, transform="log2", convert="cpm",
                                    norm="quant", filter=TRUE))
our_whole_norm_metrics <- sm(graph_metrics(our_whole_norm))
our_whole_fsva <- sm(normalize_expt(our_whole_expt, transform="log2", convert="cpm",
                                    batch="fsva", filter=TRUE))
our_whole_fsva_metrics <- sm(graph_metrics(our_whole_fsva))
```

### Tb libraries

```{r tb_whole_metrics, fig.show='hide'}
tb_whole_metrics <- sm(graph_metrics(tb_whole_expt))
tb_whole_norm <- sm(normalize_expt(tb_whole_expt, transform="log2", convert="cpm",
                                    norm="quant", filter=TRUE))
tb_whole_norm_metrics <- sm(graph_metrics(tb_whole_norm))
tb_whole_fsva <- sm(normalize_expt(tb_whole_expt, transform="log2", convert="cpm",
                                    batch="fsva", filter=TRUE))
tb_whole_fsva_metrics <- sm(graph_metrics(tb_whole_fsva))
```

## Metrics of the filtrate data set

### Our libraries

```{r our_cf_metrics, fig.show='hide'}
our_cf_metrics <- sm(graph_metrics(our_cf_expt))
our_cf_norm <- sm(normalize_expt(our_cf_expt, transform="log2", convert="cpm",
                                 norm="quant", filter=TRUE))
our_cf_norm_metrics <- sm(graph_metrics(our_cf_norm))
our_cf_fsva <- sm(normalize_expt(our_cf_expt, transform="log2", convert="cpm",
                                 batch="fsva", filter=TRUE))
our_cf_fsva_metrics <- sm(graph_metrics(our_cf_fsva))
```

### Tb libraries

```{r tb_cf_metrics, fig.show='hide'}
tb_cf_metrics <- sm(graph_metrics(tb_cf_expt))
tb_cf_norm <- sm(normalize_expt(tb_cf_expt, transform="log2", convert="cpm",
                                 norm="quant", filter=TRUE))
tb_cf_norm_metrics <- sm(graph_metrics(tb_cf_norm))
tb_cf_fsva <- sm(normalize_expt(tb_cf_expt, transform="log2", convert="cpm",
                                 batch="fsva", filter=TRUE))
tb_cf_fsva_metrics <- sm(graph_metrics(tb_cf_fsva))
```

## plot some metrics

### Our libraries

```{r our_print_metrics, eval=FALSE}
pp(image=our_protein_metrics$libsize,
   file=file.path("images", paste0(ver, "_our_libsize.png")))
## It seems to me that the scale of the data is all within an order of magnitude or two.
## I cannot get used to these absurdly large numbers though.
pp(image=our_protein_norm_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_our_norm_pca.png")))
## There appears to be a nice split in the data, however the un-assayable batch
## effect is a problem.
pp(image=our_protein_fsva_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_fsva_pca.png")))
## fsva seems to get some handle on the data, but I don't think we should rely
## upon it.
pp(image=our_protein_norm_metrics$corheat,
   file=file.path("images", paste0(ver, "_norm_corheat.png")))
## Once again, the whole-cell/culture-filtrate split is very large.
pp(image=our_protein_metrics$density,
   file=file.path("images", paste0(ver, "_raw_density.png")))
## There are two obvious distributions in the data, once again split between types.
pp(image=our_protein_metrics$boxplot,
   file=file.path("images", paste0(ver, "_boxplot.png")))
## This recapitulates the previous plot.

pp(image=our_whole_metrics$libsize,
   file=file.path("images", paste0(ver, "_whole_libsize.png")))
pp(image=our_whole_norm_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_whole_norm_pca.png")))
pp(image=our_whole_fsva_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_whole_fsva_pca.png")))
pp(image=our_whole_norm_metrics$corheat,
   file=file.path("images", paste0(ver, "_whole_norm_corheat.png")))
pp(image=our_whole_metrics$density,
   file=file.path("images", paste0(ver, "_whole_raw_density.png")))
pp(image=our_whole_metrics$boxplot,
   file=file.path("images", paste0(ver, "_whole_boxplot.png")))

pp(image=our_cf_metrics$libsize,
   file=file.path("images", paste0(ver, "_libsize.png")))
pp(image=our_cf_norm_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_norm_pca.png")))
pp(image=our_cf_fsva_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_fsva_pca.png")))
pp(image=our_cf_norm_metrics$corheat,
   file=file.path("images", paste0(ver, "_norm_corheat.png")))
pp(image=our_cf_metrics$density,
   file=file.path("images", paste0(ver, "_raw_density.png")))
pp(image=our_cf_metrics$boxplot,
   file=file.path("images", paste0(ver, "_boxplot.png")))
```

### Tb libraries

I think the figure S1 intensities may come from these:

```{r figure_s1a}
## Reminder:  tb_protein_metrics <- sm(graph_metrics(tb_protein_expt))
figure_expt <- set_expt_samplenames(tb_protein_expt, pData(tb_protein_expt)[["figurename"]])
fig_s1a <- plot_libsize(figure_expt, title="", order_by="condition", y_label="Sum(intensities) by run",
                        order=c("wt_whole", "delta_whole", "comp_whole",
                                "wt_filtrate", "delta_filtrate", "comp_filtrate"))
pp(file="images/fig_s1a.pdf", image=fig_s1a$plot)
```

```{r tb_print_metrics}
pp(image=tb_protein_metrics$libsize,
   file=file.path("images", paste0(ver, "_tb_libsize.png")))
## It seems to me that the scale of the data is all within an order of magnitude or two.
## I cannot get used to these absurdly large numbers though.
pp(image=tb_protein_norm_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_tb_norm_pca.png")))
## There appears to be a nice split in the data, however the un-assayable batch
## effect is a problem.
pp(image=tb_protein_fsva_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_fsva_pca.png")))
## fsva seems to get some handle on the data, but I don't think we should rely
## upon it.
pp(image=tb_protein_norm_metrics$corheat,
   file=file.path("images", paste0(ver, "_norm_corheat.png")))
## Once again, the whole-cell/culture-filtrate split is very large.
pp(image=tb_protein_metrics$density,
   file=file.path("images", paste0(ver, "_raw_density.png")))
## There are two obvious distributions in the data, once again split between types.
pp(image=tb_protein_metrics$boxplot,
   file=file.path("images", paste0(ver, "_boxplot.png")))
## This recapitulates the previous plot.

pp(image=tb_whole_metrics$libsize,
   file=file.path("images", paste0(ver, "_whole_libsize.png")))
pp(image=tb_whole_norm_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_whole_norm_pca.png")))
pp(image=tb_whole_fsva_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_whole_fsva_pca.png")))
pp(image=tb_whole_norm_metrics$corheat,
   file=file.path("images", paste0(ver, "_whole_norm_corheat.png")))
pp(image=tb_whole_metrics$density,
   file=file.path("images", paste0(ver, "_whole_raw_density.png")))
pp(image=tb_whole_metrics$boxplot,
   file=file.path("images", paste0(ver, "_whole_boxplot.png")))

pp(image=tb_cf_metrics$libsize,
   file=file.path("images", paste0(ver, "_libsize.png")))
pp(image=tb_cf_norm_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_norm_pca.png")))
pp(image=tb_cf_fsva_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_fsva_pca.png")))
pp(image=tb_cf_norm_metrics$corheat,
   file=file.path("images", paste0(ver, "_norm_corheat.png")))
pp(image=tb_cf_metrics$density,
   file=file.path("images", paste0(ver, "_raw_density.png")))
pp(image=tb_cf_metrics$boxplot,
   file=file.path("images", paste0(ver, "_boxplot.png")))
```

# Attempt some quantification comparisons?

## Our libraries

```{r our_de_testing, fig.show='hide'}
our_pairwise_filt <- sm(normalize_expt(our_protein_expt, filter=TRUE))
our_pairwise_comp <- sm(all_pairwise(our_pairwise_filt, model_batch="fsva", force=TRUE))
our_pairwise_nobatch <- sm(all_pairwise(our_pairwise_filt, model_batch=FALSE, force=TRUE))
```

## Tb libraries

```{r tb_de_testing, fig.show='hide'}
tb_pairwise_filt <- sm(normalize_expt(tb_protein_expt, filter=TRUE))
tb_pairwise_comp <- sm(all_pairwise(tb_pairwise_filt, model_batch="fsva", force=TRUE))
tb_pairwise_nobatch <- sm(all_pairwise(tb_pairwise_filt, model_batch=FALSE, force=TRUE))
```

### Show a few metrics from the hpgltools pairwise comparisons

```{r show_de_testing}
our_pairwise_comp$comparison$heat
tb_pairwise_comp$comparison$heat
```

# For each msstats run, do a DE table

## wt_filtrate vs wt_whole

### Our libraries

```{r our_combine_wtcf_wtwhole, fig.show='hide'}
keepers <- list(
  "wtcf_vs_wtwhole" = c("wt_filtrate", "wt_whole"))
droppers <- c("undefined")
names(droppers) <- "log2fc"

## Make sure to set the rownames so it will merge into the excel file.
set_name <- "wt_filtrate_vs_wt_whole"
rownames(our_results[[set_name]]) <- our_results[[set_name]][["Protein"]]
wtcf_wtwhole_tables <- sm(combine_de_tables(
  ## our_pairwise_comp, keepers=keepers, extra_annot=our_results[[set_name]],
  our_pairwise_comp, keepers=keepers,
  excludes=droppers,
  excel=paste0("excel/our_wtcf_vs_wtwhole_tables-v", ver, ".xlsx")))

wtcf_nobatch_wtwhole_tables <- sm(combine_de_tables(
  ## our_pairwise_nobatch, keepers=keepers, extra_annot=our_results[[set_name]],
  our_pairwise_nobatch, keepers=keepers,
  excludes=droppers,
  excel=paste0("excel/our_wtcf_vs_wtwhole_nobatch_tables-v", ver, ".xlsx")))

##comp_table <- wtcf_wtwhole_tables$data[[set_name]]
##cor.test(comp_table$log2fc, comp_table$deseq_logfc, method="spearman")

##comp_table <- wtcf_nobatch_wtwhole_tables$data[[set_name]]
##cor.test(comp_table$log2fc, comp_table$deseq_logfc, method="spearman")
```

### Tb libraries

```{r tb_combine_wtcf_wtwhole, fig.show='hide'}
keepers <- list(
  "wtcf_vs_wtwhole" = c("wt_filtrate", "wt_whole"))
droppers <- c("undefined")
names(droppers) <- "log2fc"

## Make sure to set the rownames so it will merge into the excel file.
set_name <- "wt_filtrate_vs_wt_whole"
## rownames(tb_results[[set_name]]) <- tb_results[[set_name]][["Protein"]]
wtcf_wtwhole_tables <- sm(combine_de_tables(
  ## tb_pairwise_comp, keepers=keepers, extra_annot=tb_results[[set_name]],
  tb_pairwise_comp, keepers=keepers,
  excludes=droppers,
  excel=paste0("excel/tb_wtcf_vs_wtwhole_tables-v", ver, ".xlsx")))

wtcf_nobatch_wtwhole_tables <- sm(combine_de_tables(
  ## tb_pairwise_nobatch, keepers=keepers, extra_annot=tb_results[[set_name]],
  tb_pairwise_nobatch, keepers=keepers,
  excludes=droppers,
  excel=paste0("excel/tb_wtcf_vs_wtwhole_nobatch_tables-v", ver, ".xlsx")))

##comp_table <- wtcf_wtwhole_tables$data[[set_name]]
##cor.test(comp_table$log2fc, comp_table$deseq_logfc, method="spearman")

##comp_table <- wtcf_nobatch_wtwhole_tables$data[[set_name]]
##cor.test(comp_table$log2fc, comp_table$deseq_logfc, method="spearman")
```

## delta_filtrate vs delta_whole

### Our libraries

```{r our_combine_deltacf_deltawhole, fig.show='hide'}
keepers <- list(
  "deltacf_vs_deltawhole" = c("delta_filtrate", "delta_whole"))
set_name <- "delta_filtrate_vs_delta_whole"
## rownames(our_results[[set_name]]) <- our_results[[set_name]][["Protein"]]
deltacf_deltawhole_tables <- sm(combine_de_tables(
  ##  our_pairwise_comp, keepers=keepers, extra_annot=our_results[[set_name]],
  our_pairwise_comp, keepers=keepers,
  excel=paste0("excel/our_deltacf_vs_deltawhole_tables-v", ver, ".xlsx")))
```

### Tb libraries

```{r tb_combine_deltacf_deltawhole, fig.show='hide'}
keepers <- list(
  "deltacf_vs_deltawhole" = c("delta_filtrate", "delta_whole"))
set_name <- "delta_filtrate_vs_delta_whole"
## rownames(tb_results[[set_name]]) <- tb_results[[set_name]][["Protein"]]
deltacf_deltawhole_tables <- sm(combine_de_tables(
##  tb_pairwise_comp, keepers=keepers, extra_annot=tb_results[[set_name]],
  tb_pairwise_comp, keepers=keepers,
  excel=paste0("excel/tb_deltacf_vs_deltawhole_tables-v", ver, ".xlsx")))
```

## comp_filtrate vs comp_whole

### Our libraries

```{r our_combine_compcf_compwhole, fig.show='hide'}
keepers <- list(
  "compcf_vs_compwhole" = c("comp_filtrate", "comp_whole"))
set_name <- "comp_filtrate_vs_comp_whole"
rownames(our_results[[set_name]]) <- our_results[[set_name]][["Protein"]]
compcf_compwhole_tables <- sm(combine_de_tables(
##  our_pairwise_comp, keepers=keepers, extra_annot=our_results[[set_name]],
  our_pairwise_comp, keepers=keepers,
  excel=paste0("excel/our_compcf_vs_compwhole_tables-v", ver, ".xlsx")))
```

### Tb libraries

```{r tb_combine_compcf_compwhole, fig.show='hide'}
keepers <- list(
  "compcf_vs_compwhole" = c("comp_filtrate", "comp_whole"))
set_name <- "comp_filtrate_vs_comp_whole"
##rownames(tb_results[[set_name]]) <- tb_results[[set_name]][["Protein"]]
compcf_compwhole_tables <- sm(combine_de_tables(
##  tb_pairwise_comp, keepers=keepers, extra_annot=tb_results[[set_name]],
  tb_pairwise_comp, keepers=keepers,
  excel=paste0("excel/tb_compcf_vs_compwhole_tables-v", ver, ".xlsx")))
```

## delta_filtrate vs wt_filtrate

### Our libraries

```{r our_combine_deltacf_wtcf, fig.show='hide'}
keepers <- list(
  "deltacf_vs_wtcf" = c("delta_filtrate", "wt_filtrate"))
set_name <- "delta_filtrate_vs_wt_filtrate"
##rownames(our_results[[set_name]]) <- our_results[[set_name]][["Protein"]]
deltacf_wtcf_tables <- sm(combine_de_tables(
##  our_pairwise_comp, keepers=keepers, extra_annot=our_results[[set_name]],
  our_pairwise_comp, keepers=keepers,
  excel=paste0("excel/our_deltacf_vs_wtcf_tables-v", ver, ".xlsx")))
```

### Tb libraries

```{r tb_combine_deltacf_wtcf, fig.show='hide'}
keepers <- list(
  "deltacf_vs_wtcf" = c("delta_filtrate", "wt_filtrate"))
set_name <- "delta_filtrate_vs_wt_filtrate"
##rownames(tb_results[[set_name]]) <- tb_results[[set_name]][["Protein"]]
deltacf_wtcf_tables <- sm(combine_de_tables(
  ##tb_pairwise_comp, keepers=keepers, extra_annot=tb_results[[set_name]],
  tb_pairwise_comp, keepers=keepers,
  excel=paste0("excel/tb_deltacf_vs_wtcf_tables-v", ver, ".xlsx")))
```

## comp_filtrate vs wt_filtrate

### Our libraries

```{r our_combine_compcf_wtcf, fig.show='hide'}
keepers <- list(
  "compcf_vs_wtcf" = c("comp_filtrate", "wt_filtrate"))
set_name <- "comp_filtrate_vs_wt_filtrate"
##rownames(our_results[[set_name]]) <- our_results[[set_name]][["Protein"]]
compcf_wtcf_tables <- sm(combine_de_tables(
##  our_pairwise_comp, keepers=keepers, extra_annot=our_results[[set_name]],
  our_pairwise_comp, keepers=keepers,
  excel=paste0("excel/our_compcf_vs_wtcf_tables-v", ver, ".xlsx")))
```

### Tb libraries

```{r tb_combine_compcf_wtcf, fig.show='hide'}
keepers <- list(
  "compcf_vs_wtcf" = c("comp_filtrate", "wt_filtrate"))
set_name <- "comp_filtrate_vs_wt_filtrate"
##rownames(tb_results[[set_name]]) <- tb_results[[set_name]][["Protein"]]
compcf_wtcf_tables <- sm(combine_de_tables(
  ##  tb_pairwise_comp, keepers=keepers, extra_annot=tb_results[[set_name]],
  tb_pairwise_comp, keepers=keepers,
  excel=paste0("excel/tb_compcf_vs_wtcf_tables-v", ver, ".xlsx")))
```

## delta_whole vs wt_whole

### Our libraries

```{r our_combine_delawhole_wtwhole, fig.show='hide'}
keepers <- list(
  "wtcf_vs_wtwhole" = c("wt_filtrate", "wt_whole"))
set_name <- "delta_whole_vs_wt_whole"
##rownames(our_results[[set_name]]) <- our_results[[set_name]][["Protein"]]
wtcf_wtwhole_tables <- sm(combine_de_tables(
  ## our_pairwise_comp, keepers=keepers, extra_annot=our_results[[set_name]],
  our_pairwise_comp, keepers=keepers,
  excel=paste0("excel/our_deltawhole_vs_wtwhole_tables-v", ver, ".xlsx")))
```

### Tb libraries

```{r tb_combine_delawhole_wtwhole, fig.show='hide'}
keepers <- list(
  "wtcf_vs_wtwhole" = c("wt_filtrate", "wt_whole"))
set_name <- "delta_whole_vs_wt_whole"
##rownames(tb_results[[set_name]]) <- tb_results[[set_name]][["Protein"]]
wtcf_wtwhole_tables <- sm(combine_de_tables(
  ##tb_pairwise_comp, keepers=keepers, extra_annot=tb_results[[set_name]],
  tb_pairwise_comp, keepers=keepers,
  excel=paste0("excel/tb_deltawhole_vs_wtwhole_tables-v", ver, ".xlsx")))
```

## comp_whole vs wt_whole

### Our libraries

```{r our_combine_compwhole_wtwhole, fig.show='hide'}
keepers <- list(
  "compwhole_vs_wtwhole" = c("comp_whole", "wt_whole"))
set_name <- "comp_whole_vs_wt_whole"
##rownames(our_results[[set_name]]) <- our_results[[set_name]][["Protein"]]
compwhole_wtwhole_tables <- sm(combine_de_tables(
  ##our_pairwise_comp, keepers=keepers, extra_annot=our_results[[set_name]],
  our_pairwise_comp, keepers=keepers,
  excel=paste0("excel/our_compwhole_vs_wtwhole_tables-v", ver, ".xlsx")))
```

### Tb libraries

```{r tb_combine_compwhole_wtwhole, fig.show='hide'}
keepers <- list(
  "compwhole_vs_wtwhole" = c("comp_whole", "wt_whole"))
set_name <- "comp_whole_vs_wt_whole"
##rownames(tb_results[[set_name]]) <- tb_results[[set_name]][["Protein"]]
compwhole_wtwhole_tables <- sm(combine_de_tables(
  ##tb_pairwise_comp, keepers=keepers, extra_annot=tb_results[[set_name]],
  tb_pairwise_comp, keepers=keepers,
  excel=paste0("excel/tb_compwhole_vs_wtwhole_tables-v", ver, ".xlsx")))
```

```{r compare_our_tb, eval=FALSE}
our_table <- our_pairwise_comp$deseq$all_tables$wt_whole_vs_wt_filtrate
tb_table <- tb_pairwise_comp$deseq$all_tables$wt_whole_vs_wt_filtrate
our_order <- order(our_table[["limma_logfc"]])
rownames(tb_table) <- gsub(pattern="^(.*)_.*", replacement="\\1", x=rownames(tb_table))

test_table <- merge(our_table, tb_table, by="row.names")
cor.test(test_table$logFC.x, test_table$logFC.y)
```

# RNASeq vs proteomics

Take the analysis performed on 02_relative_pe_expression, which unfortunately
was only using the DDA data, now reuse it with the DIA data and see what happens.

```{r prot_vs_rnaseq}
all_de <- read.table("external_data/limma_result.csv", header=TRUE, sep="\t")

pe_ids <- read.table("reference/annotated_pe_genes.txt")
colnames(pe_ids) <- "gene"
pe_annotations <- merge(pe_ids, mtb_annotations, by.x="gene", by.y="id")
chosen_colnames <- c("rv_id", "chr", "start", "end", "width", "strand",
                     "source", "type", "score", "phase", "id", "gene_id",
                     "description", "function")
colnames(pe_annotations) <- chosen_colnames
rownames(pe_annotations) <- pe_annotations[["id"]]

tmp_de <- all_de
colnames(tmp_de) <- c("id", "logfc", "ave_expr", "adjp", "abbrev_id", "description", "gene_length", "type")
colnames(mtb_annotations) <- chosen_colnames
all_de <- merge(tmp_de, mtb_annotations, by="id")
colnames(all_de) <- c("id","logfc", "ave_expr", "adjp", "abbrev_id", "description", "gene_length", "type",
                      "rv_id", "start", "end", "width", "strand", "source", "type", "na", "anotherna",
                      "mtb_id", "another_abbrev_id", "another_description", "html_string")
all_de <- all_de[, c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)]

pe_de <- merge(pe_annotations, all_de, by="id")
colnames(pe_de) <- c("id", "rv_id", "chromosome", "start", "end", "width", "strand",
                     "source", "type", "score", "phase", "gene_id", "description", "function",
                     "logfc", "ave_expr", "adjp", "another_gene", "another_description", "gene_length",
                     "another_type", "another_id", "another_start", "another_end", "another_width",
                     "another_strand", "another_source", "another_type")
pe_de <- pe_de[, c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17)]
## pe_de_annot <- merge(pe_de, mtb_annotations, by="locus_tag")
## The contrast is written as wt - delta, so add back the logFC I think
## Hopefully I did not reverse this in my head.
pe_de$ave_minus_log <- pe_de$ave_expr - pe_de$logfc
all_de$ave_minus_log <- all_de$ave_expr - all_de$logfc
pe_de$pseudo_rpkm <- (pe_de$ave_minus_log / pe_de$width) * 1000
all_de$pseudo_rpkm <- (all_de$ave_minus_log / all_de$width) * 1000

maximum_expression <- max(pe_de[["pseudo_rpkm"]])
maximum_all <- max(all_de$pseudo_rpkm)
pe_de[["exprs_vs_max"]] <- pe_de[["pseudo_rpkm"]] / maximum_expression
all_de$exprs_vs_max <- all_de$pseudo_rpkm / maximum_all

table_order <- order(pe_de[["exprs_vs_max"]], decreasing=TRUE)
pe_de <- pe_de[table_order, ]

knitr::kable(head(pe_de, n=50))

write.csv(x=pe_de, file="pe_relative_expression.csv")

plot_histogram(pe_de$exprs_vs_max, bins=20)

prot_table <- tb_pairwise_nobatch$limma$all_tables[["wt_whole_vs_delta_whole"]]

all_combined_table <- merge(all_de, prot_table, by.x="id", by.y="row.names")
cor.test(all_combined_table$ave_expr, all_combined_table$AveExpr, method="spearman")
all_testing <- all_combined_table[, c("ave_expr", "AveExpr")]
all_comparison <- plot_linear_scatter(all_testing)
all_comparison$scatter

pe_combined_table <- merge(pe_de, prot_table, by.x="id", by.y="row.names")
cor.test(pe_combined_table$ave_expr, pe_combined_table$AveExpr, method="spearman")
pe_testing <- pe_combined_table[, c("ave_expr", "AveExpr")]
pe_comparison <- plot_linear_scatter(pe_testing)
pe_comparison$scatter
```

```{r saveme}
if (!isTRUE(get0("skip_load"))) {
  message(paste0("This is hpgltools commit: ", get_git_commit()))
  this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
  message(paste0("Saving to ", this_save))
  tmp <- sm(saveme(filename=this_save))
  pander::pander(sessionInfo())
}
```
