1 TODO

  • Remove MSstats logging. – done
  • Make explicit spearman correlations between methods. – done
    • Do both for all data and for the top-50/bottom-50 – done, but weird.
  • Make 100% certain that the samples are annotated correctly. – done.
  • Limma/MSstats/EdgeR venn disagram for all up in CF
    • Repeat in all down CF
    • Repeat with a cutoff, top-50
  • Individual plots for P/PE proteins.

2 Analyzing data from openMS and friends.

In preprocessing_comet_highres.Rmd, I used the openMS tutorials and supplemental materials from a couple papers to hopefully correctly perform the various preprocessing tasks required to extract intensity data from DIA/SWATH transitions.

The final steps of that process combined the transition intensities from every sample into a metadata file (results/tric/HCD_meta.tsv), an intensity matrix (results/tric/HCD_outmatrix.tsv), and a feature aligned output matrix (results/tric/aligned_comet_HCD.tsv).

My reading of the SWATH2stats and MSstats source code suggests to me that the log2(intensities) of the feature aligned data are our final proxy for protein abundance. At first glance, this suggests to me that these data might follow a distribution similar to RNASeq data (negative binomial, but perhaps with a bigger tail?). In addition, by the time we use tric on the data, we have a count matrix and sample annotation data frames which look remarkably similar to those used in a RNASeq expressionset. Indeed, by the end of the MSstats processing, it creates a MSnSet class of its own which uses fData/exprs/pData.

For the curious, my reasoning for saying that the log intensities are our proxy for abundance comes from MSstats/R/DataProcess.R in a clause which looks like:

if (logTrans == 2) {
  work[["ABUNDANCE"]] <- log2(work[["ABUNDANCE"]])
} else if (logTrans == 10) {
  work[["ABUNDANCE"]] <- log10(work[["ABUNDANCE"]])
} else {
  ## Above there was a check for only log 2 and 10, but we can do e if we want.
  ## I might go back up there and remove that check. Long live e! 2.718282 rules!
  work[["ABUNDANCE"]] <- log(work[["ABUNDANCE"]]) / log(logTrans)
}

(Note: I added the natural log to the set of conditions, but otherwise the logic is unchanged.)

With that in mind, I want to use some tools with which I am familiar in order to try to understand these data. Therefore I will first attempt to coerce my tric aligned data and annotations into a ‘normal’ expressionset. Then I want to do some diagnostic plots which, if I am wrong and these distributions are not as expected, will be conceptually incorrect (I don’t yet think I am wrong).

2.1 Sample annotation via SWATH2stats

I am using the SWATH2stats vignette as my primary source of information. Thus I see that it uses the OpenSWATH_SM3_GoldStandardAutomatedResults_human_peakgroups.txt which has a format nearly identical to the tric output matrix. Thus for the moment I will assume that the proper input for SWATH2stats is ‘results/tric/comet_HCD.tsv’ and not the metadata nor output matrix. In addition, SWATH2stats provides a similar conversion function which takes the tric output and coerces it into a similar matrix after performing its various filtering tasks. Therefore I will use that.

I keep a sample sheet of all the DIA samples used in this analysis in ‘sample_sheets/dia_samples.xlsx’ It should contain all the other required data with one important caveat, I removed 1 sample by ‘commenting’ it (e.g. prefixing it with ‘##’ – which is an admittedly dumb thing to do in an excel file.

One last caveat: I hacked the SWATH2stats sample_annotation() function to add a couple columns in an attempt to make it a little more robust when faced with sample sheets with differently named columns.

In addition, SWATH2stats provides some nice filtering and combination functions which should be considered when generating various expressionset data structures later.

2.1.1 Creating a swath2stats experiment using our comet-derived library data

our_tric_data <- read.csv("results/tric/20180611/whole_8mz/comet_HCD.tsv", sep="\t")

sample_annot <- openxlsx::read.xlsx("sample_sheets/Mtb_dia_samples.xlsx")
rownames(sample_annot) <- make.names(sample_annot[["sampleid"]], unique=TRUE)
## Drop samples starting with comments
keep_idx <- ! grepl(pattern="##", x=sample_annot[["sampleid"]])
sample_annot <- sample_annot[keep_idx, ]
expt_idx <- sample_annot[["expt_id"]] == "may2018" | sample_annot[["expt_id"]] == "mar2018"
expt_idx[is.na(sample_annot[["expt_id"]])] <- FALSE
sample_annot <- sample_annot[expt_idx, ]
mz_idx <- sample_annot[["windowsize"]] == "8"
sample_annot <- sample_annot[mz_idx, ]
## Set the mzXML column to match the filename column in the data.
loaded <- sm(devtools::load_all("~/scratch/git/SWATH2stats"))
## s2s, my witty way of shortening SWATH2stats...
our_s2s_exp <- sample_annotation(data=our_tric_data,
                                 sample_annotation=sample_annot,
                                 run_column="run",
                                 fullpeptidename_column="fullunimodpeptidename")
## Found the same mzXML files in the annotations and data.

2.1.2 Creating a swath2stats experiment using the tuberculist-derived library data

This just repeats the previous stanza, but uses the tuberculist transition-library swath data as its input rather than our own comet data.

There is one important caveat in the following block: I used a regex to remove the second half of geneID_geneName so that later when I merge in the annotation data I have it will match.

tb_tric_data <- read.csv("results/tric/20180611/whole_8mz_tuberculist/comet_HCD.tsv", sep="\t")

tb_tric_data[["ProteinName"]] <- gsub(pattern="^(.*)_.*$", replacement="\\1",
                                      x=tb_tric_data[["ProteinName"]])
tb_s2s_exp <- sample_annotation(data=tb_tric_data,
                                sample_annotation=sample_annot,
                                fullpeptidename_column="fullunimodpeptidename")
## Found the same mzXML files in the annotations and data.

Now I have a couple data structures which should prove useful for the metrics provided by SWATH2stats, MSstats, and my own hpgltools.

3 SWATH2stats continued

The various metrics and filters provided by SWATH2stats seem quite reasonable to me. The only thing that really bothers me is that they are all case sensitive and I found that the most recent tric changed the capitalization of a column, causing these to all fall down. Therefore I went in and made everything case insensitive in a fashion similar to that done by MSstats (except I hate capital letters, so I used tolower() rather than toupper()).

3.1 Run filters on our comet data

## Get correlations on a sample by sample basis
pp(file="images/20180611_our_swath2stats_sample_cor.png")
## Going to write the image to: images/20180611_our_swath2stats_sample_cor.png when dev.off() is called.
sample_cor <- plot_correlation_between_samples(our_s2s_exp, size=2,
                                               fun.aggregate=sum,
                                               column.values="intensity")
dev.off()
## png 
##   2
sample_cond_rep_cor <- plot_correlation_between_samples(our_s2s_exp, size=2,
                                                        comparison=transition_group_id ~
                                                          condition + bioreplicate + run,
                                                        fun.aggregate=sum,
                                                        column.values="intensity")

decoy_lists <- assess_decoy_rate(our_s2s_exp)
## Number of non-decoy peptides: 3259
## Number of decoy peptides: 182
## Decoy rate: 0.0558
## This seems a bit high to me, yesno?
fdr_overall <- assess_fdr_overall(our_s2s_exp, output="Rconsole", plot=TRUE)

byrun_fdr <- assess_fdr_byrun(our_s2s_exp, FFT=0.7, plot=TRUE, output="Rconsole")
## The average FDR by run on assay level is 0.007
## The average FDR by run on peptide level is 0.007
## The average FDR by run on protein level is 0.019

chosen_mscore <- mscore4assayfdr(our_s2s_exp, FFT=0.7, fdr_target=0.02)
## Target assay FDR: 0.02
## Required overall m-score cutoff: 0.0019953
## achieving assay FDR: 0.0189
prot_score <- mscore4protfdr(our_s2s_exp, FFT=0.7, fdr_target=0.02)
## Target protein FDR: 0.02
## Required overall m-score cutoff: 0.00050119
## achieving protein FDR: 0.0181
our_mscore_filtered <- filter_mscore(our_s2s_exp, chosen_mscore)
## Original dimension: 47362, new dimension: 45761, difference: 1601.
our_freq_mscore <- filter_mscore_freqobs(our_s2s_exp, 0.01, 0.8, rm.decoy=FALSE)
## Peptides need to have been quantified in more conditions than: 18.4 in order to pass this percentage-based threshold.
## Fraction of peptides selected: 1
## Original dimension: 47841, new dimension: 47841, difference: 0.
our_data_filtered_fdr <- filter_mscore_fdr(our_mscore_filtered, FFT=0.7,
                                           overall_protein_fdr_target=prot_score,
                                           upper_overall_peptide_fdr_limit=0.05)
## Target protein FDR: 0.000501187233627273
## Required overall m-score cutoff: 0.01
## achieving protein FDR: 0
## filter_mscore_fdr is filtering the data...
## finding m-score cutoff to achieve desired protein FDR in protein master list..
## finding m-score cutoff to achieve desired global peptide FDR..
## Target peptide FDR: 0.05
## Required overall m-score cutoff: 0.01
## Achieving peptide FDR: 0
## Proteins selected: 
## Total proteins selected: 945
## Final target proteins: 945
## Final decoy proteins: 0
## Peptides mapping to these protein entries selected:
## Total mapping peptides: 3203
## Final target peptides: 3203
## Final decoy peptides: 0
## Total peptides selected from:
## Total peptides: 3203
## Final target peptides: 3203
## Final decoy peptides: 0
## Individual run FDR quality of the peptides was not calculated
## as not every run contains a decoy.
## The decoys have been removed from the returned data.
our_only_proteotypic <- filter_proteotypic_peptides(our_data_filtered_fdr)
## Number of proteins detected: 947
## Protein identifiers: Rv1908c, Rv0242c, Rv3224, Rv0667, Rv1133c, Rv3036c
## Number of proteins detected that are supported by a proteotypic peptide: 930
## Number of proteotypic peptides detected: 3158
our_all_filtered <- filter_all_peptides(our_only_proteotypic)
## Number of proteins detected: 930
## First 6 protein identifiers: Rv1908c, Rv0242c, Rv3224, Rv0667, Rv1133c, Rv3036c
our_only_strong <- filter_on_max_peptides(data=our_all_filtered, n_peptides=10)
## Before filtering: 
##   Number of proteins: 930
##   Number of peptides: 3158
## 
## Percentage of peptides removed: 8.61%
## 
## After filtering: 
##   Number of proteins: 928
##   Number of peptides: 2886
our_only_minimum <- filter_on_min_peptides(data=our_only_strong, n_peptides=3)
## Before filtering:
##   Number of proteins: 928
##   Number of peptides: 2886
## 
## Percentage of peptides removed: 0%
## 
## After filtering:
##   Number of proteins: 895
##   Number of peptides: 2886
## I think these matrixes are probably smarter to use than the raw outmatrix from tric.
## But I am not a fan of rerwriting the sample column names.
matrix_prefix <- file.path("results", "swath2stats", ver)
if (!file.exists(matrix_prefix)) {
  dir.create(matrix_prefix)
}
protein_matrix_all <- write_matrix_proteins(
  our_s2s_exp, write.csv=TRUE,
  filename=file.path(matrix_prefix, "our_protein_all.csv"))
## Protein overview matrix results/swath2stats/20180611/our_protein_all.csv written to working folder.
dim(protein_matrix_all)
## [1] 1110   24
protein_matrix_mscore <- write_matrix_proteins(
  our_mscore_filtered, write.csv=TRUE,
  filename=file.path(matrix_prefix, "our_protein_matrix_mscore.csv"))
## Protein overview matrix results/swath2stats/20180611/our_protein_matrix_mscore.csv written to working folder.
dim(protein_matrix_mscore)
## [1] 945  24
peptide_matrix_mscore <- write_matrix_peptides(
  our_mscore_filtered, write.csv=TRUE,
  filename=file.path(matrix_prefix, "our_peptide_matrix_mscore.csv"))
## Peptide overview matrix results/swath2stats/20180611/our_peptide_matrix_mscore.csv written to working folder.
dim(peptide_matrix_mscore)
## [1] 3203   24
protein_matrix_minimum <- write_matrix_proteins(
  our_only_minimum, write.csv=TRUE,
  filename=file.path(matrix_prefix, "our_protein_matrix_minimum.csv"))
## Protein overview matrix results/swath2stats/20180611/our_protein_matrix_minimum.csv written to working folder.
dim(protein_matrix_minimum)
## [1] 895  24
peptide_matrix_minimum <- write_matrix_peptides(
  our_only_minimum, write.csv=TRUE,
  filename=file.path(matrix_prefix, "our_peptide_matrix_minimum.csv"))
## Peptide overview matrix results/swath2stats/20180611/our_peptide_matrix_minimum.csv written to working folder.
dim(peptide_matrix_minimum)
## [1] 38491    24
rt_cor <- plot_correlation_between_samples(
  our_only_minimum, column.values="intensity", fun.aggregate=sum, size=2)

## I have no effing clue what this plot means.
variation <- plot_variation(our_only_minimum, fun.aggregate=sum)

cols <- colnames(our_only_minimum)
our_disaggregated <- disaggregate(our_only_minimum, all.columns=TRUE)
## The library contains6transitions per precursor.
## The data table was transformed into a table containing one row per transition.
our_msstats_input <- convert_for_msstats(our_disaggregated)
## One or several columns required by MSstats were not in the data. The columns were created and filled with NAs.
## Missing columns: fragmention, productcharge, isotopelabeltype
## isotopelabeltype was filled with light.

3.2 Run filters on the tuberculist

Repeat the above stanza using the tuberculist data.

## Get correlations on a sample by sample basis
pp(file="images/20180611_tb_swath2stats_sample_cor.png")
## Going to write the image to: images/20180611_tb_swath2stats_sample_cor.png when dev.off() is called.
sample_cor <- plot_correlation_between_samples(tb_s2s_exp, size=2,
                                               fun.aggregate=sum,
                                               column.values="intensity")
dev.off()
## png 
##   2
sample_cond_rep_cor <- plot_correlation_between_samples(tb_s2s_exp, size=2,
                                                        comparison=transition_group_id ~
                                                          condition + bioreplicate + run,
                                                        fun.aggregate=sum,
                                                        column.values="intensity")

decoy_lists <- assess_decoy_rate(tb_s2s_exp)
## Number of non-decoy peptides: 23975
## Number of decoy peptides: 1762
## Decoy rate: 0.0735
## This seems a bit high to me, yesno?
fdr_overall <- assess_fdr_overall(tb_s2s_exp, output="Rconsole", plot=TRUE)

byrun_fdr <- assess_fdr_byrun(tb_s2s_exp, FFT=0.7, plot=TRUE, output="Rconsole")
## The average FDR by run on assay level is 0.008
## The average FDR by run on peptide level is 0.009
## The average FDR by run on protein level is 0.044

chosen_mscore <- mscore4assayfdr(tb_s2s_exp, FFT=0.7, fdr_target=0.02)
## Target assay FDR: 0.02
## Required overall m-score cutoff: 0.0056234
## achieving assay FDR: 0.019
prot_score <- mscore4protfdr(tb_s2s_exp, FFT=0.7, fdr_target=0.02)
## Target protein FDR: 0.02
## Required overall m-score cutoff: 0.00070795
## achieving protein FDR: 0.0198
tb_mscore_filtered <- filter_mscore(tb_s2s_exp, chosen_mscore)
## Original dimension: 252491, new dimension: 231348, difference: 21143.
tb_freq_mscore <- filter_mscore_freqobs(tb_s2s_exp, 0.01, 0.8, rm.decoy=FALSE)
## Peptides need to have been quantified in more conditions than: 18.4 in order to pass this percentage-based threshold.
## Fraction of peptides selected: 1
## Original dimension: 256996, new dimension: 256996, difference: 0.
tb_data_filtered_fdr <- filter_mscore_fdr(tb_mscore_filtered, FFT=0.7,
                                           overall_protein_fdr_target=prot_score,
                                           upper_overall_peptide_fdr_limit=0.05)
## Target protein FDR: 0.000707945784384137
## Required overall m-score cutoff: 0.01
## achieving protein FDR: 0
## filter_mscore_fdr is filtering the data...
## finding m-score cutoff to achieve desired protein FDR in protein master list..
## finding m-score cutoff to achieve desired global peptide FDR..
## Target peptide FDR: 0.05
## Required overall m-score cutoff: 0.01
## Achieving peptide FDR: 0
## Proteins selected: 
## Total proteins selected: 3087
## Final target proteins: 3087
## Final decoy proteins: 0
## Peptides mapping to these protein entries selected:
## Total mapping peptides: 22671
## Final target peptides: 22671
## Final decoy peptides: 0
## Total peptides selected from:
## Total peptides: 22671
## Final target peptides: 22671
## Final decoy peptides: 0
## Individual run FDR quality of the peptides was not calculated
## as not every run contains a decoy.
## The decoys have been removed from the returned data.
tb_only_proteotypic <- filter_proteotypic_peptides(tb_data_filtered_fdr)
## Number of proteins detected: 3101
## Protein identifiers: Rv1270c, Rv0161, Rv1613, Rv2540c, Rv0440, Rv1327c
## Number of proteins detected that are supported by a proteotypic peptide: 2964
## Number of proteotypic peptides detected: 22510
tb_all_filtered <- filter_all_peptides(tb_only_proteotypic)
## Number of proteins detected: 2966
## First 6 protein identifiers: Rv1270c, Rv0161, Rv1613, Rv2540c, Rv0440, Rv1327c
tb_only_strong <- filter_on_max_peptides(data=tb_all_filtered, n_peptides=10)
## Before filtering: 
##   Number of proteins: 2964
##   Number of peptides: 22510
## 
## Percentage of peptides removed: 23.15%
## 
## After filtering: 
##   Number of proteins: 2936
##   Number of peptides: 17299
tb_only_minimum <- filter_on_min_peptides(data=tb_only_strong, n_peptides=3)
## Before filtering:
##   Number of proteins: 2936
##   Number of peptides: 17299
## 
## Percentage of peptides removed: 0.03%
## 
## After filtering:
##   Number of proteins: 2679
##   Number of peptides: 17293
## I think these matrixes are probably smarter to use than the raw outmatrix from tric.
## But I am not a fan of rerwriting the sample column names.
matrix_prefix <- file.path("results", "swath2stats", ver)
if (!file.exists(matrix_prefix)) {
  dir.create(matrix_prefix)
}
protein_matrix_all <- write_matrix_proteins(
  tb_s2s_exp, write.csv=TRUE,
  filename=file.path(matrix_prefix, "tb_protein_all.csv"))
## Protein overview matrix results/swath2stats/20180611/tb_protein_all.csv written to working folder.
dim(protein_matrix_all)
## [1] 4570   24
protein_matrix_mscore <- write_matrix_proteins(
  tb_mscore_filtered, write.csv=TRUE,
  filename=file.path(matrix_prefix, "tb_protein_matrix_mscore.csv"))
## Protein overview matrix results/swath2stats/20180611/tb_protein_matrix_mscore.csv written to working folder.
dim(protein_matrix_mscore)
## [1] 3087   24
peptide_matrix_mscore <- write_matrix_peptides(
  tb_mscore_filtered, write.csv=TRUE,
  filename=file.path(matrix_prefix, "tb_peptide_matrix_mscore.csv"))
## Peptide overview matrix results/swath2stats/20180611/tb_peptide_matrix_mscore.csv written to working folder.
dim(peptide_matrix_mscore)
## [1] 22671    24
protein_matrix_minimum <- write_matrix_proteins(
  tb_only_minimum, write.csv=TRUE,
  filename=file.path(matrix_prefix, "tb_protein_matrix_minimum.csv"))
## Protein overview matrix results/swath2stats/20180611/tb_protein_matrix_minimum.csv written to working folder.
dim(protein_matrix_minimum)
## [1] 2679   24
peptide_matrix_minimum <- write_matrix_peptides(
  tb_only_minimum, write.csv=TRUE,
  filename=file.path(matrix_prefix, "tb_peptide_matrix_minimum.csv"))
## Peptide overview matrix results/swath2stats/20180611/tb_peptide_matrix_minimum.csv written to working folder.
dim(peptide_matrix_minimum)
## [1] 164565     24
rt_cor <- plot_correlation_between_samples(size=2,
  tb_only_minimum, column.values="intensity", fun.aggregate=sum)

## I have no effing clue what this plot means.
variation <- plot_variation(tb_only_minimum, fun.aggregate=sum)

cols <- colnames(tb_only_minimum)
tb_disaggregated <- disaggregate(tb_only_minimum, all.columns=TRUE)
## The library contains5transitions per precursor.
## The data table was transformed into a table containing one row per transition.
tb_msstats_input <- convert_for_msstats(tb_disaggregated)
## One or several columns required by MSstats were not in the data. The columns were created and filled with NAs.
## Missing columns: fragmention, productcharge, isotopelabeltype
## isotopelabeltype was filled with light.

3.3 Some new plots

In response to some interesting queries from Yan, I made a few little functions which query and plot data from the scored data provided by openswath/pyprophet. Let us look at their results here.

3.3.1 Using the comet data

our_pyprophet_fun <- sm(extract_pyprophet_data(metadata="sample_sheets/Mtb_dia_samples.xlsx"))

our_mass_plot <- sm(plot_pyprophet_distribution(our_pyprophet_fun, column="mass"))
our_mass_plot[["boxplot"]]

our_mass_plot[["density"]]

our_deltart_plot_all <- sm(plot_pyprophet_distribution(
  our_pyprophet_fun, column="delta_rt"))
our_deltart_plot_all[["boxplot"]]

our_deltart_plot_real <- sm(plot_pyprophet_distribution(
  our_pyprophet_fun,
  column="delta_rt", keep_decoys=FALSE))
our_deltart_plot_real[["boxplot"]]

our_deltart_plot_decoys <- sm(plot_pyprophet_distribution(
  our_pyprophet_fun,
  column="delta_rt", keep_real=FALSE))
our_deltart_plot_decoys[["boxplot"]]

our_deltart_plot_decoys[["density"]]

our_widthvsmass <- sm(plot_pyprophet_data(our_pyprophet_fun, legend=FALSE))
our_widthvsmass$plot

3.3.2 The tuberculist data

tb_pyprophet_fun <- sm(extract_pyprophet_data(metadata="sample_sheets/Mtb_dia_samples.xlsx",
                                              pyprophet_column="tuberculistscored"))

tb_mass_plot <- sm(plot_pyprophet_distribution(tb_pyprophet_fun, column="mass"))
tb_mass_plot[["boxplot"]]

tb_mass_plot[["density"]]

tb_deltart_plot_all <- sm(plot_pyprophet_distribution(
  tb_pyprophet_fun, column="delta_rt"))
tb_deltart_plot_all[["boxplot"]]

tb_deltart_plot_real <- sm(plot_pyprophet_distribution(
  tb_pyprophet_fun,
  column="delta_rt", keep_decoys=FALSE))
tb_deltart_plot_real[["boxplot"]]

tb_deltart_plot_decoys <- sm(plot_pyprophet_distribution(
  tb_pyprophet_fun,
  column="delta_rt", keep_real=FALSE))
tb_deltart_plot_decoys[["boxplot"]]

tb_deltart_plot_decoys[["density"]]

tb_widthvsmass <- sm(plot_pyprophet_data(tb_pyprophet_fun, legend=FALSE))
tb_widthvsmass$plot

3.4 MSstats

msstats.org seems to provide a complete solution for performing reasonable metrics of this data.

I am currently reading: http://msstats.org/wp-content/uploads/2017/01/MSstats_v3.7.3_manual.pdf

I made some moderately intrusive changes to MSstats to make it clearer, as well.

3.4.1 Comet derived msstats

devtools::load_all("~/scratch/git/MSstats")
## Loading MSstats
## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)
our_msstats_quant <- sm(dataProcess(our_msstats_input))
our_msstats_plots <- sm(dataProcessPlots(our_msstats_quant, type="QCPLOT"))

my_levels <- levels(as.factor(our_msstats_input$condition))
my_levels
## [1] "comp_cf"     "comp_whole"  "delta_cf"    "delta_whole" "wt_cf"      
## [6] "wt_whole"
comparisons <- ghetto_contrast_matrix(
  numerators=c("wt_cf", "delta_cf", "comp_cf",
               "delta_cf", "comp_cf", "delta_whole",
               "comp_whole"),
  denominators=c("wt_whole", "delta_whole", "comp_whole",
                 "wt_cf", "wt_cf", "wt_whole",
                 "wt_whole"))
our_results <- list()
for (c in 1:length(rownames(comparisons))) {
  name <- rownames(comparisons)[c]
  message("Starting ", name)
  comp <- comparisons[c, ]
  comparison <- t(as.matrix(comp))
  rownames(comparison) <- name
  our_results[name] <- sm(MSstats::groupComparison(contrast.matrix=comparison,
                                                   data=our_msstats_quant))
}
## Starting wt_cf_vs_wt_whole
## Warning in our_results[name] <- sm(MSstats::groupComparison(contrast.matrix
## = comparison, : number of items to replace is not a multiple of replacement
## length
## Starting delta_cf_vs_delta_whole
## Warning in our_results[name] <- sm(MSstats::groupComparison(contrast.matrix
## = comparison, : number of items to replace is not a multiple of replacement
## length
## Starting comp_cf_vs_comp_whole
## Warning in our_results[name] <- sm(MSstats::groupComparison(contrast.matrix
## = comparison, : number of items to replace is not a multiple of replacement
## length
## Starting delta_cf_vs_wt_cf
## Warning in our_results[name] <- sm(MSstats::groupComparison(contrast.matrix
## = comparison, : number of items to replace is not a multiple of replacement
## length
## Starting comp_cf_vs_wt_cf
## Warning in our_results[name] <- sm(MSstats::groupComparison(contrast.matrix
## = comparison, : number of items to replace is not a multiple of replacement
## length
## Starting delta_whole_vs_wt_whole
## Warning in our_results[name] <- sm(MSstats::groupComparison(contrast.matrix
## = comparison, : number of items to replace is not a multiple of replacement
## length
## Starting comp_whole_vs_wt_whole
## Warning in our_results[name] <- sm(MSstats::groupComparison(contrast.matrix
## = comparison, : number of items to replace is not a multiple of replacement
## length

3.4.2 Tuberculist derived msstats

tb_msstats_quant <- sm(dataProcess(tb_msstats_input))
##tb_msstats_plots <- sm(dataProcessPlots(tb_msstats_quant, type="QCPLOT"))

my_levels <- levels(as.factor(tb_msstats_input$condition))
my_levels
## [1] "comp_cf"     "comp_whole"  "delta_cf"    "delta_whole" "wt_cf"      
## [6] "wt_whole"
comparisons <- ghetto_contrast_matrix(
  numerators=c("wt_cf", "delta_cf", "comp_cf",
               "delta_cf", "comp_cf", "delta_whole",
               "comp_whole"),
  denominators=c("wt_whole", "delta_whole", "comp_whole",
                 "wt_cf", "wt_cf", "wt_whole",
                 "wt_whole"))
tb_results <- list()
for (c in 1:length(rownames(comparisons))) {
  name <- rownames(comparisons)[c]
  message("Starting ", name)
  comp <- comparisons[c, ]
  comparison <- t(as.matrix(comp))
  rownames(comparison) <- name
  tb_results[name] <- sm(MSstats::groupComparison(contrast.matrix=comparison,
                                                   data=tb_msstats_quant))
}
## Starting wt_cf_vs_wt_whole
## Warning in tb_results[name] <- sm(MSstats::groupComparison(contrast.matrix
## = comparison, : number of items to replace is not a multiple of replacement
## length
## Starting delta_cf_vs_delta_whole
## Warning in tb_results[name] <- sm(MSstats::groupComparison(contrast.matrix
## = comparison, : number of items to replace is not a multiple of replacement
## length
## Starting comp_cf_vs_comp_whole
## Warning in tb_results[name] <- sm(MSstats::groupComparison(contrast.matrix
## = comparison, : number of items to replace is not a multiple of replacement
## length
## Starting delta_cf_vs_wt_cf
## Warning in tb_results[name] <- sm(MSstats::groupComparison(contrast.matrix
## = comparison, : number of items to replace is not a multiple of replacement
## length
## Starting comp_cf_vs_wt_cf
## Warning in tb_results[name] <- sm(MSstats::groupComparison(contrast.matrix
## = comparison, : number of items to replace is not a multiple of replacement
## length
## Starting delta_whole_vs_wt_whole
## Warning in tb_results[name] <- sm(MSstats::groupComparison(contrast.matrix
## = comparison, : number of items to replace is not a multiple of replacement
## length
## Starting comp_whole_vs_wt_whole
## Warning in tb_results[name] <- sm(MSstats::groupComparison(contrast.matrix
## = comparison, : number of items to replace is not a multiple of replacement
## length

3.4.3 P/PE protein QC plots for Yan

Yan asked for the p/pe protein qc plots. ok. I changed the dataProcessPlots to return something useful, so that should be possible now.

pe_genes <- read.table("reference/annotated_pe_genes.txt")[[1]]

## Unfortunately, the names did not get set in my changed version of dataProcessPlots...
plotlst <- our_msstats_plots$QCPLOT
available_plots <- gsub(pattern="^1/", replacement="",
                        x=levels(our_msstats_quant$ProcessedData$PROTEIN))
names(plotlst) <- available_plots

pe_in_avail_idx <- pe_genes %in% available_plots
pe_in_avail <- pe_genes[pe_in_avail_idx]
pe_plots <- plotlst[pe_in_avail]
pdf(file="pe_qc_plots.pdf")
for (p in 1:length(pe_plots)) {
  plot(pe_plots[[p]])
}
## Warning: Removed 132 rows containing non-finite values (stat_boxplot).
## Warning: Removed 220 rows containing non-finite values (stat_boxplot).
## Warning: Removed 418 rows containing non-finite values (stat_boxplot).
## Warning: Removed 330 rows containing non-finite values (stat_boxplot).
## Warning: Removed 132 rows containing non-finite values (stat_boxplot).
## Warning: Removed 1452 rows containing non-finite values (stat_boxplot).
## Warning: Removed 1052 rows containing non-finite values (stat_boxplot).
## Warning: Removed 858 rows containing non-finite values (stat_boxplot).
## Warning: Removed 264 rows containing non-finite values (stat_boxplot).
## Warning: Removed 154 rows containing non-finite values (stat_boxplot).
## Warning: Removed 132 rows containing non-finite values (stat_boxplot).
## Warning: Removed 352 rows containing non-finite values (stat_boxplot).
## Warning: Removed 616 rows containing non-finite values (stat_boxplot).
## Warning: Removed 638 rows containing non-finite values (stat_boxplot).
## Warning: Removed 550 rows containing non-finite values (stat_boxplot).
## Warning: Removed 440 rows containing non-finite values (stat_boxplot).
## Warning: Removed 814 rows containing non-finite values (stat_boxplot).
## Warning: Removed 462 rows containing non-finite values (stat_boxplot).
## Warning: Removed 242 rows containing non-finite values (stat_boxplot).
dev.off()
## png 
##   2
length(pe_plots)
## [1] 19

4 Create hpgltools expressionset

Since I am not certain I understand these data, I will take the intensities from SWATH2stats, metadata, and annotation data; attempt to create a ‘normal’ expressionset; poke at it to see what I can learn.

4.1 Massaging the metadata

I want to use the same metadata as were used for MSstats. It has a few important differences from the requirements of hpgltools: pretty much only that I do not allow rownames/sampleIDs to start with a number.

4.2 Massaging the intensity matrix

I do not want the \1 before the protein names, I already merged them into one entry per gene vis SWATH2stats.

4.2.1 Make matrices with our libraries

our_prot_mtrx <- read.csv(file.path("results", "swath2stats", ver, "our_protein_matrix_minimum.csv"))
rownames(our_prot_mtrx) <- gsub(pattern="^1\\/", replacement="", x=our_prot_mtrx[["proteinname"]])
our_prot_mtrx <- our_prot_mtrx[, -1]
## Important question: Did SWATH2stats reorder my data?
colnames(our_prot_mtrx) <- gsub(pattern="^X", replacement="s", x=colnames(our_prot_mtrx))

4.2.2 Make matrices with the tb libraries

tb_prot_mtrx <- read.csv(file.path("results", "swath2stats", ver, "tb_protein_matrix_minimum.csv"))
rownames(tb_prot_mtrx) <- gsub(pattern="^1\\/", replacement="", x=tb_prot_mtrx[["proteinname"]])
tb_prot_mtrx <- tb_prot_mtrx[, -1]
## Important question: Did SWATH2stats reorder my data?
colnames(tb_prot_mtrx) <- gsub(pattern="^X", replacement="s", x=colnames(tb_prot_mtrx))

4.3 Merge the pieces

Now we should have sufficient pieces to make an expressionset.

While here, I will also split the data into a cf and whole-cell pair of data structures.

4.3.1 Our data

## Drop the metadata not in the protein matrix:
## And ensure that they are the same order.
reordered <- colnames(our_prot_mtrx)
metadata <- sample_annot[reordered, ]

our_protein_expt <- sm(create_expt(metadata=metadata,
                                   count_dataframe=our_prot_mtrx,
                                   gene_info=mtb_annotations))

our_whole_expt <- subset_expt(our_protein_expt, subset="collectiontype=='whole'")
## There were 23, now there are 11 samples.
our_cf_expt <- subset_expt(our_protein_expt, subset="collectiontype=='cf'")
## There were 23, now there are 12 samples.

4.3.2 Tb data

## Drop the metadata not in the protein matrix:
## And ensure that they are the same order.
reordered <- colnames(tb_prot_mtrx)
metadata <- sample_annot[reordered, ]

tb_protein_expt <- sm(create_expt(metadata,
                                  count_dataframe=tb_prot_mtrx,
                                  gene_info=mtb_annotations))
tb_whole_expt <- subset_expt(tb_protein_expt, subset="collectiontype=='whole'")
## There were 23, now there are 11 samples.
tb_cf_expt <- subset_expt(tb_protein_expt, subset="collectiontype=='cf'")
## There were 23, now there are 12 samples.

4.4 Metrics of the full data set

4.4.1 Our libraries

our_protein_metrics <- sm(graph_metrics(our_protein_expt))
our_protein_norm <- sm(normalize_expt(our_protein_expt, transform="log2", convert="cpm",
                                      norm="quant", filter=TRUE))
our_protein_norm_metrics <- sm(graph_metrics(our_protein_norm))
our_protein_fsva <- sm(normalize_expt(our_protein_expt, transform="log2",
                                   batch="fsva", filter=TRUE))
our_protein_fsva_metrics <- sm(graph_metrics(our_protein_fsva))

4.4.2 Tb libraries

tb_protein_metrics <- sm(graph_metrics(tb_protein_expt))
tb_protein_norm <- sm(normalize_expt(tb_protein_expt, transform="log2", convert="cpm",
                                      norm="quant", filter=TRUE))
tb_protein_norm_metrics <- sm(graph_metrics(tb_protein_norm))
tb_protein_fsva <- sm(normalize_expt(tb_protein_expt, transform="log2", convert="cpm",
                                      batch="fsva", filter=TRUE))
tb_protein_fsva_metrics <- sm(graph_metrics(tb_protein_fsva))

4.5 Metrics of the whole-cell data set

4.5.1 Our libraries

our_whole_metrics <- sm(graph_metrics(our_whole_expt))
our_whole_norm <- sm(normalize_expt(our_whole_expt, transform="log2", convert="cpm",
                                    norm="quant", filter=TRUE))
our_whole_norm_metrics <- sm(graph_metrics(our_whole_norm))
our_whole_fsva <- sm(normalize_expt(our_whole_expt, transform="log2", convert="cpm",
                                    batch="fsva", filter=TRUE))
our_whole_fsva_metrics <- sm(graph_metrics(our_whole_fsva))

4.5.2 Tb libraries

tb_whole_metrics <- sm(graph_metrics(tb_whole_expt))
tb_whole_norm <- sm(normalize_expt(tb_whole_expt, transform="log2", convert="cpm",
                                    norm="quant", filter=TRUE))
tb_whole_norm_metrics <- sm(graph_metrics(tb_whole_norm))
tb_whole_fsva <- sm(normalize_expt(tb_whole_expt, transform="log2", convert="cpm",
                                    batch="fsva", filter=TRUE))
tb_whole_fsva_metrics <- sm(graph_metrics(tb_whole_fsva))

4.6 Metrics of the filtrate data set

4.6.1 Our libraries

our_cf_metrics <- sm(graph_metrics(our_cf_expt))
our_cf_norm <- sm(normalize_expt(our_cf_expt, transform="log2", convert="cpm",
                                 norm="quant", filter=TRUE))
our_cf_norm_metrics <- sm(graph_metrics(our_cf_norm))
our_cf_fsva <- sm(normalize_expt(our_cf_expt, transform="log2", convert="cpm",
                                 batch="fsva", filter=TRUE))
our_cf_fsva_metrics <- sm(graph_metrics(our_cf_fsva))

4.6.2 Tb libraries

tb_cf_metrics <- sm(graph_metrics(tb_cf_expt))
tb_cf_norm <- sm(normalize_expt(tb_cf_expt, transform="log2", convert="cpm",
                                 norm="quant", filter=TRUE))
tb_cf_norm_metrics <- sm(graph_metrics(tb_cf_norm))
tb_cf_fsva <- sm(normalize_expt(tb_cf_expt, transform="log2", convert="cpm",
                                 batch="fsva", filter=TRUE))
tb_cf_fsva_metrics <- sm(graph_metrics(tb_cf_fsva))

4.7 plot some metrics

4.7.1 Our libraries

pp(image=our_protein_metrics$libsize,
   file=file.path("images", paste0(ver, "_our_libsize.png")))
## Writing the image to: images/20180611_our_libsize.png and calling dev.off().

## It seems to me that the scale of the data is all within an order of magnitude or two.
## I cannot get used to these absurdly large numbers though.
pp(image=our_protein_norm_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_our_norm_pca.png")))
## Writing the image to: images/20180611_our_norm_pca.png and calling dev.off().
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

## There appears to be a nice split in the data, however the un-assayable batch
## effect is a problem.
pp(image=our_protein_fsva_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_fsva_pca.png")))
## Writing the image to: images/20180611_fsva_pca.png and calling dev.off().
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

## fsva seems to get some handle on the data, but I don't think we should rely
## upon it.
pp(image=our_protein_norm_metrics$corheat,
   file=file.path("images", paste0(ver, "_norm_corheat.png")))
## Writing the image to: images/20180611_norm_corheat.png and calling dev.off().

## Once again, the whole-cell/culture-filtrate split is very large.
pp(image=our_protein_metrics$density,
   file=file.path("images", paste0(ver, "_raw_density.png")))
## Writing the image to: images/20180611_raw_density.png and calling dev.off().

## There are two obvious distributions in the data, once again split between types.
pp(image=our_protein_metrics$boxplot,
   file=file.path("images", paste0(ver, "_boxplot.png")))
## Writing the image to: images/20180611_boxplot.png and calling dev.off().

## This recapitulates the previous plot.

pp(image=our_whole_metrics$libsize,
   file=file.path("images", paste0(ver, "_whole_libsize.png")))
## Writing the image to: images/20180611_whole_libsize.png and calling dev.off().

pp(image=our_whole_norm_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_whole_norm_pca.png")))
## Writing the image to: images/20180611_whole_norm_pca.png and calling dev.off().
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

pp(image=our_whole_fsva_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_whole_fsva_pca.png")))
## Writing the image to: images/20180611_whole_fsva_pca.png and calling dev.off().
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

pp(image=our_whole_norm_metrics$corheat,
   file=file.path("images", paste0(ver, "_whole_norm_corheat.png")))
## Writing the image to: images/20180611_whole_norm_corheat.png and calling dev.off().

pp(image=our_whole_metrics$density,
   file=file.path("images", paste0(ver, "_whole_raw_density.png")))
## Writing the image to: images/20180611_whole_raw_density.png and calling dev.off().

pp(image=our_whole_metrics$boxplot,
   file=file.path("images", paste0(ver, "_whole_boxplot.png")))
## Writing the image to: images/20180611_whole_boxplot.png and calling dev.off().

pp(image=our_cf_metrics$libsize,
   file=file.path("images", paste0(ver, "_libsize.png")))
## Writing the image to: images/20180611_libsize.png and calling dev.off().

pp(image=our_cf_norm_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_norm_pca.png")))
## Writing the image to: images/20180611_norm_pca.png and calling dev.off().
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

pp(image=our_cf_fsva_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_fsva_pca.png")))
## Writing the image to: images/20180611_fsva_pca.png and calling dev.off().
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

pp(image=our_cf_norm_metrics$corheat,
   file=file.path("images", paste0(ver, "_norm_corheat.png")))
## Writing the image to: images/20180611_norm_corheat.png and calling dev.off().

pp(image=our_cf_metrics$density,
   file=file.path("images", paste0(ver, "_raw_density.png")))
## Writing the image to: images/20180611_raw_density.png and calling dev.off().

pp(image=our_cf_metrics$boxplot,
   file=file.path("images", paste0(ver, "_boxplot.png")))
## Writing the image to: images/20180611_boxplot.png and calling dev.off().

4.7.2 Tb libraries

pp(image=tb_protein_metrics$libsize,
   file=file.path("images", paste0(ver, "_tb_libsize.png")))
## Writing the image to: images/20180611_tb_libsize.png and calling dev.off().

## It seems to me that the scale of the data is all within an order of magnitude or two.
## I cannot get used to these absurdly large numbers though.
pp(image=tb_protein_norm_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_tb_norm_pca.png")))
## Writing the image to: images/20180611_tb_norm_pca.png and calling dev.off().
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

## There appears to be a nice split in the data, however the un-assayable batch
## effect is a problem.
pp(image=tb_protein_fsva_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_fsva_pca.png")))
## Writing the image to: images/20180611_fsva_pca.png and calling dev.off().
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

## fsva seems to get some handle on the data, but I don't think we should rely
## upon it.
pp(image=tb_protein_norm_metrics$corheat,
   file=file.path("images", paste0(ver, "_norm_corheat.png")))
## Writing the image to: images/20180611_norm_corheat.png and calling dev.off().

## Once again, the whole-cell/culture-filtrate split is very large.
pp(image=tb_protein_metrics$density,
   file=file.path("images", paste0(ver, "_raw_density.png")))
## Writing the image to: images/20180611_raw_density.png and calling dev.off().

## There are two obvious distributions in the data, once again split between types.
pp(image=tb_protein_metrics$boxplot,
   file=file.path("images", paste0(ver, "_boxplot.png")))
## Writing the image to: images/20180611_boxplot.png and calling dev.off().

## This recapitulates the previous plot.

pp(image=tb_whole_metrics$libsize,
   file=file.path("images", paste0(ver, "_whole_libsize.png")))
## Writing the image to: images/20180611_whole_libsize.png and calling dev.off().

pp(image=tb_whole_norm_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_whole_norm_pca.png")))
## Writing the image to: images/20180611_whole_norm_pca.png and calling dev.off().
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

pp(image=tb_whole_fsva_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_whole_fsva_pca.png")))
## Writing the image to: images/20180611_whole_fsva_pca.png and calling dev.off().
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

pp(image=tb_whole_norm_metrics$corheat,
   file=file.path("images", paste0(ver, "_whole_norm_corheat.png")))
## Writing the image to: images/20180611_whole_norm_corheat.png and calling dev.off().

pp(image=tb_whole_metrics$density,
   file=file.path("images", paste0(ver, "_whole_raw_density.png")))
## Writing the image to: images/20180611_whole_raw_density.png and calling dev.off().

pp(image=tb_whole_metrics$boxplot,
   file=file.path("images", paste0(ver, "_whole_boxplot.png")))
## Writing the image to: images/20180611_whole_boxplot.png and calling dev.off().

pp(image=tb_cf_metrics$libsize,
   file=file.path("images", paste0(ver, "_libsize.png")))
## Writing the image to: images/20180611_libsize.png and calling dev.off().

pp(image=tb_cf_norm_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_norm_pca.png")))
## Writing the image to: images/20180611_norm_pca.png and calling dev.off().
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

pp(image=tb_cf_fsva_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_fsva_pca.png")))
## Writing the image to: images/20180611_fsva_pca.png and calling dev.off().
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

pp(image=tb_cf_norm_metrics$corheat,
   file=file.path("images", paste0(ver, "_norm_corheat.png")))
## Writing the image to: images/20180611_norm_corheat.png and calling dev.off().

pp(image=tb_cf_metrics$density,
   file=file.path("images", paste0(ver, "_raw_density.png")))
## Writing the image to: images/20180611_raw_density.png and calling dev.off().

pp(image=tb_cf_metrics$boxplot,
   file=file.path("images", paste0(ver, "_boxplot.png")))
## Writing the image to: images/20180611_boxplot.png and calling dev.off().

5 Attempt some quantification comparisons?

5.1 Our libraries

our_pairwise_filt <- sm(normalize_expt(our_protein_expt, filter=TRUE))
our_pairwise_comp <- sm(all_pairwise(our_pairwise_filt, model_batch="fsva", force=TRUE))
our_pairwise_nobatch <- sm(all_pairwise(our_pairwise_filt, model_batch=FALSE, force=TRUE))

5.2 Tb libraries

tb_pairwise_filt <- sm(normalize_expt(tb_protein_expt, filter=TRUE))
tb_pairwise_comp <- sm(all_pairwise(tb_pairwise_filt, model_batch="fsva", force=TRUE))
tb_pairwise_nobatch <- sm(all_pairwise(tb_pairwise_filt, model_batch=FALSE, force=TRUE))

5.2.1 Show a few metrics from the hpgltools pairwise comparisons

our_pairwise_comp$comparison$heat

tb_pairwise_comp$comparison$heat

6 For each msstats run, do a DE table

6.1 wt_cf vs wt_whole

6.1.1 Our libraries

keepers <- list(
  "wtcf_vs_wtwhole" = c("wt_cf", "wt_whole"))
droppers <- c("undefined")
names(droppers) <- "log2fc"

## Make sure to set the rownames so it will merge into the excel file.
set_name <- "wt_cf_vs_wt_whole"
rownames(our_results[[set_name]]) <- our_results[[set_name]][["Protein"]]
wtcf_wtwhole_tables <- sm(combine_de_tables(
  our_pairwise_comp, keepers=keepers, extra_annot=our_results[[set_name]],
  excludes=droppers,
  excel=paste0("excel/our_wtcf_vs_wtwhole_tables-v", ver, ".xlsx")))
wtcf_nobatch_wtwhole_tables <- sm(combine_de_tables(
  our_pairwise_nobatch, keepers=keepers, extra_annot=our_results[[set_name]],
  excludes=droppers,
  excel=paste0("excel/our_wtcf_vs_wtwhole_nobatch_tables-v", ver, ".xlsx")))
comp_table <- wtcf_wtwhole_tables$data[[set_name]]
cor.test(comp_table$log2fc, comp_table$deseq_logfc, method="spearman")
## Warning in cor.test.default(comp_table$log2fc, comp_table$deseq_logfc,
## method = "spearman"): Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  comp_table$log2fc and comp_table$deseq_logfc
## S = 6.7e+07, p-value <2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##    rho 
## 0.3996
comp_table <- wtcf_nobatch_wtwhole_tables$data[[set_name]]
cor.test(comp_table$log2fc, comp_table$deseq_logfc, method="spearman")
## Warning in cor.test.default(comp_table$log2fc, comp_table$deseq_logfc,
## method = "spearman"): Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  comp_table$log2fc and comp_table$deseq_logfc
## S = 5.3e+07, p-value <2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##    rho 
## 0.5197

6.1.2 Tb libraries

keepers <- list(
  "wtcf_vs_wtwhole" = c("wt_cf", "wt_whole"))
droppers <- c("undefined")
names(droppers) <- "log2fc"

## Make sure to set the rownames so it will merge into the excel file.
set_name <- "wt_cf_vs_wt_whole"
rownames(tb_results[[set_name]]) <- tb_results[[set_name]][["Protein"]]
wtcf_wtwhole_tables <- sm(combine_de_tables(
  tb_pairwise_comp, keepers=keepers, extra_annot=tb_results[[set_name]],
  excludes=droppers,
  excel=paste0("excel/tb_wtcf_vs_wtwhole_tables-v", ver, ".xlsx")))
wtcf_nobatch_wtwhole_tables <- sm(combine_de_tables(
  tb_pairwise_nobatch, keepers=keepers, extra_annot=tb_results[[set_name]],
  excludes=droppers,
  excel=paste0("excel/tb_wtcf_vs_wtwhole_nobatch_tables-v", ver, ".xlsx")))
comp_table <- wtcf_wtwhole_tables$data[[set_name]]
cor.test(comp_table$log2fc, comp_table$deseq_logfc, method="spearman")
## Warning in cor.test.default(comp_table$log2fc, comp_table$deseq_logfc,
## method = "spearman"): Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  comp_table$log2fc and comp_table$deseq_logfc
## S = 1.1e+09, p-value <2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##    rho 
## 0.6209
comp_table <- wtcf_nobatch_wtwhole_tables$data[[set_name]]
cor.test(comp_table$log2fc, comp_table$deseq_logfc, method="spearman")
## Warning in cor.test.default(comp_table$log2fc, comp_table$deseq_logfc,
## method = "spearman"): Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  comp_table$log2fc and comp_table$deseq_logfc
## S = 9.5e+08, p-value <2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##    rho 
## 0.6581

6.2 delta_cf vs delta_whole

6.2.1 Our libraries

keepers <- list(
  "deltacf_vs_deltawhole" = c("delta_cf", "delta_whole"))
set_name <- "delta_cf_vs_delta_whole"
rownames(our_results[[set_name]]) <- our_results[[set_name]][["Protein"]]
deltacf_deltawhole_tables <- sm(combine_de_tables(
  our_pairwise_comp, keepers=keepers, extra_annot=our_results[[set_name]],
  excel=paste0("excel/our_deltacf_vs_deltawhole_tables-v", ver, ".xlsx")))

6.2.2 Tb libraries

keepers <- list(
  "deltacf_vs_deltawhole" = c("delta_cf", "delta_whole"))
set_name <- "delta_cf_vs_delta_whole"
rownames(tb_results[[set_name]]) <- tb_results[[set_name]][["Protein"]]
deltacf_deltawhole_tables <- sm(combine_de_tables(
  tb_pairwise_comp, keepers=keepers, extra_annot=tb_results[[set_name]],
  excel=paste0("excel/tb_deltacf_vs_deltawhole_tables-v", ver, ".xlsx")))

6.3 comp_cf vs comp_whole

6.3.1 Our libraries

keepers <- list(
  "compcf_vs_compwhole" = c("comp_cf", "comp_whole"))
set_name <- "comp_cf_vs_comp_whole"
rownames(our_results[[set_name]]) <- our_results[[set_name]][["Protein"]]
compcf_compwhole_tables <- sm(combine_de_tables(
  our_pairwise_comp, keepers=keepers, extra_annot=our_results[[set_name]],
  excel=paste0("excel/our_compcf_vs_compwhole_tables-v", ver, ".xlsx")))

6.3.2 Tb libraries

keepers <- list(
  "compcf_vs_compwhole" = c("comp_cf", "comp_whole"))
set_name <- "comp_cf_vs_comp_whole"
rownames(tb_results[[set_name]]) <- tb_results[[set_name]][["Protein"]]
compcf_compwhole_tables <- sm(combine_de_tables(
  tb_pairwise_comp, keepers=keepers, extra_annot=tb_results[[set_name]],
  excel=paste0("excel/tb_compcf_vs_compwhole_tables-v", ver, ".xlsx")))

6.4 delta_cf vs wt_cf

6.4.1 Our libraries

keepers <- list(
  "deltacf_vs_wtcf" = c("delta_cf", "wt_cf"))
set_name <- "delta_cf_vs_wt_cf"
rownames(our_results[[set_name]]) <- our_results[[set_name]][["Protein"]]
deltacf_wtcf_tables <- sm(combine_de_tables(
  our_pairwise_comp, keepers=keepers, extra_annot=our_results[[set_name]],
  excel=paste0("excel/our_deltacf_vs_wtcf_tables-v", ver, ".xlsx")))

6.4.2 Tb libraries

keepers <- list(
  "deltacf_vs_wtcf" = c("delta_cf", "wt_cf"))
set_name <- "delta_cf_vs_wt_cf"
rownames(tb_results[[set_name]]) <- tb_results[[set_name]][["Protein"]]
deltacf_wtcf_tables <- sm(combine_de_tables(
  tb_pairwise_comp, keepers=keepers, extra_annot=tb_results[[set_name]],
  excel=paste0("excel/tb_deltacf_vs_wtcf_tables-v", ver, ".xlsx")))

6.5 comp_cf vs wt_cf

6.5.1 Our libraries

keepers <- list(
  "compcf_vs_wtcf" = c("comp_cf", "wt_cf"))
set_name <- "comp_cf_vs_wt_cf"
rownames(our_results[[set_name]]) <- our_results[[set_name]][["Protein"]]
compcf_wtcf_tables <- sm(combine_de_tables(
  our_pairwise_comp, keepers=keepers, extra_annot=our_results[[set_name]],
  excel=paste0("excel/our_compcf_vs_wtcf_tables-v", ver, ".xlsx")))

6.5.2 Tb libraries

keepers <- list(
  "compcf_vs_wtcf" = c("comp_cf", "wt_cf"))
set_name <- "comp_cf_vs_wt_cf"
rownames(tb_results[[set_name]]) <- tb_results[[set_name]][["Protein"]]
compcf_wtcf_tables <- sm(combine_de_tables(
  tb_pairwise_comp, keepers=keepers, extra_annot=tb_results[[set_name]],
  excel=paste0("excel/tb_compcf_vs_wtcf_tables-v", ver, ".xlsx")))

6.6 delta_whole vs wt_whole

6.6.1 Our libraries

keepers <- list(
  "wtcf_vs_wtwhole" = c("wt_cf", "wt_whole"))
set_name <- "delta_whole_vs_wt_whole"
rownames(our_results[[set_name]]) <- our_results[[set_name]][["Protein"]]
wtcf_wtwhole_tables <- sm(combine_de_tables(
  our_pairwise_comp, keepers=keepers, extra_annot=our_results[[set_name]],
  excel=paste0("excel/our_deltawhole_vs_wtwhole_tables-v", ver, ".xlsx")))

6.6.2 Tb libraries

keepers <- list(
  "wtcf_vs_wtwhole" = c("wt_cf", "wt_whole"))
set_name <- "delta_whole_vs_wt_whole"
rownames(tb_results[[set_name]]) <- tb_results[[set_name]][["Protein"]]
wtcf_wtwhole_tables <- sm(combine_de_tables(
  tb_pairwise_comp, keepers=keepers, extra_annot=tb_results[[set_name]],
  excel=paste0("excel/tb_deltawhole_vs_wtwhole_tables-v", ver, ".xlsx")))

6.7 comp_whole vs wt_whole

6.7.1 Our libraries

keepers <- list(
  "compwhole_vs_wtwhole" = c("comp_whole", "wt_whole"))
set_name <- "comp_whole_vs_wt_whole"
rownames(our_results[[set_name]]) <- our_results[[set_name]][["Protein"]]
compwhole_wtwhole_tables <- sm(combine_de_tables(
  our_pairwise_comp, keepers=keepers, extra_annot=our_results[[set_name]],
  excel=paste0("excel/our_compwhole_vs_wtwhole_tables-v", ver, ".xlsx")))

6.7.2 Tb libraries

keepers <- list(
  "compwhole_vs_wtwhole" = c("comp_whole", "wt_whole"))
set_name <- "comp_whole_vs_wt_whole"
rownames(tb_results[[set_name]]) <- tb_results[[set_name]][["Protein"]]
compwhole_wtwhole_tables <- sm(combine_de_tables(
  tb_pairwise_comp, keepers=keepers, extra_annot=tb_results[[set_name]],
  excel=paste0("excel/tb_compwhole_vs_wtwhole_tables-v", ver, ".xlsx")))
our_table <- our_pairwise_comp$deseq$all_tables$wt_whole_vs_wt_cf
tb_table <- tb_pairwise_comp$deseq$all_tables$wt_whole_vs_wt_cf
our_order <- order(our_table[["limma_logfc"]])
## Error in order(our_table[["limma_logfc"]]): argument 1 is not a vector
rownames(tb_table) <- gsub(pattern="^(.*)_.*", replacement="\\1", x=rownames(tb_table))

test_table <- merge(our_table, tb_table, by="row.names")
cor.test(test_table$logFC.x, test_table$logFC.y)
## 
##  Pearson's product-moment correlation
## 
## data:  test_table$logFC.x and test_table$logFC.y
## t = 15, df = 890, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4021 0.5063
## sample estimates:
##    cor 
## 0.4558

7 TODO

  • 2018-04-10: Make sure my invocations of SWATH2stats/MSstats are correct.
if (!isTRUE(get0("skip_load"))) {
  message(paste0("This is hpgltools commit: ", get_git_commit()))
  this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
  message(paste0("Saving to ", this_save))
  tmp <- sm(saveme(filename=this_save))
  pander::pander(sessionInfo())
}
---
title: "M.tuberculosis 20180611: Analyzing data from OpenSwathWorkFlow/TRIC."
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
 html_document:
  code_download: true
  code_folding: show
  fig_caption: true
  fig_height: 7
  fig_width: 7
  highlight: default
  keep_md: false
  mode: selfcontained
  number_sections: true
  self_contained: true
  theme: readable
  toc: true
  toc_float:
    collapsed: false
    smooth_scroll: false
---

<style>
  body .main-container {
    max-width: 1600px;
  }
</style>

```{r options, include=FALSE}
if (!isTRUE(get0("skip_load"))) {
  library(hpgltools)
  tt <- devtools::load_all("~/hpgltools")
  knitr::opts_knit$set(progress=TRUE,
                       verbose=TRUE,
                       width=90,
                       echo=TRUE)
  knitr::opts_chunk$set(error=TRUE,
                        fig.width=8,
                        fig.height=8,
                        dpi=96)
  old_options <- options(digits=4,
                         stringsAsFactors=FALSE,
                         knitr.duplicate.label="allow")
  ggplot2::theme_set(ggplot2::theme_bw(base_size=10))
  ver <- "20180611"
  previous_file <- "01_preprocessing_comet_highres.Rmd"

  tmp <- try(sm(loadme(filename=paste0(gsub(pattern="\\.Rmd", replace="", x=previous_file), "-v", ver, ".rda.xz"))))
  previous_file <- "01_annotation_20180611.Rmd"
  tmp <- try(sm(loadme(filename=paste0(gsub(pattern="\\.Rmd", replace="", x=previous_file), "-v", ver, ".rda.xz"))))

  rmd_file <- paste0("02_swath2stats_", ver, ".Rmd")
}
```

# TODO

* Remove MSstats logging. -- done
* Make explicit spearman correlations between methods. -- done
    * Do both for all data and for the top-50/bottom-50 -- done, but weird.
* Make 100% certain that the samples are annotated correctly. -- done.
* Limma/MSstats/EdgeR venn disagram for all up in CF
   * Repeat in all down CF
   * Repeat with a cutoff, top-50
* Individual plots for P/PE proteins.

Analyzing data from openMS and friends.
=======================================

In preprocessing_comet_highres.Rmd, I used the openMS tutorials and supplemental
materials from a couple papers to hopefully correctly perform the various
preprocessing tasks required to extract intensity data from DIA/SWATH
transitions.

The final steps of that process combined the transition intensities from every
sample into a metadata file (results/tric/HCD_meta.tsv), an intensity matrix
(results/tric/HCD_outmatrix.tsv), and a feature aligned output matrix
(results/tric/aligned_comet_HCD.tsv).

My reading of the SWATH2stats and MSstats source code suggests to me that the
log2(intensities) of the feature aligned data are our final proxy for protein
abundance.  At first glance, this suggests to me that these data might follow a
distribution similar to RNASeq data (negative binomial, but perhaps with a
bigger tail?).  In addition, by the time we use tric on the data, we have a
count matrix and sample annotation data frames which look remarkably similar to
those used in a RNASeq expressionset.  Indeed, by the end of the MSstats
processing, it creates a MSnSet class of its own which uses fData/exprs/pData.

For the curious, my reasoning for saying that the log intensities are our proxy
for abundance comes from MSstats/R/DataProcess.R in a clause which looks like:

```{r snippet, eval=FALSE}
if (logTrans == 2) {
  work[["ABUNDANCE"]] <- log2(work[["ABUNDANCE"]])
} else if (logTrans == 10) {
  work[["ABUNDANCE"]] <- log10(work[["ABUNDANCE"]])
} else {
  ## Above there was a check for only log 2 and 10, but we can do e if we want.
  ## I might go back up there and remove that check. Long live e! 2.718282 rules!
  work[["ABUNDANCE"]] <- log(work[["ABUNDANCE"]]) / log(logTrans)
}
```

(Note: I added the natural log to the set of conditions, but otherwise the logic
is unchanged.)

With that in mind, I want to use some tools with which I am familiar in order to
try to understand these data.  Therefore I will first attempt to coerce my tric
aligned data and annotations into a 'normal' expressionset.  Then I want to do
some diagnostic plots which, if I am wrong and these distributions are not as
expected, will be conceptually incorrect (I don't yet think I am wrong).

## Sample annotation via SWATH2stats

I am using the SWATH2stats vignette as my primary source of information.  Thus I
see that it uses the
OpenSWATH_SM3_GoldStandardAutomatedResults_human_peakgroups.txt which has a
format nearly identical to the tric output matrix.  Thus for the moment I will
assume that the proper input for SWATH2stats is 'results/tric/comet_HCD.tsv' and
not the metadata nor output matrix.  In addition, SWATH2stats provides a similar
conversion function which takes the tric output and coerces it into a similar
matrix after performing its various filtering tasks.  Therefore I will use that.

I keep a sample sheet of all the DIA samples used in this analysis in
'sample_sheets/dia_samples.xlsx'  It should contain all the other required data
with one important caveat, I removed 1 sample by 'commenting' it (e.g. prefixing
it with '##' -- which is an admittedly dumb thing to do in an excel file.

One last caveat: I hacked the SWATH2stats sample_annotation() function to add a
couple columns in an attempt to make it a little more robust when faced with
sample sheets with differently named columns.

In addition, SWATH2stats provides some nice filtering and combination
functions which should be considered when generating various expressionset data
structures later.

### Creating a swath2stats experiment using our comet-derived library data

```{r our_swath2stats_initial}
our_tric_data <- read.csv("results/tric/20180611/whole_8mz/comet_HCD.tsv", sep="\t")

sample_annot <- openxlsx::read.xlsx("sample_sheets/Mtb_dia_samples.xlsx")
rownames(sample_annot) <- make.names(sample_annot[["sampleid"]], unique=TRUE)
## Drop samples starting with comments
keep_idx <- ! grepl(pattern="##", x=sample_annot[["sampleid"]])
sample_annot <- sample_annot[keep_idx, ]
expt_idx <- sample_annot[["expt_id"]] == "may2018" | sample_annot[["expt_id"]] == "mar2018"
expt_idx[is.na(sample_annot[["expt_id"]])] <- FALSE
sample_annot <- sample_annot[expt_idx, ]
mz_idx <- sample_annot[["windowsize"]] == "8"
sample_annot <- sample_annot[mz_idx, ]
## Set the mzXML column to match the filename column in the data.
loaded <- sm(devtools::load_all("~/scratch/git/SWATH2stats"))
## s2s, my witty way of shortening SWATH2stats...
our_s2s_exp <- sample_annotation(data=our_tric_data,
                                 sample_annotation=sample_annot,
                                 run_column="run",
                                 fullpeptidename_column="fullunimodpeptidename")
```

### Creating a swath2stats experiment using the tuberculist-derived library data

This just repeats the previous stanza, but uses the tuberculist
transition-library swath data as its input rather than our own comet
data.

There is one important caveat in the following block: I used a regex to remove
the second half of geneID_geneName so that later when I merge in the annotation
data I have it will match.

```{r tb_swath2stats_initial}
tb_tric_data <- read.csv("results/tric/20180611/whole_8mz_tuberculist/comet_HCD.tsv", sep="\t")

tb_tric_data[["ProteinName"]] <- gsub(pattern="^(.*)_.*$", replacement="\\1",
                                      x=tb_tric_data[["ProteinName"]])
tb_s2s_exp <- sample_annotation(data=tb_tric_data,
                                sample_annotation=sample_annot,
                                fullpeptidename_column="fullunimodpeptidename")
```

Now I have a couple data structures which should prove useful for the metrics
provided by SWATH2stats, MSstats, and my own hpgltools.

# SWATH2stats continued

The various metrics and filters provided by SWATH2stats seem quite reasonable to
me.  The only thing that really bothers me is that they are all case sensitive
and I found that the most recent tric changed the capitalization of a column,
causing these to all fall down.  Therefore I went in and made everything case
insensitive in a fashion similar to that done by MSstats (except I hate capital
letters, so I used tolower() rather than toupper()).

## Run filters on our comet data

```{r our_swath2stats_processing}
## Get correlations on a sample by sample basis
pp(file="images/20180611_our_swath2stats_sample_cor.png")
sample_cor <- plot_correlation_between_samples(our_s2s_exp, size=2,
                                               fun.aggregate=sum,
                                               column.values="intensity")
dev.off()
sample_cond_rep_cor <- plot_correlation_between_samples(our_s2s_exp, size=2,
                                                        comparison=transition_group_id ~
                                                          condition + bioreplicate + run,
                                                        fun.aggregate=sum,
                                                        column.values="intensity")

decoy_lists <- assess_decoy_rate(our_s2s_exp)
## This seems a bit high to me, yesno?
fdr_overall <- assess_fdr_overall(our_s2s_exp, output="Rconsole", plot=TRUE)

byrun_fdr <- assess_fdr_byrun(our_s2s_exp, FFT=0.7, plot=TRUE, output="Rconsole")
chosen_mscore <- mscore4assayfdr(our_s2s_exp, FFT=0.7, fdr_target=0.02)
prot_score <- mscore4protfdr(our_s2s_exp, FFT=0.7, fdr_target=0.02)

our_mscore_filtered <- filter_mscore(our_s2s_exp, chosen_mscore)
our_freq_mscore <- filter_mscore_freqobs(our_s2s_exp, 0.01, 0.8, rm.decoy=FALSE)
our_data_filtered_fdr <- filter_mscore_fdr(our_mscore_filtered, FFT=0.7,
                                           overall_protein_fdr_target=prot_score,
                                           upper_overall_peptide_fdr_limit=0.05)
our_only_proteotypic <- filter_proteotypic_peptides(our_data_filtered_fdr)
our_all_filtered <- filter_all_peptides(our_only_proteotypic)
our_only_strong <- filter_on_max_peptides(data=our_all_filtered, n_peptides=10)
our_only_minimum <- filter_on_min_peptides(data=our_only_strong, n_peptides=3)

## I think these matrixes are probably smarter to use than the raw outmatrix from tric.
## But I am not a fan of rerwriting the sample column names.
matrix_prefix <- file.path("results", "swath2stats", ver)
if (!file.exists(matrix_prefix)) {
  dir.create(matrix_prefix)
}
protein_matrix_all <- write_matrix_proteins(
  our_s2s_exp, write.csv=TRUE,
  filename=file.path(matrix_prefix, "our_protein_all.csv"))
dim(protein_matrix_all)
protein_matrix_mscore <- write_matrix_proteins(
  our_mscore_filtered, write.csv=TRUE,
  filename=file.path(matrix_prefix, "our_protein_matrix_mscore.csv"))
dim(protein_matrix_mscore)
peptide_matrix_mscore <- write_matrix_peptides(
  our_mscore_filtered, write.csv=TRUE,
  filename=file.path(matrix_prefix, "our_peptide_matrix_mscore.csv"))
dim(peptide_matrix_mscore)
protein_matrix_minimum <- write_matrix_proteins(
  our_only_minimum, write.csv=TRUE,
  filename=file.path(matrix_prefix, "our_protein_matrix_minimum.csv"))
dim(protein_matrix_minimum)
peptide_matrix_minimum <- write_matrix_peptides(
  our_only_minimum, write.csv=TRUE,
  filename=file.path(matrix_prefix, "our_peptide_matrix_minimum.csv"))
dim(peptide_matrix_minimum)

rt_cor <- plot_correlation_between_samples(
  our_only_minimum, column.values="intensity", fun.aggregate=sum, size=2)
## I have no effing clue what this plot means.
variation <- plot_variation(our_only_minimum, fun.aggregate=sum)

cols <- colnames(our_only_minimum)
our_disaggregated <- disaggregate(our_only_minimum, all.columns=TRUE)
our_msstats_input <- convert_for_msstats(our_disaggregated)
```

## Run filters on the tuberculist

Repeat the above stanza using the tuberculist data.

```{r tb_swath2stats_processing}
## Get correlations on a sample by sample basis
pp(file="images/20180611_tb_swath2stats_sample_cor.png")
sample_cor <- plot_correlation_between_samples(tb_s2s_exp, size=2,
                                               fun.aggregate=sum,
                                               column.values="intensity")
dev.off()
sample_cond_rep_cor <- plot_correlation_between_samples(tb_s2s_exp, size=2,
                                                        comparison=transition_group_id ~
                                                          condition + bioreplicate + run,
                                                        fun.aggregate=sum,
                                                        column.values="intensity")

decoy_lists <- assess_decoy_rate(tb_s2s_exp)
## This seems a bit high to me, yesno?
fdr_overall <- assess_fdr_overall(tb_s2s_exp, output="Rconsole", plot=TRUE)

byrun_fdr <- assess_fdr_byrun(tb_s2s_exp, FFT=0.7, plot=TRUE, output="Rconsole")
chosen_mscore <- mscore4assayfdr(tb_s2s_exp, FFT=0.7, fdr_target=0.02)
prot_score <- mscore4protfdr(tb_s2s_exp, FFT=0.7, fdr_target=0.02)

tb_mscore_filtered <- filter_mscore(tb_s2s_exp, chosen_mscore)
tb_freq_mscore <- filter_mscore_freqobs(tb_s2s_exp, 0.01, 0.8, rm.decoy=FALSE)
tb_data_filtered_fdr <- filter_mscore_fdr(tb_mscore_filtered, FFT=0.7,
                                           overall_protein_fdr_target=prot_score,
                                           upper_overall_peptide_fdr_limit=0.05)
tb_only_proteotypic <- filter_proteotypic_peptides(tb_data_filtered_fdr)
tb_all_filtered <- filter_all_peptides(tb_only_proteotypic)
tb_only_strong <- filter_on_max_peptides(data=tb_all_filtered, n_peptides=10)
tb_only_minimum <- filter_on_min_peptides(data=tb_only_strong, n_peptides=3)

## I think these matrixes are probably smarter to use than the raw outmatrix from tric.
## But I am not a fan of rerwriting the sample column names.
matrix_prefix <- file.path("results", "swath2stats", ver)
if (!file.exists(matrix_prefix)) {
  dir.create(matrix_prefix)
}
protein_matrix_all <- write_matrix_proteins(
  tb_s2s_exp, write.csv=TRUE,
  filename=file.path(matrix_prefix, "tb_protein_all.csv"))
dim(protein_matrix_all)
protein_matrix_mscore <- write_matrix_proteins(
  tb_mscore_filtered, write.csv=TRUE,
  filename=file.path(matrix_prefix, "tb_protein_matrix_mscore.csv"))
dim(protein_matrix_mscore)
peptide_matrix_mscore <- write_matrix_peptides(
  tb_mscore_filtered, write.csv=TRUE,
  filename=file.path(matrix_prefix, "tb_peptide_matrix_mscore.csv"))
dim(peptide_matrix_mscore)
protein_matrix_minimum <- write_matrix_proteins(
  tb_only_minimum, write.csv=TRUE,
  filename=file.path(matrix_prefix, "tb_protein_matrix_minimum.csv"))
dim(protein_matrix_minimum)
peptide_matrix_minimum <- write_matrix_peptides(
  tb_only_minimum, write.csv=TRUE,
  filename=file.path(matrix_prefix, "tb_peptide_matrix_minimum.csv"))
dim(peptide_matrix_minimum)

rt_cor <- plot_correlation_between_samples(size=2,
  tb_only_minimum, column.values="intensity", fun.aggregate=sum)
## I have no effing clue what this plot means.
variation <- plot_variation(tb_only_minimum, fun.aggregate=sum)

cols <- colnames(tb_only_minimum)
tb_disaggregated <- disaggregate(tb_only_minimum, all.columns=TRUE)
tb_msstats_input <- convert_for_msstats(tb_disaggregated)
```

## Some new plots

In response to some interesting queries from Yan, I made a few little functions
which query and plot data from the scored data provided by openswath/pyprophet.
Let us look at their results here.

### Using the comet data

```{r our_pyprophet_plots}
our_pyprophet_fun <- sm(extract_pyprophet_data(metadata="sample_sheets/Mtb_dia_samples.xlsx"))

our_mass_plot <- sm(plot_pyprophet_distribution(our_pyprophet_fun, column="mass"))
our_mass_plot[["boxplot"]]
our_mass_plot[["density"]]

our_deltart_plot_all <- sm(plot_pyprophet_distribution(
  our_pyprophet_fun, column="delta_rt"))
our_deltart_plot_all[["boxplot"]]

our_deltart_plot_real <- sm(plot_pyprophet_distribution(
  our_pyprophet_fun,
  column="delta_rt", keep_decoys=FALSE))
our_deltart_plot_real[["boxplot"]]

our_deltart_plot_decoys <- sm(plot_pyprophet_distribution(
  our_pyprophet_fun,
  column="delta_rt", keep_real=FALSE))
our_deltart_plot_decoys[["boxplot"]]
our_deltart_plot_decoys[["density"]]

our_widthvsmass <- sm(plot_pyprophet_data(our_pyprophet_fun, legend=FALSE))
our_widthvsmass$plot
```

### The tuberculist data

```{r tb_pyprophet_plots}
tb_pyprophet_fun <- sm(extract_pyprophet_data(metadata="sample_sheets/Mtb_dia_samples.xlsx",
                                              pyprophet_column="tuberculistscored"))

tb_mass_plot <- sm(plot_pyprophet_distribution(tb_pyprophet_fun, column="mass"))
tb_mass_plot[["boxplot"]]
tb_mass_plot[["density"]]

tb_deltart_plot_all <- sm(plot_pyprophet_distribution(
  tb_pyprophet_fun, column="delta_rt"))
tb_deltart_plot_all[["boxplot"]]

tb_deltart_plot_real <- sm(plot_pyprophet_distribution(
  tb_pyprophet_fun,
  column="delta_rt", keep_decoys=FALSE))
tb_deltart_plot_real[["boxplot"]]

tb_deltart_plot_decoys <- sm(plot_pyprophet_distribution(
  tb_pyprophet_fun,
  column="delta_rt", keep_real=FALSE))
tb_deltart_plot_decoys[["boxplot"]]
tb_deltart_plot_decoys[["density"]]

tb_widthvsmass <- sm(plot_pyprophet_data(tb_pyprophet_fun, legend=FALSE))
tb_widthvsmass$plot
```

## MSstats

msstats.org seems to provide a complete solution for performing reasonable metrics of this data.

I am currently reading: http://msstats.org/wp-content/uploads/2017/01/MSstats_v3.7.3_manual.pdf

I made some moderately intrusive changes to MSstats to make it clearer, as well.

### Comet derived msstats

```{r our_msstats_quant}
devtools::load_all("~/scratch/git/MSstats")
our_msstats_quant <- sm(dataProcess(our_msstats_input))
our_msstats_plots <- sm(dataProcessPlots(our_msstats_quant, type="QCPLOT"))

my_levels <- levels(as.factor(our_msstats_input$condition))
my_levels
comparisons <- ghetto_contrast_matrix(
  numerators=c("wt_cf", "delta_cf", "comp_cf",
               "delta_cf", "comp_cf", "delta_whole",
               "comp_whole"),
  denominators=c("wt_whole", "delta_whole", "comp_whole",
                 "wt_cf", "wt_cf", "wt_whole",
                 "wt_whole"))
our_results <- list()
for (c in 1:length(rownames(comparisons))) {
  name <- rownames(comparisons)[c]
  message("Starting ", name)
  comp <- comparisons[c, ]
  comparison <- t(as.matrix(comp))
  rownames(comparison) <- name
  our_results[name] <- sm(MSstats::groupComparison(contrast.matrix=comparison,
                                                   data=our_msstats_quant))
}
```

### Tuberculist derived msstats

```{r tb_msstats_quant}
tb_msstats_quant <- sm(dataProcess(tb_msstats_input))
##tb_msstats_plots <- sm(dataProcessPlots(tb_msstats_quant, type="QCPLOT"))

my_levels <- levels(as.factor(tb_msstats_input$condition))
my_levels
comparisons <- ghetto_contrast_matrix(
  numerators=c("wt_cf", "delta_cf", "comp_cf",
               "delta_cf", "comp_cf", "delta_whole",
               "comp_whole"),
  denominators=c("wt_whole", "delta_whole", "comp_whole",
                 "wt_cf", "wt_cf", "wt_whole",
                 "wt_whole"))
tb_results <- list()
for (c in 1:length(rownames(comparisons))) {
  name <- rownames(comparisons)[c]
  message("Starting ", name)
  comp <- comparisons[c, ]
  comparison <- t(as.matrix(comp))
  rownames(comparison) <- name
  tb_results[name] <- sm(MSstats::groupComparison(contrast.matrix=comparison,
                                                   data=tb_msstats_quant))
}
```

### P/PE protein QC plots for Yan

Yan asked for the p/pe protein qc plots. ok.  I changed the dataProcessPlots to
return something useful, so that should be possible now.

```{r pe}
pe_genes <- read.table("reference/annotated_pe_genes.txt")[[1]]

## Unfortunately, the names did not get set in my changed version of dataProcessPlots...
plotlst <- our_msstats_plots$QCPLOT
available_plots <- gsub(pattern="^1/", replacement="",
                        x=levels(our_msstats_quant$ProcessedData$PROTEIN))
names(plotlst) <- available_plots

pe_in_avail_idx <- pe_genes %in% available_plots
pe_in_avail <- pe_genes[pe_in_avail_idx]
pe_plots <- plotlst[pe_in_avail]
pdf(file="pe_qc_plots.pdf")
for (p in 1:length(pe_plots)) {
  plot(pe_plots[[p]])
}
dev.off()
length(pe_plots)
```

# Create hpgltools expressionset

Since I am not certain I understand these data, I will take the intensities from
SWATH2stats, metadata, and annotation data;  attempt to create a 'normal'
expressionset; poke at it to see what I can learn.

## Massaging the metadata

I want to use the same metadata as were used for MSstats.  It has a few
important differences from the requirements of hpgltools: pretty much only that
I do not allow rownames/sampleIDs to start with a number.

## Massaging the intensity matrix

I do not want the \1 before the protein names, I already merged them into one
entry per gene vis SWATH2stats.

### Make matrices with our libraries

```{r our_protein_matrix}
our_prot_mtrx <- read.csv(file.path("results", "swath2stats", ver, "our_protein_matrix_minimum.csv"))
rownames(our_prot_mtrx) <- gsub(pattern="^1\\/", replacement="", x=our_prot_mtrx[["proteinname"]])
our_prot_mtrx <- our_prot_mtrx[, -1]
## Important question: Did SWATH2stats reorder my data?
colnames(our_prot_mtrx) <- gsub(pattern="^X", replacement="s", x=colnames(our_prot_mtrx))
```

### Make matrices with the tb libraries

```{r tb_protein_matrix}
tb_prot_mtrx <- read.csv(file.path("results", "swath2stats", ver, "tb_protein_matrix_minimum.csv"))
rownames(tb_prot_mtrx) <- gsub(pattern="^1\\/", replacement="", x=tb_prot_mtrx[["proteinname"]])
tb_prot_mtrx <- tb_prot_mtrx[, -1]
## Important question: Did SWATH2stats reorder my data?
colnames(tb_prot_mtrx) <- gsub(pattern="^X", replacement="s", x=colnames(tb_prot_mtrx))
```

## Merge the pieces

Now we should have sufficient pieces to make an expressionset.

While here, I will also split the data into a cf and whole-cell pair of data structures.

### Our data

```{r our_expt}
## Drop the metadata not in the protein matrix:
## And ensure that they are the same order.
reordered <- colnames(our_prot_mtrx)
metadata <- sample_annot[reordered, ]

our_protein_expt <- sm(create_expt(metadata=metadata,
                                   count_dataframe=our_prot_mtrx,
                                   gene_info=mtb_annotations))

our_whole_expt <- subset_expt(our_protein_expt, subset="collectiontype=='whole'")
our_cf_expt <- subset_expt(our_protein_expt, subset="collectiontype=='cf'")
```

### Tb data

```{r tb_expt}
## Drop the metadata not in the protein matrix:
## And ensure that they are the same order.
reordered <- colnames(tb_prot_mtrx)
metadata <- sample_annot[reordered, ]

tb_protein_expt <- sm(create_expt(metadata,
                                  count_dataframe=tb_prot_mtrx,
                                  gene_info=mtb_annotations))
tb_whole_expt <- subset_expt(tb_protein_expt, subset="collectiontype=='whole'")
tb_cf_expt <- subset_expt(tb_protein_expt, subset="collectiontype=='cf'")
```

## Metrics of the full data set

### Our libraries

```{r our_protein_metrics, fig.show='hide'}
our_protein_metrics <- sm(graph_metrics(our_protein_expt))
our_protein_norm <- sm(normalize_expt(our_protein_expt, transform="log2", convert="cpm",
                                      norm="quant", filter=TRUE))
our_protein_norm_metrics <- sm(graph_metrics(our_protein_norm))
our_protein_fsva <- sm(normalize_expt(our_protein_expt, transform="log2",
                                   batch="fsva", filter=TRUE))
our_protein_fsva_metrics <- sm(graph_metrics(our_protein_fsva))
```

### Tb libraries

```{r tb_protein_metrics, fig.show='hide'}
tb_protein_metrics <- sm(graph_metrics(tb_protein_expt))
tb_protein_norm <- sm(normalize_expt(tb_protein_expt, transform="log2", convert="cpm",
                                      norm="quant", filter=TRUE))
tb_protein_norm_metrics <- sm(graph_metrics(tb_protein_norm))
tb_protein_fsva <- sm(normalize_expt(tb_protein_expt, transform="log2", convert="cpm",
                                      batch="fsva", filter=TRUE))
tb_protein_fsva_metrics <- sm(graph_metrics(tb_protein_fsva))
```

## Metrics of the whole-cell data set

### Our libraries

```{r our_whole_metrics, fig.show='hide'}
our_whole_metrics <- sm(graph_metrics(our_whole_expt))
our_whole_norm <- sm(normalize_expt(our_whole_expt, transform="log2", convert="cpm",
                                    norm="quant", filter=TRUE))
our_whole_norm_metrics <- sm(graph_metrics(our_whole_norm))
our_whole_fsva <- sm(normalize_expt(our_whole_expt, transform="log2", convert="cpm",
                                    batch="fsva", filter=TRUE))
our_whole_fsva_metrics <- sm(graph_metrics(our_whole_fsva))
```

### Tb libraries

```{r tb_whole_metrics, fig.show='hide'}
tb_whole_metrics <- sm(graph_metrics(tb_whole_expt))
tb_whole_norm <- sm(normalize_expt(tb_whole_expt, transform="log2", convert="cpm",
                                    norm="quant", filter=TRUE))
tb_whole_norm_metrics <- sm(graph_metrics(tb_whole_norm))
tb_whole_fsva <- sm(normalize_expt(tb_whole_expt, transform="log2", convert="cpm",
                                    batch="fsva", filter=TRUE))
tb_whole_fsva_metrics <- sm(graph_metrics(tb_whole_fsva))
```

## Metrics of the filtrate data set

### Our libraries

```{r our_cf_metrics, fig.show='hide'}
our_cf_metrics <- sm(graph_metrics(our_cf_expt))
our_cf_norm <- sm(normalize_expt(our_cf_expt, transform="log2", convert="cpm",
                                 norm="quant", filter=TRUE))
our_cf_norm_metrics <- sm(graph_metrics(our_cf_norm))
our_cf_fsva <- sm(normalize_expt(our_cf_expt, transform="log2", convert="cpm",
                                 batch="fsva", filter=TRUE))
our_cf_fsva_metrics <- sm(graph_metrics(our_cf_fsva))
```

### Tb libraries

```{r tb_cf_metrics, fig.show='hide'}
tb_cf_metrics <- sm(graph_metrics(tb_cf_expt))
tb_cf_norm <- sm(normalize_expt(tb_cf_expt, transform="log2", convert="cpm",
                                 norm="quant", filter=TRUE))
tb_cf_norm_metrics <- sm(graph_metrics(tb_cf_norm))
tb_cf_fsva <- sm(normalize_expt(tb_cf_expt, transform="log2", convert="cpm",
                                 batch="fsva", filter=TRUE))
tb_cf_fsva_metrics <- sm(graph_metrics(tb_cf_fsva))
```

## plot some metrics

### Our libraries

```{r our_print_metrics}
pp(image=our_protein_metrics$libsize,
   file=file.path("images", paste0(ver, "_our_libsize.png")))
## It seems to me that the scale of the data is all within an order of magnitude or two.
## I cannot get used to these absurdly large numbers though.
pp(image=our_protein_norm_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_our_norm_pca.png")))
## There appears to be a nice split in the data, however the un-assayable batch
## effect is a problem.
pp(image=our_protein_fsva_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_fsva_pca.png")))
## fsva seems to get some handle on the data, but I don't think we should rely
## upon it.
pp(image=our_protein_norm_metrics$corheat,
   file=file.path("images", paste0(ver, "_norm_corheat.png")))
## Once again, the whole-cell/culture-filtrate split is very large.
pp(image=our_protein_metrics$density,
   file=file.path("images", paste0(ver, "_raw_density.png")))
## There are two obvious distributions in the data, once again split between types.
pp(image=our_protein_metrics$boxplot,
   file=file.path("images", paste0(ver, "_boxplot.png")))
## This recapitulates the previous plot.

pp(image=our_whole_metrics$libsize,
   file=file.path("images", paste0(ver, "_whole_libsize.png")))
pp(image=our_whole_norm_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_whole_norm_pca.png")))
pp(image=our_whole_fsva_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_whole_fsva_pca.png")))
pp(image=our_whole_norm_metrics$corheat,
   file=file.path("images", paste0(ver, "_whole_norm_corheat.png")))
pp(image=our_whole_metrics$density,
   file=file.path("images", paste0(ver, "_whole_raw_density.png")))
pp(image=our_whole_metrics$boxplot,
   file=file.path("images", paste0(ver, "_whole_boxplot.png")))

pp(image=our_cf_metrics$libsize,
   file=file.path("images", paste0(ver, "_libsize.png")))
pp(image=our_cf_norm_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_norm_pca.png")))
pp(image=our_cf_fsva_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_fsva_pca.png")))
pp(image=our_cf_norm_metrics$corheat,
   file=file.path("images", paste0(ver, "_norm_corheat.png")))
pp(image=our_cf_metrics$density,
   file=file.path("images", paste0(ver, "_raw_density.png")))
pp(image=our_cf_metrics$boxplot,
   file=file.path("images", paste0(ver, "_boxplot.png")))
```

### Tb libraries

```{r tb_print_metrics}
pp(image=tb_protein_metrics$libsize,
   file=file.path("images", paste0(ver, "_tb_libsize.png")))
## It seems to me that the scale of the data is all within an order of magnitude or two.
## I cannot get used to these absurdly large numbers though.
pp(image=tb_protein_norm_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_tb_norm_pca.png")))
## There appears to be a nice split in the data, however the un-assayable batch
## effect is a problem.
pp(image=tb_protein_fsva_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_fsva_pca.png")))
## fsva seems to get some handle on the data, but I don't think we should rely
## upon it.
pp(image=tb_protein_norm_metrics$corheat,
   file=file.path("images", paste0(ver, "_norm_corheat.png")))
## Once again, the whole-cell/culture-filtrate split is very large.
pp(image=tb_protein_metrics$density,
   file=file.path("images", paste0(ver, "_raw_density.png")))
## There are two obvious distributions in the data, once again split between types.
pp(image=tb_protein_metrics$boxplot,
   file=file.path("images", paste0(ver, "_boxplot.png")))
## This recapitulates the previous plot.

pp(image=tb_whole_metrics$libsize,
   file=file.path("images", paste0(ver, "_whole_libsize.png")))
pp(image=tb_whole_norm_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_whole_norm_pca.png")))
pp(image=tb_whole_fsva_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_whole_fsva_pca.png")))
pp(image=tb_whole_norm_metrics$corheat,
   file=file.path("images", paste0(ver, "_whole_norm_corheat.png")))
pp(image=tb_whole_metrics$density,
   file=file.path("images", paste0(ver, "_whole_raw_density.png")))
pp(image=tb_whole_metrics$boxplot,
   file=file.path("images", paste0(ver, "_whole_boxplot.png")))

pp(image=tb_cf_metrics$libsize,
   file=file.path("images", paste0(ver, "_libsize.png")))
pp(image=tb_cf_norm_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_norm_pca.png")))
pp(image=tb_cf_fsva_metrics$pcaplot,
   file=file.path("images", paste0(ver, "_fsva_pca.png")))
pp(image=tb_cf_norm_metrics$corheat,
   file=file.path("images", paste0(ver, "_norm_corheat.png")))
pp(image=tb_cf_metrics$density,
   file=file.path("images", paste0(ver, "_raw_density.png")))
pp(image=tb_cf_metrics$boxplot,
   file=file.path("images", paste0(ver, "_boxplot.png")))
```

# Attempt some quantification comparisons?

## Our libraries

```{r our_de_testing, fig.show='hide'}
our_pairwise_filt <- sm(normalize_expt(our_protein_expt, filter=TRUE))
our_pairwise_comp <- sm(all_pairwise(our_pairwise_filt, model_batch="fsva", force=TRUE))
our_pairwise_nobatch <- sm(all_pairwise(our_pairwise_filt, model_batch=FALSE, force=TRUE))
```

## Tb libraries

```{r tb_de_testing, fig.show='hide'}
tb_pairwise_filt <- sm(normalize_expt(tb_protein_expt, filter=TRUE))
tb_pairwise_comp <- sm(all_pairwise(tb_pairwise_filt, model_batch="fsva", force=TRUE))
tb_pairwise_nobatch <- sm(all_pairwise(tb_pairwise_filt, model_batch=FALSE, force=TRUE))
```

### Show a few metrics from the hpgltools pairwise comparisons

```{r show_de_testing}
our_pairwise_comp$comparison$heat
tb_pairwise_comp$comparison$heat
```

# For each msstats run, do a DE table

## wt_cf vs wt_whole

### Our libraries

```{r our_combine_wtcf_wtwhole, fig.show='hide'}
keepers <- list(
  "wtcf_vs_wtwhole" = c("wt_cf", "wt_whole"))
droppers <- c("undefined")
names(droppers) <- "log2fc"

## Make sure to set the rownames so it will merge into the excel file.
set_name <- "wt_cf_vs_wt_whole"
rownames(our_results[[set_name]]) <- our_results[[set_name]][["Protein"]]
wtcf_wtwhole_tables <- sm(combine_de_tables(
  our_pairwise_comp, keepers=keepers, extra_annot=our_results[[set_name]],
  excludes=droppers,
  excel=paste0("excel/our_wtcf_vs_wtwhole_tables-v", ver, ".xlsx")))

wtcf_nobatch_wtwhole_tables <- sm(combine_de_tables(
  our_pairwise_nobatch, keepers=keepers, extra_annot=our_results[[set_name]],
  excludes=droppers,
  excel=paste0("excel/our_wtcf_vs_wtwhole_nobatch_tables-v", ver, ".xlsx")))

comp_table <- wtcf_wtwhole_tables$data[[set_name]]
cor.test(comp_table$log2fc, comp_table$deseq_logfc, method="spearman")

comp_table <- wtcf_nobatch_wtwhole_tables$data[[set_name]]
cor.test(comp_table$log2fc, comp_table$deseq_logfc, method="spearman")
```

### Tb libraries

```{r tb_combine_wtcf_wtwhole, fig.show='hide'}
keepers <- list(
  "wtcf_vs_wtwhole" = c("wt_cf", "wt_whole"))
droppers <- c("undefined")
names(droppers) <- "log2fc"

## Make sure to set the rownames so it will merge into the excel file.
set_name <- "wt_cf_vs_wt_whole"
rownames(tb_results[[set_name]]) <- tb_results[[set_name]][["Protein"]]
wtcf_wtwhole_tables <- sm(combine_de_tables(
  tb_pairwise_comp, keepers=keepers, extra_annot=tb_results[[set_name]],
  excludes=droppers,
  excel=paste0("excel/tb_wtcf_vs_wtwhole_tables-v", ver, ".xlsx")))

wtcf_nobatch_wtwhole_tables <- sm(combine_de_tables(
  tb_pairwise_nobatch, keepers=keepers, extra_annot=tb_results[[set_name]],
  excludes=droppers,
  excel=paste0("excel/tb_wtcf_vs_wtwhole_nobatch_tables-v", ver, ".xlsx")))

comp_table <- wtcf_wtwhole_tables$data[[set_name]]
cor.test(comp_table$log2fc, comp_table$deseq_logfc, method="spearman")

comp_table <- wtcf_nobatch_wtwhole_tables$data[[set_name]]
cor.test(comp_table$log2fc, comp_table$deseq_logfc, method="spearman")
```

## delta_cf vs delta_whole

### Our libraries

```{r our_combine_deltacf_deltawhole, fig.show='hide'}
keepers <- list(
  "deltacf_vs_deltawhole" = c("delta_cf", "delta_whole"))
set_name <- "delta_cf_vs_delta_whole"
rownames(our_results[[set_name]]) <- our_results[[set_name]][["Protein"]]
deltacf_deltawhole_tables <- sm(combine_de_tables(
  our_pairwise_comp, keepers=keepers, extra_annot=our_results[[set_name]],
  excel=paste0("excel/our_deltacf_vs_deltawhole_tables-v", ver, ".xlsx")))
```

### Tb libraries

```{r tb_combine_deltacf_deltawhole, fig.show='hide'}
keepers <- list(
  "deltacf_vs_deltawhole" = c("delta_cf", "delta_whole"))
set_name <- "delta_cf_vs_delta_whole"
rownames(tb_results[[set_name]]) <- tb_results[[set_name]][["Protein"]]
deltacf_deltawhole_tables <- sm(combine_de_tables(
  tb_pairwise_comp, keepers=keepers, extra_annot=tb_results[[set_name]],
  excel=paste0("excel/tb_deltacf_vs_deltawhole_tables-v", ver, ".xlsx")))
```

## comp_cf vs comp_whole

### Our libraries

```{r our_combine_compcf_compwhole, fig.show='hide'}
keepers <- list(
  "compcf_vs_compwhole" = c("comp_cf", "comp_whole"))
set_name <- "comp_cf_vs_comp_whole"
rownames(our_results[[set_name]]) <- our_results[[set_name]][["Protein"]]
compcf_compwhole_tables <- sm(combine_de_tables(
  our_pairwise_comp, keepers=keepers, extra_annot=our_results[[set_name]],
  excel=paste0("excel/our_compcf_vs_compwhole_tables-v", ver, ".xlsx")))
```

### Tb libraries

```{r tb_combine_compcf_compwhole, fig.show='hide'}
keepers <- list(
  "compcf_vs_compwhole" = c("comp_cf", "comp_whole"))
set_name <- "comp_cf_vs_comp_whole"
rownames(tb_results[[set_name]]) <- tb_results[[set_name]][["Protein"]]
compcf_compwhole_tables <- sm(combine_de_tables(
  tb_pairwise_comp, keepers=keepers, extra_annot=tb_results[[set_name]],
  excel=paste0("excel/tb_compcf_vs_compwhole_tables-v", ver, ".xlsx")))
```

## delta_cf vs wt_cf

### Our libraries

```{r our_combine_deltacf_wtcf, fig.show='hide'}
keepers <- list(
  "deltacf_vs_wtcf" = c("delta_cf", "wt_cf"))
set_name <- "delta_cf_vs_wt_cf"
rownames(our_results[[set_name]]) <- our_results[[set_name]][["Protein"]]
deltacf_wtcf_tables <- sm(combine_de_tables(
  our_pairwise_comp, keepers=keepers, extra_annot=our_results[[set_name]],
  excel=paste0("excel/our_deltacf_vs_wtcf_tables-v", ver, ".xlsx")))
```

### Tb libraries

```{r tb_combine_deltacf_wtcf, fig.show='hide'}
keepers <- list(
  "deltacf_vs_wtcf" = c("delta_cf", "wt_cf"))
set_name <- "delta_cf_vs_wt_cf"
rownames(tb_results[[set_name]]) <- tb_results[[set_name]][["Protein"]]
deltacf_wtcf_tables <- sm(combine_de_tables(
  tb_pairwise_comp, keepers=keepers, extra_annot=tb_results[[set_name]],
  excel=paste0("excel/tb_deltacf_vs_wtcf_tables-v", ver, ".xlsx")))
```

## comp_cf vs wt_cf

### Our libraries

```{r our_combine_compcf_wtcf, fig.show='hide'}
keepers <- list(
  "compcf_vs_wtcf" = c("comp_cf", "wt_cf"))
set_name <- "comp_cf_vs_wt_cf"
rownames(our_results[[set_name]]) <- our_results[[set_name]][["Protein"]]
compcf_wtcf_tables <- sm(combine_de_tables(
  our_pairwise_comp, keepers=keepers, extra_annot=our_results[[set_name]],
  excel=paste0("excel/our_compcf_vs_wtcf_tables-v", ver, ".xlsx")))
```

### Tb libraries

```{r tb_combine_compcf_wtcf, fig.show='hide'}
keepers <- list(
  "compcf_vs_wtcf" = c("comp_cf", "wt_cf"))
set_name <- "comp_cf_vs_wt_cf"
rownames(tb_results[[set_name]]) <- tb_results[[set_name]][["Protein"]]
compcf_wtcf_tables <- sm(combine_de_tables(
  tb_pairwise_comp, keepers=keepers, extra_annot=tb_results[[set_name]],
  excel=paste0("excel/tb_compcf_vs_wtcf_tables-v", ver, ".xlsx")))
```

## delta_whole vs wt_whole

### Our libraries

```{r our_combine_delawhole_wtwhole, fig.show='hide'}
keepers <- list(
  "wtcf_vs_wtwhole" = c("wt_cf", "wt_whole"))
set_name <- "delta_whole_vs_wt_whole"
rownames(our_results[[set_name]]) <- our_results[[set_name]][["Protein"]]
wtcf_wtwhole_tables <- sm(combine_de_tables(
  our_pairwise_comp, keepers=keepers, extra_annot=our_results[[set_name]],
  excel=paste0("excel/our_deltawhole_vs_wtwhole_tables-v", ver, ".xlsx")))
```

### Tb libraries

```{r tb_combine_delawhole_wtwhole, fig.show='hide'}
keepers <- list(
  "wtcf_vs_wtwhole" = c("wt_cf", "wt_whole"))
set_name <- "delta_whole_vs_wt_whole"
rownames(tb_results[[set_name]]) <- tb_results[[set_name]][["Protein"]]
wtcf_wtwhole_tables <- sm(combine_de_tables(
  tb_pairwise_comp, keepers=keepers, extra_annot=tb_results[[set_name]],
  excel=paste0("excel/tb_deltawhole_vs_wtwhole_tables-v", ver, ".xlsx")))
```

## comp_whole vs wt_whole

### Our libraries

```{r our_combine_compwhole_wtwhole, fig.show='hide'}
keepers <- list(
  "compwhole_vs_wtwhole" = c("comp_whole", "wt_whole"))
set_name <- "comp_whole_vs_wt_whole"
rownames(our_results[[set_name]]) <- our_results[[set_name]][["Protein"]]
compwhole_wtwhole_tables <- sm(combine_de_tables(
  our_pairwise_comp, keepers=keepers, extra_annot=our_results[[set_name]],
  excel=paste0("excel/our_compwhole_vs_wtwhole_tables-v", ver, ".xlsx")))
```

### Tb libraries

```{r tb_combine_compwhole_wtwhole, fig.show='hide'}
keepers <- list(
  "compwhole_vs_wtwhole" = c("comp_whole", "wt_whole"))
set_name <- "comp_whole_vs_wt_whole"
rownames(tb_results[[set_name]]) <- tb_results[[set_name]][["Protein"]]
compwhole_wtwhole_tables <- sm(combine_de_tables(
  tb_pairwise_comp, keepers=keepers, extra_annot=tb_results[[set_name]],
  excel=paste0("excel/tb_compwhole_vs_wtwhole_tables-v", ver, ".xlsx")))
```

```{r compare_our_tb}
our_table <- our_pairwise_comp$deseq$all_tables$wt_whole_vs_wt_cf
tb_table <- tb_pairwise_comp$deseq$all_tables$wt_whole_vs_wt_cf
our_order <- order(our_table[["limma_logfc"]])
rownames(tb_table) <- gsub(pattern="^(.*)_.*", replacement="\\1", x=rownames(tb_table))

test_table <- merge(our_table, tb_table, by="row.names")
cor.test(test_table$logFC.x, test_table$logFC.y)
```

# TODO

* 2018-04-10:  Make sure my invocations of SWATH2stats/MSstats are correct.

```{r saveme, eval=FALSE}
if (!isTRUE(get0("skip_load"))) {
  message(paste0("This is hpgltools commit: ", get_git_commit()))
  this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
  message(paste0("Saving to ", this_save))
  tmp <- sm(saveme(filename=this_save))
  pander::pander(sessionInfo())
}
```
