1 TODO

  • Remove MSstats logging. – done
  • Make explicit spearman correlations between methods. – done
    • Do both for all data and for the top-50/bottom-50 – done, but weird.
  • Make 100% certain that the samples are annotated correctly. – done.
  • Limma/MSstats/EdgeR venn disagram for all up in CF
  • Repeat in all down CF
  • Repeat with a cutoff, top-50
  • Individual plots for P/PE proteins.

2 Analyzing data from openMS and friends.

In preprocessing_comet_highres.Rmd, I used the openMS tutorials and supplemental materials from a couple papers to hopefully correctly perform the various preprocessing tasks required to extract intensity data from DIA/SWATH transitions.

The final steps of that process combined the transition intensities from every sample into a metadata frile (results/tric/HCD_meta.tsv), an intensity matrix (results/tric/HCD_outmatrix.tsv), and a feature aligned output matrix (results/tric/aligned_comet_HCD.tsv).

My reading of the SWATH2stats and MSstats source code suggests to me that the log2(intensities) of the feature aligned data are our final proxy for protein abundance. At first glance, this suggests to me that these data might follow a distribution similar to RNASeq data (negative binomial, but perhaps with a bigger tail?). In addition, by the time we use tric on the data, we have a count matrix and sample annotation data frames which look remarkably similar to those used in a RNASeq expressionset. Indeed, by the end of the MSstats processing, it creates a MSnSet class of its own which uses fData/exprs/pData.

For the curious, my reasoning for saying that the log intensities are our proxy for abundance comes from MSstats/R/DataProcess.R in a clause which looks like:

if (logTrans == 2) {
  work[["ABUNDANCE"]] <- log2(work[["ABUNDANCE"]])
} else if (logTrans == 10) {
  work[["ABUNDANCE"]] <- log10(work[["ABUNDANCE"]])
} else {
  ## Above there was a check for only log 2 and 10, but we can do e if we want.
  ## I might go back up there and remove that check. Long live e! 2.718282 rules!
  work[["ABUNDANCE"]] <- log(work[["ABUNDANCE"]]) / log(logTrans)
}

(Note: I added the natural log to the set of conditions, but otherwise the logic is unchanged.)

With that in mind, I want to use some tools with which I am familiar in order to try to understand these data. Therefore I will first attempt to coerce my tric aligned data and annotations into a ‘normal’ expressionset. Then I want to do some diagnostic plots which, if I am wrong and these distributions are not as expected, will be conceptually incorrect (I don’t yet think I am wrong).

2.1 Sample annotation via SWATH2stats

I am using the SWATH2stats vignette as my primary source of information. Thus I see that it uses the OpenSWATH_SM3_GoldStandardAutomatedResults_human_peakgroups.txt which has a format nearly identical to my tric output matrix. Thus for the moment I will assume that the proper input for SWATH2stats is ‘results/tric/comet_HCD.tsv’ and not the metadata nor output matrix.

I keep a sample sheet of all the DIA samples used in this analysis in ‘sample_sheets/dia_samples.xlsx’ It should contain all the other required data with one important caveat, I removed 1 sample by ‘commenting’ it (e.g. prefixing it with ‘##’ – which is an admittedly dumb thing to do in an excel file.

One last caveat: I hacked the SWATH2stats sample_annotation() function to add a couple columns in an attempt to make it a little more robust when faced with sample sheets with differently named columns.

In addition, SWATH2stats provides some nice filtering and combination functions which should be considered when generating various expressionset data structures later.

tric_data <- read.csv("results/tric/aligned_comet_HCD.tsv", sep="\t")

sample_annot <- openxlsx::read.xlsx("sample_sheets/dia_samples.xlsx")
rownames(sample_annot) <- sample_annot[["sampleid"]]
## Drop samples starting with comments
keep_idx <- ! grepl(pattern="##", x=rownames(sample_annot))
sample_annot <- sample_annot[keep_idx, ]
## Set the mzXML column to match the filename column in the data.
devtools::load_all("~/scratch/git/SWATH2stats")
## Loading SWATH2stats
## s2s, my witty way of shortening SWATH2stats...
s2s_exp <- SWATH2stats::sample_annotation(data=tric_data,
                                          sample_annotation=sample_annot,
                                          data_type="OpenSWATH",
                                          data_file_column="filename")
############
## IMPORTANT CAVEAT
############
## HEY YOU, TREY, READ THIS OR YOU WILL BE PISSED LATER
############
## I am keeping only those rows which are from the march run.
dim(s2s_exp)
## [1] 145308     70
kept_rows <- s2s_exp[["BioReplicate"]] == "mar"
s2s_exp <- s2s_exp[kept_rows, ]
dim(s2s_exp)
## [1] 92624    70
## Wow, that dropped 46% of the data

Now I have a couple data structures which should prove useful for the metrics provided by SWATH2stats, MSstats, and my own hpgltools.

3 SWATH2stats continued

Lets return to some of the metrics provided by swath2stats.

## Get correlations on a sample by sample basis
pp(file="images/swath2stats_sample_cor.png")
## Going to write the image to: images/swath2stats_sample_cor.png when dev.off() is called.
sample_cor <- SWATH2stats::plot_correlation_between_samples(s2s_exp,
                                                            Comparison=transition_group_id ~ Run,
                                                            fun.aggregate=sum,
                                                            column.values="Intensity")
## Warning: Ignoring unknown aesthetics: fill
dev.off()
## png 
##   2
sample_cond_rep_cor <- SWATH2stats::plot_correlation_between_samples(s2s_exp,
                                                                     Comparison=transition_group_id ~
                                                                       Condition + BioReplicate + Run,
                                                                     fun.aggregate=sum,
                                                                     column.values="Intensity")
## Warning: Ignoring unknown aesthetics: fill

## I am a little concerned that these values do not seem to change when I took
## filtered/normalized data.  So I am rerunning them manually for a moment --
## perhaps I messed something up when I rewrote portions of the
## sample_annotation() function in SWATH2stats.

## ahh I think I see the problem.  The default value for fun.aggregate is NULL,
## which causes dcast to default to length.  I think this is not likely to be
## valid for this data.  I am not certain, however, what is the appropriate
## function.  If I had to guess, I would go with sum()?

assess_decoy_rate(s2s_exp)
## Number of non-decoy peptides: 9558
## Number of decoy peptides: 1605
## Decoy rate: 0.1679
## This seems a bit high to me, yesno?
fdr_overall <- assess_fdr_overall(s2s_exp, output="Rconsole")

byrun_fdr <- assess_fdr_byrun(s2s_exp, FFT=0.7, plot=TRUE, output="Rconsole")
## The average FDR by run on assay level is 0.009
## The average FDR by run on peptide level is 0.009
## The average FDR by run on protein level is 0.034

mscore4assayfdr(s2s_exp, FFT=0.7, fdr_target=0.02)
## Target assay FDR: 0.02
## Required overall m-score cutoff:0.0044668
## achieving assay FDR =0.0194
## [1] 0.004467
prot_score <- mscore4protfdr(s2s_exp, FFT=0.7, fdr_target=0.02)
## Target protein FDR:0.02
## Required overall m-score cutoff:0.0014125
## achieving protein FDR =0.0199
mscore_filtered <- SWATH2stats::filter_mscore(s2s_exp, prot_score)
## Dimension difference: 12662, 0
data_filtered_mscore <- filter_mscore_freqobs(s2s_exp, 0.01, 0.8, rm.decoy=FALSE)
## Treshold, peptides need to have been quantified in more conditions than: 13.6
## Fraction of peptides selected: 0.16
## Dimension difference: 62450, 0
data_filtered_fdr <- filter_mscore_fdr(s2s_exp, FFT=0.7,
                                       overall_protein_fdr_target=0.03,
                                       upper_overall_peptide_fdr_limit=0.05)
## Target protein FDR:0.03
## Required overall m-score cutoff:0.0017783
## achieving protein FDR =0.0286
## filter_mscore_fdr is filtering the data...
## finding m-score cutoff to achieve desired protein FDR in protein master list..
## finding m-score cutoff to achieve desired global peptide FDR..
## Target peptide FDR:0.05
## Required overall m-score cutoff:0.0079433
## achieving peptide FDR =0.0485
## Proteins selected: 
## Total proteins selected: 1832
## Thereof target proteins: 1760
## Thereof decoy proteins: 72
## Peptides mapping to these protein entries selected:
## Total mapping peptides: 9342
## Thereof target peptides: 9190
## Thereof decoy peptides: 152
## Total peptides selected from:
## Total peptides: 9853
## Thereof target peptides: 9847
## Thereof decoy peptides: 299
## The average FDR by run on assay level is 0.006
## The average FDR by run on peptide level is 0.007
## The average FDR by run on protein level is 0.025
## Individual run FDR quality of the peptides selected from:
## 0.00666
## The decoys have been removed from the returned data
only_proteotypic <- filter_proteotypic_peptides(data_filtered_fdr)
## Number of proteins detected: 1743
## Protein identifiers: Rv2050, Rv0516c, Rv0045c, Rv0704, Rv2555c, Rv2442c
## Number of proteins detected that are supported by a proteotypic peptide: 1717
## Number of proteotypic peptides detected: 9108
all_filtered <- filter_all_peptides(data_filtered_fdr)
## Number of proteins detected: 1745
## Protein identifiers: Rv2050, Rv0516c, Rv0045c, Rv0704, Rv2555c, Rv2442c
only_strong <- filter_on_max_peptides(data_filtered_fdr, 5)
## Before filtering: 
##   Number of proteins: 1756
##   Number of peptides: 9190
## 
## Percentage of peptides removed: 40.12%
## 
## After filtering: 
##   Number of proteins: 1756
##   Number of peptides: 5503
only_minimum <- filter_on_min_peptides(only_strong, 2)
## Before filtering: 
##   Number of proteins: 1756
##   Number of peptides: 5503
## 
## Percentage of peptides removed: 8.4%
## 
## After filtering: 
##   Number of proteins: 1294
##   Number of peptides: 5041
## I think these matrixes are probably smarter to use than the raw outmatrix from tric.
## But I am not a fan of rerwriting the sample column names.
protein_matrix_mscore <- write_matrix_proteins(mscore_filtered, write.csv=TRUE,
                                               filename="swath2stats_protein_matrix_mscore.csv",
                                               rm.decoy=FALSE)
## Protein overview matrix swath2stats_protein_matrix_mscore.csv written to working folder.
dim(protein_matrix_mscore)
## [1] 1755   18
peptide_matrix_mscore <- write_matrix_peptides(mscore_filtered, write.csv=TRUE,
                                               filename="swath2stats_peptide_matrix_mscore.csv",
                                               rm.decoy=FALSE)
## Peptide overview matrix swath2stats_peptide_matrix_mscore.csv written to working folder.
protein_matrix_minimum <- write_matrix_proteins(only_minimum, write.csv=TRUE,
                                                filename="swath2stats_protein_matrix_minimum.csv",
                                                rm.decoy=FALSE)
## Protein overview matrix swath2stats_protein_matrix_minimum.csv written to working folder.
dim(protein_matrix_minimum)
## [1] 1294   18
peptide_matrix_minimum <- write_matrix_peptides(only_minimum, write.csv=TRUE,
                                                filename="swath2stats_peptide_matrix_minimum.csv",
                                                rm.decoy=FALSE)
## Peptide overview matrix swath2stats_peptide_matrix_minimum.csv written to working folder.
SWATH2stats::plot_correlation_between_samples(s2s_exp,
                                              column.values="RT",
                                              fun.aggregate=sum)
## Warning: Ignoring unknown aesthetics: fill

##           Var1         Var2  value   method
## 1    wt_cf_mar    wt_cf_mar 1.0000  pearson
## 3    wt_cf_mar wt_whole_mar 0.3573  pearson
## 4 wt_whole_mar wt_whole_mar 1.0000  pearson
## 6 wt_whole_mar    wt_cf_mar 0.2628 spearman
## I have no effing clue what this plot means.
variation <- SWATH2stats::plot_variation(s2s_exp, fun.aggregate=sum,
                                         Comparison=transition_group_id ~ Condition)

## Something in SWATH2stats::disaggregate was written poorly and is looking for
## a variable named 'cols'
cols <- colnames(only_minimum)
disaggregated <- SWATH2stats::disaggregate(only_minimum, all.columns=TRUE)
## The library contains between 4 and 6 transitions per precursor.
## The data table was transformed into a table containing one row per transition.
msstats_input <- SWATH2stats::convert4MSstats(disaggregated)
## One or several columns required by MSstats were not in the data. The columns were created and filled with NAs.
## Missing columns: ProductCharge, IsotopeLabelType
## IsotopeLabelType was filled with light.
alfq_input <- sm(SWATH2stats::convert4aLFQ(disaggregated))
mapdia_input <- sm(SWATH2stats::convert4mapDIA(disaggregated, RT=TRUE))

3.1 MSstats

msstats.org seems to provide a complete solution for performing reasonable metrics of this data.

I am currently reading: http://msstats.org/wp-content/uploads/2017/01/MSstats_v3.7.3_manual.pdf

I made some moderately intrusive changes to MSstats to make it clearer, as well.

devtools::load_all("~/scratch/git/MSstats")
## Loading MSstats
## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)
msstats_quant <- sm(MSstats::dataProcess(msstats_input))

msstats_plots <- sm(MSstats::dataProcessPlots(msstats_quant, type="QCPLOT"))

my_levels <- levels(as.factor(msstats_input$Condition))
my_levels
## [1] "wt_cf"    "wt_whole"
comparison <- ghetto_contrast_matrix("wt_cf", "wt_whole")
msstats_comparison <- sm(MSstats::groupComparison(contrast.matrix=comparison,
                                                  data=msstats_quant))

3.2 Do the same comparison with the modified data.

## I want to test a hypothesis about the effect of 0/NA on the data, so I am replacing them with
## 2, which should give me a minimal value for all the log2fc calculations to work with.
msstats_modified <- msstats_quant
abundances <- msstats_quant$ProcessedData[["ABUNDANCE"]]
na_idx <- is.na(abundances)
msstats_modified$ProcessedData[na_idx, "ABUNDANCE"] <- 2
msstats_modified_comp <- sm(MSstats::groupComparison(contrast.matrix=comparison,
                                                     data=msstats_modified))

3.2.1 P/PE protein QC plots for Yan

Yan asked for the p/pe protein qc plots. ok. I changed the dataProcessPlots to return something useful, so that should be possible now.

pe_genes <- read.table("reference/annotated_pe_genes.txt")[[1]]

## Unfortunately, the names did not get set in my changed version of dataProcessPlots...
plotlst <- msstats_plots$QCPLOT
available_plots <- gsub(pattern="^1/", replacement="", x=levels(msstats_quant$ProcessedData$PROTEIN))
names(plotlst) <- available_plots

pe_in_avail_idx <- pe_genes %in% available_plots
pe_in_avail <- pe_genes[pe_in_avail_idx]
pe_plots <- plotlst[pe_in_avail]
pdf(file="pe_qc_plots.pdf")
for (p in 1:length(pe_plots)) {
  plot(pe_plots[[p]])
}
## Warning: Removed 114 rows containing non-finite values (stat_boxplot).
## Warning: Removed 108 rows containing non-finite values (stat_boxplot).
## Warning: Removed 12 rows containing non-finite values (stat_boxplot).
## Warning: Removed 174 rows containing non-finite values (stat_boxplot).
## Warning: Removed 240 rows containing non-finite values (stat_boxplot).
## Warning: Removed 138 rows containing non-finite values (stat_boxplot).
## Warning: Removed 66 rows containing non-finite values (stat_boxplot).
## Warning: Removed 89 rows containing non-finite values (stat_boxplot).
dev.off()
## png 
##   2

4 Create hpgltools expressionset

Since I am not certain I understand these data, I will take the intensities from SWATH2stats, metadata, and annotation data; attempt to create a ‘normal’ expressionset; poke at it to see what I can learn.

4.1 Massaging the metadata

I want to use the same metadata as were used for MSstats. It has a few important differences from the requirements of hpgltools: pretty much only that I do not allow rownames/sampleIDs to start with a number.

metadata <- sample_annot
metadata[["sampleid"]] <- paste0("s", metadata[["sampleid"]])
rownames(metadata) <- metadata[["sampleid"]]

4.2 Massaging the gene annotation data and adding the msstats data.

I have my own annotation data from the gff file/microbesonline/whatever, I can add the MSstats result to it so that later I can print them all together.

## Here is a neat little thing I can do:  Add the MSstats results to my annotation data.
## Then when I print out the tables of the limma/etc results, they MSstats
## results will come along for free.
msstats_table <- msstats_comparison[["ComparisonResult"]]
rownames(msstats_table) <- gsub(pattern="^1\\/", replacement="",
                                x=msstats_table[["Protein"]])
mtb_annotations_with_msstats <- merge(mtb_annotations, msstats_table,
                                      by="row.names", all.x=TRUE)
rownames(mtb_annotations_with_msstats) <- mtb_annotations_with_msstats[["Row.names"]]
mtb_annotations_with_msstats <- mtb_annotations_with_msstats[, -1]

msstats_table_modified <- msstats_modified_comp[["ComparisonResult"]]
rownames(msstats_table_modified) <- gsub(pattern="^1\\/", replacement="",
                                         x=msstats_table_modified[["Protein"]])
mtb_annotations_with_modified_msstats <- merge(mtb_annotations, msstats_table_modified,
                                               by="row.names", all.x=TRUE)
rownames(mtb_annotations_with_modified_msstats) <- mtb_annotations_with_modified_msstats[["Row.names"]]
mtb_annotations_with_modified_msstats <- mtb_annotations_with_modified_msstats[, -1]

4.3 Massaging the intensity matrix

I do not want the before the protein names, I already merged them into one entry per gene vis SWATH2stats.

prot_mtrx <- read.csv("swath2stats_protein_matrix_mscore.csv")
rownames(prot_mtrx) <- gsub(pattern="^1\\/", replacement="", x=prot_mtrx[["ProteinName"]])
prot_mtrx <- prot_mtrx[, -1]
## Important question: Did SWATH2stats reorder my data?
fun <- gsub(pattern="^.*_(2018.*$)", replacement="\\1", x=colnames(prot_mtrx))
colnames(prot_mtrx) <- paste0("s", fun)

decoy_idx <- ! grepl(pattern="DECOY", x=rownames(prot_mtrx))
prot_mtrx <- prot_mtrx[decoy_idx, ]

prot_mtrx_modified <- prot_mtrx
zero_idx <- prot_mtrx == 0
prot_mtrx_modified[zero_idx] <- 2

4.4 Merge the pieces

Now we should have sufficient pieces to make an expressionset.

## Drop the metadata not in the protein matrix:
## And ensure that they are the same order.
reordered <- colnames(prot_mtrx)
metadata <- metadata[reordered, ]

protein_expt <- create_expt(metadata,
                            count_dataframe=prot_mtrx,
                            gene_info=mtb_annotations_with_msstats)
## Reading the sample metadata.
## The sample definitions comprises: 17, 15 rows, columns.
## Matched 1708 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
protein_modified_expt <- create_expt(metadata,
                                     count_dataframe=prot_mtrx_modified,
                                     gene_info=mtb_annotations_with_modified_msstats)
## Reading the sample metadata.
## The sample definitions comprises: 17, 15 rows, columns.
## Matched 1708 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
protein_metrics <- sm(graph_metrics(protein_expt))
protein_norm <- sm(normalize_expt(protein_expt, transform="log2", convert="cpm",
                                  norm="quant", filter=TRUE))
protein_norm_metrics <- sm(graph_metrics(protein_norm))
protein_metrics$libsize

protein_norm_metrics$pcaplot

5 Attempt some quantification comparisons?

pairwise_filt <- sm(normalize_expt(protein_expt, filter=TRUE))
pairwise_comp <- all_pairwise(pairwise_filt, parallel=FALSE, force=TRUE)
## Using limma's removeBatchEffect to visualize before/after batch inclusion.
## Starting limma pairwise comparison.
## Leaving the data alone, regardless of normalization state.
## libsize was not specified, this parameter has profound effects on limma's result.
## Using the libsize from expt$best_libsize.
## Limma step 1/6: choosing model.
## Choosing the non-intercept containing model.
## Limma step 2/6: running limma::voom(), switch with the argument 'which_voom'.
## Using normalize.method=quantile for voom.
## Limma step 3/6: running lmFit with method: ls.
## Limma step 4/6: making and fitting contrasts with no intercept. (~ 0 + factors)
## Limma step 5/6: Running eBayes with robust=FALSE and trend=FALSE.
## Limma step 6/6: Writing limma outputs.
## Limma step 6/6: 1/1: Creating table: wt_whole_vs_wt_cf.  Adjust=BH
## Limma step 6/6: 1/2: Creating table: wt_cf.  Adjust=BH
## Limma step 6/6: 2/2: Creating table: wt_whole.  Adjust=BH
## Starting DESeq2 pairwise comparisons.
## About to round the data, this is a pretty terrible thing to do. But if you,
## like me, want to see what happens when you put non-standard data into deseq, then here you go.
## Warning in choose_binom_dataset(input, force = force): This data was inappropriately
## forced into integers.
## Choosing the non-intercept containing model.
## DESeq2 step 1/5: Including batch and condition in the deseq model.
## Warning in import_deseq(data, column_data, model_string, tximport = input[["tximport"]]
## [["raw"]]): Converted down 1458 elements because they are larger than the maximum integer
## size.
## converting counts to integer mode
## DESeq2 step 2/5: Estimate size factors.
## DESeq2 step 3/5: Estimate dispersions.
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## Using a parametric fitting seems to have worked.
## DESeq2 step 4/5: nbinomWaldTest.
## Plotting dispersions.

## 
  |                                                                                      
  |                                                                                |   0%
  |                                                                                      
  |================================================================================| 100%
## Starting edgeR pairwise comparisons.
## About to round the data, this is a pretty terrible thing to do. But if you,
## like me, want to see what happens when you put non-standard data into deseq, then here you go.
## Warning in choose_binom_dataset(input, force = force): This data was inappropriately
## forced into integers.
## Choosing the non-intercept containing model.
## EdgeR step 1/9: importing and normalizing data.
## EdgeR step 2/9: Estimating the common dispersion.
## EdgeR step 3/9: Estimating dispersion across genes.
## EdgeR step 4/9: Estimating GLM Common dispersion.
## EdgeR step 5/9: Estimating GLM Trended dispersion.
## EdgeR step 6/9: Estimating GLM Tagged dispersion.
## EdgeR step 7/9: Running glmFit, switch to glmQLFit by changing the argument 'edger_test'.
## EdgeR step 8/9: Making pairwise contrasts.

## 
  |                                                                                      
  |                                                                                |   0%
  |                                                                                      
  |================================================================================| 100%
## Starting basic pairwise comparison.
## Leaving the data alone, regardless of normalization state.
## Basic step 0/3: Transforming data.
## Basic step 1/3: Creating median and variance tables.
## Basic step 2/3: Performing 3 comparisons.
## 
  |                                                                                      
  |                                                                                |   0%
  |                                                                                      
  |================================================================================| 100%
## Basic step 3/3: Creating faux DE Tables.
## Basic: Returning tables.
## Comparing analyses 1/1: wt_whole_vs_wt_cf

keepers <- list("filtrate_vs_cells" = c("wt_cf", "wt_whole"))
pairwise_tables <- combine_de_tables(pairwise_comp,
                                     excel="excel/test_pairwise_de_with_msstats.xlsx",
                                     keepers=keepers)
## Deleting the file excel/test_pairwise_de_with_msstats.xlsx before writing the tables.
## Writing a legend of columns.
## Printing a pca plot before/after surrogates/batch estimation.
## Working on 1/1: filtrate_vs_cells
## Found inverse table with wt_whole_vs_wt_cf
## Adding venn plots for wt_cf_vs_wt_whole.

## Limma expression coefficients for wt_cf_vs_wt_whole; R^2: 0.433; equation: y = 0.562x - 0.965
## Edger expression coefficients for wt_cf_vs_wt_whole; R^2: 0.461; equation: y = 1.42x - 25
## DESeq2 expression coefficients for wt_cf_vs_wt_whole; R^2: 0.408; equation: y = 1.4x - 23.5
## Writing summary information.
## Attempting to add the comparison plot to pairwise_summary at row: 19 and column: 1
## 
  |                                                                                      
  |                                                                                |   0%
  |                                                                                      
  |================================================================================| 100%
## Performing save of the workbook.

droppers <- c("undefined")
names(droppers) <- "log2fc"
pairwise_onlyall <- combine_de_tables(pairwise_comp,
                                      excel="excel/test_pairwise_de_only_msstats.xlsx",
                                      keepers=keepers,
                                      excludes=droppers)
## Deleting the file excel/test_pairwise_de_only_msstats.xlsx before writing the tables.
## Writing a legend of columns.
## Printing a pca plot before/after surrogates/batch estimation.
## Working on 1/1: filtrate_vs_cells
## Found inverse table with wt_whole_vs_wt_cf
## Removed 741261 genes using undefined as a string against column log2fc.
## Adding venn plots for wt_cf_vs_wt_whole.

## Limma expression coefficients for wt_cf_vs_wt_whole; R^2: 0.433; equation: y = 0.562x - 0.965
## Edger expression coefficients for wt_cf_vs_wt_whole; R^2: 0.461; equation: y = 1.42x - 25
## DESeq2 expression coefficients for wt_cf_vs_wt_whole; R^2: 0.408; equation: y = 1.4x - 23.5
## Writing summary information.
## Attempting to add the comparison plot to pairwise_summary at row: 19 and column: 1
## 
  |                                                                                      
  |                                                                                |   0%
  |                                                                                      
  |================================================================================| 100%
## Performing save of the workbook.

pairwise_sig <- sm(extract_significant_genes(pairwise_tables,
                                             excel="excel/test_pairwise_sig_with_msstats.xlsx"))

solo_proteins <- features_in_single_condition(protein_expt)
proteins_only_cf <- solo_proteins[["solo_this"]][["wt_cf"]]
proteins_only_cf
##  [1] "Rv0116c" "Rv0256c" "Rv0398c" "Rv1184c" "Rv2080"  "Rv2240c" "Rv2623"  "Rv2668" 
##  [9] "Rv3312A" "Rv3568c"
proteins_only_whole <- solo_proteins[["solo_this"]][["wt_whole"]]
length(proteins_only_whole)
## [1] 788
pairwise_modified_comp <- all_pairwise(protein_modified_expt, parallel=FALSE, force=TRUE)
## Using limma's removeBatchEffect to visualize before/after batch inclusion.
## Starting limma pairwise comparison.
## Leaving the data alone, regardless of normalization state.
## libsize was not specified, this parameter has profound effects on limma's result.
## Using the libsize from expt$libsize.
## Limma step 1/6: choosing model.
## Choosing the non-intercept containing model.
## Limma step 2/6: running limma::voom(), switch with the argument 'which_voom'.
## Using normalize.method=quantile for voom.
## Limma step 3/6: running lmFit with method: ls.
## Limma step 4/6: making and fitting contrasts with no intercept. (~ 0 + factors)
## Limma step 5/6: Running eBayes with robust=FALSE and trend=FALSE.
## Limma step 6/6: Writing limma outputs.
## Limma step 6/6: 1/1: Creating table: wt_whole_vs_wt_cf.  Adjust=BH
## Limma step 6/6: 1/2: Creating table: wt_cf.  Adjust=BH
## Limma step 6/6: 2/2: Creating table: wt_whole.  Adjust=BH
## Starting DESeq2 pairwise comparisons.
## About to round the data, this is a pretty terrible thing to do. But if you,
## like me, want to see what happens when you put non-standard data into deseq, then here you go.
## Warning in choose_binom_dataset(input, force = force): This data was inappropriately
## forced into integers.
## Choosing the non-intercept containing model.
## DESeq2 step 1/5: Including batch and condition in the deseq model.
## Warning in import_deseq(data, column_data, model_string, tximport = input[["tximport"]]
## [["raw"]]): Converted down 1458 elements because they are larger than the maximum integer
## size.
## converting counts to integer mode
## DESeq2 step 2/5: Estimate size factors.
## DESeq2 step 3/5: Estimate dispersions.
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## Using a parametric fitting seems to have worked.
## DESeq2 step 4/5: nbinomWaldTest.
## Plotting dispersions.

## 
  |                                                                                      
  |                                                                                |   0%
  |                                                                                      
  |================================================================================| 100%
## Starting edgeR pairwise comparisons.
## About to round the data, this is a pretty terrible thing to do. But if you,
## like me, want to see what happens when you put non-standard data into deseq, then here you go.
## Warning in choose_binom_dataset(input, force = force): This data was inappropriately
## forced into integers.
## Choosing the non-intercept containing model.
## EdgeR step 1/9: importing and normalizing data.
## EdgeR step 2/9: Estimating the common dispersion.
## EdgeR step 3/9: Estimating dispersion across genes.
## EdgeR step 4/9: Estimating GLM Common dispersion.
## EdgeR step 5/9: Estimating GLM Trended dispersion.
## EdgeR step 6/9: Estimating GLM Tagged dispersion.
## EdgeR step 7/9: Running glmFit, switch to glmQLFit by changing the argument 'edger_test'.
## EdgeR step 8/9: Making pairwise contrasts.

## 
  |                                                                                      
  |                                                                                |   0%
  |                                                                                      
  |================================================================================| 100%
## Starting basic pairwise comparison.
## Leaving the data alone, regardless of normalization state.
## Basic step 0/3: Transforming data.
## Basic step 1/3: Creating median and variance tables.
## Basic step 2/3: Performing 3 comparisons.
## 
  |                                                                                      
  |                                                                                |   0%
  |                                                                                      
  |================================================================================| 100%
## Basic step 3/3: Creating faux DE Tables.
## Basic: Returning tables.
## Comparing analyses 1/1: wt_whole_vs_wt_cf

pairwise_modified_tables <- combine_de_tables(pairwise_modified_comp,
                                              excel="excel/test_pairwise_modified_de_with_msstats.xlsx",
                                              keepers=keepers)
## Deleting the file excel/test_pairwise_modified_de_with_msstats.xlsx before writing the tables.
## Writing a legend of columns.
## Printing a pca plot before/after surrogates/batch estimation.
## Working on 1/1: filtrate_vs_cells
## Found inverse table with wt_whole_vs_wt_cf
## Adding venn plots for wt_cf_vs_wt_whole.

## Limma expression coefficients for wt_cf_vs_wt_whole; R^2: 0.438; equation: y = 0.547x - 0.718
## Edger expression coefficients for wt_cf_vs_wt_whole; R^2: 0.247; equation: y = 0.966x - 6.19
## DESeq2 expression coefficients for wt_cf_vs_wt_whole; R^2: 0.199; equation: y = 0.861x + 2.28
## Writing summary information.
## Attempting to add the comparison plot to pairwise_summary at row: 19 and column: 1
## 
  |                                                                                      
  |                                                                                |   0%
  |                                                                                      
  |================================================================================| 100%
## Performing save of the workbook.

droppers <- c("undefined")
names(droppers) <- "log2fc"
pairwise_modified_onlyall <- combine_de_tables(pairwise_modified_comp,
                                               excel="excel/test_pairwise_de_only_msstats.xlsx",
                                               keepers=keepers,
                                               excludes=droppers)
## Deleting the file excel/test_pairwise_de_only_msstats.xlsx before writing the tables.
## Writing a legend of columns.
## Printing a pca plot before/after surrogates/batch estimation.
## Working on 1/1: filtrate_vs_cells
## Found inverse table with wt_whole_vs_wt_cf
## Removed 770556 genes using undefined as a string against column log2fc.
## Adding venn plots for wt_cf_vs_wt_whole.

## Limma expression coefficients for wt_cf_vs_wt_whole; R^2: 0.438; equation: y = 0.547x - 0.718
## Edger expression coefficients for wt_cf_vs_wt_whole; R^2: 0.247; equation: y = 0.966x - 6.19
## DESeq2 expression coefficients for wt_cf_vs_wt_whole; R^2: 0.199; equation: y = 0.861x + 2.28
## Writing summary information.
## Attempting to add the comparison plot to pairwise_summary at row: 19 and column: 1
## 
  |                                                                                      
  |                                                                                |   0%
  |                                                                                      
  |================================================================================| 100%
## Performing save of the workbook.

pairwise_modified_sig <- sm(extract_significant_genes(
  pairwise_modified_tables,
  excel="excel/test_pairwise_modified_sig_with_msstats.xlsx"))

5.1 Compare hpgltools/MSstats

Can we compare limma and riends to MSstats?

hpgl_table <- pairwise_tables$data[[1]]
msstats_table <- msstats_comparison$ComparisonResult
rownames(msstats_table) <- gsub(pattern="^1/", replacement="", x=msstats_table$Protein)

merged_table <- merge(hpgl_table, msstats_table, by="row.names")
write.csv(file="images/merged_table.csv", merged_table)
cor.test(merged_table$limma_logfc, merged_table$log2FC)
## 
##  Pearson's product-moment correlation
## 
## data:  merged_table$limma_logfc and merged_table$log2FC
## t = 35, df = 910, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7290 0.7844
## sample estimates:
##    cor 
## 0.7581
cor.test(merged_table$limma_logfc, merged_table$log2FC, method="spearman")
## Warning in cor.test.default(merged_table$limma_logfc, merged_table$log2FC, : Cannot
## compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  merged_table$limma_logfc and merged_table$log2FC
## S = 2.9e+07, p-value <2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##    rho 
## 0.7698
cor.test(merged_table$deseq_logfc, merged_table$log2FC)
## 
##  Pearson's product-moment correlation
## 
## data:  merged_table$deseq_logfc and merged_table$log2FC
## t = 15, df = 910, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3832 0.4885
## sample estimates:
##    cor 
## 0.4374
cor.test(merged_table$deseq_logfc, merged_table$log2FC, method="spearman")
## Warning in cor.test.default(merged_table$deseq_logfc, merged_table$log2FC, : Cannot
## compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  merged_table$deseq_logfc and merged_table$log2FC
## S = 3.9e+07, p-value <2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##    rho 
## 0.6907
cor.test(merged_table$edger_logfc, merged_table$log2FC)
## 
##  Pearson's product-moment correlation
## 
## data:  merged_table$edger_logfc and merged_table$log2FC
## t = 16, df = 910, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4117 0.5139
## sample estimates:
##    cor 
## 0.4643
cor.test(merged_table$edger_logfc, merged_table$log2FC, method="spearman")
## Warning in cor.test.default(merged_table$edger_logfc, merged_table$log2FC, : Cannot
## compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  merged_table$edger_logfc and merged_table$log2FC
## S = 3.4e+07, p-value <2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##    rho 
## 0.7282
combined_table <- pairwise_tables$data[[1]]
undefined_idx <- combined_table[["log2fc"]] == "undefined"
combined_table[undefined_idx, "log2fc"] <- NA
combined_table[["log2fc"]] <- as.numeric(combined_table[["log2fc"]])
cor.test(combined_table[["limma_logfc"]], combined_table[["log2fc"]])
## 
##  Pearson's product-moment correlation
## 
## data:  combined_table[["limma_logfc"]] and combined_table[["log2fc"]]
## t = 35, df = 900, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7272 0.7830
## sample estimates:
##    cor 
## 0.7565
cor.test(combined_table[["limma_logfc"]], combined_table[["log2fc"]], method="spearman")
## Warning in cor.test.default(combined_table[["limma_logfc"]], combined_table[["log2fc"]], :
## Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  combined_table[["limma_logfc"]] and combined_table[["log2fc"]]
## S = 2.9e+07, p-value <2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##   rho 
## 0.769
high_to_low_idx <- order(combined_table[["log2fc"]], na.last=TRUE, decreasing=TRUE)
low_to_high_idx <- order(combined_table[["log2fc"]], na.last=TRUE, decreasing=FALSE)
top_50 <- head(combined_table[high_to_low_idx, ], n=100)
bottom_50 <- head(combined_table[low_to_high_idx, ], n=100)

cor.test(top_50$limma_logfc, top_50$log2fc)
## 
##  Pearson's product-moment correlation
## 
## data:  top_50$limma_logfc and top_50$log2fc
## t = 7.1, df = 98, p-value = 2e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4342 0.6978
## sample estimates:
##    cor 
## 0.5811
cor.test(top_50$limma_logfc, top_50$log2fc, method="spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  top_50$limma_logfc and top_50$log2fc
## S = 70000, p-value = 5e-11
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##    rho 
## 0.5802
cor.test(top_50$deseq_logfc, top_50$log2fc)
## 
##  Pearson's product-moment correlation
## 
## data:  top_50$deseq_logfc and top_50$log2fc
## t = 5.3, df = 98, p-value = 6e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3074 0.6142
## sample estimates:
##    cor 
## 0.4751
cor.test(top_50$deseq_logfc, top_50$log2fc, method="spearman")
## Warning in cor.test.default(top_50$deseq_logfc, top_50$log2fc, method = "spearman"):
## Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  top_50$deseq_logfc and top_50$log2fc
## S = 19000, p-value <2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##    rho 
## 0.8865
cor.test(top_50$edger_logfc, top_50$log2fc)
## 
##  Pearson's product-moment correlation
## 
## data:  top_50$edger_logfc and top_50$log2fc
## t = 6.5, df = 98, p-value = 3e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3960 0.6734
## sample estimates:
##    cor 
## 0.5497
cor.test(top_50$edger_logfc, top_50$log2fc, method="spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  top_50$edger_logfc and top_50$log2fc
## S = 15000, p-value <2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##    rho 
## 0.9088
cor.test(bottom_50$limma_logfc, bottom_50$log2fc)
## 
##  Pearson's product-moment correlation
## 
## data:  bottom_50$limma_logfc and bottom_50$log2fc
## t = 1.8, df = 98, p-value = 0.08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.01932  0.36157
## sample estimates:
##    cor 
## 0.1778
cor.test(bottom_50$limma_logfc, bottom_50$log2fc, method="spearman")
## Warning in cor.test.default(bottom_50$limma_logfc, bottom_50$log2fc, method = "spearman"):
## Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  bottom_50$limma_logfc and bottom_50$log2fc
## S = 130000, p-value = 0.04
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##    rho 
## 0.2049
cor.test(bottom_50$deseq_logfc, bottom_50$log2fc)
## 
##  Pearson's product-moment correlation
## 
## data:  bottom_50$deseq_logfc and bottom_50$log2fc
## t = -0.27, df = 98, p-value = 0.8
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2225  0.1701
## sample estimates:
##      cor 
## -0.02726
cor.test(bottom_50$deseq_logfc, bottom_50$log2fc, method="spearman")
## Warning in cor.test.default(bottom_50$deseq_logfc, bottom_50$log2fc, method = "spearman"):
## Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  bottom_50$deseq_logfc and bottom_50$log2fc
## S = 150000, p-value = 0.4
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##     rho 
## 0.08359
cor.test(bottom_50$edger_logfc, bottom_50$log2fc)
## 
##  Pearson's product-moment correlation
## 
## data:  bottom_50$edger_logfc and bottom_50$log2fc
## t = 0.11, df = 98, p-value = 0.9
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1856  0.2072
## sample estimates:
##     cor 
## 0.01123
cor.test(bottom_50$edger_logfc, bottom_50$log2fc, method="spearman")
## Warning in cor.test.default(bottom_50$edger_logfc, bottom_50$log2fc, method = "spearman"):
## Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  bottom_50$edger_logfc and bottom_50$log2fc
## S = 140000, p-value = 0.2
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##    rho 
## 0.1402
up_in_msstats <- rownames(combined_table)[combined_table[["log2fc"]] > 0]
up_in_msstats <- up_in_msstats[!is.na(up_in_msstats)]
up_in_limma <- rownames(combined_table)[combined_table[["limma_logfc"]] > 0]
up_in_limma <- up_in_limma[!is.na(up_in_limma)]
up_in_edger <- rownames(combined_table)[combined_table[["edger_logfc"]] > 0]
up_in_edger <- up_in_edger[!is.na(up_in_edger)]
up_venn_sets <- list(
  "msstats" = up_in_msstats,
  "limma" = up_in_limma,
  "edger" = up_in_edger)
testing <- Vennerable::Venn(Sets=up_venn_sets, )
pp(file="/tmp/up_venn.png")
## Going to write the image to: /tmp/up_venn.png when dev.off() is called.
Vennerable::plot(testing, doWeights=FALSE)
dev.off()
## png 
##   2
Vennerable::plot(testing, doWeights=FALSE)

down_in_msstats <- rownames(combined_table)[combined_table[["log2fc"]] < 0]
down_in_msstats <- down_in_msstats[!is.na(down_in_msstats)]
down_in_limma <- rownames(combined_table)[combined_table[["limma_logfc"]] < 0]
down_in_limma <- down_in_limma[!is.na(down_in_limma)]
down_in_edger <- rownames(combined_table)[combined_table[["edger_logfc"]] < 0]
down_in_edger <- down_in_edger[!is.na(down_in_edger)]
down_venn_sets <- list(
  "msstats" = down_in_msstats,
  "limma" = down_in_limma,
  "edger" = down_in_edger)
testing <- Vennerable::Venn(Sets=down_venn_sets, )
pp(file="/tmp/down_venn.png")
## Going to write the image to: /tmp/down_venn.png when dev.off() is called.
Vennerable::plot(testing, doWeights=FALSE)
dev.off()
## png 
##   2

6 Repeat with the modified data

hpgl_table <- pairwise_modified_tables$data[[1]]
msstats_table <- msstats_modified_comp$ComparisonResult
rownames(msstats_table) <- gsub(pattern="^1/", replacement="", x=msstats_table_modified$Protein)

merged_table <- merge(hpgl_table, msstats_table, by="row.names")
write.csv(file="images/merged_table_modified.csv", merged_table)
cor.test(merged_table$limma_logfc, merged_table$log2FC)
## 
##  Pearson's product-moment correlation
## 
## data:  merged_table$limma_logfc and merged_table$log2FC
## t = 36, df = 910, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7375 0.7914
## sample estimates:
##    cor 
## 0.7658
cor.test(merged_table$limma_logfc, merged_table$log2FC, method="spearman")
## Warning in cor.test.default(merged_table$limma_logfc, merged_table$log2FC, : Cannot
## compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  merged_table$limma_logfc and merged_table$log2FC
## S = 2.8e+07, p-value <2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##    rho 
## 0.7768
cor.test(merged_table$deseq_logfc, merged_table$log2FC)
## 
##  Pearson's product-moment correlation
## 
## data:  merged_table$deseq_logfc and merged_table$log2FC
## t = 18, df = 910, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4648 0.5606
## sample estimates:
##    cor 
## 0.5143
cor.test(merged_table$deseq_logfc, merged_table$log2FC, method="spearman")
## Warning in cor.test.default(merged_table$deseq_logfc, merged_table$log2FC, : Cannot
## compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  merged_table$deseq_logfc and merged_table$log2FC
## S = 3.8e+07, p-value <2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##    rho 
## 0.6923
cor.test(merged_table$edger_logfc, merged_table$log2FC)
## 
##  Pearson's product-moment correlation
## 
## data:  merged_table$edger_logfc and merged_table$log2FC
## t = 18, df = 910, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4744 0.5690
## sample estimates:
##    cor 
## 0.5233
cor.test(merged_table$edger_logfc, merged_table$log2FC, method="spearman")
## Warning in cor.test.default(merged_table$edger_logfc, merged_table$log2FC, : Cannot
## compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  merged_table$edger_logfc and merged_table$log2FC
## S = 3.3e+07, p-value <2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##    rho 
## 0.7357
combined_table <- pairwise_tables$data[[1]]
undefined_idx <- combined_table[["log2fc"]] == "undefined"
combined_table[undefined_idx, "log2fc"] <- NA
combined_table[["log2fc"]] <- as.numeric(combined_table[["log2fc"]])
cor.test(combined_table[["limma_logfc"]], combined_table[["log2fc"]])
## 
##  Pearson's product-moment correlation
## 
## data:  combined_table[["limma_logfc"]] and combined_table[["log2fc"]]
## t = 35, df = 900, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7272 0.7830
## sample estimates:
##    cor 
## 0.7565
cor.test(combined_table[["limma_logfc"]], combined_table[["log2fc"]], method="spearman")
## Warning in cor.test.default(combined_table[["limma_logfc"]], combined_table[["log2fc"]], :
## Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  combined_table[["limma_logfc"]] and combined_table[["log2fc"]]
## S = 2.9e+07, p-value <2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##   rho 
## 0.769
high_to_low_idx <- order(combined_table[["log2fc"]], na.last=TRUE, decreasing=TRUE)
low_to_high_idx <- order(combined_table[["log2fc"]], na.last=TRUE, decreasing=FALSE)
top_50 <- head(combined_table[high_to_low_idx, ], n=100)
bottom_50 <- head(combined_table[low_to_high_idx, ], n=100)

cor.test(top_50$limma_logfc, top_50$log2fc)
## 
##  Pearson's product-moment correlation
## 
## data:  top_50$limma_logfc and top_50$log2fc
## t = 7.1, df = 98, p-value = 2e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4342 0.6978
## sample estimates:
##    cor 
## 0.5811
cor.test(top_50$limma_logfc, top_50$log2fc, method="spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  top_50$limma_logfc and top_50$log2fc
## S = 70000, p-value = 5e-11
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##    rho 
## 0.5802
cor.test(top_50$deseq_logfc, top_50$log2fc)
## 
##  Pearson's product-moment correlation
## 
## data:  top_50$deseq_logfc and top_50$log2fc
## t = 5.3, df = 98, p-value = 6e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3074 0.6142
## sample estimates:
##    cor 
## 0.4751
cor.test(top_50$deseq_logfc, top_50$log2fc, method="spearman")
## Warning in cor.test.default(top_50$deseq_logfc, top_50$log2fc, method = "spearman"):
## Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  top_50$deseq_logfc and top_50$log2fc
## S = 19000, p-value <2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##    rho 
## 0.8865
cor.test(top_50$edger_logfc, top_50$log2fc)
## 
##  Pearson's product-moment correlation
## 
## data:  top_50$edger_logfc and top_50$log2fc
## t = 6.5, df = 98, p-value = 3e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3960 0.6734
## sample estimates:
##    cor 
## 0.5497
cor.test(top_50$edger_logfc, top_50$log2fc, method="spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  top_50$edger_logfc and top_50$log2fc
## S = 15000, p-value <2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##    rho 
## 0.9088
cor.test(bottom_50$limma_logfc, bottom_50$log2fc)
## 
##  Pearson's product-moment correlation
## 
## data:  bottom_50$limma_logfc and bottom_50$log2fc
## t = 1.8, df = 98, p-value = 0.08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.01932  0.36157
## sample estimates:
##    cor 
## 0.1778
cor.test(bottom_50$limma_logfc, bottom_50$log2fc, method="spearman")
## Warning in cor.test.default(bottom_50$limma_logfc, bottom_50$log2fc, method = "spearman"):
## Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  bottom_50$limma_logfc and bottom_50$log2fc
## S = 130000, p-value = 0.04
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##    rho 
## 0.2049
cor.test(bottom_50$deseq_logfc, bottom_50$log2fc)
## 
##  Pearson's product-moment correlation
## 
## data:  bottom_50$deseq_logfc and bottom_50$log2fc
## t = -0.27, df = 98, p-value = 0.8
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2225  0.1701
## sample estimates:
##      cor 
## -0.02726
cor.test(bottom_50$deseq_logfc, bottom_50$log2fc, method="spearman")
## Warning in cor.test.default(bottom_50$deseq_logfc, bottom_50$log2fc, method = "spearman"):
## Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  bottom_50$deseq_logfc and bottom_50$log2fc
## S = 150000, p-value = 0.4
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##     rho 
## 0.08359
cor.test(bottom_50$edger_logfc, bottom_50$log2fc)
## 
##  Pearson's product-moment correlation
## 
## data:  bottom_50$edger_logfc and bottom_50$log2fc
## t = 0.11, df = 98, p-value = 0.9
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1856  0.2072
## sample estimates:
##     cor 
## 0.01123
cor.test(bottom_50$edger_logfc, bottom_50$log2fc, method="spearman")
## Warning in cor.test.default(bottom_50$edger_logfc, bottom_50$log2fc, method = "spearman"):
## Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  bottom_50$edger_logfc and bottom_50$log2fc
## S = 140000, p-value = 0.2
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##    rho 
## 0.1402
up_in_msstats <- rownames(combined_table)[combined_table[["log2fc"]] > 0]
up_in_msstats <- up_in_msstats[!is.na(up_in_msstats)]
up_in_limma <- rownames(combined_table)[combined_table[["limma_logfc"]] > 0]
up_in_limma <- up_in_limma[!is.na(up_in_limma)]
up_in_edger <- rownames(combined_table)[combined_table[["edger_logfc"]] > 0]
up_in_edger <- up_in_edger[!is.na(up_in_edger)]
up_venn_sets <- list(
  "msstats" = up_in_msstats,
  "limma" = up_in_limma,
  "edger" = up_in_edger)
testing <- Vennerable::Venn(Sets=up_venn_sets, )
pp(file="/tmp/up_venn.png")
## Going to write the image to: /tmp/up_venn.png when dev.off() is called.
Vennerable::plot(testing, doWeights=FALSE)
dev.off()
## png 
##   2
Vennerable::plot(testing, doWeights=FALSE)

down_in_msstats <- rownames(combined_table)[combined_table[["log2fc"]] < 0]
down_in_msstats <- down_in_msstats[!is.na(down_in_msstats)]
down_in_limma <- rownames(combined_table)[combined_table[["limma_logfc"]] < 0]
down_in_limma <- down_in_limma[!is.na(down_in_limma)]
down_in_edger <- rownames(combined_table)[combined_table[["edger_logfc"]] < 0]
down_in_edger <- down_in_edger[!is.na(down_in_edger)]
down_venn_sets <- list(
  "msstats" = down_in_msstats,
  "limma" = down_in_limma,
  "edger" = down_in_edger)
testing <- Vennerable::Venn(Sets=down_venn_sets, )
pp(file="/tmp/down_venn.png")
## Going to write the image to: /tmp/down_venn.png when dev.off() is called.
Vennerable::plot(testing, doWeights=FALSE)
dev.off()
## png 
##   2

7 Everything below here might get dropped?

I think I wedged all of the following work into that short block above. So, stop evaluating the following blocks and see if that is true.

7.0.1 Expressionset from the TRIC intensity matrix

intensity_mtrx <- read.csv(file="results/tric/HCD_outmatrix.tsv", sep="\t")
## The HCD_outmatrix has columns including: Peptide, Protein,
## A series of Intensity_sample_id columns
## A series of RT_sample_id columns
## A series of score_sample_id columns
## And at the end: RT_mean, RT_std, and pg_pvalue.
## Since SWATH2stats/MSstats uses the intensities, lets get those columns out,
## standardize the column names to match the annotation data, and use them for
## creating an expressionset.
intensity_mtrx[["rownames"]] <- intensity_mtrx[["Protein"]]
intensity_mtrx[["rownames"]] <- gsub(pattern="^[[:digit:]]+\\/",
                                     replacement="", x=intensity_mtrx[["rownames"]])
## Standardize the rownames, this might be a bad idea, as this will keep
## separate every peptide from each protein and not do anything to
## sum/median/whatever them.  But for the purposes of testing out the data I
## think it is ok.
rownames(intensity_mtrx) <- make.names(intensity_mtrx[["rownames"]], unique=TRUE)

## Now lets get rid of the extraneous text in the column names and simplify them
## to the sample names as referenced in the sample sheet.
all_columns <- colnames(intensity_mtrx)
intense_columns <- grepl(pattern="Intensity", x=all_columns)
intensity_mtrx <- intensity_mtrx[, intense_columns]
all_columns <- colnames(intensity_mtrx)
new_columns <- gsub(pattern="Intensity_", replacement="", x=all_columns)
new_columns <- gsub(pattern="_dia_.*$", replacement="", x=new_columns)
colnames(intensity_mtrx) <- paste0("s", new_columns)
intensity_mtrx[is.na(intensity_mtrx)] <- 0

## No columns in an expression set are allowed to start with a number.  I have
## unsed prefixing samples with 's' as a standard to handle this problem...
metadata <- sample_annot
metadata$sampleid <- paste0("s", metadata$sampleid)
colnames(intensity_mtrx) <- gsub(pattern="_vs_HCD", replacement="", x=colnames(intensity_mtrx))
rownames(metadata) <- metadata$sampleid
## Theoretically this is not needed anymore...
intensity_mtrx <- intensity_mtrx[, rownames(metadata)]
## Ok, at this point, we should have all the pieces for a more or less normal expressionset.
dim(intensity_mtrx)
test_expt <- create_expt(metadata=metadata,
                         count_dataframe=intensity_mtrx,
                         gene_info=mtb_annotations)

7.0.2 Play with the hpgltools derived expressionset of peptide intensities

Lets see if anything makes sense in this intensity expressionset.

## First, log2 and normalize the data.
test_norm <- normalize_expt(test_expt, transform="log2", convert="cpm", norm="quantile", filter=TRUE)
test_normbatch <- sm(normalize_expt(test_expt, transform="log2", convert="cpm",
                                    norm="quantile", filter=TRUE, batch="limma"))

test_metrics <- sm(graph_metrics(test_expt))
test_metrics_norm <- sm(graph_metrics(test_norm))
test_metrics_normbatch <- sm(graph_metrics(test_normbatch))

Now lets see what the data looks like, assuming I did not do anything horrible to it.

test_metrics$legend
test_metrics$libsize  ## Wow the intensities get ridiculous, what?
test_metrics$nonzero
## Interesting, but nothing which super-jumps out at me
test_metrics$density
## hmm very interesting, the good news is that they all have basically the same
## distribution.  I did not expect the CF samples to have such a stronger
## distribution though.

test_metrics_norm$corheat
## This suggests to me a likely batch effect
test_metrics_norm$disheat
test_metrics$smc
test_metrics$tsneplot
## Yeah there is some wonkyness between the old/new samples.

test_metrics_normbatch$pcaplot
## Wow, this is after invoking 'removeBatchEffect' from limma.  That suggests
## pretty strongly to me that we should probalby not examine the new and old
## data together.  In addition I am thinking that one whole-cell lysate sample
## might not be.

7.1 Remove old batch/weirdo sample

kept_testing <- subset_expt(test_expt,
                            subset="bioreplicate == 'mar'")
## The new batch IDs: x,y,z are associated with the 3 runs of these samples, one
## which is 8 m/z, one which is 20 m/z, and one which is a reordered 20 m/z.

test_norm <- normalize_expt(kept_testing, transform="log2", convert="cpm", norm="quantile", filter=TRUE)
test_normbatch <- sm(normalize_expt(kept_testing, transform="log2", convert="cpm",
                                    filter=TRUE, batch="limma"))
new_metrics <- sm(graph_metrics(kept_testing))
new_metrics_norm <- sm(graph_metrics(test_norm))
new_metrics_normbatch <- sm(graph_metrics(test_normbatch))

7.1.1 Plot only new data

Removing the January data seems to have made this a lot easier to look at. There might be some batchyness associated with reordering the samples, but I am thinking it is not huge.

tt = plot_topn(kept_testing, direct=TRUE)
new_metrics$legend
new_metrics$libsize
pp("images/kept_norm_topn.png", image=new_metrics$topnplot)
pp("images/kept_norm_hcd_pca.png", image=new_metrics_norm$pcaplot)
new_metrics_norm$tsneplot
pp("images/kept_norm_hcd_corheat.png", image=new_metrics_norm$corheat)
new_metrics_norm$disheat
new_metrics_normbatch$pcaplot
pp("images/normbatch_tsne.png", image=new_metrics_normbatch$tsneplot)

7.1.2 How is the variance?

Lets see what variancePartition has to say about this data.

new_varpart <- varpart(kept_testing, factors=c("condition", "batch"))
pp("images/varpart_partition.png", image=new_varpart$partition_plot)
## That is not terrible.

## the modified_expt slow from varpart() adds some metadata to the expressionset
## including the calculated % variance for condition/batch/residual for each peptide.
kept_testing <- new_varpart$modified_expt

7.2 Back to hpgltools

If you will recall, in the original block where I read the data into SWATH2stats, I invoked write_matrix_proteins() and write_matrix_peptides() and wrote out 2 csv files ‘swath2stats_protein_matrix.csv’ and ’_peptide_matrix.csv’. If my earlier work is correct, I should be able to use the protein matrix to get a truer sense of the relative abundances between my experimental conditions.

library(aLFQ)
alfq_process <- ProteinInference(alfq_input,
                                 peptide_method="top",
                                 peptide_topx=3,
                                 peptide_strictness="loose",
                                 peptide_summary="mean",
                                 transition_topx=3,
                                 transition_strictness="loose",
                                 transition_summary="sum",
                                 fasta=NA,
                                 model=NA,
                                 combine_precursors=FALSE)

8 Index version: 20180215

9 TODO

  • 2018-04-10: Make sure my invocations of SWATH2stats/MSstats are correct.
if (!isTRUE(get0("skip_load"))) {
  message(paste0("This is hpgltools commit: ", get_git_commit()))
  this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
  message(paste0("Saving to ", this_save))
  tmp <- sm(saveme(filename=this_save))
  pander::pander(sessionInfo())
}
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 7de4503f6bb5724c28cce24af5dbee22bb1c0cae
## R> packrat::restore()
## This is hpgltools commit: Thu Apr 12 22:08:53 2018 -0400: 7de4503f6bb5724c28cce24af5dbee22bb1c0cae
## Saving to 02_swath2stats-v20180215.rda.xz

R version 3.4.4 (2018-03-15)

**Platform:** x86_64-pc-linux-gnu (64-bit)

locale: LC_CTYPE=en_US.utf8, LC_NUMERIC=C, LC_TIME=en_US.utf8, LC_COLLATE=en_US.utf8, LC_MONETARY=en_US.utf8, LC_MESSAGES=en_US.utf8, LC_PAPER=en_US.utf8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.utf8 and LC_IDENTIFICATION=C

attached base packages: stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: foreach(v.1.4.4), Vennerable(v.3.1.0.9000), edgeR(v.3.20.9), ruv(v.0.9.7), MSstats(v.3.10.2), SWATH2stats(v.1.9.1) and hpgltools(v.2018.03)

loaded via a namespace (and not attached): snow(v.0.4-2), backports(v.1.1.2), Hmisc(v.4.1-1), plyr(v.1.8.4), lazyeval(v.0.2.1), splines(v.3.4.4), BiocParallel(v.1.12.0), GenomeInfoDb(v.1.14.0), ggplot2(v.2.2.1), sva(v.3.26.0), digest(v.0.6.15), BiocInstaller(v.1.28.0), htmltools(v.0.3.6), gdata(v.2.18.0), magrittr(v.1.5), checkmate(v.1.8.5), memoise(v.1.1.0), cluster(v.2.0.6), doParallel(v.1.0.11), openxlsx(v.4.0.17), limma(v.3.34.9), annotate(v.1.56.1), matrixStats(v.0.53.1), colorspace(v.1.3-2), blob(v.1.1.0), ggrepel(v.0.7.0), dplyr(v.0.7.4), RCurl(v.1.95-4.10), graph(v.1.56.0), roxygen2(v.6.0.1), genefilter(v.1.60.0), lme4(v.1.1-15), bindr(v.0.1.1), impute(v.1.52.0), survival(v.2.41-3), iterators(v.1.0.9), glue(v.1.2.0), gtable(v.0.2.0), zlibbioc(v.1.24.0), XVector(v.0.18.0), DelayedArray(v.0.4.1), BiocGenerics(v.0.24.0), DEoptimR(v.1.0-8), scales(v.0.5.0.9000), vsn(v.3.46.0), DBI(v.0.8), Rcpp(v.0.12.16), mzR(v.2.12.0), xtable(v.1.8-2), htmlTable(v.1.11.2), foreign(v.0.8-69), bit(v.1.1-12), preprocessCore(v.1.40.0), Formula(v.1.2-2), stats4(v.3.4.4), htmlwidgets(v.1.0), gplots(v.3.0.1), RColorBrewer(v.1.1-2), acepack(v.1.4.1), pkgconfig(v.2.0.1), XML(v.3.98-1.10), nnet(v.7.3-12), locfit(v.1.5-9.1), labeling(v.0.3), rlang(v.0.2.0.9001), reshape2(v.1.4.3), AnnotationDbi(v.1.40.0), munsell(v.0.4.3), tools(v.3.4.4), RSQLite(v.2.0), devtools(v.1.13.5), evaluate(v.0.10.1), stringr(v.1.3.0), mzID(v.1.16.0), yaml(v.2.1.18), knitr(v.1.20), bit64(v.0.9-7), pander(v.0.6.1), robustbase(v.0.92-8), caTools(v.1.17.1), bindrcpp(v.0.2), RBGL(v.1.54.0), nlme(v.3.1-131.1), xml2(v.1.2.0), compiler(v.3.4.4), rstudioapi(v.0.7), testthat(v.2.0.0), affyio(v.1.48.0), marray(v.1.56.0), tibble(v.1.4.2), geneplotter(v.1.56.0), stringi(v.1.1.7), highr(v.0.6), MSnbase(v.2.4.2), lattice(v.0.20-35), ProtGenerics(v.1.10.0), Matrix(v.1.2-12), commonmark(v.1.4), nloptr(v.1.0.4), pillar(v.1.2.1), MALDIquant(v.1.17), data.table(v.1.10.4-3), bitops(v.1.0-6), corpcor(v.1.6.9), GenomicRanges(v.1.30.3), R6(v.2.2.2), latticeExtra(v.0.6-28), pcaMethods(v.1.70.0), directlabels(v.2017.03.31), affy(v.1.56.0), KernSmooth(v.2.23-15), gridExtra(v.2.3), IRanges(v.2.12.0), codetools(v.0.2-15), MASS(v.7.3-49), gtools(v.3.5.0), assertthat(v.0.2.0), SummarizedExperiment(v.1.8.1), DESeq2(v.1.18.1), rprojroot(v.1.3-2), minpack.lm(v.1.2-1), withr(v.2.1.2), S4Vectors(v.0.16.0), GenomeInfoDbData(v.1.0.0), mgcv(v.1.8-23), parallel(v.3.4.4), doSNOW(v.1.0.16), quadprog(v.1.5-5), grid(v.3.4.4), rpart(v.4.1-13), minqa(v.1.2.4), rmarkdown(v.1.9), logging(v.0.7-103), Rtsne(v.0.13), Biobase(v.2.38.0) and base64enc(v.0.1-3)

---
title: "M.tuberculosis 2018: Analyzing data from OpenSwathWorkFlow/TRIC."
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
 html_document:
  code_download: true
  code_folding: show
  fig_caption: true
  fig_height: 7
  fig_width: 7
  highlight: default
  keep_md: false
  mode: selfcontained
  number_sections: true
  self_contained: true
  theme: readable
  toc: true
  toc_float:
    collapsed: false
    smooth_scroll: false
---

<style>
  body .main-container {
    max-width: 1600px;
  }
</style>

```{r options, include=FALSE}
if (!isTRUE(get0("skip_load"))) {
  library(hpgltools)
  tt <- devtools::load_all("~/hpgltools")
  knitr::opts_knit$set(progress=TRUE,
                       verbose=TRUE,
                       width=90,
                       echo=TRUE)
  knitr::opts_chunk$set(error=TRUE,
                        fig.width=8,
                        fig.height=8,
                        dpi=96)
  old_options <- options(digits=4,
                         stringsAsFactors=FALSE,
                         knitr.duplicate.label="allow")
  ggplot2::theme_set(ggplot2::theme_bw(base_size=10))
  ver <- "20180215"
  previous_file <- "01_preprocessing_comet_highres.Rmd"

  tmp <- try(sm(loadme(filename=paste0(gsub(pattern="\\.Rmd", replace="", x=previous_file), "-v", ver, ".rda.xz"))))
  previous_file <- "01_annotation.Rmd"
  tmp <- try(sm(loadme(filename=paste0(gsub(pattern="\\.Rmd", replace="", x=previous_file), "-v", ver, ".rda.xz"))))

  rmd_file <- "02_swath2stats.Rmd"
}
```

# TODO

* Remove MSstats logging. -- done
* Make explicit spearman correlations between methods. -- done
    * Do both for all data and for the top-50/bottom-50 -- done, but weird.
* Make 100% certain that the samples are annotated correctly. -- done.
* Limma/MSstats/EdgeR venn disagram for all up in CF
   * Repeat in all down CF
   * Repeat with a cutoff, top-50
* Individual plots for P/PE proteins.

Analyzing data from openMS and friends.
=======================================

In preprocessing_comet_highres.Rmd, I used the openMS tutorials and supplemental
materials from a couple papers to hopefully correctly perform the various
preprocessing tasks required to extract intensity data from DIA/SWATH
transitions.

The final steps of that process combined the transition intensities from every
sample into a metadata frile (results/tric/HCD_meta.tsv), an intensity matrix
(results/tric/HCD_outmatrix.tsv), and a feature aligned output matrix
(results/tric/aligned_comet_HCD.tsv).

My reading of the SWATH2stats and MSstats source code suggests to me that the
log2(intensities) of the feature aligned data are our final proxy for protein
abundance.  At first glance, this suggests to me that these data might follow a
distribution similar to RNASeq data (negative binomial, but perhaps with a
bigger tail?).  In addition, by the time we use tric on the data, we have a
count matrix and sample annotation data frames which look remarkably similar to
those used in a RNASeq expressionset.  Indeed, by the end of the MSstats
processing, it creates a MSnSet class of its own which uses fData/exprs/pData.

For the curious, my reasoning for saying that the log intensities are our proxy
for abundance comes from MSstats/R/DataProcess.R in a clause which looks like:

```{r snippet, eval=FALSE}
if (logTrans == 2) {
  work[["ABUNDANCE"]] <- log2(work[["ABUNDANCE"]])
} else if (logTrans == 10) {
  work[["ABUNDANCE"]] <- log10(work[["ABUNDANCE"]])
} else {
  ## Above there was a check for only log 2 and 10, but we can do e if we want.
  ## I might go back up there and remove that check. Long live e! 2.718282 rules!
  work[["ABUNDANCE"]] <- log(work[["ABUNDANCE"]]) / log(logTrans)
}
```

(Note: I added the natural log to the set of conditions, but otherwise the logic
is unchanged.)

With that in mind, I want to use some tools with which I am familiar in order to
try to understand these data.  Therefore I will first attempt to coerce my tric
aligned data and annotations into a 'normal' expressionset.  Then I want to do
some diagnostic plots which, if I am wrong and these distributions are not as
expected, will be conceptually incorrect (I don't yet think I am wrong).

## Sample annotation via SWATH2stats

I am using the SWATH2stats vignette as my primary source of information.  Thus I
see that it uses the
OpenSWATH_SM3_GoldStandardAutomatedResults_human_peakgroups.txt which has a
format nearly identical to my tric output matrix.  Thus for the moment I will
assume that the proper input for SWATH2stats is 'results/tric/comet_HCD.tsv' and
not the metadata nor output matrix.

I keep a sample sheet of all the DIA samples used in this analysis in
'sample_sheets/dia_samples.xlsx'  It should contain all the other required data
with one important caveat, I removed 1 sample by 'commenting' it (e.g. prefixing
it with '##' -- which is an admittedly dumb thing to do in an excel file.

One last caveat: I hacked the SWATH2stats sample_annotation() function to add a
couple columns in an attempt to make it a little more robust when faced with
sample sheets with differently named columns.

In addition, SWATH2stats provides some nice filtering and combination
functions which should be considered when generating various expressionset data
structures later.

```{r swath2stats_sample_annotations}
tric_data <- read.csv("results/tric/aligned_comet_HCD.tsv", sep="\t")

sample_annot <- openxlsx::read.xlsx("sample_sheets/dia_samples.xlsx")
rownames(sample_annot) <- sample_annot[["sampleid"]]
## Drop samples starting with comments
keep_idx <- ! grepl(pattern="##", x=rownames(sample_annot))
sample_annot <- sample_annot[keep_idx, ]
## Set the mzXML column to match the filename column in the data.
devtools::load_all("~/scratch/git/SWATH2stats")
## s2s, my witty way of shortening SWATH2stats...
s2s_exp <- SWATH2stats::sample_annotation(data=tric_data,
                                          sample_annotation=sample_annot,
                                          data_type="OpenSWATH",
                                          data_file_column="filename")
############
## IMPORTANT CAVEAT
############
## HEY YOU, TREY, READ THIS OR YOU WILL BE PISSED LATER
############
## I am keeping only those rows which are from the march run.
dim(s2s_exp)
kept_rows <- s2s_exp[["BioReplicate"]] == "mar"
s2s_exp <- s2s_exp[kept_rows, ]
dim(s2s_exp)
## Wow, that dropped 46% of the data
```

Now I have a couple data structures which should prove useful for the metrics
provided by SWATH2stats, MSstats, and my own hpgltools.

# SWATH2stats continued

Lets return to some of the metrics provided by swath2stats.

```{r swath2stats_processing}
## Get correlations on a sample by sample basis
pp(file="images/swath2stats_sample_cor.png")
sample_cor <- SWATH2stats::plot_correlation_between_samples(s2s_exp,
                                                            Comparison=transition_group_id ~ Run,
                                                            fun.aggregate=sum,
                                                            column.values="Intensity")
dev.off()
sample_cond_rep_cor <- SWATH2stats::plot_correlation_between_samples(s2s_exp,
                                                                     Comparison=transition_group_id ~
                                                                       Condition + BioReplicate + Run,
                                                                     fun.aggregate=sum,
                                                                     column.values="Intensity")
## I am a little concerned that these values do not seem to change when I took
## filtered/normalized data.  So I am rerunning them manually for a moment --
## perhaps I messed something up when I rewrote portions of the
## sample_annotation() function in SWATH2stats.

## ahh I think I see the problem.  The default value for fun.aggregate is NULL,
## which causes dcast to default to length.  I think this is not likely to be
## valid for this data.  I am not certain, however, what is the appropriate
## function.  If I had to guess, I would go with sum()?

assess_decoy_rate(s2s_exp)
## This seems a bit high to me, yesno?
fdr_overall <- assess_fdr_overall(s2s_exp, output="Rconsole")

byrun_fdr <- assess_fdr_byrun(s2s_exp, FFT=0.7, plot=TRUE, output="Rconsole")
mscore4assayfdr(s2s_exp, FFT=0.7, fdr_target=0.02)
prot_score <- mscore4protfdr(s2s_exp, FFT=0.7, fdr_target=0.02)

mscore_filtered <- SWATH2stats::filter_mscore(s2s_exp, prot_score)
data_filtered_mscore <- filter_mscore_freqobs(s2s_exp, 0.01, 0.8, rm.decoy=FALSE)
data_filtered_fdr <- filter_mscore_fdr(s2s_exp, FFT=0.7,
                                       overall_protein_fdr_target=0.03,
                                       upper_overall_peptide_fdr_limit=0.05)
only_proteotypic <- filter_proteotypic_peptides(data_filtered_fdr)
all_filtered <- filter_all_peptides(data_filtered_fdr)
only_strong <- filter_on_max_peptides(data_filtered_fdr, 5)
only_minimum <- filter_on_min_peptides(only_strong, 2)

## I think these matrixes are probably smarter to use than the raw outmatrix from tric.
## But I am not a fan of rerwriting the sample column names.
protein_matrix_mscore <- write_matrix_proteins(mscore_filtered, write.csv=TRUE,
                                               filename="swath2stats_protein_matrix_mscore.csv",
                                               rm.decoy=FALSE)
dim(protein_matrix_mscore)
peptide_matrix_mscore <- write_matrix_peptides(mscore_filtered, write.csv=TRUE,
                                               filename="swath2stats_peptide_matrix_mscore.csv",
                                               rm.decoy=FALSE)
protein_matrix_minimum <- write_matrix_proteins(only_minimum, write.csv=TRUE,
                                                filename="swath2stats_protein_matrix_minimum.csv",
                                                rm.decoy=FALSE)
dim(protein_matrix_minimum)
peptide_matrix_minimum <- write_matrix_peptides(only_minimum, write.csv=TRUE,
                                                filename="swath2stats_peptide_matrix_minimum.csv",
                                                rm.decoy=FALSE)

SWATH2stats::plot_correlation_between_samples(s2s_exp,
                                              column.values="RT",
                                              fun.aggregate=sum)
## I have no effing clue what this plot means.
variation <- SWATH2stats::plot_variation(s2s_exp, fun.aggregate=sum,
                                         Comparison=transition_group_id ~ Condition)

## Something in SWATH2stats::disaggregate was written poorly and is looking for
## a variable named 'cols'
cols <- colnames(only_minimum)
disaggregated <- SWATH2stats::disaggregate(only_minimum, all.columns=TRUE)
msstats_input <- SWATH2stats::convert4MSstats(disaggregated)

alfq_input <- sm(SWATH2stats::convert4aLFQ(disaggregated))
mapdia_input <- sm(SWATH2stats::convert4mapDIA(disaggregated, RT=TRUE))
```

## MSstats

msstats.org seems to provide a complete solution for performing reasonable metrics of this data.

I am currently reading: http://msstats.org/wp-content/uploads/2017/01/MSstats_v3.7.3_manual.pdf

I made some moderately intrusive changes to MSstats to make it clearer, as well.

```{r msstats_quant}
devtools::load_all("~/scratch/git/MSstats")
msstats_quant <- sm(MSstats::dataProcess(msstats_input))

msstats_plots <- sm(MSstats::dataProcessPlots(msstats_quant, type="QCPLOT"))

my_levels <- levels(as.factor(msstats_input$Condition))
my_levels

comparison <- ghetto_contrast_matrix("wt_cf", "wt_whole")
msstats_comparison <- sm(MSstats::groupComparison(contrast.matrix=comparison,
                                                  data=msstats_quant))
```

## Do the same comparison with the modified data.

```{r msstats_quant_modified}
## I want to test a hypothesis about the effect of 0/NA on the data, so I am replacing them with
## 2, which should give me a minimal value for all the log2fc calculations to work with.
msstats_modified <- msstats_quant
abundances <- msstats_quant$ProcessedData[["ABUNDANCE"]]
na_idx <- is.na(abundances)
msstats_modified$ProcessedData[na_idx, "ABUNDANCE"] <- 2
msstats_modified_comp <- sm(MSstats::groupComparison(contrast.matrix=comparison,
                                                     data=msstats_modified))
```

### P/PE protein QC plots for Yan

Yan asked for the p/pe protein qc plots. ok.  I changed the dataProcessPlots to
return something useful, so that should be possible now.

```{r pe}
pe_genes <- read.table("reference/annotated_pe_genes.txt")[[1]]

## Unfortunately, the names did not get set in my changed version of dataProcessPlots...
plotlst <- msstats_plots$QCPLOT
available_plots <- gsub(pattern="^1/", replacement="", x=levels(msstats_quant$ProcessedData$PROTEIN))
names(plotlst) <- available_plots

pe_in_avail_idx <- pe_genes %in% available_plots
pe_in_avail <- pe_genes[pe_in_avail_idx]
pe_plots <- plotlst[pe_in_avail]
pdf(file="pe_qc_plots.pdf")
for (p in 1:length(pe_plots)) {
  plot(pe_plots[[p]])
}
dev.off()
```

# Create hpgltools expressionset

Since I am not certain I understand these data, I will take the intensities from
SWATH2stats, metadata, and annotation data;  attempt to create a 'normal'
expressionset; poke at it to see what I can learn.

## Massaging the metadata

I want to use the same metadata as were used for MSstats.  It has a few
important differences from the requirements of hpgltools: pretty much only that
I do not allow rownames/sampleIDs to start with a number.

```{r protein_peptide_matrices}
metadata <- sample_annot
metadata[["sampleid"]] <- paste0("s", metadata[["sampleid"]])
rownames(metadata) <- metadata[["sampleid"]]
```

## Massaging the gene annotation data and adding the msstats data.

I have my own annotation data from the gff file/microbesonline/whatever, I can
add the MSstats result to it so that later I can print them all together.

```{r msstats_annotations}
## Here is a neat little thing I can do:  Add the MSstats results to my annotation data.
## Then when I print out the tables of the limma/etc results, they MSstats
## results will come along for free.
msstats_table <- msstats_comparison[["ComparisonResult"]]
rownames(msstats_table) <- gsub(pattern="^1\\/", replacement="",
                                x=msstats_table[["Protein"]])
mtb_annotations_with_msstats <- merge(mtb_annotations, msstats_table,
                                      by="row.names", all.x=TRUE)
rownames(mtb_annotations_with_msstats) <- mtb_annotations_with_msstats[["Row.names"]]
mtb_annotations_with_msstats <- mtb_annotations_with_msstats[, -1]

msstats_table_modified <- msstats_modified_comp[["ComparisonResult"]]
rownames(msstats_table_modified) <- gsub(pattern="^1\\/", replacement="",
                                         x=msstats_table_modified[["Protein"]])
mtb_annotations_with_modified_msstats <- merge(mtb_annotations, msstats_table_modified,
                                               by="row.names", all.x=TRUE)
rownames(mtb_annotations_with_modified_msstats) <- mtb_annotations_with_modified_msstats[["Row.names"]]
mtb_annotations_with_modified_msstats <- mtb_annotations_with_modified_msstats[, -1]
```

## Massaging the intensity matrix

I do not want the \1 before the protein names, I already merged them into one
entry per gene vis SWATH2stats.

```{r protein_matrix}
prot_mtrx <- read.csv("swath2stats_protein_matrix_mscore.csv")
rownames(prot_mtrx) <- gsub(pattern="^1\\/", replacement="", x=prot_mtrx[["ProteinName"]])
prot_mtrx <- prot_mtrx[, -1]
## Important question: Did SWATH2stats reorder my data?
fun <- gsub(pattern="^.*_(2018.*$)", replacement="\\1", x=colnames(prot_mtrx))
colnames(prot_mtrx) <- paste0("s", fun)

decoy_idx <- ! grepl(pattern="DECOY", x=rownames(prot_mtrx))
prot_mtrx <- prot_mtrx[decoy_idx, ]

prot_mtrx_modified <- prot_mtrx
zero_idx <- prot_mtrx == 0
prot_mtrx_modified[zero_idx] <- 2
```

## Merge the pieces

Now we should have sufficient pieces to make an expressionset.

```{r expt}
## Drop the metadata not in the protein matrix:
## And ensure that they are the same order.
reordered <- colnames(prot_mtrx)
metadata <- metadata[reordered, ]

protein_expt <- create_expt(metadata,
                            count_dataframe=prot_mtrx,
                            gene_info=mtb_annotations_with_msstats)

protein_modified_expt <- create_expt(metadata,
                                     count_dataframe=prot_mtrx_modified,
                                     gene_info=mtb_annotations_with_modified_msstats)
```

```{r protein_metrics, fig.show='hide'}
protein_metrics <- sm(graph_metrics(protein_expt))
protein_norm <- sm(normalize_expt(protein_expt, transform="log2", convert="cpm",
                                  norm="quant", filter=TRUE))
protein_norm_metrics <- sm(graph_metrics(protein_norm))
```

```{r print_metrics}
protein_metrics$libsize
protein_norm_metrics$pcaplot
```

# Attempt some quantification comparisons?

```{r de_testing}
pairwise_filt <- sm(normalize_expt(protein_expt, filter=TRUE))
pairwise_comp <- all_pairwise(pairwise_filt, parallel=FALSE, force=TRUE)

keepers <- list("filtrate_vs_cells" = c("wt_cf", "wt_whole"))
pairwise_tables <- combine_de_tables(pairwise_comp,
                                     excel="excel/test_pairwise_de_with_msstats.xlsx",
                                     keepers=keepers)
droppers <- c("undefined")
names(droppers) <- "log2fc"
pairwise_onlyall <- combine_de_tables(pairwise_comp,
                                      excel="excel/test_pairwise_de_only_msstats.xlsx",
                                      keepers=keepers,
                                      excludes=droppers)
pairwise_sig <- sm(extract_significant_genes(pairwise_tables,
                                             excel="excel/test_pairwise_sig_with_msstats.xlsx"))

solo_proteins <- features_in_single_condition(protein_expt)
proteins_only_cf <- solo_proteins[["solo_this"]][["wt_cf"]]
proteins_only_cf
proteins_only_whole <- solo_proteins[["solo_this"]][["wt_whole"]]
length(proteins_only_whole)
```

```{r de_testing_modified}
pairwise_modified_comp <- all_pairwise(protein_modified_expt, parallel=FALSE, force=TRUE)

pairwise_modified_tables <- combine_de_tables(pairwise_modified_comp,
                                              excel="excel/test_pairwise_modified_de_with_msstats.xlsx",
                                              keepers=keepers)
droppers <- c("undefined")
names(droppers) <- "log2fc"
pairwise_modified_onlyall <- combine_de_tables(pairwise_modified_comp,
                                               excel="excel/test_pairwise_de_only_msstats.xlsx",
                                               keepers=keepers,
                                               excludes=droppers)
pairwise_modified_sig <- sm(extract_significant_genes(
  pairwise_modified_tables,
  excel="excel/test_pairwise_modified_sig_with_msstats.xlsx"))
```

## Compare hpgltools/MSstats

Can we compare limma and riends to MSstats?

```{r compare_comparisons}
hpgl_table <- pairwise_tables$data[[1]]
msstats_table <- msstats_comparison$ComparisonResult
rownames(msstats_table) <- gsub(pattern="^1/", replacement="", x=msstats_table$Protein)

merged_table <- merge(hpgl_table, msstats_table, by="row.names")
write.csv(file="images/merged_table.csv", merged_table)
cor.test(merged_table$limma_logfc, merged_table$log2FC)
cor.test(merged_table$limma_logfc, merged_table$log2FC, method="spearman")
cor.test(merged_table$deseq_logfc, merged_table$log2FC)
cor.test(merged_table$deseq_logfc, merged_table$log2FC, method="spearman")
cor.test(merged_table$edger_logfc, merged_table$log2FC)
cor.test(merged_table$edger_logfc, merged_table$log2FC, method="spearman")

combined_table <- pairwise_tables$data[[1]]
undefined_idx <- combined_table[["log2fc"]] == "undefined"
combined_table[undefined_idx, "log2fc"] <- NA
combined_table[["log2fc"]] <- as.numeric(combined_table[["log2fc"]])
cor.test(combined_table[["limma_logfc"]], combined_table[["log2fc"]])
cor.test(combined_table[["limma_logfc"]], combined_table[["log2fc"]], method="spearman")

high_to_low_idx <- order(combined_table[["log2fc"]], na.last=TRUE, decreasing=TRUE)
low_to_high_idx <- order(combined_table[["log2fc"]], na.last=TRUE, decreasing=FALSE)
top_50 <- head(combined_table[high_to_low_idx, ], n=100)
bottom_50 <- head(combined_table[low_to_high_idx, ], n=100)

cor.test(top_50$limma_logfc, top_50$log2fc)
cor.test(top_50$limma_logfc, top_50$log2fc, method="spearman")
cor.test(top_50$deseq_logfc, top_50$log2fc)
cor.test(top_50$deseq_logfc, top_50$log2fc, method="spearman")
cor.test(top_50$edger_logfc, top_50$log2fc)
cor.test(top_50$edger_logfc, top_50$log2fc, method="spearman")

cor.test(bottom_50$limma_logfc, bottom_50$log2fc)
cor.test(bottom_50$limma_logfc, bottom_50$log2fc, method="spearman")
cor.test(bottom_50$deseq_logfc, bottom_50$log2fc)
cor.test(bottom_50$deseq_logfc, bottom_50$log2fc, method="spearman")
cor.test(bottom_50$edger_logfc, bottom_50$log2fc)
cor.test(bottom_50$edger_logfc, bottom_50$log2fc, method="spearman")

up_in_msstats <- rownames(combined_table)[combined_table[["log2fc"]] > 0]
up_in_msstats <- up_in_msstats[!is.na(up_in_msstats)]
up_in_limma <- rownames(combined_table)[combined_table[["limma_logfc"]] > 0]
up_in_limma <- up_in_limma[!is.na(up_in_limma)]
up_in_edger <- rownames(combined_table)[combined_table[["edger_logfc"]] > 0]
up_in_edger <- up_in_edger[!is.na(up_in_edger)]
up_venn_sets <- list(
  "msstats" = up_in_msstats,
  "limma" = up_in_limma,
  "edger" = up_in_edger)
testing <- Vennerable::Venn(Sets=up_venn_sets, )
pp(file="/tmp/up_venn.png")
Vennerable::plot(testing, doWeights=FALSE)
dev.off()
Vennerable::plot(testing, doWeights=FALSE)

down_in_msstats <- rownames(combined_table)[combined_table[["log2fc"]] < 0]
down_in_msstats <- down_in_msstats[!is.na(down_in_msstats)]
down_in_limma <- rownames(combined_table)[combined_table[["limma_logfc"]] < 0]
down_in_limma <- down_in_limma[!is.na(down_in_limma)]
down_in_edger <- rownames(combined_table)[combined_table[["edger_logfc"]] < 0]
down_in_edger <- down_in_edger[!is.na(down_in_edger)]
down_venn_sets <- list(
  "msstats" = down_in_msstats,
  "limma" = down_in_limma,
  "edger" = down_in_edger)
testing <- Vennerable::Venn(Sets=down_venn_sets, )
pp(file="/tmp/down_venn.png")
Vennerable::plot(testing, doWeights=FALSE)
dev.off()
```

# Repeat with the modified data

```{r compare_comparisons_modified}
hpgl_table <- pairwise_modified_tables$data[[1]]
msstats_table <- msstats_modified_comp$ComparisonResult
rownames(msstats_table) <- gsub(pattern="^1/", replacement="", x=msstats_table_modified$Protein)

merged_table <- merge(hpgl_table, msstats_table, by="row.names")
write.csv(file="images/merged_table_modified.csv", merged_table)
cor.test(merged_table$limma_logfc, merged_table$log2FC)
cor.test(merged_table$limma_logfc, merged_table$log2FC, method="spearman")
cor.test(merged_table$deseq_logfc, merged_table$log2FC)
cor.test(merged_table$deseq_logfc, merged_table$log2FC, method="spearman")
cor.test(merged_table$edger_logfc, merged_table$log2FC)
cor.test(merged_table$edger_logfc, merged_table$log2FC, method="spearman")

combined_table <- pairwise_tables$data[[1]]
undefined_idx <- combined_table[["log2fc"]] == "undefined"
combined_table[undefined_idx, "log2fc"] <- NA
combined_table[["log2fc"]] <- as.numeric(combined_table[["log2fc"]])
cor.test(combined_table[["limma_logfc"]], combined_table[["log2fc"]])
cor.test(combined_table[["limma_logfc"]], combined_table[["log2fc"]], method="spearman")

high_to_low_idx <- order(combined_table[["log2fc"]], na.last=TRUE, decreasing=TRUE)
low_to_high_idx <- order(combined_table[["log2fc"]], na.last=TRUE, decreasing=FALSE)
top_50 <- head(combined_table[high_to_low_idx, ], n=100)
bottom_50 <- head(combined_table[low_to_high_idx, ], n=100)

cor.test(top_50$limma_logfc, top_50$log2fc)
cor.test(top_50$limma_logfc, top_50$log2fc, method="spearman")
cor.test(top_50$deseq_logfc, top_50$log2fc)
cor.test(top_50$deseq_logfc, top_50$log2fc, method="spearman")
cor.test(top_50$edger_logfc, top_50$log2fc)
cor.test(top_50$edger_logfc, top_50$log2fc, method="spearman")

cor.test(bottom_50$limma_logfc, bottom_50$log2fc)
cor.test(bottom_50$limma_logfc, bottom_50$log2fc, method="spearman")
cor.test(bottom_50$deseq_logfc, bottom_50$log2fc)
cor.test(bottom_50$deseq_logfc, bottom_50$log2fc, method="spearman")
cor.test(bottom_50$edger_logfc, bottom_50$log2fc)
cor.test(bottom_50$edger_logfc, bottom_50$log2fc, method="spearman")

up_in_msstats <- rownames(combined_table)[combined_table[["log2fc"]] > 0]
up_in_msstats <- up_in_msstats[!is.na(up_in_msstats)]
up_in_limma <- rownames(combined_table)[combined_table[["limma_logfc"]] > 0]
up_in_limma <- up_in_limma[!is.na(up_in_limma)]
up_in_edger <- rownames(combined_table)[combined_table[["edger_logfc"]] > 0]
up_in_edger <- up_in_edger[!is.na(up_in_edger)]
up_venn_sets <- list(
  "msstats" = up_in_msstats,
  "limma" = up_in_limma,
  "edger" = up_in_edger)
testing <- Vennerable::Venn(Sets=up_venn_sets, )
pp(file="/tmp/up_venn.png")
Vennerable::plot(testing, doWeights=FALSE)
dev.off()
Vennerable::plot(testing, doWeights=FALSE)

down_in_msstats <- rownames(combined_table)[combined_table[["log2fc"]] < 0]
down_in_msstats <- down_in_msstats[!is.na(down_in_msstats)]
down_in_limma <- rownames(combined_table)[combined_table[["limma_logfc"]] < 0]
down_in_limma <- down_in_limma[!is.na(down_in_limma)]
down_in_edger <- rownames(combined_table)[combined_table[["edger_logfc"]] < 0]
down_in_edger <- down_in_edger[!is.na(down_in_edger)]
down_venn_sets <- list(
  "msstats" = down_in_msstats,
  "limma" = down_in_limma,
  "edger" = down_in_edger)
testing <- Vennerable::Venn(Sets=down_venn_sets, )
pp(file="/tmp/down_venn.png")
Vennerable::plot(testing, doWeights=FALSE)
dev.off()
```

# Everything below here might get dropped?

I think I wedged all of the following work into that short block above.
So, stop evaluating the following blocks and see if that is true.

### Expressionset from the TRIC intensity matrix

```{r hpgltools_expt, eval=FALSE}
intensity_mtrx <- read.csv(file="results/tric/HCD_outmatrix.tsv", sep="\t")
## The HCD_outmatrix has columns including: Peptide, Protein,
## A series of Intensity_sample_id columns
## A series of RT_sample_id columns
## A series of score_sample_id columns
## And at the end: RT_mean, RT_std, and pg_pvalue.
## Since SWATH2stats/MSstats uses the intensities, lets get those columns out,
## standardize the column names to match the annotation data, and use them for
## creating an expressionset.
intensity_mtrx[["rownames"]] <- intensity_mtrx[["Protein"]]
intensity_mtrx[["rownames"]] <- gsub(pattern="^[[:digit:]]+\\/",
                                     replacement="", x=intensity_mtrx[["rownames"]])
## Standardize the rownames, this might be a bad idea, as this will keep
## separate every peptide from each protein and not do anything to
## sum/median/whatever them.  But for the purposes of testing out the data I
## think it is ok.
rownames(intensity_mtrx) <- make.names(intensity_mtrx[["rownames"]], unique=TRUE)

## Now lets get rid of the extraneous text in the column names and simplify them
## to the sample names as referenced in the sample sheet.
all_columns <- colnames(intensity_mtrx)
intense_columns <- grepl(pattern="Intensity", x=all_columns)
intensity_mtrx <- intensity_mtrx[, intense_columns]
all_columns <- colnames(intensity_mtrx)
new_columns <- gsub(pattern="Intensity_", replacement="", x=all_columns)
new_columns <- gsub(pattern="_dia_.*$", replacement="", x=new_columns)
colnames(intensity_mtrx) <- paste0("s", new_columns)
intensity_mtrx[is.na(intensity_mtrx)] <- 0

## No columns in an expression set are allowed to start with a number.  I have
## unsed prefixing samples with 's' as a standard to handle this problem...
metadata <- sample_annot
metadata$sampleid <- paste0("s", metadata$sampleid)
colnames(intensity_mtrx) <- gsub(pattern="_vs_HCD", replacement="", x=colnames(intensity_mtrx))
rownames(metadata) <- metadata$sampleid
## Theoretically this is not needed anymore...
intensity_mtrx <- intensity_mtrx[, rownames(metadata)]
## Ok, at this point, we should have all the pieces for a more or less normal expressionset.
dim(intensity_mtrx)
test_expt <- create_expt(metadata=metadata,
                         count_dataframe=intensity_mtrx,
                         gene_info=mtb_annotations)
```

### Play with the hpgltools derived expressionset of peptide intensities

Lets see if anything makes sense in this intensity expressionset.

```{r hpgltools_expt_examine, fig.show="hide", eval=FALSE}
## First, log2 and normalize the data.
test_norm <- normalize_expt(test_expt, transform="log2", convert="cpm", norm="quantile", filter=TRUE)
test_normbatch <- sm(normalize_expt(test_expt, transform="log2", convert="cpm",
                                    norm="quantile", filter=TRUE, batch="limma"))

test_metrics <- sm(graph_metrics(test_expt))
test_metrics_norm <- sm(graph_metrics(test_norm))
test_metrics_normbatch <- sm(graph_metrics(test_normbatch))
```

Now lets see what the data looks like, assuming I did not do anything horrible
to it.

```{r hpgltools_plots, eval=FALSE}
test_metrics$legend
test_metrics$libsize  ## Wow the intensities get ridiculous, what?
test_metrics$nonzero
## Interesting, but nothing which super-jumps out at me
test_metrics$density
## hmm very interesting, the good news is that they all have basically the same
## distribution.  I did not expect the CF samples to have such a stronger
## distribution though.

test_metrics_norm$corheat
## This suggests to me a likely batch effect
test_metrics_norm$disheat
test_metrics$smc
test_metrics$tsneplot
## Yeah there is some wonkyness between the old/new samples.

test_metrics_normbatch$pcaplot
## Wow, this is after invoking 'removeBatchEffect' from limma.  That suggests
## pretty strongly to me that we should probalby not examine the new and old
## data together.  In addition I am thinking that one whole-cell lysate sample
## might not be.
```

## Remove old batch/weirdo sample

```{r removals, fig.show="hide", eval=FALSE}
kept_testing <- subset_expt(test_expt,
                            subset="bioreplicate == 'mar'")
## The new batch IDs: x,y,z are associated with the 3 runs of these samples, one
## which is 8 m/z, one which is 20 m/z, and one which is a reordered 20 m/z.

test_norm <- normalize_expt(kept_testing, transform="log2", convert="cpm", norm="quantile", filter=TRUE)
test_normbatch <- sm(normalize_expt(kept_testing, transform="log2", convert="cpm",
                                    filter=TRUE, batch="limma"))
new_metrics <- sm(graph_metrics(kept_testing))
new_metrics_norm <- sm(graph_metrics(test_norm))
new_metrics_normbatch <- sm(graph_metrics(test_normbatch))
```

### Plot only new data

Removing the January data seems to have made this a lot easier to look at.
There might be some batchyness associated with reordering the samples, but I am
thinking it is not huge.


```{r new_data_plots, eval=FALSE}
tt = plot_topn(kept_testing, direct=TRUE)
new_metrics$legend
new_metrics$libsize
pp("images/kept_norm_topn.png", image=new_metrics$topnplot)
pp("images/kept_norm_hcd_pca.png", image=new_metrics_norm$pcaplot)
new_metrics_norm$tsneplot
pp("images/kept_norm_hcd_corheat.png", image=new_metrics_norm$corheat)
new_metrics_norm$disheat
new_metrics_normbatch$pcaplot
pp("images/normbatch_tsne.png", image=new_metrics_normbatch$tsneplot)
```

### How is the variance?

Lets see what variancePartition has to say about this data.

```{r varpart, eval=FALSE}
new_varpart <- varpart(kept_testing, factors=c("condition", "batch"))
pp("images/varpart_partition.png", image=new_varpart$partition_plot)
## That is not terrible.

## the modified_expt slow from varpart() adds some metadata to the expressionset
## including the calculated % variance for condition/batch/residual for each peptide.
kept_testing <- new_varpart$modified_expt
```

## Back to hpgltools

If you will recall, in the original block where I read the data into
SWATH2stats, I invoked write_matrix_proteins() and write_matrix_peptides() and
wrote out 2 csv files 'swath2stats_protein_matrix.csv' and
'_peptide_matrix.csv'.  If my earlier work is correct, I should be able to use
the protein matrix to get a truer sense of the relative abundances between my
experimental conditions.

```{r alfq_quant, eval=FALSE}
library(aLFQ)
alfq_process <- ProteinInference(alfq_input,
                                 peptide_method="top",
                                 peptide_topx=3,
                                 peptide_strictness="loose",
                                 peptide_summary="mean",
                                 transition_topx=3,
                                 transition_strictness="loose",
                                 transition_summary="sum",
                                 fasta=NA,
                                 model=NA,
                                 combine_precursors=FALSE)
```

# Index version: `r ver`

# TODO

* 2018-04-10:  Make sure my invocations of SWATH2stats/MSstats are correct.

```{r saveme}
if (!isTRUE(get0("skip_load"))) {
  message(paste0("This is hpgltools commit: ", get_git_commit()))
  this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
  message(paste0("Saving to ", this_save))
  tmp <- sm(saveme(filename=this_save))
  pander::pander(sessionInfo())
}
```
