1 TODO

  • Remove MSstats logging. – done
  • Make explicit spearman correlations between methods. – done
    • Do both for all data and for the top-50/bottom-50 – done, but weird.
  • Make 100% certain that the samples are annotated correctly. – done.
  • Limma/MSstats/EdgeR venn disagram for all up in CF
    • Repeat in all down CF
    • Repeat with a cutoff, top-50
  • Individual plots for P/PE proteins.

2 Analyzing data from openMS and friends.

In preprocessing_comet_highres.Rmd, I used the openMS tutorials and supplemental materials from a couple papers to hopefully correctly perform the various preprocessing tasks required to extract intensity data from DIA/SWATH transitions.

The final steps of that process combined the transition intensities from every sample into a metadata frile (results/tric/HCD_meta.tsv), an intensity matrix (results/tric/HCD_outmatrix.tsv), and a feature aligned output matrix (results/tric/aligned_comet_HCD.tsv).

My reading of the SWATH2stats and MSstats source code suggests to me that the log2(intensities) of the feature aligned data are our final proxy for protein abundance. At first glance, this suggests to me that these data might follow a distribution similar to RNASeq data (negative binomial, but perhaps with a bigger tail?). In addition, by the time we use tric on the data, we have a count matrix and sample annotation data frames which look remarkably similar to those used in a RNASeq expressionset. Indeed, by the end of the MSstats processing, it creates a MSnSet class of its own which uses fData/exprs/pData.

For the curious, my reasoning for saying that the log intensities are our proxy for abundance comes from MSstats/R/DataProcess.R in a clause which looks like:

if (logTrans == 2) {
  work[["ABUNDANCE"]] <- log2(work[["ABUNDANCE"]])
} else if (logTrans == 10) {
  work[["ABUNDANCE"]] <- log10(work[["ABUNDANCE"]])
} else {
  ## Above there was a check for only log 2 and 10, but we can do e if we want.
  ## I might go back up there and remove that check. Long live e! 2.718282 rules!
  work[["ABUNDANCE"]] <- log(work[["ABUNDANCE"]]) / log(logTrans)
}

(Note: I added the natural log to the set of conditions, but otherwise the logic is unchanged.)

With that in mind, I want to use some tools with which I am familiar in order to try to understand these data. Therefore I will first attempt to coerce my tric aligned data and annotations into a ‘normal’ expressionset. Then I want to do some diagnostic plots which, if I am wrong and these distributions are not as expected, will be conceptually incorrect (I don’t yet think I am wrong).

2.1 Sample annotation via SWATH2stats

I am using the SWATH2stats vignette as my primary source of information. Thus I see that it uses the OpenSWATH_SM3_GoldStandardAutomatedResults_human_peakgroups.txt which has a format nearly identical to my tric output matrix. Thus for the moment I will assume that the proper input for SWATH2stats is ‘results/tric/comet_HCD.tsv’ and not the metadata nor output matrix.

I keep a sample sheet of all the DIA samples used in this analysis in ‘sample_sheets/dia_samples.xlsx’ It should contain all the other required data with one important caveat, I removed 1 sample by ‘commenting’ it (e.g. prefixing it with ‘##’ – which is an admittedly dumb thing to do in an excel file.

One last caveat: I hacked the SWATH2stats sample_annotation() function to add a couple columns in an attempt to make it a little more robust when faced with sample sheets with differently named columns.

In addition, SWATH2stats provides some nice filtering and combination functions which should be considered when generating various expressionset data structures later.

tric_data <- read.csv("results/tric_201805/comet_HCD.tsv", sep="\t")

sample_annot <- openxlsx::read.xlsx("sample_sheets/Mtb_dia_samples.xlsx")
rownames(sample_annot) <- make.names(sample_annot[["sampleid"]], unique=TRUE)
## Drop samples starting with comments
keep_idx <- ! grepl(pattern="##", x=sample_annot[["sampleid"]])
sample_annot <- sample_annot[keep_idx, ]
expt_idx <- sample_annot[["expt_id"]] == "may2018" | sample_annot[["expt_id"]] == "mar2018"
expt_idx[is.na(sample_annot[["expt_id"]])] <- FALSE
sample_annot <- sample_annot[expt_idx, ]
mz_idx <- sample_annot[["windowsize"]] == "8"
sample_annot <- sample_annot[mz_idx, ]
## Set the mzXML column to match the filename column in the data.
devtools::load_all("~/scratch/git/SWATH2stats")
## Loading SWATH2stats
## s2s, my witty way of shortening SWATH2stats...
s2s_exp <- sample_annotation(data=tric_data,
                             sample_annotation=sample_annot,
                             fullpeptidename_column="fullunimodpeptidename")
## Found the same mzXML files in the annotations and data.

Now I have a couple data structures which should prove useful for the metrics provided by SWATH2stats, MSstats, and my own hpgltools.

3 SWATH2stats continued

Lets return to some of the metrics provided by swath2stats.

## Get correlations on a sample by sample basis
pp(file="images/20180523_swath2stats_sample_cor.png")
## Going to write the image to: images/20180523_swath2stats_sample_cor.png when dev.off() is called.
sample_cor <- plot_correlation_between_samples(s2s_exp,
                                               fun.aggregate=sum,
                                               column.values="intensity")
dev.off()
## png 
##   2
sample_cond_rep_cor <- plot_correlation_between_samples(s2s_exp,
                                                        comparison=transition_group_id ~
                                                          condition + bioreplicate + run,
                                                        fun.aggregate=sum,
                                                        column.values="intensity")

## I am a little concerned that these values do not seem to change when I took
## filtered/normalized data.  So I am rerunning them manually for a moment --
## perhaps I messed something up when I rewrote portions of the
## sample_annotation() function in SWATH2stats.

## ahh I think I see the problem.  The default value for fun.aggregate is NULL,
## which causes dcast to default to length.  I think this is not likely to be
## valid for this data.  I am not certain, however, what is the appropriate
## function.  If I had to guess, I would go with sum()?

assess_decoy_rate(s2s_exp)
## Number of non-decoy peptides: 18096
## Number of decoy peptides: 1436
## Decoy rate: 0.0794
## This seems a bit high to me, yesno?
fdr_overall <- assess_fdr_overall(s2s_exp, output="Rconsole", plot=TRUE)

byrun_fdr <- assess_fdr_byrun(s2s_exp, FFT=0.7, plot=TRUE, output="Rconsole")
## The average FDR by run on assay level is 0.011
## The average FDR by run on peptide level is 0.012
## The average FDR by run on protein level is 0.049

chosen_mscore <- mscore4assayfdr(s2s_exp, FFT=0.7, fdr_target=0.02)
## Target assay FDR: 0.02
## Required overall m-score cutoff:0.0039811
## achieving assay FDR =0.0186
prot_score <- mscore4protfdr(s2s_exp, FFT=0.7, fdr_target=0.02)
## Target protein FDR:0.02
## Required overall m-score cutoff:0.00089125
## achieving protein FDR =0.0194
mscore_filtered <- filter_mscore(s2s_exp, chosen_mscore)
## Dimension difference: 15921, 0
data_filtered_mscore <- filter_mscore_freqobs(s2s_exp, 0.01, 0.8, rm.decoy=FALSE)
## Treshold, peptides need to have been quantified in more conditions than: 18.4
## Fraction of peptides selected: 1
## Dimension difference: 0, 0
data_filtered_fdr <- filter_mscore_fdr(mscore_filtered, FFT=0.7,
                                       overall_protein_fdr_target=prot_score,
                                       upper_overall_peptide_fdr_limit=0.05)
## Target protein FDR:0.000891250938133746
## Required overall m-score cutoff:0.01
## achieving protein FDR =0
## filter_mscore_fdr is filtering the data...
## finding m-score cutoff to achieve desired protein FDR in protein master list..
## finding m-score cutoff to achieve desired global peptide FDR..
## Target peptide FDR: 0.05
## Required overall m-score cutoff: 0.01
## Achieving peptide FDR: 0
## Proteins selected: 
## Total proteins selected: 2583
## Thereof target proteins: 2583
## Thereof decoy proteins: 0
## Peptides mapping to these protein entries selected:
## Total mapping peptides: 16956
## Thereof target peptides: 16956
## Thereof decoy peptides: 0
## Total peptides selected from:
## Total peptides: 16956
## Thereof target peptides: 16956
## Thereof decoy peptides: 0
## Individual run FDR quality of the peptides was not calculated
## as not every run contains a decoy.
## The decoys have been removed from the returned data.
only_proteotypic <- filter_proteotypic_peptides(data_filtered_fdr)
## Number of proteins detected: 2544
## Protein identifiers: Rv2224c, Rv2220, Rv3267, Rv2241, Rv1613, Rv2427c
## Number of proteins detected that are supported by a proteotypic peptide: 2505
## Number of proteotypic peptides detected: 16822
all_filtered <- filter_all_peptides(only_proteotypic)
## Number of proteins detected: 2505
## First 6 protein identifiers: Rv2224c, Rv2220, Rv3267, Rv2241, Rv1613, Rv2427c
only_strong <- filter_on_max_peptides(data=all_filtered, n_peptides=10)
## Before filtering: 
##   Number of proteins: 2505
##   Number of peptides: 16822
## 
## Percentage of peptides removed: 21.43%
## 
## After filtering: 
##   Number of proteins: 2491
##   Number of peptides: 13217
only_minimum <- filter_on_min_peptides(data=only_strong, n_peptides=3)
## Before filtering: 
##   Number of proteins: 2491
##   Number of peptides: 13217
## 
## Percentage of peptides removed: 0.03%
## 
## After filtering: 
##   Number of proteins: 2329
##   Number of peptides: 13213
## I think these matrixes are probably smarter to use than the raw outmatrix from tric.
## But I am not a fan of rerwriting the sample column names.
protein_matrix_all <- write_matrix_proteins(
  s2s_exp, write.csv=TRUE,
  filename=paste0("results/swath2stats_", ver, "/protein_all.csv"))
## Protein overview matrix results/swath2stats_20180528/protein_all.csv written to working folder.
dim(protein_matrix_all)
## [1] 3614   24
protein_matrix_mscore <- write_matrix_proteins(
  mscore_filtered, write.csv=TRUE,
  filename=paste0("results/swath2stats_", ver, "/protein_matrix_mscore.csv"))
## Protein overview matrix results/swath2stats_20180528/protein_matrix_mscore.csv written to working folder.
dim(protein_matrix_mscore)
## [1] 2583   24
peptide_matrix_mscore <- write_matrix_peptides(
  mscore_filtered, write.csv=TRUE,
  filename=paste0("results/swath2stats_", ver, "/peptide_matrix_mscore.csv"))
## Peptide overview matrix results/swath2stats_20180528/peptide_matrix_mscore.csv written to working folder.
dim(peptide_matrix_mscore)
## [1] 16956    24
protein_matrix_minimum <- write_matrix_proteins(
  only_minimum, write.csv=TRUE,
  filename=paste0("results/swath2stats_", ver, "/protein_matrix_minimum.csv"))
## Protein overview matrix results/swath2stats_20180528/protein_matrix_minimum.csv written to working folder.
dim(protein_matrix_minimum)
## [1] 2329   24
peptide_matrix_minimum <- write_matrix_peptides(
  only_minimum, write.csv=TRUE,
  filename=paste0("results/swath2stats_", ver, "/peptide_matrix_minimum.csv"))
## Peptide overview matrix results/swath2stats_20180528/peptide_matrix_minimum.csv written to working folder.
dim(peptide_matrix_minimum)
## [1] 115412     24
rt_cor <- plot_correlation_between_samples(only_minimum,
                                           column.values="intensity",
                                           fun.aggregate=sum)

## I have no effing clue what this plot means.
variation <- plot_variation(only_minimum, fun.aggregate=sum,
                            comparison=transition_group_id ~ condition)

## Something in SWATH2stats::disaggregate was written poorly and is looking for
## a variable named 'cols'
cols <- colnames(only_minimum)
disaggregated <- disaggregate(only_minimum, all.columns=TRUE)
## The library contains 6 transitions per precursor.
##                   
## The data table was transformed into a table containing one row per transition.
msstats_input <- convert4MSstats(disaggregated)
## One or several columns required by MSstats were not in the data. The columns were created and filled with NAs.
## Missing columns: fragmention, productcharge, isotopelabeltype
## isotopelabeltype was filled with light.
##alfq_input <- sm(convert4aLFQ(disaggregated))
##mapdia_input <- sm(convert4mapDIA(disaggregated, RT=TRUE))

3.1 MSstats

msstats.org seems to provide a complete solution for performing reasonable metrics of this data.

I am currently reading: http://msstats.org/wp-content/uploads/2017/01/MSstats_v3.7.3_manual.pdf

I made some moderately intrusive changes to MSstats to make it clearer, as well.

devtools::load_all("~/scratch/git/MSstats")
## Loading MSstats
## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)

## Warning: character(0)
msstats_quant <- sm(dataProcess(msstats_input))
msstats_plots <- sm(dataProcessPlots(msstats_quant, type="QCPLOT"))

my_levels <- levels(as.factor(msstats_input$condition))
my_levels
## [1] "comp_cf"     "comp_whole"  "delta_cf"    "delta_whole" "wt_cf"      
## [6] "wt_whole"
comparisons <- ghetto_contrast_matrix(
  numerators=c("wt_cf", "delta_cf", "comp_cf",
               "delta_cf", "comp_cf", "delta_whole",
               "comp_whole"),
  denominators=c("wt_whole", "delta_whole", "comp_whole",
                 "wt_cf", "wt_cf", "wt_whole",
                 "wt_whole"))
results <- list()
for (c in 1:length(comparisons)) {
  name <- rownames(comparisons)[c]
  message("Starting ", name)
  comp <- comparisons[c, ]
  comparison <- t(as.matrix(comp))
  rownames(comparison) <- name
  results[name] <- sm(MSstats::groupComparison(contrast.matrix=comparison,
                                               data=msstats_quant))
}
## Starting wt_cf-wt_whole
## Warning in results[name] <- sm(MSstats::groupComparison(contrast.matrix =
## comparison, : number of items to replace is not a multiple of replacement
## length
## Starting delta_cf-delta_whole
## Warning in results[name] <- sm(MSstats::groupComparison(contrast.matrix =
## comparison, : number of items to replace is not a multiple of replacement
## length
## Starting comp_cf-comp_whole
## Warning in results[name] <- sm(MSstats::groupComparison(contrast.matrix =
## comparison, : number of items to replace is not a multiple of replacement
## length
## Starting delta_cf-wt_cf
## Warning in results[name] <- sm(MSstats::groupComparison(contrast.matrix =
## comparison, : number of items to replace is not a multiple of replacement
## length
## Starting comp_cf-wt_cf
## Warning in results[name] <- sm(MSstats::groupComparison(contrast.matrix =
## comparison, : number of items to replace is not a multiple of replacement
## length
## Starting delta_whole-wt_whole
## Warning in results[name] <- sm(MSstats::groupComparison(contrast.matrix =
## comparison, : number of items to replace is not a multiple of replacement
## length
## Starting comp_whole-wt_whole
## Warning in results[name] <- sm(MSstats::groupComparison(contrast.matrix =
## comparison, : number of items to replace is not a multiple of replacement
## length
## Starting NA
## Error in comparisons[c, ]: subscript out of bounds

3.1.1 P/PE protein QC plots for Yan

Yan asked for the p/pe protein qc plots. ok. I changed the dataProcessPlots to return something useful, so that should be possible now.

pe_genes <- read.table("reference/annotated_pe_genes.txt")[[1]]

## Unfortunately, the names did not get set in my changed version of dataProcessPlots...
plotlst <- msstats_plots$QCPLOT
available_plots <- gsub(pattern="^1/", replacement="", x=levels(msstats_quant$ProcessedData$PROTEIN))
names(plotlst) <- available_plots

pe_in_avail_idx <- pe_genes %in% available_plots
pe_in_avail <- pe_genes[pe_in_avail_idx]
pe_plots <- plotlst[pe_in_avail]
pdf(file="pe_qc_plots.pdf")
for (p in 1:length(pe_plots)) {
  plot(pe_plots[[p]])
}
dev.off()

4 Create hpgltools expressionset

Since I am not certain I understand these data, I will take the intensities from SWATH2stats, metadata, and annotation data; attempt to create a ‘normal’ expressionset; poke at it to see what I can learn.

4.1 Massaging the metadata

I want to use the same metadata as were used for MSstats. It has a few important differences from the requirements of hpgltools: pretty much only that I do not allow rownames/sampleIDs to start with a number.

metadata <- sample_annot
metadata[["sampleid"]] <- paste0("s", metadata[["sampleid"]])
rownames(metadata) <- metadata[["sampleid"]]

4.2 Massaging the gene annotation data and adding the msstats data.

I have my own annotation data from the gff file/microbesonline/whatever, I can add the MSstats result to it so that later I can print them all together.

## Here is a neat little thing I can do:  Add the MSstats results to my annotation data.
## Then when I print out the tables of the limma/etc results, they MSstats
## results will come along for free.
big_table <- results[[1]]
for (r in 2:length(results)) {
  big_table <- merge(big_table, results[[r]], by="Protein")
}
## Warning in merge.data.frame(big_table, results[[r]], by = "Protein"):
## column names 'Label.x', 'log2FC.x', 'SE.x', 'Tvalue.x', 'DF.x', 'pvalue.x',
## 'adj.pvalue.x', 'issue.x', 'MissingPercentage.x', 'ImputationPercentage.x',
## 'Label.y', 'log2FC.y', 'SE.y', 'Tvalue.y', 'DF.y', 'pvalue.y',
## 'adj.pvalue.y', 'issue.y', 'MissingPercentage.y', 'ImputationPercentage.y'
## are duplicated in the result

## Warning in merge.data.frame(big_table, results[[r]], by = "Protein"):
## column names 'Label.x', 'log2FC.x', 'SE.x', 'Tvalue.x', 'DF.x', 'pvalue.x',
## 'adj.pvalue.x', 'issue.x', 'MissingPercentage.x', 'ImputationPercentage.x',
## 'Label.y', 'log2FC.y', 'SE.y', 'Tvalue.y', 'DF.y', 'pvalue.y',
## 'adj.pvalue.y', 'issue.y', 'MissingPercentage.y', 'ImputationPercentage.y'
## are duplicated in the result
## Warning in merge.data.frame(big_table, results[[r]], by = "Protein"):
## column names 'Label.x', 'log2FC.x', 'SE.x', 'Tvalue.x', 'DF.x', 'pvalue.x',
## 'adj.pvalue.x', 'issue.x', 'MissingPercentage.x', 'ImputationPercentage.x',
## 'Label.y', 'log2FC.y', 'SE.y', 'Tvalue.y', 'DF.y', 'pvalue.y',
## 'adj.pvalue.y', 'issue.y', 'MissingPercentage.y', 'ImputationPercentage.y',
## 'Label.x', 'log2FC.x', 'SE.x', 'Tvalue.x', 'DF.x', 'pvalue.x',
## 'adj.pvalue.x', 'issue.x', 'MissingPercentage.x', 'ImputationPercentage.x',
## 'Label.y', 'log2FC.y', 'SE.y', 'Tvalue.y', 'DF.y', 'pvalue.y',
## 'adj.pvalue.y', 'issue.y', 'MissingPercentage.y', 'ImputationPercentage.y'
## are duplicated in the result

## Warning in merge.data.frame(big_table, results[[r]], by = "Protein"):
## column names 'Label.x', 'log2FC.x', 'SE.x', 'Tvalue.x', 'DF.x', 'pvalue.x',
## 'adj.pvalue.x', 'issue.x', 'MissingPercentage.x', 'ImputationPercentage.x',
## 'Label.y', 'log2FC.y', 'SE.y', 'Tvalue.y', 'DF.y', 'pvalue.y',
## 'adj.pvalue.y', 'issue.y', 'MissingPercentage.y', 'ImputationPercentage.y',
## 'Label.x', 'log2FC.x', 'SE.x', 'Tvalue.x', 'DF.x', 'pvalue.x',
## 'adj.pvalue.x', 'issue.x', 'MissingPercentage.x', 'ImputationPercentage.x',
## 'Label.y', 'log2FC.y', 'SE.y', 'Tvalue.y', 'DF.y', 'pvalue.y',
## 'adj.pvalue.y', 'issue.y', 'MissingPercentage.y', 'ImputationPercentage.y'
## are duplicated in the result
rownames(big_table) <- big_table[["Protein"]]
big_table <- big_table[, -1]

mtb_annotations_with_msstats <- merge(mtb_annotations, big_table,
                                      by="row.names", all.x=TRUE)
## Error in merge(mtb_annotations, big_table, by = "row.names", all.x = TRUE): object 'mtb_annotations' not found
rownames(mtb_annotations_with_msstats) <- mtb_annotations_with_msstats[["Row.names"]]
## Error in eval(expr, envir, enclos): object 'mtb_annotations_with_msstats' not found
mtb_annotations_with_msstats <- mtb_annotations_with_msstats[, -1]
## Error in eval(expr, envir, enclos): object 'mtb_annotations_with_msstats' not found

4.3 Massaging the intensity matrix

I do not want the \1 before the protein names, I already merged them into one entry per gene vis SWATH2stats.

prot_mtrx <- read.csv(paste0("results/swath2stats_", ver, "/protein_matrix_minimum.csv"))
rownames(prot_mtrx) <- gsub(pattern="^1\\/", replacement="", x=prot_mtrx[["proteinname"]])
prot_mtrx <- prot_mtrx[, -1]
## Important question: Did SWATH2stats reorder my data?
fun <- gsub(pattern="^.*_(2018.*$)", replacement="\\1", x=colnames(prot_mtrx))
colnames(prot_mtrx) <- paste0("s", fun)

4.4 Merge the pieces

Now we should have sufficient pieces to make an expressionset.

## Drop the metadata not in the protein matrix:
## And ensure that they are the same order.
reordered <- colnames(prot_mtrx)
metadata <- metadata[reordered, ]

protein_expt <- create_expt(metadata,
                            count_dataframe=prot_mtrx,
                            gene_info=mtb_annotations_with_msstats)
## Reading the sample metadata.
## The sample definitions comprises: 23, 21 rows, columns.
## Error in create_expt(metadata, count_dataframe = prot_mtrx, gene_info = mtb_annotations_with_msstats): object 'mtb_annotations_with_msstats' not found
whole_expt <- subset_expt(protein_expt, subset="collectiontype=='whole'")
## Error in sampleNames(expt): object 'protein_expt' not found
cf_expt <- subset_expt(protein_expt, subset="collectiontype=='cf'")
## Error in sampleNames(expt): object 'protein_expt' not found
protein_metrics <- sm(graph_metrics(protein_expt))
## Error in corheat[["plot"]]: subscript out of bounds
protein_norm <- sm(normalize_expt(protein_expt, transform="log2", convert="cpm",
                                  norm="quant", filter=TRUE))
## Error in normalize_expt(protein_expt, transform = "log2", convert = "cpm", : object 'protein_expt' not found
protein_norm_metrics <- sm(graph_metrics(protein_norm))
## Error in corheat[["plot"]]: subscript out of bounds
protein_fsva <- sm(normalize_expt(protein_expt, transform="log2", convert="cpm",
                                  batch="fsva", filter=TRUE))
## Error in normalize_expt(protein_expt, transform = "log2", convert = "cpm", : object 'protein_expt' not found
protein_fsva_metrics <- sm(graph_metrics(protein_fsva))
## Error in corheat[["plot"]]: subscript out of bounds
whole_metrics <- sm(graph_metrics(whole_expt))
## Error in corheat[["plot"]]: subscript out of bounds
whole_norm <- sm(normalize_expt(whole_expt, transform="log2", convert="cpm",
                                  norm="quant", filter=TRUE))
## Error in normalize_expt(whole_expt, transform = "log2", convert = "cpm", : object 'whole_expt' not found
whole_norm_metrics <- sm(graph_metrics(whole_norm))
## Error in corheat[["plot"]]: subscript out of bounds
whole_fsva <- sm(normalize_expt(whole_expt, transform="log2", convert="cpm",
                                  batch="fsva", filter=TRUE))
## Error in normalize_expt(whole_expt, transform = "log2", convert = "cpm", : object 'whole_expt' not found
whole_fsva_metrics <- sm(graph_metrics(whole_fsva))
## Error in corheat[["plot"]]: subscript out of bounds
cf_metrics <- sm(graph_metrics(cf_expt))
## Error in corheat[["plot"]]: subscript out of bounds
cf_norm <- sm(normalize_expt(cf_expt, transform="log2", convert="cpm",
                                  norm="quant", filter=TRUE))
## Error in normalize_expt(cf_expt, transform = "log2", convert = "cpm", : object 'cf_expt' not found
cf_norm_metrics <- sm(graph_metrics(cf_norm))
## Error in corheat[["plot"]]: subscript out of bounds
cf_fsva <- sm(normalize_expt(cf_expt, transform="log2", convert="cpm",
                                  batch="fsva", filter=TRUE))
## Error in normalize_expt(cf_expt, transform = "log2", convert = "cpm", : object 'cf_expt' not found
cf_fsva_metrics <- sm(graph_metrics(cf_fsva))
## Error in corheat[["plot"]]: subscript out of bounds
pp(image=protein_metrics$libsize, file="images/20180523_libsize.png")
## Error in pp(image = protein_metrics$libsize, file = "images/20180523_libsize.png"): object 'protein_metrics' not found
pp(image=protein_norm_metrics$pcaplot, file="images/20180523_norm_pca.png")
## Error in pp(image = protein_norm_metrics$pcaplot, file = "images/20180523_norm_pca.png"): object 'protein_norm_metrics' not found
pp(image=protein_fsva_metrics$pcaplot, file="images/20180523_fsva_pca.png")
## Error in pp(image = protein_fsva_metrics$pcaplot, file = "images/20180523_fsva_pca.png"): object 'protein_fsva_metrics' not found
pp(image=protein_norm_metrics$corheat, file="images/20180523_norm_corheat.png")
## Error in pp(image = protein_norm_metrics$corheat, file = "images/20180523_norm_corheat.png"): object 'protein_norm_metrics' not found
pp(image=protein_metrics$density, file="images/20180523_raw_density.png")
## Error in pp(image = protein_metrics$density, file = "images/20180523_raw_density.png"): object 'protein_metrics' not found
pp(image=protein_metrics$boxplot, file="images/20180523_boxplot.png")
## Error in pp(image = protein_metrics$boxplot, file = "images/20180523_boxplot.png"): object 'protein_metrics' not found
pp(image=whole_metrics$libsize, file="images/20180523_whole_libsize.png")
## Error in pp(image = whole_metrics$libsize, file = "images/20180523_whole_libsize.png"): object 'whole_metrics' not found
pp(image=whole_norm_metrics$pcaplot, file="images/20180523_whole_norm_pca.png")
## Error in pp(image = whole_norm_metrics$pcaplot, file = "images/20180523_whole_norm_pca.png"): object 'whole_norm_metrics' not found
pp(image=whole_fsva_metrics$pcaplot, file="images/20180523_whole_fsva_pca.png")
## Error in pp(image = whole_fsva_metrics$pcaplot, file = "images/20180523_whole_fsva_pca.png"): object 'whole_fsva_metrics' not found
pp(image=whole_norm_metrics$corheat, file="images/20180523_whole_norm_corheat.png")
## Error in pp(image = whole_norm_metrics$corheat, file = "images/20180523_whole_norm_corheat.png"): object 'whole_norm_metrics' not found
pp(image=whole_metrics$density, file="images/20180523_whole_raw_density.png")
## Error in pp(image = whole_metrics$density, file = "images/20180523_whole_raw_density.png"): object 'whole_metrics' not found
pp(image=whole_metrics$boxplot, file="images/20180523_whole_boxplot.png")
## Error in pp(image = whole_metrics$boxplot, file = "images/20180523_whole_boxplot.png"): object 'whole_metrics' not found
pp(image=cf_metrics$libsize, file="images/20180523_libsize.png")
## Error in pp(image = cf_metrics$libsize, file = "images/20180523_libsize.png"): object 'cf_metrics' not found
pp(image=cf_norm_metrics$pcaplot, file="images/20180523_norm_pca.png")
## Error in pp(image = cf_norm_metrics$pcaplot, file = "images/20180523_norm_pca.png"): object 'cf_norm_metrics' not found
pp(image=cf_fsva_metrics$pcaplot, file="images/20180523_fsva_pca.png")
## Error in pp(image = cf_fsva_metrics$pcaplot, file = "images/20180523_fsva_pca.png"): object 'cf_fsva_metrics' not found
pp(image=cf_norm_metrics$corheat, file="images/20180523_norm_corheat.png")
## Error in pp(image = cf_norm_metrics$corheat, file = "images/20180523_norm_corheat.png"): object 'cf_norm_metrics' not found
pp(image=cf_metrics$density, file="images/20180523_raw_density.png")
## Error in pp(image = cf_metrics$density, file = "images/20180523_raw_density.png"): object 'cf_metrics' not found
pp(image=cf_metrics$boxplot, file="images/20180523_boxplot.png")
## Error in pp(image = cf_metrics$boxplot, file = "images/20180523_boxplot.png"): object 'cf_metrics' not found

5 Attempt some quantification comparisons?

pairwise_filt <- sm(normalize_expt(protein_expt, filter=TRUE))
## Error in normalize_expt(protein_expt, filter = TRUE): object 'protein_expt' not found
pairwise_comp <- sm(all_pairwise(pairwise_filt, model_batch="fsva", force=TRUE))
## Error in get_model_adjust(input, estimate_type = model_batch, surrogates = surrogates): object 'pairwise_filt' not found

6 For each msstats run, do a DE table

keepers <- list(
  "wt_cf_minus_wt_whole" = c("wt_cf", "wt_whole"),
  "delta_cf_minus_delta_whole" = c("delta_cf", "delta_whole"),
  "comp_cf_minus_comp_whole" = c("comp_cf", "comp_whole"),
  "delta_cf_minus_wt_cf" = c("delta_cf", "wt_cf"),
  "comp_cf_minus_wt_cf" = c("comp_cf", "wt_cf"),
  "delta_whole_minus_wt_whole" = c("delta_whole", "wt_whole"),
  "comp_whole_minus_wt_whole" = c("comp_whole", "wt_whole"))
pairwise_tables <- sm(combine_de_tables(
  pairwise_comp,
  excel=paste0("excel/pairwise_de_with_msstats-v", ver, ".xlsx"),
  keepers=keepers))
## Error in combine_de_tables(pairwise_comp, excel = paste0("excel/pairwise_de_with_msstats-v", : object 'pairwise_comp' not found
pairwise_sig <- sm(extract_significant_genes(
  pairwise_tables,
  excel=paste0("excel/test_pairwise_sig_with_msstats-v", ver, ".xlsx")))
## Error in extract_significant_genes(pairwise_tables, excel = paste0("excel/test_pairwise_sig_with_msstats-v", : object 'pairwise_tables' not found
droppers <- c("undefined")
names(droppers) <- "log2fc"
pairwise_onlyall <- sm(combine_de_tables(
  pairwise_comp,
  excel=paste0("excel/test_pairwise_de_only_msstats-v", ver, ".xlsx"),
  keepers=keepers,
  excludes=droppers))
solo_proteins <- features_in_single_condition(protein_expt)
## Error in pData(expt): object 'protein_expt' not found
proteins_only_cf <- solo_proteins[["solo_this"]][["wt_cf"]]
## Error in eval(expr, envir, enclos): object 'solo_proteins' not found
proteins_only_cf
## Error in eval(expr, envir, enclos): object 'proteins_only_cf' not found
proteins_only_whole <- solo_proteins[["solo_this"]][["wt_whole"]]
## Error in eval(expr, envir, enclos): object 'solo_proteins' not found
length(proteins_only_whole)
## Error in eval(expr, envir, enclos): object 'proteins_only_whole' not found

6.1 Compare hpgltools/MSstats

Can we compare limma and riends to MSstats?

6.1.1 wt cf / wt whole

hpgl_table <- pairwise_tables[["data"]][["wt_cf_vs_wt_whole"]]
msstats_table <- results[["


msstats
msstats_table <- msstats_comparison$ComparisonResult
rownames(msstats_table) <- gsub(pattern="^1/", replacement="", x=msstats_table$Protein)

merged_table <- merge(hpgl_table, msstats_table, by="row.names")
write.csv(file="images/merged_table.csv", merged_table)
cor.test(merged_table$limma_logfc, merged_table$log2FC)
cor.test(merged_table$limma_logfc, merged_table$log2FC, method="spearman")
cor.test(merged_table$deseq_logfc, merged_table$log2FC)
cor.test(merged_table$deseq_logfc, merged_table$log2FC, method="spearman")
cor.test(merged_table$edger_logfc, merged_table$log2FC)
cor.test(merged_table$edger_logfc, merged_table$log2FC, method="spearman")

combined_table <- pairwise_tables$data[[1]]
undefined_idx <- combined_table[["log2fc"]] == "undefined"
combined_table[undefined_idx, "log2fc"] <- NA
combined_table[["log2fc"]] <- as.numeric(combined_table[["log2fc"]])
cor.test(combined_table[["limma_logfc"]], combined_table[["log2fc"]])
cor.test(combined_table[["limma_logfc"]], combined_table[["log2fc"]], method="spearman")

high_to_low_idx <- order(combined_table[["log2fc"]], na.last=TRUE, decreasing=TRUE)
low_to_high_idx <- order(combined_table[["log2fc"]], na.last=TRUE, decreasing=FALSE)
top_50 <- head(combined_table[high_to_low_idx, ], n=100)
bottom_50 <- head(combined_table[low_to_high_idx, ], n=100)

cor.test(top_50$limma_logfc, top_50$log2fc)
cor.test(top_50$limma_logfc, top_50$log2fc, method="spearman")
cor.test(top_50$deseq_logfc, top_50$log2fc)
cor.test(top_50$deseq_logfc, top_50$log2fc, method="spearman")
cor.test(top_50$edger_logfc, top_50$log2fc)
cor.test(top_50$edger_logfc, top_50$log2fc, method="spearman")

cor.test(bottom_50$limma_logfc, bottom_50$log2fc)
cor.test(bottom_50$limma_logfc, bottom_50$log2fc, method="spearman")
cor.test(bottom_50$deseq_logfc, bottom_50$log2fc)
cor.test(bottom_50$deseq_logfc, bottom_50$log2fc, method="spearman")
cor.test(bottom_50$edger_logfc, bottom_50$log2fc)
cor.test(bottom_50$edger_logfc, bottom_50$log2fc, method="spearman")

up_in_msstats <- rownames(combined_table)[combined_table[["log2fc"]] > 0]
up_in_msstats <- up_in_msstats[!is.na(up_in_msstats)]
up_in_limma <- rownames(combined_table)[combined_table[["limma_logfc"]] > 0]
up_in_limma <- up_in_limma[!is.na(up_in_limma)]
up_in_edger <- rownames(combined_table)[combined_table[["edger_logfc"]] > 0]
up_in_edger <- up_in_edger[!is.na(up_in_edger)]
up_venn_sets <- list(
  "msstats" = up_in_msstats,
  "limma" = up_in_limma,
  "edger" = up_in_edger)
testing <- Vennerable::Venn(Sets=up_venn_sets, )
pp(file="/tmp/up_venn.png")
Vennerable::plot(testing, doWeights=FALSE)
dev.off()
Vennerable::plot(testing, doWeights=FALSE)

down_in_msstats <- rownames(combined_table)[combined_table[["log2fc"]] < 0]
down_in_msstats <- down_in_msstats[!is.na(down_in_msstats)]
down_in_limma <- rownames(combined_table)[combined_table[["limma_logfc"]] < 0]
down_in_limma <- down_in_limma[!is.na(down_in_limma)]
down_in_edger <- rownames(combined_table)[combined_table[["edger_logfc"]] < 0]
down_in_edger <- down_in_edger[!is.na(down_in_edger)]
down_venn_sets <- list(
  "msstats" = down_in_msstats,
  "limma" = down_in_limma,
  "edger" = down_in_edger)
testing <- Vennerable::Venn(Sets=down_venn_sets, )
pp(file="/tmp/down_venn.png")
Vennerable::plot(testing, doWeights=FALSE)
dev.off()
## Error: <text>:7:61: unexpected string constant
## 8: 
## 9: merged_table <- merge(hpgl_table, msstats_table, by="
##                                                                ^

7 Repeat with the modified data

hpgl_table <- pairwise_modified_tables$data[[1]]
## Error in eval(expr, envir, enclos): object 'pairwise_modified_tables' not found
msstats_table <- msstats_modified_comp$ComparisonResult
## Error in eval(expr, envir, enclos): object 'msstats_modified_comp' not found
rownames(msstats_table) <- gsub(pattern="^1/", replacement="", x=msstats_table_modified$Protein)
## Error in gsub(pattern = "^1/", replacement = "", x = msstats_table_modified$Protein): object 'msstats_table_modified' not found
merged_table <- merge(hpgl_table, msstats_table, by="row.names")
## Error in merge(hpgl_table, msstats_table, by = "row.names"): object 'hpgl_table' not found
write.csv(file="images/merged_table_modified.csv", merged_table)
## Error in is.data.frame(x): object 'merged_table' not found
cor.test(merged_table$limma_logfc, merged_table$log2FC)
## Error in cor.test(merged_table$limma_logfc, merged_table$log2FC): object 'merged_table' not found
cor.test(merged_table$limma_logfc, merged_table$log2FC, method="spearman")
## Error in cor.test(merged_table$limma_logfc, merged_table$log2FC, method = "spearman"): object 'merged_table' not found
cor.test(merged_table$deseq_logfc, merged_table$log2FC)
## Error in cor.test(merged_table$deseq_logfc, merged_table$log2FC): object 'merged_table' not found
cor.test(merged_table$deseq_logfc, merged_table$log2FC, method="spearman")
## Error in cor.test(merged_table$deseq_logfc, merged_table$log2FC, method = "spearman"): object 'merged_table' not found
cor.test(merged_table$edger_logfc, merged_table$log2FC)
## Error in cor.test(merged_table$edger_logfc, merged_table$log2FC): object 'merged_table' not found
cor.test(merged_table$edger_logfc, merged_table$log2FC, method="spearman")
## Error in cor.test(merged_table$edger_logfc, merged_table$log2FC, method = "spearman"): object 'merged_table' not found
combined_table <- pairwise_tables$data[[1]]
## Error in eval(expr, envir, enclos): object 'pairwise_tables' not found
undefined_idx <- combined_table[["log2fc"]] == "undefined"
## Error in eval(expr, envir, enclos): object 'combined_table' not found
combined_table[undefined_idx, "log2fc"] <- NA
## Error in combined_table[undefined_idx, "log2fc"] <- NA: object 'combined_table' not found
combined_table[["log2fc"]] <- as.numeric(combined_table[["log2fc"]])
## Error in eval(expr, envir, enclos): object 'combined_table' not found
cor.test(combined_table[["limma_logfc"]], combined_table[["log2fc"]])
## Error in cor.test(combined_table[["limma_logfc"]], combined_table[["log2fc"]]): object 'combined_table' not found
cor.test(combined_table[["limma_logfc"]], combined_table[["log2fc"]], method="spearman")
## Error in cor.test(combined_table[["limma_logfc"]], combined_table[["log2fc"]], : object 'combined_table' not found
high_to_low_idx <- order(combined_table[["log2fc"]], na.last=TRUE, decreasing=TRUE)
## Error in order(combined_table[["log2fc"]], na.last = TRUE, decreasing = TRUE): object 'combined_table' not found
low_to_high_idx <- order(combined_table[["log2fc"]], na.last=TRUE, decreasing=FALSE)
## Error in order(combined_table[["log2fc"]], na.last = TRUE, decreasing = FALSE): object 'combined_table' not found
top_50 <- head(combined_table[high_to_low_idx, ], n=100)
## Error in head(combined_table[high_to_low_idx, ], n = 100): object 'combined_table' not found
bottom_50 <- head(combined_table[low_to_high_idx, ], n=100)
## Error in head(combined_table[low_to_high_idx, ], n = 100): object 'combined_table' not found
cor.test(top_50$limma_logfc, top_50$log2fc)
## Error in cor.test(top_50$limma_logfc, top_50$log2fc): object 'top_50' not found
cor.test(top_50$limma_logfc, top_50$log2fc, method="spearman")
## Error in cor.test(top_50$limma_logfc, top_50$log2fc, method = "spearman"): object 'top_50' not found
cor.test(top_50$deseq_logfc, top_50$log2fc)
## Error in cor.test(top_50$deseq_logfc, top_50$log2fc): object 'top_50' not found
cor.test(top_50$deseq_logfc, top_50$log2fc, method="spearman")
## Error in cor.test(top_50$deseq_logfc, top_50$log2fc, method = "spearman"): object 'top_50' not found
cor.test(top_50$edger_logfc, top_50$log2fc)
## Error in cor.test(top_50$edger_logfc, top_50$log2fc): object 'top_50' not found
cor.test(top_50$edger_logfc, top_50$log2fc, method="spearman")
## Error in cor.test(top_50$edger_logfc, top_50$log2fc, method = "spearman"): object 'top_50' not found
cor.test(bottom_50$limma_logfc, bottom_50$log2fc)
## Error in cor.test(bottom_50$limma_logfc, bottom_50$log2fc): object 'bottom_50' not found
cor.test(bottom_50$limma_logfc, bottom_50$log2fc, method="spearman")
## Error in cor.test(bottom_50$limma_logfc, bottom_50$log2fc, method = "spearman"): object 'bottom_50' not found
cor.test(bottom_50$deseq_logfc, bottom_50$log2fc)
## Error in cor.test(bottom_50$deseq_logfc, bottom_50$log2fc): object 'bottom_50' not found
cor.test(bottom_50$deseq_logfc, bottom_50$log2fc, method="spearman")
## Error in cor.test(bottom_50$deseq_logfc, bottom_50$log2fc, method = "spearman"): object 'bottom_50' not found
cor.test(bottom_50$edger_logfc, bottom_50$log2fc)
## Error in cor.test(bottom_50$edger_logfc, bottom_50$log2fc): object 'bottom_50' not found
cor.test(bottom_50$edger_logfc, bottom_50$log2fc, method="spearman")
## Error in cor.test(bottom_50$edger_logfc, bottom_50$log2fc, method = "spearman"): object 'bottom_50' not found
up_in_msstats <- rownames(combined_table)[combined_table[["log2fc"]] > 0]
## Error in rownames(combined_table): object 'combined_table' not found
up_in_msstats <- up_in_msstats[!is.na(up_in_msstats)]
## Error in eval(expr, envir, enclos): object 'up_in_msstats' not found
up_in_limma <- rownames(combined_table)[combined_table[["limma_logfc"]] > 0]
## Error in rownames(combined_table): object 'combined_table' not found
up_in_limma <- up_in_limma[!is.na(up_in_limma)]
## Error in eval(expr, envir, enclos): object 'up_in_limma' not found
up_in_edger <- rownames(combined_table)[combined_table[["edger_logfc"]] > 0]
## Error in rownames(combined_table): object 'combined_table' not found
up_in_edger <- up_in_edger[!is.na(up_in_edger)]
## Error in eval(expr, envir, enclos): object 'up_in_edger' not found
up_venn_sets <- list(
  "msstats" = up_in_msstats,
  "limma" = up_in_limma,
  "edger" = up_in_edger)
## Error in eval(expr, envir, enclos): object 'up_in_msstats' not found
testing <- Vennerable::Venn(Sets=up_venn_sets, )
## Error in Vennerable::Venn(Sets = up_venn_sets, ): object 'up_venn_sets' not found
pp(file="/tmp/up_venn.png")
## Going to write the image to: /tmp/up_venn.png when dev.off() is called.
Vennerable::plot(testing, doWeights=FALSE)
## Error in Vennerable::plot(testing, doWeights = FALSE): object 'testing' not found
dev.off()
## png 
##   2
Vennerable::plot(testing, doWeights=FALSE)
## Error in Vennerable::plot(testing, doWeights = FALSE): object 'testing' not found
down_in_msstats <- rownames(combined_table)[combined_table[["log2fc"]] < 0]
## Error in rownames(combined_table): object 'combined_table' not found
down_in_msstats <- down_in_msstats[!is.na(down_in_msstats)]
## Error in eval(expr, envir, enclos): object 'down_in_msstats' not found
down_in_limma <- rownames(combined_table)[combined_table[["limma_logfc"]] < 0]
## Error in rownames(combined_table): object 'combined_table' not found
down_in_limma <- down_in_limma[!is.na(down_in_limma)]
## Error in eval(expr, envir, enclos): object 'down_in_limma' not found
down_in_edger <- rownames(combined_table)[combined_table[["edger_logfc"]] < 0]
## Error in rownames(combined_table): object 'combined_table' not found
down_in_edger <- down_in_edger[!is.na(down_in_edger)]
## Error in eval(expr, envir, enclos): object 'down_in_edger' not found
down_venn_sets <- list(
  "msstats" = down_in_msstats,
  "limma" = down_in_limma,
  "edger" = down_in_edger)
## Error in eval(expr, envir, enclos): object 'down_in_msstats' not found
testing <- Vennerable::Venn(Sets=down_venn_sets, )
## Error in Vennerable::Venn(Sets = down_venn_sets, ): object 'down_venn_sets' not found
pp(file="/tmp/down_venn.png")
## Going to write the image to: /tmp/down_venn.png when dev.off() is called.
Vennerable::plot(testing, doWeights=FALSE)
## Error in Vennerable::plot(testing, doWeights = FALSE): object 'testing' not found
dev.off()
## png 
##   2

8 Everything below here might get dropped?

I think I wedged all of the following work into that short block above. So, stop evaluating the following blocks and see if that is true.

8.0.1 Expressionset from the TRIC intensity matrix

intensity_mtrx <- read.csv(file="results/tric/HCD_outmatrix.tsv", sep="\t")
## The HCD_outmatrix has columns including: Peptide, Protein,
## A series of Intensity_sample_id columns
## A series of RT_sample_id columns
## A series of score_sample_id columns
## And at the end: RT_mean, RT_std, and pg_pvalue.
## Since SWATH2stats/MSstats uses the intensities, lets get those columns out,
## standardize the column names to match the annotation data, and use them for
## creating an expressionset.
intensity_mtrx[["rownames"]] <- intensity_mtrx[["Protein"]]
intensity_mtrx[["rownames"]] <- gsub(pattern="^[[:digit:]]+\\/",
                                     replacement="", x=intensity_mtrx[["rownames"]])
## Standardize the rownames, this might be a bad idea, as this will keep
## separate every peptide from each protein and not do anything to
## sum/median/whatever them.  But for the purposes of testing out the data I
## think it is ok.
rownames(intensity_mtrx) <- make.names(intensity_mtrx[["rownames"]], unique=TRUE)

## Now lets get rid of the extraneous text in the column names and simplify them
## to the sample names as referenced in the sample sheet.
all_columns <- colnames(intensity_mtrx)
intense_columns <- grepl(pattern="Intensity", x=all_columns)
intensity_mtrx <- intensity_mtrx[, intense_columns]
all_columns <- colnames(intensity_mtrx)
new_columns <- gsub(pattern="Intensity_", replacement="", x=all_columns)
new_columns <- gsub(pattern="_dia_.*$", replacement="", x=new_columns)
colnames(intensity_mtrx) <- paste0("s", new_columns)
intensity_mtrx[is.na(intensity_mtrx)] <- 0

## No columns in an expression set are allowed to start with a number.  I have
## unsed prefixing samples with 's' as a standard to handle this problem...
metadata <- sample_annot
metadata$sampleid <- paste0("s", metadata$sampleid)
colnames(intensity_mtrx) <- gsub(pattern="_vs_HCD", replacement="", x=colnames(intensity_mtrx))
rownames(metadata) <- metadata$sampleid
## Theoretically this is not needed anymore...
intensity_mtrx <- intensity_mtrx[, rownames(metadata)]
## Ok, at this point, we should have all the pieces for a more or less normal expressionset.
dim(intensity_mtrx)
test_expt <- create_expt(metadata=metadata,
                         count_dataframe=intensity_mtrx,
                         gene_info=mtb_annotations)

8.0.2 Play with the hpgltools derived expressionset of peptide intensities

Lets see if anything makes sense in this intensity expressionset.

## First, log2 and normalize the data.
test_norm <- normalize_expt(test_expt, transform="log2", convert="cpm", norm="quantile", filter=TRUE)
test_normbatch <- sm(normalize_expt(test_expt, transform="log2", convert="cpm",
                                    norm="quantile", filter=TRUE, batch="limma"))

test_metrics <- sm(graph_metrics(test_expt))
test_metrics_norm <- sm(graph_metrics(test_norm))
test_metrics_normbatch <- sm(graph_metrics(test_normbatch))

Now lets see what the data looks like, assuming I did not do anything horrible to it.

test_metrics$legend
test_metrics$libsize  ## Wow the intensities get ridiculous, what?
test_metrics$nonzero
## Interesting, but nothing which super-jumps out at me
test_metrics$density
## hmm very interesting, the good news is that they all have basically the same
## distribution.  I did not expect the CF samples to have such a stronger
## distribution though.

test_metrics_norm$corheat
## This suggests to me a likely batch effect
test_metrics_norm$disheat
test_metrics$smc
test_metrics$tsneplot
## Yeah there is some wonkyness between the old/new samples.

test_metrics_normbatch$pcaplot
## Wow, this is after invoking 'removeBatchEffect' from limma.  That suggests
## pretty strongly to me that we should probalby not examine the new and old
## data together.  In addition I am thinking that one whole-cell lysate sample
## might not be.

8.1 Remove old batch/weirdo sample

kept_testing <- subset_expt(test_expt,
                            subset="bioreplicate == 'mar'")
## The new batch IDs: x,y,z are associated with the 3 runs of these samples, one
## which is 8 m/z, one which is 20 m/z, and one which is a reordered 20 m/z.

test_norm <- normalize_expt(kept_testing, transform="log2", convert="cpm", norm="quantile", filter=TRUE)
test_normbatch <- sm(normalize_expt(kept_testing, transform="log2", convert="cpm",
                                    filter=TRUE, batch="limma"))
new_metrics <- sm(graph_metrics(kept_testing))
new_metrics_norm <- sm(graph_metrics(test_norm))
new_metrics_normbatch <- sm(graph_metrics(test_normbatch))

8.1.1 Plot only new data

Removing the January data seems to have made this a lot easier to look at. There might be some batchyness associated with reordering the samples, but I am thinking it is not huge.

tt = plot_topn(kept_testing, direct=TRUE)
new_metrics$legend
new_metrics$libsize
pp("images/kept_norm_topn.png", image=new_metrics$topnplot)
pp("images/kept_norm_hcd_pca.png", image=new_metrics_norm$pcaplot)
new_metrics_norm$tsneplot
pp("images/kept_norm_hcd_corheat.png", image=new_metrics_norm$corheat)
new_metrics_norm$disheat
new_metrics_normbatch$pcaplot
pp("images/normbatch_tsne.png", image=new_metrics_normbatch$tsneplot)

8.1.2 How is the variance?

Lets see what variancePartition has to say about this data.

new_varpart <- varpart(kept_testing, factors=c("condition", "batch"))
pp("images/varpart_partition.png", image=new_varpart$partition_plot)
## That is not terrible.

## the modified_expt slow from varpart() adds some metadata to the expressionset
## including the calculated % variance for condition/batch/residual for each peptide.
kept_testing <- new_varpart$modified_expt
pe_genes <- read.table("reference/annotated_pe_genes.txt")[[1]]
sum(pe_genes %in% rownames(exprs(protein_norm)))
## Error in exprs(protein_norm): object 'protein_norm' not found
found_pe_idx <- rownames(exprs(protein_expt)) %in% pe_genes
## Error in exprs(protein_expt): object 'protein_expt' not found
found_pe <- rownames(exprs(protein_expt))[found_pe_idx]
## Error in exprs(protein_expt): object 'protein_expt' not found
pe_expt <- exclude_genes_expt(protein_expt, method="keep", ids=found_pe)
## Error in exclude_genes_expt(protein_expt, method = "keep", ids = found_pe): object 'protein_expt' not found
plot_boxplot(pe_expt)
## Error in plot_boxplot(pe_expt): object 'pe_expt' not found
exprs(pe_expt)
## Error in exprs(pe_expt): object 'pe_expt' not found

8.2 Back to hpgltools

If you will recall, in the original block where I read the data into SWATH2stats, I invoked write_matrix_proteins() and write_matrix_peptides() and wrote out 2 csv files ‘swath2stats_protein_matrix.csv’ and ’_peptide_matrix.csv’. If my earlier work is correct, I should be able to use the protein matrix to get a truer sense of the relative abundances between my experimental conditions.

library(aLFQ)
alfq_process <- ProteinInference(alfq_input,
                                 peptide_method="top",
                                 peptide_topx=3,
                                 peptide_strictness="loose",
                                 peptide_summary="mean",
                                 transition_topx=3,
                                 transition_strictness="loose",
                                 transition_summary="sum",
                                 fasta=NA,
                                 model=NA,
                                 combine_precursors=FALSE)

9 Index version: 20180528

10 TODO

  • 2018-04-10: Make sure my invocations of SWATH2stats/MSstats are correct.
if (!isTRUE(get0("skip_load"))) {
  message(paste0("This is hpgltools commit: ", get_git_commit()))
  this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
  message(paste0("Saving to ", this_save))
  tmp <- sm(saveme(filename=this_save))
  pander::pander(sessionInfo())
}
---
title: "M.tuberculosis 201805: Analyzing data from OpenSwathWorkFlow/TRIC."
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
 html_document:
  code_download: true
  code_folding: show
  fig_caption: true
  fig_height: 7
  fig_width: 7
  highlight: default
  keep_md: false
  mode: selfcontained
  number_sections: true
  self_contained: true
  theme: readable
  toc: true
  toc_float:
    collapsed: false
    smooth_scroll: false
---

<style>
  body .main-container {
    max-width: 1600px;
  }
</style>

```{r options, include=FALSE}
if (!isTRUE(get0("skip_load"))) {
  library(hpgltools)
  tt <- devtools::load_all("~/hpgltools")
  knitr::opts_knit$set(progress=TRUE,
                       verbose=TRUE,
                       width=90,
                       echo=TRUE)
  knitr::opts_chunk$set(error=TRUE,
                        fig.width=8,
                        fig.height=8,
                        dpi=96)
  old_options <- options(digits=4,
                         stringsAsFactors=FALSE,
                         knitr.duplicate.label="allow")
  ggplot2::theme_set(ggplot2::theme_bw(base_size=10))
  ver <- "20180528"
  previous_file <- "01_preprocessing_comet_highres.Rmd"

  tmp <- try(sm(loadme(filename=paste0(gsub(pattern="\\.Rmd", replace="", x=previous_file), "-v", ver, ".rda.xz"))))
  previous_file <- "01_annotation.Rmd"
  tmp <- try(sm(loadme(filename=paste0(gsub(pattern="\\.Rmd", replace="", x=previous_file), "-v", ver, ".rda.xz"))))

  rmd_file <- "02_swath2stats_20180520.Rmd"
}
```

# TODO

* Remove MSstats logging. -- done
* Make explicit spearman correlations between methods. -- done
    * Do both for all data and for the top-50/bottom-50 -- done, but weird.
* Make 100% certain that the samples are annotated correctly. -- done.
* Limma/MSstats/EdgeR venn disagram for all up in CF
   * Repeat in all down CF
   * Repeat with a cutoff, top-50
* Individual plots for P/PE proteins.

Analyzing data from openMS and friends.
=======================================

In preprocessing_comet_highres.Rmd, I used the openMS tutorials and supplemental
materials from a couple papers to hopefully correctly perform the various
preprocessing tasks required to extract intensity data from DIA/SWATH
transitions.

The final steps of that process combined the transition intensities from every
sample into a metadata frile (results/tric/HCD_meta.tsv), an intensity matrix
(results/tric/HCD_outmatrix.tsv), and a feature aligned output matrix
(results/tric/aligned_comet_HCD.tsv).

My reading of the SWATH2stats and MSstats source code suggests to me that the
log2(intensities) of the feature aligned data are our final proxy for protein
abundance.  At first glance, this suggests to me that these data might follow a
distribution similar to RNASeq data (negative binomial, but perhaps with a
bigger tail?).  In addition, by the time we use tric on the data, we have a
count matrix and sample annotation data frames which look remarkably similar to
those used in a RNASeq expressionset.  Indeed, by the end of the MSstats
processing, it creates a MSnSet class of its own which uses fData/exprs/pData.

For the curious, my reasoning for saying that the log intensities are our proxy
for abundance comes from MSstats/R/DataProcess.R in a clause which looks like:

```{r snippet, eval=FALSE}
if (logTrans == 2) {
  work[["ABUNDANCE"]] <- log2(work[["ABUNDANCE"]])
} else if (logTrans == 10) {
  work[["ABUNDANCE"]] <- log10(work[["ABUNDANCE"]])
} else {
  ## Above there was a check for only log 2 and 10, but we can do e if we want.
  ## I might go back up there and remove that check. Long live e! 2.718282 rules!
  work[["ABUNDANCE"]] <- log(work[["ABUNDANCE"]]) / log(logTrans)
}
```

(Note: I added the natural log to the set of conditions, but otherwise the logic
is unchanged.)

With that in mind, I want to use some tools with which I am familiar in order to
try to understand these data.  Therefore I will first attempt to coerce my tric
aligned data and annotations into a 'normal' expressionset.  Then I want to do
some diagnostic plots which, if I am wrong and these distributions are not as
expected, will be conceptually incorrect (I don't yet think I am wrong).

## Sample annotation via SWATH2stats

I am using the SWATH2stats vignette as my primary source of information.  Thus I
see that it uses the
OpenSWATH_SM3_GoldStandardAutomatedResults_human_peakgroups.txt which has a
format nearly identical to my tric output matrix.  Thus for the moment I will
assume that the proper input for SWATH2stats is 'results/tric/comet_HCD.tsv' and
not the metadata nor output matrix.

I keep a sample sheet of all the DIA samples used in this analysis in
'sample_sheets/dia_samples.xlsx'  It should contain all the other required data
with one important caveat, I removed 1 sample by 'commenting' it (e.g. prefixing
it with '##' -- which is an admittedly dumb thing to do in an excel file.

One last caveat: I hacked the SWATH2stats sample_annotation() function to add a
couple columns in an attempt to make it a little more robust when faced with
sample sheets with differently named columns.

In addition, SWATH2stats provides some nice filtering and combination
functions which should be considered when generating various expressionset data
structures later.

```{r swath2stats_sample_annotations}
tric_data <- read.csv("results/tric_201805/comet_HCD.tsv", sep="\t")

sample_annot <- openxlsx::read.xlsx("sample_sheets/Mtb_dia_samples.xlsx")
rownames(sample_annot) <- make.names(sample_annot[["sampleid"]], unique=TRUE)
## Drop samples starting with comments
keep_idx <- ! grepl(pattern="##", x=sample_annot[["sampleid"]])
sample_annot <- sample_annot[keep_idx, ]
expt_idx <- sample_annot[["expt_id"]] == "may2018" | sample_annot[["expt_id"]] == "mar2018"
expt_idx[is.na(sample_annot[["expt_id"]])] <- FALSE
sample_annot <- sample_annot[expt_idx, ]
mz_idx <- sample_annot[["windowsize"]] == "8"
sample_annot <- sample_annot[mz_idx, ]
## Set the mzXML column to match the filename column in the data.
devtools::load_all("~/scratch/git/SWATH2stats")
## s2s, my witty way of shortening SWATH2stats...
s2s_exp <- sample_annotation(data=tric_data,
                             sample_annotation=sample_annot,
                             fullpeptidename_column="fullunimodpeptidename")
```

Now I have a couple data structures which should prove useful for the metrics
provided by SWATH2stats, MSstats, and my own hpgltools.

# SWATH2stats continued

Lets return to some of the metrics provided by swath2stats.

```{r swath2stats_processing}
## Get correlations on a sample by sample basis
pp(file="images/20180523_swath2stats_sample_cor.png")
sample_cor <- plot_correlation_between_samples(s2s_exp,
                                               fun.aggregate=sum,
                                               column.values="intensity")
dev.off()
sample_cond_rep_cor <- plot_correlation_between_samples(s2s_exp,
                                                        comparison=transition_group_id ~
                                                          condition + bioreplicate + run,
                                                        fun.aggregate=sum,
                                                        column.values="intensity")
## I am a little concerned that these values do not seem to change when I took
## filtered/normalized data.  So I am rerunning them manually for a moment --
## perhaps I messed something up when I rewrote portions of the
## sample_annotation() function in SWATH2stats.

## ahh I think I see the problem.  The default value for fun.aggregate is NULL,
## which causes dcast to default to length.  I think this is not likely to be
## valid for this data.  I am not certain, however, what is the appropriate
## function.  If I had to guess, I would go with sum()?

assess_decoy_rate(s2s_exp)
## This seems a bit high to me, yesno?
fdr_overall <- assess_fdr_overall(s2s_exp, output="Rconsole", plot=TRUE)

byrun_fdr <- assess_fdr_byrun(s2s_exp, FFT=0.7, plot=TRUE, output="Rconsole")
chosen_mscore <- mscore4assayfdr(s2s_exp, FFT=0.7, fdr_target=0.02)
prot_score <- mscore4protfdr(s2s_exp, FFT=0.7, fdr_target=0.02)

mscore_filtered <- filter_mscore(s2s_exp, chosen_mscore)
data_filtered_mscore <- filter_mscore_freqobs(s2s_exp, 0.01, 0.8, rm.decoy=FALSE)
data_filtered_fdr <- filter_mscore_fdr(mscore_filtered, FFT=0.7,
                                       overall_protein_fdr_target=prot_score,
                                       upper_overall_peptide_fdr_limit=0.05)
only_proteotypic <- filter_proteotypic_peptides(data_filtered_fdr)
all_filtered <- filter_all_peptides(only_proteotypic)
only_strong <- filter_on_max_peptides(data=all_filtered, n_peptides=10)
only_minimum <- filter_on_min_peptides(data=only_strong, n_peptides=3)

## I think these matrixes are probably smarter to use than the raw outmatrix from tric.
## But I am not a fan of rerwriting the sample column names.
protein_matrix_all <- write_matrix_proteins(
  s2s_exp, write.csv=TRUE,
  filename=paste0("results/swath2stats_", ver, "/protein_all.csv"))
dim(protein_matrix_all)
protein_matrix_mscore <- write_matrix_proteins(
  mscore_filtered, write.csv=TRUE,
  filename=paste0("results/swath2stats_", ver, "/protein_matrix_mscore.csv"))
dim(protein_matrix_mscore)
peptide_matrix_mscore <- write_matrix_peptides(
  mscore_filtered, write.csv=TRUE,
  filename=paste0("results/swath2stats_", ver, "/peptide_matrix_mscore.csv"))
dim(peptide_matrix_mscore)
protein_matrix_minimum <- write_matrix_proteins(
  only_minimum, write.csv=TRUE,
  filename=paste0("results/swath2stats_", ver, "/protein_matrix_minimum.csv"))
dim(protein_matrix_minimum)
peptide_matrix_minimum <- write_matrix_peptides(
  only_minimum, write.csv=TRUE,
  filename=paste0("results/swath2stats_", ver, "/peptide_matrix_minimum.csv"))
dim(peptide_matrix_minimum)

rt_cor <- plot_correlation_between_samples(only_minimum,
                                           column.values="intensity",
                                           fun.aggregate=sum)
## I have no effing clue what this plot means.
variation <- plot_variation(only_minimum, fun.aggregate=sum,
                            comparison=transition_group_id ~ condition)

## Something in SWATH2stats::disaggregate was written poorly and is looking for
## a variable named 'cols'
cols <- colnames(only_minimum)
disaggregated <- disaggregate(only_minimum, all.columns=TRUE)
msstats_input <- convert4MSstats(disaggregated)

##alfq_input <- sm(convert4aLFQ(disaggregated))
##mapdia_input <- sm(convert4mapDIA(disaggregated, RT=TRUE))
```

## MSstats

msstats.org seems to provide a complete solution for performing reasonable metrics of this data.

I am currently reading: http://msstats.org/wp-content/uploads/2017/01/MSstats_v3.7.3_manual.pdf

I made some moderately intrusive changes to MSstats to make it clearer, as well.

```{r msstats_quant}
devtools::load_all("~/scratch/git/MSstats")
msstats_quant <- sm(dataProcess(msstats_input))
msstats_plots <- sm(dataProcessPlots(msstats_quant, type="QCPLOT"))

my_levels <- levels(as.factor(msstats_input$condition))
my_levels
comparisons <- ghetto_contrast_matrix(
  numerators=c("wt_cf", "delta_cf", "comp_cf",
               "delta_cf", "comp_cf", "delta_whole",
               "comp_whole"),
  denominators=c("wt_whole", "delta_whole", "comp_whole",
                 "wt_cf", "wt_cf", "wt_whole",
                 "wt_whole"))
results <- list()
for (c in 1:length(comparisons)) {
  name <- rownames(comparisons)[c]
  message("Starting ", name)
  comp <- comparisons[c, ]
  comparison <- t(as.matrix(comp))
  rownames(comparison) <- name
  results[name] <- sm(MSstats::groupComparison(contrast.matrix=comparison,
                                               data=msstats_quant))
}
```

### P/PE protein QC plots for Yan

Yan asked for the p/pe protein qc plots. ok.  I changed the dataProcessPlots to
return something useful, so that should be possible now.

```{r pe, eval=FALSE}
pe_genes <- read.table("reference/annotated_pe_genes.txt")[[1]]

## Unfortunately, the names did not get set in my changed version of dataProcessPlots...
plotlst <- msstats_plots$QCPLOT
available_plots <- gsub(pattern="^1/", replacement="", x=levels(msstats_quant$ProcessedData$PROTEIN))
names(plotlst) <- available_plots

pe_in_avail_idx <- pe_genes %in% available_plots
pe_in_avail <- pe_genes[pe_in_avail_idx]
pe_plots <- plotlst[pe_in_avail]
pdf(file="pe_qc_plots.pdf")
for (p in 1:length(pe_plots)) {
  plot(pe_plots[[p]])
}
dev.off()
```

# Create hpgltools expressionset

Since I am not certain I understand these data, I will take the intensities from
SWATH2stats, metadata, and annotation data;  attempt to create a 'normal'
expressionset; poke at it to see what I can learn.

## Massaging the metadata

I want to use the same metadata as were used for MSstats.  It has a few
important differences from the requirements of hpgltools: pretty much only that
I do not allow rownames/sampleIDs to start with a number.

```{r protein_peptide_matrices}
metadata <- sample_annot
metadata[["sampleid"]] <- paste0("s", metadata[["sampleid"]])
rownames(metadata) <- metadata[["sampleid"]]
```

## Massaging the gene annotation data and adding the msstats data.

I have my own annotation data from the gff file/microbesonline/whatever, I can
add the MSstats result to it so that later I can print them all together.

```{r msstats_annotations}
## Here is a neat little thing I can do:  Add the MSstats results to my annotation data.
## Then when I print out the tables of the limma/etc results, they MSstats
## results will come along for free.
big_table <- results[[1]]
for (r in 2:length(results)) {
  big_table <- merge(big_table, results[[r]], by="Protein")
}
rownames(big_table) <- big_table[["Protein"]]
big_table <- big_table[, -1]

mtb_annotations_with_msstats <- merge(mtb_annotations, big_table,
                                      by="row.names", all.x=TRUE)
rownames(mtb_annotations_with_msstats) <- mtb_annotations_with_msstats[["Row.names"]]
mtb_annotations_with_msstats <- mtb_annotations_with_msstats[, -1]
```

## Massaging the intensity matrix

I do not want the \1 before the protein names, I already merged them into one
entry per gene vis SWATH2stats.

```{r protein_matrix}
prot_mtrx <- read.csv(paste0("results/swath2stats_", ver, "/protein_matrix_minimum.csv"))
rownames(prot_mtrx) <- gsub(pattern="^1\\/", replacement="", x=prot_mtrx[["proteinname"]])
prot_mtrx <- prot_mtrx[, -1]
## Important question: Did SWATH2stats reorder my data?
fun <- gsub(pattern="^.*_(2018.*$)", replacement="\\1", x=colnames(prot_mtrx))
colnames(prot_mtrx) <- paste0("s", fun)
```

## Merge the pieces

Now we should have sufficient pieces to make an expressionset.

```{r expt}
## Drop the metadata not in the protein matrix:
## And ensure that they are the same order.
reordered <- colnames(prot_mtrx)
metadata <- metadata[reordered, ]

protein_expt <- create_expt(metadata,
                            count_dataframe=prot_mtrx,
                            gene_info=mtb_annotations_with_msstats)
whole_expt <- subset_expt(protein_expt, subset="collectiontype=='whole'")
cf_expt <- subset_expt(protein_expt, subset="collectiontype=='cf'")
```

```{r protein_metrics, fig.show='hide'}
protein_metrics <- sm(graph_metrics(protein_expt))
protein_norm <- sm(normalize_expt(protein_expt, transform="log2", convert="cpm",
                                  norm="quant", filter=TRUE))
protein_norm_metrics <- sm(graph_metrics(protein_norm))
protein_fsva <- sm(normalize_expt(protein_expt, transform="log2", convert="cpm",
                                  batch="fsva", filter=TRUE))
protein_fsva_metrics <- sm(graph_metrics(protein_fsva))
```

```{r whole_metrics, fig.show='hide'}
whole_metrics <- sm(graph_metrics(whole_expt))
whole_norm <- sm(normalize_expt(whole_expt, transform="log2", convert="cpm",
                                  norm="quant", filter=TRUE))
whole_norm_metrics <- sm(graph_metrics(whole_norm))
whole_fsva <- sm(normalize_expt(whole_expt, transform="log2", convert="cpm",
                                  batch="fsva", filter=TRUE))
whole_fsva_metrics <- sm(graph_metrics(whole_fsva))
```

```{r cf_metrics, fig.show='hide'}
cf_metrics <- sm(graph_metrics(cf_expt))
cf_norm <- sm(normalize_expt(cf_expt, transform="log2", convert="cpm",
                                  norm="quant", filter=TRUE))
cf_norm_metrics <- sm(graph_metrics(cf_norm))
cf_fsva <- sm(normalize_expt(cf_expt, transform="log2", convert="cpm",
                                  batch="fsva", filter=TRUE))
cf_fsva_metrics <- sm(graph_metrics(cf_fsva))
```

```{r print_metrics}
pp(image=protein_metrics$libsize, file="images/20180523_libsize.png")
pp(image=protein_norm_metrics$pcaplot, file="images/20180523_norm_pca.png")
pp(image=protein_fsva_metrics$pcaplot, file="images/20180523_fsva_pca.png")
pp(image=protein_norm_metrics$corheat, file="images/20180523_norm_corheat.png")
pp(image=protein_metrics$density, file="images/20180523_raw_density.png")
pp(image=protein_metrics$boxplot, file="images/20180523_boxplot.png")

pp(image=whole_metrics$libsize, file="images/20180523_whole_libsize.png")
pp(image=whole_norm_metrics$pcaplot, file="images/20180523_whole_norm_pca.png")
pp(image=whole_fsva_metrics$pcaplot, file="images/20180523_whole_fsva_pca.png")
pp(image=whole_norm_metrics$corheat, file="images/20180523_whole_norm_corheat.png")
pp(image=whole_metrics$density, file="images/20180523_whole_raw_density.png")
pp(image=whole_metrics$boxplot, file="images/20180523_whole_boxplot.png")

pp(image=cf_metrics$libsize, file="images/20180523_libsize.png")
pp(image=cf_norm_metrics$pcaplot, file="images/20180523_norm_pca.png")
pp(image=cf_fsva_metrics$pcaplot, file="images/20180523_fsva_pca.png")
pp(image=cf_norm_metrics$corheat, file="images/20180523_norm_corheat.png")
pp(image=cf_metrics$density, file="images/20180523_raw_density.png")
pp(image=cf_metrics$boxplot, file="images/20180523_boxplot.png")
```

# Attempt some quantification comparisons?

```{r de_testing}
pairwise_filt <- sm(normalize_expt(protein_expt, filter=TRUE))
pairwise_comp <- sm(all_pairwise(pairwise_filt, model_batch="fsva", force=TRUE))
```

# For each msstats run, do a DE table

```{r multi_tales}
keepers <- list(
  "wt_cf_minus_wt_whole" = c("wt_cf", "wt_whole"),
  "delta_cf_minus_delta_whole" = c("delta_cf", "delta_whole"),
  "comp_cf_minus_comp_whole" = c("comp_cf", "comp_whole"),
  "delta_cf_minus_wt_cf" = c("delta_cf", "wt_cf"),
  "comp_cf_minus_wt_cf" = c("comp_cf", "wt_cf"),
  "delta_whole_minus_wt_whole" = c("delta_whole", "wt_whole"),
  "comp_whole_minus_wt_whole" = c("comp_whole", "wt_whole"))
pairwise_tables <- sm(combine_de_tables(
  pairwise_comp,
  excel=paste0("excel/pairwise_de_with_msstats-v", ver, ".xlsx"),
  keepers=keepers))
pairwise_sig <- sm(extract_significant_genes(
  pairwise_tables,
  excel=paste0("excel/test_pairwise_sig_with_msstats-v", ver, ".xlsx")))
```

```{r msstats_only, eval=FALSE}
droppers <- c("undefined")
names(droppers) <- "log2fc"
pairwise_onlyall <- sm(combine_de_tables(
  pairwise_comp,
  excel=paste0("excel/test_pairwise_de_only_msstats-v", ver, ".xlsx"),
  keepers=keepers,
  excludes=droppers))
```

```{r solo_features}
solo_proteins <- features_in_single_condition(protein_expt)
proteins_only_cf <- solo_proteins[["solo_this"]][["wt_cf"]]
proteins_only_cf
proteins_only_whole <- solo_proteins[["solo_this"]][["wt_whole"]]
length(proteins_only_whole)
```

## Compare hpgltools/MSstats

Can we compare limma and riends to MSstats?

### wt cf / wt whole

```{r compare_comparisons}
hpgl_table <- pairwise_tables[["data"]][["wt_cf_vs_wt_whole"]]
msstats_table <- results[["


msstats
msstats_table <- msstats_comparison$ComparisonResult
rownames(msstats_table) <- gsub(pattern="^1/", replacement="", x=msstats_table$Protein)

merged_table <- merge(hpgl_table, msstats_table, by="row.names")
write.csv(file="images/merged_table.csv", merged_table)
cor.test(merged_table$limma_logfc, merged_table$log2FC)
cor.test(merged_table$limma_logfc, merged_table$log2FC, method="spearman")
cor.test(merged_table$deseq_logfc, merged_table$log2FC)
cor.test(merged_table$deseq_logfc, merged_table$log2FC, method="spearman")
cor.test(merged_table$edger_logfc, merged_table$log2FC)
cor.test(merged_table$edger_logfc, merged_table$log2FC, method="spearman")

combined_table <- pairwise_tables$data[[1]]
undefined_idx <- combined_table[["log2fc"]] == "undefined"
combined_table[undefined_idx, "log2fc"] <- NA
combined_table[["log2fc"]] <- as.numeric(combined_table[["log2fc"]])
cor.test(combined_table[["limma_logfc"]], combined_table[["log2fc"]])
cor.test(combined_table[["limma_logfc"]], combined_table[["log2fc"]], method="spearman")

high_to_low_idx <- order(combined_table[["log2fc"]], na.last=TRUE, decreasing=TRUE)
low_to_high_idx <- order(combined_table[["log2fc"]], na.last=TRUE, decreasing=FALSE)
top_50 <- head(combined_table[high_to_low_idx, ], n=100)
bottom_50 <- head(combined_table[low_to_high_idx, ], n=100)

cor.test(top_50$limma_logfc, top_50$log2fc)
cor.test(top_50$limma_logfc, top_50$log2fc, method="spearman")
cor.test(top_50$deseq_logfc, top_50$log2fc)
cor.test(top_50$deseq_logfc, top_50$log2fc, method="spearman")
cor.test(top_50$edger_logfc, top_50$log2fc)
cor.test(top_50$edger_logfc, top_50$log2fc, method="spearman")

cor.test(bottom_50$limma_logfc, bottom_50$log2fc)
cor.test(bottom_50$limma_logfc, bottom_50$log2fc, method="spearman")
cor.test(bottom_50$deseq_logfc, bottom_50$log2fc)
cor.test(bottom_50$deseq_logfc, bottom_50$log2fc, method="spearman")
cor.test(bottom_50$edger_logfc, bottom_50$log2fc)
cor.test(bottom_50$edger_logfc, bottom_50$log2fc, method="spearman")

up_in_msstats <- rownames(combined_table)[combined_table[["log2fc"]] > 0]
up_in_msstats <- up_in_msstats[!is.na(up_in_msstats)]
up_in_limma <- rownames(combined_table)[combined_table[["limma_logfc"]] > 0]
up_in_limma <- up_in_limma[!is.na(up_in_limma)]
up_in_edger <- rownames(combined_table)[combined_table[["edger_logfc"]] > 0]
up_in_edger <- up_in_edger[!is.na(up_in_edger)]
up_venn_sets <- list(
  "msstats" = up_in_msstats,
  "limma" = up_in_limma,
  "edger" = up_in_edger)
testing <- Vennerable::Venn(Sets=up_venn_sets, )
pp(file="/tmp/up_venn.png")
Vennerable::plot(testing, doWeights=FALSE)
dev.off()
Vennerable::plot(testing, doWeights=FALSE)

down_in_msstats <- rownames(combined_table)[combined_table[["log2fc"]] < 0]
down_in_msstats <- down_in_msstats[!is.na(down_in_msstats)]
down_in_limma <- rownames(combined_table)[combined_table[["limma_logfc"]] < 0]
down_in_limma <- down_in_limma[!is.na(down_in_limma)]
down_in_edger <- rownames(combined_table)[combined_table[["edger_logfc"]] < 0]
down_in_edger <- down_in_edger[!is.na(down_in_edger)]
down_venn_sets <- list(
  "msstats" = down_in_msstats,
  "limma" = down_in_limma,
  "edger" = down_in_edger)
testing <- Vennerable::Venn(Sets=down_venn_sets, )
pp(file="/tmp/down_venn.png")
Vennerable::plot(testing, doWeights=FALSE)
dev.off()
```

# Repeat with the modified data

```{r compare_comparisons_modified}
hpgl_table <- pairwise_modified_tables$data[[1]]
msstats_table <- msstats_modified_comp$ComparisonResult
rownames(msstats_table) <- gsub(pattern="^1/", replacement="", x=msstats_table_modified$Protein)

merged_table <- merge(hpgl_table, msstats_table, by="row.names")
write.csv(file="images/merged_table_modified.csv", merged_table)
cor.test(merged_table$limma_logfc, merged_table$log2FC)
cor.test(merged_table$limma_logfc, merged_table$log2FC, method="spearman")
cor.test(merged_table$deseq_logfc, merged_table$log2FC)
cor.test(merged_table$deseq_logfc, merged_table$log2FC, method="spearman")
cor.test(merged_table$edger_logfc, merged_table$log2FC)
cor.test(merged_table$edger_logfc, merged_table$log2FC, method="spearman")

combined_table <- pairwise_tables$data[[1]]
undefined_idx <- combined_table[["log2fc"]] == "undefined"
combined_table[undefined_idx, "log2fc"] <- NA
combined_table[["log2fc"]] <- as.numeric(combined_table[["log2fc"]])
cor.test(combined_table[["limma_logfc"]], combined_table[["log2fc"]])
cor.test(combined_table[["limma_logfc"]], combined_table[["log2fc"]], method="spearman")

high_to_low_idx <- order(combined_table[["log2fc"]], na.last=TRUE, decreasing=TRUE)
low_to_high_idx <- order(combined_table[["log2fc"]], na.last=TRUE, decreasing=FALSE)
top_50 <- head(combined_table[high_to_low_idx, ], n=100)
bottom_50 <- head(combined_table[low_to_high_idx, ], n=100)

cor.test(top_50$limma_logfc, top_50$log2fc)
cor.test(top_50$limma_logfc, top_50$log2fc, method="spearman")
cor.test(top_50$deseq_logfc, top_50$log2fc)
cor.test(top_50$deseq_logfc, top_50$log2fc, method="spearman")
cor.test(top_50$edger_logfc, top_50$log2fc)
cor.test(top_50$edger_logfc, top_50$log2fc, method="spearman")

cor.test(bottom_50$limma_logfc, bottom_50$log2fc)
cor.test(bottom_50$limma_logfc, bottom_50$log2fc, method="spearman")
cor.test(bottom_50$deseq_logfc, bottom_50$log2fc)
cor.test(bottom_50$deseq_logfc, bottom_50$log2fc, method="spearman")
cor.test(bottom_50$edger_logfc, bottom_50$log2fc)
cor.test(bottom_50$edger_logfc, bottom_50$log2fc, method="spearman")

up_in_msstats <- rownames(combined_table)[combined_table[["log2fc"]] > 0]
up_in_msstats <- up_in_msstats[!is.na(up_in_msstats)]
up_in_limma <- rownames(combined_table)[combined_table[["limma_logfc"]] > 0]
up_in_limma <- up_in_limma[!is.na(up_in_limma)]
up_in_edger <- rownames(combined_table)[combined_table[["edger_logfc"]] > 0]
up_in_edger <- up_in_edger[!is.na(up_in_edger)]
up_venn_sets <- list(
  "msstats" = up_in_msstats,
  "limma" = up_in_limma,
  "edger" = up_in_edger)
testing <- Vennerable::Venn(Sets=up_venn_sets, )
pp(file="/tmp/up_venn.png")
Vennerable::plot(testing, doWeights=FALSE)
dev.off()
Vennerable::plot(testing, doWeights=FALSE)

down_in_msstats <- rownames(combined_table)[combined_table[["log2fc"]] < 0]
down_in_msstats <- down_in_msstats[!is.na(down_in_msstats)]
down_in_limma <- rownames(combined_table)[combined_table[["limma_logfc"]] < 0]
down_in_limma <- down_in_limma[!is.na(down_in_limma)]
down_in_edger <- rownames(combined_table)[combined_table[["edger_logfc"]] < 0]
down_in_edger <- down_in_edger[!is.na(down_in_edger)]
down_venn_sets <- list(
  "msstats" = down_in_msstats,
  "limma" = down_in_limma,
  "edger" = down_in_edger)
testing <- Vennerable::Venn(Sets=down_venn_sets, )
pp(file="/tmp/down_venn.png")
Vennerable::plot(testing, doWeights=FALSE)
dev.off()
```

# Everything below here might get dropped?

I think I wedged all of the following work into that short block above.
So, stop evaluating the following blocks and see if that is true.

### Expressionset from the TRIC intensity matrix

```{r hpgltools_expt, eval=FALSE}
intensity_mtrx <- read.csv(file="results/tric/HCD_outmatrix.tsv", sep="\t")
## The HCD_outmatrix has columns including: Peptide, Protein,
## A series of Intensity_sample_id columns
## A series of RT_sample_id columns
## A series of score_sample_id columns
## And at the end: RT_mean, RT_std, and pg_pvalue.
## Since SWATH2stats/MSstats uses the intensities, lets get those columns out,
## standardize the column names to match the annotation data, and use them for
## creating an expressionset.
intensity_mtrx[["rownames"]] <- intensity_mtrx[["Protein"]]
intensity_mtrx[["rownames"]] <- gsub(pattern="^[[:digit:]]+\\/",
                                     replacement="", x=intensity_mtrx[["rownames"]])
## Standardize the rownames, this might be a bad idea, as this will keep
## separate every peptide from each protein and not do anything to
## sum/median/whatever them.  But for the purposes of testing out the data I
## think it is ok.
rownames(intensity_mtrx) <- make.names(intensity_mtrx[["rownames"]], unique=TRUE)

## Now lets get rid of the extraneous text in the column names and simplify them
## to the sample names as referenced in the sample sheet.
all_columns <- colnames(intensity_mtrx)
intense_columns <- grepl(pattern="Intensity", x=all_columns)
intensity_mtrx <- intensity_mtrx[, intense_columns]
all_columns <- colnames(intensity_mtrx)
new_columns <- gsub(pattern="Intensity_", replacement="", x=all_columns)
new_columns <- gsub(pattern="_dia_.*$", replacement="", x=new_columns)
colnames(intensity_mtrx) <- paste0("s", new_columns)
intensity_mtrx[is.na(intensity_mtrx)] <- 0

## No columns in an expression set are allowed to start with a number.  I have
## unsed prefixing samples with 's' as a standard to handle this problem...
metadata <- sample_annot
metadata$sampleid <- paste0("s", metadata$sampleid)
colnames(intensity_mtrx) <- gsub(pattern="_vs_HCD", replacement="", x=colnames(intensity_mtrx))
rownames(metadata) <- metadata$sampleid
## Theoretically this is not needed anymore...
intensity_mtrx <- intensity_mtrx[, rownames(metadata)]
## Ok, at this point, we should have all the pieces for a more or less normal expressionset.
dim(intensity_mtrx)
test_expt <- create_expt(metadata=metadata,
                         count_dataframe=intensity_mtrx,
                         gene_info=mtb_annotations)
```

### Play with the hpgltools derived expressionset of peptide intensities

Lets see if anything makes sense in this intensity expressionset.

```{r hpgltools_expt_examine, fig.show="hide", eval=FALSE}
## First, log2 and normalize the data.
test_norm <- normalize_expt(test_expt, transform="log2", convert="cpm", norm="quantile", filter=TRUE)
test_normbatch <- sm(normalize_expt(test_expt, transform="log2", convert="cpm",
                                    norm="quantile", filter=TRUE, batch="limma"))

test_metrics <- sm(graph_metrics(test_expt))
test_metrics_norm <- sm(graph_metrics(test_norm))
test_metrics_normbatch <- sm(graph_metrics(test_normbatch))
```

Now lets see what the data looks like, assuming I did not do anything horrible
to it.

```{r hpgltools_plots, eval=FALSE}
test_metrics$legend
test_metrics$libsize  ## Wow the intensities get ridiculous, what?
test_metrics$nonzero
## Interesting, but nothing which super-jumps out at me
test_metrics$density
## hmm very interesting, the good news is that they all have basically the same
## distribution.  I did not expect the CF samples to have such a stronger
## distribution though.

test_metrics_norm$corheat
## This suggests to me a likely batch effect
test_metrics_norm$disheat
test_metrics$smc
test_metrics$tsneplot
## Yeah there is some wonkyness between the old/new samples.

test_metrics_normbatch$pcaplot
## Wow, this is after invoking 'removeBatchEffect' from limma.  That suggests
## pretty strongly to me that we should probalby not examine the new and old
## data together.  In addition I am thinking that one whole-cell lysate sample
## might not be.
```

## Remove old batch/weirdo sample

```{r removals, fig.show="hide", eval=FALSE}
kept_testing <- subset_expt(test_expt,
                            subset="bioreplicate == 'mar'")
## The new batch IDs: x,y,z are associated with the 3 runs of these samples, one
## which is 8 m/z, one which is 20 m/z, and one which is a reordered 20 m/z.

test_norm <- normalize_expt(kept_testing, transform="log2", convert="cpm", norm="quantile", filter=TRUE)
test_normbatch <- sm(normalize_expt(kept_testing, transform="log2", convert="cpm",
                                    filter=TRUE, batch="limma"))
new_metrics <- sm(graph_metrics(kept_testing))
new_metrics_norm <- sm(graph_metrics(test_norm))
new_metrics_normbatch <- sm(graph_metrics(test_normbatch))
```

### Plot only new data

Removing the January data seems to have made this a lot easier to look at.
There might be some batchyness associated with reordering the samples, but I am
thinking it is not huge.


```{r new_data_plots, eval=FALSE}
tt = plot_topn(kept_testing, direct=TRUE)
new_metrics$legend
new_metrics$libsize
pp("images/kept_norm_topn.png", image=new_metrics$topnplot)
pp("images/kept_norm_hcd_pca.png", image=new_metrics_norm$pcaplot)
new_metrics_norm$tsneplot
pp("images/kept_norm_hcd_corheat.png", image=new_metrics_norm$corheat)
new_metrics_norm$disheat
new_metrics_normbatch$pcaplot
pp("images/normbatch_tsne.png", image=new_metrics_normbatch$tsneplot)
```

### How is the variance?

Lets see what variancePartition has to say about this data.

```{r varpart, eval=FALSE}
new_varpart <- varpart(kept_testing, factors=c("condition", "batch"))
pp("images/varpart_partition.png", image=new_varpart$partition_plot)
## That is not terrible.

## the modified_expt slow from varpart() adds some metadata to the expressionset
## including the calculated % variance for condition/batch/residual for each peptide.
kept_testing <- new_varpart$modified_expt
```

```{r pe_proteins}
pe_genes <- read.table("reference/annotated_pe_genes.txt")[[1]]
sum(pe_genes %in% rownames(exprs(protein_norm)))
found_pe_idx <- rownames(exprs(protein_expt)) %in% pe_genes
found_pe <- rownames(exprs(protein_expt))[found_pe_idx]
pe_expt <- exclude_genes_expt(protein_expt, method="keep", ids=found_pe)
plot_boxplot(pe_expt)
exprs(pe_expt)
```


## Back to hpgltools

If you will recall, in the original block where I read the data into
SWATH2stats, I invoked write_matrix_proteins() and write_matrix_peptides() and
wrote out 2 csv files 'swath2stats_protein_matrix.csv' and
'_peptide_matrix.csv'.  If my earlier work is correct, I should be able to use
the protein matrix to get a truer sense of the relative abundances between my
experimental conditions.

```{r alfq_quant, eval=FALSE}
library(aLFQ)
alfq_process <- ProteinInference(alfq_input,
                                 peptide_method="top",
                                 peptide_topx=3,
                                 peptide_strictness="loose",
                                 peptide_summary="mean",
                                 transition_topx=3,
                                 transition_strictness="loose",
                                 transition_summary="sum",
                                 fasta=NA,
                                 model=NA,
                                 combine_precursors=FALSE)
```

# Index version: `r ver`

# TODO

* 2018-04-10:  Make sure my invocations of SWATH2stats/MSstats are correct.

```{r saveme, eval=FALSE}
if (!isTRUE(get0("skip_load"))) {
  message(paste0("This is hpgltools commit: ", get_git_commit()))
  this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
  message(paste0("Saving to ", this_save))
  tmp <- sm(saveme(filename=this_save))
  pander::pander(sessionInfo())
}
```
