1 Analyzing data from openMS and friends.

1.1 SWATH2stats preprocessing

I am using my slightly modified copy of SWATH2stats. This seeks to ensure that changes in the case of columns in the metadata from one version of OpenMS to another do not trouble me.

1.1.1 Creating a swath2stats experiment using the tuberculist-derived library data

There is one important caveat in the following block: I used a regex to remove the second half of geneID_geneName so that later when I merge in the annotation data I have it will match.

1.2 Some new plots

In response to some interesting queries from Yan, I made a few little functions which query and plot data from the scored data provided by openswath/pyprophet. Let us look at their results here.

## Writing the image to: images/all_masses_observed-v20191021.png and calling dev.off().

## Writing the image to: images/all_drt_observed-v20191021.png and calling dev.off().

## Writing the image to: images/all_intensities_observed-v20191021.png and calling dev.off().

## Writing the image to: images/all_peptide_identifications-v20191021.png and calling dev.off().

## Writing the image to: images/all_protein_identifications-v20191021.png and calling dev.off().

## Writing the image to: images/all_counts_vs_intensities-v20191021.png and calling dev.off().

## Writing the image to: images/two_samples_wc_cf_dscore_mass-v20191021.png and calling dev.off().

## Writing the image to: images/two_samples_mass_rt-v20191021.png and calling dev.off().

## Writing the image to: images/two_samples_wc_cf_intensity_mass-v20191021.png and calling dev.off().

## Writing the image to: images/two_samples_wc_cf_rt_mscore-v20191021.png and calling dev.off().

## Writing the image to: images/two_samples_wc_cf_rt_width-v20191021.png and calling dev.off().

## Writing the image to: images/whole_lwidths_vs_counts.png and calling dev.off().

1.3 Show a series of all identifications for marker proteins

There are a few proteins for which Volker has relatively specific assumptions/expectations. Let us see what they look like and if they follow a trend which makes some sense…

The primary thing to recall, I think, is that in our previous data sets, there were a pretty large number of samples for which no identifications were made for many of these proteins. Does that remain true?

1.4 Grab annotations

## The species being downloaded is: Mycobacterium tuberculosis H37Rv
## Downloading: http://www.microbesonline.org/cgi-bin/genomeInfo.cgi?tId=83332;export=tab

I will use the annotations to make finding the correct protein IDs easier.

##        gene name         description
## Rv0287 esxG esxG esat-6 like protein
## Writing the image to: images/whole_osw_esxG_intensities-v20191021.png and calling dev.off().

##        gene name                            description
## Rv0288 esxH esxH low molecular weight protein antigen 7
## Writing the image to: images/whole_osw_esxH_intensities-v20191021.png and calling dev.off().

##        gene name                          description
## Rv3763 lpqH lpqH 19 kda lipoprotein antigen precursor
## Writing the image to: images/whole_osw_lpqh_intensities-v20191021.png and calling dev.off().

## Writing the image to: images/whole_osw_groel1_intensities-v20191021.png and calling dev.off().

## Writing the image to: images/whole_osw_groel2_intensities-v20191021.png and calling dev.off().

##        gene name                               description
## Rv1860  apa  apa alanine and proline rich secreted protein
## Writing the image to: images/whole_osw_fap_intensities-v20191021.png and calling dev.off().

## Writing the image to: images/whole_osw_ppe1_intensities-v20191021.png and calling dev.off().
## Warning: Removed 35 rows containing non-finite values (stat_ydensity).
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
## Warning: Removed 35 rows containing non-finite values (stat_ydensity).
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf

##         gene name                           description
## Rv1908c katG katG catalase-peroxidase-peroxynitritase T
## Writing the image to: images/whole_osw_katg_intensities-v20191021.png and calling dev.off().

##          gene  name        description
## Rv2430c PPE41 PPE41 PPE family protein
## Adding 2019_0801Briken01
## Adding 2019_0801Briken04
## Adding 2019_0801Briken07
## Adding 2019_0801Briken10
## Adding 2019_0801Briken13
## Adding 2019_0801Briken16
## Adding 2019_0709Briken01
## Adding 2019_0709Briken04
## Adding 2019_0709Briken07
## Adding 2019_0709Briken10
## Adding 2019_0709Briken13
## Adding 2019_0709Briken16
## Adding 2019_0801Briken02
## Adding 2019_0801Briken08
## Adding 2019_0801Briken11
## Adding 2019_0801Briken14
## Adding 2019_0801Briken17
## Adding 2019_0709Briken02
## Adding 2019_0709Briken05
## Adding 2019_0709Briken08
## Adding 2019_0709Briken11
## Adding 2019_0709Briken14
## Adding 2019_0709Briken17
## Adding 2019_0801Briken03
## Adding 2019_0801Briken06
## Adding 2019_0801Briken09
## Adding 2019_0801Briken12
## Adding 2019_0801Briken15
## Adding 2019_0801Briken18
## Adding 2019_0709Briken03
## Adding 2019_0709Briken06
## Adding 2019_0709Briken09
## Adding 2019_0709Briken12
## Adding 2019_0709Briken15
## Adding 2019_0709Briken18
## Writing the image to: images/PPE41_intensities-v20191021.png and calling dev.off().

##         gene name       description
## Rv2431c PE25 PE25 PE family protein
## Adding 2019_0801Briken01
## Adding 2019_0801Briken04
## Adding 2019_0801Briken07
## Adding 2019_0801Briken10
## Adding 2019_0801Briken13
## Adding 2019_0801Briken16
## Adding 2019_0709Briken01
## Adding 2019_0709Briken04
## Adding 2019_0709Briken07
## Adding 2019_0709Briken10
## Adding 2019_0709Briken13
## Adding 2019_0709Briken16
## Adding 2019_0801Briken02
## Adding 2019_0801Briken08
## Adding 2019_0801Briken11
## Adding 2019_0801Briken14
## Adding 2019_0801Briken17
## Adding 2019_0709Briken02
## Adding 2019_0709Briken05
## Adding 2019_0709Briken08
## Adding 2019_0709Briken11
## Adding 2019_0709Briken14
## Adding 2019_0709Briken17
## Adding 2019_0801Briken03
## Adding 2019_0801Briken06
## Adding 2019_0801Briken09
## Adding 2019_0801Briken12
## Adding 2019_0801Briken15
## Adding 2019_0801Briken18
## Adding 2019_0709Briken03
## Adding 2019_0709Briken06
## Adding 2019_0709Briken09
## Adding 2019_0709Briken12
## Adding 2019_0709Briken15
## Adding 2019_0709Briken18
## Writing the image to: images/PE25_intensities-v20191021.png and calling dev.off().

##        gene name       description
## Rv3477 PE31 PE31 PE family protein
## Writing the image to: images/PE31_intensities-v20191021.png and calling dev.off().

##        gene name                            description
## Rv0288 esxH esxH low molecular weight protein antigen 7
## Writing the image to: images/esxH_intensities-v20191021.png and calling dev.off().

##          gene   name          description
## Rv1748 Rv1748 Rv1748 hypothetical protein
## Writing the image to: images/Rv1748_intensities-v20191021.png and calling dev.off().
## Warning in max(data$density): no non-missing arguments to max; returning -Inf
## Warning in max(data$density): no non-missing arguments to max; returning -Inf

##        gene name                                       description
## Rv1746 pknF pknF anchored-membrane serine/threonine-protein kinase
## Writing the image to: images/pknF_intensities-v20191021.png and calling dev.off().

##         gene name                     description
## Rv0410c pknG pknG serine/threonine-protein kinase
## Writing the image to: images/pknG_intensities-v20191021.png and calling dev.off().

2 Start SWATH2stats

I want to load the data and metadata into SWATH2stats in preparation for MSstats and my own hpgltools-base analyses.

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   run_id = col_character(),
##   filename = col_character(),
##   Sequence = col_character(),
##   FullPeptideName = col_character(),
##   aggr_Peak_Area = col_logical(),
##   aggr_Peak_Apex = col_logical(),
##   aggr_Fragment_Annotation = col_logical(),
##   ProteinName = col_character(),
##   align_runid = col_character(),
##   align_origfilename = col_character()
## )
## See spec(...) for full column specifications.
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   run_id = col_character(),
##   filename = col_character(),
##   Sequence = col_character(),
##   FullPeptideName = col_character(),
##   aggr_Peak_Area = col_logical(),
##   aggr_Peak_Apex = col_logical(),
##   aggr_Fragment_Annotation = col_logical(),
##   ProteinName = col_character(),
##   align_runid = col_character(),
##   align_origfilename = col_character()
## )
## See spec(...) for full column specifications.
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   FigureReplicate = col_logical(),
##   `Figure Name` = col_logical(),
##   `Sample Description` = col_logical(),
##   `Replicate State` = col_logical(),
##   enzyme = col_logical(),
##   harvestdate = col_logical(),
##   prepdate = col_logical(),
##   rundate = col_logical(),
##   runinfo = col_logical(),
##   tuberculist_scored = col_logical(),
##   include_exclude = col_logical(),
##   Run_note = col_logical()
## )
## See spec(...) for full column specifications.
## Loading SWATH2stats
## Found the same mzXML files in the annotations and data.
## preprocessing/01mzXML/dia/20190718/2019_0709Briken01.mzXML has 4377 entries in the data.
## preprocessing/01mzXML/dia/20190718/2019_0709Briken02.mzXML has 3590 entries in the data.
## preprocessing/01mzXML/dia/20190718/2019_0709Briken03.mzXML has 4163 entries in the data.
## preprocessing/01mzXML/dia/20190718/2019_0709Briken04.mzXML has 4355 entries in the data.
## preprocessing/01mzXML/dia/20190718/2019_0709Briken05.mzXML has 3636 entries in the data.
## preprocessing/01mzXML/dia/20190718/2019_0709Briken06.mzXML has 4003 entries in the data.
## preprocessing/01mzXML/dia/20190718/2019_0709Briken07.mzXML has 4573 entries in the data.
## preprocessing/01mzXML/dia/20190718/2019_0709Briken08.mzXML has 3614 entries in the data.
## preprocessing/01mzXML/dia/20190718/2019_0709Briken09.mzXML has 4418 entries in the data.
## preprocessing/01mzXML/dia/20190718/2019_0709Briken10.mzXML has 4960 entries in the data.
## preprocessing/01mzXML/dia/20190718/2019_0709Briken11.mzXML has 5853 entries in the data.
## preprocessing/01mzXML/dia/20190718/2019_0709Briken12.mzXML has 6303 entries in the data.
## preprocessing/01mzXML/dia/20190718/2019_0709Briken13.mzXML has 2828 entries in the data.
## preprocessing/01mzXML/dia/20190718/2019_0709Briken14.mzXML has 6987 entries in the data.
## preprocessing/01mzXML/dia/20190718/2019_0709Briken15.mzXML has 4934 entries in the data.
## preprocessing/01mzXML/dia/20190718/2019_0709Briken16.mzXML has 3510 entries in the data.
## preprocessing/01mzXML/dia/20190718/2019_0709Briken17.mzXML has 4010 entries in the data.
## preprocessing/01mzXML/dia/20190718/2019_0709Briken18.mzXML has 6040 entries in the data.
## preprocessing/01mzXML/dia/20190801/2019_0801Briken01.mzXML has 13393 entries in the data.
## preprocessing/01mzXML/dia/20190801/2019_0801Briken02.mzXML has 13968 entries in the data.
## preprocessing/01mzXML/dia/20190801/2019_0801Briken03.mzXML has 14417 entries in the data.
## preprocessing/01mzXML/dia/20190801/2019_0801Briken04.mzXML has 11498 entries in the data.
## preprocessing/01mzXML/dia/20190801/2019_0801Briken06.mzXML has 14028 entries in the data.
## preprocessing/01mzXML/dia/20190801/2019_0801Briken07.mzXML has 12127 entries in the data.
## preprocessing/01mzXML/dia/20190801/2019_0801Briken08.mzXML has 13509 entries in the data.
## preprocessing/01mzXML/dia/20190801/2019_0801Briken09.mzXML has 13867 entries in the data.
## preprocessing/01mzXML/dia/20190801/2019_0801Briken10.mzXML has 13339 entries in the data.
## preprocessing/01mzXML/dia/20190801/2019_0801Briken11.mzXML has 8334 entries in the data.
## preprocessing/01mzXML/dia/20190801/2019_0801Briken12.mzXML has 13904 entries in the data.
## preprocessing/01mzXML/dia/20190801/2019_0801Briken13.mzXML has 12833 entries in the data.
## preprocessing/01mzXML/dia/20190801/2019_0801Briken14.mzXML has 8462 entries in the data.
## preprocessing/01mzXML/dia/20190801/2019_0801Briken15.mzXML has 13365 entries in the data.
## preprocessing/01mzXML/dia/20190801/2019_0801Briken16.mzXML has 12468 entries in the data.
## preprocessing/01mzXML/dia/20190801/2019_0801Briken17.mzXML has 8173 entries in the data.
## preprocessing/01mzXML/dia/20190801/2019_0801Briken18.mzXML has 13436 entries in the data.

Now I have a couple data structures which should prove useful for the metrics provided by SWATH2stats, MSstats, and my own hpgltools.

3 SWATH2stats continued

The various metrics and filters provided by SWATH2stats seem quite reasonable to me. The only thing that really bothers me is that they are all case sensitive and I found that the most recent tric changed the capitalization of a column, causing these to all fall down. Therefore I went in and made everything case insensitive in a fashion similar to that done by MSstats (except I hate capital letters, so I used tolower() rather than toupper()).

3.1 Perform filters

The following block performs the metrics and filters suggested by swath2stats. These first define the decoy hit rate in the data, then filter the data based on that. It also filters out hits with less than ideal m-scores and proteins with non-optimal distributions of peptide hits (either due to too few peptides or a weird distribution of intensities).

## Going to write the image to: images/s2s_correlation-v20191021.png when dev.off() is called.
## png 
##   2
## Going to write the image to: images/s2_correlation.png when dev.off() is called.
## png 
##   2

3.2 Perform SWATH2stats filters

I just realized something which should be added to me SWATH2stats fork: A simplified filter functions which invokes all of these so that I can make sure that there are no typeographikal errors introduced by my invocation of each of these things, one at a time.

## Number of non-decoy peptides: 17968
## Number of decoy peptides: 1413
## Decoy rate: 0.0786
## The average FDR by run on assay level is 0.01
## The average FDR by run on peptide level is 0.011
## The average FDR by run on protein level is 0.043
## Target assay FDR: 0.02
## Required overall m-score cutoff: 0.0039811
## achieving assay FDR: 0.0196
## Target protein FDR: 0.02
## Required overall m-score cutoff: 0.00079433
## achieving protein FDR: 0.019
## Original dimension: 288943, new dimension: 257900, difference: 31043.
## Peptides need to have been quantified in more conditions than: 28 in order to pass this percentage-based threshold.
## Fraction of peptides selected: 0.14
## Original dimension: 288943, new dimension: 95329, difference: 193614.
## Target protein FDR: 0.000794328234724281
## Required overall m-score cutoff: 0.01
## achieving protein FDR: 0
## filter_mscore_fdr is filtering the data...
## finding m-score cutoff to achieve desired protein FDR in protein master list..
## finding m-score cutoff to achieve desired global peptide FDR..
## Target peptide FDR: 0.05
## Required overall m-score cutoff: 0.01
## Achieving peptide FDR: 0
## Proteins selected: 
## Total proteins selected: 2813
## Final target proteins: 2813
## Final decoy proteins: 0
## Peptides mapping to these protein entries selected:
## Total mapping peptides: 16178
## Final target peptides: 16178
## Final decoy peptides: 0
## Total peptides selected from:
## Total peptides: 16178
## Final target peptides: 16178
## Final decoy peptides: 0
## Individual run FDR quality of the peptides was not calculated
## as not every run contains a decoy.
## The decoys have been removed from the returned data.
## Number of proteins detected: 2830
## Protein identifiers: Rv3570c, Rv2524c, Rv3908, Rv2427c, Rv1579c, Rv0913c
## Number of proteins detected that are supported by a proteotypic peptide: 2716
## Number of proteotypic peptides detected: 16060
## Number of proteins detected: 2716
## First 6 protein identifiers: Rv3570c, Rv2524c, Rv3908, Rv2427c, Rv1579c, Rv0913c
## Before filtering:
##   Number of proteins: 2716
##   Number of peptides: 16060
## 
## Percentage of peptides removed: 16.95%
## 
## After filtering:
##   Number of proteins: 2716
##   Number of peptides: 13338
## Before filtering:
##   Number of proteins: 2716
##   Number of peptides: 13338
## 
## Percentage of peptides removed: 9.18%
## 
## After filtering:
##   Number of proteins: 1845
##   Number of peptides: 12113
## Number of non-decoy peptides: 17968
## Number of decoy peptides: 1413
## Decoy rate: 0.0786
## There were 288943 observations and 4332 decoy observations.
## The average FDR by run on assay level is 0.01
## The average FDR by run on peptide level is 0.011
## The average FDR by run on protein level is 0.043
## Target assay FDR: 0.1
## Required overall m-score cutoff: 0.01
## achieving assay FDR: 0.0471
## Target protein FDR: 0.1
## Required overall m-score cutoff: 0.0031623
## achieving protein FDR: 0.094
## Starting mscore filter.
## Starting mscore filter.
## Original dimension: 288943, new dimension: 288928, difference: 15.
## Starting freqobs filter.
## Peptides need to have been quantified in more conditions than: 26.25 in order to pass this percentage-based threshold.
## Fraction of peptides selected: 0.15
## Original dimension: 288928, new dimension: 101508, difference: 187420.
## Starting fdr filter.
## Target protein FDR: 0.00316227766016838
## Required overall m-score cutoff: 0.01
## achieving protein FDR: 0
## filter_mscore_fdr is filtering the data...
## finding m-score cutoff to achieve desired protein FDR in protein master list..
## finding m-score cutoff to achieve desired global peptide FDR..
## Target peptide FDR: 0.1
## Required overall m-score cutoff: 0.01
## Achieving peptide FDR: 0
## Proteins selected: 
## Total proteins selected: 895
## Final target proteins: 895
## Final decoy proteins: 0
## Peptides mapping to these protein entries selected:
## Total mapping peptides: 2927
## Final target peptides: 2927
## Final decoy peptides: 0
## Total peptides selected from:
## Total peptides: 2927
## Final target peptides: 2927
## Final decoy peptides: 0
## Individual run FDR quality of the peptides was not calculated
## as not every run contains a decoy.
## The decoys have been removed from the returned data.
## Starting proteotypic filter.
## Number of proteins detected: 912
## Protein identifiers: Rv3879c, Rv0577, Rv3800c, Rv0360c, Rv1023, Rv1808
## Number of proteins detected that are supported by a proteotypic peptide: 874
## Number of proteotypic peptides detected: 2900
## Starting peptide filter.
## Number of proteins detected: 874
## First 6 protein identifiers: Rv3879c, Rv0577, Rv3800c, Rv0360c, Rv1023, Rv1808
## Starting maximum peptide filter.
## Before filtering:
##   Number of proteins: 874
##   Number of peptides: 2900
## 
## Percentage of peptides removed: 2.38%
## 
## After filtering:
##   Number of proteins: 874
##   Number of peptides: 2831
## Skipping min peptide filter.
## We went from 4159/19381 proteins/peptides to:
##              874/2831 proteins/peptides.

3.3 Write out matrices of the results

swath2stats provides a couple of ways to print out its results, one in a format specifically intended for MSstats, and another as a more canonical matrix of rows = proteins, columns = samples.

Let us reset the version back to 20190327 here.

## Protein overview matrix preprocessing/10swath2stats/20191021/protein_matrix_unfiltered.csv written to working folder.
## [1] 4159   36
## Protein overview matrix preprocessing/10swath2stats/20191021/protein_matrix_mscore.csv written to working folder.
## [1] 3072   36
## Peptide overview matrix preprocessing/10swath2stats/20191021/peptide_matrix_mscore.csv written to working folder.
## [1] 17968    36
## Protein overview matrix preprocessing/10swath2stats/20191021/protein_matrix_filtered.csv written to working folder.
## [1] 874  36
## Peptide overview matrix preprocessing/10swath2stats/20191021/peptide_matrix_filtered.csv written to working folder.
## [1] 2831   36

## The library contains 1 transitions per precursor.
## The data table was transformed into a table containing one row per transition.
## One or several columns required by MSstats were not in the data. The columns were created and filled with NAs.
## Missing columns: productcharge, isotopelabeltype
## isotopelabeltype was filled with light.

4 Test aLFQ

I want to revisit aLFQ, I think it might provide better protein-level quantification methods. aLFQ looks promising, but I have not figured out valid parameters for using it.

4.1 MSstats

msstats.org seems to provide a complete solution for performing reasonable metrics of this data.

I am currently reading: http://msstats.org/wp-content/uploads/2017/01/MSstats_v3.7.3_manual.pdf

I made some moderately intrusive changes to MSstats to make it clearer, as well.

5 Create hpgltools expressionset

Since I am not certain I understand these data, I will take the intensities from SWATH2stats, metadata, and annotation data; attempt to create a ‘normal’ expressionset; poke at it to see what I can learn.

5.1 Massaging the metadata

I want to use the same metadata as were used for MSstats. It has a few important differences from the requirements of hpgltools: pretty much only that I do not allow rownames/sampleIDs to start with a number.

5.2 Massaging the intensity matrix

I do not want the \1 before the protein names, I already merged them into one entry per gene via SWATH2stats.

5.3 Merge the pieces

Now we should have sufficient pieces to make an expressionset.

While here, I will also split the data into a cf and whole-cell pair of data structures.

## Reading the sample metadata.
## The sample definitions comprises: 35 rows(samples) and 20 columns(metadata fields).
## Matched 869 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## The final expressionset has 873 rows and 35 columns.
## Reading the sample metadata.
## The sample definitions comprises: 35 rows(samples) and 20 columns(metadata fields).
## Matched 2927 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## The final expressionset has 2954 rows and 35 columns.
## Graphing number of non-zero genes with respect to CPM by library.
## Graphing library sizes.
## Warning in mode(current): NAs introduced by coercion to integer range
## Graphing a boxplot.
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0.  We are on log scale, adding 1 to the data.
## Changed 1759 zero count features.
## Graphing a correlation heatmap.
## Graphing a standard median correlation.
## Performing correlation.
## Graphing a distance heatmap.

## Graphing a standard median distance.
## Performing distance.
## Graphing a PCA plot.
## Graphing a T-SNE plot.
## Plotting a density plot.
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0.  We are on log scale, setting them to 0.5.
## Changed 1759 zero count features.
## Plotting a CV plot.
## Naively calculating coefficient of variation/dispersion with respect to condition.
## Finished calculating dispersion estimates.
## Plotting the representation of the top-n genes.
## Plotting the expression of the top-n PC loaded genes.

## Printing a color to condition legend.

## Going to write the image to: images/legend.png when dev.off() is called.
## png 
##   2
## Going to write the image to: images/density.png when dev.off() is called.
## png 
##   2
## Going to write the image to: images/box.png when dev.off() is called.
## png 
##   2
## Going to write the image to: libsize.png when dev.off() is called.
## png 
##   2
## This function will replace the expt$expressionset slot with:
## log2(cpm(quant(cbcb(data))))
## It will save copies of each step along the way
##  in expt$normalized with the corresponding libsizes. Keep libsizes in mind
##  when invoking limma.  The appropriate libsize is non-log(cpm(normalized)).
##  This is most likely kept at:
##  'new_expt$normalized$intermediate_counts$normalization$libsizes'
##  A copy of this may also be found at:
##  new_expt$best_libsize
## Not correcting the count-data for batch effects.  If batch is
##  included in EdgerR/limma's model, then this is probably wise; but in extreme
##  batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 0 low-count genes (873 remaining).
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 102 values equal to 0, adding 1 to the matrix.
## Step 5: not doing batch correction.
## Graphing number of non-zero genes with respect to CPM by library.
## Graphing library sizes.
## Graphing a boxplot.
## Graphing a correlation heatmap.
## Graphing a standard median correlation.
## Performing correlation.
## Graphing a distance heatmap.

## Graphing a standard median distance.
## Performing distance.
## Graphing a PCA plot.
## Graphing a T-SNE plot.
## Plotting a density plot.
## Plotting a CV plot.
## Naively calculating coefficient of variation/dispersion with respect to condition.
## Finished calculating dispersion estimates.
## Plotting the representation of the top-n genes.
## Plotting the expression of the top-n PC loaded genes.

## Printing a color to condition legend.

## Going to write the image to: images/cor.png when dev.off() is called.
## png 
##   2
## Going to write the image to: images/dis.png when dev.off() is called.
## png 
##   2
## Going to write the image to: images/cv.png when dev.off() is called.
## png 
##   2
## Going to write the image to: images/pca.png when dev.off() is called.
## png 
##   2
## Writing the image to: images/20191021_filtered_pca.png and calling dev.off().

## Using a subset expression.
## There were 35, now there are 17 samples.
## This function will replace the expt$expressionset slot with:
## log2(ruvg(cpm(quant(cbcb(data)))))
## It will save copies of each step along the way
##  in expt$normalized with the corresponding libsizes. Keep libsizes in mind
##  when invoking limma.  The appropriate libsize is non-log(cpm(normalized)).
##  This is most likely kept at:
##  'new_expt$normalized$intermediate_counts$normalization$libsizes'
##  A copy of this may also be found at:
##  new_expt$best_libsize
## Step 1: performing count filter with option: cbcb
## Removing 0 low-count genes (873 remaining).
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 74 values equal to 0, adding 1 to the matrix.
## Step 5: doing batch correction with ruvg.
## Note to self:  If you get an error like 'x contains missing values' The data has too many 0's and needs a stronger low-count filter applied.
## Passing off to all_adjusters.
## batch_counts: Before batch/surrogate estimation, 14731 entries are x>1: 99%.
## batch_counts: Before batch/surrogate estimation, 74 entries are x==0: 0%.
## batch_counts: Before batch/surrogate estimation, 36 entries are 0<x<1: 0%.
## The be method chose 3 surrogate variable(s).
## Using RUVSeq and edgeR for batch correction (similar to lmfit residuals.)
## Warning in RUVSeq::RUVg(linear_mtrx, ruv_controls, k = chosen_surrogates): The expression matrix does not contain counts.
## Please, pass a matrix of counts (not logged) or set isLog to TRUE to skip the log transformation
## There are 4 (0%) elements which are < 0 after batch correction.
## Setting low elements to zero.

## Writing the image to: images/20191021_fsva_filtered_pca.png and calling dev.off().

## This function will replace the expt$expressionset slot with:
## log2(cpm(quant(cbcb(data))))
## It will save copies of each step along the way
##  in expt$normalized with the corresponding libsizes. Keep libsizes in mind
##  when invoking limma.  The appropriate libsize is non-log(cpm(normalized)).
##  This is most likely kept at:
##  'new_expt$normalized$intermediate_counts$normalization$libsizes'
##  A copy of this may also be found at:
##  new_expt$best_libsize
## Not correcting the count-data for batch effects.  If batch is
##  included in EdgerR/limma's model, then this is probably wise; but in extreme
##  batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 181 low-count genes (2773 remaining).
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 6746 values equal to 0, adding 1 to the matrix.
## Step 5: not doing batch correction.

## Writing the first sheet, containing a legend and some summary data.
## Writing the raw reads.
## Graphing the raw reads.
## varpart sees only 1 batch, adjusting the model accordingly.
## Attempting mixed linear model with: ~  (1|condition)
## Fitting the expressionset to the model, this is slow.
## Dividing work into 100 chunks...
## 
## Total:18 s
## Placing factor: condition at the beginning of the model.
## Writing the normalized reads.
## Graphing the normalized reads.
## varpart sees only 1 batch, adjusting the model accordingly.
## Attempting mixed linear model with: ~  (1|condition)
## Fitting the expressionset to the model, this is slow.
## Dividing work into 100 chunks...
## 
## Total:11 s
## Placing factor: condition at the beginning of the model.
## Writing the median reads by factor.
## Writing the first sheet, containing a legend and some summary data.
## Writing the raw reads.
## Graphing the raw reads.
## varpart sees only 1 batch, adjusting the model accordingly.
## Attempting mixed linear model with: ~  (1|condition)
## Fitting the expressionset to the model, this is slow.
## Dividing work into 100 chunks...
## 
## Total:33 s
## Placing factor: condition at the beginning of the model.
## Writing the normalized reads.
## Graphing the normalized reads.
## varpart sees only 1 batch, adjusting the model accordingly.
## Attempting mixed linear model with: ~  (1|condition)
## Fitting the expressionset to the model, this is slow.
## Dividing work into 100 chunks...
## 
## Total:49 s
## Placing factor: condition at the beginning of the model.
## Writing the median reads by factor.
## Using a subset expression.
## There were 35, now there are 34 samples.
## Using a subset expression.
## There were 35, now there are 34 samples.
## Writing a legend of columns.
## Printing a pca plot before/after surrogates/batch estimation.
## Working on 1/9: whdt_wt which is: dt_wh/wt_wh.
## Found inverse table with wt_wh_vs_dt_wh
## The ebseq table is null.
## Working on 2/9: whcp_wt which is: cp_wh/wt_wh.
## Found inverse table with wt_wh_vs_cp_wh
## The ebseq table is null.
## Working on 3/9: whdt_cp which is: dt_wh/cp_wh.
## Found table with dt_wh_vs_cp_wh
## The ebseq table is null.
## Working on 4/9: cfdt_wt which is: dt_cf/wt_cf.
## Found inverse table with wt_cf_vs_dt_cf
## The ebseq table is null.
## Working on 5/9: cfcp_wt which is: cp_cf/wt_cf.
## Found inverse table with wt_cf_vs_cp_cf
## The ebseq table is null.
## Working on 6/9: cfdt_cp which is: dt_cf/cp_cf.
## Found table with dt_cf_vs_cp_cf
## The ebseq table is null.
## Working on 7/9: wtcf_wh which is: wt_cf/wt_wh.
## Found inverse table with wt_wh_vs_wt_cf
## The ebseq table is null.
## Working on 8/9: dtcf_wh which is: dt_cf/dt_wh.
## Found inverse table with dt_wh_vs_dt_cf
## The ebseq table is null.
## Working on 9/9: cpcf_wh which is: cp_cf/cp_wh.
## Found inverse table with cp_wh_vs_cp_cf
## The ebseq table is null.
## Adding venn plots for whdt_wt.
## Limma expression coefficients for whdt_wt; R^2: 0.566; equation: y = 0.634x + 3.26
## Deseq expression coefficients for whdt_wt; R^2: 0.674; equation: y = 0.675x + 8.82
## Edger expression coefficients for whdt_wt; R^2: 0.69; equation: y = 0.726x + 1.64
## Adding venn plots for whcp_wt.
## Limma expression coefficients for whcp_wt; R^2: 0.566; equation: y = 0.634x + 3.26
## Deseq expression coefficients for whcp_wt; R^2: 0.674; equation: y = 0.675x + 8.82
## Edger expression coefficients for whcp_wt; R^2: 0.69; equation: y = 0.726x + 1.64
## Adding venn plots for whdt_cp.
## Limma expression coefficients for whdt_cp; R^2: 0.566; equation: y = 0.634x + 3.26
## Deseq expression coefficients for whdt_cp; R^2: 0.674; equation: y = 0.675x + 8.82
## Edger expression coefficients for whdt_cp; R^2: 0.69; equation: y = 0.726x + 1.64
## Adding venn plots for cfdt_wt.
## Limma expression coefficients for cfdt_wt; R^2: 0.566; equation: y = 0.634x + 3.26
## Deseq expression coefficients for cfdt_wt; R^2: 0.674; equation: y = 0.675x + 8.82
## Edger expression coefficients for cfdt_wt; R^2: 0.69; equation: y = 0.726x + 1.64
## Adding venn plots for cfcp_wt.
## Limma expression coefficients for cfcp_wt; R^2: 0.566; equation: y = 0.634x + 3.26
## Deseq expression coefficients for cfcp_wt; R^2: 0.674; equation: y = 0.675x + 8.82
## Edger expression coefficients for cfcp_wt; R^2: 0.69; equation: y = 0.726x + 1.64
## Adding venn plots for cfdt_cp.
## Limma expression coefficients for cfdt_cp; R^2: 0.566; equation: y = 0.634x + 3.26
## Deseq expression coefficients for cfdt_cp; R^2: 0.674; equation: y = 0.675x + 8.82
## Edger expression coefficients for cfdt_cp; R^2: 0.69; equation: y = 0.726x + 1.64
## Adding venn plots for wtcf_wh.
## Limma expression coefficients for wtcf_wh; R^2: 0.566; equation: y = 0.634x + 3.26
## Deseq expression coefficients for wtcf_wh; R^2: 0.674; equation: y = 0.675x + 8.82
## Edger expression coefficients for wtcf_wh; R^2: 0.69; equation: y = 0.726x + 1.64
## Adding venn plots for dtcf_wh.
## Limma expression coefficients for dtcf_wh; R^2: 0.566; equation: y = 0.634x + 3.26
## Deseq expression coefficients for dtcf_wh; R^2: 0.674; equation: y = 0.675x + 8.82
## Edger expression coefficients for dtcf_wh; R^2: 0.69; equation: y = 0.726x + 1.64
## Adding venn plots for cpcf_wh.
## Limma expression coefficients for cpcf_wh; R^2: 0.566; equation: y = 0.634x + 3.26
## Deseq expression coefficients for cpcf_wh; R^2: 0.674; equation: y = 0.675x + 8.82
## Edger expression coefficients for cpcf_wh; R^2: 0.69; equation: y = 0.726x + 1.64
## Writing summary information, compare_plot is: TRUE.
## Performing save of excel/de_20191210_tables_v20191021.xlsx.

6 Satisfy TODO 20191105

Provide a plot of the ratio of delta(cf/whole) with respect to the ratio of wt(cf/whole)

## Warning in plot_multihistogram(df): NAs introduced by coercion

6.1 Lets see what happens if we rewrite the expressionset with NAs

## In condition wt_wh there are 264 rows which are all zero.
## In condition wt_cf there are 1377 rows which are all zero.
## In condition dt_wh there are 370 rows which are all zero.
## In condition dt_cf there are 1256 rows which are all zero.
## In condition cp_wh there are 269 rows which are all zero.
## In condition cp_cf there are 1291 rows which are all zero.
## Error in write_xls(interesting_nas, excel = "excel/testme_first.xlsx"): could not find function "write_xls"
## Error in write_xls(interesting_nas, excel = "excel/testme_first.xlsx"): could not find function "write_xls"

7 A nice subset request from Volker

Let us pull the following subset from the DE tables for Volker, it should provide a set of proteins most obviously of interest; assuming the false negatives are not too severe.

  1. lfc <= -6 for delta/wt && lfc >= -3 for comp/wt
  2. lfc >= 10 delta/wt && lfc <= 4 comp/wt

This will hopefully find things which are sufficiently different from the deletion and complement samples to be interesting.

9 Request from 20191128

Here is the text of my email:

Query 1: Given existing table of all DE proteins perform the following: a. Take the deseq_logfc for the delta(cf/wh) b. Take the deseq_logfc for the wt(cf/wh) c. Subtract a and b. d. Given c, extract the set which are negative and <= x where x is likely 0.5 at least at first.

Query 2: For the subset acquired in Query 1 do the following: a. Take the deseq_logfc for the delta(cf/wh). b. Take the deseq_logfc for the complement(cf/wh). c. Subtract a and b.

Query 3: For the results of Q1, Q2, plot them as a scatter plot. Take from this the set of genes close than some Z-value.

Focus on one stringency? start with filtered and unfiltered.

Query 4: Plot y axis: cf(delta/comp) Plot x axis: cf(delta/wt)

Query 5: Plot y axis: wh(delta/comp) Plot x axis: wh(delta/wt)

For all of the above use ggplotly (eg. hover plots).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -3.985  -0.335   0.030   0.013   0.377   2.842
## [1] 702
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -4.692  -0.498  -0.071  -0.144   0.280   1.796
## Warning in plot_multihistogram(df): NAs introduced by coercion

## Warning in plot_multihistogram(df): NAs introduced by coercion

## Warning in plot_multihistogram(df): NAs introduced by coercion

## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset ca7dfe3c77056f495da05ef0c95b6208370a23f1commit!
## This is hpgltools commit: Sat Nov 23 19:41:15 2019 -0500: ca7dfe3c77056f495da05ef0c95b6208370a23f1This is hpgltools commit: Sat Nov 23 19:41:15 2019 -0500: commit!
## Saving to 03_swath2stats_20191021-v20191021.rda.xz

R version 3.6.1 (2019-07-05)

Platform: x86_64-pc-linux-gnu (64-bit)

locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C

attached base packages: parallel, stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: edgeR(v.3.28.0), variancePartition(v.1.16.0), ruv(v.0.9.7.1), SWATH2stats(v.1.13.5), testthat(v.2.3.1), hpgltools(v.1.0), Biobase(v.2.46.0) and BiocGenerics(v.0.32.0)

loaded via a namespace (and not attached): rappdirs(v.0.3.1), rtracklayer(v.1.46.0), R.methodsS3(v.1.7.1), tidyr(v.1.0.0), ggplot2(v.3.2.1), acepack(v.1.4.1), bit64(v.0.9-7), knitr(v.1.26), aroma.light(v.3.16.0), DelayedArray(v.0.12.0), R.utils(v.2.9.2), data.table(v.1.12.8), rpart(v.4.1-15), hwriter(v.1.3.2), RCurl(v.1.95-4.12), doParallel(v.1.0.15), GenomicFeatures(v.1.38.0), preprocessCore(v.1.48.0), callr(v.3.4.0), cowplot(v.1.0.0), usethis(v.1.5.1), RSQLite(v.2.1.4), europepmc(v.0.3), bit(v.1.1-14), enrichplot(v.1.6.0), httpuv(v.1.5.2), xml2(v.1.2.2), SummarizedExperiment(v.1.16.0), assertthat(v.0.2.1), viridis(v.0.5.1), xfun(v.0.11), hms(v.0.5.2), promises(v.1.1.0), evaluate(v.0.14), DEoptimR(v.1.0-8), fansi(v.0.4.0), progress(v.1.2.2), caTools(v.1.17.1.3), dbplyr(v.1.4.2), htmlwidgets(v.1.5.1), igraph(v.1.2.4.2), DBI(v.1.0.0), geneplotter(v.1.64.0), EDASeq(v.2.20.0), stats4(v.3.6.1), purrr(v.0.3.3), ellipsis(v.0.3.0), crosstalk(v.1.0.0), selectr(v.0.4-2), dplyr(v.0.8.3), backports(v.1.1.5), annotate(v.1.64.0), biomaRt(v.2.42.0), vctrs(v.0.2.0), remotes(v.2.1.0), BRAIN(v.1.32.0), withr(v.2.1.2), ggforce(v.0.3.1), triebeard(v.0.3.0), robustbase(v.0.93-5), checkmate(v.1.9.4), GenomicAlignments(v.1.22.1), prettyunits(v.1.0.2), cluster(v.2.1.0), DOSE(v.3.12.0), lazyeval(v.0.2.2), crayon(v.1.3.4), genefilter(v.1.68.0), pkgconfig(v.2.0.3), labeling(v.0.3), tweenr(v.1.0.1), GenomeInfoDb(v.1.22.0), nlme(v.3.1-143), PolynomF(v.2.0-2), pkgload(v.1.0.2), nnet(v.7.3-12), devtools(v.2.2.1), rlang(v.0.4.2), lifecycle(v.0.1.0), BiocFileCache(v.1.10.2), directlabels(v.2018.05.22), cellranger(v.1.1.0), rprojroot(v.1.3-2), polyclip(v.1.10-0), matrixStats(v.0.55.0), graph(v.1.64.0), Matrix(v.1.2-18), urltools(v.1.7.3), boot(v.1.3-23), base64enc(v.0.1-3), ggridges(v.0.5.1), processx(v.3.4.1), viridisLite(v.0.3.0), bitops(v.1.0-6), R.oo(v.1.23.0), KernSmooth(v.2.23-16), pander(v.0.6.3), Biostrings(v.2.54.0), blob(v.1.2.0), stringr(v.1.4.0), qvalue(v.2.18.0), ShortRead(v.1.44.0), readr(v.1.3.1), gridGraphics(v.0.4-1), S4Vectors(v.0.24.1), scales(v.1.1.0), memoise(v.1.1.0), magrittr(v.1.5), plyr(v.1.8.5), gplots(v.3.0.1.1), gdata(v.2.18.0), zlibbioc(v.1.32.0), compiler(v.3.6.1), RColorBrewer(v.1.1-2), lme4(v.1.1-21), DESeq2(v.1.26.0), Rsamtools(v.2.2.1), cli(v.2.0.0), XVector(v.0.26.0), ps(v.1.3.0), htmlTable(v.1.13.3), Formula(v.1.2-3), MASS(v.7.3-51.4), mgcv(v.1.8-31), tidyselect(v.0.2.5), stringi(v.1.4.3), yaml(v.2.2.0), GOSemSim(v.2.12.0), askpass(v.1.1), locfit(v.1.5-9.1), latticeExtra(v.0.6-28), ggrepel(v.0.8.1), grid(v.3.6.1), fastmatch(v.1.1-0), tools(v.3.6.1), rstudioapi(v.0.10), foreach(v.1.4.7), foreign(v.0.8-72), gridExtra(v.2.3), farver(v.2.0.1), Rtsne(v.0.15), ggraph(v.2.0.0), digest(v.0.6.23), rvcheck(v.0.1.7), BiocManager(v.1.30.10), shiny(v.1.4.0), quadprog(v.1.5-8), Rcpp(v.1.0.3), GenomicRanges(v.1.38.0), later(v.1.0.0), httr(v.1.4.1), AnnotationDbi(v.1.48.0), colorspace(v.1.4-1), rvest(v.0.3.5), XML(v.3.98-1.20), fs(v.1.3.1), IRanges(v.2.20.1), splines(v.3.6.1), RBGL(v.1.62.1), graphlayouts(v.0.5.0), ggplotify(v.0.0.4), plotly(v.4.9.1), sessioninfo(v.1.1.1), xtable(v.1.8-4), jsonlite(v.1.6), nloptr(v.1.2.1), tidygraph(v.1.1.2), corpcor(v.1.6.9), zeallot(v.0.1.0), Vennerable(v.3.1.0.9000), R6(v.2.4.1), Hmisc(v.4.3-0), mime(v.0.7), pillar(v.1.4.2), htmltools(v.0.4.0), fastmap(v.1.0.1), glue(v.1.3.1), minqa(v.1.2.4), clusterProfiler(v.3.14.0), BiocParallel(v.1.20.0), DESeq(v.1.38.0), RUVSeq(v.1.20.0), codetools(v.0.2-16), fgsea(v.1.12.0), pkgbuild(v.1.0.6), lattice(v.0.20-38), tibble(v.2.1.3), sva(v.3.34.0), pbkrtest(v.0.4-7), curl(v.4.3), colorRamps(v.2.3), gtools(v.3.8.1), zip(v.2.0.4), GO.db(v.3.10.0), openxlsx(v.4.1.4), openssl(v.1.4.1), survival(v.3.1-8), limma(v.3.42.0), rmarkdown(v.1.18), desc(v.1.2.0), readODS(v.1.6.7), munsell(v.0.5.0), DO.db(v.2.9), fastcluster(v.1.1.25), GenomeInfoDbData(v.1.2.2), iterators(v.1.0.12), reshape2(v.1.4.3) and gtable(v.0.3.0)

---
title: "M.tuberculosis 20191021: All Samples OpenSwathWorkFlow/TRIC."
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
  rmdformats::readthedown:
    code_download: true
    code_folding: show
    df_print: paged
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: tango
    width: 300
    keep_md: false
    mode: selfcontained
    toc_float: true
  BiocStyle::html_document:
    code_download: true
    code_folding: show
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: tango
    keep_md: false
    mode: selfcontained
    toc_float: true
  html_document:
    code_download: true
    code_folding: show
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: tango
    keep_md: false
    mode: selfcontained
    number_sections: true
    self_contained: true
    theme: readable
    toc: true
    toc_float:
      collapsed: false
      smooth_scroll: false
---

<style type="text/css">
body, td {
  font-size: 16px;
}
code.r{
  font-size: 16px;
}
pre {
 font-size: 16px
}
</style>

```{r options, include=FALSE}
library("hpgltools")
tt <- devtools::load_all("/data/hpgltools")
knitr::opts_knit$set(width=120,
                     progress=TRUE,
                     verbose=TRUE,
                     echo=TRUE)
knitr::opts_chunk$set(error=TRUE,
                      dpi=96)
old_options <- options(digits=4,
                       stringsAsFactors=FALSE,
                       knitr.duplicate.label="allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size=10))
previous_file <- "01_annotation_20190801.Rmd"
rundate <- format(Sys.Date(), format="%Y%m%d")
ver <- "20190801"
tmp <- try(sm(loadme(filename=paste0(gsub(pattern="\\.Rmd", replace="", x=previous_file), "-v", ver, ".rda.xz"))))
ver <- "20191021"

rmd_file <- paste0("03_swath2stats_", ver, ".Rmd")
savefile <- gsub(pattern="\\.Rmd", replace="\\.rda\\.xz", x=rmd_file)
```

Analyzing data from openMS and friends.
=======================================

## SWATH2stats preprocessing

I am using my slightly modified copy of SWATH2stats.  This seeks to ensure that
changes in the case of columns in the metadata from one version of OpenMS to
another do not trouble me.

### Creating a swath2stats experiment using the tuberculist-derived library data

There is one important caveat in the following block: I used a regex to remove
the second half of geneID_geneName so that later when I merge in the annotation
data I have it will match.

## Some new plots

In response to some interesting queries from Yan, I made a few little functions
which query and plot data from the scored data provided by openswath/pyprophet.
Let us look at their results here.

```{r tb_pyprophet_plots}
pyp_metadata <- glue::glue("sample_sheets/Mtb_dia_samples_{ver}.ods")
pyprophet_fun <- sm(extract_pyprophet_data(metadata=pyp_metadata,
                                           pyprophet_column="diascored"))
## Visualize the mass distributions of each sample
mass_plot <- sm(plot_pyprophet_distribution(pyprophet_fun, column="mass"))
pp(file=glue::glue("images/all_masses_observed-v{ver}.png"),
                   image=mass_plot[["violin"]])
## That second to last sample looks pretty odd.

## Look at the delta rt times observed.  I am continually puzzled as to why these numbers
## get so high.  I suspect the actual reason is that I do not understand this column in the
## data, but finding reliable documentation is non-trivial, I just blew 45 minutes (re)reading
## a couple of reviews and some papers in which I thought I saw relevant information
## and found nothing.
deltart_plot_all <- sm(plot_pyprophet_distribution(
  pyprophet_fun, column="delta_rt"))
pp(file=glue::glue("images/all_drt_observed-v{ver}.png"),
   image=deltart_plot_all[["violin"]])


deltart_plot_all <- sm(plot_pyprophet_distribution(
  scale=TRUE,
  pyprophet_fun, column="intensity"))
pp(file=glue::glue("images/all_intensities_observed-v{ver}.png"),
   image=deltart_plot_all[["violin"]])

## This plot is the same as above, but includes _only_ the non-decoy values.
deltart_plot_real <- sm(plot_pyprophet_distribution(
  pyprophet_fun,
  column="delta_rt", keep_decoys=FALSE))
deltart_plot_real[["violin"]]
## And this time we have _only_ the decoys.  I am not really sure what one should expect in
## these, but the differences are definitely intriguing.
deltart_plot_decoys <- sm(plot_pyprophet_distribution(
  pyprophet_fun,
  column="delta_rt", keep_real=FALSE))
deltart_plot_decoys[["violin"]]

dscore_plot <- sm(plot_pyprophet_distribution(
  pyprophet_fun,
  column="d_score"))
dscore_plot[["violin"]]

dscore_plot_nodecoy <- sm(plot_pyprophet_distribution(
  pyprophet_fun, keep_decoys=FALSE,
  column="d_score"))
dscore_plot_nodecoy[["violin"]]

mscore_plot <- sm(plot_pyprophet_distribution(
  pyprophet_fun, keep_decoys=FALSE,
  column="m_score"))
mscore_plot[["violin"]]
mscore_plot_nodecoy <- sm(plot_pyprophet_distribution(
  pyprophet_fun, keep_decoys=FALSE,
  column="m_score"))
mscore_plot_nodecoy[["violin"]]

## How many identifications were observed in each sample?
pyprophet_identifications <- sm(plot_pyprophet_counts(
  pyprophet_fun, keep_decoys=FALSE,
  type="count"))
pp(file=glue::glue("images/all_peptide_identifications-v{ver}.png"),
   image=pyprophet_identifications$plot)
pyprophet_identifications <- sm(plot_pyprophet_counts(
  pyprophet_fun, keep_decoys=FALSE,
  type="protein_count"))
pp(file=glue::glue("images/all_protein_identifications-v{ver}.png"),
   image=pyprophet_identifications$plot)

## The range in values is a little surprising to me, but I do not
## think it is crazytown.

## Sum(intensity) vs. sum(identifications).  We saw in a previous plot that sample
## 17 was odd, so I would sort of expect it to be far away on this plot?
pyprophet_xy <- sm(plot_pyprophet_xy(
  pyprophet_fun,
  x_type="count", y_type="intensity"))
pp(file=glue::glue("images/all_counts_vs_intensities-v{ver}.png"),
   image=pyprophet_xy)
## hmm I do not see a strong trend

pyprophet_xy <- sm(plot_pyprophet_points(
  pyprophet_fun, alpha=0.7, color_by="condition", legend=FALSE,
  sample=c("2019_0801Briken04", "2019_0709Briken01"),
  xaxis="d_score", yaxis="mass"))
pp(file=glue::glue("images/two_samples_wc_cf_dscore_mass-v{ver}.png"),
   image=pyprophet_xy$plot)

pyprophet_xy <- sm(plot_pyprophet_points(
  pyprophet_fun, alpha=0.3, color_by="condition", legend=FALSE,
  sample=c("2019_0801Briken04", "2019_0709Briken01"),
  xaxis="mz", yaxis="leftwidth"))
pp(file=glue::glue("images/two_samples_mass_rt-v{ver}.png"),
   image=pyprophet_xy$plot)

pyprophet_xy <- sm(plot_pyprophet_points(
  pyprophet_fun, alpha=0.4, color_by="condition", legend=FALSE,
  sample=c("2019_0801Briken04", "2019_0709Briken01"), yscale="log2",
  xaxis="mass", yaxis="intensity"))
pp(file=glue::glue("images/two_samples_wc_cf_intensity_mass-v{ver}.png"),
   image=pyprophet_xy$plot)

pyprophet_xy <- sm(plot_pyprophet_points(
  pyprophet_fun, alpha=0.3, color_by="condition", legend=FALSE,
  sample=c("2019_0801Briken04", "2019_0709Briken01"),
  xaxis="rt", yaxis="m_score"))
pyprophet_xy$plot
pp(file=glue::glue("images/two_samples_wc_cf_rt_mscore-v{ver}.png"),
   image=pyprophet_xy$plot)

pyprophet_xy <- sm(plot_pyprophet_points(
  pyprophet_fun, alpha=0.3, color_by="condition", legend=FALSE,
  sample=c("2019_0801Briken04", "2019_0709Briken01"),
  xaxis="assay_rt", yaxis="leftwidth", x_scale="log2", y_scale="log2"))
pyprophet_xy$plot
pp(file=glue::glue("images/two_samples_wc_cf_rt_width-v{ver}.png"),
   image=pyprophet_xy$plot)

pyprophet_xy <- sm(plot_pyprophet_points(
  pyprophet_fun, alpha=0.3, color_by="condition", legend=FALSE,
  sample=c("2019_0801Briken04", "2019_0709Briken01"),
  xaxis="mz", yaxis="assay_rt"))
pyprophet_xy$plot

## This has so far been a pretty reliable plot to show that the observed peak widths are
## very consistent across samples.
pyprophet_lwidths <- sm(plot_pyprophet_xy(
  pyprophet_fun,
  x_type="count", y_type="leftwidth"))
pp(file="images/whole_lwidths_vs_counts.png", image=pyprophet_lwidths)
## In this case, we can see strongly that sample #17 is strange.
```

## Show a series of all identifications for marker proteins

There are a few proteins for which Volker has relatively specific
assumptions/expectations.  Let us see what they look like and if they follow a
trend which makes some sense...

The primary thing to recall, I think, is that in our previous data sets, there
were a pretty large number of samples for which no identifications were made for
many of these proteins.  Does that remain true?

## Grab annotations

```{r mtb_annotations}
## The old way of getting genome/annotation data
mtb_gff <- "reference/mycobacterium_tuberculosis_h37rv_2.gff.gz"
mtb_genome <- "reference/mtuberculosis_h37rv_genbank.fasta"
mtb_cds <- "reference/mtb_cds.fasta"

mtb_annotations <- sm(load_gff_annotations(mtb_gff, type="gene"))
colnames(mtb_annotations) <- gsub(pattern="\\.", replacement="", x=colnames(mtb_annotations))
mtb_annotations[["description"]] <- gsub(pattern="\\+", replacement=" ",
                                         x=mtb_annotations[["description"]])
mtb_annotations[["function"]] <- gsub(pattern="\\+", replacement=" ",
                                      x=mtb_annotations[["function"]])

mtb_microbes <- load_microbesonline_annotations(id=83332)
mtb_annotations <- merge(mtb_annotations, mtb_microbes,
                         by.x="ID", by.y="sysName")
rownames(mtb_annotations) <- mtb_annotations[["ID"]]
mtb_annotations[["width"]] <- abs(mtb_annotations[["start.y"]] - mtb_annotations[["stop"]])
```

I will use the annotations to make finding the correct protein IDs easier.

```{r plot_individual_proteins}
gene_xref <- mtb_annotations[, c("gene", "name", "description")]

gene_xref[gene_xref[["name"]] == "esxG", ]
intensities_esxG <- sm(plot_pyprophet_protein(pyprophet_fun, scale="log",
                                              title="esxG Intensities",
                                              column="intensity", protein="Rv0287"))
pp(file=glue::glue("images/whole_osw_esxG_intensities-v{ver}.png"),
   image=intensities_esxG)
## Is this a good or terrible spread of observed intensities for proteomics data?
## If I found a range like this in RNASeq data, I would just throw it away out of hand.
## Sample 17 did not have any observations, but everything else did.

## The range of dRT values. I wish this metric made sense to me!
drt_esxG <- sm(plot_pyprophet_protein(pyprophet_fun,
                                      column="delta_rt", protein="Rv0287",
                                      min_data=100, max_data=1000))
drt_esxG

gene_xref[gene_xref[["name"]] == "esxH", ]
intensities_esxH <- sm(plot_pyprophet_protein(
  pyprophet_fun,
  title="esxH Intensities",
  scale="log", column="intensity", protein="Rv0288"))
pp(file=glue::glue("images/whole_osw_esxH_intensities-v{ver}.png"),
   image=intensities_esxH)
## All samples have observations, but a bit sparser.

gene_xref[gene_xref[["name"]] == "lpqH", ]
intensities_lpqH <- sm(plot_pyprophet_protein(
  pyprophet_fun, scale="log",
  title="lpqH_intensities", column="intensity", protein="Rv3763"))
pp(file=glue::glue("images/whole_osw_lpqh_intensities-v{ver}.png"),
   image=intensities_lpqH)
## Very few observations, but they are relatively consistent?

intensities_groel1 <- sm(plot_pyprophet_protein(pyprophet_fun,
                                                title="groEL1 intensities", scale="log",
                                                column="intensity", protein="Rv3417"))
pp(file=glue::glue("images/whole_osw_groel1_intensities-v{ver}.png"),
   image=intensities_groel1)

intensities_groel2 <- sm(plot_pyprophet_protein(
  pyprophet_fun,
  title="groEL2 intensities", scale="log",
  column="intensity", protein="Rv0440"))
pp(file=glue::glue("images/whole_osw_groel2_intensities-v{ver}.png"),
   image=intensities_groel2)
## Man I am loving the density of observation, but I wish they were more consistent!
## Also, sample 17 is sparser.

gene_xref[rownames(gene_xref) == "Rv1860", ]
intensities_fap <- sm(plot_pyprophet_protein(
  pyprophet_fun,
  title="fap intensities", scale="log",
  column="intensity", protein="Rv1860"))
pp(file=glue::glue("images/whole_osw_fap_intensities-v{ver}.png"),
   image=intensities_fap)
## Sample 17 is basically a null.

intensities_katg <- sm(plot_pyprophet_protein(
  pyprophet_fun,
  title="PPE1 (Rv0096) intensities", scale="log",
  column="intensity", protein="Rv0096"))
pp(file=glue::glue("images/whole_osw_ppe1_intensities-v{ver}.png"),
   image=intensities_katg)
## What is up with that one, weirdly low, observation for each sample?

gene_xref[gene_xref[["name"]] == "katG", ]
intensities_katg <- sm(plot_pyprophet_protein(
  pyprophet_fun,
  title="katG intensities", scale="log",
  column="intensity", protein="Rv1908c"))
pp(file=glue::glue("images/whole_osw_katg_intensities-v{ver}.png"),
   image=intensities_katg)

gene_xref[gene_xref[["name"]] == "PPE41", ]
ppe41_intensities <- plot_pyprophet_protein(
  pyprophet_fun,
  title="PPE41 (Rv2430c) intensities", scale="log",
  column="intensity", protein="Rv2430c")
pp(file=glue::glue("images/PPE41_intensities-v{ver}.png"),
   image=ppe41_intensities)

gene_xref[gene_xref[["name"]] == "PE25", ]
pe25_intensities <- plot_pyprophet_protein(
  pyprophet_fun, show_all=TRUE,
  title="PE25 (Rv2431c) intensities", scale="log",
  column="intensity", protein="Rv2431c")
pp(file=glue::glue("images/PE25_intensities-v{ver}.png"),
   image=pe25_intensities)

gene_xref[gene_xref[["name"]] == "PE31", ]
pe31_intensities <- sm(plot_pyprophet_protein(
  pyprophet_fun,
  title="PE31 (Rv3477) intensities", scale="log",
  column="intensity", protein="Rv3477"))
pp(file=glue::glue("images/PE31_intensities-v{ver}.png"),
   image=pe31_intensities)

gene_xref[gene_xref[["name"]] == "esxH", ]
esxH_intensities <- sm(plot_pyprophet_protein(
  pyprophet_fun,
  title="esxH (Rv0288) intensities", scale="log",
  column="intensity", protein="Rv0288"))
pp(file=glue::glue("images/esxH_intensities-v{ver}.png"),
   image=esxH_intensities)

gene_xref[gene_xref[["name"]] == "Rv1748", ]
Rv1748_intensities <- sm(plot_pyprophet_protein(
  pyprophet_fun,
  title="Rv1748 intensities", scale="log",
  column="intensity", protein="Rv1748"))
pp(file=glue::glue("images/Rv1748_intensities-v{ver}.png"),
   image=Rv1748_intensities)

gene_xref[gene_xref[["name"]] == "pknF", ]
pknF_intensities <- sm(plot_pyprophet_protein(
  pyprophet_fun,
  title="pknF (Rv1746) intensities", scale="log",
  column="intensity", protein="Rv1746"))
pp(file=glue::glue("images/pknF_intensities-v{ver}.png"),
   image=pknF_intensities)

gene_xref[gene_xref[["name"]] == "pknG", ]
pknG_intensities <- sm(plot_pyprophet_protein(
  pyprophet_fun,
  title="pknG (Rv0410c) intensities", scale="log",
  column="intensity", protein="Rv0410c"))
pp(file=glue::glue("images/pknG_intensities-v{ver}.png"),
   image=pknG_intensities)
```

# Start SWATH2stats

I want to load the data and metadata into SWATH2stats in preparation for MSstats
and my own hpgltools-base analyses.

```{r tb_swath2stats_initial}
cf_version <- "20190718"
cf_tric_file <- file.path("preprocessing", "09tric", cf_version,
                          "whole_8mz_tuberculist", "comet_HCD.tsv")
cf_tric_data <- readr::read_tsv(cf_tric_file)
cf_tric_data[["ProteinName"]] <- gsub(pattern="^(.*)_.*$", replacement="\\1",
                                      x=cf_tric_data[["ProteinName"]])
wc_version <- "20190801"
wc_tric_file <- file.path("preprocessing", "09tric", wc_version,
                          "whole_8mz_tuberculist", "comet_HCD.tsv")
wc_tric_data <- readr::read_tsv(wc_tric_file)
wc_tric_data[["ProteinName"]] <- gsub(pattern="^(.*)_.*$", replacement="\\1",
                                      x=wc_tric_data[["ProteinName"]])
tric_data <- rbind(wc_tric_data, cf_tric_data)
sample_sheet <- file.path("sample_sheets", glue::glue("Mtb_dia_samples_{ver}.ods"))
sample_annot <- extract_metadata(sample_sheet)
## A weird thing is happening with column names, I will chase it down later.
colnames(sample_annot) <- gsub(x=colnames(sample_annot),
                               pattern="[[:punct:]]", replace="")
kept <- ! grepl(x=rownames(sample_annot), pattern="^s\\.\\.")
sample_annot <- sample_annot[kept, ]
devtools::load_all("~/scratch/git/SWATH2stats_myforked")
s2s_exp <- sample_annotation(data=tric_data,
                             sample_annotation=sample_annot,
                             fullpeptidename_column="fullpeptidename")
```

Now I have a couple data structures which should prove useful for the metrics
provided by SWATH2stats, MSstats, and my own hpgltools.

# SWATH2stats continued

The various metrics and filters provided by SWATH2stats seem quite reasonable to
me.  The only thing that really bothers me is that they are all case sensitive
and I found that the most recent tric changed the capitalization of a column,
causing these to all fall down.  Therefore I went in and made everything case
insensitive in a fashion similar to that done by MSstats (except I hate capital
letters, so I used tolower() rather than toupper()).

## Perform filters

The following block performs the metrics and filters suggested by swath2stats.
These first define the decoy hit rate in the data, then filter the data based on
that.  It also filters out hits with less than ideal m-scores and proteins with
non-optimal distributions of peptide hits (either due to too few peptides or a
weird distribution of intensities).

```{r tb_swath2stats_processing}
## Get correlations on a sample by sample basis
pp(file=glue::glue("images/s2s_correlation-v{ver}.png"))
sample_cond_rep_cor <- plot_correlation_between_samples(
  s2s_exp, size=2,
  comparison=transition_group_id ~
    condition + bioreplicate + run,
  fun.aggregate=mean,
  column.values="intensity")
dev.off()

## Do the same thing, but use the sum of the intensities peptide/protein
## instead of the mean...
pp(file="images/s2_correlation.png")
sample_cond_rep_cor <- plot_correlation_between_samples(
  s2s_exp, size=2,
  comparison=transition_group_id ~
    condition + bioreplicate + run,
  fun.aggregate=sum,
  column.values="intensity")
dev.off()
## I would love to know why it is that the spearman and pearson correlations
## in this data are so oddly different compare to previous data sets.  Is this
## an artifact of the fact that this time I am _only_ looking at CF samples?
## Are the high intensity numbers messing with non-rank-based correlations?
```

## Perform SWATH2stats filters

I just realized something which should be added to me SWATH2stats fork:
A simplified filter functions which invokes all of these so that I can make sure
that there are no typeographikal errors introduced by my invocation of each of
these things, one at a time.

```{r swath2stats_filters, fig.show="hide"}
decoy_lists <- assess_decoy_rate(s2s_exp)
## This seems a bit high to me, yesno?
fdr_overall <- assess_fdr_overall(s2s_exp, output="Rconsole", plot=TRUE)

byrun_fdr <- assess_fdr_byrun(s2s_exp, FFT=0.7, plot=TRUE, output="Rconsole")
chosen_mscore <- mscore4assayfdr(s2s_exp, FFT=0.7, fdr_target=0.02)
prot_score <- mscore4protfdr(s2s_exp, FFT=0.7, fdr_target=0.02)

filtered_ms <- filter_mscore(s2s_exp, chosen_mscore, rm.decoy=TRUE)
filtered_fq <- filter_mscore_freqobs(s2s_exp, 0.01, 0.8, rm.decoy=TRUE)
filtered_ms_fdr <- filter_mscore_fdr(filtered_ms, FFT=0.7, rm.decoy=TRUE,
                                     overall_protein_fdr_target=prot_score,
                                     upper_overall_peptide_fdr_limit=0.05)
filtered_ms_fdr_pr <- filter_proteotypic_peptides(filtered_ms, rm.decoy=TRUE)
filtered_ms_fdr_pr_all <- filter_all_peptides(filtered_ms_fdr_pr)
filtered_ms_fdr_pr_all_str <- filter_on_max_peptides(data=filtered_ms_fdr_pr_all,
                                                     n_peptides=10, rm.decoy=TRUE)
old_filtered_all_filters <- filter_on_min_peptides(data=filtered_ms_fdr_pr_all_str,
                                               n_peptides=3, rm.decoy=TRUE)

filtered_all_filters <- s2s_all_filters(s2s_exp, target_fdr=0.1,
                                        mscore=0.1, upper_fdr=0.1,
                                        do_min=FALSE)
```

## Write out matrices of the results

swath2stats provides a couple of ways to print out its results, one in a format
specifically intended for MSstats, and another as a more canonical matrix of
rows = proteins, columns = samples.

Let us reset the version back to 20190327 here.

```{r swath2stats_matrices}
## I think these matrixes are probably smarter to use than the raw outmatrix from tric.
## But I am not a fan of rerwriting the sample column names.
matrix_prefix <- file.path("preprocessing", "10swath2stats", ver)
if (!file.exists(matrix_prefix)) {
  dir.create(matrix_prefix)
}

## I want to write a few iterations of the filtered data
## Starting with the raw and moving down, perhaps I should just
## add this logic to my fancy new filtering function?
protein_matrix_unfilt <- write_matrix_proteins(
  filtered_all_filters[["raw"]], write.csv=TRUE,
  filename=file.path(matrix_prefix, "protein_matrix_unfiltered.csv"))
dim(protein_matrix_unfilt)
protein_matrix_mscore <- write_matrix_proteins(
  filtered_all_filters[["mscore_filtered"]], write.csv=TRUE,
  filename=file.path(matrix_prefix, "protein_matrix_mscore.csv"))
dim(protein_matrix_mscore)
peptide_matrix_mscore <- write_matrix_peptides(
  filtered_all_filters[["mscore_filtered"]], write.csv=TRUE,
  filename=file.path(matrix_prefix, "peptide_matrix_mscore.csv"))
dim(peptide_matrix_mscore)
protein_matrix_filtered <- write_matrix_proteins(
  filtered_all_filters[["final"]], write.csv=TRUE,
  filename=file.path(matrix_prefix, "protein_matrix_filtered.csv"))
dim(protein_matrix_filtered)
peptide_matrix_filtered <- write_matrix_peptides(
  filtered_all_filters[["final"]], write.csv=TRUE,
  filename=file.path(matrix_prefix, "peptide_matrix_filtered.csv"))
dim(peptide_matrix_filtered)

## Do the correlation of the sum of peptides/protein
rt_sum_cor <- plot_correlation_between_samples(
  filtered_all_filters[["final"]], column.values="intensity",
  fun.aggregate=sum, size=2,
  comparison=transition_group_id ~
    condition + bioreplicate + run)

## And their means
rt_mean_cor <- plot_correlation_between_samples(
  filtered_all_filters[["final"]], column.values="intensity",
  fun.aggregate=mean, size=2,
  comparison=transition_group_id ~
    condition + bioreplicate + run)

cols <- colnames(filtered_all_filters[["final"]])
disaggregated <- disaggregate(filtered_all_filters[["final"]], all.columns=TRUE)
msstats_input <- convert_MSstats(disaggregated)
```

# Test aLFQ

I want to revisit aLFQ, I think it might provide better protein-level
quantification methods.  aLFQ looks promising, but I have not figured out valid
parameters for using it.

```{r alfq, eval=FALSE}
summary(msstats_input)
devtools::load_all("~/scratch/git/aLFQ")
alfq_input <- tric_data[, c("align_origfilename", "ProteinName", "FullPeptideName", "transition_group_id",
                            "FullPeptideName", "Charge", "Intensity")]
colnames(alfq_input) <- c("run_id", "protein_id", "peptide_id", "transition_id", "peptide_sequence",
                          "precursor_charge", "transition_intensity")
alfq_input[["concentration"]] <- "?"
alfq_inference <- aLFQ::ProteinInference.default(alfq_input, consensus_proteins=FALSE,
                                                 consensus_peptides=FALSE, transition_strictness="loose",
                                                 consensus_transitions=FALSE)
alfq_quantities <- aLFQ::AbsoluteQuantification.default(alfq_inference,
                                                        total_protein_concentration=100)

alfq_norm <- alfq_quantities[[2]]
alfq_min <- min(alfq_norm[["normalized_concentration"]])
alfq_norm[["norm"]] <- alfq_norm[["normalized_concentration"]] / alfq_min
## Hmm that does not look right.
alfq_long <- data.table::as.data.table(alfq_norm)
alfq_wide <- data.table::dcast(alfq_long, protein_id ~ run_id, value.var="norm")
alfq_mtrx <- as.data.frame(alfq_wide)
rownames(alfq_mtrx) <- alfq_mtrx[["protein_id"]]
alfq_mtrx <- as.matrix(alfq_mtrx[, -1])
colnames(alfq_mtrx) <- gsub(x=colnames(alfq_mtrx),
                            pattern="^.*(2019_.*Briken[[:digit:]]{2}).*$",
                            replacement="\\1")
```

## MSstats

msstats.org seems to provide a complete solution for performing reasonable metrics of this data.

I am currently reading: http://msstats.org/wp-content/uploads/2017/01/MSstats_v3.7.3_manual.pdf

I made some moderately intrusive changes to MSstats to make it clearer, as well.

```{r tb_msstats_quant, eval=FALSE}
tt <- sm(devtools::load_all("~/scratch/git/MSstats"))

msstats_ver <- "20191021"
checkpoint <- paste0("msstats_dataprocess-v", msstats_ver, ".rda")
if (file.exists(checkpoint)) {
  load(file=checkpoint)
} else {
  msstats_quant <- dataProcess(msstats_input)
  save(file=checkpoint, list=c("msstats_quant"))
}

##checkpoint <- paste0("msstats_plots-v", msstats_ver, ".rda")
##if (file.exists(checkpoint)) {
##  load(file=checkpoint)
##} else {
##  msstats_plots <- dataProcessPlots(msstats_quant, type="QCPLOT")
##  save(file=checkpoint, list=c("msstats_plots"))
##}

my_levels <- levels(as.factor(msstats_input$condition))
my_levels
comparisons <- make_simplified_contrast_matrix(
  numerators=c("dt_whole", "cp_whole"),
  denominators=c("wt_whole", "wt_whole"))

msstats_results <- list()
checkpoint <- paste0("msstats_group-v", ver, ".rda")
if (file.exists(checkpoint)) {
  load(file=checkpoint)
} else {
  for (c in 1:length(rownames(comparisons))) {
    name <- rownames(comparisons)[c]
    message("Starting ", name)
    comp <- comparisons[c, ]
    comp <- t(as.matrix(comp))
    rownames(comp) <- name
    msstats_results[[name]] <- sm(MSstats::groupComparison(contrast.matrix=comp,
                                                           data=msstats_quant))
    message("Finished ", name)
  }
  save(file=checkpoint, list=c("msstats_results"))
}
```

### P/PE protein QC plots for Yan

Yan asked for the p/pe protein qc plots. ok.  I changed the dataProcessPlots to
return something useful, so that should be possible now.

```{r pe, eval=FALSE}
pe_genes <- read.table("reference/annotated_pe_genes.txt")[[1]]

## Unfortunately, the names did not get set in my changed version of dataProcessPlots...
plotlst <- msstats_plots$QCPLOT
available_plots <- gsub(pattern="^1/", replacement="",
                        x=levels(msstats_quant$ProcessedData$PROTEIN))
names(plotlst) <- available_plots

pe_in_avail_idx <- pe_genes %in% available_plots
pe_in_avail <- pe_genes[pe_in_avail_idx]
pe_plots <- plotlst[pe_in_avail]
pdf(file="pe_qc_plots.pdf")
for (p in 1:length(pe_plots)) {
  plot(pe_plots[[p]])
}
dev.off()
length(pe_plots)
```

# Create hpgltools expressionset

Since I am not certain I understand these data, I will take the intensities from
SWATH2stats, metadata, and annotation data;  attempt to create a 'normal'
expressionset; poke at it to see what I can learn.

## Massaging the metadata

I want to use the same metadata as were used for MSstats.  It has a few
important differences from the requirements of hpgltools: pretty much only that
I do not allow rownames/sampleIDs to start with a number.

## Massaging the intensity matrix

I do not want the \1 before the protein names, I already merged them into one
entry per gene via SWATH2stats.

### Pyprophet matrix

There are two ways to get the matrix of intensities, either directly from pyprophet, or from
the filtered data provided by swath2stats.  Here is the former, but I should be using the latter.

```{r tb_protein_matrix, eval=FALSE}
prot_mtrx <- read.csv(file.path("preprocessing", "10swath2stats",
                                ver, "protein_matrix_filtered.csv"))
rownames(prot_mtrx) <- gsub(pattern="^1\\/", replacement="",
                            x=prot_mtrx[["proteinname"]])
prot_mtrx <- prot_mtrx[, -1]
prot_keepers <- grepl(pattern="^Rv", x=rownames(prot_mtrx))
prot_mtrx <- prot_mtrx[prot_keepers, ]
## Important question: Did SWATH2stats reorder my data?
colnames(prot_mtrx) <- gsub(pattern="^(.*)(2018.*)$",
                            replacement="s\\2", x=colnames(prot_mtrx))

unfilt_mtrx <- read.csv(file.path("preprocessing", "10swath2stats",
                                  ver, "protein_matrix_unfiltered.csv"))
rownames(unfilt_mtrx) <- gsub(pattern="^1\\/", replacement="",
                              x=unfilt_mtrx[["proteinname"]])
unfilt_mtrx <- unfilt_mtrx[, -1]
unfilt_keepers <- grepl(pattern="^Rv", x=rownames(unfilt_mtrx))
unfilt_mtrx <- unfilt_mtrx[unfilt_keepers, ]
## Important question: Did SWATH2stats reorder my data?
colnames(unfilt_mtrx) <- gsub(pattern="^(.*)(2018.*)$",
                              replacement="s\\2", x=colnames(unfilt_mtrx))
```

### From SWATH2stats

```{r prot_matrix_swath2stats}
prot_mtrx <- protein_matrix_filtered
colnames(prot_mtrx) <- gsub(x=colnames(prot_mtrx),
                            pattern="^(.*)_(2019.*)$", replacement="\\2")
rownames(prot_mtrx) <- prot_mtrx[["proteinname"]]
prot_mtrx[["proteinname"]] <- NULL
rv_idx <- grepl(x=rownames(prot_mtrx), pattern="^Rv")
prot_mtrx <- prot_mtrx[rv_idx, ]

unfilt_mtrx <- protein_matrix_unfilt
colnames(unfilt_mtrx) <- gsub(x=colnames(unfilt_mtrx),
                              pattern="^(.*)_(2019.*)$", replacement="\\2")
rownames(unfilt_mtrx) <- gsub(pattern="^1\\/", replacement="", x=unfilt_mtrx[["proteinname"]])
unfilt_mtrx[["proteinname"]] <- NULL
rv_idx <- grepl(x=rownames(unfilt_mtrx), pattern="^Rv")
unfilt_mtrx <- unfilt_mtrx[rv_idx, ]
```

## Merge the pieces

Now we should have sufficient pieces to make an expressionset.

While here, I will also split the data into a cf and whole-cell pair of data structures.

```{r tb_expt, fig.show="hide"}
## Drop the metadata not in the protein matrix:
## And ensure that they are the same order.
reordered <- colnames(prot_mtrx)
metadata <- sample_annot[reordered, ]
colnames(prot_mtrx) <- gsub(x=colnames(prot_mtrx), pattern="^.*(2019.*$)", replacement="s\\1")
mtb_annotations <- mtb_annotations[, -1]
protein_expt <- create_expt(sample_annot,
                            count_dataframe=prot_mtrx,
                            gene_info=mtb_annotations)

reordered <- colnames(unfilt_mtrx)
metadata <- sample_annot[reordered, ]
colnames(unfilt_mtrx) <- gsub(x=colnames(unfilt_mtrx), pattern="^.*(2019.*$)", replacement="s\\1")
mtb_annotations <- mtb_annotations[, -1]
unfilt_expt <- create_expt(sample_annot,
                           count_dataframe=unfilt_mtrx,
                           gene_info=mtb_annotations)
```

```{r all_pca}
raw_filt <- graph_metrics(protein_expt)
pp(file="images/legend.png")
raw_filt$legend
dev.off()

pp(file="images/density.png")
raw_filt$density
dev.off()
pp(file="images/box.png")
raw_filt$boxplot
dev.off()
pp(file="libsize.png")
raw_filt$libsize
dev.off()

filt_norm <- normalize_expt(protein_expt, filter=TRUE,
                            norm="quant", convert="cpm",
                            transform="log2")
norm_filt <- graph_metrics(filt_norm)

pp(file="images/cor.png")
norm_filt$corheat
dev.off()

pp(file="images/dis.png")
norm_filt$disheat
dev.off()

pp(file="images/cv.png")
norm_filt$cvplot
dev.off()

pp(file="images/pca.png")
norm_filt$pc_plot
dev.off()

plt <- plot_pca(filt_norm)$plot
pp(file=glue::glue("images/{ver}_filtered_pca.png"),
   image=plt)

cf <- subset_expt(protein_expt, subset="collectiontype=='whole_cell'")
sva_norm <- normalize_expt(cf, filter=TRUE, convert="cpm", norm="quant",
                           transform="log2", batch="ruvg")
plt <- plot_pca(sva_norm)$plot
plt
pp(file=glue::glue("images/{ver}_fsva_filtered_pca.png"),
   image=plt)


unfilt_norm <- normalize_expt(unfilt_expt, filter=TRUE,
                              norm="quant", convert="cpm",
                              transform="log2")
plot_pca(unfilt_norm)$plot
```

```{r write_expt, fig.show="hide"}
written_file <- glue::glue("excel/{rundate}_protein_expt-v{ver}.xlsx")
protein_write <- write_expt(protein_expt, violin=TRUE,
                            batch="raw", excel=written_file)

written_file <- glue::glue("excel/{rundate}_unfilt_protein_expt-v{ver}.xlsx")
unfilt_write <- write_expt(unfilt_expt, violin=TRUE,
                            batch="raw", excel=written_file)

protein_expt <- subset_expt(protein_expt, subset="sampleid!='2019_0709Briken17'")
unfilt_expt <- subset_expt(unfilt_expt, subset="sampleid!='2019_0709Briken17'")
protein_filt <- sm(normalize_expt(protein_expt, filter=TRUE))
protein_norm <- sm(normalize_expt(protein_filt, norm="quant", transform="log2",
                                  convert="cpm", filter=TRUE))

protein_unfilt <- sm(normalize_expt(unfilt_expt, filter=TRUE))
protein_de <- sm(all_pairwise(protein_filt, model_batch=FALSE, force=TRUE, parallel=FALSE))
unfilt_de <- sm(all_pairwise(protein_unfilt, model_batch=FALSE, force=TRUE, parallel=FALSE))

keepers <- list(
  "whdt_wt" = c("dt_wh", "wt_wh"),
  "whcp_wt" = c("cp_wh", "wt_wh"),
  "whdt_cp" = c("dt_wh", "cp_wh"),
  "cfdt_wt" = c("dt_cf", "wt_cf"),
  "cfcp_wt" = c("cp_cf", "wt_cf"),
  "cfdt_cp" = c("dt_cf", "cp_cf"),
  "wtcf_wh" = c("wt_cf", "wt_wh"),
  "dtcf_wh" = c("dt_cf", "dt_wh"),
  "cpcf_wh" = c("cp_cf", "cp_wh")
)

protein_tables <- combine_de_tables(
  protein_de, keepers=keepers,
  excel=glue::glue("excel/de_{rundate}_tables_v{ver}.xlsx"))
unfilt_tables <- sm(combine_de_tables(
  unfilt_de, keepers=keepers,
  excel=glue::glue("excel/de_{rundate}_unfilt_tables_v{ver}.xlsx")))

protein_sig <- sm(extract_significant_genes(
  protein_tables,
  excel=glue::glue("excel/sig_{rundate}_tables_v{ver}.xlsx")))
unfilt_sig <- sm(extract_significant_genes(
  unfilt_tables,
  excel=glue::glue("excel/sig_{rundate}_unfilt_tables_v{ver}.xlsx")))
```

# Satisfy TODO 20191105

Provide a plot of the ratio of delta(cf/whole) with respect to the ratio of wt(cf/whole)

```{r ratio_plots}
x_data <- protein_tables[["data"]][["dtcf_wh"]][, c("deseq_logfc", "deseq_adjp")]
y_data <- protein_tables[["data"]][["wtcf_wh"]][, c("deseq_logfc", "deseq_adjp")]
xy <- merge(x_data, y_data, by.x="row.names", by.y="row.names")
rownames(xy) <- xy[["Row.names"]]
xy <- xy[, c("deseq_logfc.x", "deseq_logfc.y")]
colnames(xy) <- c("delta_ratio_cf_wh", "wt_ratio_cf_wh")
xy_scatter <- plot_linear_scatter(xy)
xy_scatter$scatter
plotly::ggplotly(xy_scatter$scatter)
```

## Lets see what happens if we rewrite the expressionset with NAs

```{r na_experiment, fig.show="hide"}
## Put NAs into the data for the set of proteins for which there are
## 0s in _not_all_ of the samples for each condition.
protein_nas <- add_conditional_nas(unfilt_expt)
unfilt_nas_de <- sm(all_pairwise(protein_nas, model_batch=FALSE,
                                 parallel=FALSE, do_ebseq=FALSE,
                                 test_pca=FALSE, force=TRUE))

unfilt_nas_tables <- sm(combine_de_tables(
  unfilt_nas_de, keepers=keepers,
  excel=glue::glue("excel/{rundate}_unfilt_nas_tables_v{ver}.xlsx")))

one_nas_table <- unfilt_nas_tables[["data"]][["cfdt_wt"]]
observations <- one_nas_table[, c("wtwhobservations", "wtcfobservations", "dtwhobservations", "dtcfobservations")]
interesting_idx <- observations[["wtwhobservations"]] > 4 &
  observations[["wtcfobservations"]] > 4
interesting_nas <- one_nas_table[interesting_idx, ]
interesting_order <- order(interesting_nas[["deseq_logfc"]])
interesting_nas <- interesting_nas[interesting_order, ]
logfc_idx <- interesting_nas[["deseq_logfc"]] < 0
interesting_nas <- interesting_nas[logfc_idx, ]
p_idx <- interesting_nas[["deseq_adjp"]] < 0.1
interesting_nas <- interesting_nas[p_idx, ]
test <- write_xls(interesting_nas, excel="excel/testme_first.xlsx")

interesting_idx <- observations[["wtcfobservations"]] > 2 &
  observations[["dtcfobservations"]] < 1
interesting_nas <- one_nas_table[interesting_idx, ]
test <- write_xls(interesting_nas, excel="excel/testme_first.xlsx")

unfilt_nas_sig <- sm(extract_significant_genes(
  unfilt_nas_tables,
  excel=glue::glue("excel/sig_{rundate}_unfilt_nas_tables_v{ver}.xlsx")))
```

# A nice subset request from Volker

Let us pull the following subset from the DE tables for Volker, it should
provide a set of proteins most obviously of interest; assuming the false
negatives are not too severe.

1.  lfc <= -6 for delta/wt && lfc >= -3 for comp/wt
2.  lfc >= 10 delta/wt && lfc <= 4 comp/wt

This will hopefully find things which are sufficiently different from the
deletion and complement samples to be interesting.

```{r interesting_subset, eval=FALSE}
dt_wt <- unfilt_nas_tables[["data"]][["dt_wt"]]
cp_wt <- unfilt_nas_tables[["data"]][["cp_wt"]]
down_idx <- dt_wt[["deseq_logfc"]] <= -6 & cp_wt[["deseq_logfc"]] >= -3
down_table <- dt_wt[down_idx, ]
up_idx <- dt_wt[["deseq_logfc"]] >= 10 & cp_wt[["deseq_logfc"]] <= 4
up_table <- dt_wt[up_idx, ]

down_subset <- write_xls(
  data=down_table,
  excel=glue::glue("excel/de_{rundate}_more_down_delta.xlsx"))
up_subset <- write_xls(
  data=up_table,
  excel=glue::glue("excel/de_{rundate}_more_up_delta.xlsx"))
```

# Playing with new EncyclopeDIA

```{r load_encyclopedia, eval=FALSE}
enc_metadata <- sample_annote
nc_matrix <- read.table("preprocessing/09encyclopedia/20191001cf_quantities.elib.proteins.txt",
                         header=TRUE)
enc_pep_matrix <- read.table("preprocessing/09encyclopedia/20191001cf_quantities.elib.peptides.txt", header=TRUE)
rownames(enc_matrix) <- enc_matrix[["Protein"]]
enc_matrix <- enc_matrix[, -1]
enc_matrix <- enc_matrix[, -1]
enc_matrix <- enc_matrix[, -1]
colnames(enc_matrix)
colnames(enc_matrix) <- gsub(pattern="X", replacement="s", x=colnames(enc_matrix))
colnames(enc_matrix) <- gsub(pattern="\\.mzML", replacement="", x=colnames(enc_matrix))
colnames(enc_matrix) <- gsub(pattern="^X", replacement="s", x=colnames(enc_matrix))
colnames(enc_pep_matrix) <- gsub(pattern="X", replacement="s", x=colnames(enc_pep_matrix))
colnames(enc_pep_matrix) <- gsub(pattern="\\.mzML", replacement="", x=colnames(enc_pep_matrix))
colnames(enc_pep_matrix) <- gsub(pattern="^X", replacement="s", x=colnames(enc_pep_matrix))

colnames(enc_matrix)
rownames(enc_metadata)

enc_expt <- create_expt(metadata=enc_metadata, count_dataframe=enc_matrix,
                                                gene_info=mtb_annotations)
enc_norm <- normalize_expt(enc_expt, norm="quant", convert="cpm", filter=TRUE,
                                                      transform="log2")
plot_pca(enc_norm)$plot
```

# Request from 20191128

Here is the text of my email:

Query 1:  Given existing table of all DE proteins perform the following:
  a. Take the deseq_logfc for the delta(cf/wh)
  b. Take the deseq_logfc for the wt(cf/wh)
  c. Subtract a and b.
  d. Given c, extract the set which are negative and <= x where x is likely 0.5 at least at first.

 Query 2: For the subset acquired in Query 1 do the following:
   a.  Take the deseq_logfc for the delta(cf/wh).
   b.  Take the deseq_logfc for the complement(cf/wh).
   c.  Subtract a and b.

Query 3: For the results of Q1, Q2, plot them as a scatter plot.
Take from this the set of genes close than some Z-value.

Focus on one stringency?  start with filtered and unfiltered.

Query 4:
Plot y axis: cf(delta/comp)
Plot x axis: cf(delta/wt)

Query 5:
Plot y axis: wh(delta/comp)
Plot x axis: wh(delta/wt)

For all of the above use ggplotly (eg. hover plots).

```{r query1}
q1a <- protein_tables[["data"]][["dtcf_wh"]][, c("deseq_logfc", "deseq_adjp")]
q1b <- protein_tables[["data"]][["wtcf_wh"]][, c("deseq_logfc", "deseq_adjp")]
testthat::expect_equal(rownames(q1a), rownames(q1b))
q1c <- q1a[["deseq_logfc"]] - q1b[["deseq_logfc"]]
names(q1c) <- rownames(q1a)
summary(q1c)
interesting_idx <- q1c <= 0.5
sum(interesting_idx)
kept_ids <- rownames(q1a)[interesting_idx]
q1d <- q1c[interesting_idx]

q2a <- protein_tables[["data"]][["dtcf_wh"]][interesting_idx, c("deseq_logfc", "deseq_adjp")]
q2b <- protein_tables[["data"]][["cpcf_wh"]][interesting_idx, c("deseq_logfc", "deseq_adjp")]
q2c <- q2a[["deseq_logfc"]] - q2b[["deseq_logfc"]]
names(q2c) <- rownames(q2a)
summary(q2c)

q3 <- as.data.frame(cbind(q1d, q2c))
rownames(q3) <- names(q2c)
colnames(q3) <- c("dt_vs_wt", "dt_vs_cp")
q3_scatter <- plot_linear_scatter(q3)
q3_scatter$scatter
q3_plotly <- plotly::ggplotly(q3_scatter$scatter)
q3_saved <- htmlwidgets::saveWidget(q3_plotly, "20191126_query3.html")

q4x <- protein_tables[["data"]][["cfdt_cp"]][, c("deseq_logfc", "deseq_adjp")]
q4y <- protein_tables[["data"]][["cfdt_wt"]][, c("deseq_logfc", "deseq_adjp")]
q4 <- as.data.frame(cbind(q4x[["deseq_logfc"]], q4y[["deseq_logfc"]]))
rownames(q4) <- rownames(q4x)
colnames(q4) <- c("cfdt_cp", "cfdt_wt")
q4_scatter <- plot_linear_scatter(q4)
q4_scatter$scatter
q4_plotly <- plotly::ggplotly(q4_scatter$scatter)
q4_saved <- htmlwidgets::saveWidget(q4_plotly, "20191126_query4.html")

q5x <- protein_tables[["data"]][["whdt_cp"]][, c("deseq_logfc", "deseq_adjp")]
q5y <- protein_tables[["data"]][["whdt_wt"]][, c("deseq_logfc", "deseq_adjp")]
q5 <- as.data.frame(cbind(q5x[["deseq_logfc"]], q5y[["deseq_logfc"]]))
rownames(q5) <- rownames(q5x)
colnames(q5) <- c("whdt_cp", "whdt_wt")
q5_scatter <- plot_linear_scatter(q5)
q5_scatter$scatter
q5_plotly <- plotly::ggplotly(q5_scatter$scatter)
q5_saved <- htmlwidgets::saveWidget(q5_plotly, "20191126_query5.html")
```

```{r saveme}
if (!isTRUE(get0("skip_load"))) {
  message(paste0("This is hpgltools commit: ", get_git_commit()))
  this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
  message(paste0("Saving to ", this_save))
  tmp <- sm(saveme(filename=this_save))
  pander::pander(sessionInfo())
}
```

```{r loadme, eval=FALSE}
tmp <- loadme(filename=this_save)
```
