I am using my slightly modified copy of SWATH2stats. This seeks to ensure that changes in the case of columns in the metadata from one version of OpenMS to another do not trouble me.
There is one important caveat in the following block: I used a regex to remove the second half of geneID_geneName so that later when I merge in the annotation data I have it will match.
In response to some interesting queries from Yan, I made a few little functions which query and plot data from the scored data provided by openswath/pyprophet. Let us look at their results here.
pyp_metadata <- glue::glue("sample_sheets/Mtb_dia_samples_{ver}.ods")
pyprophet_fun <- extract_pyprophet_data(metadata=pyp_metadata,
pyprophet_column="diascored")
## Parsed with column specification:
## cols(
## .default = col_character(),
## FigureReplicate = col_logical(),
## `Figure Name` = col_logical(),
## `Sample Description` = col_logical(),
## `Replicate State` = col_logical(),
## enzyme = col_logical(),
## harvestdate = col_logical(),
## prepdate = col_logical(),
## rundate = col_logical(),
## runinfo = col_logical(),
## tuberculist_scored = col_logical(),
## include_exclude = col_logical(),
## Run_note = col_logical()
## )
## See spec(...) for full column specifications.
## Attempting to read the tsv file for: 2019_0801Briken01: preprocessing/08pyprophet/20190801/whole_8mz_tuberculist/2019_0801Briken01_vs_20190801_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2019_0709Briken01: preprocessing/08pyprophet/20190718/whole_8mz_tuberculist/2019_0709Briken01_vs_20190718_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2019_0801Briken02: preprocessing/08pyprophet/20190801/whole_8mz_tuberculist/2019_0801Briken02_vs_20190801_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2019_0709Briken02: preprocessing/08pyprophet/20190718/whole_8mz_tuberculist/2019_0709Briken02_vs_20190718_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2019_0801Briken03: preprocessing/08pyprophet/20190801/whole_8mz_tuberculist/2019_0801Briken03_vs_20190801_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2019_0709Briken03: preprocessing/08pyprophet/20190718/whole_8mz_tuberculist/2019_0709Briken03_vs_20190718_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2019_0801Briken04: preprocessing/08pyprophet/20190801/whole_8mz_tuberculist/2019_0801Briken04_vs_20190801_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2019_0709Briken04: preprocessing/08pyprophet/20190718/whole_8mz_tuberculist/2019_0709Briken04_vs_20190718_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2019_0709Briken05: preprocessing/08pyprophet/20190718/whole_8mz_tuberculist/2019_0709Briken05_vs_20190718_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2019_0801Briken06: preprocessing/08pyprophet/20190801/whole_8mz_tuberculist/2019_0801Briken06_vs_20190801_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2019_0709Briken06: preprocessing/08pyprophet/20190718/whole_8mz_tuberculist/2019_0709Briken06_vs_20190718_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2019_0801Briken07: preprocessing/08pyprophet/20190801/whole_8mz_tuberculist/2019_0801Briken07_vs_20190801_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2019_0709Briken07: preprocessing/08pyprophet/20190718/whole_8mz_tuberculist/2019_0709Briken07_vs_20190718_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2019_0801Briken08: preprocessing/08pyprophet/20190801/whole_8mz_tuberculist/2019_0801Briken08_vs_20190801_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2019_0709Briken08: preprocessing/08pyprophet/20190718/whole_8mz_tuberculist/2019_0709Briken08_vs_20190718_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2019_0801Briken09: preprocessing/08pyprophet/20190801/whole_8mz_tuberculist/2019_0801Briken09_vs_20190801_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2019_0709Briken09: preprocessing/08pyprophet/20190718/whole_8mz_tuberculist/2019_0709Briken09_vs_20190718_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2019_0801Briken10: preprocessing/08pyprophet/20190801/whole_8mz_tuberculist/2019_0801Briken10_vs_20190801_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2019_0709Briken10: preprocessing/08pyprophet/20190718/whole_8mz_tuberculist/2019_0709Briken10_vs_20190718_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2019_0801Briken11: preprocessing/08pyprophet/20190801/whole_8mz_tuberculist/2019_0801Briken11_vs_20190801_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2019_0709Briken11: preprocessing/08pyprophet/20190718/whole_8mz_tuberculist/2019_0709Briken11_vs_20190718_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2019_0801Briken12: preprocessing/08pyprophet/20190801/whole_8mz_tuberculist/2019_0801Briken12_vs_20190801_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2019_0709Briken12: preprocessing/08pyprophet/20190718/whole_8mz_tuberculist/2019_0709Briken12_vs_20190718_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2019_0801Briken13: preprocessing/08pyprophet/20190801/whole_8mz_tuberculist/2019_0801Briken13_vs_20190801_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2019_0709Briken13: preprocessing/08pyprophet/20190718/whole_8mz_tuberculist/2019_0709Briken13_vs_20190718_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2019_0801Briken14: preprocessing/08pyprophet/20190801/whole_8mz_tuberculist/2019_0801Briken14_vs_20190801_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2019_0709Briken14: preprocessing/08pyprophet/20190718/whole_8mz_tuberculist/2019_0709Briken14_vs_20190718_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2019_0801Briken15: preprocessing/08pyprophet/20190801/whole_8mz_tuberculist/2019_0801Briken15_vs_20190801_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2019_0709Briken15: preprocessing/08pyprophet/20190718/whole_8mz_tuberculist/2019_0709Briken15_vs_20190718_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2019_0801Briken16: preprocessing/08pyprophet/20190801/whole_8mz_tuberculist/2019_0801Briken16_vs_20190801_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2019_0709Briken16: preprocessing/08pyprophet/20190718/whole_8mz_tuberculist/2019_0709Briken16_vs_20190718_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2019_0801Briken17: preprocessing/08pyprophet/20190801/whole_8mz_tuberculist/2019_0801Briken17_vs_20190801_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2019_0709Briken17: preprocessing/08pyprophet/20190718/whole_8mz_tuberculist/2019_0709Briken17_vs_20190718_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2019_0801Briken18: preprocessing/08pyprophet/20190801/whole_8mz_tuberculist/2019_0801Briken18_vs_20190801_whole_HCD_dia_scored.tsv.
## Attempting to read the tsv file for: 2019_0709Briken18: preprocessing/08pyprophet/20190718/whole_8mz_tuberculist/2019_0709Briken18_vs_20190718_whole_HCD_dia_scored.tsv.
## Visualize the mass distributions of each sample
mass_plot <- plot_pyprophet_distribution(pyprophet_fun, column="mass")
## Adding 2019_0801Briken01
## Adding 2019_0709Briken01
## Adding 2019_0801Briken02
## Adding 2019_0709Briken02
## Adding 2019_0801Briken03
## Adding 2019_0709Briken03
## Adding 2019_0801Briken04
## Adding 2019_0709Briken04
## Adding 2019_0709Briken05
## Adding 2019_0801Briken06
## Adding 2019_0709Briken06
## Adding 2019_0801Briken07
## Adding 2019_0709Briken07
## Adding 2019_0801Briken08
## Adding 2019_0709Briken08
## Adding 2019_0801Briken09
## Adding 2019_0709Briken09
## Adding 2019_0801Briken10
## Adding 2019_0709Briken10
## Adding 2019_0801Briken11
## Adding 2019_0709Briken11
## Adding 2019_0801Briken12
## Adding 2019_0709Briken12
## Adding 2019_0801Briken13
## Adding 2019_0709Briken13
## Adding 2019_0801Briken14
## Adding 2019_0709Briken14
## Adding 2019_0801Briken15
## Adding 2019_0709Briken15
## Adding 2019_0801Briken16
## Adding 2019_0709Briken16
## Adding 2019_0801Briken17
## Adding 2019_0709Briken17
## Adding 2019_0801Briken18
## Adding 2019_0709Briken18
## Writing the image to: images/whole_masses_observed.png and calling dev.off().
## That second to last sample looks pretty odd.
## Look at the delta rt times observed. I am continually puzzled as to why these numbers
## get so high. I suspect the actual reason is that I do not understand this column in the
## data, but finding reliable documentation is non-trivial, I just blew 45 minutes (re)reading
## a couple of reviews and some papers in which I thought I saw relevant information
## and found nothing.
deltart_plot_all <- sm(plot_pyprophet_distribution(
pyprophet_fun, column="delta_rt"))
pp(file="images/whole_drt_observed.png", image=deltart_plot_all[["violin"]])
## Writing the image to: images/whole_drt_observed.png and calling dev.off().
## This plot is the same as above, but includes _only_ the non-decoy values.
deltart_plot_real <- sm(plot_pyprophet_distribution(
pyprophet_fun,
column="delta_rt", keep_decoys=FALSE))
deltart_plot_real[["violin"]]
## And this time we have _only_ the decoys. I am not really sure what one should expect in
## these, but the differences are definitely intriguing.
deltart_plot_decoys <- sm(plot_pyprophet_distribution(
pyprophet_fun,
column="delta_rt", keep_real=FALSE))
deltart_plot_decoys[["violin"]]
## How many identifications were observed in each sample?
pyprophet_identifications <- sm(plot_pyprophet_counts(
pyprophet_fun, keep_decoys=FALSE,
type="count"))
pp(file="images/whole_num_identifications.png", image=pyprophet_identifications$plot)
## Writing the image to: images/whole_num_identifications.png and calling dev.off().
## The range in values is a little surprising to me, but I do not
## think it is crazytown.
## Sum(intensity) vs. sum(identifications). We saw in a previous plot that sample
## 17 was odd, so I would sort of expect it to be far away on this plot?
pyprophet_xy <- sm(plot_pyprophet_xy(
pyprophet_fun,
x_type="count", y_type="intensity"))
pp(file="images/whole_counts_vs_intensities.png", image=pyprophet_xy)
## Writing the image to: images/whole_counts_vs_intensities.png and calling dev.off().
## hmm I do not see a strong trend
## This has so far been a pretty reliable plot to show that the observed peak widths are
## very consistent across samples.
pyprophet_lwidths <- sm(plot_pyprophet_xy(
pyprophet_fun,
x_type="count", y_type="leftwidth"))
pp(file="images/whole_lwidths_vs_counts.png", image=pyprophet_lwidths)
## Writing the image to: images/whole_lwidths_vs_counts.png and calling dev.off().
There are a few proteins for which Volker has relatively specific assumptions/expectations. Let us see what they look like and if they follow a trend which makes some sense…
The primary thing to recall, I think, is that in our previous data sets, there were a pretty large number of samples for which no identifications were made for many of these proteins. Does that remain true?
intensities_esxG <- sm(plot_pyprophet_protein(pyprophet_fun, scale="log",
title="esxG Intensities",
column="intensity", protein="Rv0287"))
pp(file=paste0("images/whole_osw_esxG_intensities-v", ver, ".png"), image=intensities_esxG)
## Writing the image to: images/whole_osw_esxG_intensities-v20191021.png and calling dev.off().
## Is this a good or terrible spread of observed intensities for proteomics data?
## If I found a range like this in RNASeq data, I would just throw it away out of hand.
## Sample 17 did not have any observations, but everything else did.
## The range of dRT values. I wish this metric made sense to me!
drt_esxG <- sm(plot_pyprophet_protein(pyprophet_fun,
column="delta_rt", protein="Rv0287",
min_data=100, max_data=1000))
drt_esxG
intensities_esxH <- sm(plot_pyprophet_protein(
pyprophet_fun,
title="esxH Intensities",
scale="log", column="intensity", protein="Rv0288"))
pp(file=paste0("images/whole_osw_esxH_intensities-v", ver, ".png"), image=intensities_esxH)
## Writing the image to: images/whole_osw_esxH_intensities-v20191021.png and calling dev.off().
## All samples have observations, but a bit sparser.
intensities_lpqH <- sm(plot_pyprophet_protein(
pyprophet_fun, scale="log",
title="lpqH_intensities", column="intensity", protein="Rv3763"))
pp(file=paste0("images/whole_osw_lpqh_intensities-v", ver, ".png"), image=intensities_lpqH)
## Writing the image to: images/whole_osw_lpqh_intensities-v20191021.png and calling dev.off().
## Very few observations, but they are relatively consistent?
intensities_groel1 <- plot_pyprophet_protein(pyprophet_fun,
title="groEL1 intensities", scale="log",
column="intensity", protein="Rv3417")
## Adding 2019_0801Briken01
## Adding 2019_0709Briken01
## Adding 2019_0801Briken02
## Adding 2019_0709Briken02
## Adding 2019_0801Briken03
## Adding 2019_0709Briken03
## Adding 2019_0801Briken04
## Adding 2019_0709Briken04
## Adding 2019_0709Briken05
## Adding 2019_0801Briken06
## Adding 2019_0709Briken06
## Adding 2019_0801Briken07
## Adding 2019_0709Briken07
## Adding 2019_0801Briken08
## Adding 2019_0709Briken08
## Adding 2019_0801Briken09
## Adding 2019_0709Briken09
## Adding 2019_0801Briken10
## Adding 2019_0709Briken10
## Adding 2019_0801Briken11
## Adding 2019_0709Briken11
## Adding 2019_0801Briken12
## Adding 2019_0709Briken12
## Adding 2019_0801Briken13
## Adding 2019_0709Briken13
## Adding 2019_0801Briken14
## Adding 2019_0709Briken14
## Adding 2019_0801Briken15
## Adding 2019_0709Briken15
## Adding 2019_0801Briken16
## Adding 2019_0709Briken16
## Adding 2019_0801Briken17
## Adding 2019_0709Briken17
## Adding 2019_0801Briken18
## Adding 2019_0709Briken18
## Writing the image to: images/whole_osw_groel1_intensities-v20191021.png and calling dev.off().
intensities_groel2 <- sm(plot_pyprophet_protein(
pyprophet_fun,
title="groEL2 intensities", scale="log",
column="intensity", protein="Rv0440"))
pp(file=paste0("images/whole_osw_groel2_intensities-v", ver, ".png"), image=intensities_groel2)
## Writing the image to: images/whole_osw_groel2_intensities-v20191021.png and calling dev.off().
## Man I am loving the density of observation, but I wish they were more consistent!
## Also, sample 17 is sparser.
intensities_fap <- sm(plot_pyprophet_protein(
pyprophet_fun,
title="fap intensities", scale="log",
column="intensity", protein="Rv1860"))
pp(file=paste0("images/whole_osw_fap_intensities-v", ver, ".png"), image=intensities_fap)
## Writing the image to: images/whole_osw_fap_intensities-v20191021.png and calling dev.off().
## Sample 17 is basically a null.
intensities_katg <- sm(plot_pyprophet_protein(
pyprophet_fun,
title="katG intensities", scale="log",
column="intensity", protein="Rv1908"))
pp(file="images/whole_osw_katg_intensities.png", image=intensities_katg)
## Writing the image to: images/whole_osw_katg_intensities.png and calling dev.off().
I want to load the data and metadata into SWATH2stats in preparation for MSstats and my own hpgltools-base analyses.
cf_version <- "20190718"
cf_tric_file <- file.path("preprocessing", "09tric", cf_version,
"whole_8mz_tuberculist", "comet_HCD.tsv")
cf_tric_data <- readr::read_tsv(cf_tric_file)
## Parsed with column specification:
## cols(
## .default = col_double(),
## run_id = col_character(),
## filename = col_character(),
## Sequence = col_character(),
## FullPeptideName = col_character(),
## aggr_Peak_Area = col_logical(),
## aggr_Peak_Apex = col_logical(),
## aggr_Fragment_Annotation = col_logical(),
## ProteinName = col_character(),
## align_runid = col_character(),
## align_origfilename = col_character()
## )
## See spec(...) for full column specifications.
cf_tric_data[["ProteinName"]] <- gsub(pattern="^(.*)_.*$", replacement="\\1",
x=cf_tric_data[["ProteinName"]])
wc_version <- "20190801"
wc_tric_file <- file.path("preprocessing", "09tric", wc_version,
"whole_8mz_tuberculist", "comet_HCD.tsv")
wc_tric_data <- readr::read_tsv(wc_tric_file)
## Parsed with column specification:
## cols(
## .default = col_double(),
## run_id = col_character(),
## filename = col_character(),
## Sequence = col_character(),
## FullPeptideName = col_character(),
## aggr_Peak_Area = col_logical(),
## aggr_Peak_Apex = col_logical(),
## aggr_Fragment_Annotation = col_logical(),
## ProteinName = col_character(),
## align_runid = col_character(),
## align_origfilename = col_character()
## )
## See spec(...) for full column specifications.
wc_tric_data[["ProteinName"]] <- gsub(pattern="^(.*)_.*$", replacement="\\1",
x=wc_tric_data[["ProteinName"]])
tric_data <- rbind(wc_tric_data, cf_tric_data)
sample_sheet <- file.path("sample_sheets", glue::glue("Mtb_dia_samples_{ver}.ods"))
sample_annot <- extract_metadata(sample_sheet)
## Parsed with column specification:
## cols(
## .default = col_character(),
## FigureReplicate = col_logical(),
## `Figure Name` = col_logical(),
## `Sample Description` = col_logical(),
## `Replicate State` = col_logical(),
## enzyme = col_logical(),
## harvestdate = col_logical(),
## prepdate = col_logical(),
## rundate = col_logical(),
## runinfo = col_logical(),
## tuberculist_scored = col_logical(),
## include_exclude = col_logical(),
## Run_note = col_logical()
## )
## See spec(...) for full column specifications.
## A weird thing is happening with column names, I will chase it down later.
colnames(sample_annot) <- gsub(x=colnames(sample_annot),
pattern="[[:punct:]]", replace="")
kept <- ! grepl(x=rownames(sample_annot), pattern="^s\\.\\.")
sample_annot <- sample_annot[kept, ]
devtools::load_all("~/scratch/git/SWATH2stats_myforked")
## Loading SWATH2stats
s2s_exp <- sample_annotation(data=tric_data, verbose=TRUE,
sample_annotation=sample_annot,
fullpeptidename_column="fullpeptidename")
## Found the same mzXML files in the annotations and data.
## preprocessing/01mzXML/dia/20190718/2019_0709Briken01.mzXML
## preprocessing/01mzXML/dia/20190718/2019_0709Briken02.mzXML
## preprocessing/01mzXML/dia/20190718/2019_0709Briken03.mzXML
## preprocessing/01mzXML/dia/20190718/2019_0709Briken04.mzXML
## preprocessing/01mzXML/dia/20190718/2019_0709Briken05.mzXML
## preprocessing/01mzXML/dia/20190718/2019_0709Briken06.mzXML
## preprocessing/01mzXML/dia/20190718/2019_0709Briken07.mzXML
## preprocessing/01mzXML/dia/20190718/2019_0709Briken08.mzXML
## preprocessing/01mzXML/dia/20190718/2019_0709Briken09.mzXML
## preprocessing/01mzXML/dia/20190718/2019_0709Briken10.mzXML
## preprocessing/01mzXML/dia/20190718/2019_0709Briken11.mzXML
## preprocessing/01mzXML/dia/20190718/2019_0709Briken12.mzXML
## preprocessing/01mzXML/dia/20190718/2019_0709Briken13.mzXML
## preprocessing/01mzXML/dia/20190718/2019_0709Briken14.mzXML
## preprocessing/01mzXML/dia/20190718/2019_0709Briken15.mzXML
## preprocessing/01mzXML/dia/20190718/2019_0709Briken16.mzXML
## preprocessing/01mzXML/dia/20190718/2019_0709Briken17.mzXML
## preprocessing/01mzXML/dia/20190718/2019_0709Briken18.mzXML
## preprocessing/01mzXML/dia/20190801/2019_0801Briken01.mzXML
## preprocessing/01mzXML/dia/20190801/2019_0801Briken02.mzXML
## preprocessing/01mzXML/dia/20190801/2019_0801Briken03.mzXML
## preprocessing/01mzXML/dia/20190801/2019_0801Briken04.mzXML
## preprocessing/01mzXML/dia/20190801/2019_0801Briken06.mzXML
## preprocessing/01mzXML/dia/20190801/2019_0801Briken07.mzXML
## preprocessing/01mzXML/dia/20190801/2019_0801Briken08.mzXML
## preprocessing/01mzXML/dia/20190801/2019_0801Briken09.mzXML
## preprocessing/01mzXML/dia/20190801/2019_0801Briken10.mzXML
## preprocessing/01mzXML/dia/20190801/2019_0801Briken11.mzXML
## preprocessing/01mzXML/dia/20190801/2019_0801Briken12.mzXML
## preprocessing/01mzXML/dia/20190801/2019_0801Briken13.mzXML
## preprocessing/01mzXML/dia/20190801/2019_0801Briken14.mzXML
## preprocessing/01mzXML/dia/20190801/2019_0801Briken15.mzXML
## preprocessing/01mzXML/dia/20190801/2019_0801Briken16.mzXML
## preprocessing/01mzXML/dia/20190801/2019_0801Briken17.mzXML
## preprocessing/01mzXML/dia/20190801/2019_0801Briken18.mzXML
## 35 samples were read from the annotations.
## 293275 transitions were read from the data and merged with the annotations.
Now I have a couple data structures which should prove useful for the metrics provided by SWATH2stats, MSstats, and my own hpgltools.
The various metrics and filters provided by SWATH2stats seem quite reasonable to me. The only thing that really bothers me is that they are all case sensitive and I found that the most recent tric changed the capitalization of a column, causing these to all fall down. Therefore I went in and made everything case insensitive in a fashion similar to that done by MSstats (except I hate capital letters, so I used tolower() rather than toupper()).
The following block performs the metrics and filters suggested by swath2stats. These first define the decoy hit rate in the data, then filter the data based on that. It also filters out hits with less than ideal m-scores and proteins with non-optimal distributions of peptide hits (either due to too few peptides or a weird distribution of intensities).
## Get correlations on a sample by sample basis
pp(file=glue::glue("images/s2s_correlation-v{ver}.png"))
## Going to write the image to: images/s2s_correlation-v20191021.png when dev.off() is called.
sample_cond_rep_cor <- plot_correlation_between_samples(
s2s_exp, size=2,
comparison=transition_group_id ~
condition + bioreplicate + run,
fun.aggregate=mean,
column.values="intensity")
dev.off()
## png
## 2
## Do the same thing, but use the sum of the intensities peptide/protein
## instead of the mean...
sample_cond_rep_cor <- plot_correlation_between_samples(
s2s_exp, size=2,
comparison=transition_group_id ~
condition + bioreplicate + run,
fun.aggregate=sum,
column.values="intensity")
## I would love to know why it is that the spearman and pearson correlations
## in this data are so oddly different compare to previous data sets. Is this
## an artifact of the fact that this time I am _only_ looking at CF samples?
## Are the high intensity numbers messing with non-rank-based correlations?
I just realized something which should be added to me SWATH2stats fork: A simplified filter functions which invokes all of these so that I can make sure that there are no typeographikal errors introduced by my invocation of each of these things, one at a time.
## Number of non-decoy peptides: 17968
## Number of decoy peptides: 1413
## Decoy rate: 0.0786
## This seems a bit high to me, yesno?
fdr_overall <- assess_fdr_overall(s2s_exp, output="Rconsole", plot=TRUE)
## The average FDR by run on assay level is 0.01
## The average FDR by run on peptide level is 0.011
## The average FDR by run on protein level is 0.043
## Target assay FDR: 0.02
## Required overall m-score cutoff: 0.0039811
## achieving assay FDR: 0.0196
## Target protein FDR: 0.02
## Required overall m-score cutoff: 0.00079433
## achieving protein FDR: 0.019
## Original dimension: 288943, new dimension: 257900, difference: 31043.
## Peptides need to have been quantified in more conditions than: 28 in order to pass this percentage-based threshold.
## Fraction of peptides selected: 0.14
## Original dimension: 288943, new dimension: 95329, difference: 193614.
filtered_ms_fdr <- filter_mscore_fdr(filtered_ms, FFT=0.7, rm.decoy=TRUE,
overall_protein_fdr_target=prot_score,
upper_overall_peptide_fdr_limit=0.05)
## Target protein FDR: 0.000794328234724281
## Required overall m-score cutoff: 0.01
## achieving protein FDR: 0
## filter_mscore_fdr is filtering the data...
## finding m-score cutoff to achieve desired protein FDR in protein master list..
## finding m-score cutoff to achieve desired global peptide FDR..
## Target peptide FDR: 0.05
## Required overall m-score cutoff: 0.01
## Achieving peptide FDR: 0
## Proteins selected:
## Total proteins selected: 2813
## Final target proteins: 2813
## Final decoy proteins: 0
## Peptides mapping to these protein entries selected:
## Total mapping peptides: 16178
## Final target peptides: 16178
## Final decoy peptides: 0
## Total peptides selected from:
## Total peptides: 16178
## Final target peptides: 16178
## Final decoy peptides: 0
## Individual run FDR quality of the peptides was not calculated
## as not every run contains a decoy.
## The decoys have been removed from the returned data.
## Number of proteins detected: 2830
## Protein identifiers: Rv3570c, Rv2524c, Rv3908, Rv2427c, Rv1579c, Rv0913c
## Number of proteins detected that are supported by a proteotypic peptide: 2716
## Number of proteotypic peptides detected: 16060
## Number of proteins detected: 2716
## First 6 protein identifiers: Rv3570c, Rv2524c, Rv3908, Rv2427c, Rv1579c, Rv0913c
filtered_ms_fdr_pr_all_str <- filter_on_max_peptides(data=filtered_ms_fdr_pr_all,
n_peptides=10, rm.decoy=TRUE)
## Before filtering:
## Number of proteins: 2716
## Number of peptides: 16060
##
## Percentage of peptides removed: 16.95%
##
## After filtering:
## Number of proteins: 2716
## Number of peptides: 13338
old_filtered_all_filters <- filter_on_min_peptides(data=filtered_ms_fdr_pr_all_str,
n_peptides=3, rm.decoy=TRUE)
## Before filtering:
## Number of proteins: 2716
## Number of peptides: 13338
##
## Percentage of peptides removed: 9.18%
##
## After filtering:
## Number of proteins: 1845
## Number of peptides: 12113
filtered_all_filters <- s2s_all_filters(s2s_exp, target_fdr=0.1, mscore=0.1, upper_fdr=0.1,
do_min=FALSE)
## Number of non-decoy peptides: 17968
## Number of decoy peptides: 1413
## Decoy rate: 0.0786
## There were 288943 observations and 4332 decoy observations.
## The average FDR by run on assay level is 0.01
## The average FDR by run on peptide level is 0.011
## The average FDR by run on protein level is 0.043
## Target assay FDR: 0.1
## Required overall m-score cutoff: 0.01
## achieving assay FDR: 0.0471
## Target protein FDR: 0.1
## Required overall m-score cutoff: 0.0031623
## achieving protein FDR: 0.094
## Starting mscore filter.
## Starting mscore filter.
## Original dimension: 288943, new dimension: 288928, difference: 15.
## Starting freqobs filter.
## Peptides need to have been quantified in more conditions than: 26.25 in order to pass this percentage-based threshold.
## Fraction of peptides selected: 0.15
## Original dimension: 288928, new dimension: 101508, difference: 187420.
## Starting fdr filter.
## Target protein FDR: 0.00316227766016838
## Required overall m-score cutoff: 0.01
## achieving protein FDR: 0
## filter_mscore_fdr is filtering the data...
## finding m-score cutoff to achieve desired protein FDR in protein master list..
## finding m-score cutoff to achieve desired global peptide FDR..
## Target peptide FDR: 0.1
## Required overall m-score cutoff: 0.01
## Achieving peptide FDR: 0
## Proteins selected:
## Total proteins selected: 895
## Final target proteins: 895
## Final decoy proteins: 0
## Peptides mapping to these protein entries selected:
## Total mapping peptides: 2927
## Final target peptides: 2927
## Final decoy peptides: 0
## Total peptides selected from:
## Total peptides: 2927
## Final target peptides: 2927
## Final decoy peptides: 0
## Individual run FDR quality of the peptides was not calculated
## as not every run contains a decoy.
## The decoys have been removed from the returned data.
## Starting proteotypic filter.
## Number of proteins detected: 912
## Protein identifiers: Rv3879c, Rv0577, Rv3800c, Rv0360c, Rv1023, Rv1808
## Number of proteins detected that are supported by a proteotypic peptide: 874
## Number of proteotypic peptides detected: 2900
## Starting peptide filter.
## Number of proteins detected: 874
## First 6 protein identifiers: Rv3879c, Rv0577, Rv3800c, Rv0360c, Rv1023, Rv1808
## Starting maximum peptide filter.
## Before filtering:
## Number of proteins: 874
## Number of peptides: 2900
##
## Percentage of peptides removed: 2.38%
##
## After filtering:
## Number of proteins: 874
## Number of peptides: 2831
## Skipping min peptide filter.
## We went from 4159/19381 proteins/peptides to:
## 874/2831 proteins/peptides.
swath2stats provides a couple of ways to print out its results, one in a format specifically intended for MSstats, and another as a more canonical matrix of rows = proteins, columns = samples.
Let us reset the version back to 20190327 here.
## I think these matrixes are probably smarter to use than the raw outmatrix from tric.
## But I am not a fan of rerwriting the sample column names.
matrix_prefix <- file.path("preprocessing", "10swath2stats", ver)
if (!file.exists(matrix_prefix)) {
dir.create(matrix_prefix)
}
## I want to write a few iterations of the filtered data
## Starting with the raw and moving down, perhaps I should just
## add this logic to my fancy new filtering function?
protein_matrix_unfilt <- write_matrix_proteins(
filtered_all_filters[["raw"]], write.csv=TRUE,
filename=file.path(matrix_prefix, "protein_matrix_unfiltered.csv"))
## Protein overview matrix preprocessing/10swath2stats/20191021/protein_matrix_unfiltered.csv written to working folder.
## [1] 4159 36
protein_matrix_mscore <- write_matrix_proteins(
filtered_all_filters[["mscore_filtered"]], write.csv=TRUE,
filename=file.path(matrix_prefix, "protein_matrix_mscore.csv"))
## Protein overview matrix preprocessing/10swath2stats/20191021/protein_matrix_mscore.csv written to working folder.
## [1] 3072 36
peptide_matrix_mscore <- write_matrix_peptides(
filtered_all_filters[["mscore_filtered"]], write.csv=TRUE,
filename=file.path(matrix_prefix, "peptide_matrix_mscore.csv"))
## Peptide overview matrix preprocessing/10swath2stats/20191021/peptide_matrix_mscore.csv written to working folder.
## [1] 17968 36
protein_matrix_filtered <- write_matrix_proteins(
filtered_all_filters[["final"]], write.csv=TRUE,
filename=file.path(matrix_prefix, "protein_matrix_filtered.csv"))
## Protein overview matrix preprocessing/10swath2stats/20191021/protein_matrix_filtered.csv written to working folder.
## [1] 874 36
peptide_matrix_filtered <- write_matrix_peptides(
filtered_all_filters[["final"]], write.csv=TRUE,
filename=file.path(matrix_prefix, "peptide_matrix_filtered.csv"))
## Peptide overview matrix preprocessing/10swath2stats/20191021/peptide_matrix_filtered.csv written to working folder.
## [1] 2831 36
## Do the correlation of the sum of peptides/protein
rt_sum_cor <- plot_correlation_between_samples(
filtered_all_filters[["final"]], column.values="intensity",
fun.aggregate=sum, size=2,
comparison=transition_group_id ~
condition + bioreplicate + run)
## And their means
rt_mean_cor <- plot_correlation_between_samples(
filtered_all_filters[["final"]], column.values="intensity",
fun.aggregate=mean, size=2,
comparison=transition_group_id ~
condition + bioreplicate + run)
cols <- colnames(filtered_all_filters[["final"]])
disaggregated <- disaggregate(filtered_all_filters[["final"]], all.columns=TRUE)
## The library contains 1 transitions per precursor.
## The data table was transformed into a table containing one row per transition.
## One or several columns required by MSstats were not in the data. The columns were created and filled with NAs.
## Missing columns: productcharge, isotopelabeltype
## isotopelabeltype was filled with light.
I want to revisit aLFQ, I think it might provide better protein-level quantification methods. aLFQ looks promising, but I have not figured out valid parameters for using it.
summary(msstats_input)
devtools::load_all("~/scratch/git/aLFQ")
alfq_input <- tric_data[, c("align_origfilename", "ProteinName", "FullPeptideName", "transition_group_id",
"FullPeptideName", "Charge", "Intensity")]
colnames(alfq_input) <- c("run_id", "protein_id", "peptide_id", "transition_id", "peptide_sequence",
"precursor_charge", "transition_intensity")
alfq_input[["concentration"]] <- "?"
alfq_inference <- aLFQ::ProteinInference.default(alfq_input, consensus_proteins=FALSE,
consensus_peptides=FALSE, transition_strictness="loose",
consensus_transitions=FALSE)
alfq_quantities <- aLFQ::AbsoluteQuantification.default(alfq_inference,
total_protein_concentration=100)
summary(alfq_quantities[[2]])
alfq_norm <- alfq_quantities[[2]]
alfq_min <- min(alfq_norm[["normalized_concentration"]])
alfq_norm[["norm"]] <- alfq_norm[["normalized_concentration"]] / alfq_min
## Hmm that does not look right.
msstats.org seems to provide a complete solution for performing reasonable metrics of this data.
I am currently reading: http://msstats.org/wp-content/uploads/2017/01/MSstats_v3.7.3_manual.pdf
I made some moderately intrusive changes to MSstats to make it clearer, as well.
tt <- sm(devtools::load_all("~/scratch/git/MSstats"))
msstats_ver <- "20191021"
checkpoint <- paste0("msstats_dataprocess-v", msstats_ver, ".rda")
if (file.exists(checkpoint)) {
load(file=checkpoint)
} else {
msstats_quant <- dataProcess(msstats_input)
save(file=checkpoint, list=c("msstats_quant"))
}
##checkpoint <- paste0("msstats_plots-v", msstats_ver, ".rda")
##if (file.exists(checkpoint)) {
## load(file=checkpoint)
##} else {
## msstats_plots <- dataProcessPlots(msstats_quant, type="QCPLOT")
## save(file=checkpoint, list=c("msstats_plots"))
##}
my_levels <- levels(as.factor(msstats_input$condition))
my_levels
comparisons <- make_simplified_contrast_matrix(
numerators=c("dt_whole", "cp_whole"),
denominators=c("wt_whole", "wt_whole"))
msstats_results <- list()
checkpoint <- paste0("msstats_group-v", ver, ".rda")
if (file.exists(checkpoint)) {
load(file=checkpoint)
} else {
for (c in 1:length(rownames(comparisons))) {
name <- rownames(comparisons)[c]
message("Starting ", name)
comp <- comparisons[c, ]
comp <- t(as.matrix(comp))
rownames(comp) <- name
msstats_results[[name]] <- sm(MSstats::groupComparison(contrast.matrix=comp,
data=msstats_quant))
message("Finished ", name)
}
save(file=checkpoint, list=c("msstats_results"))
}
Yan asked for the p/pe protein qc plots. ok. I changed the dataProcessPlots to return something useful, so that should be possible now.
pe_genes <- read.table("reference/annotated_pe_genes.txt")[[1]]
## Unfortunately, the names did not get set in my changed version of dataProcessPlots...
plotlst <- msstats_plots$QCPLOT
available_plots <- gsub(pattern="^1/", replacement="",
x=levels(msstats_quant$ProcessedData$PROTEIN))
names(plotlst) <- available_plots
pe_in_avail_idx <- pe_genes %in% available_plots
pe_in_avail <- pe_genes[pe_in_avail_idx]
pe_plots <- plotlst[pe_in_avail]
pdf(file="pe_qc_plots.pdf")
for (p in 1:length(pe_plots)) {
plot(pe_plots[[p]])
}
dev.off()
length(pe_plots)
Since I am not certain I understand these data, I will take the intensities from SWATH2stats, metadata, and annotation data; attempt to create a ‘normal’ expressionset; poke at it to see what I can learn.
I want to use the same metadata as were used for MSstats. It has a few important differences from the requirements of hpgltools: pretty much only that I do not allow rownames/sampleIDs to start with a number.
I do not want the \1 before the protein names, I already merged them into one entry per gene via SWATH2stats.
There are two ways to get the matrix of intensities, either directly from pyprophet, or from the filtered data provided by swath2stats. Here is the former, but I should be using the latter.
prot_mtrx <- read.csv(file.path("preprocessing", "10swath2stats",
ver, "protein_matrix_filtered.csv"))
rownames(prot_mtrx) <- gsub(pattern="^1\\/", replacement="",
x=prot_mtrx[["proteinname"]])
prot_mtrx <- prot_mtrx[, -1]
prot_keepers <- grepl(pattern="^Rv", x=rownames(prot_mtrx))
prot_mtrx <- prot_mtrx[prot_keepers, ]
## Important question: Did SWATH2stats reorder my data?
colnames(prot_mtrx) <- gsub(pattern="^(.*)(2018.*)$",
replacement="s\\2", x=colnames(prot_mtrx))
unfilt_mtrx <- read.csv(file.path("preprocessing", "10swath2stats",
ver, "protein_matrix_unfiltered.csv"))
rownames(unfilt_mtrx) <- gsub(pattern="^1\\/", replacement="",
x=unfilt_mtrx[["proteinname"]])
unfilt_mtrx <- unfilt_mtrx[, -1]
unfilt_keepers <- grepl(pattern="^Rv", x=rownames(unfilt_mtrx))
unfilt_mtrx <- unfilt_mtrx[unfilt_keepers, ]
## Important question: Did SWATH2stats reorder my data?
colnames(unfilt_mtrx) <- gsub(pattern="^(.*)(2018.*)$",
replacement="s\\2", x=colnames(unfilt_mtrx))
prot_mtrx <- protein_matrix_filtered
colnames(prot_mtrx) <- gsub(x=colnames(prot_mtrx),
pattern="^(.*)_(2019.*)$", replacement="\\2")
rownames(prot_mtrx) <- prot_mtrx[["proteinname"]]
prot_mtrx[["proteinname"]] <- NULL
rv_idx <- grepl(x=rownames(prot_mtrx), pattern="^Rv")
prot_mtrx <- prot_mtrx[rv_idx, ]
unfilt_mtrx <- protein_matrix_unfilt
colnames(unfilt_mtrx) <- gsub(x=colnames(unfilt_mtrx),
pattern="^(.*)_(2019.*)$", replacement="\\2")
rownames(unfilt_mtrx) <- gsub(pattern="^1\\/", replacement="", x=unfilt_mtrx[["proteinname"]])
unfilt_mtrx[["proteinname"]] <- NULL
rv_idx <- grepl(x=rownames(unfilt_mtrx), pattern="^Rv")
unfilt_mtrx <- unfilt_mtrx[rv_idx, ]
## The old way of getting genome/annotation data
mtb_gff <- "reference/mycobacterium_tuberculosis_h37rv_2.gff.gz"
mtb_genome <- "reference/mtuberculosis_h37rv_genbank.fasta"
mtb_cds <- "reference/mtb_cds.fasta"
mtb_annotations <- sm(load_gff_annotations(mtb_gff, type="gene"))
colnames(mtb_annotations) <- gsub(pattern="\\.", replacement="", x=colnames(mtb_annotations))
mtb_annotations[["description"]] <- gsub(pattern="\\+", replacement=" ",
x=mtb_annotations[["description"]])
mtb_annotations[["function"]] <- gsub(pattern="\\+", replacement=" ",
x=mtb_annotations[["function"]])
mtb_microbes <- load_microbesonline_annotations(id=83332)
## The species being downloaded is: Mycobacterium tuberculosis H37Rv
## Downloading: http://www.microbesonline.org/cgi-bin/genomeInfo.cgi?tId=83332;export=tab
Now we should have sufficient pieces to make an expressionset.
While here, I will also split the data into a cf and whole-cell pair of data structures.
## Drop the metadata not in the protein matrix:
## And ensure that they are the same order.
reordered <- colnames(prot_mtrx)
metadata <- sample_annot[reordered, ]
colnames(prot_mtrx) <- gsub(x=colnames(prot_mtrx), pattern="^.*(2019.*$)", replacement="s\\1")
mtb_annotations <- mtb_annotations[, -1]
protein_expt <- create_expt(sample_annot,
count_dataframe=prot_mtrx,
gene_info=mtb_annotations)
## Reading the sample metadata.
## The sample definitions comprises: 35 rows(samples) and 20 columns(metadata fields).
## Matched 869 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## The final expressionset has 873 rows and 35 columns.
reordered <- colnames(unfilt_mtrx)
metadata <- sample_annot[reordered, ]
colnames(unfilt_mtrx) <- gsub(x=colnames(unfilt_mtrx), pattern="^.*(2019.*$)", replacement="s\\1")
mtb_annotations <- mtb_annotations[, -1]
unfilt_expt <- create_expt(sample_annot,
count_dataframe=unfilt_mtrx,
gene_info=mtb_annotations)
## Reading the sample metadata.
## The sample definitions comprises: 35 rows(samples) and 20 columns(metadata fields).
## Matched 2927 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## The final expressionset has 2954 rows and 35 columns.
filt_norm <- normalize_expt(protein_expt, filter=TRUE,
norm="quant", convert="cpm",
transform="log2")
## This function will replace the expt$expressionset slot with:
## log2(cpm(quant(cbcb(data))))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep libsizes in mind
## when invoking limma. The appropriate libsize is non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 0 low-count genes (873 remaining).
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 102 values equal to 0, adding 1 to the matrix.
## Step 5: not doing batch correction.
unfilt_norm <- normalize_expt(unfilt_expt, filter=TRUE,
norm="quant", convert="cpm",
transform="log2")
## This function will replace the expt$expressionset slot with:
## log2(cpm(quant(cbcb(data))))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep libsizes in mind
## when invoking limma. The appropriate libsize is non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 181 low-count genes (2773 remaining).
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 6746 values equal to 0, adding 1 to the matrix.
## Step 5: not doing batch correction.
written_file <- glue::glue("excel/{rundate}_protein_expt-v{ver}.xlsx")
protein_write <- write_expt(protein_expt, violin=TRUE,
batch="raw", excel=written_file)
## Writing the first sheet, containing a legend and some summary data.
## Writing the raw reads.
## Graphing the raw reads.
## varpart sees only 1 batch, adjusting the model accordingly.
## Attempting mixed linear model with: ~ (1|condition)
## Fitting the expressionset to the model, this is slow.
## Dividing work into 100 chunks...
##
## Total:10 s
## Placing factor: condition at the beginning of the model.
## Writing the normalized reads.
## Graphing the normalized reads.
## varpart sees only 1 batch, adjusting the model accordingly.
## Attempting mixed linear model with: ~ (1|condition)
## Fitting the expressionset to the model, this is slow.
## Dividing work into 100 chunks...
##
## Total:9 s
## Placing factor: condition at the beginning of the model.
## Writing the median reads by factor.
## Note: zip::zip() is deprecated, please use zip::zipr() instead
written_file <- glue::glue("excel/{rundate}_unfilt_protein_expt-v{ver}.xlsx")
unfilt_write <- write_expt(unfilt_expt, violin=TRUE,
batch="raw", excel=written_file)
## Writing the first sheet, containing a legend and some summary data.
## Writing the raw reads.
## Graphing the raw reads.
## varpart sees only 1 batch, adjusting the model accordingly.
## Attempting mixed linear model with: ~ (1|condition)
## Fitting the expressionset to the model, this is slow.
## Dividing work into 100 chunks...
##
## Total:18 s
## Placing factor: condition at the beginning of the model.
## Writing the normalized reads.
## Graphing the normalized reads.
## varpart sees only 1 batch, adjusting the model accordingly.
## Attempting mixed linear model with: ~ (1|condition)
## Fitting the expressionset to the model, this is slow.
## Dividing work into 100 chunks...
##
## Total:25 s
## Placing factor: condition at the beginning of the model.
## Writing the median reads by factor.
protein_filt <- sm(normalize_expt(protein_expt, filter=TRUE))
protein_norm <- sm(normalize_expt(protein_filt, norm="quant", transform="log2",
convert="cpm", filter=TRUE))
protein_unfilt <- sm(normalize_expt(unfilt_expt, filter=TRUE))
protein_de <- sm(all_pairwise(protein_filt, model_batch=FALSE, force=TRUE, parallel=FALSE))
keepers <- list(
"whdt_wt" = c("dt_wh", "wt_wh"),
"whcp_wt" = c("cp_wh", "wt_wh"),
"cfdt_wt" = c("dt_cf", "wt_cf"),
"cfcp_wt" = c("cp_cf", "wt_cf"),
"wtcf_wh" = c("wt_cf", "wt_wh"),
"dtcf_wh" = c("dt_cf", "dt_cf")
)
protein_tables <- sm(combine_de_tables(
protein_de, keepers=keepers,
excel=glue::glue("excel/de_{rundate}_tables_v{ver}.xlsx")))
protein_sig <- sm(extract_significant_genes(
protein_tables,
excel=glue::glue("excel/sig_{rundate}_tables_v{ver}.xlsx")))
unfilt_tables <- sm(combine_de_tables(
unfilt_de, keepers=keepers,
excel=glue::glue("excel/de_{rundate}_unfilt_tables_v{ver}.xlsx")))
unfilt_sig <- extract_significant_genes(
unfilt_tables,
excel=glue::glue("excel/sig_{rundate}_unfilt_tables_v{ver}.xlsx"))
## Writing a legend of columns.
## Did not find the ebseq_logfc, skipping ebseq.
## Writing excel data according to limma for whdt_wt: 1/24.
## After (adj)p filter, the up genes table has 72 genes.
## After (adj)p filter, the down genes table has 109 genes.
## After fold change filter, the up genes table has 70 genes.
## After fold change filter, the down genes table has 108 genes.
## Writing excel data according to limma for whcp_wt: 2/24.
## After (adj)p filter, the up genes table has 13 genes.
## After (adj)p filter, the down genes table has 5 genes.
## After fold change filter, the up genes table has 13 genes.
## After fold change filter, the down genes table has 5 genes.
## Writing excel data according to limma for cfdt_wt: 3/24.
## After (adj)p filter, the up genes table has 0 genes.
## After (adj)p filter, the down genes table has 0 genes.
## After fold change filter, the up genes table has 0 genes.
## After fold change filter, the down genes table has 0 genes.
## Writing excel data according to limma for cfcp_wt: 4/24.
## After (adj)p filter, the up genes table has 1 genes.
## After (adj)p filter, the down genes table has 0 genes.
## After fold change filter, the up genes table has 1 genes.
## After fold change filter, the down genes table has 0 genes.
## Writing excel data according to limma for wtcf_wh: 5/24.
## After (adj)p filter, the up genes table has 879 genes.
## After (adj)p filter, the down genes table has 702 genes.
## After fold change filter, the up genes table has 869 genes.
## After fold change filter, the down genes table has 692 genes.
## Printing significant genes to the file: excel/sig_20191029_unfilt_tables_v20191021.xlsx
## 1/5: Creating significant table up_limma_whdt_wt
## 2/5: Creating significant table up_limma_whcp_wt
## The up table cfdt_wt is empty.
## The down table cfdt_wt is empty.
## 4/5: Creating significant table up_limma_cfcp_wt
## The down table cfcp_wt is empty.
## 5/5: Creating significant table up_limma_wtcf_wh
## Writing excel data according to edger for whdt_wt: 1/24.
## After (adj)p filter, the up genes table has 3 genes.
## After (adj)p filter, the down genes table has 20 genes.
## After fold change filter, the up genes table has 3 genes.
## After fold change filter, the down genes table has 20 genes.
## Writing excel data according to edger for whcp_wt: 2/24.
## After (adj)p filter, the up genes table has 3 genes.
## After (adj)p filter, the down genes table has 21 genes.
## After fold change filter, the up genes table has 3 genes.
## After fold change filter, the down genes table has 21 genes.
## Writing excel data according to edger for cfdt_wt: 3/24.
## After (adj)p filter, the up genes table has 373 genes.
## After (adj)p filter, the down genes table has 84 genes.
## After fold change filter, the up genes table has 373 genes.
## After fold change filter, the down genes table has 84 genes.
## Writing excel data according to edger for cfcp_wt: 4/24.
## After (adj)p filter, the up genes table has 199 genes.
## After (adj)p filter, the down genes table has 133 genes.
## After fold change filter, the up genes table has 199 genes.
## After fold change filter, the down genes table has 133 genes.
## Writing excel data according to edger for wtcf_wh: 5/24.
## After (adj)p filter, the up genes table has 190 genes.
## After (adj)p filter, the down genes table has 1031 genes.
## After fold change filter, the up genes table has 189 genes.
## After fold change filter, the down genes table has 1031 genes.
## Printing significant genes to the file: excel/sig_20191029_unfilt_tables_v20191021.xlsx
## 1/5: Creating significant table up_edger_whdt_wt
## 2/5: Creating significant table up_edger_whcp_wt
## 3/5: Creating significant table up_edger_cfdt_wt
## 4/5: Creating significant table up_edger_cfcp_wt
## 5/5: Creating significant table up_edger_wtcf_wh
## Writing excel data according to deseq for whdt_wt: 1/24.
## After (adj)p filter, the up genes table has 64 genes.
## After (adj)p filter, the down genes table has 119 genes.
## After fold change filter, the up genes table has 49 genes.
## After fold change filter, the down genes table has 117 genes.
## Writing excel data according to deseq for whcp_wt: 2/24.
## After (adj)p filter, the up genes table has 55 genes.
## After (adj)p filter, the down genes table has 66 genes.
## After fold change filter, the up genes table has 55 genes.
## After fold change filter, the down genes table has 66 genes.
## Writing excel data according to deseq for cfdt_wt: 3/24.
## After (adj)p filter, the up genes table has 384 genes.
## After (adj)p filter, the down genes table has 99 genes.
## After fold change filter, the up genes table has 384 genes.
## After fold change filter, the down genes table has 96 genes.
## Writing excel data according to deseq for cfcp_wt: 4/24.
## After (adj)p filter, the up genes table has 213 genes.
## After (adj)p filter, the down genes table has 145 genes.
## After fold change filter, the up genes table has 212 genes.
## After fold change filter, the down genes table has 144 genes.
## Writing excel data according to deseq for wtcf_wh: 5/24.
## After (adj)p filter, the up genes table has 234 genes.
## After (adj)p filter, the down genes table has 1313 genes.
## After fold change filter, the up genes table has 183 genes.
## After fold change filter, the down genes table has 1282 genes.
## Printing significant genes to the file: excel/sig_20191029_unfilt_tables_v20191021.xlsx
## 1/5: Creating significant table up_deseq_whdt_wt
## 2/5: Creating significant table up_deseq_whcp_wt
## 3/5: Creating significant table up_deseq_cfdt_wt
## 4/5: Creating significant table up_deseq_cfcp_wt
## 5/5: Creating significant table up_deseq_wtcf_wh
## Writing excel data according to basic for whdt_wt: 1/24.
## After (adj)p filter, the up genes table has 0 genes.
## After (adj)p filter, the down genes table has 1 genes.
## After fold change filter, the up genes table has 0 genes.
## After fold change filter, the down genes table has 1 genes.
## Writing excel data according to basic for whcp_wt: 2/24.
## After (adj)p filter, the up genes table has 0 genes.
## After (adj)p filter, the down genes table has 0 genes.
## After fold change filter, the up genes table has 0 genes.
## After fold change filter, the down genes table has 0 genes.
## Writing excel data according to basic for cfdt_wt: 3/24.
## After (adj)p filter, the up genes table has 0 genes.
## After (adj)p filter, the down genes table has 0 genes.
## After fold change filter, the up genes table has 0 genes.
## After fold change filter, the down genes table has 0 genes.
## Writing excel data according to basic for cfcp_wt: 4/24.
## After (adj)p filter, the up genes table has 0 genes.
## After (adj)p filter, the down genes table has 0 genes.
## After fold change filter, the up genes table has 0 genes.
## After fold change filter, the down genes table has 0 genes.
## Writing excel data according to basic for wtcf_wh: 5/24.
## After (adj)p filter, the up genes table has 68 genes.
## After (adj)p filter, the down genes table has 1765 genes.
## After fold change filter, the up genes table has 67 genes.
## After fold change filter, the down genes table has 1762 genes.
## Printing significant genes to the file: excel/sig_20191029_unfilt_tables_v20191021.xlsx
## The up table whdt_wt is empty.
## The up table whcp_wt is empty.
## The down table whcp_wt is empty.
## The up table cfdt_wt is empty.
## The down table cfdt_wt is empty.
## The up table cfcp_wt is empty.
## The down table cfcp_wt is empty.
## 5/5: Creating significant table up_basic_wtcf_wh
## Adding significance bar plots.
## Before removal, there were 873 entries.
## Now there are 871 entries.
## Percent kept: 99.983, 99.956, 99.969, 99.930, 99.986, 99.927, 99.975, 99.917, 99.954, 99.984, 99.938, 99.981, 99.922, 99.987, 99.944, 99.982, 99.913, 99.976, 98.712, 99.090, 99.056, 99.987, 98.827, 99.977, 99.896, 99.149, 98.707, 99.986, 99.891, 99.208, 99.691, 98.733, 98.759, 99.860, 99.964
## Percent removed: 0.017, 0.044, 0.031, 0.070, 0.014, 0.073, 0.025, 0.083, 0.046, 0.016, 0.062, 0.019, 0.078, 0.013, 0.056, 0.018, 0.087, 0.024, 1.288, 0.910, 0.944, 0.013, 1.173, 0.023, 0.104, 0.851, 1.293, 0.014, 0.109, 0.792, 0.309, 1.267, 1.241, 0.140, 0.036
remove_norm <- normalize_expt(remove_two, transform="log2", convert="cpm", norm="quant",
filter=TRUE)
## This function will replace the expt$expressionset slot with:
## log2(cpm(quant(cbcb(data))))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep libsizes in mind
## when invoking limma. The appropriate libsize is non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 0 low-count genes (871 remaining).
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 86 values equal to 0, adding 1 to the matrix.
## Step 5: not doing batch correction.
## Put NAs into the data for the set of proteins for which there are
## 0s in _not_all_ of the samples for each condition.
protein_nas <- add_conditional_nas(unfilt_expt)
## In condition wt_wh there are 264 rows which are all zero.
## In condition wt_cf there are 1377 rows which are all zero.
## In condition dt_wh there are 370 rows which are all zero.
## In condition dt_cf there are 1071 rows which are all zero.
## In condition cp_wh there are 269 rows which are all zero.
## In condition cp_cf there are 1291 rows which are all zero.
Let us pull the following subset from the DE tables for Volker, it should provide a set of proteins most obviously of interest; assuming the false negatives are not too severe.
This will hopefully find things which are sufficiently different from the deletion and complement samples to be interesting.
dt_wt <- unfilt_nas_tables[["data"]][["dt_wt"]]
cp_wt <- unfilt_nas_tables[["data"]][["cp_wt"]]
down_idx <- dt_wt[["deseq_logfc"]] <= -6 & cp_wt[["deseq_logfc"]] >= -3
down_table <- dt_wt[down_idx, ]
up_idx <- dt_wt[["deseq_logfc"]] >= 10 & cp_wt[["deseq_logfc"]] <= 4
up_table <- dt_wt[up_idx, ]
down_subset <- write_xls(
data=down_table,
excel=glue::glue("excel/de_{rundate}_more_down_delta.xlsx"))
up_subset <- write_xls(
data=up_table,
excel=glue::glue("excel/de_{rundate}_more_up_delta.xlsx"))
enc_metadata <- sample_annote
nc_matrix <- read.table("preprocessing/09encyclopedia/20191001cf_quantities.elib.proteins.txt",
header=TRUE)
enc_pep_matrix <- read.table("preprocessing/09encyclopedia/20191001cf_quantities.elib.peptides.txt", header=TRUE)
rownames(enc_matrix) <- enc_matrix[["Protein"]]
enc_matrix <- enc_matrix[, -1]
enc_matrix <- enc_matrix[, -1]
enc_matrix <- enc_matrix[, -1]
colnames(enc_matrix)
colnames(enc_matrix) <- gsub(pattern="X", replacement="s", x=colnames(enc_matrix))
colnames(enc_matrix) <- gsub(pattern="\\.mzML", replacement="", x=colnames(enc_matrix))
colnames(enc_matrix) <- gsub(pattern="^X", replacement="s", x=colnames(enc_matrix))
colnames(enc_pep_matrix) <- gsub(pattern="X", replacement="s", x=colnames(enc_pep_matrix))
colnames(enc_pep_matrix) <- gsub(pattern="\\.mzML", replacement="", x=colnames(enc_pep_matrix))
colnames(enc_pep_matrix) <- gsub(pattern="^X", replacement="s", x=colnames(enc_pep_matrix))
colnames(enc_matrix)
rownames(enc_metadata)
enc_expt <- create_expt(metadata=enc_metadata, count_dataframe=enc_matrix,
gene_info=mtb_annotations)
enc_norm <- normalize_expt(enc_expt, norm="quant", convert="cpm", filter=TRUE,
transform="log2")
plot_pca(enc_norm)$plot
if (!isTRUE(get0("skip_load"))) {
message(paste0("This is hpgltools commit: ", get_git_commit()))
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
tmp <- sm(saveme(filename=this_save))
pander::pander(sessionInfo())
}
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset c2881f6d97e1ec981fd1481cf46d6bc875fac423
## This is hpgltools commit: Tue Oct 22 10:22:30 2019 -0400: c2881f6d97e1ec981fd1481cf46d6bc875fac423
## Saving to 03_swath2stats_20191021-v20191021.rda.xz
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C
attached base packages: parallel, stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: edgeR(v.3.27.13), variancePartition(v.1.15.8), SWATH2stats(v.1.13.5), testthat(v.2.2.1), hpgltools(v.1.0), Biobase(v.2.45.1) and BiocGenerics(v.0.31.6)
loaded via a namespace (and not attached): tidyselect(v.0.2.5), lme4(v.1.1-21), htmlwidgets(v.1.5.1), RSQLite(v.2.1.2), AnnotationDbi(v.1.47.1), grid(v.3.6.1), BiocParallel(v.1.19.4), Rtsne(v.0.15), devtools(v.2.2.1), munsell(v.0.5.0), preprocessCore(v.1.47.1), codetools(v.0.2-16), withr(v.2.1.2), colorspace(v.1.4-1), GOSemSim(v.2.11.0), knitr(v.1.25), rstudioapi(v.0.10), stats4(v.3.6.1), Vennerable(v.3.1.0.9000), robustbase(v.0.93-5), readODS(v.1.6.7), DOSE(v.3.11.2), labeling(v.0.3), urltools(v.1.7.3), GenomeInfoDbData(v.1.2.1), polyclip(v.1.10-0), bit64(v.0.9-7), farver(v.1.1.0), rprojroot(v.1.3-2), vctrs(v.0.2.0), xfun(v.0.10), BiocFileCache(v.1.9.1), fastcluster(v.1.1.25), R6(v.2.4.0), doParallel(v.1.0.15), GenomeInfoDb(v.1.21.2), graphlayouts(v.0.5.0), locfit(v.1.5-9.1), bitops(v.1.0-6), fgsea(v.1.11.1), gridGraphics(v.0.4-1), DelayedArray(v.0.11.8), assertthat(v.0.2.1), scales(v.1.0.0), nnet(v.7.3-12), ggraph(v.2.0.0), enrichplot(v.1.5.2), gtable(v.0.3.0), sva(v.3.33.1), processx(v.3.4.1), tidygraph(v.1.1.2), rlang(v.0.4.0), zeallot(v.0.1.0), genefilter(v.1.67.1), splines(v.3.6.1), rtracklayer(v.1.45.6), lazyeval(v.0.2.2), acepack(v.1.4.1), selectr(v.0.4-1), checkmate(v.1.9.4), europepmc(v.0.3), BiocManager(v.1.30.8), yaml(v.2.2.0), reshape2(v.1.4.3), GenomicFeatures(v.1.37.4), backports(v.1.1.5), Hmisc(v.4.2-0), qvalue(v.2.17.0), RBGL(v.1.61.0), clusterProfiler(v.3.13.0), tools(v.3.6.1), usethis(v.1.5.1), ggplotify(v.0.0.4), ggplot2(v.3.2.1), ellipsis(v.0.3.0), gplots(v.3.0.1.1), RColorBrewer(v.1.1-2), sessioninfo(v.1.1.1), ggridges(v.0.5.1), Rcpp(v.1.0.2), plyr(v.1.8.4), base64enc(v.0.1-3), progress(v.1.2.2), zlibbioc(v.1.31.0), purrr(v.0.3.3), RCurl(v.1.95-4.12), ps(v.1.3.0), prettyunits(v.1.0.2), rpart(v.4.1-15), openssl(v.1.4.1), viridis(v.0.5.1), cowplot(v.1.0.0), S4Vectors(v.0.23.25), cluster(v.2.1.0), SummarizedExperiment(v.1.15.9), ggrepel(v.0.8.1), colorRamps(v.2.3), fs(v.1.3.1), magrittr(v.1.5), data.table(v.1.12.6), openxlsx(v.4.1.0.1), DO.db(v.2.9), BRAIN(v.1.31.0), triebeard(v.0.3.0), matrixStats(v.0.55.0), pkgload(v.1.0.2), hms(v.0.5.1), evaluate(v.0.14), xtable(v.1.8-4), pbkrtest(v.0.4-7), XML(v.3.98-1.20), IRanges(v.2.19.17), gridExtra(v.2.3), compiler(v.3.6.1), biomaRt(v.2.41.9), tibble(v.2.1.3), KernSmooth(v.2.23-16), crayon(v.1.3.4), minqa(v.1.2.4), htmltools(v.0.4.0), corpcor(v.1.6.9), mgcv(v.1.8-29), Formula(v.1.2-3), geneplotter(v.1.63.0), tidyr(v.1.0.0), DBI(v.1.0.0), tweenr(v.1.0.1), dbplyr(v.1.4.2), MASS(v.7.3-51.4), rappdirs(v.0.3.1), boot(v.1.3-23), Matrix(v.1.2-17), readr(v.1.3.1), cli(v.1.1.0), quadprog(v.1.5-7), gdata(v.2.18.0), igraph(v.1.2.4.1), GenomicRanges(v.1.37.17), pkgconfig(v.2.0.3), rvcheck(v.0.1.5), GenomicAlignments(v.1.21.7), foreign(v.0.8-72), xml2(v.1.2.2), foreach(v.1.4.7), annotate(v.1.63.0), XVector(v.0.25.0), rvest(v.0.3.4), stringr(v.1.4.0), callr(v.3.3.2), PolynomF(v.2.0-2), digest(v.0.6.22), graph(v.1.63.0), Biostrings(v.2.53.2), rmarkdown(v.1.16), cellranger(v.1.1.0), fastmatch(v.1.1-0), htmlTable(v.1.13.2), directlabels(v.2018.05.22), curl(v.4.2), Rsamtools(v.2.1.7), gtools(v.3.8.1), nloptr(v.1.2.1), lifecycle(v.0.1.0), nlme(v.3.1-141), jsonlite(v.1.6), desc(v.1.2.0), viridisLite(v.0.3.0), askpass(v.1.1), limma(v.3.41.18), pillar(v.1.4.2), lattice(v.0.20-38), DEoptimR(v.1.0-8), httr(v.1.4.1), pkgbuild(v.1.0.6), survival(v.2.44-1.1), GO.db(v.3.8.2), glue(v.1.3.1), remotes(v.2.1.0), zip(v.2.0.4), iterators(v.1.0.12), pander(v.0.6.3), bit(v.1.1-14), ggforce(v.0.3.1), stringi(v.1.4.3), blob(v.1.2.0), DESeq2(v.1.25.17), latticeExtra(v.0.6-28), caTools(v.1.17.1.2), memoise(v.1.1.0) and dplyr(v.0.8.3)