This worksheet is in response to Yan’s email:
“Following up to our earlier conversation. Here are some thing I’m hoping you can extract out of the current data: 1. with Briken data, High resolution vs low resolution MSMS for library generation: Generate separate libraries with 0116DDA01 (low resolution) and 0129DDA01 (high resolution), then use the two separate libraries to quantify DIA01-03, 11-13 with corresponding fragment mass accuracy (0.5 Da vs 0.02Da). Basically the DIA data is done with high res. I want to see if high res library data will help.”
This document will attempt to do that with the single low-res DDA data set therefore.
echo "Setting variables for the rest of the script."
search_method="comet"
cleavages="2"
refdb="reference/reference.fasta"
decoy_string="DECOY_"
confidence_cutoff="0.05"
ippp="0.97"
## irtdb="reference/reference.fasta"
dda_method="HCD"
## I am an idiot, I keep typing arbf when I test my output files...
test="Rv1611"
dda_input=$(/bin/ls mzXML/dda_${dda_method}/*.mzXML)
startdir="/fs/cbcb-scratch/abelew/mycobacterium_tuberculosis_2018"
mkdir -p results/pepxml
fdr_result="results/fdr_controlled/fdr_library_${search_method}_${dda_method}.xml"
mkdir -p results/fdr_controlled
ipp_result="results/iprophet/iprophet_${search_method}_${dda_method}.xml"
mkdir -p results/iprophet
mayu_out="results/mayu/mayu_${search_method}_${dda_method}.out"
mkdir -p results/mayu
spectral_basename="results/spectral_libraries/${search_method}_${dda_method}"
spectral_library="${spectral_basename}.splib"
mkdir -p results/spectral_libraries
consensus="results/spectral_consensus/${search_method}_${dda_method}"
mkdir -p results/spectral_consensus
declare -a mz_windows=("8mz" "20mz")
tempdir=$(pwd)/tmp
dscore_cutoff="10"
## Setting variables for the rest of the script.
## It appears that the module command causes runr to die.
module add openms/2.0.0
module add tpp
module add mayu
## Yay modules!
The thermo .raw files were moved into directories labeled: raw/dda, raw/dia, and raw/std
For the dda files, the windows ‘RawConverter’ from the TPP was used to convert to mzXML. For the dia files, the windows ‘MSConvertGUI’ was used to convert to mzXML.
In theory, one is supposed to use RawConverter for both, but it crashes. This is a pity, as it has some options for centroided conversion which look to be useful.
I copied the human genpep database to reference/ (AUP00005640.fasta) Upon figuring out the sequences of the ABRF peptides, I added them as abrf.fasta and concatenated the two files into reference.fasta.
A default configuration for comet was created by running comet -p
and editing the resulting file in order to be more similar to the suggestions in the tutorial. The relevant changes and options mentioned in the tutorial are:
Once I was satisfied, I saved it to comet/comet_params.txt
I am using comet here primarily because it is what was used in: Tutorial-1_LibraryGeneration.pdf. With that in mind, as long as the results are in a single pep.xml file for each dda acquisition file, then I am all good I think.
With that in mind, I believe I will do some testing in the near future to see if I can also use mgsf/tandem/etc.
## If using the cluster, comet is in module 'comet'
## This is the old CID input.
##export INPUT="2018_0116BrikenDDA01.mzXML"
##export INPUT="2018_0129BrikenDDA01.mzXML"
##export BASE=$(basename ${INPUT} .mzXML)
## Crap in a hat, the new high-res mzXML file has many fewer spectra than the old one
## and includes many fewer hits in comet.
## I will use both and see if that helps.
echo "Starting run of comet using the configuration file:
comet/comet_${dda_method}_params.txt against ${dda_input}."
for input in ${dda_input}
do
base=$(basename "${dda_input}" .mzXML)
/cbcb/sw/RedHat-7-x86_64/common/local/tpp/5.1.0/bin/comet \
-Pcomet/comet_${dda_method}_params.txt \
"${input}"
done
## I want to look into using some of the other search engines here.
Apparently comet can only report the first hit of a bunch when reporting its matches. TPP provides a thing to fix this:
If I read the documentation correctly, this should give me the alternate hits when available.
echo "Starting refreshparser to hopefully add back the alternate matches dropped by comet."
for input in $(/bin/ls mzXML/dda_${dda_method}/*.pep.xml)
do
echo "Refreshparser-ing ${input}."
RefreshParser \
"${input}" \
"${refdb}"
done
This tool is intended to combine the searches for each input file into a single set of controlled FDRs. It works by examining each search output file and creating a distribution of the search scores for each set of targets and decoys. Once it has these distributions, it computes the likelihood that each individual search was correct.
It is worth noting that I installed tpp from the git tree and statically linked as much of it as possible so that it is possible for me to do test runs on my workstation. Hopefully it will run properly as a result, but this is not guaranteed as my workstation runs a recent version of Debian linux, while I compiled tpp on the cluster’s RedHat 7.0 architecture using gcc 6.2.0.
Here are some of the xinteract options (note that xinteract options must not be separated by spaces, unlike most unix-like tools.):
It turns out xinteract creates multiple output files, I should figure out the others.
echo "Invoking xinteract to merge searches, set decoys, and set up fdr."
xinteract \
-d${decoy_string} \
-OARPpd \
-N${fdr_result} \
"$(/bin/ls "mzXML/dda_${dda_method}/"*.pep.xml)"
grep "${test}" "${fdr_result}" | head
prophet_table <- extract_peprophet_data("comet/pepxml/fdr_library_HCD.xml")
## Extracting spectrum queries.
## Extracting the spectrum_query metadata.
## Extracting the search_result metadata.
## Extracting modification metadata.
## Filling in modification information, this is slow.
##
|
| | 0%
|
| | 1%
|
|= | 1%
|
|= | 2%
|
|== | 2%
|
|== | 3%
|
|== | 4%
|
|=== | 4%
|
|=== | 5%
|
|==== | 5%
|
|==== | 6%
|
|==== | 7%
|
|===== | 7%
|
|===== | 8%
|
|====== | 8%
|
|====== | 9%
|
|====== | 10%
|
|======= | 10%
|
|======= | 11%
|
|======= | 12%
|
|======== | 12%
|
|======== | 13%
|
|========= | 13%
|
|========= | 14%
|
|========= | 15%
|
|========== | 15%
|
|========== | 16%
|
|=========== | 16%
|
|=========== | 17%
|
|=========== | 18%
|
|============ | 18%
|
|============ | 19%
|
|============= | 19%
|
|============= | 20%
|
|============= | 21%
|
|============== | 21%
|
|============== | 22%
|
|=============== | 22%
|
|=============== | 23%
|
|=============== | 24%
|
|================ | 24%
|
|================ | 25%
|
|================= | 25%
|
|================= | 26%
|
|================= | 27%
|
|================== | 27%
|
|================== | 28%
|
|=================== | 28%
|
|=================== | 29%
|
|=================== | 30%
|
|==================== | 30%
|
|==================== | 31%
|
|==================== | 32%
|
|===================== | 32%
|
|===================== | 33%
|
|====================== | 33%
|
|====================== | 34%
|
|====================== | 35%
|
|======================= | 35%
|
|======================= | 36%
|
|======================== | 36%
|
|======================== | 37%
|
|======================== | 38%
|
|========================= | 38%
|
|========================= | 39%
|
|========================== | 39%
|
|========================== | 40%
|
|========================== | 41%
|
|=========================== | 41%
|
|=========================== | 42%
|
|============================ | 42%
|
|============================ | 43%
|
|============================ | 44%
|
|============================= | 44%
|
|============================= | 45%
|
|============================== | 45%
|
|============================== | 46%
|
|============================== | 47%
|
|=============================== | 47%
|
|=============================== | 48%
|
|================================ | 48%
|
|================================ | 49%
|
|================================ | 50%
|
|================================= | 50%
|
|================================= | 51%
|
|================================= | 52%
|
|================================== | 52%
|
|================================== | 53%
|
|=================================== | 53%
|
|=================================== | 54%
|
|=================================== | 55%
|
|==================================== | 55%
|
|==================================== | 56%
|
|===================================== | 56%
|
|===================================== | 57%
|
|===================================== | 58%
|
|====================================== | 58%
|
|====================================== | 59%
|
|======================================= | 59%
|
|======================================= | 60%
|
|======================================= | 61%
|
|======================================== | 61%
|
|======================================== | 62%
|
|========================================= | 62%
|
|========================================= | 63%
|
|========================================= | 64%
|
|========================================== | 64%
|
|========================================== | 65%
|
|=========================================== | 65%
|
|=========================================== | 66%
|
|=========================================== | 67%
|
|============================================ | 67%
|
|============================================ | 68%
|
|============================================= | 68%
|
|============================================= | 69%
|
|============================================= | 70%
|
|============================================== | 70%
|
|============================================== | 71%
|
|============================================== | 72%
|
|=============================================== | 72%
|
|=============================================== | 73%
|
|================================================ | 73%
|
|================================================ | 74%
|
|================================================ | 75%
|
|================================================= | 75%
|
|================================================= | 76%
|
|================================================== | 76%
|
|================================================== | 77%
|
|================================================== | 78%
|
|=================================================== | 78%
|
|=================================================== | 79%
|
|==================================================== | 79%
|
|==================================================== | 80%
|
|==================================================== | 81%
|
|===================================================== | 81%
|
|===================================================== | 82%
|
|====================================================== | 82%
|
|====================================================== | 83%
|
|====================================================== | 84%
|
|======================================================= | 84%
|
|======================================================= | 85%
|
|======================================================== | 85%
|
|======================================================== | 86%
|
|======================================================== | 87%
|
|========================================================= | 87%
|
|========================================================= | 88%
|
|========================================================== | 88%
|
|========================================================== | 89%
|
|========================================================== | 90%
|
|=========================================================== | 90%
|
|=========================================================== | 91%
|
|=========================================================== | 92%
|
|============================================================ | 92%
|
|============================================================ | 93%
|
|============================================================= | 93%
|
|============================================================= | 94%
|
|============================================================= | 95%
|
|============================================================== | 95%
|
|============================================================== | 96%
|
|=============================================================== | 96%
|
|=============================================================== | 97%
|
|=============================================================== | 98%
|
|================================================================ | 98%
|
|================================================================ | 99%
|
|=================================================================| 99%
|
|=================================================================| 100%
## Extracting the search_score metadata.
## Extracting the peptideprophet_result probabilities.
## Extracting the search parameters.
summary(prophet_table)
## protein decoy peptide start_scan
## Length:25712 Mode :logical Length:25712 Min. : 752
## Class :character FALSE:24742 Class :character 1st Qu.:14042
## Mode :character TRUE :970 Mode :character Median :25587
## Mean :26422
## 3rd Qu.:38252
## Max. :55070
##
## end_scan index precursor_neutral_mass assumed_charge
## Min. : 752 Min. : 1 Min. : 670 2:11066
## 1st Qu.:14042 1st Qu.: 6429 1st Qu.:1340 3:11496
## Median :25587 Median :12856 Median :1750 4: 2899
## Mean :26422 Mean :12856 Mean :1877 5: 234
## 3rd Qu.:38252 3rd Qu.:19284 3rd Qu.:2273 6: 17
## Max. :55070 Max. :25712 Max. :6133
##
## retention_time_sec peptide_prev_aa peptide_next_aa num_tot_proteins
## Min. : 757 Length:25712 Length:25712 0: 970
## 1st Qu.:2557 Class :character Class :character 1:24395
## Median :3868 Mode :character Mode :character 2: 258
## Mean :3901 3: 27
## 3rd Qu.:5231 4: 6
## Max. :6901 5: 56
##
## num_matched_ions tot_num_ions matched_ion_ratio
## 13 : 1273 Length:25712 Min. :0.000
## 12 : 1217 Class :character 1st Qu.:0.200
## 11 : 1193 Mode :character Median :0.328
## 10 : 1184 Mean :0.356
## 14 : 1181 3rd Qu.:0.500
## 15 : 1050 Max. :0.938
## (Other):18614
## calc_neutral_pep_mass massdiff num_tol_term num_missed_cleavages
## Min. : 669 Min. :-0.0857 2:25712 0:18860
## 1st Qu.:1340 1st Qu.: 0.0004 1: 5949
## Median :1750 Median : 0.0047 2: 903
## Mean :1876 Mean : 0.4330
## 3rd Qu.:2273 3rd Qu.: 1.0042
## Max. :6132 Max. : 1.0624
##
## num_matched_peptides xcorr deltacn deltacnstar
## Min. : 2 Min. :0.04 Min. :0.001 Min. :0.000
## 1st Qu.: 92 1st Qu.:1.73 1st Qu.:0.579 1st Qu.:0.000
## Median :130 Median :2.61 Median :0.727 Median :0.000
## Mean :123 Mean :2.82 Mean :0.665 Mean :0.001
## 3rd Qu.:160 3rd Qu.:3.71 3rd Qu.:0.808 3rd Qu.:0.000
## Max. :314 Max. :9.92 Max. :1.000 Max. :0.820
##
## spscore sprank expect prophet_probability
## Min. : 0 1 :24999 Min. : 0.0 Min. :0.0501
## 1st Qu.: 145 2 : 443 1st Qu.: 0.0 1st Qu.:0.9958
## Median : 367 3 : 121 Median : 0.0 Median :1.0000
## Mean : 509 4 : 53 Mean : 1.9 Mean :0.9155
## 3rd Qu.: 720 5 : 33 3rd Qu.: 0.0 3rd Qu.:1.0000
## Max. :4423 6 : 13 Max. :999.0 Max. :1.0000
## (Other): 50
## fval ntt nmc massd isomassd
## Min. :-1.95 2:25712 0:18860 Min. :-0.09500 0:14641
## 1st Qu.: 4.97 1: 6852 1st Qu.:-0.00100 1:11071
## Median : 6.91 Median : 0.00100
## Mean : 6.70 Mean : 0.00093
## 3rd Qu.: 8.95 3rd Qu.: 0.00300
## Max. :13.37 Max. : 0.06500
##
## RT RT_score modified_peptides variable_mods
## Min. :-658 Min. :-2.000 Length:25712 Length:25712
## 1st Qu.:1525 1st Qu.: 0.050 Class :character Class :character
## Median :2004 Median : 0.640 Mode :character Mode :character
## Mean :2114 Mean : 0.687
## 3rd Qu.:2570 3rd Qu.: 1.222
## Max. :7523 Max. : 6.000
##
## static_mods
## Length:25712
## Class :character
## Mode :character
##
##
##
##
a_plot <- plot_prophet(prophet_table,
xaxis="precursor_neutral_mass",
yaxis="matched_ion_ratio",
xscale=exp(1))
a_plot
## Now make some plots of the results to see if my matches are good/bad/indifferent.
plotly::ggplotly(a_plot)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`