M.tuberculosis 20180913: New samples! Analyzing data from OpenSwathWorkFlow/TRIC.
M.tuberculosis 20180913: New samples! Analyzing data from OpenSwathWorkFlow/TRIC.
20180913
This is a copy of 20180817 but where I will further simplify to include only the wt/mutant and only the most recent technical replicate.
Therefore, the first thing I did was to copy the sample sheet and remove most of the rows, leaving behind only the 12 which comprise the 3 biological replicates for wt/mutant and the technical run dated 20180817.
Notes
- Remove complement.
- Work only with newest samples.
- Define intersections between limma/edger/deseq/msstats for multiple thresholds.
- Ontology tests
New samples!
In this copy of the worksheet, I am going to deal only with the tuberculist libraries, and only with the most recent technical replicate of the data.
This worksheet is a copy of 20180806 but this time with feeling! (or new samples) A minor caveat: I had some difficulties converting the new data to a openMS suitable format. I am not certain if I messed up a priori, or if a change in the version of the software on the spec caused the problem, but for future reference:
To properly convert the data to mzXML, make sure to use the TPP MSConvert utility and have ‘TPP’ compatibility on.
Analyzing data from openMS and friends.
SWATH2stats preprocessing
I am using my slightly modified copy of SWATH2stats. This seeks to ensure that changes in the case of columns in the metadata from one version of OpenMS to another do not trouble me.
Creating a swath2stats experiment using the tuberculist-derived library data
There is one important caveat in the following block: I used a regex to remove the second half of geneID_geneName so that later when I merge in the annotation data I have it will match.
Notes 20180913
If I want to subset like this, I need to delete the pyprophet outputs for anything not included in this set and rerun feature_alignment.py.
tric_data <- read.csv(
paste0("results/tric/", ver, "/whole_8mz_tuberculist/comet_HCD.tsv"), sep="\t")
tric_data[["ProteinName"]] <- gsub(pattern="^(.*)_.*$", replacement="\\1",
x=tric_data[["ProteinName"]])
sample_annot <- extract_metadata(paste0("sample_sheets/Mtb_dia_samples_", ver, ".xlsx"))
kept <- ! grepl(x=rownames(sample_annot), pattern="^s\\.\\.")
sample_annot <- sample_annot[kept, ]
devtools::load_all("~/scratch/git/SWATH2stats")
## Loading SWATH2stats
s2s_exp <- sample_annotation(data=tric_data, check_files=FALSE,
sample_annotation=sample_annot,
fullpeptidename_column="fullpeptidename")
## Not checking that the files are identical between the annotation and data.
Now I have a couple data structures which should prove useful for the metrics provided by SWATH2stats, MSstats, and my own hpgltools.
SWATH2stats continued
The various metrics and filters provided by SWATH2stats seem quite reasonable to me. The only thing that really bothers me is that they are all case sensitive and I found that the most recent tric changed the capitalization of a column, causing these to all fall down. Therefore I went in and made everything case insensitive in a fashion similar to that done by MSstats (except I hate capital letters, so I used tolower() rather than toupper()).
Perform filters
The following block performs the metrics and filters suggested by swath2stats. These first define the decoy hit rate in the data, then filter the data based on that. It also filters out hits with less than ideal m-scores and proteins with non-optimal distributions of peptide hits (either due to too few peptides or a weird distribution of intensities).
## Get correlations on a sample by sample basis
pp(file="images/20180611_tb_swath2stats_sample_cor.png")
## Going to write the image to: images/20180611_tb_swath2stats_sample_cor.png when dev.off() is called.
sample_cor <- plot_correlation_between_samples(s2s_exp, size=2,
fun.aggregate=sum,
column.values="intensity")
dev.off()
## png
## 2
sample_cond_rep_cor <- plot_correlation_between_samples(s2s_exp, size=2,
comparison=transition_group_id ~
condition + bioreplicate + run,
fun.aggregate=sum,
column.values="intensity")
## Number of non-decoy peptides: 21557
## Number of decoy peptides: 939
## Decoy rate: 0.0436
## This seems a bit high to me, yesno?
fdr_overall <- assess_fdr_overall(s2s_exp, output="Rconsole", plot=TRUE)
## The average FDR by run on assay level is 0.009
## The average FDR by run on peptide level is 0.01
## The average FDR by run on protein level is 0.047
## Target assay FDR: 0.02
## Required overall m-score cutoff: 0.0070795
## achieving assay FDR: 0.0181
## Target protein FDR: 0.02
## Required overall m-score cutoff: 0.00089125
## achieving protein FDR: 0.0182
## Original dimension: 133447, new dimension: 128204, difference: 5243.
## Peptides need to have been quantified in more conditions than: 9.6 in order to pass this percentage-based threshold.
## Fraction of peptides selected: 0.11
## Original dimension: 135427, new dimension: 33028, difference: 102399.
data_filtered_fdr <- filter_mscore_fdr(mscore_filtered, FFT=0.7,
overall_protein_fdr_target=prot_score,
upper_overall_peptide_fdr_limit=0.05)
## Target protein FDR: 0.000891250938133746
## Required overall m-score cutoff: 0.01
## achieving protein FDR: 0
## filter_mscore_fdr is filtering the data...
## finding m-score cutoff to achieve desired protein FDR in protein master list..
## finding m-score cutoff to achieve desired global peptide FDR..
## Target peptide FDR: 0.05
## Required overall m-score cutoff: 0.01
## Achieving peptide FDR: 0
## Proteins selected:
## Total proteins selected: 2999
## Final target proteins: 2999
## Final decoy proteins: 0
## Peptides mapping to these protein entries selected:
## Total mapping peptides: 20921
## Final target peptides: 20921
## Final decoy peptides: 0
## Total peptides selected from:
## Total peptides: 20921
## Final target peptides: 20921
## Final decoy peptides: 0
## Individual run FDR quality of the peptides was not calculated
## as not every run contains a decoy.
## The decoys have been removed from the returned data.
## Number of proteins detected: 3016
## Protein identifiers: Rv2524c, Rv3716c, Rv1270c, Rv0724, Rv0161, Rv2535c
## Number of proteins detected that are supported by a proteotypic peptide: 2888
## Number of proteotypic peptides detected: 20772
## Number of proteins detected: 2890
## First 6 protein identifiers: Rv2524c, Rv3716c, Rv1270c, Rv0724, Rv0161, Rv2535c
## Before filtering:
## Number of proteins: 2888
## Number of peptides: 20772
##
## Percentage of peptides removed: 21.87%
##
## After filtering:
## Number of proteins: 2861
## Number of peptides: 16230
## Before filtering:
## Number of proteins: 2861
## Number of peptides: 16230
##
## Percentage of peptides removed: 0.04%
##
## After filtering:
## Number of proteins: 2603
## Number of peptides: 16223
Write out matrices of the results
swath2stats provides a couple of ways to print out its results, one in a format specifically intended for MSstats, and another as a more canonical matrix of rows = proteins, columns = samples.
## I think these matrixes are probably smarter to use than the raw outmatrix from tric.
## But I am not a fan of rerwriting the sample column names.
matrix_prefix <- file.path("results", "swath2stats", ver)
if (!file.exists(matrix_prefix)) {
dir.create(matrix_prefix)
}
protein_matrix_all <- write_matrix_proteins(
s2s_exp, write.csv=TRUE,
filename=file.path(matrix_prefix, "protein_all.csv"))
## Protein overview matrix results/swath2stats/20180913/protein_all.csv written to working folder.
## [1] 3873 13
protein_matrix_mscore <- write_matrix_proteins(
mscore_filtered, write.csv=TRUE,
filename=file.path(matrix_prefix, "protein_matrix_mscore.csv"))
## Protein overview matrix results/swath2stats/20180913/protein_matrix_mscore.csv written to working folder.
## [1] 2999 13
peptide_matrix_mscore <- write_matrix_peptides(
mscore_filtered, write.csv=TRUE,
filename=file.path(matrix_prefix, "peptide_matrix_mscore.csv"))
## Peptide overview matrix results/swath2stats/20180913/peptide_matrix_mscore.csv written to working folder.
## [1] 20921 13
protein_matrix_minimum <- write_matrix_proteins(
only_minimum, write.csv=TRUE,
filename=file.path(matrix_prefix, "protein_matrix_minimum.csv"))
## Protein overview matrix results/swath2stats/20180913/protein_matrix_minimum.csv written to working folder.
## [1] 2603 13
peptide_matrix_minimum <- write_matrix_peptides(
only_minimum, write.csv=TRUE,
filename=file.path(matrix_prefix, "peptide_matrix_minimum.csv"))
## Peptide overview matrix results/swath2stats/20180913/peptide_matrix_minimum.csv written to working folder.
## [1] 93860 13
rt_cor <- plot_correlation_between_samples(
only_minimum, column.values="intensity", fun.aggregate=sum, size=2)
## I have no effing clue what this plot means.
variation <- plot_variation(only_minimum, fun.aggregate=sum)
## The library contains5transitions per precursor.
## The data table was transformed into a table containing one row per transition.
## One or several columns required by MSstats were not in the data. The columns were created and filled with NAs.
## Missing columns: productcharge, isotopelabeltype
## isotopelabeltype was filled with light.
Some new plots
In response to some interesting queries from Yan, I made a few little functions which query and plot data from the scored data provided by openswath/pyprophet. Let us look at their results here.
pyprophet_fun <- sm(extract_pyprophet_data(
metadata=paste0("sample_sheets/Mtb_dia_samples_", ver, ".xlsx"),
pyprophet_column="tuberculistscored"))
mass_plot <- sm(plot_pyprophet_distribution(pyprophet_fun, column="mass",
expt_names="figurename"))
mass_plot[["violin"]]
deltart_plot_all <- sm(plot_pyprophet_distribution(
pyprophet_fun, column="delta_rt", expt_names="figurename"))
deltart_plot_all[["violin"]]
deltart_plot_real <- sm(plot_pyprophet_distribution(
pyprophet_fun, expt_names="figurename",
column="delta_rt", keep_decoys=FALSE))
deltart_plot_real[["violin"]]
deltart_plot_decoys <- sm(plot_pyprophet_distribution(
pyprophet_fun, expt_names="figurename",
column="delta_rt", keep_real=FALSE))
deltart_plot_decoys[["violin"]]
MSstats
msstats.org seems to provide a complete solution for performing reasonable metrics of this data.
I am currently reading: http://msstats.org/wp-content/uploads/2017/01/MSstats_v3.7.3_manual.pdf
I made some moderately intrusive changes to MSstats to make it clearer, as well.
tt <- sm(devtools::load_all("~/scratch/git/MSstats"))
checkpoint <- paste0("msstats_dataprocess-v", ver, ".rda")
if (file.exists(checkpoint)) {
load(file=checkpoint)
} else {
msstats_quant <- sm(dataProcess(msstats_input))
save(file=checkpoint, list=c("msstats_quant"))
}
msstats_plots <- sm(dataProcessPlots(msstats_quant, type="QCPLOT"))
my_levels <- levels(as.factor(msstats_input$condition))
my_levels
## [1] "delta_filtrate" "wt_filtrate" "delta_whole" "wt_whole"
comparisons <- make_simplified_contrast_matrix(
numerators=c("wt_filtrate", "delta_filtrate", "delta_filtrate", "delta_whole"),
denominators=c("wt_whole", "delta_whole", "wt_filtrate", "wt_whole"))
msstats_results <- list()
checkpoint <- paste0("msstats_group-v", ver, ".rda")
if (file.exists(checkpoint)) {
load(file=checkpoint)
} else {
for (c in 1:length(rownames(comparisons))) {
name <- rownames(comparisons)[c]
message("Starting ", name)
comp <- comparisons[c, ]
comp <- t(as.matrix(comp))
rownames(comp) <- name
msstats_results[[name]] <- sm(MSstats::groupComparison(contrast.matrix=comp,
data=msstats_quant))
message("Finished ", name)
}
save(file=checkpoint, list=c("msstats_results"))
}
P/PE protein QC plots for Yan
Yan asked for the p/pe protein qc plots. ok. I changed the dataProcessPlots to return something useful, so that should be possible now.
pe_genes <- read.table("reference/annotated_pe_genes.txt")[[1]]
## Unfortunately, the names did not get set in my changed version of dataProcessPlots...
plotlst <- msstats_plots$QCPLOT
available_plots <- gsub(pattern="^1/", replacement="",
x=levels(msstats_quant$ProcessedData$PROTEIN))
names(plotlst) <- available_plots
pe_in_avail_idx <- pe_genes %in% available_plots
pe_in_avail <- pe_genes[pe_in_avail_idx]
pe_plots <- plotlst[pe_in_avail]
pdf(file="pe_qc_plots.pdf")
for (p in 1:length(pe_plots)) {
plot(pe_plots[[p]])
}
## Warning: Removed 90 rows containing non-finite values (stat_boxplot).
## Warning: Removed 35 rows containing non-finite values (stat_boxplot).
## Warning: Removed 60 rows containing non-finite values (stat_boxplot).
## Warning: Removed 85 rows containing non-finite values (stat_boxplot).
## Warning: Removed 55 rows containing non-finite values (stat_boxplot).
## Warning: Removed 60 rows containing non-finite values (stat_boxplot).
## Warning: Removed 50 rows containing non-finite values (stat_boxplot).
## Warning: Removed 20 rows containing non-finite values (stat_boxplot).
## Warning: Removed 90 rows containing non-finite values (stat_boxplot).
## Warning: Removed 30 rows containing non-finite values (stat_boxplot).
## Warning: Removed 55 rows containing non-finite values (stat_boxplot).
## Warning: Removed 160 rows containing non-finite values (stat_boxplot).
## Warning: Removed 160 rows containing non-finite values (stat_boxplot).
## Warning: Removed 180 rows containing non-finite values (stat_boxplot).
## Warning: Removed 95 rows containing non-finite values (stat_boxplot).
## Warning: Removed 75 rows containing non-finite values (stat_boxplot).
## Warning: Removed 55 rows containing non-finite values (stat_boxplot).
## Warning: Removed 35 rows containing non-finite values (stat_boxplot).
## Warning: Removed 70 rows containing non-finite values (stat_boxplot).
## Warning: Removed 30 rows containing non-finite values (stat_boxplot).
## Warning: Removed 35 rows containing non-finite values (stat_boxplot).
## Warning: Removed 60 rows containing non-finite values (stat_boxplot).
## Warning: Removed 55 rows containing non-finite values (stat_boxplot).
## Warning: Removed 55 rows containing non-finite values (stat_boxplot).
## Warning: Removed 65 rows containing non-finite values (stat_boxplot).
## png
## 2
## [1] 27
Create hpgltools expressionset
Since I am not certain I understand these data, I will take the intensities from SWATH2stats, metadata, and annotation data; attempt to create a ‘normal’ expressionset; poke at it to see what I can learn.
Massaging the metadata
I want to use the same metadata as were used for MSstats. It has a few important differences from the requirements of hpgltools: pretty much only that I do not allow rownames/sampleIDs to start with a number.
Massaging the intensity matrix
I do not want the \1 before the protein names, I already merged them into one entry per gene via SWATH2stats.
prot_mtrx <- read.csv(file.path("results", "swath2stats", ver, "protein_matrix_minimum.csv"))
rownames(prot_mtrx) <- gsub(pattern="^1\\/", replacement="", x=prot_mtrx[["proteinname"]])
prot_mtrx <- prot_mtrx[, -1]
## Important question: Did SWATH2stats reorder my data?
colnames(prot_mtrx) <- gsub(pattern="^(.*)(2018.*)$", replacement="s\\2", x=colnames(prot_mtrx))
Merge the pieces
Now we should have sufficient pieces to make an expressionset.
While here, I will also split the data into a cf and whole-cell pair of data structures.
## Drop the metadata not in the protein matrix:
## And ensure that they are the same order.
reordered <- colnames(prot_mtrx)
metadata <- sample_annot[reordered, ]
protein_expt <- sm(create_expt(metadata,
count_dataframe=prot_mtrx,
gene_info=mtb_annotations))
whole_expt <- subset_expt(protein_expt, subset="collectiontype=='whole'")
## There were 12, now there are 6 samples.
## There were 12, now there are 6 samples.
protein_write <- write_expt(protein_expt, expt_names="figurename", plot_legend=FALSE, violin=TRUE,
excel=paste0("excel/protein_expt2-v", ver, ".xlsx"))
## Writing the legend.
## Writing the raw reads.
## Hey, you merged the annotation data and did not reset the column names!
## Graphing the raw reads.
## varpart sees only 1 batch, adjusting the model accordingly.
## Attempting mixed linear model with: ~ (1|condition)
## Fitting the expressionset to the model, this is slow.
## Projected run time: ~ 0.2 min
## Warning in serialize(data, node$con): 'package:variancePartition' may not
## be available when loading
## Warning in serialize(data, node$con): 'package:variancePartition' may not
## be available when loading
## Warning in serialize(data, node$con): 'package:variancePartition' may not
## be available when loading
## Warning in serialize(data, node$con): 'package:variancePartition' may not
## be available when loading
## Warning in serialize(data, node$con): 'package:variancePartition' may not
## be available when loading
## Warning in serialize(data, node$con): 'package:variancePartition' may not
## be available when loading
## Placing factor: condition at the beginning of the model.
## Warning in merge.data.frame(tmp_annot, added_data, by = "row.names"):
## column name 'Row.names' is duplicated in the result
## Writing the normalized reads.
## Graphing the normalized reads.
## varpart sees only 1 batch, adjusting the model accordingly.
## Attempting mixed linear model with: ~ (1|condition)
## Fitting the expressionset to the model, this is slow.
## Projected run time: ~ 0.1 min
## Placing factor: condition at the beginning of the model.
## Warning in merge.data.frame(tmp_annot, added_data, by = "row.names"):
## column name 'Row.names' is duplicated in the result
## Writing the median reads by factor.
## The factor delta_filtrate has 3 rows.
## The factor delta_whole has 3 rows.
## The factor wt_filtrate has 3 rows.
## The factor wt_whole has 3 rows.
Metrics of the full data set
Generate some metrics of the full data set, later the same things will be performed using the data subsets. There is a weird caveat: sva fails under some strange circumstances for this data. It seems that if the data is cpmmed, then sva sees the resulting matrix as containing singular values. I am guessing that this means that there are a bunch of low values in the filtrate data which, when cpm(data) is performed get dropped to near 0 and therefore trip up sva. There are two likely ways around this:
- Do not cpm the data.
- Filter the data more aggressively so that there are no zero-values after cpm().
Metrics of the whole-cell data set
Here are some of the promised subset data.
Variance partition
This is a potential argument against including only the newest technical replicate of the data; as when all the technicals were included, these metrics looked much more sensible.
## varpart sees only 1 batch, adjusting the model accordingly.
## Attempting mixed linear model with: ~ (1|condition)
## Fitting the expressionset to the model, this is slow.
## Projected run time: ~ 0.1 min
## Placing factor: condition at the beginning of the model.
## Warning in merge.data.frame(tmp_annot, added_data, by = "row.names"):
## column name 'Row.names' is duplicated in the result
## Naively calculating coefficient of variation/dispersion with respect to condition.
## Finished calculating dispersion estimates.
##tb_batch_cv <- plot_variance_coefficients(tb_protein_expt, x_axis="batch")
##tb_batch_cv$disp
genotype_cv <- plot_variance_coefficients(protein_expt, x_axis="genotype")
## Naively calculating coefficient of variation/dispersion with respect to genotype.
## Finished calculating dispersion estimates.
##genotype_cv$disp
##tb_prep_cv <- plot_variance_coefficients(tb_protein_expt, x_axis="prepdate")
##tb_prep_cv$disp
vio <- plot_boxplot(protein_expt, violin=TRUE)
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0. We are on log scale, adding 1 to the data.
## Changed 11011 zero count features.
Metrics of the filtrate data set
Once again, here we have metrics of the subset data; this time of filtrate data.
cf_norm <- sm(normalize_expt(cf_expt, transform="log2", convert="cpm",
norm="quant", filter=TRUE))
cf_norm_metrics <- sm(graph_metrics(cf_norm))
cf_fsva <- sm(normalize_expt(cf_expt, transform="log2", convert="cpm",
batch="fsva", filter=TRUE))
cf_fsva_metrics <- sm(graph_metrics(cf_fsva))
## This function will replace the expt$expressionset slot with:
## hpgl(data)
## It backs up the current data into a slot named:
## expt$backup_expressionset. It will also save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Leaving the data in its current base format, keep in mind that
## some metrics are easier to see when the data is log2 transformed, but
## EdgeR/DESeq do not accept transformed data.
## Leaving the data unconverted. It is often advisable to cpm/rpkm
## the data to normalize for sampling differences, keep in mind though that rpkm
## has some annoying biases, and voom() by default does a cpm (though hpgl_voom()
## will try to detect this).
## Leaving the data unnormalized. This is necessary for DESeq, but
## EdgeR/limma might benefit from normalization. Good choices include quantile,
## size-factor, tmm, etc.
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: hpgl
## Removing 1420 low-count genes (1183 remaining).
## Step 2: not normalizing the data.
## Step 3: not converting the data.
## Step 4: not transforming the data.
## Step 5: not doing batch correction.
## s2018_0817BrikenTrypsinDIA01 s2018_0817BrikenTrypsinDIA02
## Min. :0.00e+00 Min. :0.00e+00
## 1st Qu.:5.13e+05 1st Qu.:5.36e+05
## Median :2.54e+06 Median :2.50e+06
## Mean :3.41e+07 Mean :4.52e+07
## 3rd Qu.:1.24e+07 3rd Qu.:1.12e+07
## Max. :4.22e+09 Max. :1.88e+10
## s2018_0817BrikenTrypsinDIA03 s2018_0817BrikenTrypsinDIA07
## Min. :0.00e+00 Min. :0.00e+00
## 1st Qu.:7.09e+05 1st Qu.:0.00e+00
## Median :3.82e+06 Median :8.14e+05
## Mean :5.00e+07 Mean :2.89e+07
## 3rd Qu.:1.86e+07 3rd Qu.:4.91e+06
## Max. :5.61e+09 Max. :1.66e+10
## s2018_0817BrikenTrypsinDIA08 s2018_0817BrikenTrypsinDIA09
## Min. :0.00e+00 Min. :0.00e+00
## 1st Qu.:2.08e+05 1st Qu.:2.04e+05
## Median :1.35e+06 Median :1.47e+06
## Mean :3.98e+07 Mean :2.36e+07
## 3rd Qu.:7.45e+06 3rd Qu.:7.13e+06
## Max. :2.00e+10 Max. :3.81e+09
## This function will replace the expt$expressionset slot with:
## cbcb(data)
## It backs up the current data into a slot named:
## expt$backup_expressionset. It will also save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Leaving the data in its current base format, keep in mind that
## some metrics are easier to see when the data is log2 transformed, but
## EdgeR/DESeq do not accept transformed data.
## Leaving the data unconverted. It is often advisable to cpm/rpkm
## the data to normalize for sampling differences, keep in mind though that rpkm
## has some annoying biases, and voom() by default does a cpm (though hpgl_voom()
## will try to detect this).
## Leaving the data unnormalized. This is necessary for DESeq, but
## EdgeR/limma might benefit from normalization. Good choices include quantile,
## size-factor, tmm, etc.
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 2055 low-count genes (548 remaining).
## Step 2: not normalizing the data.
## Step 3: not converting the data.
## Step 4: not transforming the data.
## Step 5: not doing batch correction.
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0. We are on log scale, setting them to 0.5.
## Changed 173 zero count features.
keeper_ids <- rownames(exprs(cf_remaining))
cf_remaining <- normalize_expt(cf_expt, filter="cbcb", thresh=1000)
## This function will replace the expt$expressionset slot with:
## cbcb(data)
## It backs up the current data into a slot named:
## expt$backup_expressionset. It will also save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Leaving the data in its current base format, keep in mind that
## some metrics are easier to see when the data is log2 transformed, but
## EdgeR/DESeq do not accept transformed data.
## Leaving the data unconverted. It is often advisable to cpm/rpkm
## the data to normalize for sampling differences, keep in mind though that rpkm
## has some annoying biases, and voom() by default does a cpm (though hpgl_voom()
## will try to detect this).
## Leaving the data unnormalized. This is necessary for DESeq, but
## EdgeR/limma might benefit from normalization. Good choices include quantile,
## size-factor, tmm, etc.
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 2443 low-count genes (160 remaining).
## Step 2: not normalizing the data.
## Step 3: not converting the data.
## Step 4: not transforming the data.
## Step 5: not doing batch correction.
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0. We are on log scale, setting them to 0.5.
## Changed 29 zero count features.
Something fun for Najib
Plot some metrics
In the previous blocks, I generated the metrics; in this block I will print them with the assumption that some of them will end up being included in whatever publication comes from this work; with that in mind I should probably change this to sva or pdf or something not png.
## Writing the image to: images/20180913_libsize.pdf and calling dev.off().
## It seems to me that the scale of the data is all within an order of magnitude or two.
## I cannot get used to these absurdly large numbers though.
pp(image=protein_norm_metrics$pcaplot,
file=file.path("images", paste0(ver, "_norm_pca.pdf")))
## Going to write the image to: images/20180913_norm_pca.pdf when dev.off() is called.
## There appears to be a nice split in the data, however the un-assayable batch
## effect is a problem.
pp(image=protein_fsva_metrics$pcaplot,
file=file.path("images", paste0(ver, "_fsva_pca.pdfg")))
## Defaulting to tiff.
## Going to write the image to: images/20180913_fsva_pca.pdfg when dev.off() is called.
## fsva seems to get some handle on the data, but I don't think we should rely
## upon it.
pp(image=protein_norm_metrics$corheat,
file=file.path("images", paste0(ver, "_norm_corheat.pdf")))
## Writing the image to: images/20180913_norm_corheat.pdf and calling dev.off().
## Once again, the whole-cell/culture-filtrate split is very large.
pp(image=protein_metrics$density,
file=file.path("images", paste0(ver, "_raw_density.pdf")))
## Writing the image to: images/20180913_raw_density.pdf and calling dev.off().
## There are two obvious distributions in the data, once again split between types.
pp(image=protein_metrics$boxplot,
file=file.path("images", paste0(ver, "_boxplot.pdf")))
## Writing the image to: images/20180913_boxplot.pdf and calling dev.off().
## This recapitulates the previous plot.
pp(image=whole_metrics$libsize,
file=file.path("images", paste0(ver, "_whole_libsize.pdf")))
## Writing the image to: images/20180913_whole_libsize.pdf and calling dev.off().
## Going to write the image to: images/20180913_whole_norm_pca.pdf when dev.off() is called.
## Going to write the image to: images/20180913_whole_fsva_pca.pdf when dev.off() is called.
pp(image=whole_norm_metrics$corheat,
file=file.path("images", paste0(ver, "_whole_norm_corheat.pdf")))
## Writing the image to: images/20180913_whole_norm_corheat.pdf and calling dev.off().
## Writing the image to: images/20180913_whole_raw_density.pdf and calling dev.off().
## Writing the image to: images/20180913_whole_boxplot.pdf and calling dev.off().
## Writing the image to: images/20180913_libsize.pdf and calling dev.off().
## Going to write the image to: images/20180913_norm_pca.pdf when dev.off() is called.
## Going to write the image to: images/20180913_fsva_pca.pdf when dev.off() is called.
## Writing the image to: images/20180913_norm_corheat.pdf and calling dev.off().
## Writing the image to: images/20180913_raw_density.pdf and calling dev.off().
## Writing the image to: images/20180913_boxplot.pdf and calling dev.off().
Perform hpgltools Differential Expression Analyses
Earlier MSstats was used to contrast the various conditions in this data. In this block the same will be performed using the limma/deseq/edger/ebseq/basic methods. As an aside, running all of these methods in serial takes ~ 1/5th the time of running any one step of MSstats.
Idea from Volker
Given the initial pairwise data, look at wt edgeR results ratio filtrate/culture; then set a cutoff as log2fc <= 0.75. Proteins which survive this cutoff are then used in the ratio of ratios and analyses of mutant/wt.
The survivors of this initial cutoff is the set of secreted proteins. Compare this set with the set of known secreted proteins from other papers (Cox). Also Collins, Abesol (likely same as the synthetic Mtb library).
Separate, simultaneous filter: Look at the distribution of the culture filtrate data and filter out the (relatively) low-intensity bump. Then take the surviving set and perform all analyses with it. In theory, these two filter methods should get us to a similar place.
lfc cutoff cutoff
As mentioned below, Volker had an interesting cutoff suggestion: keep only those with lfc >= 0.75 filtrate/whole
## Starting edgeR pairwise comparisons.
## The data should be suitable for EdgeR/DESeq/EBSeq. If they freak out, check the state of the count table and ensure that it is in integer counts.
## The condition+batch model failed. Does your experimental design support both condition and batch? Using only a conditional model.
## Choosing the non-intercept containing model.
## EdgeR step 1/9: Importing and normalizing data.
## EdgeR step 2/9: Estimating the common dispersion.
## EdgeR step 3/9: Estimating dispersion across genes.
## EdgeR step 4/9: Estimating GLM Common dispersion.
## EdgeR step 5/9: Estimating GLM Trended dispersion.
## EdgeR step 6/9: Estimating GLM Tagged dispersion.
## EdgeR step 7/9: Running glmFit, switch to glmQLFit by changing the argument 'edger_test'.
## EdgeR step 8/9: Making pairwise contrasts.
query <- test[["all_tables"]][["wt_whole_vs_wt_filtrate"]]
## This one of course put the desired factor on the bottom, so I will look for lfc <= -0.75
lfc_keeper_idx <- query[["logFC"]] <= -0.75
lfc_keeper_ids <- rownames(query[lfc_keeper_idx, ])
test_venn_lst <- list("cutoff" = keeper_ids,
"lfc" = lfc_keeper_ids)
test_venn <- Vennerable::Venn(Sets=test_venn_lst)
Vennerable::plot(test_venn)
## Here I perform the same cutoff as shown in the density plots above.
input <- exclude_genes_expt(protein_expt, method="keep", ids=keeper_ids)
## Before removal, there were 2603 entries.
## Now there are 548 entries.
## Percent kept: 98.068, 98.708, 98.132, 64.816, 67.348, 67.478, 99.143, 99.085, 98.417, 67.790, 67.178, 66.334
## Percent removed: 1.932, 1.292, 1.868, 35.184, 32.652, 32.522, 0.857, 0.915, 1.583, 32.210, 32.822, 33.666
## Before removal, there were 2603 entries.
## Now there are 298 entries.
## Percent kept: 77.606, 82.261, 71.273, 15.330, 16.345, 16.731, 87.346, 89.106, 80.236, 18.102, 16.575, 14.815
## Percent removed: 22.394, 17.739, 28.727, 84.670, 83.655, 83.269, 12.654, 10.894, 19.764, 81.898, 83.425, 85.185
lfc_filter_de <- sm(all_pairwise(
lfc_input, model_batch=FALSE, force=TRUE,
do_ebseq=TRUE, parallel=FALSE))
## Interesting, when I run this interactively, no error, but it appears that the report
## had problems with it.
extra <- "delta_cfwhole_vs_wt_cfwhole=(delta_filtrate-delta_whole)-(wt_filtrate-wt_whole)"
extra_filter_de <- sm(all_pairwise(
input, model_batch=FALSE, force=TRUE,
do_ebseq=TRUE, parallel=FALSE,
extra_contrasts=extra))
extra_lfc_filter_de <- sm(all_pairwise(
lfc_input, model_batch=FALSE, force=TRUE,
do_ebseq=TRUE, parallel=FALSE,
extra_contrasts=extra))
extra_keepers <- list(
"wt_cfwhole" = c("wt_filtrate", "wt_whole"),
"delta_cfwhole" = c("delta_filtrate", "delta_whole"),
"whole_deltawt" = c("delta_whole", "wt_whole"),
"cf_deltawt" = c("delta_filtrate", "delta_whole"),
"rofr" = c("delta_cfwhole", "wt_cfwhole"))
extra_filter_tables <- sm(combine_de_tables(
extra_filter_de,
keepers=extra_keepers,
excel=paste0("excel/combined_filter_contrasts-v", ver, ".xlsx")))
extra_lfc_filter_tables <- sm(combine_de_tables(
extra_lfc_filter_de,
keepers=extra_keepers,
excel=paste0("excel/combined_lfcfilter_contrasts-v", ver, ".xlsx")))
extra_filter_sig <- sm(extract_significant_genes(
extra_filter_tables, according_to="edger",
lfc=0.75,
excel=paste0("excel/sig_filter_contrasts-v", ver, ".xlsx")))
extra_lfc_filter_sig <- sm(extract_significant_genes(
extra_lfc_filter_tables, according_to="edger",
lfc=0.75,
excel=paste0("excel/sig_lfcfilter_contrasts-v", ver, ".xlsx")))
test_limma <- sm(limma_pairwise(input, model_batch=FALSE, extra_contrasts=extra))
test_edger <- sm(edger_pairwise(input, model_batch=FALSE, extra_contrasts=extra))
test_deseq <- sm(deseq_pairwise(input, model_batch=FALSE, extra_contrasts=extra, force=TRUE))
test_basic <- sm(basic_pairwise(input, model_batch=FALSE, extra_contrasts=extra, force=TRUE))
test_ebseq <- sm(ebseq_pairwise(input, model_batch=FALSE, extra_contrasts=extra, force=TRUE))
Show a few metrics from the hpgltools pairwise comparisons
## Writing the image to: images/de_heat.png and calling dev.off().
For each msstats run, do a DE table
wt_cf vs wt_whole
keepers <- list(
"wtcf_vs_wtwhole" = c("wt_filtrate", "wt_whole"))
ms_set_name <- "wt_filtrate_vs_wt_whole"
msstats_result <- msstats_results[[ms_set_name]][["ComparisonResult"]]
droppers <- c("undefined")
names(droppers) <- "log2fc"
## Make sure to set the rownames so it will merge into the excel file.
rownames(msstats_result) <- msstats_result[["Protein"]]
wtcf_nobatch_wtwhole_tables <- sm(combine_de_tables(
extra_lfc_filter_de, keepers=keepers, extra_annot=msstats_result,
excludes=droppers,
excel=paste0("excel/wtcf_vs_wtwhole_nobatch_tables-v", ver, ".xlsx")))
comp_table <- wtcf_nobatch_wtwhole_tables$data[[names(keepers)]]
cor.test(comp_table$log2fc, comp_table$deseq_logfc, method="spearman")
## Warning in cor.test.default(comp_table$log2fc, comp_table$deseq_logfc,
## method = "spearman"): Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: comp_table$log2fc and comp_table$deseq_logfc
## S = 830000, p-value <2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.8083
## Warning in cor.test.default(comp_table$log2fc, comp_table$limma_logfc,
## method = "spearman"): Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: comp_table$log2fc and comp_table$limma_logfc
## S = 1900000, p-value <2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.5613
## Warning in cor.test.default(comp_table$log2fc, comp_table$ebseq_logfc,
## method = "spearman"): Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: comp_table$log2fc and comp_table$ebseq_logfc
## S = 820000, p-value <2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.8093
delta_cf vs delta_whole
keepers <- list(
"deltacf_vs_deltawhole" = c("delta_filtrate", "delta_whole"))
ms_set_name <- "delta_filtrate_vs_delta_whole"
msstats_result <- msstats_results[[ms_set_name]][["ComparisonResult"]]
## Make sure to set the rownames so it will merge into the excel file.
rownames(msstats_result) <- msstats_result[["Protein"]]
deltacf_deltawhole_tables <- sm(combine_de_tables(
extra_lfc_filter_de, keepers=keepers, extra_annot=msstats_result,
excel=paste0("excel/deltacf_vs_deltawhole_tables-v", ver, ".xlsx")))
comp_table <- deltacf_deltawhole_tables$data[[names(keepers)]]
cor.test(comp_table$log2fc, comp_table$deseq_logfc, method="spearman")
## Warning in cor.test.default(comp_table$log2fc, comp_table$deseq_logfc,
## method = "spearman"): Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: comp_table$log2fc and comp_table$deseq_logfc
## S = 940000, p-value <2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.7814
## Could not create a linear model of the data.
## Going to perform a scatter plot without linear model.
delta_cf vs wt_cf
Something is weird about this particular contrast, I keep getting negative correlations!
keepers <- list(
"deltacf_vs_wtcf" = c("delta_filtrate", "wt_filtrate"))
ms_set_name <- "delta_filtrate_vs_wt_filtrate"
msstats_result <- msstats_results[[ms_set_name]][["ComparisonResult"]]
rownames(msstats_result) <- msstats_result[["Protein"]]
deltacf_wtcf_tables <- sm(combine_de_tables(
extra_lfc_filter_de, keepers=keepers, extra_annot=msstats_result,
excel=paste0("excel/deltacf_vs_wtcf_tables-v", ver, ".xlsx")))
comp_table <- deltacf_wtcf_tables$data[[names(keepers)]]
cor.test(comp_table$log2fc, comp_table$deseq_logfc, method="spearman")
## Warning in cor.test.default(comp_table$log2fc, comp_table$deseq_logfc,
## method = "spearman"): Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: comp_table$log2fc and comp_table$deseq_logfc
## S = 7800000, p-value <2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.7928
## Warning in cor.test.default(comp_table$log2fc, comp_table$limma_logfc,
## method = "spearman"): Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: comp_table$log2fc and comp_table$limma_logfc
## S = 6800000, p-value <2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.552
## Warning in cor.test.default(comp_table$log2fc, comp_table$edger_logfc,
## method = "spearman"): Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: comp_table$log2fc and comp_table$edger_logfc
## S = 7800000, p-value <2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.7926
## Warning in cor.test.default(comp_table$log2fc, comp_table$ebseq_logfc,
## method = "spearman"): Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: comp_table$log2fc and comp_table$ebseq_logfc
## S = 7800000, p-value <2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.7952
## Could not create a linear model of the data.
## Going to perform a scatter plot without linear model.
filtered_idx <- ! is.na(comp_table$log2fc)
com_table <- comp_table[filtered_idx, ]
neg_inf_idx <- com_table$log2fc == -Inf
com_table[neg_inf_idx, "log2fc"] <- -100
pos_inf_idx <- com_table$log2fc == Inf
com_table[pos_inf_idx, "log2fc"] <- 100
cor.test(com_table$log2fc, com_table$deseq_logfc)
##
## Pearson's product-moment correlation
##
## data: com_table$log2fc and com_table$deseq_logfc
## t = -21, df = 300, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.8125 -0.7191
## sample estimates:
## cor
## -0.7699
## Warning in plot_multihistogram(df): NAs introduced by coercion
## Used Bon Ferroni corrected t test(s) between columns.
deltacf_wtcf_sig <- sm(extract_significant_genes(
deltacf_wtcf_tables,
excel=paste0("excel/deltacf_vs_wtcf_nobatch_sig-v", ver, ".xlsx")))
Ok, I have been resisting this, but lets look more closely at the data for this contrast. One thing I can do to look for correctness in what I am seeing is to look at the mean numerators/denominators for this contrast and see if they agree with msstats or the others.
delta_whole vs wt_whole
keepers <- list(
"deltawhole_vs_wtwhole" = c("delta_whole", "wt_whole"))
ms_set_name <- "delta_whole_vs_wt_whole"
msstats_result <- msstats_results[[ms_set_name]][["ComparisonResult"]]
rownames(msstats_result) <- msstats_result[["Protein"]]
deltawhole_wtwhole_tables <- sm(combine_de_tables(
extra_lfc_filter_de, keepers=keepers, extra_annot=msstats_result,
excel=paste0("excel/deltawhole_vs_wtwhole_tables-v", ver, ".xlsx")))
comp_table <- deltawhole_wtwhole_tables$data[[names(keepers)]]
cor.test(comp_table$log2fc, comp_table$deseq_logfc, method="spearman")
## Warning in cor.test.default(comp_table$log2fc, comp_table$deseq_logfc,
## method = "spearman"): Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: comp_table$log2fc and comp_table$deseq_logfc
## S = 1100000, p-value <2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.7085
Venn of MSstats vs. others for wt whole/cf
Najib asked for the overlap in perceived significance.
start <- wtcf_nobatch_wtwhole_tables[["data"]][[1]]
na_idx <- is.na(start[["adjpvalue"]])
start[na_idx, "adjpvalue"] <- 1
ms_sig_idx <- start[["adjpvalue"]] <= 0.05
de_sig_idx <- start[["deseq_adjp"]] <= 0.05
ed_sig_idx <- start[["edger_adjp"]] <= 0.05
lm_sig_idx <- start[["limma_adjp"]] <= 0.05
eb_sig_idx <- start[["ebseq_ppee"]] <= 0.05
ms_sig <- start[ms_sig_idx, ]
de_sig <- start[de_sig_idx, ]
ed_sig <- start[ed_sig_idx, ]
lm_sig <- start[lm_sig_idx, ]
eb_sig <- start[eb_sig_idx, ]
ms_de_shared <- sum(rownames(ms_sig) %in% rownames(de_sig))
ms_solo <- ! rownames(ms_sig) %in% rownames(de_sig)
de_solo <- ! rownames(de_sig) %in% rownames(ms_sig)
ms_de_weights <- c(0, sum(ms_solo), sum(de_solo), sum(ms_de_shared))
ms_de_venn <- Vennerable::Venn(SetNames=c("ms", "de"), Weight=ms_de_weights)
Vennerable::plot(ms_de_venn, doWeights=FALSE)
ms_ed_shared <- sum(rownames(ms_sig) %in% rownames(ed_sig))
ms_solo <- ! rownames(ms_sig) %in% rownames(ed_sig)
ed_solo <- ! rownames(ed_sig) %in% rownames(ms_sig)
ms_ed_weights <- c(0, sum(ms_solo), sum(ed_solo), sum(ms_ed_shared))
ms_ed_venn <- Vennerable::Venn(SetNames=c("ms", "ed"), Weight=ms_ed_weights)
Vennerable::plot(ms_ed_venn, doWeights=FALSE)
ms_lm_shared <- sum(rownames(ms_sig) %in% rownames(lm_sig))
ms_solo <- ! rownames(ms_sig) %in% rownames(lm_sig)
lm_solo <- ! rownames(lm_sig) %in% rownames(ms_sig)
ms_lm_weights <- c(0, sum(ms_solo), sum(lm_solo), sum(ms_lm_shared))
ms_lm_venn <- Vennerable::Venn(SetNames=c("ms", "lm"), Weight=ms_lm_weights)
Vennerable::plot(ms_lm_venn, doWeights=FALSE)
ms_eb_shared <- sum(rownames(ms_sig) %in% rownames(eb_sig))
ms_solo <- ! rownames(ms_sig) %in% rownames(eb_sig)
eb_solo <- ! rownames(eb_sig) %in% rownames(ms_sig)
ms_eb_weights <- c(0, sum(ms_solo), sum(eb_solo), sum(ms_eb_shared))
ms_eb_venn <- Vennerable::Venn(SetNames=c("ms", "eb"), Weight=ms_eb_weights)
Vennerable::plot(ms_eb_venn, doWeights=FALSE)
Intersections
chosen_intersection <- intersect_significant(
wtcf_nobatch_wtwhole_tables,
selectors=c("limma", "msstats"),
fc_column="log2fc", p_column="adjpvalue",
excel="excel/testing_intersections_two.xlsx")
## The count is: 1 and the test is: alternate.
## Writing excel data according to alternate for wtcf_vs_wtwhole: 1/1.
## After (adj)p filter, the up genes table has 116 genes.
## After (adj)p filter, the down genes table has 94 genes.
## After fold change filter, the up genes table has 114 genes.
## After fold change filter, the down genes table has 63 genes.
chosen_intersection <- intersect_significant(
wtcf_nobatch_wtwhole_tables,
selectors=c("limma", "ebseq", "deseq"),
fc_column="log2fc", p_column="adjpvalue",
excel="excel/testing_intersections_three.xlsx")
chosen_intersection <- intersect_significant(
wtcf_nobatch_wtwhole_tables,
selectors=c("limma", "ebseq", "deseq", "edger"),
fc_column="log2fc", p_column="adjpvalue",
excel="excel/testing_intersections_four.xlsx")
Ontology tests
Wt CF vs Wt whole
## The species being downloaded is: Mycobacterium tuberculosis H37Rv and is being downloaded as 83332.tab.
colnames(mtb_go) <- c("ID", "GO")
mtb_lengths <- mtb_annotations[, c("ID", "width")]
## What sets of genes to perform goseq on?
## Arbitrary decision time, lets use deseq data for each contrast, once without sva, once with.
wtcf_wtwhole_up <- wtcf_nobatch_wtwhole_sig[["deseq"]][["ups"]][[1]]
wtcf_wtwhole_up_goseq_nb <- simple_goseq(
sig_genes=wtcf_wtwhole_up, go_db=mtb_go, length_db=mtb_lengths,
excel=paste0("excel/wtcf_wtwhole_up_deseq_nobatch_goseq-v", ver, ".xlsx"))
## Using the row names of your table.
## Found 26 genes out of 63 from the sig_genes in the go_db.
## Found 61 genes out of 63 from the sig_genes in the length_db.
## Using manually entered categories.
## Calculating the p-values...
## 'select()' returned 1:1 mapping between keys and columns
## simple_goseq(): Calculating q-values
## Using GO.db to extract terms and categories.
## simple_goseq(): Filling godata with terms, this is slow.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
## simple_goseq(): Making pvalue plots for the ontologies.
## Writing data to: excel/wtcf_wtwhole_up_deseq_nobatch_goseq-v20180913.xlsx.
wtcf_wtwhole_down <- wtcf_nobatch_wtwhole_sig[["deseq"]][["downs"]][[1]]
wtcf_wtwhole_down_goseq_nb <- simple_goseq(
sig_genes=wtcf_wtwhole_down, go_db=mtb_go, length_db=mtb_lengths,
excel=paste0("excel/wtcf_wtwhole_down_deseq_nobatch_goseq-v", ver, ".xlsx"))
## Using the row names of your table.
## Found 9 genes out of 11 from the sig_genes in the go_db.
## Found 11 genes out of 11 from the sig_genes in the length_db.
## Warning in pcls(G): initial point very close to some inequality constraints
## Using manually entered categories.
## Calculating the p-values...
## 'select()' returned 1:1 mapping between keys and columns
## simple_goseq(): Calculating q-values
## Using GO.db to extract terms and categories.
## simple_goseq(): There are no genes with an adj.p < 0.1 using: BH.
## simple_goseq(): Providing genes with raw pvalue < 0.1.
## simple_goseq(): Filling godata with terms, this is slow.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
## simple_goseq(): Making pvalue plots for the ontologies.
## Writing data to: excel/wtcf_wtwhole_down_deseq_nobatch_goseq-v20180913.xlsx.
delta CF vs delta whole
deltacf_deltawhole_up <- deltacf_deltawhole_sig[["deseq"]][["ups"]][[1]]
deltacf_deltawhole_up_goseq_nb <- simple_goseq(
sig_genes=deltacf_deltawhole_up, go_db=mtb_go, length_db=mtb_lengths,
excel=paste0("excel/deltacf_deltawhole_up_deseq_nobatch_goseq-v", ver, ".xlsx"))
## Using the row names of your table.
## Found 32 genes out of 85 from the sig_genes in the go_db.
## Found 82 genes out of 85 from the sig_genes in the length_db.
## Using manually entered categories.
## Calculating the p-values...
## 'select()' returned 1:1 mapping between keys and columns
## simple_goseq(): Calculating q-values
## Using GO.db to extract terms and categories.
## simple_goseq(): Filling godata with terms, this is slow.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
## simple_goseq(): Making pvalue plots for the ontologies.
## Writing data to: excel/deltacf_deltawhole_up_deseq_nobatch_goseq-v20180913.xlsx.
deltacf_deltawhole_down <- deltacf_deltawhole_sig[["deseq"]][["downs"]][[1]]
deltacf_deltawhole_down_goseq_nb <- simple_goseq(
sig_genes=deltacf_deltawhole_down, go_db=mtb_go, length_db=mtb_lengths,
excel=paste0("excel/deltacf_deltawhole_down_deseq_nobatch_goseq-v", ver, ".xlsx"))
## Using the row names of your table.
## Found 27 genes out of 35 from the sig_genes in the go_db.
## Found 34 genes out of 35 from the sig_genes in the length_db.
## Using manually entered categories.
## Calculating the p-values...
## 'select()' returned 1:1 mapping between keys and columns
## simple_goseq(): Calculating q-values
## Using GO.db to extract terms and categories.
## simple_goseq(): There are no genes with an adj.p < 0.1 using: BH.
## simple_goseq(): Providing genes with raw pvalue < 0.1.
## simple_goseq(): Filling godata with terms, this is slow.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
## simple_goseq(): Making pvalue plots for the ontologies.
## Writing data to: excel/deltacf_deltawhole_down_deseq_nobatch_goseq-v20180913.xlsx.
delta filtrate vs wt filtrate
deltacf_wtcf_up <- deltacf_wtcf_sig[["deseq"]][["ups"]][[1]]
deltacf_wtcf_up_goseq_nb <- simple_goseq(
sig_genes=deltacf_wtcf_up, go_db=mtb_go, length_db=mtb_lengths,
excel=paste0("excel/deltacf_wtcf_up_deseq_nobatch_goseq-v", ver, ".xlsx"))
## Using the row names of your table.
## Found 4 genes out of 7 from the sig_genes in the go_db.
## Found 6 genes out of 7 from the sig_genes in the length_db.
## Warning in pcls(G): initial point very close to some inequality constraints
## Using manually entered categories.
## Calculating the p-values...
## 'select()' returned 1:1 mapping between keys and columns
## simple_goseq(): Calculating q-values
## Using GO.db to extract terms and categories.
## simple_goseq(): There are no genes with an adj.p < 0.1 using: BH.
## simple_goseq(): Providing genes with raw pvalue < 0.1.
## simple_goseq(): Filling godata with terms, this is slow.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
## simple_goseq(): Making pvalue plots for the ontologies.
## Writing data to: excel/deltacf_wtcf_up_deseq_nobatch_goseq-v20180913.xlsx.
deltacf_wtcf_down <- deltacf_wtcf_sig[["deseq"]][["downs"]][[1]]
deltacf_wtcf_down_goseq_nb <- simple_goseq(
sig_genes=deltacf_wtcf_down, go_db=mtb_go, length_db=mtb_lengths,
excel=paste0("excel/deltacf_wtcf_down_deseq_nobatch_goseq-v", ver, ".xlsx"))
## Using the row names of your table.
## Found 7 genes out of 15 from the sig_genes in the go_db.
## Found 14 genes out of 15 from the sig_genes in the length_db.
## Using manually entered categories.
## Calculating the p-values...
## 'select()' returned 1:1 mapping between keys and columns
## simple_goseq(): Calculating q-values
## Using GO.db to extract terms and categories.
## simple_goseq(): There are no genes with an adj.p < 0.1 using: BH.
## simple_goseq(): Providing genes with raw pvalue < 0.1.
## simple_goseq(): Filling godata with terms, this is slow.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
## simple_goseq(): Making pvalue plots for the ontologies.
## Writing data to: excel/deltacf_wtcf_down_deseq_nobatch_goseq-v20180913.xlsx.
delta whole vs wt whole
deltawhole_wtwhole_up <- deltawhole_wtwhole_sig[["deseq"]][["ups"]][[1]]
deltawhole_wtwhole_up_goseq_nb <- simple_goseq(
sig_genes=deltawhole_wtwhole_up, go_db=mtb_go, length_db=mtb_lengths,
excel=paste0("excel/deltawhole_wtwhole_up_deseq_nobatch_goseq-v", ver, ".xlsx"))
## Using the row names of your table.
## Found 6 genes out of 7 from the sig_genes in the go_db.
## Found 7 genes out of 7 from the sig_genes in the length_db.
## Using manually entered categories.
## Calculating the p-values...
## 'select()' returned 1:1 mapping between keys and columns
## simple_goseq(): Calculating q-values
## Using GO.db to extract terms and categories.
## simple_goseq(): There are no genes with an adj.p < 0.1 using: BH.
## simple_goseq(): Providing genes with raw pvalue < 0.1.
## simple_goseq(): Filling godata with terms, this is slow.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
## simple_goseq(): Making pvalue plots for the ontologies.
## Writing data to: excel/deltawhole_wtwhole_up_deseq_nobatch_goseq-v20180913.xlsx.
deltawhole_wtwhole_down <- deltawhole_wtwhole_sig[["deseq"]][["downs"]][[1]]
deltawhole_wtwhole_down_goseq_nb <- simple_goseq(
sig_genes=deltawhole_wtwhole_down, go_db=mtb_go, length_db=mtb_lengths,
excel=paste0("excel/deltawhole_wtwhole_down_deseq_nobatch_goseq-v", ver, ".xlsx"))
## Using the row names of your table.
## Found 5 genes out of 11 from the sig_genes in the go_db.
## Found 10 genes out of 11 from the sig_genes in the length_db.
## Warning in pcls(G): initial point very close to some inequality constraints
## Using manually entered categories.
## Calculating the p-values...
## 'select()' returned 1:1 mapping between keys and columns
## simple_goseq(): Calculating q-values
## Using GO.db to extract terms and categories.
## simple_goseq(): There are no genes with an adj.p < 0.1 using: BH.
## simple_goseq(): Providing genes with raw pvalue < 0.1.
## simple_goseq(): Filling godata with terms, this is slow.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
## simple_goseq(): Making pvalue plots for the ontologies.
## Writing data to: excel/deltawhole_wtwhole_down_deseq_nobatch_goseq-v20180913.xlsx.
## NULL
circos
## Trying attempt: rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo=TRUE)
## Trying attempt: rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo=FALSE)
## Had a successful gff import with rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo=FALSE)
## Returning a df with 15 columns and 4093 rows.
rownames(gff_df) <- make.names(gff_df[["ID"]], unique=TRUE)
microbes_annot <- load_microbesonline_annotations(id=83332)
## The species being downloaded is: Mycobacterium tuberculosis H37Rv
## Warning: Setting row names on a tibble is deprecated.
coefficients <- extra_lfc_filter_de$deseq$coefficients
values <- merge(microbes_annot, coefficients, by="row.names")
cog_groups <- values[, c("Row.names", "start", "stop", "strand", "COGFun")]
colnames(cog_groups) <- c("seqnames", "start", "stop", "strand", "COGFun")
rownames(cog_groups) <- cog_groups[["seqnames"]]
wt_whole <- values[, c("Row.names", "wt_whole")]
rownames(wt_whole) <- wt_whole[["Row.names"]]
wtcf <- values[, c("Row.names", "wt_filtrate")]
rownames(wtcf) <- wtcf[["Row.names"]]
delta_whole <- values[, c("Row.names", "delta_whole")]
rownames(delta_whole) <- delta_whole[["Row.names"]]
delta_filtrate <- values[, c("Row.names", "delta_filtrate")]
rownames(delta_filtrate) <- delta_filtrate[["Row.names"]]
circos_mtb <- circos_prefix(name="mtb")
## This assumes you have a colors.conf in circos/colors/ and fonts.conf in circos/fonts/
## It also assumes you have conf/ideogram.conf, conf/ticks.conf, and conf/housekeeping.conf
## It will write circos/conf/mtb.conf with a reasonable first approximation config file.
## Wrote karyotype to circos/conf/ideograms/mtb.conf
## This should match the karyotype= line in mtb.conf
## Wrote ticks to circos/conf/ticks_mtb.conf
## Warning in file.symlink(from, to): cannot symlink 'conf/mtb.conf' to './
## mtb.conf', reason 'File exists'
## Wrote karyotype to circos/conf/karyotypes/mtb.conf
## This should match the karyotype= line in mtb.conf
## Writing data file: circos/data/mtb_plus_go.txt with the + strand GO data.
## Writing data file: circos/data/mtb_minus_go.txt with the - strand GO data.
## Wrote the +/- config files. Appending their inclusion to the master file.
## Returning the inner width: 0.84. Use it as the outer for the next ring.
mtb_wtwhole <- circos_heatmap(
wt_whole, gff_df, colname="wt_whole",
cfgout=circos_mtb, basename="wt_whole", outer=mtb_plus)
## Warning in merge.data.frame(df, annot_df, by = "row.names"): column name
## 'Row.names' is duplicated in the result
## Writing data file: circos/data/mtb_wt_wholewt_whole_heatmap.txt with the wt_wholewt_whole column.
mtb_deltawhole <- circos_heatmap(
delta_whole, gff_df, colname="delta_whole",
cfgout=circos_mtb, basename="delta_whole", outer=mtb_wtwhole)
## Warning in merge.data.frame(df, annot_df, by = "row.names"): column name
## 'Row.names' is duplicated in the result
## Writing data file: circos/data/mtb_delta_wholedelta_whole_heatmap.txt with the delta_wholedelta_whole column.
mtb_wtcf <- circos_heatmap(
wtcf, gff_df, colname="wt_filtrate",
cfgout=circos_mtb, basename="wt_filtrate", outer=mtb_deltawhole)
## Warning in merge.data.frame(df, annot_df, by = "row.names"): column name
## 'Row.names' is duplicated in the result
## Writing data file: circos/data/mtb_wt_filtratewt_filtrate_heatmap.txt with the wt_filtratewt_filtrate column.
mtb_deltacf <- circos_heatmap(
delta_filtrate, gff_df, colname="delta_filtrate",
cfgout=circos_mtb, basename="delta_filtrate", outer=mtb_wtcf)
## Warning in merge.data.frame(df, annot_df, by = "row.names"): column name
## 'Row.names' is duplicated in the result
## Writing data file: circos/data/mtb_delta_filtratedelta_filtrate_heatmap.txt with the delta_filtratedelta_filtrate column.
mtb_deltacf_heat <- circos_heatmap(
delta_filtrate, gff_df, colname="delta_filtrate",
cfgout=circos_mtb, basename="delta_filtrate_heat", outer=mtb_deltacf)
## Warning in merge.data.frame(df, annot_df, by = "row.names"): column name
## 'Row.names' is duplicated in the result
## Writing data file: circos/data/mtb_delta_filtrate_heatdelta_filtrate_heatmap.txt with the delta_filtrate_heatdelta_filtrate column.
TODO
- 2018-04-10: Make sure my invocations of SWATH2stats/MSstats are correct.