This document seeks to explore the differences in samples which did and did-not respond to drug treatment during infection.
Extract the antimonial samples from the experiment.
## Using a subset expression.
## There were 42, now there are 12 samples.
ant$colors <- gsub(pattern="CD960C", replacement="005500", x=ant$colors)
ant$colors <- gsub(pattern="AA7A1A", replacement="880000", x=ant$colors)
ant$colors <- gsub(pattern="886E3E", replacement="0000EE", x=ant$colors)
ant$colors <- gsub(pattern="666666", replacement="777777", x=ant$colors)
Lets take a look at the data before normalization.
Perform an initial sample normalization and see how they look.
## This function will replace the expt$expressionset slot with:
## log2(cpm(quant(cbcb(data))))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep libsizes in mind
## when invoking limma. The appropriate libsize is non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 6998 low-count genes (11849 remaining).
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 2 values equal to 0, adding 1 to the matrix.
## Step 5: not doing batch correction.
## Graphing number of non-zero genes with respect to CPM by library.
## Graphing library sizes.
## Graphing a boxplot.
## Graphing a correlation heatmap.
## Graphing a standard median correlation.
## Performing correlation.
## Graphing a distance heatmap.
## Graphing a standard median distance.
## Performing distance.
## Graphing a PCA plot.
## Graphing a T-SNE plot.
## Plotting a density plot.
## Plotting a CV plot.
## Naively calculating coefficient of variation/dispersion with respect to condition.
## Finished calculating dispersion estimates.
## Plotting the representation of the top-n genes.
## Plotting the expression of the top-n PC loaded genes.
## Printing a color to condition legend.
Do these samples segregate in ways which make sense?
sampleid | condition | batch | batch_int | colors | labels | PC1 | PC2 | pc_1 | pc_2 | pc_3 | pc_4 | pc_5 | pc_6 | pc_7 | pc_8 | pc_9 | pc_10 | pc_11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
hpgl0315 | HPGL0315 | respond_nil | d202 | 1 | #8068AE | HPGL0315 | -0.0338 | -0.6025 | -0.0338 | -0.6025 | 0.1303 | -0.2138 | 0.1463 | -0.2340 | 0.1992 | -0.2157 | 0.2115 | -0.5258 | 0.0791 |
hpgl0316 | HPGL0316 | respond_inf | d202 | 1 | #D03792 | HPGL0316 | 0.0290 | -0.3744 | 0.0290 | -0.3744 | -0.4013 | -0.0489 | 0.2854 | -0.3512 | -0.0279 | 0.2744 | -0.2251 | 0.5296 | 0.0106 |
hpgl0317 | HPGL0317 | respond_nil | d209 | 2 | #8068AE | HPGL0317 | 0.1650 | -0.0435 | 0.1650 | -0.0435 | 0.3056 | 0.3768 | -0.3640 | -0.2161 | -0.3055 | 0.1443 | 0.1904 | 0.1009 | 0.5589 |
hpgl0318 | HPGL0318 | respond_inf | d209 | 2 | #D03792 | HPGL0318 | 0.2562 | 0.1285 | 0.2562 | 0.1285 | -0.1629 | 0.3698 | -0.4091 | -0.3222 | 0.3184 | -0.2234 | -0.1481 | -0.0842 | -0.4687 |
hpgl0319 | HPGL0319 | fail_nil | d218 | 3 | #A66753 | HPGL0319 | 0.0351 | -0.3389 | 0.0351 | -0.3389 | 0.1651 | 0.2846 | 0.1072 | 0.5343 | -0.3611 | -0.1955 | 0.1263 | 0.1934 | -0.4164 |
hpgl0320 | HPGL0320 | fail_inf | d218 | 3 | #7FA718 | HPGL0320 | 0.1647 | -0.0126 | 0.1647 | -0.0126 | -0.3753 | 0.1494 | -0.0171 | 0.5871 | 0.3742 | 0.2216 | -0.1444 | -0.2110 | 0.3561 |
hpgl0321 | HPGL0321 | fail_nil | d119 | 4 | #A66753 | HPGL0321 | 0.3916 | 0.1902 | 0.3916 | 0.1902 | 0.3407 | -0.3311 | 0.0911 | -0.0069 | -0.1481 | 0.5585 | -0.1577 | -0.2289 | -0.2863 |
hpgl0322 | HPGL0322 | fail_inf | d119 | 4 | #7FA718 | HPGL0322 | 0.4296 | 0.3207 | 0.4296 | 0.3207 | 0.0206 | -0.3733 | 0.1964 | 0.0381 | 0.0453 | -0.5583 | 0.1086 | 0.2799 | 0.2135 |
hpgl0640 | HPGL0640 | fail_nil | d1025 | 5 | #A66753 | HPGL0640 | -0.4513 | -0.0217 | -0.4513 | -0.0217 | 0.2727 | -0.3649 | -0.4382 | 0.1494 | 0.3733 | 0.1025 | 0.0788 | 0.3615 | -0.0628 |
hpgl0641 | HPGL0641 | fail_inf | d1025 | 5 | #7FA718 | HPGL0641 | -0.2927 | 0.1332 | -0.2927 | 0.1332 | -0.4542 | -0.2947 | -0.3076 | 0.0027 | -0.5700 | -0.1275 | -0.1183 | -0.2653 | 0.0050 |
hpgl0642 | HPGL0642 | respond_nil | d1018 | 6 | #8068AE | HPGL0642 | -0.4142 | 0.2290 | -0.4142 | 0.2290 | 0.3286 | 0.2494 | 0.3634 | -0.0673 | 0.0287 | -0.1770 | -0.5732 | -0.0942 | 0.1276 |
hpgl0643 | HPGL0643 | respond_inf | d1018 | 6 | #D03792 | HPGL0643 | -0.2793 | 0.3920 | -0.2793 | 0.3920 | -0.1698 | 0.1966 | 0.3462 | -0.1138 | 0.0735 | 0.1960 | 0.6513 | -0.0560 | -0.1165 |
write.csv(file=glue::glue("excel/antimonial_pca_norm_table-v{ver}.csv"),
norm_pca$table)
anorm_met$corheat
The batch effect is pretty severe, what happens if we try combat on it?
abatch_combat <- sm(normalize_expt(ant, filter=TRUE, norm="quant", batch="combat"))
abatch_scale <- sm(normalize_expt(ant, filter=TRUE, norm="quant", batch="combat_scale"))
abatch_sva <- sm(normalize_expt(ant, filter=TRUE, norm="quant", batch="svaseq", convert="cpm"))
abatch_pca <- plot_pca(abatch_scale)
abatch_pca$plot
sampleid | condition | batch | batch_int | colors | labels | PC1 | PC2 | pc_1 | pc_2 | pc_3 | pc_4 | pc_5 | pc_6 | pc_7 | pc_8 | pc_9 | pc_10 | pc_11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
hpgl0315 | HPGL0315 | respond_nil | d202 | 1 | #8068AE | HPGL0315 | -0.3774 | 0.2881 | -0.3774 | 0.2881 | -0.1487 | 0.4487 | 0.3389 | -0.0322 | -0.0374 | -0.1513 | -0.3670 | -0.3888 | -0.2044 |
hpgl0316 | HPGL0316 | respond_inf | d202 | 1 | #D03792 | HPGL0316 | -0.1377 | -0.2621 | -0.1377 | -0.2621 | 0.1236 | -0.4758 | -0.3688 | 0.0316 | 0.4082 | -0.0960 | -0.3680 | -0.3221 | -0.1882 |
hpgl0317 | HPGL0317 | respond_nil | d209 | 2 | #8068AE | HPGL0317 | -0.4103 | -0.4970 | -0.4103 | -0.4970 | 0.3046 | 0.0760 | 0.2029 | 0.1918 | -0.1735 | 0.3255 | 0.3140 | 0.1158 | -0.2769 |
hpgl0318 | HPGL0318 | respond_inf | d209 | 2 | #D03792 | HPGL0318 | -0.1065 | 0.4579 | -0.1065 | 0.4579 | -0.3220 | -0.1205 | -0.1105 | -0.2202 | 0.3602 | 0.4080 | 0.3574 | 0.1664 | -0.2552 |
hpgl0319 | HPGL0319 | fail_nil | d218 | 3 | #A66753 | HPGL0319 | 0.1198 | 0.2314 | 0.1198 | 0.2314 | 0.3218 | 0.2202 | -0.5008 | -0.0235 | -0.3830 | 0.3835 | -0.3525 | 0.1397 | 0.0888 |
hpgl0320 | HPGL0320 | fail_inf | d218 | 3 | #7FA718 | HPGL0320 | 0.3896 | -0.2161 | 0.3896 | -0.2161 | -0.2836 | -0.1892 | 0.5043 | 0.0106 | 0.0254 | 0.3283 | -0.4241 | 0.2195 | 0.1054 |
hpgl0321 | HPGL0321 | fail_nil | d119 | 4 | #A66753 | HPGL0321 | 0.1396 | -0.3143 | 0.1396 | -0.3143 | -0.4978 | 0.0592 | -0.2508 | -0.1714 | -0.3503 | 0.0126 | 0.2930 | -0.4535 | 0.2013 |
hpgl0322 | HPGL0322 | fail_inf | d119 | 4 | #7FA718 | HPGL0322 | 0.3777 | 0.2525 | 0.3777 | 0.2525 | 0.4887 | -0.0526 | 0.2697 | 0.1485 | 0.1602 | 0.0462 | 0.3115 | -0.4407 | 0.2343 |
hpgl0640 | HPGL0640 | fail_nil | d1025 | 5 | #A66753 | HPGL0640 | 0.1264 | 0.1341 | 0.1264 | 0.1341 | 0.1541 | -0.3528 | 0.1303 | -0.3912 | -0.4254 | -0.4681 | 0.0757 | 0.2027 | -0.3429 |
hpgl0641 | HPGL0641 | fail_inf | d1025 | 5 | #7FA718 | HPGL0641 | 0.3960 | -0.0986 | 0.3960 | -0.0986 | -0.1283 | 0.3975 | -0.1947 | 0.4460 | 0.2125 | -0.3522 | 0.0916 | 0.2524 | -0.3124 |
hpgl0642 | HPGL0642 | respond_nil | d1018 | 6 | #8068AE | HPGL0642 | -0.3682 | 0.2365 | -0.3682 | 0.2365 | -0.1727 | -0.3045 | -0.0131 | 0.5093 | -0.1530 | -0.2157 | 0.0411 | 0.2196 | 0.4725 |
hpgl0643 | HPGL0643 | respond_inf | d1018 | 6 | #D03792 | HPGL0643 | -0.1489 | -0.2125 | -0.1489 | -0.2125 | 0.1601 | 0.2939 | -0.0074 | -0.4994 | 0.3562 | -0.2207 | 0.0274 | 0.2890 | 0.4776 |
write.csv(file=glue::glue("excel/antimonial_pca_batch_table-v{ver}.csv"),
abatch_pca$table)
asva_pca <- plot_pca(abatch_sva)
asva_pca$plot
sampleid | condition | batch | batch_int | colors | labels | PC1 | PC2 | pc_1 | pc_2 | pc_3 | pc_4 | pc_5 | pc_6 | pc_7 | pc_8 | pc_9 | pc_10 | pc_11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
hpgl0315 | HPGL0315 | respond_nil | d202 | 1 | #8068AE | HPGL0315 | -0.1724 | -0.1313 | -0.1724 | -0.1313 | 0.0850 | 0.3823 | 0.2579 | 0.1408 | 0.1885 | 0.3931 | 0.3326 | -0.1472 | 0.5546 |
hpgl0316 | HPGL0316 | respond_inf | d202 | 1 | #D03792 | HPGL0316 | 0.1530 | -0.4233 | 0.1530 | -0.4233 | 0.1281 | -0.0442 | -0.3911 | -0.2939 | -0.3162 | -0.2511 | -0.2465 | -0.0906 | 0.4737 |
hpgl0317 | HPGL0317 | respond_nil | d209 | 2 | #8068AE | HPGL0317 | -0.3327 | 0.3810 | -0.3327 | 0.3810 | 0.4853 | -0.3070 | -0.4018 | -0.1349 | 0.1432 | 0.0344 | 0.2918 | 0.2111 | 0.0027 |
hpgl0318 | HPGL0318 | respond_inf | d209 | 2 | #D03792 | HPGL0318 | 0.4605 | 0.3580 | 0.4605 | 0.3580 | 0.4121 | 0.1045 | 0.4365 | -0.1383 | 0.0418 | 0.0189 | -0.3998 | 0.1530 | 0.0273 |
hpgl0319 | HPGL0319 | fail_nil | d218 | 3 | #A66753 | HPGL0319 | -0.1350 | 0.4275 | -0.1350 | 0.4275 | -0.4226 | 0.1166 | -0.2395 | 0.4566 | 0.1241 | -0.2387 | -0.3552 | 0.0406 | 0.2396 |
hpgl0320 | HPGL0320 | fail_inf | d218 | 3 | #7FA718 | HPGL0320 | 0.3112 | 0.1516 | 0.3112 | 0.1516 | -0.3119 | -0.3970 | 0.1437 | 0.1468 | -0.5456 | 0.1669 | 0.3898 | 0.1324 | 0.0684 |
hpgl0321 | HPGL0321 | fail_nil | d119 | 4 | #A66753 | HPGL0321 | -0.4615 | -0.4070 | -0.4615 | -0.4070 | 0.0722 | -0.0268 | 0.2408 | 0.2140 | -0.2039 | 0.0634 | -0.2710 | 0.5076 | -0.2274 |
hpgl0322 | HPGL0322 | fail_inf | d119 | 4 | #7FA718 | HPGL0322 | 0.2039 | -0.1102 | 0.2039 | -0.1102 | -0.4058 | 0.2749 | -0.0911 | -0.4357 | 0.3578 | -0.1094 | 0.2424 | 0.4482 | -0.1577 |
hpgl0640 | HPGL0640 | fail_nil | d1025 | 5 | #A66753 | HPGL0640 | -0.3903 | 0.1696 | -0.3903 | 0.1696 | -0.2955 | -0.1252 | 0.1962 | -0.5260 | -0.0806 | 0.1818 | -0.2153 | -0.4568 | -0.1512 |
hpgl0641 | HPGL0641 | fail_inf | d1025 | 5 | #7FA718 | HPGL0641 | 0.1354 | -0.3103 | 0.1354 | -0.3103 | 0.0088 | -0.5378 | 0.2112 | 0.2144 | 0.5261 | -0.2638 | 0.0486 | -0.2650 | -0.0562 |
hpgl0642 | HPGL0642 | respond_nil | d1018 | 6 | #8068AE | HPGL0642 | -0.0559 | 0.0370 | -0.0559 | 0.0370 | 0.1855 | 0.4169 | 0.0788 | 0.1290 | -0.2734 | -0.5351 | 0.3236 | -0.3073 | -0.3475 |
hpgl0643 | HPGL0643 | respond_inf | d1018 | 6 | #D03792 | HPGL0643 | 0.2839 | -0.1427 | 0.2839 | -0.1427 | 0.0588 | 0.1429 | -0.4415 | 0.2272 | 0.0382 | 0.5395 | -0.1410 | -0.2260 | -0.4264 |
## Graphing number of non-zero genes with respect to CPM by library.
## Graphing library sizes.
## Graphing a boxplot.
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some data are negative. We are on log scale, setting them to 0.
## Changed 1929 negative features.
## Some entries are 0. We are on log scale, adding 1 to the data.
## Changed 1929 zero count features.
## Graphing a correlation heatmap.
## Graphing a standard median correlation.
## Performing correlation.
## Graphing a distance heatmap.
## Graphing a standard median distance.
## Performing distance.
## Graphing a PCA plot.
## Graphing a T-SNE plot.
## Plotting a density plot.
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some data are negative. We are on log scale, setting them to 0.5.
## Changed 1929 negative features.
## Plotting a CV plot.
## Naively calculating coefficient of variation/dispersion with respect to condition.
## Finished calculating dispersion estimates.
## Plotting the representation of the top-n genes.
## Plotting the expression of the top-n PC loaded genes.
## Printing a color to condition legend.
The PCA plot of the combat adjusted data suggests that it might actually work out ok. Let us look at the other metrics.
anti_host_combat_de$comparison$heat
limma_table <- anti_host_combat_de$limma$all_tables$respond_inf_vs_fail_inf
de_plots <- extract_de_plots(anti_host_combat_de, table="respond_inf_vs_fail_inf")
de_plots$ma$plot
antimony_contrasts <- list(
"fail_respond_inf" = c("fail_inf", "respond_inf"),
"fail_respond_nil" = c("fail_nil", "respond_nil"),
"fail" = c("fail_inf", "fail_nil"),
"respond" = c("respond_inf", "respond_nil"))
antimony_tables <- sm(combine_de_tables(
anti_host_combat_de,
excel=glue::glue("excel/antimony_combat_tables-v{ver}.xlsx"),
keepers=antimony_contrasts, annot_df=hs_annotations))
##antimony_tables$plots[["fail"]]
antimony_sig_tables <- sm(extract_significant_genes(
antimony_tables,
excel=glue::glue("excel/antimony_significant_genes-v{ver}.xlsx"),
according_to="all"))
anti_host_sva <- sm(all_pairwise(ant, model_batch="svaseq", filter=TRUE))
anti_host_tables <- sm(combine_de_tables(
anti_host_sva,
excel=glue::glue("excel/antimony_sva_tables-v{ver}.xlsx"),
keepers=antimony_contrasts))
anti_host_plots <- extract_de_plots(anti_host_tables, table="fail", type="deseq",
label=5)
anti_host_plots$volcano$plot
## Testing method: limma.
## Adding method: limma to the set.
## Testing method: deseq.
## Adding method: deseq to the set.
## Testing method: edger.
## Adding method: edger to the set.
## Testing method: ebseq.
## Adding method: ebseq to the set.
## Testing method: basic.
## Adding method: basic to the set.
## Warning in cor(x = fs[["x"]], y = fs[["y"]], method = cor_method): the
## standard deviation is zero
## Warning in cor(x = fs[["x"]], y = fs[["y"]], method = cor_method): the
## standard deviation is zero
## fail_respond_inf fail_respond_nil fail respond
## 0.2371 0.1971 0.6620 0.6337
upgenes_fail_respond_infected <- antimony_sig_tables[["deseq"]][["ups"]][["fail_respond_inf"]]
fail_respond_inf_up_gp <- simple_gprofiler(upgenes_fail_respond_infected, first_col="deseq_logfc")
## Performing gProfiler GO search of 622 genes against hsapiens.
## GO search found 187 hits.
## Performing gProfiler KEGG search of 622 genes against hsapiens.
## KEGG search found 7 hits.
## Performing gProfiler REAC search of 622 genes against hsapiens.
## REAC search found 31 hits.
## Performing gProfiler MI search of 622 genes against hsapiens.
## MI search found 0 hits.
## Performing gProfiler TF search of 622 genes against hsapiens.
## TF search found 84 hits.
## Performing gProfiler CORUM search of 622 genes against hsapiens.
## CORUM search found 1 hits.
## Performing gProfiler HP search of 622 genes against hsapiens.
## HP search found 1 hits.
dngenes_fail_respond_infected <- antimony_sig_tables[["deseq"]][["downs"]][["fail_respond_inf"]]
fail_respond_inf_dn_gp <- simple_gprofiler(dngenes_fail_respond_infected, first_col="deseq_logfc")
## Performing gProfiler GO search of 1329 genes against hsapiens.
## GO search found 213 hits.
## Performing gProfiler KEGG search of 1329 genes against hsapiens.
## KEGG search found 8 hits.
## Performing gProfiler REAC search of 1329 genes against hsapiens.
## REAC search found 12 hits.
## Performing gProfiler MI search of 1329 genes against hsapiens.
## MI search found 0 hits.
## Performing gProfiler TF search of 1329 genes against hsapiens.
## TF search found 409 hits.
## Performing gProfiler CORUM search of 1329 genes against hsapiens.
## CORUM search found 1 hits.
## Performing gProfiler HP search of 1329 genes against hsapiens.
## HP search found 18 hits.
upgenes_fail_respond_uninfected <- antimony_sig_tables[["deseq"]][["ups"]][["fail_respond_nil"]]
fail_respond_uninf_up_gp <- simple_gprofiler(upgenes_fail_respond_uninfected, first_col="deseq_logfc")
## Performing gProfiler GO search of 1033 genes against hsapiens.
## GO search found 366 hits.
## Performing gProfiler KEGG search of 1033 genes against hsapiens.
## KEGG search found 15 hits.
## Performing gProfiler REAC search of 1033 genes against hsapiens.
## REAC search found 44 hits.
## Performing gProfiler MI search of 1033 genes against hsapiens.
## MI search found 0 hits.
## Performing gProfiler TF search of 1033 genes against hsapiens.
## TF search found 135 hits.
## Performing gProfiler CORUM search of 1033 genes against hsapiens.
## CORUM search found 3 hits.
## Performing gProfiler HP search of 1033 genes against hsapiens.
## HP search found 2 hits.
dngenes_fail_respond_uninfected <- antimony_sig_tables[["deseq"]][["downs"]][["fail_respond_nil"]]
fail_respond_uninf_dn_gp <- simple_gprofiler(dngenes_fail_respond_uninfected, first_col="deseq_logfc")
## Performing gProfiler GO search of 658 genes against hsapiens.
## GO search found 136 hits.
## Performing gProfiler KEGG search of 658 genes against hsapiens.
## KEGG search found 5 hits.
## Performing gProfiler REAC search of 658 genes against hsapiens.
## REAC search found 7 hits.
## Performing gProfiler MI search of 658 genes against hsapiens.
## MI search found 0 hits.
## Performing gProfiler TF search of 658 genes against hsapiens.
## TF search found 151 hits.
## Performing gProfiler CORUM search of 658 genes against hsapiens.
## CORUM search found 0 hits.
## Performing gProfiler HP search of 658 genes against hsapiens.
## HP search found 0 hits.
## This function will replace the expt$expressionset slot with:
## cbcb(data)
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep libsizes in mind
## when invoking limma. The appropriate libsize is non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Leaving the data in its current base format, keep in mind that
## some metrics are easier to see when the data is log2 transformed, but
## EdgeR/DESeq do not accept transformed data.
## Leaving the data unconverted. It is often advisable to cpm/rpkm
## the data to normalize for sampling differences, keep in mind though that rpkm
## has some annoying biases, and voom() by default does a cpm (though hpgl_voom()
## will try to detect this).
## Leaving the data unnormalized. This is necessary for DESeq, but
## EdgeR/limma might benefit from normalization. Good choices include quantile,
## size-factor, tmm, etc.
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 6998 low-count genes (11849 remaining).
## Step 2: not normalizing the data.
## Step 3: not converting the data.
## Step 4: not transforming the data.
## Step 5: not doing batch correction.
fail_gsc <- make_gsc_from_ids(first_ids=rownames(upgenes_fail_respond_infected),
second_ids=rownames(dngenes_fail_respond_infected),
study_name="antimony_response",
category_name="response")
## Converting the rownames() of the expressionset to ENTREZID.
## Converting the rownames() of the expressionset to ENTREZID.
## Before conversion, the expressionset has 11849 entries.
## After conversion, the expressionset has 11900 entries.
## Mapping identifiers between gene sets and feature names
## Estimating GSVA scores for 2990 gene sets.
## Computing observed enrichment scores
## Estimating ECDFs with Poisson kernels
## Allocating cluster
## Estimating enrichment scores in parallel
## Taking diff of max KS.
## Cleaning up
expt <- fail_gsva$expt
ex <- expt$expressionset
reactome_idx <- grep(x=rownames(exprs(ex)), pattern="^REACTOME")
reactome_ids <- rownames(exprs(ex))[reactome_idx]
ex_rownames <- rownames(exprs(ex))
fd_rownames <- rownames(fData(ex))
extra <- fd_rownames %in% ex_rownames
fData(ex) <- fData(ex)[extra, ]
expt$expressionset <- ex
reactome <- exclude_genes_expt(expt, method="keep", ids=reactome_ids)
## Before removal, there were 2990 entries.
## Now there are 428 entries.
## Percent kept: 21.055, -38.837, 6.811, 19.056, 2.511, 5.844, 15.271, 12.715, -11.998, -24.376, -22.480, 18.239
## Percent removed: 78.945, 138.837, 93.189, 80.944, 97.489, 94.156, 84.729, 87.285, 111.998, 124.376, 122.480, 81.761
row_labels <- gsub(x=rownames(exprs(reactome)), pattern="^([[:alpha:]]+_)?(.*)$", replacement="\\2")
gsva_heat <- plot_sample_heatmap(reactome, row_label=row_labels)
## The biomart annotations file already exists, loading from it.
## xCell strongly perfers rpkm values, re-normalizing now.
## This function will replace the expt$expressionset slot with:
## rpkm(data)
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep libsizes in mind
## when invoking limma. The appropriate libsize is non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Filter is false, this should likely be set to something, good
## choices include cbcb, kofa, pofa (anything but FALSE). If you want this to
## stay FALSE, keep in mind that if other normalizations are performed, then the
## resulting libsizes are likely to be strange (potentially negative!)
## Leaving the data in its current base format, keep in mind that
## some metrics are easier to see when the data is log2 transformed, but
## EdgeR/DESeq do not accept transformed data.
## Leaving the data unnormalized. This is necessary for DESeq, but
## EdgeR/limma might benefit from normalization. Good choices include quantile,
## size-factor, tmm, etc.
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: not doing count filtering.
## Step 2: not normalizing the data.
## Step 3: converting the data with rpkm.
## Step 4: not transforming the data.
## Step 5: not doing batch correction.
## Loading required namespace: xCell
## Reread the parasite sample sheet to play with other covariants
## Wow, from the parasite perspective, it seems they cluster much more tightly by strain
parasite_expt <- sm(create_expt(
"sample_sheets/old/all_samples-lpanamensis.xlsx",
gene_info=lp_annotations))
antipara_expt <- subset_expt(parasite_expt, subset="condition=='ant_good'|condition=='ant_bad'")
## Using a subset expression.
## There were 33, now there are 6 samples.
antipara_expt <- set_expt_batches(antipara_expt, fact="strain")
antipara_metrics <- sm(graph_metrics(antipara_expt))
antipara_norm <- normalize_expt(antipara_expt, transform="log2",
convert="cpm", filter=TRUE,
norm="quant")
## This function will replace the expt$expressionset slot with:
## log2(cpm(quant(cbcb(data))))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep libsizes in mind
## when invoking limma. The appropriate libsize is non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 0 low-count genes (8841 remaining).
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 93 values equal to 0, adding 1 to the matrix.
## Step 5: not doing batch correction.
## This function will replace the expt$expressionset slot with:
## log2(svaseq(simple(data)))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep libsizes in mind
## when invoking limma. The appropriate libsize is non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Leaving the data unconverted. It is often advisable to cpm/rpkm
## the data to normalize for sampling differences, keep in mind though that rpkm
## has some annoying biases, and voom() by default does a cpm (though hpgl_voom()
## will try to detect this).
## Leaving the data unnormalized. This is necessary for DESeq, but
## EdgeR/limma might benefit from normalization. Good choices include quantile,
## size-factor, tmm, etc.
## Step 1: performing count filter with option: simple
## Removing 114 low-count genes (8727 remaining).
## Step 2: not normalizing the data.
## Step 3: not converting the data.
## Step 4: transforming the data with log2.
## transform_counts: Found 7670 values equal to 0, adding 1 to the matrix.
## Step 5: doing batch correction with svaseq.
## Note to self: If you get an error like 'x contains missing values' The data has too many 0's and needs a stronger low-count filter applied.
## Passing off to all_adjusters.
## batch_counts: Before batch/surrogate estimation, 40937 entries are x>1: 78.2%.
## batch_counts: Before batch/surrogate estimation, 7670 entries are x==0: 14.6%.
## The be method chose 1 surrogate variable(s).
## Attempting svaseq estimation with 1 surrogates.
## There are 15939 (30.4%) elements which are < 0 after batch correction.
human_expt <- create_expt(file="all_samples-hsapiens.xlsx")
antimonial_expt <- expt_subset(human_expt, subset="condition=='respond_nil'|condition=='respond_inf'|condition=='fail_nil'|condition=='fail_inf'")
antimonial_expt$colors <- c("blue","blue","red","red","black","black","green","green","red","green","blue","black")
antimonial_metrics <- graph_metrics(antimonial_expt)
antimonial_metrics$libsize
antimonial_norm <- normalize_expt(antimonial_expt, transform="log2", convert="cpm", filter_low=TRUE, norm="quant")
antimonialnorm_metrics <- graph_metrics(antimonial_norm)
antimonialnorm_metrics$pcaplot
antimonialnorm_metrics$corheat
antimonialnorm_metrics$disheat
antimonial_nb <- normalize_expt(antimonial_expt, transform="log2", convert="cpm", filter_low=TRUE, norm="quant", batch="combat")
antimonialnb_metrics <- graph_metrics(antimonial_nb)
antimonialnb_metrics$pcaplot
pp("images/antimonial_batch_corheat.png")
antimonialnb_metrics$corheat
dev.off()
## Reread the parasite sample sheet to play with other covariants
## Wow, from the parasite perspective, it seems they cluster much more tightly by strain
parasite_expt <- create_expt(file="all_samples-lpanamensis_patientbatch.xlsx")
antipara_expt <- expt_subset(parasite_expt, subset="condition=='ant_good'|condition=='ant_bad'")
antipara_metrics <- graph_metrics(antipara_expt)
antipara_metrics$pcaplot
antipara_norm <- normalize_expt(antipara_expt, transform="log2", convert="cpm", filter_low=TRUE, norm="quant", batch="combat")
antiparanorm_metrics <- graph_metrics(antipara_norm)
antiparanorm_metrics$pcaplot
antiparanorm_metrics$corheat
library(Biobase)
library(Heatplus)
infection_norm <- sm(normalize_expt(parasite_expt, transform="raw",
convert="cpm", filter="pofa",
norm="quant", batch="sva"))
infect_data <- exprs(infection_norm$expressionset)
correlations <- hpgl_cor(infect_data)
distances <- as.matrix(dist(t(infect_data), method="mink"))
design <- as.data.frame(infection_norm$design)
rownames(correlations) <- paste0(rownames(correlations), "_", as.character(design$condition))
colnames(correlations) <- paste0(colnames(correlations), "_", as.character(design$batch))
col_data <- as.data.frame(pData(infection_norm)[, c("strain", "condition")])
row_data <- design[, c("batch","condition")]
colnames(col_data) <- c("strain", "selfheal")
colnames(row_data) <- c("volunteer","selfheal")
row_data$selfheal <- as.numeric(row_data$selfheal)
row_data$selfheal <- ifelse(row_data$selfheal == 2, TRUE, FALSE)
myannot <- list(Col=list(data=col_data), Row=list(data=row_data))
mylabels <- list(Row=list(nrow=10), Col=list(nrow=10))
map1 <- annHeatmap2(distances,
cluster=list(cuth=7000),
ann=myannot,
labels=mylabels)
## 3 numbers, dendrogram, plot, annotation width/height
plot(map1)
words <- as.character(unlist(as.data.frame(read.csv("sample_sheets/testing.csv"))))
numbers <- gsub(x=words, pattern="[[:punct:]]", replacement="")
numbers <- gsub(x=numbers, pattern="[[:alpha:]]", replacement="")
numbers <- as.factor(as.numeric(unlist(strsplit(x=numbers, split=""))))
## Warning in strsplit(x = numbers, split = ""): input string 1914 is invalid
## UTF-8
## Warning in strsplit(x = numbers, split = ""): input string 1915 is invalid
## UTF-8
## Warning in strsplit(x = numbers, split = ""): input string 1916 is invalid
## UTF-8
## Warning in strsplit(x = numbers, split = ""): input string 1917 is invalid
## UTF-8
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset c2881f6d97e1ec981fd1481cf46d6bc875fac423
## This is hpgltools commit: Tue Oct 22 10:22:30 2019 -0400: c2881f6d97e1ec981fd1481cf46d6bc875fac423
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
## Saving to 02_antimonial_estimation_20190914-v20190914.rda.xz
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C
attached base packages: parallel, stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: Heatplus(v.2.31.0), GSVAdata(v.1.21.0), org.Hs.eg.db(v.3.8.2), foreach(v.1.4.7), ruv(v.0.9.7.1), hpgltools(v.1.0), Biobase(v.2.45.1) and BiocGenerics(v.0.31.6)
loaded via a namespace (and not attached): snow(v.0.4-3), backports(v.1.1.5), Hmisc(v.4.2-0), BiocFileCache(v.1.9.1), plyr(v.1.8.4), lazyeval(v.0.2.2), GSEABase(v.1.47.0), splines(v.3.6.1), BiocParallel(v.1.19.4), GenomeInfoDb(v.1.21.2), ggplot2(v.3.2.1), sva(v.3.33.1), digest(v.0.6.22), htmltools(v.0.4.0), gdata(v.2.18.0), magrittr(v.1.5), checkmate(v.1.9.4), memoise(v.1.1.0), cluster(v.2.1.0), doParallel(v.1.0.15), openxlsx(v.4.1.0.1), limma(v.3.41.18), fastcluster(v.1.1.25), Biostrings(v.2.53.2), annotate(v.1.63.0), matrixStats(v.0.55.0), askpass(v.1.1), prettyunits(v.1.0.2), colorspace(v.1.4-1), blob(v.1.2.0), rappdirs(v.0.3.1), ggrepel(v.0.8.1), xfun(v.0.10), dplyr(v.0.8.3), crayon(v.1.3.4), RCurl(v.1.95-4.12), graph(v.1.63.0), genefilter(v.1.67.1), zeallot(v.0.1.0), survival(v.2.44-1.1), iterators(v.1.0.12), glue(v.1.3.1), gtable(v.0.3.0), zlibbioc(v.1.31.0), XVector(v.0.25.0), DelayedArray(v.0.11.8), DEoptimR(v.1.0-8), scales(v.1.0.0), DBI(v.1.0.0), edgeR(v.3.27.13), Rcpp(v.1.0.2), xtable(v.1.8-4), progress(v.1.2.2), htmlTable(v.1.13.2), xCell(v.1.1.0), foreign(v.0.8-72), bit(v.1.1-14), OrganismDbi(v.1.27.1), preprocessCore(v.1.47.1), Formula(v.1.2-3), stats4(v.3.6.1), GSVA(v.1.33.4), htmlwidgets(v.1.5.1), httr(v.1.4.1), gplots(v.3.0.1.1), RColorBrewer(v.1.1-2), acepack(v.1.4.1), pkgconfig(v.2.0.3), XML(v.3.98-1.20), nnet(v.7.3-12), dbplyr(v.1.4.2), locfit(v.1.5-9.1), tidyselect(v.0.2.5), labeling(v.0.3), rlang(v.0.4.1), reshape2(v.1.4.3), later(v.1.0.0), AnnotationDbi(v.1.47.1), munsell(v.0.5.0), tools(v.3.6.1), RSQLite(v.2.1.2), evaluate(v.0.14), stringr(v.1.4.0), fastmap(v.1.0.1), yaml(v.2.2.0), knitr(v.1.25), bit64(v.0.9-7), pander(v.0.6.3), zip(v.2.0.4), robustbase(v.0.93-5), caTools(v.1.17.1.2), Vennerable(v.3.1.0.9000), purrr(v.0.3.3), RBGL(v.1.61.0), nlme(v.3.1-141), mime(v.0.7), pracma(v.2.2.5), biomaRt(v.2.41.9), shinythemes(v.1.1.2), compiler(v.3.6.1), rstudioapi(v.0.10), curl(v.4.2), tibble(v.2.1.3), geneplotter(v.1.63.0), stringi(v.1.4.3), highr(v.0.8), GenomicFeatures(v.1.37.4), lattice(v.0.20-38), Matrix(v.1.2-17), vctrs(v.0.2.0), pillar(v.1.4.2), BiocManager(v.1.30.8), data.table(v.1.12.6), bitops(v.1.0-6), corpcor(v.1.6.9), httpuv(v.1.5.2), rtracklayer(v.1.45.6), GenomicRanges(v.1.37.17), R6(v.2.4.0), latticeExtra(v.0.6-28), directlabels(v.2018.05.22), promises(v.1.1.0), KernSmooth(v.2.23-16), gridExtra(v.2.3), IRanges(v.2.19.17), codetools(v.0.2-16), MASS(v.7.3-51.4), gtools(v.3.8.1), assertthat(v.0.2.1), SummarizedExperiment(v.1.15.9), gProfileR(v.0.6.8), openssl(v.1.4.1), DESeq2(v.1.25.17), GenomicAlignments(v.1.21.7), Rsamtools(v.2.1.7), S4Vectors(v.0.23.25), GenomeInfoDbData(v.1.2.1), mgcv(v.1.8-29), hms(v.0.5.1), doSNOW(v.1.0.18), quadprog(v.1.5-7), grid(v.3.6.1), rpart(v.4.1-15), rmarkdown(v.1.16), Rtsne(v.0.15), shiny(v.1.4.0) and base64enc(v.0.1-3)