This document is concerned with analyzing RNAseq data of human and parasite during the infection of human macrophages. A few different strains of L. panamensis were used; the experiment is therefore segregated into groups named ‘self-healing’, ‘chronic’, and ‘uninfected’. Two separate sets of libraries were generated, one earlier set with greater coverage, and a later set including only 1 uninfected sample, and 2-3 chronic samples.
The following subset operation extract the samples used for the macrophage experiment. The next three lines then change the colors from the defaults.
## This was made in 01_annotation_20180822.Rmd
##hs_expt <- sm(create_expt(metadata="sample_sheets/all_samples-combined.xlsx",
## gene_info=hs_annotations,
## file_column="humanfile"))
hs_macr <- subset_expt(hs_expt, subset="experimentname=='macrophage'")
## There were 42, now there are 9 samples.
new_colors <- c("#009900", "#990000", "#000099")
names(new_colors) <- c("uninf", "chr", "sh")
hs_macr <- set_expt_colors(hs_macr, colors=new_colors)
## The new colors are a character, changing according to condition.
labels <- as.character(pData(hs_macr)[["label"]])
hs_macr <- set_expt_samplenames(hs_macr, labels)
hs_cds_expt <- exclude_genes_expt(hs_expt, column='Type', patterns="protein_coding", method="keep")
## The Type column is null, doing nothing.
## There were 42, now there are 9 samples.
## The new colors are a character, changing according to condition.
Figure S1 should include nice versions of the sample metrics. The normalization chosen is batch-in-model.
First, however, we will make some plots of the raw data.
Sample names are going to be ‘infectionstate_strainnumber’ : chr_7721
## We are only dealing with the CDS data from now on.
## fig_s1 <- sm(write_expt(
## hs_macr, norm="quant", violin=FALSE, convert="cpm",
## transform="log2", batch="pca", filter=TRUE,
## excel=paste0("excel/figure_s1_sample_estimation-v", ver, ".xlsx")))
fig_s1_cds <- sm(write_expt(
hs_cds_macr, norm="quant", violin=FALSE, convert="cpm",
transform="log2", batch="pca", filter=TRUE,
excel=paste0("excel/figure_s1_sample_estimation_cds_only-v", ver, ".xlsx")))
First look for clustering patterns in the raw data, in all data followed by only the CDS regions.
Once the experiment has been written with the various normalizations in place, we can use that to print and view some representative plots.
## Writing the image to: images/figure_s1a_cds.pdf and calling dev.off().
## Show the library sizes of cds features.
pp(file="images/figure_s1b_cds.pdf", image=fig_s1_cds$raw_disheat)
## Writing the image to: images/figure_s1b_cds.pdf and calling dev.off().
## Show the clustering by euclidean distance of the raw data cds features.
pp(file="images/figure_s1c_cds.pdf", image=fig_s1_cds$raw_scaled_pca)
## Writing the image to: images/figure_s1c_cds.pdf and calling dev.off().
## Do a PCA plot of the log2(normalized(data)) without any batch-in-model (cds features)
pp(file="images/figure_s1d_cds.pdf", image=fig_s1_cds$norm_corheat)
## Writing the image to: images/figure_s1d_cds.pdf and calling dev.off().
In a similar fashion to what is above, I am printing the figure 1 images.
## pp(file="images/figure_1a.pdf", fig_s1_cds$norm_disheat)
## Show the changes in clustering after adding batch to the model and normalizing.
pp(file="images/figure_1a_cds.pdf", image=fig_s1_cds$norm_disheat)
## Writing the image to: images/figure_1a_cds.pdf and calling dev.off().
## Observe that the clustering is nearly identical when only looking at the CDS features.
## pp(file="images/figure_1b_cds.pdf", image=fig_s1_cds$norm_pca)
## Visualize the clustering after normalization via PCA.
## knitr::kable(fig_s1_cds$norm_pca_table)
## write.csv(fig_s1_cds$norm_pca_table, file="images/fig_1b.csv")
pp(file="images/figure_1b_cds.pdf", image=fig_s1_cds$norm_pca)
## Writing the image to: images/figure_1b_cds.pdf and calling dev.off().
## Once again, see that it is mostly unchanged when using only CDS features.
knitr::kable(fig_s1_cds$norm_pca_table)
sampleid | condition | batch | batch_int | colors | labels | PC1 | PC2 | pc_1 | pc_2 | pc_3 | pc_4 | pc_5 | pc_6 | pc_7 | pc_8 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
uninf_1 | HPGL0241 | uninf | a | 1 | #009900 | HPGL0241 | -0.5952 | 0.3607 | -0.5952 | 0.3607 | -0.1009 | -0.2387 | 0.2915 | -0.3302 | -0.1950 | 0.3246 |
sh_2271 | HPGL0242 | sh | a | 1 | #000099 | HPGL0242 | -0.3238 | -0.0209 | -0.3238 | -0.0209 | 0.0803 | -0.0416 | -0.4957 | 0.6115 | 0.2547 | 0.3017 |
chr_5433 | HPGL0244 | chr | a | 1 | #990000 | HPGL0244 | 0.4501 | -0.2176 | 0.4501 | -0.2176 | -0.3954 | -0.4586 | 0.3924 | 0.2719 | 0.0319 | 0.2083 |
chr_1320 | HPGL0245 | chr | a | 1 | #990000 | HPGL0245 | 0.2480 | 0.0228 | 0.2480 | 0.0228 | 0.1995 | 0.4028 | -0.0250 | 0.1583 | -0.7538 | 0.1760 |
chr_5430 | HPGL0247 | chr | a | 1 | #990000 | HPGL0247 | 0.2095 | 0.1749 | 0.2095 | 0.1749 | 0.2866 | 0.4835 | 0.3558 | -0.1139 | 0.5660 | 0.1966 |
chr_5397 | HPGL0248 | chr | a | 1 | #990000 | HPGL0248 | 0.1995 | -0.3503 | 0.1995 | -0.3503 | -0.0287 | -0.1001 | -0.5192 | -0.6358 | 0.0699 | 0.1919 |
uninf | HPGL0637 | uninf | b | 2 | #009900 | HPGL0637 | 0.1589 | 0.5572 | 0.1589 | 0.5572 | -0.4927 | 0.1206 | -0.2616 | -0.0035 | 0.0477 | -0.4745 |
sh_2189 | HPGL0638 | sh | b | 2 | #000099 | HPGL0638 | 0.0584 | 0.0675 | 0.0584 | 0.0675 | 0.6498 | -0.4689 | 0.0419 | 0.0171 | -0.0212 | -0.4862 |
sh_1022 | HPGL0639 | sh | b | 2 | #000099 | HPGL0639 | -0.4055 | -0.5943 | -0.4055 | -0.5943 | -0.1985 | 0.3010 | 0.2199 | 0.0246 | -0.0002 | -0.4385 |
The following is mostly used to compare different methodologies and is therefore not likely to be useful for most.
The various metrics used to describe and examine the data come once before, and once after normalization.
Some other analyses performed suggest a possible switch between two samples. We can artifically see what the data would look like if that is true by switching those two samples in a separate experimental design.
PBMC: variance partition human, “Are there any genes with variance we can ascribe to condition?” That set of genes is the most interesting. – use the variance parition table and pull the top 200. Report back the % variance by condition for all of these genes.
In the following, I will attempt to find the variance associated with different experimental factors in the macrophage data with mappings against the human transcriptome.
parasite_expt <- sm(create_expt(
metadata="sample_sheets/all_samples-combined.xlsx",
gene_info=lp_annotations,
file_column="parasitefile"))
lp_macr <- subset_expt(parasite_expt, subset="experimentname=='macrophage'")
## There were 25, now there are 7 samples.
chosen_colors <- c("#990000", "#000099")
names(chosen_colors) <- c("chr", "sh")
lp_macr <- set_expt_colors(lp_macr, colors=chosen_colors)
## The new colors are a character, changing according to condition.
lp_testing <- sm(write_expt(
lp_macr, violin=TRUE,
excel=paste0("excel/macrophage_parasite_data-v", ver, ".xlsx")))
Now lets try removing some of the surrogates, the two primary candidates are the strains which I proxy with 3 columns: snpclade, snpcladev2, and snpcladev3; and the batch. In this data set batch is either a or b.
In theory, sva should pick up both of those in one invocation.
## Perform a 'normal' sva
lp_macr_sva <- sm(normalize_expt(lp_macr, batch="svaseq", filter=TRUE))
## Play with the normalization of it
lp_macr_sva_norm <- sm(normalize_expt(lp_macr_sva, filter=TRUE, convert="cpm",
norm="quant", transform="log2"))
lp_macr_sva_norm_metr <- sm(graph_metrics(lp_macr_sva_norm))
## Try limma's method
lp_macr_limma_fcqlr <- sm(normalize_expt(lp_macr, batch="limma", filter=TRUE,
batch2="snpcladev3", convert="cpm", norm="quant"))
Another method might be to try using limmaresid to explicitly pull both columns.
Another method might be to try pca on two separate invocations.
## frrpr rprrf
## raw(pca(raw(raw(filter(data)))))
## thus fcqsl would be log2(sva(quant(cpm(filter(data)))))
lp_macr_rprrf <- sm(normalize_expt(lp_macr, batch="pca", filter=TRUE))
lp_macr_rprrf <- set_expt_batches(expt=lp_macr_rprrf, fact="snpcladev3")
lp_macr_rprrf <- sm(normalize_expt(lp_macr_rprrf, batch="pca"))
lp_macr_lpqcf <- sm(normalize_expt(lp_macr_rprrf, convert="cpm",
norm="quant", transform="log2",
filter=TRUE))
lp_macr_lpqcf_metr <- sm(graph_metrics(lp_macr_lpqcf))
## NULL
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 7ec7a093f7ee8293e7a5326cb8f77479341502a7
## This is hpgltools commit: Thu Feb 7 16:10:02 2019 -0500: 7ec7a093f7ee8293e7a5326cb8f77479341502a7
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
## Saving to 02_estimation_macrophage_20190205-v20190205.rda.xz
R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=en_US.utf8, LC_NUMERIC=C, LC_TIME=en_US.utf8, LC_COLLATE=en_US.utf8, LC_MONETARY=en_US.utf8, LC_MESSAGES=en_US.utf8, LC_PAPER=en_US.utf8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.utf8 and LC_IDENTIFICATION=C
attached base packages: parallel, stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: variancePartition(v.1.12.1), ruv(v.0.9.7), bindrcpp(v.0.2.2), hpgltools(v.2018.11), Biobase(v.2.42.0) and BiocGenerics(v.0.28.0)
loaded via a namespace (and not attached): backports(v.1.1.3), fastmatch(v.1.1-0), plyr(v.1.8.4), igraph(v.1.2.2), lazyeval(v.0.2.1), splines(v.3.5.2), BiocParallel(v.1.16.5), usethis(v.1.4.0), GenomeInfoDb(v.1.18.1), ggplot2(v.3.1.0), sva(v.3.30.1), urltools(v.1.7.1), digest(v.0.6.18), foreach(v.1.4.4), htmltools(v.0.3.6), GOSemSim(v.2.8.0), viridis(v.0.5.1), GO.db(v.3.7.0), gdata(v.2.18.0), magrittr(v.1.5), memoise(v.1.1.0), doParallel(v.1.0.14), openxlsx(v.4.1.0), remotes(v.2.0.2), limma(v.3.38.3), Biostrings(v.2.50.2), annotate(v.1.60.0), matrixStats(v.0.54.0), enrichplot(v.1.2.0), prettyunits(v.1.0.2), colorspace(v.1.4-0), blob(v.1.1.1), ggrepel(v.0.8.0), xfun(v.0.4), dplyr(v.0.7.8), callr(v.3.1.1), crayon(v.1.3.4), RCurl(v.1.95-4.11), jsonlite(v.1.6), graph(v.1.60.0), genefilter(v.1.64.0), lme4(v.1.1-19), bindr(v.0.1.1), survival(v.2.43-3), iterators(v.1.0.10), glue(v.1.3.0), gtable(v.0.2.0), zlibbioc(v.1.28.0), XVector(v.0.22.0), UpSetR(v.1.3.3), DelayedArray(v.0.8.0), pkgbuild(v.1.0.2), scales(v.1.0.0), DOSE(v.3.8.2), edgeR(v.3.24.3), DBI(v.1.0.0), Rcpp(v.1.0.0), viridisLite(v.0.3.0), xtable(v.1.8-3), progress(v.1.2.0), units(v.0.6-2), gridGraphics(v.0.3-0), bit(v.1.1-14), europepmc(v.0.3), preprocessCore(v.1.45.0), OrganismDbi(v.1.24.0), stats4(v.3.5.2), httr(v.1.4.0), fgsea(v.1.8.0), gplots(v.3.0.1), RColorBrewer(v.1.1-2), pkgconfig(v.2.0.2), XML(v.3.98-1.16), farver(v.1.1.0), locfit(v.1.5-9.1), labeling(v.0.3), ggplotify(v.0.0.3), tidyselect(v.0.2.5), rlang(v.0.3.1), reshape2(v.1.4.3), AnnotationDbi(v.1.44.0), munsell(v.0.5.0), tools(v.3.5.2), cli(v.1.0.1), RSQLite(v.2.1.1), devtools(v.2.0.1), ggridges(v.0.5.1), evaluate(v.0.12), stringr(v.1.3.1), yaml(v.2.2.0), fs(v.1.2.6), processx(v.3.2.1), knitr(v.1.21), bit64(v.0.9-7), pander(v.0.6.3), zip(v.1.0.0), caTools(v.1.17.1.1), purrr(v.0.2.5), ggraph(v.1.0.2), packrat(v.0.5.0), RBGL(v.1.58.1), nlme(v.3.1-137), DO.db(v.2.9), xml2(v.1.2.0), biomaRt(v.2.38.0), rstudioapi(v.0.9.0), compiler(v.3.5.2), pbkrtest(v.0.4-7), testthat(v.2.0.1), tibble(v.2.0.1), tweenr(v.1.0.1), stringi(v.1.2.4), highr(v.0.7), ps(v.1.3.0), desc(v.1.2.0), GenomicFeatures(v.1.34.1), lattice(v.0.20-38), Matrix(v.1.2-15), nloptr(v.1.2.1), pillar(v.1.3.1), BiocManager(v.1.30.4), triebeard(v.0.3.0), corpcor(v.1.6.9), data.table(v.1.12.0), cowplot(v.0.9.4), bitops(v.1.0-6), rtracklayer(v.1.42.1), GenomicRanges(v.1.34.0), qvalue(v.2.14.1), colorRamps(v.2.3), directlabels(v.2018.05.22), R6(v.2.3.0), KernSmooth(v.2.23-15), gridExtra(v.2.3), IRanges(v.2.16.0), sessioninfo(v.1.1.1), codetools(v.0.2-16), pkgload(v.1.0.2), MASS(v.7.3-51.1), gtools(v.3.8.1), assertthat(v.0.2.0), SummarizedExperiment(v.1.12.0), rprojroot(v.1.3-2), withr(v.2.1.2), GenomicAlignments(v.1.18.1), Rsamtools(v.1.34.0), S4Vectors(v.0.20.1), GenomeInfoDbData(v.1.2.0), mgcv(v.1.8-26), hms(v.0.4.2), clusterProfiler(v.3.10.1), quadprog(v.1.5-5), grid(v.3.5.2), tidyr(v.0.8.2), minqa(v.1.2.4), rmarkdown(v.1.11), rvcheck(v.0.1.3), Rtsne(v.0.15), ggforce(v.0.1.3) and base64enc(v.0.1-3)