This process is not really optimized yet, I am not really certain what questions I want to ask nor how I want to answer them.
But whatever my questions are, the first thing I will need to do is collect the variant information and get a feeling for how many variants are in each sample.
## The scale difference between the smallest and largest
## libraries is > 10. Assuming a log10 scale is better, set scale=FALSE if not.
## This should be normalized against the library sizes of the original libraries so that we know
## if the numbers we are getting are an artifact of coverage.
This categorizes the possible variants so that we can cross reference them against experiment metadata later.
## Get all possible snp categories
snp_strain_sets <- sm(get_snp_sets(snp_strain_expt))
head(snp_strain_sets$medians)
## Lp001 s1022 s1320 s2189 s2271 s2272 s2504 s5397
## LPAL13_SCAF000001_1019_G_A 0 15.5 0 0 0 0 6 0
## LPAL13_SCAF000001_106_A_G 0 0.0 0 0 0 0 0 0
## LPAL13_SCAF000001_1092_A_G 0 17.0 0 0 0 0 13 0
## LPAL13_SCAF000001_1138_C_A 0 0.0 0 0 0 0 0 0
## LPAL13_SCAF000001_1149_G_A 0 0.0 0 0 0 0 0 0
## LPAL13_SCAF000001_1183_C_T 0 14.5 0 0 0 0 0 0
## s5430 s5433 chr
## LPAL13_SCAF000001_1019_G_A 0.0 10 SCAF000001
## LPAL13_SCAF000001_106_A_G 0.0 0 SCAF000001
## LPAL13_SCAF000001_1092_A_G 0.0 22 SCAF000001
## LPAL13_SCAF000001_1138_C_A 24.0 0 SCAF000001
## LPAL13_SCAF000001_1149_G_A 23.5 0 SCAF000001
## LPAL13_SCAF000001_1183_C_T 0.0 0 SCAF000001
A likely question we will ask: what genes have snps in specific strains:
Lets do the same thing, but this time against the self-healing/chronic etc conditions.
## xref against a specific metadatum
snp_cond_sets <- sm(get_snp_sets(snp_strain_expt, factor="state"))
snp_cond_summary <- snps_vs_genes(parasite_expt, snp_cond_sets)
## Oops, this is not what I thought it was, this is the set of positions in all
## samples, not a specific subset.
head(snp_cond_summary$summary_by_gene)
## exon_LPAL13_340060500.1 exon_LPAL13_250017600.1 exon_LPAL13_270011800.1
## 334 291 264
## exon_LPAL13_160011700.1 exon_LPAL13_330037000.1 exon_LPAL13_200020000.1
## 258 252 251
## exon_LPAL13_090013100.1 exon_LPAL13_340060500.1 exon_LPAL13_310019300.1
## 34 19 18
## exon_LPAL13_230022900.1 exon_LPAL13_270011800.1 exon_LPAL13_350020500.1
## 17 17 17
## exon_LPAL13_160006700.1 exon_LPAL13_230013700.1 exon_LPAL13_310036400.1
## 16 16 16
## exon_LPAL13_100010800.1 exon_LPAL13_280011700.1 exon_LPAL13_280037200.1
## 15 15 15
## exon_LPAL13_000030400.1 exon_LPAL13_270015100.1 exon_LPAL13_310013800.1
## 14 14 14
## exon_LPAL13_090014400.1 exon_LPAL13_130018000.1 exon_LPAL13_200020000.1
## 13 13 13
## exon_LPAL13_200051500.1 exon_LPAL13_240016500.1 exon_LPAL13_320031600.1
## 13 13 13
## exon_LPAL13_180009600.1 exon_LPAL13_190017200.1 exon_LPAL13_260015800.1
## 12 12 12
## exon_LPAL13_060013800.1 exon_LPAL13_060014200.1 exon_LPAL13_070008100.1
## 11 11 11
## exon_LPAL13_120012900.1 exon_LPAL13_140012900.1 exon_LPAL13_170011000.1
## 11 11 11
fun_names <- head(names(snp_cond_genes[["gene_summaries"]][["chronic"]]), n=30)
annot <- fData(parasite_expt)[fun_names, ]
annot[, c("seqnames", "description")]
## seqnames
## exon_LPAL13_090013100.1 LpaL13_09
## exon_LPAL13_340060500.1 LpaL13_34
## exon_LPAL13_310019300.1 LpaL13_31
## exon_LPAL13_230022900.1 LpaL13_23
## exon_LPAL13_270011800.1 LpaL13_27
## exon_LPAL13_350020500.1 LpaL13_35
## exon_LPAL13_160006700.1 LpaL13_16
## exon_LPAL13_230013700.1 LpaL13_23
## exon_LPAL13_310036400.1 LpaL13_31
## exon_LPAL13_100010800.1 LpaL13_10
## exon_LPAL13_280011700.1 LpaL13_28
## exon_LPAL13_280037200.1 LpaL13_28
## exon_LPAL13_000030400.1 LPAL13_SCAF000424
## exon_LPAL13_270015100.1 LpaL13_27
## exon_LPAL13_310013800.1 LpaL13_31
## exon_LPAL13_090014400.1 LpaL13_09
## exon_LPAL13_130018000.1 LpaL13_13
## exon_LPAL13_200020000.1 LpaL13_20.1
## exon_LPAL13_200051500.1 LpaL13_20.2
## exon_LPAL13_240016500.1 LpaL13_24
## exon_LPAL13_320031600.1 LpaL13_32
## exon_LPAL13_180009600.1 LpaL13_18
## exon_LPAL13_190017200.1 LpaL13_19
## exon_LPAL13_260015800.1 LpaL13_26
## exon_LPAL13_060013800.1 LpaL13_06
## exon_LPAL13_060014200.1 LpaL13_06
## exon_LPAL13_070008100.1 LpaL13_07
## exon_LPAL13_120012900.1 LpaL13_12
## exon_LPAL13_140012900.1 LpaL13_14
## exon_LPAL13_170011000.1 LpaL13_17
## description
## exon_LPAL13_090013100.1 hypothetical+protein,+conserved
## exon_LPAL13_340060500.1 SPRY+domain/HECT-domain+(ubiquitin-transferase),+putative
## exon_LPAL13_310019300.1 hypothetical+protein,+conserved
## exon_LPAL13_230022900.1 DNA+polymerase+zeta+catalytic+subunit,+putative
## exon_LPAL13_270011800.1 calpain-like+cysteine+peptidase,+putative
## exon_LPAL13_350020500.1 N-terminal+region+of+Chorein,+a+TM+vesicle-mediated+sorter,+putative
## exon_LPAL13_160006700.1 hypothetical+protein,+conserved
## exon_LPAL13_230013700.1 Galactose+oxidase,+central+domain+containing+protein,+putative
## exon_LPAL13_310036400.1 hypothetical+protein,+conserved
## exon_LPAL13_100010800.1 hypothetical+protein,+conserved
## exon_LPAL13_280011700.1 Cytoplasmic+dynein+2+heavy+chain+(DYNC2H1),+putative,dynein+heavy+chain,+putative
## exon_LPAL13_280037200.1 Cytoplasmic+dynein+2+heavy+chain+(DYNC2H1),+putative,dynein+heavy+chain,+putative
## exon_LPAL13_000030400.1 hypothetical+protein,+conserved
## exon_LPAL13_270015100.1 parallel+beta-helix+repeat,+putative
## exon_LPAL13_310013800.1 hypothetical+protein,+conserved
## exon_LPAL13_090014400.1 hypothetical+protein,+conserved
## exon_LPAL13_130018000.1 SPRY+domain/HECT-domain+(ubiquitin-transferase),+putative
## exon_LPAL13_200020000.1 N-terminal+region+of+Chorein,+a+TM+vesicle-mediated+sorter/Protein+of+unknown+function+(DUF1162),+putative
## exon_LPAL13_200051500.1 Midasin,+putative+(MDN1)
## exon_LPAL13_240016500.1 hypothetical+protein,+conserved
## exon_LPAL13_320031600.1 hypothetical+protein,+conserved
## exon_LPAL13_180009600.1 Flagellar-associated+PapD-like,+putative
## exon_LPAL13_190017200.1 Ankyrin+repeats+(3+copies),+putative
## exon_LPAL13_260015800.1 TROVE+domain/WD+domain,+G-beta+repeat,+putative
## exon_LPAL13_060013800.1 hypothetical+protein,+conserved
## exon_LPAL13_060014200.1 hypothetical+protein,+conserved
## exon_LPAL13_070008100.1 ubiquitin-protein+ligase-like,+putative
## exon_LPAL13_120012900.1 hypothetical+protein,+conserved
## exon_LPAL13_140012900.1 hypothetical+protein,+conserved
## exon_LPAL13_170011000.1 hypothetical+protein,+conserved
fun_names <- head(names(snp_cond_genes[["gene_summaries"]][["self_heal"]]), n=30)
annot <- fData(parasite_expt)[fun_names, ]
annot[, c("seqnames", "description")]
## seqnames
## exon_LPAL13_080010200.1 LpaL13_08
## exon_LPAL13_140008400.1 LpaL13_14
## exon_LPAL13_200013000.1 LpaL13_20.1
## exon_LPAL13_000035800.1 LPAL13_SCAF000500
## exon_LPAL13_000037900.1 LPAL13_SCAF000543
## exon_LPAL13_200008300.1 LpaL13_20.1
## exon_LPAL13_200012900.1 LpaL13_20.1
## exon_LPAL13_200014900.1 LpaL13_20.1
## exon_LPAL13_200016900.1 LpaL13_20.1
## exon_LPAL13_200017800.1 LpaL13_20.1
## exon_LPAL13_200019700.1 LpaL13_20.1
## exon_LPAL13_230015400.1 LpaL13_23
## exon_LPAL13_000015400.1 LPAL13_SCAF000124
## exon_LPAL13_000015500.1 LPAL13_SCAF000124
## exon_LPAL13_000029100.1 LPAL13_SCAF000388
## exon_LPAL13_040010300.1 LpaL13_04
## exon_LPAL13_060010800.1 LpaL13_06
## exon_LPAL13_070014200.1 LpaL13_07
## exon_LPAL13_080008300.1 LpaL13_08
## exon_LPAL13_080008800.1 LpaL13_08
## exon_LPAL13_080009300.1 LpaL13_08
## exon_LPAL13_100011600.1 LpaL13_10
## exon_LPAL13_100012700.1 LpaL13_10
## exon_LPAL13_200011500.1 LpaL13_20.1
## exon_LPAL13_200012200.1 LpaL13_20.1
## exon_LPAL13_200016400.1 LpaL13_20.1
## exon_LPAL13_200017400.1 LpaL13_20.1
## exon_LPAL13_200017900.1 LpaL13_20.1
## exon_LPAL13_200019500.1 LpaL13_20.1
## exon_LPAL13_200019600.1 LpaL13_20.1
## description
## exon_LPAL13_080010200.1 Amastin+surface+glycoprotein,+putative
## exon_LPAL13_140008400.1 Transmembrane+amino+acid+transporter+protein,+putative
## exon_LPAL13_200013000.1 Amastin+surface+glycoprotein,+putative
## exon_LPAL13_000035800.1 hypothetical+protein
## exon_LPAL13_000037900.1 Amastin+surface+glycoprotein,+putative
## exon_LPAL13_200008300.1 hypothetical+protein,+conserved
## exon_LPAL13_200012900.1 Amastin+surface+glycoprotein,+putative
## exon_LPAL13_200014900.1 EF+hand/EF-hand+domain+pair,+putative
## exon_LPAL13_200016900.1 hypothetical+protein,+conserved
## exon_LPAL13_200017800.1 hypothetical+protein,+conserved
## exon_LPAL13_200019700.1 hypothetical+protein,+conserved
## exon_LPAL13_230015400.1 hypothetical+protein,+conserved
## exon_LPAL13_000015400.1 hypothetical+protein,+conserved
## exon_LPAL13_000015500.1 ribose-phosphate+pyrophosphokinase,+putative
## exon_LPAL13_000029100.1 Amastin+surface+glycoprotein,+putative
## exon_LPAL13_040010300.1 serine/threonine+protein+kinase-like+protein
## exon_LPAL13_060010800.1 hypothetical+protein,+conserved
## exon_LPAL13_070014200.1 hypothetical+protein,+conserved
## exon_LPAL13_080008300.1 hypothetical+protein,+conserved
## exon_LPAL13_080008800.1 hypothetical+protein,+conserved
## exon_LPAL13_080009300.1 hypothetical+protein,+conserved
## exon_LPAL13_100011600.1 Chromatin+assembly+factor+1+subunit+A,+putative
## exon_LPAL13_100012700.1 hypothetical+protein,+conserved
## exon_LPAL13_200011500.1 flagellar+attachment+zone+protein,+putative
## exon_LPAL13_200012200.1 serine/threonine-protein+phosphatase+PP1,+putative
## exon_LPAL13_200016400.1 hypothetical+protein
## exon_LPAL13_200017400.1 hypothetical+protein,+conserved
## exon_LPAL13_200017900.1 RNA+editing+associated+helicase+2,+putative,ATP-dependent+RNA+helicase-like+protein+(REH2)
## exon_LPAL13_200019500.1 hypothetical+protein,+conserved
## exon_LPAL13_200019600.1 hypothetical+protein,+conserved
## GRanges object with 10 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## LpaL13_01_182906_C_T 01 182906-182907 +
## LpaL13_01_183131_G_C 01 183131-183132 +
## LpaL13_01_183471_A_G 01 183471-183472 +
## LpaL13_01_183498_C_G 01 183498-183499 +
## LpaL13_01_184202_G_A 01 184202-184203 +
## LpaL13_01_184517_C_T 01 184517-184518 +
## LpaL13_02_233734_A_G 02 233734-233735 +
## LpaL13_02_76980_T_C 02 76980-76981 +
## LpaL13_02_82011_A_G 02 82011-82012 +
## LpaL13_02_95074_G_A 02 95074-95075 +
## -------
## seqinfo: 100 sequences from an unspecified genome; no seqlengths
## [1] "chronic"
## [2] "failure"
## [3] "chronic, self_heal"
## [4] "failure, self_heal"
## [5] "chronic, failure, self_heal"
## [6] "failure, responder, self_heal"
## [7] "chronic, failure"
## [8] "failure, responder"
## [9] "self_heal"
## [10] "chronic, failure, responder, self_heal"
## [11] "chronic, responder, self_heal"
## [12] "responder"
## [13] "chronic, failure, responder"
## [14] "responder, self_heal"
## [15] "chronic, responder"
## [1] 305
## [1] 25
Ideally, the following heatmaps should provide what is essentially a phylogenetic tree describing the relationship of the various strains in the data according to the observed variant positions in the data.
## This function will replace the expt$expressionset slot with:
## log2(cpm(hpgl(data)))
## It backs up the current data into a slot named:
## expt$backup_expressionset. It will also save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Leaving the data unnormalized. This is necessary for DESeq, but
## EdgeR/limma might benefit from normalization. Good choices include quantile,
## size-factor, tmm, etc.
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: hpgl
## Removing 228809 low-count genes (210906 remaining).
## Step 2: not normalizing the data.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 5653071 values equal to 0, adding 1 to the matrix.
## Step 5: not doing batch correction.
snp_strain_norm <- set_expt_conditions(snp_strain_norm, fact="pathogenstrain")
test <- plot_disheat(snp_strain_norm)
des <- snp_strain_norm$design
library(Heatplus)
##hmcols <- colorRampPalette(c("yellow","black","darkblue"))(256)
correlations <- hpgl_cor(exprs(snp_strain_norm))
mydendro <- list(
"clustfun" = hclust,
"lwd" = 2.0)
col_data <- as.data.frame(des[, c("pathogenstrain")])
row_data <- as.data.frame(des[, c("alias")])
colnames(col_data) <- c("strain")
colnames(row_data) <- c("alias")
myannot <- list(
"Col" = list("data" = col_data),
"Row" = list("data" = row_data))
myclust <- list("cuth" = 1.0,
"col" = BrewerClusterCol)
mylabs <- list(
"Row" = list("nrow" = 4),
"Col" = list("nrow" = 4))
hmcols <- colorRampPalette(c("darkblue", "beige"))(333)
map2 <- annHeatmap2(
correlations,
dendrogram=mydendro,
annotation=myannot,
cluster=myclust,
labels=mylabs,
col=hmcols)
num_clusters <- max(map2[["cluster"]][["Row"]][["grp"]])
chosen_palette <- "Dark2"
new_colors <- grDevices::colorRampPalette(
RColorBrewer::brewer.pal(num_clusters, chosen_palette))(num_clusters)
## Warning in RColorBrewer::brewer.pal(num_clusters, chosen_palette): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
myclust <- list("cuth" = 1.0,
"col" = new_colors)
map2 <- annHeatmap2(
correlations,
dendrogram=mydendro,
annotation=myannot,
cluster=myclust,
labels=mylabs,
scale="none",
col=hmcols)
## Warning in breakColors(breaks, col): more colors than classes: ignoring 25
## last colors
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 455a7f0db4618c2209c21a978a41441bd4891095
## This is hpgltools commit: Mon Nov 12 15:13:53 2018 -0500: 455a7f0db4618c2209c21a978a41441bd4891095
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
## Saving to 01_annotation_20180822-v20180822.rda.xz
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=en_US.utf8, LC_NUMERIC=C, LC_TIME=en_US.utf8, LC_COLLATE=en_US.utf8, LC_MONETARY=en_US.utf8, LC_MESSAGES=en_US.utf8, LC_PAPER=en_US.utf8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.utf8 and LC_IDENTIFICATION=C
attached base packages: parallel, stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: Heatplus(v.2.26.0), foreach(v.1.4.4), hpgltools(v.2018.11), Biobase(v.2.40.0) and BiocGenerics(v.0.26.0)
loaded via a namespace (and not attached): snow(v.0.4-3), backports(v.1.1.2), fastmatch(v.1.1-0), plyr(v.1.8.4), igraph(v.1.2.2), lazyeval(v.0.2.1), splines(v.3.5.1), BiocParallel(v.1.14.2), usethis(v.1.4.0), GenomeInfoDb(v.1.16.0), ggplot2(v.3.1.0), sva(v.3.28.0), digest(v.0.6.18), BiocInstaller(v.1.30.0), htmltools(v.0.3.6), GOSemSim(v.2.6.2), viridis(v.0.5.1), GO.db(v.3.6.0), gdata(v.2.18.0), magrittr(v.1.5), memoise(v.1.1.0), doParallel(v.1.0.14), limma(v.3.36.5), remotes(v.2.0.2), Biostrings(v.2.48.0), annotate(v.1.58.0), matrixStats(v.0.54.0), enrichplot(v.1.0.2), prettyunits(v.1.0.2), colorspace(v.1.3-2), blob(v.1.1.1), ggrepel(v.0.8.0), dplyr(v.0.7.8), callr(v.3.0.0), crayon(v.1.3.4), RCurl(v.1.95-4.11), graph(v.1.58.2), genefilter(v.1.62.0), lme4(v.1.1-19), bindr(v.0.1.1), survival(v.2.43-1), iterators(v.1.0.10), glue(v.1.3.0), gtable(v.0.2.0), zlibbioc(v.1.26.0), XVector(v.0.20.0), UpSetR(v.1.3.3), DelayedArray(v.0.6.6), pkgbuild(v.1.0.2), scales(v.1.0.0), DOSE(v.3.6.1), edgeR(v.3.22.5), DBI(v.1.0.0), Rcpp(v.1.0.0), viridisLite(v.0.3.0), xtable(v.1.8-3), progress(v.1.2.0), units(v.0.6-1), bit(v.1.1-14), OrganismDbi(v.1.22.0), stats4(v.3.5.1), httr(v.1.3.1), fgsea(v.1.6.0), RColorBrewer(v.1.1-2), gplots(v.3.0.1), pkgconfig(v.2.0.2), XML(v.3.98-1.16), farver(v.1.0), locfit(v.1.5-9.1), tidyselect(v.0.2.5), rlang(v.0.3.0.1), reshape2(v.1.4.3), AnnotationDbi(v.1.42.1), munsell(v.0.5.0), tools(v.3.5.1), cli(v.1.0.1), RSQLite(v.2.1.1), devtools(v.2.0.1), ggridges(v.0.5.1), evaluate(v.0.12), stringr(v.1.3.1), yaml(v.2.2.0), processx(v.3.2.0), knitr(v.1.20), bit64(v.0.9-7), fs(v.1.2.6), pander(v.0.6.3), caTools(v.1.17.1.1), purrr(v.0.2.5), ggraph(v.1.0.2), bindrcpp(v.0.2.2), packrat(v.0.4.9-3), RBGL(v.1.56.0), nlme(v.3.1-137), DO.db(v.2.9), biomaRt(v.2.36.1), compiler(v.3.5.1), pbkrtest(v.0.4-7), rstudioapi(v.0.8), variancePartition(v.1.10.4), testthat(v.2.0.1), tibble(v.1.4.2), tweenr(v.1.0.0), stringi(v.1.2.4), ps(v.1.2.1), GenomicFeatures(v.1.32.3), desc(v.1.2.0), lattice(v.0.20-38), Matrix(v.1.2-15), nloptr(v.1.2.1), pillar(v.1.3.0), data.table(v.1.11.8), cowplot(v.0.9.3), bitops(v.1.0-6), rtracklayer(v.1.40.6), GenomicRanges(v.1.32.7), qvalue(v.2.12.0), colorRamps(v.2.3), R6(v.2.3.0), KernSmooth(v.2.23-15), gridExtra(v.2.3), IRanges(v.2.14.12), sessioninfo(v.1.1.1), codetools(v.0.2-15), MASS(v.7.3-51.1), gtools(v.3.8.1), assertthat(v.0.2.0), pkgload(v.1.0.2), SummarizedExperiment(v.1.10.1), rprojroot(v.1.3-2), withr(v.2.1.2), GenomicAlignments(v.1.16.0), Rsamtools(v.1.32.3), S4Vectors(v.0.18.3), GenomeInfoDbData(v.1.1.0), doSNOW(v.1.0.16), mgcv(v.1.8-25), hms(v.0.4.2), clusterProfiler(v.3.8.1), grid(v.3.5.1), tidyr(v.0.8.2), minqa(v.1.2.4), rmarkdown(v.1.10), rvcheck(v.0.1.1), ggforce(v.0.1.3) and base64enc(v.0.1-3)