1 Introduction

This document is intended to provide a general overview of the TMRC2 samples which have thus far been sequenced. In some cases, this includes only those samples starting in 2019; in other instances I am including our previous (2015-2016) samples.

In all cases the processing performed was:

  1. Default trimming was performed.
  2. Hisat2 was used to map the remaining reads against the Leishmania panamensis genome revision 36.
  3. The alignments from hisat2 were used to count reads/gene against the revision 36 annotations with htseq.
  4. These alignments were also passed to the pileup functionality of samtools and the vcf/bcf utilities in order to make a matrix of all observed differences between each sample with respect to the reference.

The analyses in this document use the matrices of counts/gene from #3 and variants/position from #4 in order to provide some images and metrics describing the samples we have sequenced so far.

2 Annotations

Everything which follows depends on the Existing TriTrypDB annotations revision 46, circa 2019. The following block loads a database of these annotations and turns it into a matrix where the rows are genes and columns are all the annotation types provided by TriTrypDB.

The same database was used to create a matrix of orthologous genes between L.panamensis and all of the other species in the TriTrypDB.

tt <- sm(library(EuPathDB))
tt <- sm(library(org.Lpanamensis.MHOMCOL81L13.v46.eg.db))
pan_db <- org.Lpanamensis.MHOMCOL81L13.v46.eg.db
all_fields <- columns(pan_db)

all_lp_annot <- sm(load_orgdb_annotations(
    pan_db,
    keytype = "gid",
    fields = c("annot_gene_entrez_id", "annot_gene_name",
               "annot_strand", "annot_chromosome", "annot_cds_length",
               "annot_gene_product")))$genes

lp_go <- sm(load_orgdb_go(pan_db))
lp_lengths <- all_lp_annot[, c("gid", "annot_cds_length")]
colnames(lp_lengths)  <- c("ID", "length")
all_lp_annot[["annot_gene_product"]] <- tolower(all_lp_annot[["annot_gene_product"]])
orthos <- sm(EuPathDB::extract_eupath_orthologs(db = pan_db))

hisat_annot <- all_lp_annot
## rownames(hisat_annot) <- paste0("exon_", rownames(hisat_annot), ".E1")

3 TODO:

Resequence samples: TMRC20002, TMRC20006, TMRC20004 (maybe TMRC20008 and TMRC20029)

4 Generate Expressionsets and Sample Estimation

The process of sample estimation takes two primary inputs:

  1. The sample sheet, which contains all the metadata we currently have on hand, including filenames for the outputs of #3 and #4 above.
  2. The gene annotations.

An expressionset is a data structure used in R to examine RNASeq data. It is comprised of annotations, metadata, and expression data. In the case of our processing pipeline, the location of the expression data is provided by the filenames in the metadata.

The first lines of the following block create the Expressionset. All of the following lines perform various normalizations and generate plots from it.

4.1 Notes

The following samples are much lower coverage:

  • TMRC20002
  • TMRC20006
  • TMRC20007
  • TMRC20008

4.2 TODO:

  1. Do the multi-gene family removal right here instead of way down at the bottom
  2. Add zymodeme snps to the annotation later.
  3. Start phylogenetic analysis of variant table.
sample_sheet <- glue::glue("sample_sheets/tmrc2_samples_20210601.xlsx")

sanitize_columns <- c("passagenumber", "clinicalresponse", "clinicalcategorical",
                      "zymodemecategorical", "phenotypiccharacteristics")
lp_expt <- sm(create_expt(sample_sheet,
                          gene_info = hisat_annot,
                          id_column = "hpglidentifier",
                          file_column = "lpanamensisv36hisatfile")) %>%
  set_expt_conditions(fact = "zymodemecategorical") %>%
  subset_expt(nonzero = 8600) %>%
  semantic_expt_filter(semantic = c("amastin", "gp63", "leishmanolysin"),
                       semantic_column = "annot_gene_product") %>%
  sanitize_expt_metadata(columns = sanitize_columns) %>%
  set_expt_factors(columns = sanitize_columns, class = "factor")
## The samples (and read coverage) removed when filtering 8600 non-zero genes are:
## TMRC20002 TMRC20004 TMRC20006 TMRC20029 TMRC20008 
##  11681227    564812   6670348   1658096   6249790
## subset_expt(): There were 52, now there are 47 samples.
## semantic_expt_filter(): Removed 68 genes.
libsizes <- plot_libsize(lp_expt)
pp(file = "images/lp_expt_libsizes.png", image = libsizes$plot, width = 12, height = 9)

## I think samples 7,10 should be removed at minimum, probably also 9,11
nonzero <- plot_nonzero(lp_expt)
nonzero$plot
## Warning: ggrepel: 6 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

lp_box <- plot_boxplot(lp_expt)
## 3164 entries are 0.  We are on a log scale, adding 1 to the data.
pp(file = "images/lp_expt_boxplot.png", image = lp_box, width = 12, height = 9)

filter_plot <- plot_libsize_prepost(lp_expt)
filter_plot$lowgene_plot
## Warning: Using alpha for a discrete variable is not advised.

filter_plot$count_plot

4.3 Distribution Visualization

Najib’s favorite plots are of course the PCA/TNSE. These are nice to look at in order to get a sense of the relationships between samples. They also provide a good opportunity to see what happens when one applies different normalizations, surrogate analyses, filters, etc. In addition, one may set different experimental factors as the primary ‘condition’ (usually the color of plots) and surrogate ‘batches’.

4.4 By Susceptilibity

Column ‘Q’ in the sample sheet, make a categorical version of it with these parameters:

  • 0 <= x <= 35 is resistant
  • 36 <= x <= 48 is ambiguous
  • 49 <= x is sensitive
starting <- as.numeric(pData(lp_expt)[["susceptibilityinfectionreduction32ugmlsbvhistoricaldata"]])
sus_categorical <- starting
na_idx <- is.na(starting)
sus_categorical[na_idx] <- "unknown"

resist_idx <- starting <= 0.35
sus_categorical[resist_idx] <- "resistant"
indeterminant_idx <- starting >= 0.36 & starting <= 0.48
sus_categorical[indeterminant_idx] <- "ambiguous"
susceptible_idx <- starting >= 0.49
sus_categorical[susceptible_idx] <- "sensitive"

pData(lp_expt$expressionset)[["sus_category"]] <- sus_categorical
clinical_samples <- lp_expt %>%
  set_expt_batches(fact = sus_categorical)

clinical_norm <- sm(normalize_expt(clinical_samples, norm = "quant", transform = "log2",
                                   convert = "cpm", batch = FALSE, filter = TRUE))
zymo_pca <- plot_pca(clinical_norm, plot_title = "PCA of parasite expression values")
pp(file = "images/zymo_pca_sus_shape.png", image = zymo_pca$plot)

zymo_3dpca <- plot_3d_pca(zymo_pca)
zymo_3dpca$plot
clinical_n <- sm(normalize_expt(clinical_samples, transform = "log2",
                                convert = "cpm", batch = FALSE, filter = TRUE))
zymo_tsne <- plot_tsne(clinical_n, plot_title = "TSNE of parasite expression values")
zymo_tsne$plot

clinical_nb <- normalize_expt(clinical_samples, convert = "cpm", transform = "log2",
                         filter = TRUE, batch = "svaseq")
## Removing 145 low-count genes (8565 remaining).
## batch_counts: Before batch/surrogate estimation, 568 entries are x==0: 0%.
## batch_counts: Before batch/surrogate estimation, 2107 entries are 0<x<1: 1%.
## Setting 129 low elements to zero.
## transform_counts: Found 129 values equal to 0, adding 1 to the matrix.
clinical_nb_pca <- plot_pca(clinical_nb, plot_title = "PCA of parasite expression values")
pp(file = "images/clinical_nb_pca_sus_shape.png", image = clinical_nb_pca$plot)

clinical_nb_tsne <- plot_tsne(clinical_nb, plot_title = "TSNE of parasite expression values")
clinical_nb_tsne$plot

corheat <- plot_corheat(clinical_norm, plot_title = "Correlation heatmap of parasite
                 expression values
")
corheat$plot

plot_sm(clinical_norm)$plot
## Performing correlation.

4.5 By Cure/Fail status

cf_expt <- set_expt_conditions(lp_expt, fact = "clinicalcategorical") %>%
  set_expt_batches(fact = sus_categorical)

cf_norm <- normalize_expt(cf_expt, convert = "cpm", transform = "log2",
                          norm = "quant", filter = TRUE)
## Removing 145 low-count genes (8565 remaining).
## transform_counts: Found 2 values equal to 0, adding 1 to the matrix.
start_cf <- plot_pca(cf_norm, plot_title = "PCA of parasite expression values")
pp(file = "images/cf_sus_shape.png", image = start_cf$plot)

cf_nb <- normalize_expt(cf_expt, convert = "cpm", transform = "log2",
                        norm = "quant", filter = TRUE, batch = "svaseq")
## Warning in normalize_expt(cf_expt, convert = "cpm", transform = "log2", :
## Quantile normalization and sva do not always play well together.
## Removing 145 low-count genes (8565 remaining).
## batch_counts: Before batch/surrogate estimation, 2 entries are x==0: 0%.
## batch_counts: Before batch/surrogate estimation, 2573 entries are 0<x<1: 1%.
## Setting 67 low elements to zero.
## transform_counts: Found 67 values equal to 0, adding 1 to the matrix.
cf_nb_pca <- plot_pca(cf_nb, plot_title = "PCA of parasite expression values")
pp(file = "images/cf_sus_share_nb.png", image = cf_nb_pca$plot)
## Warning: ggrepel: 2 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

cf_norm <- normalize_expt(cf_expt, transform = "log2", convert = "cpm",
                          filter = TRUE, norm = "quant")
## Removing 145 low-count genes (8565 remaining).
## transform_counts: Found 2 values equal to 0, adding 1 to the matrix.
test <- pca_information(cf_norm,
                        expt_factors = c("clinicalcategorical", "zymodemecategorical",
                                         "pathogenstrain", "passagenumber"),
                        num_components = 6, plot_pcas = TRUE)
test$anova_p
##                           PC1     PC2       PC3     PC4    PC5    PC6
## clinicalcategorical 4.067e-01 0.91829 8.951e-04 0.45541 0.8572 0.3990
## zymodemecategorical 2.381e-06 0.02917 2.717e-01 0.72493 0.1045 0.4399
## pathogenstrain      6.072e-01 0.07173 5.510e-06 0.43202 0.4455 0.7410
## passagenumber       7.131e-03 0.09458 3.026e-01 0.00987 0.3880 0.8164
test$cor_heatmap

sus_expt <- set_expt_conditions(lp_expt, fact = "sus_category") %>%
  set_expt_batches(fact = "zymodemecategorical")

sus_norm <- normalize_expt(sus_expt, transform = "log2", convert = "cpm",
                           norm = "quant", filter = TRUE)
## Removing 145 low-count genes (8565 remaining).
## transform_counts: Found 2 values equal to 0, adding 1 to the matrix.
sus_pca <- plot_pca(sus_norm, plot_title = "PCA of parasite expression values")
sus_pca$plot

sus_nb <- normalize_expt(sus_expt, transform = "log2", convert = "cpm",
                         batch = "svaseq", filter = TRUE)
## Removing 145 low-count genes (8565 remaining).
## batch_counts: Before batch/surrogate estimation, 568 entries are x==0: 0%.
## batch_counts: Before batch/surrogate estimation, 2107 entries are 0<x<1: 1%.
## Setting 110 low elements to zero.
## transform_counts: Found 110 values equal to 0, adding 1 to the matrix.
sus_nb_pca <- plot_pca(sus_nb, plot_title = "PCA of parasite expression values")
pp(file = "images/sus_nb_pca.png", image = sus_nb_pca$plot)

At this time, we do not have very many samples, so the set of metrics/plots is fairly limited. There is really only one factor in the metadata which we can use for performing differential expression analyses, the ‘zymodeme’.

5 Zymodeme analyses

The following sections perform a series of analyses which seek to elucidate differences between the zymodemes 2.2 and 2.3 either through differential expression or variant profiles.

5.1 Differential expression

5.1.1 With respect to zymodeme attribution

TODO: Do this with and without sva and compare the results.

zy_expt <- subset_expt(lp_expt, subset = "condition=='z2.2'|condition=='z2.3'")
## subset_expt(): There were 47, now there are 25 samples.
zy_norm <- normalize_expt(zy_expt, filter = TRUE, convert = "cpm", norm = "quant")
## Removing 167 low-count genes (8543 remaining).
zy_de_nobatch <- sm(all_pairwise(zy_expt, filter = TRUE, model_batch = "svaseq"))
zy_de <- sm(all_pairwise(zy_expt, filter = TRUE, model_batch = "svaseq"))
zy_table <- sm(combine_de_tables(zy_de, excel = glue::glue("excel/zy_tables-v{ver}.xlsx")))
zy_sig <- sm(extract_significant_genes(zy_table, excel = glue::glue("excel/zy_sig-v{ver}.xlsx")))

5.1.2 Images of zymodeme DE

zy_table[["plots"]][["z23_vs_z22"]][["deseq_ma_plots"]][["plot"]]

5.2 With respect to cure/failure

In contrast, we can search for genes which are differentially expressed with respect to cure/failure status.

cf_de <- sm(all_pairwise(cf_expt, filter = TRUE, model_batch = "svaseq"))
cf_table <- sm(combine_de_tables(cf_de, excel = glue::glue("excel/cf_tables-v{ver}.xlsx")))
cf_sig <- sm(extract_significant_genes(cf_table, excel = glue::glue("excel/cf_sig-v{ver}.xlsx")))

5.3 With respect to susceptibility

Finally, we can use our category of susceptibility and look for genes which change from sensitive to resistant. Keep in mind, though, that for the moment we have a lot of ambiguous and unknown strains.

sus_de <- sm(all_pairwise(sus_expt, filter = TRUE, model_batch = "svaseq"))
sus_table <- sm(combine_de_tables(sus_de, excel = glue::glue("excel/sus_tables-v{ver}.xlsx")))
sus_sig <- sm(extract_significant_genes(sus_table, excel = glue::glue("excel/sus_sig-v{ver}.xlsx")))
knitr::kable(head(sus_sig$deseq$ups$sensitive_vs_resistant, n = 20))

knitr::kable(head(sus_sig$deseq$downs$sensitive_vs_resistant, n = 20))

sus_ma <- sus_table[["plots"]][["sensitive_vs_resistant"]][["deseq_ma_plots"]][["plot"]]
pp(file = "images/sus_ma.png", image = sus_ma

## test <- ggplt(sus_ma)
## Error: <text>:9:0: unexpected end of input
## 7: 
## 8: ## test <- ggplt(sus_ma)
##   ^

5.4 Ontology searches

Now let us look for ontology categories which are increased in the 2.3 samples followed by the 2.2 samples.

## Gene categories more represented in the 2.3 group.
zy_go_up <- sm(simple_goseq(sig_genes = zy_sig[["deseq"]][["ups"]][[1]],
                            go_db = lp_go, length_db = lp_lengths))

## Gene categories more represented in the 2.2 group.
zy_go_down <- sm(simple_goseq(sig_genes = zy_sig[["deseq"]][["downs"]][[1]],
                              go_db = lp_go, length_db = lp_lengths))

5.4.1 A couple plots from the differential expression

5.4.1.1 Number of genes in agreement among DE methods, 2.3 more than 2.2

In the function ‘combined_de_tables()’ above, one of the tasks performed is to look at the agreement among DESeq2, limma, and edgeR. The following show a couple of these for the set of genes observed with a fold-change >= |2| and adjusted p-value <= 0.05.

zy_table[["venns"]][[1]][["p_lfc1"]][["up_noweight"]]

5.4.1.2 Number of genes in agreement among DE methods, 2.2 more than 2.3

zy_table[["venns"]][[1]][["p_lfc1"]][["down_noweight"]]

5.4.1.3 goseq ontology plots of groups of genes, 2.3 more than 2.2

zy_go_up$pvalue_plots$bpp_plot_over

5.4.1.4 goseq ontology plots of groups of genes, 2.2 more than 2.3

zy_go_down$pvalue_plots$bpp_plot_over

5.5 Zymodeme enzyme gene IDs

Najib read me an email listing off the gene names associated with the zymodeme classification. I took those names and cross referenced them against the Leishmania panamensis gene annotations and found the following:

They are:

  1. ALAT: LPAL13_120010900 – alanine aminotransferase
  2. ASAT: LPAL13_340013000 – aspartate aminotransferase
  3. G6PD: LPAL13_000054100 – glucase-6-phosphate 1-dehydrogenase
  4. NH: LPAL13_14006100, LPAL13_180018500 – inosine-guanine nucleoside hydrolase
  5. MPI: LPAL13_320022300 (maybe) – mannose phosphate isomerase (I chose phosphomannose isomerase)

Given these 6 gene IDs (NH has two gene IDs associated with it), I can do some looking for specific differences among the various samples.

5.5.1 Expression levels of zymodeme genes

The following creates a colorspace (red to green) heatmap showing the observed expression of these genes in every sample.

my_genes <- c("LPAL13_120010900", "LPAL13_340013000", "LPAL13_000054100",
              "LPAL13_140006100", "LPAL13_180018500", "LPAL13_320022300",
              "other")
my_names <- c("ALAT", "ASAT", "G6PD", "NHv1", "NHv2", "MPI", "other")

zymo_expt <- exclude_genes_expt(zy_norm, ids = my_genes, method = "keep")
## Before removal, there were 8543 genes, now there are 6.
## There are 25 samples which kept less than 90 percent counts.
## TMRC20001 TMRC20005 TMRC20039 TMRC20037 TMRC20038 TMRC20041 TMRC20015 TMRC20009 
##    0.1310    0.1318    0.1299    0.1100    0.1128    0.1179    0.1146    0.1135 
## TMRC20010 TMRC20016 TMRC20011 TMRC20012 TMRC20013 TMRC20017 TMRC20014 TMRC20018 
##    0.1098    0.1059    0.1101    0.1205    0.1205    0.1063    0.1089    0.1144 
## TMRC20021 TMRC20022 TMRC20053 TMRC20052 TMRC20064 TMRC20051 TMRC20050 TMRC20062 
##    0.1061    0.1305    0.1182    0.1104    0.1138    0.1280    0.1151    0.1283 
## TMRC20054 
##    0.1276
zymo_heatmap <- plot_sample_heatmap(zymo_expt, row_label = my_names)
zymo_heatmap

5.6 Empirically observed Zymodeme genes from differential expression analysis

In contrast, the following plots take the set of genes which are shared among all differential expression methods (|lfc| >= 1.0 and adjp <= 0.05) and use them to make categories of genes which are increased in 2.3 or 2.2.

shared_zymo <- intersect_significant(zy_table)
## Deleting the file excel/intersect_significant.xlsx before writing the tables.
up_shared <- shared_zymo[["ups"]][[1]][["data"]][["all"]]
rownames(up_shared)
##  [1] "LPAL13_000033300" "LPAL13_000012000" "LPAL13_310031300" "LPAL13_000038400"
##  [5] "LPAL13_000038500" "LPAL13_000012100" "LPAL13_340039600" "LPAL13_050005000"
##  [9] "LPAL13_310031000" "LPAL13_310039200" "LPAL13_210015500" "LPAL13_350063000"
## [13] "LPAL13_140019300" "LPAL13_270034100" "LPAL13_340039700" "LPAL13_180013900"
## [17] "LPAL13_170015400" "LPAL13_350013200" "LPAL13_330021800" "LPAL13_140019100"
## [21] "LPAL13_240009700" "LPAL13_330021900" "LPAL13_140019200" "LPAL13_000052700"
## [25] "LPAL13_250025700" "LPAL13_350073200" "LPAL13_310028500" "LPAL13_320038700"
## [29] "LPAL13_210005000" "LPAL13_300031600" "LPAL13_110015700" "LPAL13_000045100"
## [33] "LPAL13_230011200" "LPAL13_040007800" "LPAL13_230011400" "LPAL13_290016200"
## [37] "LPAL13_310032500" "LPAL13_230011500" "LPAL13_140019400" "LPAL13_000010600"
## [41] "LPAL13_100005800"
upshared_expt <- exclude_genes_expt(zy_norm, ids = rownames(up_shared), method = "keep")
## Before removal, there were 8543 genes, now there are 41.
## There are 25 samples which kept less than 90 percent counts.
## TMRC20001 TMRC20005 TMRC20039 TMRC20037 TMRC20038 TMRC20041 TMRC20015 TMRC20009 
##    0.4245    0.1639    0.2311    0.6340    0.7125    0.2043    0.5332    0.1935 
## TMRC20010 TMRC20016 TMRC20011 TMRC20012 TMRC20013 TMRC20017 TMRC20014 TMRC20018 
##    0.4926    0.4036    0.2015    0.1613    0.4721    0.2591    0.2134    0.4470 
## TMRC20021 TMRC20022 TMRC20053 TMRC20052 TMRC20064 TMRC20051 TMRC20050 TMRC20062 
##    0.5088    0.1852    0.2650    0.5964    0.5949    0.8250    0.2773    0.8369 
## TMRC20054 
##    0.7197

We can plot a quick heatmap to get a sense of the differences observed between the genes which are different between the two zymodemes.

5.6.1 Heatmap of zymodeme gene expression increased in 2.3 vs. 2.2

high_23_heatmap <- plot_sample_heatmap(upshared_expt, row_label = rownames(up_shared))
high_23_heatmap

5.6.2 Heatmap of zymodeme gene expression increased in 2.2 vs. 2.3

down_shared <- shared_zymo[["downs"]][[1]][["data"]][["all"]]
downshared_expt <- exclude_genes_expt(zy_norm, ids = rownames(down_shared), method = "keep")
## Before removal, there were 8543 genes, now there are 63.
## There are 25 samples which kept less than 90 percent counts.
## TMRC20001 TMRC20005 TMRC20039 TMRC20037 TMRC20038 TMRC20041 TMRC20015 TMRC20009 
##    0.2181    0.6764    0.6475    0.1938    0.1849    0.6785    0.1786    0.6274 
## TMRC20010 TMRC20016 TMRC20011 TMRC20012 TMRC20013 TMRC20017 TMRC20014 TMRC20018 
##    0.1628    0.2041    0.5594    0.5529    0.1608    0.6469    0.6368    0.1568 
## TMRC20021 TMRC20022 TMRC20053 TMRC20052 TMRC20064 TMRC20051 TMRC20050 TMRC20062 
##    0.1565    0.6738    0.5544    0.1747    0.1908    0.1781    0.6052    0.1779 
## TMRC20054 
##    0.1921
high_22_heatmap <- plot_sample_heatmap(downshared_expt, row_label = rownames(down_shared))
high_22_heatmap

6 SNP profiles

Now I will combine our previous samples and our new samples in the hopes of finding variant positions which help elucidate currently unknown aspects of either group via their clustering to known samples from the other group. In other words, we do not know the zymodeme annotations for the old samples nor the strain identities (or the shortcut ‘chronic vs. self-healing’) for the new samples. I hope to make educated guesses given the variant profiles. There are some differences in how the previous and current data sets were analyzed (though I have since redone the old samples so it should be trivial to remove those differences now).

I added our 2016 data to a specific TMRC2 sample sheet, dated 20191203. Thus I will load the data here. That previous data was mapped using tophat, so I will also need to make some changes to the gene names to accomodate the two mappings.

old_expt <- sm(create_expt("sample_sheets/tmrc2_samples_20191203.xlsx",
                           file_column = "tophat2file"))

tt <- lp_expt[["expressionset"]]
rownames(tt) <- gsub(pattern = "^exon_", replacement = "", x = rownames(tt))
rownames(tt) <- gsub(pattern = "\\.E1$", replacement = "", x = rownames(tt))
lp_expt$expressionset <- tt

tt <- old_expt$expressionset
rownames(tt) <- gsub(pattern = "^exon_", replacement = "", x = rownames(tt))
rownames(tt) <- gsub(pattern = "\\.1$", replacement = "", x = rownames(tt))
old_expt$expressionset <- tt
rm(tt)

6.1 Create the SNP expressionset

One other important caveat, we have a group of new samples which have not yet run through the variant search pipeline, so I need to remove them from consideration. Though it looks like they finished overnight…

## The next line drops the samples which are missing the SNP pipeline.
lp_snp <- subset_expt(lp_expt, subset="!is.na(pData(lp_expt)[['bcftable']])")
## subset_expt(): There were 47, now there are 46 samples.
new_snps <- sm(count_expt_snps(lp_snp, annot_column = "bcftable"))
old_snps <- sm(count_expt_snps(old_expt, annot_column = "bcftable", snp_column = 2))

both_snps <- combine_expts(new_snps, old_snps)
both_norm <- sm(normalize_expt(both_snps, transform = "log2", convert = "cpm", filter = TRUE))

## strains <- both_norm[["design"]][["strain"]]
both_strain <- set_expt_conditions(both_norm, fact = "strain")

The data structure ‘both_norm’ now contains our 2016 data along with the newer data collected since 2019.

6.2 Plot of SNP profiles for zymodemes

The following plot shows the SNP profiles of all samples (old and new) where the colors at the top show either the 2.2 strains (orange), 2.3 strains (green), the previous samples (purple), or the various lab strains (pink etc).

old_new_variant_heatmap <- plot_disheat(both_norm)
pp(file = "images/raw_snp_disheat.png", image = old_new_variant_heatmap,
   height = 12, width = 12)

The function get_snp_sets() takes the provided metadata factor (in this case ‘condition’) and looks for variants which are exclusive to each element in it. In this case, this is looking for differences between 2.2 and 2.3, as well as the set shared among them.

snp_sets <- get_snp_sets(both_snps, factor = "condition")
## The factor z2.3 has 14 rows.
## The factor z2.2 has 11 rows.
## The factor unknown has 21 rows.
## The factor sh has 13 rows.
## The factor chr has 14 rows.
## The factor inf has 6 rows.
## Iterating over 727 elements.
both_expt <- combine_expts(lp_expt, old_expt)

snp_genes <- sm(snps_vs_genes(both_expt, snp_sets, expt_name_col = "chromosome"))
## I think we have some metrics here we can plot...
snp_subset <- sm(snp_subset_genes(
  both_expt, both_snps,
  genes = c("LPAL13_120010900", "LPAL13_340013000", "LPAL13_000054100",
            "LPAL13_140006100", "LPAL13_180018500", "LPAL13_320022300")))
zymo_heat <- plot_sample_heatmap(snp_subset, row_label = rownames(exprs(snp_subset)))
zymo_heat

Didn’t I create a set of densities by chromosome? Oh I think they come in from get_snp_sets()

6.3 SNPS associated with clinical response in the TMRC samples

clinical_sets <- get_snp_sets(new_snps, factor = "clinicalresponse")
## The factor cure has 17 rows.
## The factor failure has 15 rows.
## The factor laboratory line has only 1 row.
## The factor nd has 3 rows.
## The factor reference strain has 3 rows.
## The factor unknown has 7 rows.
## Iterating over 693 elements.
density_vec <- clinical_sets[["density"]]
chromosome_idx <- grep(pattern = "LpaL", x = names(density_vec))
density_df <- as.data.frame(density_vec[chromosome_idx])
density_df[["chr"]] <- rownames(density_df)
colnames(density_df) <- c("density_vec", "chr")
ggplot(density_df, aes_string(x = "chr", y = "density_vec")) +
  ggplot2::geom_col() +
  ggplot2::theme(axis.text = ggplot2::element_text(size = 10, colour = "black"),
                 axis.text.x = ggplot2::element_text(angle = 90, vjust = 0.5))

## clinical_written <- write_variants(new_snps)

6.3.1 Cross reference these variants by gene

clinical_genes <- sm(snps_vs_genes(lp_expt, clinical_sets, expt_name_col = "chromosome"))

snp_density <- merge(as.data.frame(clinical_genes[["summary_by_gene"]]),
                     as.data.frame(fData(lp_expt)),
                     by = "row.names")
snp_density <- snp_density[, c(1, 2, 4, 15)]
colnames(snp_density) <- c("name", "snps", "product", "length")
snp_density[["product"]] <- tolower(snp_density[["product"]])
snp_density[["length"]] <- as.numeric(snp_density[["length"]])
snp_density[["density"]] <- snp_density[["snps"]] / snp_density[["length"]]
snp_idx <- order(snp_density[["density"]], decreasing = TRUE)
snp_density <- snp_density[snp_idx, ]

removers <- c("amastin", "gp63", "leishmanolysin")
for (r in removers) {
  drop_idx <- grepl(pattern = r, x = snp_density[["product"]])
  snp_density <- snp_density[!drop_idx, ]
}
## Filter these for [A|a]mastin gp63 Leishmanolysin
clinical_snps <- snps_intersections(lp_expt, clinical_sets, chr_column = "chromosome")

as.data.frame(clinical_snps[["inters"]][["failure, reference strain"]])
##                                       seqnames  start    end width strand
## chr_LpaL13-10_pos_233490_ref_C_alt_G LpaL13-10 233490 233491     2      +
## chr_LpaL13-15_pos_42885_ref_A_alt_G  LpaL13-15  42885  42886     2      +
## chr_LpaL13-24_pos_163196_ref_C_alt_A LpaL13-24 163196 163197     2      +
## chr_LpaL13-31_pos_852703_ref_C_alt_A LpaL13-31 852703 852704     2      +
as.data.frame(clinical_snps[["inters"]][["cure"]])
##                                           seqnames   start     end width strand
## chr_LpaL13-01_pos_169299_ref_A_alt_G     LpaL13-01  169299  169300     2      +
## chr_LpaL13-08_pos_184791_ref_T_alt_A     LpaL13-08  184791  184792     2      +
## chr_LpaL13-10_pos_347757_ref_A_alt_C     LpaL13-10  347757  347758     2      +
## chr_LpaL13-11_pos_433123_ref_C_alt_T     LpaL13-11  433123  433124     2      +
## chr_LpaL13-15_pos_47170_ref_G_alt_C      LpaL13-15   47170   47171     2      +
## chr_LpaL13-16_pos_456493_ref_A_alt_G     LpaL13-16  456493  456494     2      +
## chr_LpaL13-20.1_pos_106634_ref_G_alt_A LpaL13-20.1  106634  106635     2      +
## chr_LpaL13-20.1_pos_112635_ref_A_alt_C LpaL13-20.1  112635  112636     2      +
## chr_LpaL13-20.1_pos_369935_ref_C_alt_T LpaL13-20.1  369935  369936     2      +
## chr_LpaL13-20.1_pos_370282_ref_C_alt_T LpaL13-20.1  370282  370283     2      +
## chr_LpaL13-20.1_pos_371356_ref_T_alt_C LpaL13-20.1  371356  371357     2      +
## chr_LpaL13-20.1_pos_380785_ref_A_alt_G LpaL13-20.1  380785  380786     2      +
## chr_LpaL13-20.1_pos_381107_ref_T_alt_C LpaL13-20.1  381107  381108     2      +
## chr_LpaL13-20.1_pos_382801_ref_A_alt_C LpaL13-20.1  382801  382802     2      +
## chr_LpaL13-20.1_pos_386522_ref_G_alt_A LpaL13-20.1  386522  386523     2      +
## chr_LpaL13-20.1_pos_386926_ref_G_alt_A LpaL13-20.1  386926  386927     2      +
## chr_LpaL13-20.1_pos_390908_ref_G_alt_A LpaL13-20.1  390908  390909     2      +
## chr_LpaL13-20.1_pos_391058_ref_C_alt_A LpaL13-20.1  391058  391059     2      +
## chr_LpaL13-20.1_pos_395411_ref_C_alt_G LpaL13-20.1  395411  395412     2      +
## chr_LpaL13-20.1_pos_412461_ref_C_alt_T LpaL13-20.1  412461  412462     2      +
## chr_LpaL13-20.1_pos_418289_ref_G_alt_A LpaL13-20.1  418289  418290     2      +
## chr_LpaL13-20.1_pos_433900_ref_C_alt_A LpaL13-20.1  433900  433901     2      +
## chr_LpaL13-20.1_pos_441730_ref_G_alt_C LpaL13-20.1  441730  441731     2      +
## chr_LpaL13-20.1_pos_455242_ref_G_alt_A LpaL13-20.1  455242  455243     2      +
## chr_LpaL13-20.1_pos_455533_ref_G_alt_C LpaL13-20.1  455533  455534     2      +
## chr_LpaL13-20.1_pos_460767_ref_T_alt_C LpaL13-20.1  460767  460768     2      +
## chr_LpaL13-20.1_pos_461944_ref_C_alt_T LpaL13-20.1  461944  461945     2      +
## chr_LpaL13-20.1_pos_465405_ref_T_alt_C LpaL13-20.1  465405  465406     2      +
## chr_LpaL13-20.1_pos_465754_ref_G_alt_A LpaL13-20.1  465754  465755     2      +
## chr_LpaL13-20.1_pos_465865_ref_G_alt_A LpaL13-20.1  465865  465866     2      +
## chr_LpaL13-20.1_pos_467343_ref_C_alt_T LpaL13-20.1  467343  467344     2      +
## chr_LpaL13-20.1_pos_534889_ref_C_alt_T LpaL13-20.1  534889  534890     2      +
## chr_LpaL13-20.1_pos_535544_ref_G_alt_A LpaL13-20.1  535544  535545     2      +
## chr_LpaL13-20.1_pos_537604_ref_T_alt_A LpaL13-20.1  537604  537605     2      +
## chr_LpaL13-20.1_pos_537764_ref_G_alt_A LpaL13-20.1  537764  537765     2      +
## chr_LpaL13-23_pos_296439_ref_A_alt_G     LpaL13-23  296439  296440     2      +
## chr_LpaL13-23_pos_296880_ref_C_alt_T     LpaL13-23  296880  296881     2      +
## chr_LpaL13-23_pos_296937_ref_G_alt_A     LpaL13-23  296937  296938     2      +
## chr_LpaL13-31_pos_1188862_ref_A_alt_G    LpaL13-31 1188862 1188863     2      +
## chr_LpaL13-31_pos_125653_ref_C_alt_T     LpaL13-31  125653  125654     2      +
## chr_LpaL13-33_pos_293184_ref_G_alt_A     LpaL13-33  293184  293185     2      +
head(clinical_snps[["gene_summaries"]][["failure, reference strain"]])
## LPAL13_100011200 LPAL13_150006200 LPAL13_240010100 LPAL13_310025800 
##                1                1                1                1 
## LPAL13_000005000 LPAL13_000005400 
##                0                0
head(clinical_snps[["gene_summaries"]][["cure"]], n = 30)
## LPAL13_200017900 LPAL13_200014600 LPAL13_230015000 LPAL13_200014900 
##                4                3                3                2 
## LPAL13_200015100 LPAL13_200015200 LPAL13_200017600 LPAL13_200017800 
##                2                2                2                2 
## LPAL13_200019500 LPAL13_200019600 LPAL13_010010900 LPAL13_080009800 
##                2                2                1                1 
## LPAL13_100014700 LPAL13_110015500 LPAL13_150006300 LPAL13_160017600 
##                1                1                1                1 
## LPAL13_200008300 LPAL13_200008400 LPAL13_200015000 LPAL13_200015300 
##                1                1                1                1 
## LPAL13_200016400 LPAL13_200016500 LPAL13_200016900 LPAL13_200017200 
##                1                1                1                1 
## LPAL13_310008900 LPAL13_310034900 LPAL13_330014300 LPAL13_000005000 
##                1                1                1                0 
## LPAL13_000005400 LPAL13_000005500 
##                0                0
annot <- fData(lp_expt)
clinical_interest <- as.data.frame(clinical_snps[["gene_summaries"]][["cure"]])
clinical_interest <- merge(clinical_interest,
                           as.data.frame(clinical_snps[["gene_summaries"]][["failure, reference strain"]]),
                           by = "row.names")
rownames(clinical_interest) <- clinical_interest[["Row.names"]]
clinical_interest[["Row.names"]] <- NULL
colnames(clinical_interest) <- c("cure_snps","fail_snps")
annot <- merge(annot, clinical_interest, by = "row.names")
rownames(annot) <- annot[["Row.names"]]
annot[["Row.names"]] <- NULL
fData(lp_expt$expressionset) <- annot

7 Zymodeme for new samples

The heatmap produced here should show the variants only for the zymodeme genes.

7.1 Hunt for snp clusters

I am thinking that if we find clusters of locations which are variant, that might provide some PCR testing possibilities.

new_sets <- get_snp_sets(new_snps, factor = "phenotypiccharacteristics")
## The factor 22 has 11 rows.
## The factor 23 has 14 rows.
## The factor laboratory line has only 1 row.
## The factor notapplicable has 17 rows.
## The factor reference strain has 3 rows.
## Iterating over 693 elements.
summary(new_sets)
##               Length Class      Mode     
## medians         6    data.frame list     
## possibilities   5    -none-     character
## intersections  31    -none-     list     
## chr_data      693    -none-     list     
## set_names      32    -none-     list     
## invert_names   32    -none-     list     
## density       693    -none-     numeric
## 1000000: 2.2
## 0100000: 2.3

summary(new_sets[["intersections"]][["10000"]])
##    Length     Class      Mode 
##       511 character character
summary(new_sets[["intersections"]][["01000"]])
##    Length     Class      Mode 
##     49790 character character

Thus we see that there are 511 variants associated with 2.2 and 49,790 associated with 2.3.

7.1.1 A small function for searching for potential PCR primers

The following function uses the positional data to look for sequential mismatches associated with zymodeme in the hopes that there will be some regions which would provide good potential targets for a PCR-based assay.

sequential_variants <- function(snp_sets, conditions = NULL, minimum = 3, maximum_separation = 3) {
  if (is.null(conditions)) {
    conditions <- 1
  }
  intersection_sets <- snp_sets[["intersections"]]
  intersection_names <- snp_sets[["set_names"]]
  chosen_intersection <- 1
  if (is.numeric(conditions)) {
    chosen_intersection <- conditions
  } else {
    intersection_idx <- intersection_names == conditions
    chosen_intersection <- names(intersection_names)[intersection_idx]
  }

  possible_positions <- intersection_sets[[chosen_intersection]]
  position_table <- data.frame(row.names = possible_positions)
  pat <- "^chr_(.+)_pos_(.+)_ref_.*$"
  position_table[["chr"]] <- gsub(pattern = pat, replacement = "\\1", x = rownames(position_table))
  position_table[["pos"]] <- as.numeric(gsub(pattern = pat, replacement = "\\2", x = rownames(position_table)))
  position_idx <- order(position_table[, "chr"], position_table[, "pos"])
  position_table <- position_table[position_idx, ]
  position_table[["dist"]] <- 0

  last_chr <- ""
  for (r in 1:nrow(position_table)) {
    this_chr <- position_table[r, "chr"]
    if (r == 1) {
      position_table[r, "dist"] <- position_table[r, "pos"]
      last_chr <- this_chr
      next
    }
    if (this_chr == last_chr) {
      position_table[r, "dist"] <- position_table[r, "pos"] - position_table[r - 1, "pos"]
    } else {
      position_table[r, "dist"] <- position_table[r, "pos"]
    }
    last_chr <- this_chr
  }

  sequentials <- position_table[["dist"]] <= maximum_separation
  message("There are ", sum(sequentials), " candidate regions.")

  ## The following can tell me how many runs of each length occurred, that is not quite what I want.
  ## Now use run length encoding to find the set of sequential sequentials!
  rle_result <- rle(sequentials)
  rle_values <- rle_result[["values"]]
  ## The following line is equivalent to just leaving values alone:
  ## true_values <- rle_result[["values"]] == TRUE
  rle_lengths <- rle_result[["lengths"]]
  true_sequentials <- rle_lengths[rle_values]
  rle_idx <- cumsum(rle_lengths)[which(rle_values)]

  position_table[["last_sequential"]] <- 0
  count <- 0
  for (r in rle_idx) {
    count <- count + 1
    position_table[r, "last_sequential"] <- true_sequentials[count]
  }
  message("The maximum sequential set is: ", max(position_table[["last_sequential"]]), ".")

  wanted_idx <- position_table[["last_sequential"]] >= minimum
  wanted <- position_table[wanted_idx, c("chr", "pos")]
  return(wanted)
}

zymo22_sequentials <- sequential_variants(new_sets, conditions = "22")
## There are 75 candidate regions.
## The maximum sequential set is: 3.
dim(zymo22_sequentials)
## [1] 7 2
## 7 candidate regions for zymodeme 2.2 -- thus I am betting that the reference strain is a 2.2
zymo23_sequentials <- sequential_variants(new_sets, conditions = "23",
                                          minimum = 1, maximum_separation = 3)
## There are 587 candidate regions.
## The maximum sequential set is: 1.
dim(zymo23_sequentials)
## [1] 587   2
## In contrast, there are lots (587) of interesting regions for 2.3!

7.2 Make a heatmap describing the clustering of variants

We can cross reference the variants against the zymodeme status and plot a heatmap of the results and hopefully see how they separate.

snp_genes <- sm(snps_vs_genes(lp_expt, new_sets, expt_name_col = "chromosome"))
new_zymo_norm  <- normalize_expt(new_snps, filter = TRUE, convert = "cpm", norm = "quant", transform = TRUE)
## Removing 0 low-count genes (558524 remaining).
## transform_counts: Found 11978651 values equal to 0, adding 1 to the matrix.
new_zymo_norm <- set_expt_conditions(new_zymo_norm, fact = "phenotypiccharacteristics")

zymo_heat <- plot_disheat(new_zymo_norm)
zymo_heat[["plot"]]

7.2.1 Annotated heatmap of variants

Now let us try to make a heatmap which includes some of the annotation data.

des <- both_norm[["design"]]
undef_idx <- is.na(des[["strain"]])
des[undef_idx, "strain"] <- "unknown"

##hmcols <- colorRampPalette(c("yellow","black","darkblue"))(256)
correlations <- hpgl_cor(exprs(both_norm))

zymo_missing_idx <- is.na(des[["phenotypiccharacteristics"]])
des[["phenotypiccharacteristics"]] <- as.character(des[["phenotypiccharacteristics"]])
des[["clinicalcategorical"]] <- as.character(des[["clinicalcategorical"]])
des[zymo_missing_idx, "phenotypiccharacteristics"] <- "unknown"
mydendro <- list(
  "clustfun" = hclust,
  "lwd" = 2.0)
col_data <- as.data.frame(des[, c("phenotypiccharacteristics", "clinicalcategorical")])

unknown_clinical <- is.na(col_data[["clinicalcategorical"]])
row_data <- as.data.frame(des[, c("strain")])
colnames(col_data) <- c("zymodeme", "outcome")
col_data[unknown_clinical, "outcome"] <- "undefined"

colnames(row_data) <- c("strain")
myannot <- list(
  "Col" = list("data" = col_data),
  "Row" = list("data" = row_data))
myclust <- list("cuth" = 1.0,
                "col" = BrewerClusterCol)
mylabs <- list(
  "Row" = list("nrow" = 4),
  "Col" = list("nrow" = 4))
hmcols <- colorRampPalette(c("darkblue", "beige"))(240)
map1 <- annHeatmap2(
  correlations,
  dendrogram = mydendro,
  annotation = myannot,
  cluster = myclust,
  labels = mylabs,
  ## The following controls if the picture is symmetric
  scale = "none",
  col = hmcols)
## Warning in breakColors(breaks, col): more colors than classes: ignoring 29 last
## colors
pp(file = "images/dendro_heatmap.png", image = map1, height = 20, width = 20)
## annotated Heatmap
## 
## Rows: 'dendrogram' with 2 branches and 79 members total, at height 5.258 
##   11  annotation variable(s)
## Cols: 'dendrogram' with 2 branches and 79 members total, at height 5.258 
##   10  annotation variable(s)

Print the larger heatmap so that all the labels appear. Keep in mind that as we get more samples, this image needs to continue getting bigger.

big heatmap

pheno <- subset_expt(lp_expt, subset = "condition=='z2.2'|condition=='z2.3'")
## subset_expt(): There were 47, now there are 25 samples.
pheno <- subset_expt(pheno, subset="!is.na(pData(pheno)[['bcftable']])")
## subset_expt(): There were 25, now there are 25 samples.
pheno_snps <- sm(count_expt_snps(pheno, annot_column = "bcftable"))

xref_prop <- table(pheno_snps[["conditions"]])
pheno_snps$conditions
##  [1] "z2.3" "z2.2" "z2.2" "z2.3" "z2.3" "z2.2" "z2.3" "z2.2" "z2.3" "z2.3"
## [11] "z2.2" "z2.2" "z2.3" "z2.2" "z2.2" "z2.3" "z2.3" "z2.2" "z2.2" "z2.3"
## [21] "z2.3" "z2.3" "z2.2" "z2.3" "z2.3"
idx_tbl <- exprs(pheno_snps) > 5
new_tbl <- data.frame(row.names = rownames(exprs(pheno_snps)))
for (n in names(xref_prop)) {
  new_tbl[[n]] <- 0
  idx_cols <- which(pheno_snps[["conditions"]] == n)
  prop_col <- rowSums(idx_tbl[, idx_cols]) / xref_prop[n]
  new_tbl[n] <- prop_col
}
keepers <- grepl(x = rownames(new_tbl), pattern = "LpaL13")
new_tbl <- new_tbl[keepers, ]
new_tbl[["strong22"]] <- 1.001 - new_tbl[["z2.2"]]
new_tbl[["strong23"]] <- 1.001 - new_tbl[["z2.3"]]
s22_na <- new_tbl[["strong22"]] > 1
new_tbl[s22_na, "strong22"] <- 1
s23_na <- new_tbl[["strong23"]] > 1
new_tbl[s23_na, "strong23"] <- 1

new_tbl[["SNP"]] <- rownames(new_tbl)
new_tbl[["Chromosome"]] <- gsub(x = new_tbl[["SNP"]], pattern = "chr_(.*)_pos_.*", replacement = "\\1")
new_tbl[["Position"]] <- gsub(x = new_tbl[["SNP"]], pattern = ".*_pos_(\\d+)_.*", replacement = "\\1")
new_tbl <- new_tbl[, c("SNP", "Chromosome", "Position", "strong22", "strong23")]


library(CMplot)
## Much appreciate for using CMplot.
## Full description, Bug report, Suggestion and the latest codes:
## https://github.com/YinLiLin/CMplot
CMplot(new_tbl, bin.size = 100000)
##  SNP-Density Plotting.
##  Circular-Manhattan Plotting strong22.
##  Circular-Manhattan Plotting strong23.
##  Rectangular-Manhattan Plotting strong22.
##  Rectangular-Manhattan Plotting strong23.
##  QQ Plotting strong22.
##  QQ Plotting strong23.
##  Plots are stored in: /mnt/cbcb/fs01_abelew/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019
CMplot(new_tbl, plot.type="m", multracks=TRUE, threshold = c(0.01, 0.05),
       threshold.lwd=c(1,1), threshold.col=c("black","grey"),
       amplify=TRUE, bin.size=1e5,
       chr.den.col=c("darkgreen", "yellow", "red"),
       signal.col=c("red", "green", "blue"),
       signal.cex=1, file="jpg", memo="", dpi=300, file.output=TRUE, verbose=TRUE)
##  Multracks-Manhattan Plotting strong22.
##  Multracks-Manhattan Plotting strong23.
##  Multraits-Rectangular Plotting...(finished 78%)
 Multraits-Rectangular Plotting...(finished 79%)
 Multraits-Rectangular Plotting...(finished 80%)
 Multraits-Rectangular Plotting...(finished 81%)
 Multraits-Rectangular Plotting...(finished 82%)
 Multraits-Rectangular Plotting...(finished 83%)
 Multraits-Rectangular Plotting...(finished 84%)
 Multraits-Rectangular Plotting...(finished 85%)
 Multraits-Rectangular Plotting...(finished 86%)
 Multraits-Rectangular Plotting...(finished 87%)
 Multraits-Rectangular Plotting...(finished 88%)
 Multraits-Rectangular Plotting...(finished 89%)
 Multraits-Rectangular Plotting...(finished 90%)
 Multraits-Rectangular Plotting...(finished 91%)
 Multraits-Rectangular Plotting...(finished 92%)
 Multraits-Rectangular Plotting...(finished 93%)
 Multraits-Rectangular Plotting...(finished 94%)
 Multraits-Rectangular Plotting...(finished 95%)
 Multraits-Rectangular Plotting...(finished 96%)
 Multraits-Rectangular Plotting...(finished 97%)
 Multraits-Rectangular Plotting...(finished 98%)
 Multraits-Rectangular Plotting...(finished 99%)
 Multraits-Rectangular Plotting...(finished 100%)
##  Plots are stored in: /mnt/cbcb/fs01_abelew/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019

SNP Density Circular Manhattan Rectangular Manhattan QQ

if (!isTRUE(get0("skip_load"))) {
  pander::pander(sessionInfo())
  message(paste0("This is hpgltools commit: ", get_git_commit()))
  message(paste0("Saving to ", savefile))
  tmp <- sm(saveme(filename = savefile))
}
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 72947fcc6afe09da22d71967059edd84e3063341
## This is hpgltools commit: Tue Jun 1 15:57:56 2021 -0400: 72947fcc6afe09da22d71967059edd84e3063341
## Saving to tmrc2_02sample_estimation_v202106.rda.xz
tmp <- loadme(filename = savefile)
---
title: "TMRC2 Comprehensive Data Analysis: 202106"
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
 html_document:
  code_download: true
  code_folding: show
  fig_caption: true
  fig_height: 7
  fig_width: 7
  highlight: default
  keep_md: false
  mode: selfcontained
  number_sections: true
  self_contained: true
  theme: readable
  toc: true
  toc_float:
   collapsed: false
   smooth_scroll: false
---

<style>
  body .main-container {
    max-width: 1600px;
  }
</style>

```{r options, include = FALSE}
library(hpgltools)
tt <- sm(devtools::load_all("~/hpgltools"))
knitr::opts_knit$set(progress = TRUE,
                     verbose = TRUE,
                     width = 90,
                     echo = TRUE)
knitr::opts_chunk$set(error = TRUE,
                      fig.width = 8,
                      fig.height = 8,
                      dpi = 96)
old_options <- options(digits = 4,
                       stringsAsFactors = FALSE,
                       knitr.duplicate.label = "allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size = 12))
ver <- "202106"
rundate <- format(Sys.Date(), format = "%Y%m%d")

## tmp <- try(sm(loadme(filename = gsub(pattern = "\\.Rmd", replace = "\\.rda\\.xz", x = previous_file))))
rmd_file <- glue::glue("tmrc2_02sample_estimation_v{ver}.Rmd")
savefile <- gsub(pattern = "\\.Rmd", replace = "\\.rda\\.xz", x = rmd_file)

library(Heatplus)
```

# Introduction

This document is intended to provide a general overview of the TMRC2 samples
which have thus far been sequenced.  In some cases, this includes only those
samples starting in 2019; in other instances I am including our previous
(2015-2016) samples.

In all cases the processing performed was:

1.  Default trimming was performed.
2.  Hisat2 was used to map the remaining reads against the Leishmania
    panamensis genome revision 36.
3.  The alignments from hisat2 were used to count reads/gene against the
    revision 36 annotations with htseq.
4.  These alignments were also passed to the pileup functionality of samtools
    and the vcf/bcf utilities in order to make a matrix of all observed
    differences between each sample with respect to the reference.

The analyses in this document use the matrices of counts/gene from #3 and
variants/position from #4 in order to provide some images and metrics describing
the samples we have sequenced so far.

# Annotations

Everything which follows depends on the Existing TriTrypDB annotations revision
46, circa 2019.  The following block loads a database of these annotations and
turns it into a matrix where the rows are genes and columns are all the
annotation types provided by TriTrypDB.

The same database was used to create a matrix of orthologous genes between
L.panamensis and all of the other species in the TriTrypDB.

```{r annot}
tt <- sm(library(EuPathDB))
tt <- sm(library(org.Lpanamensis.MHOMCOL81L13.v46.eg.db))
pan_db <- org.Lpanamensis.MHOMCOL81L13.v46.eg.db
all_fields <- columns(pan_db)

all_lp_annot <- sm(load_orgdb_annotations(
    pan_db,
    keytype = "gid",
    fields = c("annot_gene_entrez_id", "annot_gene_name",
               "annot_strand", "annot_chromosome", "annot_cds_length",
               "annot_gene_product")))$genes

lp_go <- sm(load_orgdb_go(pan_db))
lp_lengths <- all_lp_annot[, c("gid", "annot_cds_length")]
colnames(lp_lengths)  <- c("ID", "length")
all_lp_annot[["annot_gene_product"]] <- tolower(all_lp_annot[["annot_gene_product"]])
orthos <- sm(EuPathDB::extract_eupath_orthologs(db = pan_db))

hisat_annot <- all_lp_annot
## rownames(hisat_annot) <- paste0("exon_", rownames(hisat_annot), ".E1")
```

# TODO:

Resequence samples: TMRC20002, TMRC20006, TMRC20004 (maybe TMRC20008 and TMRC20029)

# Generate Expressionsets and Sample Estimation

The process of sample estimation takes two primary inputs:

1.  The sample sheet, which contains all the metadata we currently have on hand,
    including filenames for the outputs of #3 and #4 above.
2.  The gene annotations.

An expressionset is a data structure used in R to examine RNASeq data.  It
is comprised of annotations, metadata, and expression data.  In the case of our
processing pipeline, the location of the expression data is provided by the
filenames in the metadata.

The first lines of the following block create the Expressionset.  All of the
following lines perform various normalizations and generate plots from it.

## Notes

The following samples are much lower coverage:

* TMRC20002
* TMRC20006
* TMRC20007
* TMRC20008

## TODO:

1.  Do the multi-gene family removal right here instead of way down at the bottom
2.  Add zymodeme snps to the annotation later.
3.  Start phylogenetic analysis of variant table.


```{r new_samples_hisat}
sample_sheet <- glue::glue("sample_sheets/tmrc2_samples_20210601.xlsx")

sanitize_columns <- c("passagenumber", "clinicalresponse", "clinicalcategorical",
                      "zymodemecategorical", "phenotypiccharacteristics")
lp_expt <- sm(create_expt(sample_sheet,
                          gene_info = hisat_annot,
                          id_column = "hpglidentifier",
                          file_column = "lpanamensisv36hisatfile")) %>%
  set_expt_conditions(fact = "zymodemecategorical") %>%
  subset_expt(nonzero = 8600) %>%
  semantic_expt_filter(semantic = c("amastin", "gp63", "leishmanolysin"),
                       semantic_column = "annot_gene_product") %>%
  sanitize_expt_metadata(columns = sanitize_columns) %>%
  set_expt_factors(columns = sanitize_columns, class = "factor")

libsizes <- plot_libsize(lp_expt)
pp(file = "images/lp_expt_libsizes.png", image = libsizes$plot, width = 12, height = 9)
## I think samples 7,10 should be removed at minimum, probably also 9,11
nonzero <- plot_nonzero(lp_expt)
nonzero$plot

lp_box <- plot_boxplot(lp_expt)
pp(file = "images/lp_expt_boxplot.png", image = lp_box, width = 12, height = 9)

filter_plot <- plot_libsize_prepost(lp_expt)
filter_plot$lowgene_plot
filter_plot$count_plot
```

## Distribution Visualization

Najib's favorite plots are of course the PCA/TNSE.  These are nice to look at in
order to get a sense of the relationships between samples.  They also provide a
good opportunity to see what happens when one applies different normalizations,
surrogate analyses, filters, etc.  In addition, one may set different
experimental factors as the primary 'condition' (usually the color of plots) and
surrogate 'batches'.

## By Susceptilibity

Column 'Q' in the sample sheet, make a categorical version of it with these parameters:

* 0 <= x <= 35 is resistant
* 36 <= x <= 48 is ambiguous
* 49 <= x is sensitive

```{r susceptibility}
starting <- as.numeric(pData(lp_expt)[["susceptibilityinfectionreduction32ugmlsbvhistoricaldata"]])
sus_categorical <- starting
na_idx <- is.na(starting)
sus_categorical[na_idx] <- "unknown"

resist_idx <- starting <= 0.35
sus_categorical[resist_idx] <- "resistant"
indeterminant_idx <- starting >= 0.36 & starting <= 0.48
sus_categorical[indeterminant_idx] <- "ambiguous"
susceptible_idx <- starting >= 0.49
sus_categorical[susceptible_idx] <- "sensitive"

pData(lp_expt$expressionset)[["sus_category"]] <- sus_categorical
```

```{r pre_questions}
clinical_samples <- lp_expt %>%
  set_expt_batches(fact = sus_categorical)

clinical_norm <- sm(normalize_expt(clinical_samples, norm = "quant", transform = "log2",
                                   convert = "cpm", batch = FALSE, filter = TRUE))
zymo_pca <- plot_pca(clinical_norm, plot_title = "PCA of parasite expression values")
pp(file = "images/zymo_pca_sus_shape.png", image = zymo_pca$plot)

zymo_3dpca <- plot_3d_pca(zymo_pca)
zymo_3dpca$plot

clinical_n <- sm(normalize_expt(clinical_samples, transform = "log2",
                                convert = "cpm", batch = FALSE, filter = TRUE))
zymo_tsne <- plot_tsne(clinical_n, plot_title = "TSNE of parasite expression values")
zymo_tsne$plot

clinical_nb <- normalize_expt(clinical_samples, convert = "cpm", transform = "log2",
                         filter = TRUE, batch = "svaseq")
clinical_nb_pca <- plot_pca(clinical_nb, plot_title = "PCA of parasite expression values")
pp(file = "images/clinical_nb_pca_sus_shape.png", image = clinical_nb_pca$plot)

clinical_nb_tsne <- plot_tsne(clinical_nb, plot_title = "TSNE of parasite expression values")
clinical_nb_tsne$plot

corheat <- plot_corheat(clinical_norm, plot_title = "Correlation heatmap of parasite
                 expression values
")
corheat$plot

plot_sm(clinical_norm)$plot
```

## By Cure/Fail status

```{r cf_status}
cf_expt <- set_expt_conditions(lp_expt, fact = "clinicalcategorical") %>%
  set_expt_batches(fact = sus_categorical)

cf_norm <- normalize_expt(cf_expt, convert = "cpm", transform = "log2",
                          norm = "quant", filter = TRUE)
start_cf <- plot_pca(cf_norm, plot_title = "PCA of parasite expression values")
pp(file = "images/cf_sus_shape.png", image = start_cf$plot)

cf_nb <- normalize_expt(cf_expt, convert = "cpm", transform = "log2",
                        norm = "quant", filter = TRUE, batch = "svaseq")
cf_nb_pca <- plot_pca(cf_nb, plot_title = "PCA of parasite expression values")
pp(file = "images/cf_sus_share_nb.png", image = cf_nb_pca$plot)

cf_norm <- normalize_expt(cf_expt, transform = "log2", convert = "cpm",
                          filter = TRUE, norm = "quant")

test <- pca_information(cf_norm,
                        expt_factors = c("clinicalcategorical", "zymodemecategorical",
                                         "pathogenstrain", "passagenumber"),
                        num_components = 6, plot_pcas = TRUE)
test$anova_p
test$cor_heatmap
```

```{r susceptibility_pca}
sus_expt <- set_expt_conditions(lp_expt, fact = "sus_category") %>%
  set_expt_batches(fact = "zymodemecategorical")

sus_norm <- normalize_expt(sus_expt, transform = "log2", convert = "cpm",
                           norm = "quant", filter = TRUE)
sus_pca <- plot_pca(sus_norm, plot_title = "PCA of parasite expression values")
sus_pca$plot

sus_nb <- normalize_expt(sus_expt, transform = "log2", convert = "cpm",
                         batch = "svaseq", filter = TRUE)
sus_nb_pca <- plot_pca(sus_nb, plot_title = "PCA of parasite expression values")
pp(file = "images/sus_nb_pca.png", image = sus_nb_pca$plot)
```

At this time, we do not have very many samples, so the set of metrics/plots is
fairly limited.  There is really only one factor in the metadata which we can
use for performing differential expression analyses, the 'zymodeme'.

# Zymodeme analyses

The following sections perform a series of analyses which seek to elucidate
differences between the zymodemes 2.2 and 2.3 either through differential
expression or variant profiles.

## Differential expression

### With respect to zymodeme attribution

TODO: Do this with and without sva and compare the results.

```{r zymo_de, fig.show = "hide"}
zy_expt <- subset_expt(lp_expt, subset = "condition=='z2.2'|condition=='z2.3'")
zy_norm <- normalize_expt(zy_expt, filter = TRUE, convert = "cpm", norm = "quant")
zy_de_nobatch <- sm(all_pairwise(zy_expt, filter = TRUE, model_batch = "svaseq"))
zy_de <- sm(all_pairwise(zy_expt, filter = TRUE, model_batch = "svaseq"))
zy_table <- sm(combine_de_tables(zy_de, excel = glue::glue("excel/zy_tables-v{ver}.xlsx")))
zy_sig <- sm(extract_significant_genes(zy_table, excel = glue::glue("excel/zy_sig-v{ver}.xlsx")))
```

### Images of zymodeme DE

```{r zymod_de_pictures}
zy_table[["plots"]][["z23_vs_z22"]][["deseq_ma_plots"]][["plot"]]
```

## With respect to cure/failure

In contrast, we can search for genes which are differentially
expressed with respect to cure/failure status.

```{r curefail_de, fig.show = "hide"}
cf_de <- sm(all_pairwise(cf_expt, filter = TRUE, model_batch = "svaseq"))
cf_table <- sm(combine_de_tables(cf_de, excel = glue::glue("excel/cf_tables-v{ver}.xlsx")))
cf_sig <- sm(extract_significant_genes(cf_table, excel = glue::glue("excel/cf_sig-v{ver}.xlsx")))
```

## With respect to susceptibility

Finally, we can use our category of susceptibility and look for genes
which change from sensitive to resistant.  Keep in mind, though, that
for the moment we have a lot of ambiguous and unknown strains.

```{r curefail_de, fig.show = "hide"}
sus_de <- sm(all_pairwise(sus_expt, filter = TRUE, model_batch = "svaseq"))
sus_table <- sm(combine_de_tables(sus_de, excel = glue::glue("excel/sus_tables-v{ver}.xlsx")))
sus_sig <- sm(extract_significant_genes(sus_table, excel = glue::glue("excel/sus_sig-v{ver}.xlsx")))
```

```{r zymod_de_pictures}
knitr::kable(head(sus_sig$deseq$ups$sensitive_vs_resistant, n = 20))

knitr::kable(head(sus_sig$deseq$downs$sensitive_vs_resistant, n = 20))

sus_ma <- sus_table[["plots"]][["sensitive_vs_resistant"]][["deseq_ma_plots"]][["plot"]]
pp(file = "images/sus_ma.png", image = sus_ma

## test <- ggplt(sus_ma)
```


## Ontology searches

Now let us look for ontology categories which are increased in the 2.3
samples followed by the 2.2 samples.

```{r go, sig.show = "hide"}
## Gene categories more represented in the 2.3 group.
zy_go_up <- sm(simple_goseq(sig_genes = zy_sig[["deseq"]][["ups"]][[1]],
                            go_db = lp_go, length_db = lp_lengths))

## Gene categories more represented in the 2.2 group.
zy_go_down <- sm(simple_goseq(sig_genes = zy_sig[["deseq"]][["downs"]][[1]],
                              go_db = lp_go, length_db = lp_lengths))
```

### A couple plots from the differential expression

#### Number of genes in agreement among DE methods, 2.3 more than 2.2

In the function 'combined_de_tables()' above, one of the tasks
performed is to look at the agreement among DESeq2, limma, and edgeR.
The following show a couple of these for the set of genes observed
with a fold-change >= |2| and adjusted p-value <= 0.05.

```{r de_plots}
zy_table[["venns"]][[1]][["p_lfc1"]][["up_noweight"]]
```

#### Number of genes in agreement among DE methods, 2.2 more than 2.3

```{r de_plots}
zy_table[["venns"]][[1]][["p_lfc1"]][["down_noweight"]]
```

#### goseq ontology plots of groups of genes, 2.3 more than 2.2

```{r goseq_up}
zy_go_up$pvalue_plots$bpp_plot_over
```

#### goseq ontology plots of groups of genes, 2.2 more than 2.3

```{r goseq_down}
zy_go_down$pvalue_plots$bpp_plot_over
```

## Zymodeme enzyme gene IDs

Najib read me an email listing off the gene names associated with the zymodeme
classification.  I took those names and cross referenced them against the
Leishmania panamensis gene annotations and found the following:

They are:

1. ALAT: LPAL13_120010900 -- alanine aminotransferase
2. ASAT: LPAL13_340013000 -- aspartate aminotransferase
3. G6PD: LPAL13_000054100 -- glucase-6-phosphate 1-dehydrogenase
4. NH: LPAL13_14006100, LPAL13_180018500 -- inosine-guanine nucleoside hydrolase
5. MPI: LPAL13_320022300 (maybe) -- mannose phosphate isomerase (I chose phosphomannose isomerase)

Given these 6 gene IDs (NH has two gene IDs associated with it), I can do some
looking for specific differences among the various samples.

### Expression levels of zymodeme genes

The following creates a colorspace (red to green) heatmap showing the observed
expression of these genes in every sample.

```{r zymodemes}
my_genes <- c("LPAL13_120010900", "LPAL13_340013000", "LPAL13_000054100",
              "LPAL13_140006100", "LPAL13_180018500", "LPAL13_320022300",
              "other")
my_names <- c("ALAT", "ASAT", "G6PD", "NHv1", "NHv2", "MPI", "other")

zymo_expt <- exclude_genes_expt(zy_norm, ids = my_genes, method = "keep")
zymo_heatmap <- plot_sample_heatmap(zymo_expt, row_label = my_names)
zymo_heatmap
```

## Empirically observed Zymodeme genes from differential expression analysis

In contrast, the following plots take the set of genes which are shared among
all differential expression methods (|lfc| >= 1.0 and adjp <= 0.05) and use them
to make categories of genes which are increased in 2.3 or 2.2.

```{r zymodeme_genes_empirical}
shared_zymo <- intersect_significant(zy_table)
up_shared <- shared_zymo[["ups"]][[1]][["data"]][["all"]]
rownames(up_shared)
upshared_expt <- exclude_genes_expt(zy_norm, ids = rownames(up_shared), method = "keep")
```

We can plot a quick heatmap to get a sense of the differences observed
between the genes which are different between the two zymodemes.

### Heatmap of zymodeme gene expression increased in 2.3 vs. 2.2

```{r zymoempup}
high_23_heatmap <- plot_sample_heatmap(upshared_expt, row_label = rownames(up_shared))
high_23_heatmap
```

### Heatmap of zymodeme gene expression increased in 2.2 vs. 2.3

```{r zymoemdown}
down_shared <- shared_zymo[["downs"]][[1]][["data"]][["all"]]
downshared_expt <- exclude_genes_expt(zy_norm, ids = rownames(down_shared), method = "keep")
high_22_heatmap <- plot_sample_heatmap(downshared_expt, row_label = rownames(down_shared))
high_22_heatmap
```

# SNP profiles

Now I will combine our previous samples and our new samples in the
hopes of finding variant positions which help elucidate currently
unknown aspects of either group via their clustering to known samples
from the other group. In other words, we do not know the zymodeme
annotations for the old samples nor the strain identities (or the
shortcut 'chronic vs. self-healing') for the new samples. I hope to
make educated guesses given the variant profiles. There are some
differences in how the previous and current data sets were analyzed
(though I have since redone the old samples so it should be trivial to
remove those differences now).

I added our 2016 data to a specific TMRC2 sample sheet,
dated 20191203.  Thus I will load the data here.  That previous data
was mapped using tophat, so I will also need to make some changes to
the gene names to accomodate the two mappings.

```{r oldnew_variants}
old_expt <- sm(create_expt("sample_sheets/tmrc2_samples_20191203.xlsx",
                           file_column = "tophat2file"))

tt <- lp_expt[["expressionset"]]
rownames(tt) <- gsub(pattern = "^exon_", replacement = "", x = rownames(tt))
rownames(tt) <- gsub(pattern = "\\.E1$", replacement = "", x = rownames(tt))
lp_expt$expressionset <- tt

tt <- old_expt$expressionset
rownames(tt) <- gsub(pattern = "^exon_", replacement = "", x = rownames(tt))
rownames(tt) <- gsub(pattern = "\\.1$", replacement = "", x = rownames(tt))
old_expt$expressionset <- tt
rm(tt)
```

## Create the SNP expressionset

One other important caveat, we have a group of new samples which have
not yet run through the variant search pipeline, so I need to remove
them from consideration.  Though it looks like they finished overnight...

```{r count_expt_old_new}
## The next line drops the samples which are missing the SNP pipeline.
lp_snp <- subset_expt(lp_expt, subset="!is.na(pData(lp_expt)[['bcftable']])")
new_snps <- sm(count_expt_snps(lp_snp, annot_column = "bcftable"))
old_snps <- sm(count_expt_snps(old_expt, annot_column = "bcftable", snp_column = 2))

both_snps <- combine_expts(new_snps, old_snps)
both_norm <- sm(normalize_expt(both_snps, transform = "log2", convert = "cpm", filter = TRUE))

## strains <- both_norm[["design"]][["strain"]]
both_strain <- set_expt_conditions(both_norm, fact = "strain")
```

The data structure 'both_norm' now contains our 2016 data along with
the newer data collected since 2019.

## Plot of SNP profiles for zymodemes

The following plot shows the SNP profiles of all samples (old and new) where the
colors at the top show either the 2.2 strains (orange), 2.3 strains (green), the
previous samples (purple), or the various lab strains (pink etc).

```{r plotting_variants}
old_new_variant_heatmap <- plot_disheat(both_norm)
pp(file = "images/raw_snp_disheat.png", image = old_new_variant_heatmap,
   height = 12, width = 12)
```

The function get_snp_sets() takes the provided metadata factor (in
this case 'condition') and looks for variants which are exclusive to
each element in it.  In this case, this is looking for differences
between 2.2 and 2.3, as well as the set shared among them.

```{r get_snp_sets1}
snp_sets <- get_snp_sets(both_snps, factor = "condition")
both_expt <- combine_expts(lp_expt, old_expt)

snp_genes <- sm(snps_vs_genes(both_expt, snp_sets, expt_name_col = "chromosome"))
## I think we have some metrics here we can plot...
snp_subset <- sm(snp_subset_genes(
  both_expt, both_snps,
  genes = c("LPAL13_120010900", "LPAL13_340013000", "LPAL13_000054100",
            "LPAL13_140006100", "LPAL13_180018500", "LPAL13_320022300")))
zymo_heat <- plot_sample_heatmap(snp_subset, row_label = rownames(exprs(snp_subset)))
zymo_heat
```

Didn't I create a set of densities by chromosome?
Oh I think they come in from get_snp_sets()

## SNPS associated with clinical response in the TMRC samples

```{r snp_clinical}
clinical_sets <- get_snp_sets(new_snps, factor = "clinicalresponse")

density_vec <- clinical_sets[["density"]]
chromosome_idx <- grep(pattern = "LpaL", x = names(density_vec))
density_df <- as.data.frame(density_vec[chromosome_idx])
density_df[["chr"]] <- rownames(density_df)
colnames(density_df) <- c("density_vec", "chr")
ggplot(density_df, aes_string(x = "chr", y = "density_vec")) +
  ggplot2::geom_col() +
  ggplot2::theme(axis.text = ggplot2::element_text(size = 10, colour = "black"),
                 axis.text.x = ggplot2::element_text(angle = 90, vjust = 0.5))

## clinical_written <- write_variants(new_snps)
```

### Cross reference these variants by gene

```{r snp_classifications}
clinical_genes <- sm(snps_vs_genes(lp_expt, clinical_sets, expt_name_col = "chromosome"))

snp_density <- merge(as.data.frame(clinical_genes[["summary_by_gene"]]),
                     as.data.frame(fData(lp_expt)),
                     by = "row.names")
snp_density <- snp_density[, c(1, 2, 4, 15)]
colnames(snp_density) <- c("name", "snps", "product", "length")
snp_density[["product"]] <- tolower(snp_density[["product"]])
snp_density[["length"]] <- as.numeric(snp_density[["length"]])
snp_density[["density"]] <- snp_density[["snps"]] / snp_density[["length"]]
snp_idx <- order(snp_density[["density"]], decreasing = TRUE)
snp_density <- snp_density[snp_idx, ]

removers <- c("amastin", "gp63", "leishmanolysin")
for (r in removers) {
  drop_idx <- grepl(pattern = r, x = snp_density[["product"]])
  snp_density <- snp_density[!drop_idx, ]
}
## Filter these for [A|a]mastin gp63 Leishmanolysin
```


```{r snp_intersections}
clinical_snps <- snps_intersections(lp_expt, clinical_sets, chr_column = "chromosome")

as.data.frame(clinical_snps[["inters"]][["failure, reference strain"]])
as.data.frame(clinical_snps[["inters"]][["cure"]])

head(clinical_snps[["gene_summaries"]][["failure, reference strain"]])
head(clinical_snps[["gene_summaries"]][["cure"]], n = 30)

annot <- fData(lp_expt)
clinical_interest <- as.data.frame(clinical_snps[["gene_summaries"]][["cure"]])
clinical_interest <- merge(clinical_interest,
                           as.data.frame(clinical_snps[["gene_summaries"]][["failure, reference strain"]]),
                           by = "row.names")
rownames(clinical_interest) <- clinical_interest[["Row.names"]]
clinical_interest[["Row.names"]] <- NULL
colnames(clinical_interest) <- c("cure_snps","fail_snps")
annot <- merge(annot, clinical_interest, by = "row.names")
rownames(annot) <- annot[["Row.names"]]
annot[["Row.names"]] <- NULL
fData(lp_expt$expressionset) <- annot
```

# Zymodeme for new samples

The heatmap produced here should show the variants only for the zymodeme genes.

## Hunt for snp clusters

I am thinking that if we find clusters of locations which are variant, that
might provide some PCR testing possibilities.

```{r new_zymo}
new_sets <- get_snp_sets(new_snps, factor = "phenotypiccharacteristics")
summary(new_sets)
## 1000000: 2.2
## 0100000: 2.3

summary(new_sets[["intersections"]][["10000"]])
summary(new_sets[["intersections"]][["01000"]])
```

Thus we see that there are 511 variants associated with 2.2 and 49,790 associated with 2.3.

### A small function for searching for potential PCR primers

The following function uses the positional data to look for sequential
mismatches associated with zymodeme in the hopes that there will be
some regions which would provide good potential targets for a
PCR-based assay.

```{r sequential_search}
sequential_variants <- function(snp_sets, conditions = NULL, minimum = 3, maximum_separation = 3) {
  if (is.null(conditions)) {
    conditions <- 1
  }
  intersection_sets <- snp_sets[["intersections"]]
  intersection_names <- snp_sets[["set_names"]]
  chosen_intersection <- 1
  if (is.numeric(conditions)) {
    chosen_intersection <- conditions
  } else {
    intersection_idx <- intersection_names == conditions
    chosen_intersection <- names(intersection_names)[intersection_idx]
  }

  possible_positions <- intersection_sets[[chosen_intersection]]
  position_table <- data.frame(row.names = possible_positions)
  pat <- "^chr_(.+)_pos_(.+)_ref_.*$"
  position_table[["chr"]] <- gsub(pattern = pat, replacement = "\\1", x = rownames(position_table))
  position_table[["pos"]] <- as.numeric(gsub(pattern = pat, replacement = "\\2", x = rownames(position_table)))
  position_idx <- order(position_table[, "chr"], position_table[, "pos"])
  position_table <- position_table[position_idx, ]
  position_table[["dist"]] <- 0

  last_chr <- ""
  for (r in 1:nrow(position_table)) {
    this_chr <- position_table[r, "chr"]
    if (r == 1) {
      position_table[r, "dist"] <- position_table[r, "pos"]
      last_chr <- this_chr
      next
    }
    if (this_chr == last_chr) {
      position_table[r, "dist"] <- position_table[r, "pos"] - position_table[r - 1, "pos"]
    } else {
      position_table[r, "dist"] <- position_table[r, "pos"]
    }
    last_chr <- this_chr
  }

  sequentials <- position_table[["dist"]] <= maximum_separation
  message("There are ", sum(sequentials), " candidate regions.")

  ## The following can tell me how many runs of each length occurred, that is not quite what I want.
  ## Now use run length encoding to find the set of sequential sequentials!
  rle_result <- rle(sequentials)
  rle_values <- rle_result[["values"]]
  ## The following line is equivalent to just leaving values alone:
  ## true_values <- rle_result[["values"]] == TRUE
  rle_lengths <- rle_result[["lengths"]]
  true_sequentials <- rle_lengths[rle_values]
  rle_idx <- cumsum(rle_lengths)[which(rle_values)]

  position_table[["last_sequential"]] <- 0
  count <- 0
  for (r in rle_idx) {
    count <- count + 1
    position_table[r, "last_sequential"] <- true_sequentials[count]
  }
  message("The maximum sequential set is: ", max(position_table[["last_sequential"]]), ".")

  wanted_idx <- position_table[["last_sequential"]] >= minimum
  wanted <- position_table[wanted_idx, c("chr", "pos")]
  return(wanted)
}

zymo22_sequentials <- sequential_variants(new_sets, conditions = "22")
dim(zymo22_sequentials)
## 7 candidate regions for zymodeme 2.2 -- thus I am betting that the reference strain is a 2.2
zymo23_sequentials <- sequential_variants(new_sets, conditions = "23",
                                          minimum = 1, maximum_separation = 3)
dim(zymo23_sequentials)
## In contrast, there are lots (587) of interesting regions for 2.3!
```

## Make a heatmap describing the clustering of variants

We can cross reference the variants against the zymodeme status and
plot a heatmap of the results and hopefully see how they separate.

```{r zymo_heatmaps}
snp_genes <- sm(snps_vs_genes(lp_expt, new_sets, expt_name_col = "chromosome"))
new_zymo_norm  <- normalize_expt(new_snps, filter = TRUE, convert = "cpm", norm = "quant", transform = TRUE)
new_zymo_norm <- set_expt_conditions(new_zymo_norm, fact = "phenotypiccharacteristics")

zymo_heat <- plot_disheat(new_zymo_norm)
zymo_heat[["plot"]]
```

### Annotated heatmap of variants

Now let us try to make a heatmap which includes some of the annotation data.

```{r zymo_heat_panel_genes}
des <- both_norm[["design"]]
undef_idx <- is.na(des[["strain"]])
des[undef_idx, "strain"] <- "unknown"

##hmcols <- colorRampPalette(c("yellow","black","darkblue"))(256)
correlations <- hpgl_cor(exprs(both_norm))

zymo_missing_idx <- is.na(des[["phenotypiccharacteristics"]])
des[["phenotypiccharacteristics"]] <- as.character(des[["phenotypiccharacteristics"]])
des[["clinicalcategorical"]] <- as.character(des[["clinicalcategorical"]])
des[zymo_missing_idx, "phenotypiccharacteristics"] <- "unknown"
mydendro <- list(
  "clustfun" = hclust,
  "lwd" = 2.0)
col_data <- as.data.frame(des[, c("phenotypiccharacteristics", "clinicalcategorical")])

unknown_clinical <- is.na(col_data[["clinicalcategorical"]])
row_data <- as.data.frame(des[, c("strain")])
colnames(col_data) <- c("zymodeme", "outcome")
col_data[unknown_clinical, "outcome"] <- "undefined"

colnames(row_data) <- c("strain")
myannot <- list(
  "Col" = list("data" = col_data),
  "Row" = list("data" = row_data))
myclust <- list("cuth" = 1.0,
                "col" = BrewerClusterCol)
mylabs <- list(
  "Row" = list("nrow" = 4),
  "Col" = list("nrow" = 4))
hmcols <- colorRampPalette(c("darkblue", "beige"))(240)
map1 <- annHeatmap2(
  correlations,
  dendrogram = mydendro,
  annotation = myannot,
  cluster = myclust,
  labels = mylabs,
  ## The following controls if the picture is symmetric
  scale = "none",
  col = hmcols)
pp(file = "images/dendro_heatmap.png", image = map1, height = 20, width = 20)
```

Print the larger heatmap so that all the labels appear.  Keep in mind
that as we get more samples, this image needs to continue getting
bigger.

![big heatmap](images/dendro_heatmap.png)


```{r theresa_idea}
pheno <- subset_expt(lp_expt, subset = "condition=='z2.2'|condition=='z2.3'")
pheno <- subset_expt(pheno, subset="!is.na(pData(pheno)[['bcftable']])")
pheno_snps <- sm(count_expt_snps(pheno, annot_column = "bcftable"))

xref_prop <- table(pheno_snps[["conditions"]])
pheno_snps$conditions
idx_tbl <- exprs(pheno_snps) > 5
new_tbl <- data.frame(row.names = rownames(exprs(pheno_snps)))
for (n in names(xref_prop)) {
  new_tbl[[n]] <- 0
  idx_cols <- which(pheno_snps[["conditions"]] == n)
  prop_col <- rowSums(idx_tbl[, idx_cols]) / xref_prop[n]
  new_tbl[n] <- prop_col
}
keepers <- grepl(x = rownames(new_tbl), pattern = "LpaL13")
new_tbl <- new_tbl[keepers, ]
new_tbl[["strong22"]] <- 1.001 - new_tbl[["z2.2"]]
new_tbl[["strong23"]] <- 1.001 - new_tbl[["z2.3"]]
s22_na <- new_tbl[["strong22"]] > 1
new_tbl[s22_na, "strong22"] <- 1
s23_na <- new_tbl[["strong23"]] > 1
new_tbl[s23_na, "strong23"] <- 1

new_tbl[["SNP"]] <- rownames(new_tbl)
new_tbl[["Chromosome"]] <- gsub(x = new_tbl[["SNP"]], pattern = "chr_(.*)_pos_.*", replacement = "\\1")
new_tbl[["Position"]] <- gsub(x = new_tbl[["SNP"]], pattern = ".*_pos_(\\d+)_.*", replacement = "\\1")
new_tbl <- new_tbl[, c("SNP", "Chromosome", "Position", "strong22", "strong23")]


library(CMplot)
CMplot(new_tbl, bin.size = 100000)

CMplot(new_tbl, plot.type="m", multracks=TRUE, threshold = c(0.01, 0.05),
       threshold.lwd=c(1,1), threshold.col=c("black","grey"),
       amplify=TRUE, bin.size=1e5,
       chr.den.col=c("darkgreen", "yellow", "red"),
       signal.col=c("red", "green", "blue"),
       signal.cex=1, file="jpg", memo="", dpi=300, file.output=TRUE, verbose=TRUE)
```

![SNP Density](SNP-Density.ratio.jpg)
![Circular Manhattan](Circular-Manhattan.ratio.jpg)
![Rectangular Manhattan](Rectangular-Manhattan.ratio.jpg)
![QQ](QQplot.ratio.jpg)




```{r saveme}
if (!isTRUE(get0("skip_load"))) {
  pander::pander(sessionInfo())
  message(paste0("This is hpgltools commit: ", get_git_commit()))
  message(paste0("Saving to ", savefile))
  tmp <- sm(saveme(filename = savefile))
}
```

```{r loadme_after, eval = FALSE}
tmp <- loadme(filename = savefile)
```
