1 Introduction

This document is intended to provide a general overview of the TMRC2 samples which have thus far been sequenced. In some cases, this includes only those samples starting in 2019; in other instances I am including our previous (2015-2016) samples.

In all cases the processing performed was:

  1. Default trimming was performed.
  2. Hisat2 was used to map the remaining reads against the Leishmania panamensis genome revision 36.
  3. The alignments from hisat2 were used to count reads/gene against the revision 36 annotations with htseq.
  4. These alignments were also passed to the pileup functionality of samtools and the vcf/bcf utilities in order to make a matrix of all observed differences between each sample with respect to the reference.

The analyses in this document use the matrices of counts/gene from #3 and variants/position from #4 in order to provide some images and metrics describing the samples we have sequenced so far.

2 Annotations

Everything which follows depends on the Existing TriTrypDB annotations revision 46, circa 2019. The following block loads a database of these annotations and turns it into a matrix where the rows are genes and columns are all the annotation types provided by TriTrypDB.

The same database was used to create a matrix of orthologous genes between L.panamensis and all of the other species in the TriTrypDB.

tt <- sm(library(EuPathDB))
tt <- sm(library(org.Lpanamensis.MHOMCOL81L13.v46.eg.db))
pan_db <- org.Lpanamensis.MHOMCOL81L13.v46.eg.db
all_fields <- columns(pan_db)

all_lp_annot <- sm(load_orgdb_annotations(
    pan_db,
    keytype="gid",
    fields=c("annot_gene_entrez_id", "annot_gene_name",
             "annot_strand", "annot_chromosome", "annot_cds_length",
             "annot_gene_product")))$genes

lp_go <- sm(load_orgdb_go(pan_db))
lp_lengths <- all_lp_annot[, c("gid", "annot_cds_length")]
colnames(lp_lengths)  <- c("ID", "length")

orthos <- sm(EuPathDB::extract_eupath_orthologs(db=pan_db))

hisat_annot <- all_lp_annot
## rownames(hisat_annot) <- paste0("exon_", rownames(hisat_annot), ".E1")

3 Sample Estimation

The process of sample estimation takes two primary inputs:

  1. The sample sheet, which contains all the metadata we currently have on hand, including filenames for the outputs of #3 and #4 above.
  2. The gene annotations.

An expressionset is primary data structure used in R to examine RNASeq data. It is comprised of annotations, metadata, and expression data. In the case of our processing pipeline, the location of the expression data is provided by the filenames in the metadata.

3.1 Generate expressionsets

The first lines of the following block create the Expressionset. All of the following lines perform various normalizations and generate plots from it.

sample_sheet <- glue::glue("sample_sheets/tmrc2_samples_{ver}.xlsx")
lp_expt <- sm(create_expt(sample_sheet,
                          gene_info=hisat_annot,
                          id_column="hpglidentifier",
                          file_column="lpanamensisv36hisatfile"))
lp_expt <- set_expt_conditions(lp_expt, fact="zymodemecategorical")

libsizes <- plot_libsize(lp_expt)
## The scale difference between the smallest and largest
## libraries is > 10. Assuming a log10 scale is better, set scale = FALSE if not.
libsizes$plot

## I think samples 7,10 should be removed at minimum, probably also 9,11
nonzero <- plot_nonzero(lp_expt)
nonzero$plot
## Warning: ggrepel: 11 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

plot_boxplot(lp_expt)
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0.  We are on log scale, adding 1 to the data.
## Changed 2659 zero count features.

3.2 Distribution Visualization

Najib’s favorite plots are of course the PCA/TNSE. These are nice to look at in order to get a sense of the relationships between samples. They also provide a good opportunity to see what happens when one applies different normalizations, surrogate analyses, filters, etc. In addition, one may set different experimental factors as the primary ‘condition’ (usually the color of plots) and surrogate ‘batches’.

all_norm <- sm(normalize_expt(lp_expt, norm="quant", transform="log2", convert="cpm",
                              batch=FALSE, filter=TRUE))
plot_pca(all_norm, plot_title="PCA of parasite expression values")$plot
## Warning in MASS::cov.trob(data[, vars]): Probable convergence failure

## Warning in MASS::cov.trob(data[, vars]): Probable convergence failure

plot_corheat(all_norm, title="Correlation heatmap of parasite expression values
(Same legend as above)")$plot

plot_sm(all_norm)$plot
## Performing correlation.

sm(plot_variance_coefficients(all_norm))$plot

sm(plot_sample_cvheatmap(all_norm))$plot

## NULL

3.2.1 Notes

The following samples are much lower coverage:

  • TMRC20002
  • TMRC20006
  • TMRC20007
  • TMRC20008

At this time, we do not have very many samples, so the set of metrics/plots is fairly limited. There is really only one factor in the metadata which we can use for performing differential expression analyses, the ‘zymodeme’.

4 Zymodeme analyses

The following sections perform a series of analyses which seek to elucidate differences between the zymodemes 2.2 and 2.3 either through differential expression or variant profiles.

4.1 Differential expression

4.1.1 With respect to zymodeme attribution

zy_expt <- subset_expt(lp_expt, subset="condition=='z2.2'|condition=='z2.3'")
## Using a subset expression.
## There were 30, now there are 16 samples.
zy_de <- sm(all_pairwise(zy_expt, filter=TRUE, model_batch="svaseq"))
zy_table <- sm(combine_de_tables(zy_de, excel=glue::glue("excel/zy_tables-v{ver}.xlsx")))
zy_sig <- sm(extract_significant_genes(zy_table, excel=glue::glue("excel/zy_sig-v{ver}.xlsx")))

4.2 With respect to cure/failure

cf_expt <- set_expt_conditions(lp_expt, fact="clinicalcategorical")
cf_norm <- sm(normalize_expt(cf_expt, filter=TRUE, norm="quant", transform="log2",
                             convert="cpm", batch="svaseq"))
plot_pca(cf_norm)$plot
## Warning in MASS::cov.trob(data[, vars]): Probable convergence failure

## Warning in MASS::cov.trob(data[, vars]): Probable convergence failure
cf_de <- sm(all_pairwise(cf_expt, filter=TRUE, model_batch="svaseq"))
cf_table <- sm(combine_de_tables(cf_de, excel=glue::glue("excel/cf_tables-v{ver}.xlsx")))
cf_sig <- sm(extract_significant_genes(cf_table, excel=glue::glue("excel/cf_sig-v{ver}.xlsx")))

4.3 Ontology searches

colnames(lp_go)
## [1] "GID"      "GO"       "EVIDENCE" "ONTOLOGY"
lp_go <- lp_go[, c("GID", "GOALL")]
## Error in `[.data.frame`(lp_go, , c("GID", "GOALL")): undefined columns selected
colnames(lp_go) <- c("ID", "GO")

## Gene categories more represented in the 2.3 group.
zy_go_up <- sm(simple_goseq(sig_genes=zy_sig[["deseq"]][["ups"]][[1]],
                            go_db=lp_go, length_db=lp_lengths))

## Gene categories more represented in the 2.2 group.
zy_go_down <- sm(simple_goseq(sig_genes=zy_sig[["deseq"]][["downs"]][[1]],
                              go_db=lp_go, length_db=lp_lengths))

4.3.1 A couple plots from the differential expression

4.3.1.1 Number of genes in agreement among DE methods, 2.3 more than 2.2

zy_table[["venns"]][[1]][["p_lfc1"]][["up_noweight"]]

4.3.1.2 Number of genes in agreement among DE methods, 2.2 more than 2.3

zy_table[["venns"]][[1]][["p_lfc1"]][["down_noweight"]]

4.3.1.3 MA plot of the differential expression between the zymodemes.

zy_table$plots[[1]][["deseq_ma_plots"]][["plot"]]

4.3.1.4 goseq ontology plots of groups of genes, 2.3 more than 2.2

zy_go_up$pvalue_plots$bpp_plot_over

4.3.1.5 goseq ontology plots of groups of genes, 2.2 more than 2.3

zy_go_down$pvalue_plots$bpp_plot_over

4.4 Zymodeme enzyme gene IDs

Najib read me an email listing off the gene names associated with the zymodeme classification. I took those names and cross referenced them against the Leishmania panamensis gene annotations and found the following:

They are:

  1. ALAT: LPAL13_120010900 – alanine aminotransferase
  2. ASAT: LPAL13_340013000 – aspartate aminotransferase
  3. G6PD: LPAL13_000054100 – glucase-6-phosphate 1-dehydrogenase
  4. NH: LPAL13_14006100, LPAL13_180018500 – inosine-guanine nucleoside hydrolase
  5. MPI: LPAL13_320022300 (maybe) – mannose phosphate isomerase (I chose phosphomannose isomerase)

Given these 6 gene IDs (NH has two gene IDs associated with it), I can do some looking for specific differences among the various samples.

4.4.1 Expression levels of zymodeme genes

The following creates a colorspace (red to green) heatmap showing the observed expression of these genes in every sample.

my_genes <- c("LPAL13_120010900", "LPAL13_340013000", "LPAL13_000054100",
              "LPAL13_140006100", "LPAL13_180018500", "LPAL13_320022300",
              "other")
my_names <- c("ALAT", "ASAT", "G6PD", "NHv1", "NHv2", "MPI", "other")

zymo_expt <- exclude_genes_expt(all_norm, ids=my_genes, method="keep")
## Before removal, there were 8636 entries.
## Now there are 6 entries.
## Percent of the counts kept after filtering: 0.086, 0.081, 0.084, 0.083, 0.083, 0.085, 0.086, 0.083, 0.084, 0.087, 0.083, 0.084, 0.083, 0.084, 0.083, 0.083, 0.085, 0.085, 0.083, 0.083, 0.083, 0.083, 0.081, 0.081, 0.085, 0.085, 0.081, 0.082, 0.087, 0.081
## There are 30 samples which kept less than 90 percent counts.
##      TMRC20001 TMRC20002 TMRC20004 TMRC20005 TMRC20006 TMRC20029 TMRC20007
##      TMRC20008 TMRC20027 TMRC20028 TMRC20032 TMRC20015 TMRC20009 TMRC20010
##      TMRC20016 TMRC20011 TMRC20012 TMRC20013 TMRC20017 TMRC20014 TMRC20018
##      TMRC20019 TMRC20020 TMRC20021 TMRC20022 TMRC20025 TMRC20024 TMRC20033
##      TMRC20026 TMRC20031
test <- plot_sample_heatmap(zymo_expt, row_label=my_names)

4.5 Empirically observed Zymodeme genes from differential expression analysis

In contrast, the following plots take the set of genes which are shared among all differential expression methods (|lfc| >= 1.0 and adjp <= 0.05) and use them to make categories of genes which are increased in 2.3 or 2.2.

shared_zymo <- intersect_significant(zy_table)
## Deleting the file excel/intersect_significant.xlsx before writing the tables.

up_shared <- shared_zymo[["ups"]][[1]][["data"]][["all"]]
rownames(up_shared)
##  [1] "LPAL13_000033300" "LPAL13_000012000" "LPAL13_310031300" "LPAL13_000038400"
##  [5] "LPAL13_340039600" "LPAL13_210015500" "LPAL13_050005000" "LPAL13_310039200"
##  [9] "LPAL13_000038500" "LPAL13_270034100" "LPAL13_170015400" "LPAL13_350044000"
## [13] "LPAL13_200013000" "LPAL13_340039700" "LPAL13_000041000" "LPAL13_330021800"
## [17] "LPAL13_240009700" "LPAL13_140019300" "LPAL13_140019100" "LPAL13_320038700"
## [21] "LPAL13_330021900" "LPAL13_260031400" "LPAL13_210005000" "LPAL13_280037900"
## [25] "LPAL13_350073200" "LPAL13_160014200" "LPAL13_000010600" "LPAL13_140019200"
## [29] "LPAL13_230011500" "LPAL13_230011200" "LPAL13_310028500" "LPAL13_230011300"
## [33] "LPAL13_250025700"
upshared_expt <- exclude_genes_expt(all_norm, ids=rownames(up_shared), method="keep")
## Before removal, there were 8636 entries.
## Now there are 33 entries.
## Percent of the counts kept after filtering: 0.322, 0.258, 0.232, 0.224, 0.246, 0.236, 0.230, 0.222, 0.255, 0.249, 0.269, 0.341, 0.229, 0.338, 0.333, 0.239, 0.220, 0.337, 0.237, 0.236, 0.337, 0.224, 0.228, 0.341, 0.221, 0.255, 0.247, 0.227, 0.233, 0.230
## There are 30 samples which kept less than 90 percent counts.
##      TMRC20001 TMRC20002 TMRC20004 TMRC20005 TMRC20006 TMRC20029 TMRC20007
##      TMRC20008 TMRC20027 TMRC20028 TMRC20032 TMRC20015 TMRC20009 TMRC20010
##      TMRC20016 TMRC20011 TMRC20012 TMRC20013 TMRC20017 TMRC20014 TMRC20018
##      TMRC20019 TMRC20020 TMRC20021 TMRC20022 TMRC20025 TMRC20024 TMRC20033
##      TMRC20026 TMRC20031

4.5.1 Heatmap of zymodeme gene expression increased in 2.3 vs. 2.2

test <- plot_sample_heatmap(upshared_expt, row_label=rownames(up_shared))

4.5.2 Heatmap of zymodeme gene expression increased in 2.2 vs. 2.3

down_shared <- shared_zymo[["downs"]][[1]][["data"]][["all"]]
downshared_expt <- exclude_genes_expt(all_norm, ids=rownames(down_shared), method="keep")
## Before removal, there were 8636 entries.
## Now there are 85 entries.
## Percent of the counts kept after filtering: 0.574, 0.970, 0.881, 0.885, 0.935, 0.857, 0.881, 0.899, 0.571, 0.539, 0.790, 0.528, 0.922, 0.505, 0.552, 0.920, 0.876, 0.535, 0.909, 0.933, 0.539, 0.910, 0.918, 0.542, 0.952, 0.665, 0.930, 0.930, 0.904, 0.880
## There are 30 samples which kept less than 90 percent counts.
##      TMRC20001 TMRC20002 TMRC20004 TMRC20005 TMRC20006 TMRC20029 TMRC20007
##      TMRC20008 TMRC20027 TMRC20028 TMRC20032 TMRC20015 TMRC20009 TMRC20010
##      TMRC20016 TMRC20011 TMRC20012 TMRC20013 TMRC20017 TMRC20014 TMRC20018
##      TMRC20019 TMRC20020 TMRC20021 TMRC20022 TMRC20025 TMRC20024 TMRC20033
##      TMRC20026 TMRC20031
test <- plot_sample_heatmap(downshared_expt, row_label=rownames(down_shared))

5 SNP profiles

In this block, I am combining our previous samples and our new samples in the hopes of finding variant positions which help elucidate aspects of either the new or old samples. In other words, we do not know the zymodeme annotations for the old samples nor the strain identities (or the shortcut ‘chronic vs. self-healing’) for the new samples. We may be able to make educated guesses given the variant profiles. There are some differences in how the previous and current data sets were analyzed (though I have since redone the old samples so it should be trivial to remove those differences now).

old_expt <- sm(create_expt("sample_sheets/tmrc2_samples_20191203.xlsx",
                           file_column="tophat2file"))

tt <- lp_expt$expressionset
rownames(tt) <- gsub(pattern="^exon_", replacement="", x=rownames(tt))
rownames(tt) <- gsub(pattern="\\.E1$", replacement="", x=rownames(tt))
lp_expt$expressionset <- tt

tt <- old_expt$expressionset
rownames(tt) <- gsub(pattern="^exon_", replacement="", x=rownames(tt))
rownames(tt) <- gsub(pattern="\\.1$", replacement="", x=rownames(tt))
old_expt$expressionset <- tt

new_snps <- sm(count_expt_snps(lp_expt, annot_column="bcftable"))
## Error: 'preprocessing/tmrc20001/outputs/vcfutils_lpanamensis_v36/concatenated_lpanamensis_v36_count.txt' does not exist in current working directory ('/mnt/sshfs_10186/cbcbsub00/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019').
old_snps <- sm(count_expt_snps(old_expt, annot_column="bcftable", snp_column=2))

both_snps <- combine_expts(new_snps, old_snps)
## Error in combine_expts(new_snps, old_snps): object 'new_snps' not found
both_norm <- sm(normalize_expt(both_snps, transform="log2", convert="cpm", filter=TRUE))
## Error in normalize_expt(both_snps, transform = "log2", convert = "cpm", : object 'both_snps' not found
## strains <- both_norm[["design"]][["strain"]]
both_norm <- set_expt_conditions(both_norm, fact="strain")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'pData': object 'both_norm' not found

5.1 Plot of SNP profiles for zymodemes

The following plot shows the SNP profiles of all samples (old and new) where the colors at the top show either the 2.2 strains (orange), 2.3 strains (green), the previous samples (purple), or the various lab strains (pink etc).

tt <- plot_disheat(both_norm)
## Error in plot_heatmap(expt_data, expt_colors = expt_colors, expt_design = expt_design, : object 'both_norm' not found
snp_sets <- get_snp_sets(both_snps, factor="condition")
## Error in get_snp_sets(both_snps, factor = "condition"): object 'both_snps' not found
both_expt <- combine_expts(lp_expt, old_expt)
snp_genes <- sm(snps_vs_genes(both_expt, snp_sets, expt_name_col="chromosome"))
## Error in snps_vs_genes(both_expt, snp_sets, expt_name_col = "chromosome"): object 'snp_sets' not found
snp_subset <- sm(snp_subset_genes(
  both_expt, both_snps,
  genes=c("LPAL13_120010900", "LPAL13_340013000", "LPAL13_000054100",
          "LPAL13_140006100", "LPAL13_180018500", "LPAL13_320022300")))
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'fData': object 'both_snps' not found
## zymo_heat <- plot_sample_heatmap(snp_subset, row_label=rownames(exprs(snp_subset)))

6 Clinical response for new samples

clinical_sets <- get_snp_sets(new_snps, factor="clinicalresponse")
## Error in get_snp_sets(new_snps, factor = "clinicalresponse"): object 'new_snps' not found
clinical_genes <- sm(snps_vs_genes(lp_expt, clinical_sets, expt_name_col="chromosome"))
## Error in snps_vs_genes(lp_expt, clinical_sets, expt_name_col = "chromosome"): object 'clinical_sets' not found
clinical_snps <- snps_intersections(lp_expt, clinical_sets, chr_column="chromosome")
## Error in snps_intersections(lp_expt, clinical_sets, chr_column = "chromosome"): object 'clinical_sets' not found
head(as.data.frame(clinical_snps$inters[["Failure"]]))
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'head': error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'clinical_snps' not found
head(as.data.frame(clinical_snps$inters[["Cure"]]))
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'head': error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'clinical_snps' not found
head(clinical_snps$gene_summaries$Failure)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'head': object 'clinical_snps' not found
head(clinical_snps$gene_summaries$Cure)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'head': object 'clinical_snps' not found
annot <- fData(lp_expt)
clinical_interest <- as.data.frame(clinical_snps[["gene_summaries"]][["Cure"]])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'clinical_snps' not found
clinical_interest <- merge(clinical_interest, as.data.frame(clinical_snps[["gene_summaries"]][["Failure"]]), by="row.names")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'merge': object 'clinical_interest' not found
rownames(clinical_interest) <- clinical_interest[["Row.names"]]
## Error in eval(expr, envir, enclos): object 'clinical_interest' not found
clinical_interest[["Row.names"]] <- NULL
## Error in clinical_interest[["Row.names"]] <- NULL: object 'clinical_interest' not found
colnames(clinical_interest) <- c("cure_snps","fail_snps")
## Error in colnames(clinical_interest) <- c("cure_snps", "fail_snps"): object 'clinical_interest' not found
annot <- merge(annot, clinical_interest, by="row.names")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'y' in selecting a method for function 'merge': object 'clinical_interest' not found
rownames(annot) <- annot[["Row.names"]]
annot[["Row.names"]] <- NULL
fData(lp_expt$expressionset) <- annot

7 Zymodeme for new samples

The heatmap produced here should show the variants only for the zymodeme genes.

new_sets <- get_snp_sets(new_snps, factor="phenotypiccharacteristics")
## Error in get_snp_sets(new_snps, factor = "phenotypiccharacteristics"): object 'new_snps' not found
snp_genes <- sm(snps_vs_genes(lp_expt, new_sets, expt_name_col="chromosome"))
## Error in snps_vs_genes(lp_expt, new_sets, expt_name_col = "chromosome"): object 'new_sets' not found
new_zymo_norm  <-  normalize_expt(new_snps, filter=TRUE, convert="cpm", norm="quant", transform=TRUE)
## Error in normalize_expt(new_snps, filter = TRUE, convert = "cpm", norm = "quant", : object 'new_snps' not found
new_zymo_norm <- set_expt_conditions(new_zymo_norm, fact="phenotypiccharacteristics")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'pData': object 'new_zymo_norm' not found
zymo_heat <- plot_disheat(new_zymo_norm)
## Error in plot_heatmap(expt_data, expt_colors = expt_colors, expt_design = expt_design, : object 'new_zymo_norm' not found
zymo_subset <- snp_subset_genes(lp_expt, new_snps,
                                genes=c("LPAL13_120010900", "LPAL13_340013000", "LPAL13_000054100",
                                        "LPAL13_140006100", "LPAL13_180018500", "LPAL13_320022300"))
## Error: subscript contains invalid names
zymo_subset <- set_expt_conditions(zymo_subset, fact="phenotypiccharacteristics")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'pData': object 'zymo_subset' not found
## zymo_heat <- plot_sample_heatmap(zymo_subset, row_label=rownames(exprs(snp_subset)))

des <- both_norm$design
## Error in eval(expr, envir, enclos): object 'both_norm' not found
undef_idx <- is.na(des[["strain"]])
## Error in eval(expr, envir, enclos): object 'des' not found
des[undef_idx, "strain"] <- "unknown"
## Error in des[undef_idx, "strain"] <- "unknown": object 'des' not found
##hmcols <- colorRampPalette(c("yellow","black","darkblue"))(256)
correlations <- hpgl_cor(exprs(both_norm))
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'exprs': object 'both_norm' not found
zymo_missing_idx <- is.na(des[["phenotypiccharacteristics"]])
## Error in eval(expr, envir, enclos): object 'des' not found
des[zymo_missing_idx, "phenotypiccharacteristics"] <- "unknown"
## Error in des[zymo_missing_idx, "phenotypiccharacteristics"] <- "unknown": object 'des' not found
mydendro <- list(
  "clustfun" = hclust,
  "lwd" = 2.0)
col_data <- as.data.frame(des[, c("phenotypiccharacteristics", "clinicalcategorical")])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'des' not found
unknown_clinical <- is.na(col_data[["clinicalcategorical"]])
## Error in eval(expr, envir, enclos): object 'col_data' not found
row_data <- as.data.frame(des[, c("strain")])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'des' not found
colnames(col_data) <- c("zymodeme", "outcome")
## Error in colnames(col_data) <- c("zymodeme", "outcome"): object 'col_data' not found
col_data[unknown_clinical, "outcome"] <- "undefined"
## Error in col_data[unknown_clinical, "outcome"] <- "undefined": object 'col_data' not found
colnames(row_data) <- c("strain")
## Error in colnames(row_data) <- c("strain"): object 'row_data' not found
myannot <- list(
  "Col" = list("data" = col_data),
  "Row" = list("data" = row_data))
## Error in eval(expr, envir, enclos): object 'col_data' not found
myclust <- list("cuth" = 1.0,
                "col" = BrewerClusterCol)
mylabs <- list(
  "Row" = list("nrow" = 4),
  "Col" = list("nrow" = 4))
hmcols <- colorRampPalette(c("darkblue", "beige"))(170)
map1 <- annHeatmap2(
  correlations,
  dendrogram=mydendro,
  annotation=myannot,
  cluster=myclust,
  labels=mylabs,
  col=hmcols)
## Error in annHeatmap2(correlations, dendrogram = mydendro, annotation = myannot, : object 'correlations' not found
plot(map1)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'plot': object 'map1' not found

8 Using Variant profiles to make guesses about strains and chronic/self-healing

The following uses the same information to make some guesses about the strains used in the new samples.

des <- both_norm$design
## Error in eval(expr, envir, enclos): object 'both_norm' not found
undef_idx <- is.na(des[["strain"]])
## Error in eval(expr, envir, enclos): object 'des' not found
des[undef_idx, "strain"] <- "unknown"
## Error in des[undef_idx, "strain"] <- "unknown": object 'des' not found
##hmcols <- colorRampPalette(c("yellow","black","darkblue"))(256)
correlations <- hpgl_cor(exprs(both_norm))
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'exprs': object 'both_norm' not found
mydendro <- list(
  "clustfun" = hclust,
  "lwd" = 2.0)
col_data <- as.data.frame(des[, c("condition")])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'des' not found
row_data <- as.data.frame(des[, c("strain")])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'des' not found
colnames(col_data) <- c("condition")
## Error in colnames(col_data) <- c("condition"): object 'col_data' not found
colnames(row_data) <- c("strain")
## Error in colnames(row_data) <- c("strain"): object 'row_data' not found
myannot <- list(
  "Col" = list("data" = col_data),
  "Row" = list("data" = row_data))
## Error in eval(expr, envir, enclos): object 'col_data' not found
myclust <- list("cuth" = 1.0,
                "col" = BrewerClusterCol)
mylabs <- list(
  "Row" = list("nrow" = 4),
  "Col" = list("nrow" = 4))
hmcols <- colorRampPalette(c("darkblue", "beige"))(170)
map1 <- annHeatmap2(
  correlations,
  dendrogram=mydendro,
  annotation=myannot,
  cluster=myclust,
  labels=mylabs,
  col=hmcols)
## Error in annHeatmap2(correlations, dendrogram = mydendro, annotation = myannot, : object 'correlations' not found
plot(map1)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'plot': object 'map1' not found
pheno <- subset_expt(lp_expt, subset="condition=='z2.2'|condition=='z2.3'")
## Using a subset expression.
## There were 30, now there are 16 samples.
pheno_snps <- sm(count_expt_snps(pheno, annot_column="bcftable"))
## Error: 'preprocessing/tmrc20001/outputs/vcfutils_lpanamensis_v36/concatenated_lpanamensis_v36_count.txt' does not exist in current working directory ('/mnt/sshfs_10186/cbcbsub00/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019').
xref_prop <- table(pheno_snps$conditions)
## Error in eval(quote(list(...)), env): object 'pheno_snps' not found
pheno_snps$conditions
## Error in eval(expr, envir, enclos): object 'pheno_snps' not found
idx_tbl <- exprs(pheno_snps) > 5
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'exprs': object 'pheno_snps' not found
new_tbl <- data.frame(row.names=rownames(exprs(pheno_snps)))
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'rownames': error in evaluating the argument 'object' in selecting a method for function 'exprs': object 'pheno_snps' not found
for (n in names(xref_prop)) {
  new_tbl[[n]] <- 0
  idx_cols <- which(pheno_snps[["conditions"]] == n)
  prop_col <- rowSums(idx_tbl[, idx_cols]) / xref_prop[n]
  new_tbl[n] <- prop_col
}
## Error in eval(expr, envir, enclos): object 'xref_prop' not found
new_tbl[["ratio"]] <- (new_tbl[["z2.2"]] - new_tbl[["z2.3"]])
## Error in eval(expr, envir, enclos): object 'new_tbl' not found
keepers <- grepl(x=rownames(new_tbl), pattern="LpaL13")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'grepl': error in evaluating the argument 'x' in selecting a method for function 'rownames': object 'new_tbl' not found
new_tbl <- new_tbl[keepers, ]
## Error in eval(expr, envir, enclos): object 'new_tbl' not found
new_tbl[["SNP"]] <- rownames(new_tbl)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'rownames': object 'new_tbl' not found
new_tbl[["Chromosome"]] <- gsub(x=new_tbl[["SNP"]], pattern="chr_(.*)_pos_.*", replacement="\\1")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'gsub': object 'new_tbl' not found
new_tbl[["Position"]] <- gsub(x=new_tbl[["SNP"]], pattern=".*_pos_(\\d+)_.*", replacement="\\1")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'gsub': object 'new_tbl' not found
new_tbl <- new_tbl[, c("SNP", "Chromosome", "Position", "ratio")]
## Error in eval(expr, envir, enclos): object 'new_tbl' not found
library(CMplot)
## Much appreciate for using CMplot.
## Full description, Bug report, Suggestion and the latest codes:
## https://github.com/YinLiLin/CMplot
CMplot(new_tbl)
## Error in is.data.frame(x): object 'new_tbl' not found
if (!isTRUE(get0("skip_load"))) {
  pander::pander(sessionInfo())
  message(paste0("This is hpgltools commit: ", get_git_commit()))
  message(paste0("Saving to ", savefile))
  tmp <- sm(saveme(filename=savefile))
}
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 3866d0ef3d5bf766f01b092108ec06406921447c
## This is hpgltools commit: Mon Mar 22 15:33:04 2021 -0400: 3866d0ef3d5bf766f01b092108ec06406921447c
## Saving to tmrc2_02sample_estimation_v202103.rda.xz
tmp <- loadme(filename=savefile)
---
title: "TMRC2 Comprehensive Data Analysis: 202103"
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
 html_document:
  code_download: true
  code_folding: show
  fig_caption: true
  fig_height: 7
  fig_width: 7
  highlight: default
  keep_md: false
  mode: selfcontained
  number_sections: true
  self_contained: true
  theme: readable
  toc: true
  toc_float:
   collapsed: false
   smooth_scroll: false
---

<style>
  body .main-container {
    max-width: 1600px;
  }
</style>

```{r options, include=FALSE}
library(hpgltools)
tt <- sm(devtools::load_all("~/hpgltools"))
knitr::opts_knit$set(progress=TRUE,
                     verbose=TRUE,
                     width=90,
                     echo=TRUE)
knitr::opts_chunk$set(error=TRUE,
                      fig.width=8,
                      fig.height=8,
                      dpi=96)
old_options <- options(digits=4,
                       stringsAsFactors=FALSE,
                       knitr.duplicate.label="allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size=12))
ver <- "202103"
rundate <- format(Sys.Date(), format="%Y%m%d")

## tmp <- try(sm(loadme(filename=gsub(pattern="\\.Rmd", replace="\\.rda\\.xz", x=previous_file))))
rmd_file <- "tmrc2_02sample_estimation_v202103.Rmd"
savefile <- gsub(pattern="\\.Rmd", replace="\\.rda\\.xz", x=rmd_file)

library(Heatplus)
```

# Introduction

This document is intended to provide a general overview of the TMRC2 samples
which have thus far been sequenced.  In some cases, this includes only those
samples starting in 2019; in other instances I am including our previous
(2015-2016) samples.

In all cases the processing performed was:

1.  Default trimming was performed.
2.  Hisat2 was used to map the remaining reads against the Leishmania
    panamensis genome revision 36.
3.  The alignments from hisat2 were used to count reads/gene against the
    revision 36 annotations with htseq.
4.  These alignments were also passed to the pileup functionality of samtools
    and the vcf/bcf utilities in order to make a matrix of all observed
    differences between each sample with respect to the reference.

The analyses in this document use the matrices of counts/gene from #3 and
variants/position from #4 in order to provide some images and metrics describing
the samples we have sequenced so far.

# Annotations

Everything which follows depends on the Existing TriTrypDB annotations revision
46, circa 2019.  The following block loads a database of these annotations and
turns it into a matrix where the rows are genes and columns are all the
annotation types provided by TriTrypDB.

The same database was used to create a matrix of orthologous genes between
L.panamensis and all of the other species in the TriTrypDB.

```{r annot}
tt <- sm(library(EuPathDB))
tt <- sm(library(org.Lpanamensis.MHOMCOL81L13.v46.eg.db))
pan_db <- org.Lpanamensis.MHOMCOL81L13.v46.eg.db
all_fields <- columns(pan_db)

all_lp_annot <- sm(load_orgdb_annotations(
    pan_db,
    keytype="gid",
    fields=c("annot_gene_entrez_id", "annot_gene_name",
             "annot_strand", "annot_chromosome", "annot_cds_length",
             "annot_gene_product")))$genes

lp_go <- sm(load_orgdb_go(pan_db))
lp_lengths <- all_lp_annot[, c("gid", "annot_cds_length")]
colnames(lp_lengths)  <- c("ID", "length")

orthos <- sm(EuPathDB::extract_eupath_orthologs(db=pan_db))

hisat_annot <- all_lp_annot
## rownames(hisat_annot) <- paste0("exon_", rownames(hisat_annot), ".E1")
```

# Sample Estimation

The process of sample estimation takes two primary inputs:

1.  The sample sheet, which contains all the metadata we currently have on hand,
    including filenames for the outputs of #3 and #4 above.
2.  The gene annotations.

An expressionset is primary data structure used in R to examine RNASeq data.  It
is comprised of annotations, metadata, and expression data.  In the case of our
processing pipeline, the location of the expression data is provided by the
filenames in the metadata.

## Generate expressionsets

The first lines of the following block create the Expressionset.  All of the
following lines perform various normalizations and generate plots from it.

```{r new_samples_hisat}
sample_sheet <- glue::glue("sample_sheets/tmrc2_samples_{ver}.xlsx")
lp_expt <- sm(create_expt(sample_sheet,
                          gene_info=hisat_annot,
                          id_column="hpglidentifier",
                          file_column="lpanamensisv36hisatfile"))
lp_expt <- set_expt_conditions(lp_expt, fact="zymodemecategorical")

libsizes <- plot_libsize(lp_expt)
libsizes$plot
## I think samples 7,10 should be removed at minimum, probably also 9,11
nonzero <- plot_nonzero(lp_expt)
nonzero$plot
plot_boxplot(lp_expt)
```

## Distribution Visualization

Najib's favorite plots are of course the PCA/TNSE.  These are nice to look at in
order to get a sense of the relationships between samples.  They also provide a
good opportunity to see what happens when one applies different normalizations,
surrogate analyses, filters, etc.  In addition, one may set different
experimental factors as the primary 'condition' (usually the color of plots) and
surrogate 'batches'.

```{r pre_questions}
all_norm <- sm(normalize_expt(lp_expt, norm="quant", transform="log2", convert="cpm",
                              batch=FALSE, filter=TRUE))
plot_pca(all_norm, plot_title="PCA of parasite expression values")$plot

plot_corheat(all_norm, title="Correlation heatmap of parasite expression values
(Same legend as above)")$plot
plot_sm(all_norm)$plot
sm(plot_variance_coefficients(all_norm))$plot
sm(plot_sample_cvheatmap(all_norm))$plot
```

### Notes

The following samples are much lower coverage:

* TMRC20002
* TMRC20006
* TMRC20007
* TMRC20008

At this time, we do not have very many samples, so the set of metrics/plots is
fairly limited.  There is really only one factor in the metadata which we can
use for performing differential expression analyses, the 'zymodeme'.

# Zymodeme analyses

The following sections perform a series of analyses which seek to elucidate
differences between the zymodemes 2.2 and 2.3 either through differential
expression or variant profiles.

## Differential expression

### With respect to zymodeme attribution

```{r zymo_de, fig.show="hide"}
zy_expt <- subset_expt(lp_expt, subset="condition=='z2.2'|condition=='z2.3'")
zy_de <- sm(all_pairwise(zy_expt, filter=TRUE, model_batch="svaseq"))
zy_table <- sm(combine_de_tables(zy_de, excel=glue::glue("excel/zy_tables-v{ver}.xlsx")))
zy_sig <- sm(extract_significant_genes(zy_table, excel=glue::glue("excel/zy_sig-v{ver}.xlsx")))
```

## With respect to cure/failure

```{r curefail_de, fig.show="hide"}
cf_expt <- set_expt_conditions(lp_expt, fact="clinicalcategorical")
cf_norm <- sm(normalize_expt(cf_expt, filter=TRUE, norm="quant", transform="log2",
                             convert="cpm", batch="svaseq"))
plot_pca(cf_norm)$plot

cf_de <- sm(all_pairwise(cf_expt, filter=TRUE, model_batch="svaseq"))
cf_table <- sm(combine_de_tables(cf_de, excel=glue::glue("excel/cf_tables-v{ver}.xlsx")))
cf_sig <- sm(extract_significant_genes(cf_table, excel=glue::glue("excel/cf_sig-v{ver}.xlsx")))
```

## Ontology searches

```{r go, sig.show="hide"}
colnames(lp_go)
lp_go <- lp_go[, c("GID", "GOALL")]
colnames(lp_go) <- c("ID", "GO")

## Gene categories more represented in the 2.3 group.
zy_go_up <- sm(simple_goseq(sig_genes=zy_sig[["deseq"]][["ups"]][[1]],
                            go_db=lp_go, length_db=lp_lengths))

## Gene categories more represented in the 2.2 group.
zy_go_down <- sm(simple_goseq(sig_genes=zy_sig[["deseq"]][["downs"]][[1]],
                              go_db=lp_go, length_db=lp_lengths))
```

### A couple plots from the differential expression

#### Number of genes in agreement among DE methods, 2.3 more than 2.2

```{r de_plots}
zy_table[["venns"]][[1]][["p_lfc1"]][["up_noweight"]]
```

#### Number of genes in agreement among DE methods, 2.2 more than 2.3

```{r de_plots}
zy_table[["venns"]][[1]][["p_lfc1"]][["down_noweight"]]
```

#### MA plot of the differential expression between the zymodemes.

```{r other_plots}
zy_table$plots[[1]][["deseq_ma_plots"]][["plot"]]
```

#### goseq ontology plots of groups of genes, 2.3 more than 2.2

```{r goseq_up}
zy_go_up$pvalue_plots$bpp_plot_over
```

#### goseq ontology plots of groups of genes, 2.2 more than 2.3

```{r goseq_down}
zy_go_down$pvalue_plots$bpp_plot_over
```

## Zymodeme enzyme gene IDs

Najib read me an email listing off the gene names associated with the zymodeme
classification.  I took those names and cross referenced them against the
Leishmania panamensis gene annotations and found the following:

They are:

1. ALAT: LPAL13_120010900 -- alanine aminotransferase
2. ASAT: LPAL13_340013000 -- aspartate aminotransferase
3. G6PD: LPAL13_000054100 -- glucase-6-phosphate 1-dehydrogenase
4. NH: LPAL13_14006100, LPAL13_180018500 -- inosine-guanine nucleoside hydrolase
5. MPI: LPAL13_320022300 (maybe) -- mannose phosphate isomerase (I chose phosphomannose isomerase)

Given these 6 gene IDs (NH has two gene IDs associated with it), I can do some
looking for specific differences among the various samples.

### Expression levels of zymodeme genes

The following creates a colorspace (red to green) heatmap showing the observed
expression of these genes in every sample.

```{r zymodemes}
my_genes <- c("LPAL13_120010900", "LPAL13_340013000", "LPAL13_000054100",
              "LPAL13_140006100", "LPAL13_180018500", "LPAL13_320022300",
              "other")
my_names <- c("ALAT", "ASAT", "G6PD", "NHv1", "NHv2", "MPI", "other")

zymo_expt <- exclude_genes_expt(all_norm, ids=my_genes, method="keep")
test <- plot_sample_heatmap(zymo_expt, row_label=my_names)
```

## Empirically observed Zymodeme genes from differential expression analysis

In contrast, the following plots take the set of genes which are shared among
all differential expression methods (|lfc| >= 1.0 and adjp <= 0.05) and use them
to make categories of genes which are increased in 2.3 or 2.2.

```{r zymodeme_genes_empirical}
shared_zymo <- intersect_significant(zy_table)
up_shared <- shared_zymo[["ups"]][[1]][["data"]][["all"]]
rownames(up_shared)
upshared_expt <- exclude_genes_expt(all_norm, ids=rownames(up_shared), method="keep")
```

### Heatmap of zymodeme gene expression increased in 2.3 vs. 2.2

```{r zymoempup}
test <- plot_sample_heatmap(upshared_expt, row_label=rownames(up_shared))
```

### Heatmap of zymodeme gene expression increased in 2.2 vs. 2.3

```{r zymoemdown}
down_shared <- shared_zymo[["downs"]][[1]][["data"]][["all"]]
downshared_expt <- exclude_genes_expt(all_norm, ids=rownames(down_shared), method="keep")
test <- plot_sample_heatmap(downshared_expt, row_label=rownames(down_shared))
```

# SNP profiles

In this block, I am combining our previous samples and our new samples in the
hopes of finding variant positions which help elucidate aspects of either the
new or old samples.  In other words, we do not know the zymodeme annotations for
the old samples nor the strain identities (or the shortcut 'chronic
vs. self-healing') for the new samples.  We may be able to make educated guesses
given the variant profiles.  There are some differences in how the previous and
current data sets were analyzed (though I have since redone the old samples so
it should be trivial to remove those differences now).

```{r oldnew_variants}
old_expt <- sm(create_expt("sample_sheets/tmrc2_samples_20191203.xlsx",
                           file_column="tophat2file"))

tt <- lp_expt$expressionset
rownames(tt) <- gsub(pattern="^exon_", replacement="", x=rownames(tt))
rownames(tt) <- gsub(pattern="\\.E1$", replacement="", x=rownames(tt))
lp_expt$expressionset <- tt

tt <- old_expt$expressionset
rownames(tt) <- gsub(pattern="^exon_", replacement="", x=rownames(tt))
rownames(tt) <- gsub(pattern="\\.1$", replacement="", x=rownames(tt))
old_expt$expressionset <- tt

new_snps <- sm(count_expt_snps(lp_expt, annot_column="bcftable"))
old_snps <- sm(count_expt_snps(old_expt, annot_column="bcftable", snp_column=2))

both_snps <- combine_expts(new_snps, old_snps)
both_norm <- sm(normalize_expt(both_snps, transform="log2", convert="cpm", filter=TRUE))

## strains <- both_norm[["design"]][["strain"]]
both_norm <- set_expt_conditions(both_norm, fact="strain")
```

## Plot of SNP profiles for zymodemes

The following plot shows the SNP profiles of all samples (old and new) where the
colors at the top show either the 2.2 strains (orange), 2.3 strains (green), the
previous samples (purple), or the various lab strains (pink etc).

```{r plotting_variants}
tt <- plot_disheat(both_norm)

snp_sets <- get_snp_sets(both_snps, factor="condition")
both_expt <- combine_expts(lp_expt, old_expt)
snp_genes <- sm(snps_vs_genes(both_expt, snp_sets, expt_name_col="chromosome"))

snp_subset <- sm(snp_subset_genes(
  both_expt, both_snps,
  genes=c("LPAL13_120010900", "LPAL13_340013000", "LPAL13_000054100",
          "LPAL13_140006100", "LPAL13_180018500", "LPAL13_320022300")))
## zymo_heat <- plot_sample_heatmap(snp_subset, row_label=rownames(exprs(snp_subset)))
```

# Clinical response for new samples

```{r snp_clinical}
clinical_sets <- get_snp_sets(new_snps, factor="clinicalresponse")
clinical_genes <- sm(snps_vs_genes(lp_expt, clinical_sets, expt_name_col="chromosome"))
clinical_snps <- snps_intersections(lp_expt, clinical_sets, chr_column="chromosome")
head(as.data.frame(clinical_snps$inters[["Failure"]]))
head(as.data.frame(clinical_snps$inters[["Cure"]]))

head(clinical_snps$gene_summaries$Failure)
head(clinical_snps$gene_summaries$Cure)

annot <- fData(lp_expt)
clinical_interest <- as.data.frame(clinical_snps[["gene_summaries"]][["Cure"]])
clinical_interest <- merge(clinical_interest, as.data.frame(clinical_snps[["gene_summaries"]][["Failure"]]), by="row.names")
rownames(clinical_interest) <- clinical_interest[["Row.names"]]
clinical_interest[["Row.names"]] <- NULL
colnames(clinical_interest) <- c("cure_snps","fail_snps")
annot <- merge(annot, clinical_interest, by="row.names")
rownames(annot) <- annot[["Row.names"]]
annot[["Row.names"]] <- NULL
fData(lp_expt$expressionset) <- annot
```

# Zymodeme for new samples

The heatmap produced here should show the variants only for the zymodeme genes.

```{r new_zymo}
new_sets <- get_snp_sets(new_snps, factor="phenotypiccharacteristics")
snp_genes <- sm(snps_vs_genes(lp_expt, new_sets, expt_name_col="chromosome"))
new_zymo_norm  <-  normalize_expt(new_snps, filter=TRUE, convert="cpm", norm="quant", transform=TRUE)
new_zymo_norm <- set_expt_conditions(new_zymo_norm, fact="phenotypiccharacteristics")
zymo_heat <- plot_disheat(new_zymo_norm)

zymo_subset <- snp_subset_genes(lp_expt, new_snps,
                                genes=c("LPAL13_120010900", "LPAL13_340013000", "LPAL13_000054100",
                                        "LPAL13_140006100", "LPAL13_180018500", "LPAL13_320022300"))

zymo_subset <- set_expt_conditions(zymo_subset, fact="phenotypiccharacteristics")
## zymo_heat <- plot_sample_heatmap(zymo_subset, row_label=rownames(exprs(snp_subset)))

des <- both_norm$design
undef_idx <- is.na(des[["strain"]])
des[undef_idx, "strain"] <- "unknown"

##hmcols <- colorRampPalette(c("yellow","black","darkblue"))(256)
correlations <- hpgl_cor(exprs(both_norm))

zymo_missing_idx <- is.na(des[["phenotypiccharacteristics"]])
des[zymo_missing_idx, "phenotypiccharacteristics"] <- "unknown"
mydendro <- list(
  "clustfun" = hclust,
  "lwd" = 2.0)
col_data <- as.data.frame(des[, c("phenotypiccharacteristics", "clinicalcategorical")])
unknown_clinical <- is.na(col_data[["clinicalcategorical"]])
row_data <- as.data.frame(des[, c("strain")])
colnames(col_data) <- c("zymodeme", "outcome")
col_data[unknown_clinical, "outcome"] <- "undefined"

colnames(row_data) <- c("strain")
myannot <- list(
  "Col" = list("data" = col_data),
  "Row" = list("data" = row_data))
myclust <- list("cuth" = 1.0,
                "col" = BrewerClusterCol)
mylabs <- list(
  "Row" = list("nrow" = 4),
  "Col" = list("nrow" = 4))
hmcols <- colorRampPalette(c("darkblue", "beige"))(170)
map1 <- annHeatmap2(
  correlations,
  dendrogram=mydendro,
  annotation=myannot,
  cluster=myclust,
  labels=mylabs,
  col=hmcols)
plot(map1)
```

# Using Variant profiles to make guesses about strains and chronic/self-healing

The following uses the same information to make some guesses about the strains
used in the new samples.

```{r old_and_new_chronic}
des <- both_norm$design
undef_idx <- is.na(des[["strain"]])
des[undef_idx, "strain"] <- "unknown"
##hmcols <- colorRampPalette(c("yellow","black","darkblue"))(256)
correlations <- hpgl_cor(exprs(both_norm))

mydendro <- list(
  "clustfun" = hclust,
  "lwd" = 2.0)
col_data <- as.data.frame(des[, c("condition")])
row_data <- as.data.frame(des[, c("strain")])
colnames(col_data) <- c("condition")
colnames(row_data) <- c("strain")
myannot <- list(
  "Col" = list("data" = col_data),
  "Row" = list("data" = row_data))
myclust <- list("cuth" = 1.0,
                "col" = BrewerClusterCol)
mylabs <- list(
  "Row" = list("nrow" = 4),
  "Col" = list("nrow" = 4))
hmcols <- colorRampPalette(c("darkblue", "beige"))(170)
map1 <- annHeatmap2(
  correlations,
  dendrogram=mydendro,
  annotation=myannot,
  cluster=myclust,
  labels=mylabs,
  col=hmcols)
plot(map1)
```

```{r theresa_idea}
pheno <- subset_expt(lp_expt, subset="condition=='z2.2'|condition=='z2.3'")
pheno_snps <- sm(count_expt_snps(pheno, annot_column="bcftable"))

xref_prop <- table(pheno_snps$conditions)
pheno_snps$conditions
idx_tbl <- exprs(pheno_snps) > 5
new_tbl <- data.frame(row.names=rownames(exprs(pheno_snps)))
for (n in names(xref_prop)) {
  new_tbl[[n]] <- 0
  idx_cols <- which(pheno_snps[["conditions"]] == n)
  prop_col <- rowSums(idx_tbl[, idx_cols]) / xref_prop[n]
  new_tbl[n] <- prop_col
}
new_tbl[["ratio"]] <- (new_tbl[["z2.2"]] - new_tbl[["z2.3"]])
keepers <- grepl(x=rownames(new_tbl), pattern="LpaL13")
new_tbl <- new_tbl[keepers, ]
new_tbl[["SNP"]] <- rownames(new_tbl)
new_tbl[["Chromosome"]] <- gsub(x=new_tbl[["SNP"]], pattern="chr_(.*)_pos_.*", replacement="\\1")
new_tbl[["Position"]] <- gsub(x=new_tbl[["SNP"]], pattern=".*_pos_(\\d+)_.*", replacement="\\1")
new_tbl <- new_tbl[, c("SNP", "Chromosome", "Position", "ratio")]
library(CMplot)
CMplot(new_tbl)
```

```{r saveme}
if (!isTRUE(get0("skip_load"))) {
  pander::pander(sessionInfo())
  message(paste0("This is hpgltools commit: ", get_git_commit()))
  message(paste0("Saving to ", savefile))
  tmp <- sm(saveme(filename=savefile))
}
```

```{r loadme_after, eval=FALSE}
tmp <- loadme(filename=savefile)
```
