1 Introduction

This document will visualize the TMRC2 samples before completing the various differential expression and variant analyses in the hopes of getting an understanding of how the various samples relate to each other.

libsizes <- plot_libsize(lp_expt)
libsizes
## Library sizes of 101 samples, 
## ranging from 6,141,368 to 135,385,347.

dev <- pp("images/lp_expt_libsizes.png", width = 18, height = 9)
libsizes$plot
closed <- dev.off()

Library sizes of the protein coding gene counts observed per sample. The samples were mapped with the EuPathDB revision 36 of the Leishmania (Viannia) panamensis strain MHOM/COL/81L13 genome; the alignments were sorted, indexed, and counted via htseq using the gene features, and non-protein coding features were excluded. The per-sample sums of the remaining matrix were plotted to check that the relative sample coverage is sufficient and not too divergent across samples. Bars are colored according to strain/zymodeme annotation: red: zymodeme 2.3; blue: zymodeme 2.2; Leishmania braziliensis-like strains b2904, z1.0, and z1.5: purple; zymodemes which are most similar to 2.3, comprising z2.4 is light brown; zymodemes most similar to 2.2, comprising z3.0, z2.0, z2.1, and z3.2 are light gray, dark gray, dark brown, and gray respectively.

## I think samples 7,10 should be removed at minimum, probably also 9,11
nonzero <- plot_nonzero(lp_expt, cutoff = 0.7)
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
nonzero
## A non-zero genes plot of 101 samples.
## These samples have an average 28.97 CPM coverage and 8629 genes observed, ranging from 8500 to
## 8682.
## Warning: ggrepel: 84 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

dev <- pp(file = "images/lp_nonzero.png", width=9, height = 9)
nonzero$plot
## Warning: ggrepel: 81 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
closed <- dev.off()

Differences in relative gene content with respect to sequencing coverage. The per-sample number of observed genes was plotted with respect to the relative CPM coverage in order to check that the samples are sufficiently and similarly diverse. Many samples were observed near or at the putative asymptote of likely gene content; no samples were observed with fewer than 65% of the Leishmania panamensis genes included. Note that the range of genes observed is quite small, 8500 <= x < 8700 genes, however this was plotted after already excluding samples with fewer than 8500 genes observed (of which there were 2) and any samples with fewer than 5 million protein coding mapped reads (there were 2 samples that had more than 8500 genes observed in less than 5 million reads).

lp_box <- plot_boxplot(lp_expt)
## 8122 entries are 0.  We are on a log scale, adding 1 to the data.
dev <- pp(file = "images/lp_expt_boxplot.png", width = 16, height = 9)
lp_box
closed <- dev.off()
lp_box

The distribution of observed counts / gene for all samples was plotted as a boxplot on the log2 (it looks like it is log10, but I checked) scale. In contrast to host transcriptome distribution, the parasite distribution of reads/gene is log-normal.

filter_plot <- plot_libsize_prepost(lp_expt)
filter_plot$lowgene_plot
## Warning: Using alpha for a discrete variable is not advised.

filter_plot$count_plot

The numbers of genes removed by low-count filtering is drastically lower in parasite samples than human. Thus, even though the range of coverage for the parasite samples is from near 0 to ~ 150 CPM, the number of genes removed by the default low-count filter ranges only from 40 to 129, and the number of reads associated with them ranges only from 100 to 3168.

table(pData(lp_expt)[["zymodemecategorical"]])
## 
##   b2904 unknown     z10     z15     z20     z21     z22     z23     z24     z30 
##       1       2       1       1       1       7      43      41       2       1 
##     z32 
##       1
table(pData(lp_expt)[["clinicalresponse"]])
## 
##                                  cure                               failure 
##                                    40                                    36 
##                   failure miltefosine                       laboratory line 
##                                     1                                     1 
## laboratory line miltefosine resistant                                    nd 
##                                     1                                    18 
##                      reference strain 
##                                     4

1.1 Distribution Visualizations

Najib’s favorite plots are of course the PCA/TNSE. These are nice to look at in order to get a sense of the relationships between samples. They also provide a good opportunity to see what happens when one applies different normalizations, surrogate analyses, filters, etc. In addition, one may set different experimental factors as the primary ‘condition’ (usually the color of plots) and surrogate ‘batches’.

1.2 By Susceptilibity

Column ‘Q’ in the sample sheet, make a categorical version of it with these parameters:

  • 0 <= x <= 35 is resistant
  • 36 <= x <= 48 is ambiguous
  • 49 <= x is sensitive
strain_norm <- normalize_expt(lp_strain, norm = "quant", transform = "log2",
                              convert = "cpm", filter = TRUE)
## Removing 134 low-count genes (8576 remaining).
## transform_counts: Found 2 values equal to 0, adding 1 to the matrix.
zymo_pca <- plot_pca(strain_norm, plot_title = "PCA of parasite expression values",
                     plot_labels = FALSE)
zymo_pca
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by b2904, unknown, z1.0, z1.5, z2.0, z2.1, z2.2, z2.3, z2.4, z3.0, z3.2
## Shapes are defined by resistant, sensitive, unknown.

dev <- pp(file = "images/promastigote_zymocol_sensshape.png")
sp(zymo_pca$plot)
closed <- dev.off()
lp_strain_knn <- set_expt_conditions(lp_strain, fact = "knnv2classification")
## The numbers of samples by condition are:
## 
## z10 z21 z22 z23 z24 z32 
##   3   6  47  41   2   2
strain_norm_knn <- normalize_expt(lp_strain_knn, norm = "quant", transform = "log2",
                                  convert = "cpm", filter = TRUE)
## Removing 134 low-count genes (8576 remaining).
## transform_counts: Found 2 values equal to 0, adding 1 to the matrix.
zymo_pca_knn <- plot_pca(strain_norm_knn, plot_title = "PCA of parasite expression values",
                         plot_labels = FALSE)
sp(plotly::ggplotly(zymo_pca_knn$plot))
dev <- pp(file = "images/promastigote_zymocol_sensshape_knnv2.png")
sp(zymo_pca_knn$plot)
closed <- dev.off()
sp(zymo_pca_knn$plot)

strain_nobatch <- set_expt_batches(strain_norm, fact="sourcelab")
## The number of samples by batch are:
## 
## MAG 
## 101
zymo_pcav2 <- plot_pca(strain_nobatch, plot_title = "PCA of parasite expression values",
                       plot_labels = FALSE)
dev <- pp(file = "images/promastigote_zymocol_nobatch.png")
sp(zymo_pcav2$plot)
closed <- dev.off()
sp(zymo_pcav2$plot)

strain_nb <- normalize_expt(lp_strain, convert = "cpm", transform = "log2",
                            filter = TRUE, batch = "svaseq")
## Removing 134 low-count genes (8576 remaining).
## Setting 738 low elements to zero.
## transform_counts: Found 738 values equal to 0, adding 1 to the matrix.
strain_nb_pca <- plot_pca(strain_nb, plot_title = "PCA of parasite expression values",
                          plot_labels = FALSE)
dev <- pp(file = "images/clinical_nb_pca_sus_shape.png")
sp(strain_nb_pca$plot)
closed <- dev.off()
strain_nb_pca
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by b2904, unknown, z1.0, z1.5, z2.0, z2.1, z2.2, z2.3, z2.4, z3.0, z3.2
## Shapes are defined by resistant, sensitive, unknown.

Add explicit labels for a few reference strains:

  • TMRC20023: Excluded due to coverage (only 7k reads)
  • TMRC20006: This one has 19,815,673 reads, but a weirdly small number of genes and got excluded.
  • TMRC20029: This has 1,946,986 reads and so was excluded.
  • TMRC20034: Not sequenced
##samples_to_label <- tolower(c("TMRC20023", "TMRC20006", "TMRC20029", "TMRC20007", "TMRC20034",
##                              "TMRC20008", "TMRC20027", "TMRC20028", "TMRC20032", "TMRC20040"))
samples_to_label <- tolower(c("TMRC20007", "TMRC20008", "TMRC20027",
                              "TMRC20028", "TMRC20032", "TMRC20040"))

label_entries <- zymo_pca$table[samples_to_label, ]
sp(zymo_pca$plot +
     geom_text(mapping = aes_string("x" = "PC1", "y" = "PC2", label = "sampleid"),
               data = label_entries))

Some likely text for a figure legend might include something like the following (paraphrased from Najib’s 2016 dual transcriptome profiling paper (10.1128/mBio.00027-16)):

Expression profiles of the promastigote samples across multiple strains. Each glyph represents one sample, colors delineate the various strains and fall into two primary clades. Red samples are zymodeme 2.3, blue samples are zymodeme 2.2. The difference between these two primary groups make up approximately 17% of the variance in the PCA. Purple samples are Leishmania braziliensis or zymodeme 1.0/1.5 samples, orange are z2.4, browns and greys are z2.1, z2.0, z3.0, and z3.2 respectively. This analysis was performed following a low-count filter, cpm conversion, quantile normalization, and a log2 transformation. No batch factor was used, nor was a surrogate variable estimation performed.

Some interpretation for this figure might include:

When PCA was performed on the promastigote samples, the dominant (but still relatively small amount of variance) component observed coincided with the two primary strain groups, zymodeme 2.2 and 2.3. With the exception of some Leishmania braziliensis samples, all promatigote samples assayed fell into one of these two categories.

When surrogate varialbe estimation was performed on the entire set of samples, it increased the apparent strain-dependent variance, but had some potentially problematic effects for a couple of samples (one z2.3 sample now lies with the other z2.2 samples); it is assumed that this is because sva attempted to estimate surrogate values for the less-represented strains with some unintended consequences for sample TMRC20095 (which, along with TMRC20008 are the two least covered samples by a significant margin); this hypothesis may be tested by excluding the braziliensis and non-z2.2/2.3 samples and repeating (when this is performed later in the document, the difference between the two primary clades increases to 49.33% of the variance and there are no odd samples).

zymo_tsne <- plot_tsne(strain_norm, plot_title = "TSNE of parasite expression values")
sp(zymo_tsne$plot)

strain_nb_tsne <- plot_tsne(strain_nb, plot_title = "TSNE of parasite expression values")
sp(strain_nb_tsne$plot)

corheat <- plot_corheat(strain_norm, plot_title = "Correlation heatmap of parasite
                 expression values
")
corheat$plot

disheat <- plot_disheat(strain_norm, plot_title = "Distance heatmap of parasite
                 expression values
")
disheat$plot

plot_sm(strain_norm)$plot

Potential start for a figure legend:

Global relationships among the promastigote transcriptional profiles. Pairwise pearson correlations and Euclidean distances were calculated using the normalized expression matrices. Colors along the top row delineate the experimental conditions (same colors as the PCA) Samples were clustered by nearest neighbor clustering and each colored tile describes one correlation value between two samples (red to white delineates pearson correlation values of the 8,710 normalized gene values between two samples ranging from <= 0.7 to >= 1.0) or the euclidean distance between two samples (dark blue to white delineates identical to a normalized euclidean distance of >= 110).

Some interpretation for this figure might include:

When the global relationships among the samples were distilled down to individual euclidean distances or pearson correlation coefficients between pairs of samples, the primary clustering among samples observed was according to strain. The primary significant outlier sample (TMRC20095) is explicitly due to low coverage. The other outlier strains are either braziliensis (purple) or a series of strains which, when viewed in IGV, appear to have genetic variants which bridge the differences between the two primary zymodemes, particularly on the known aneuploid chromosomes.

1.3 Limit to three strains: 2.1/2.2/2.3

only_three_types <- subset_expt(lp_strain, subset = "condition=='z2.1'|condition=='z2.3'|condition=='z2.2'")
## subset_expt(): There were 101, now there are 91 samples.
only_three_norm <- normalize_expt(only_three_types, norm = "quant", transform = "log2",
                                  convert = "cpm", batch = FALSE, filter = TRUE) %>%
  set_expt_batches(fact="phase")
## Removing 151 low-count genes (8559 remaining).
## transform_counts: Found 109 values equal to 0, adding 1 to the matrix.
## The number of samples by batch are:
## 
## Stationary 
##         91
onlythree_pca <- plot_pca(only_three_norm, plot_title = "PCA of z2.1, z2.2 and z2.3 parasite expression values",
                          plot_labels = FALSE)
pp(file="images/promastigote_threetypes_zymocol_noshape.png")
sp(onlythree_pca$plot)
dev.off()
## png 
##   2
sp(onlythree_pca$plot)

1.4 Limit to just two strains: 2.2/2.3

lp_two_strains_norm <- sm(normalize_expt(lp_two_strains, norm = "quant", transform = "log2",
                                         convert = "cpm", batch = FALSE, filter = TRUE))
onlytwo_pca <- plot_pca(lp_two_strains_norm, plot_title = "PCA of z2.2 and z2.3 parasite expression values",
                        plot_labels = FALSE)
dev <- pp(file = "images/zymo_z2.2_z2.3_pca_sus_shape.pdf")
sp(onlytwo_pca$plot)
closed <- dev.off()
sp(onlytwo_pca$plot)

lp_two_strains_nb <- sm(normalize_expt(lp_two_strains, norm = "quant", transform = "log2",
                                       convert = "cpm", batch = "svaseq", filter = TRUE))
onlytwo_pca_nb <- plot_pca(lp_two_strains_nb, plot_title = "PCA of z2.2 and z2.3 parasite expression values",
                           plot_labels = FALSE)
dev <- pp(file = "images/zymo_z2.2_z2.3_pca_sus_shape_nb.pdf")
sp(onlytwo_pca_nb$plot)
closed <- dev.off()
sp(onlytwo_pca_nb$plot)

1.5 By Cure/Fail status

This is by far the most problematic comparison, I think the only interpretation of the following images is that the parasite has little effect on the likelihood that a person will successfully end treatment. There does appear to be some variance associated with cure/fail, but only in a few samples (visible in ~10 fail samples and perhaps ~8 cure samples when sva is applied to the data).

cf_norm <- normalize_expt(lp_cf, convert = "cpm", transform = "log2",
                          norm = "quant", filter = TRUE)
## Removing 134 low-count genes (8576 remaining).
## transform_counts: Found 2 values equal to 0, adding 1 to the matrix.
start_cf <- plot_pca(cf_norm, plot_title = "PCA of parasite expression values",
                     plot_labels = FALSE)
dev <- pp(file = "images/cf_sus_shape.png")
sp(start_cf$plot)
closed <- dev.off()
sp(start_cf$plot)

cf_nb <- normalize_expt(lp_cf_known, convert = "cpm", transform = "log2",
                        filter = TRUE, batch = "svaseq")
## Removing 162 low-count genes (8548 remaining).
## Setting 118 low elements to zero.
## transform_counts: Found 118 values equal to 0, adding 1 to the matrix.
cf_nb_pca <- plot_pca(cf_nb, plot_title = "PCA of parasite expression values",
                      plot_labels = FALSE)
dev <- pp(file = "images/cf_sus_share_nb.png")
sp(cf_nb_pca$plot)
closed <- dev.off()
sp(cf_nb_pca$plot)

cf_norm <- normalize_expt(lp_cf, transform = "log2", convert = "cpm",
                          filter = TRUE, norm = "quant")
## Removing 134 low-count genes (8576 remaining).
## transform_counts: Found 2 values equal to 0, adding 1 to the matrix.
test <- pca_information(cf_norm,
                        expt_factors = c("clinicalcategorical", "zymodemecategorical",
                                         "pathogenstrain", "passagenumber"),
                        num_components = 6, plot_pcas = TRUE)
test$anova_p
##                           PC1       PC2    PC3       PC4       PC5       PC6
## clinicalcategorical 7.122e-01 0.0002604 0.2263 9.377e-01 1.288e-03 2.114e-02
## zymodemecategorical 4.787e-07 0.0016215 0.5959 5.970e-02 3.966e-05 5.040e-01
## pathogenstrain      4.747e-01 0.8703328 0.6433 5.629e-05 1.889e-02 2.316e-01
## passagenumber       9.502e-01 0.1744479 0.4657 3.136e-02 8.602e-01 5.429e-06
test$cor_heatmap

1.6 By Current drug sensitivity assay data

We have two competing metrics of antmonial sensitivity; one historical and one current. In both cases there is a reasonable expectation that resistant strains tend to be zymodeme 2.3 and sensitive strains tend to be zymodeme 2.2. There appear to be more exceptions to this rule of thumb in the current data than the historical.

sus_norm <- normalize_expt(lp_susceptibility, transform = "log2", convert = "cpm",
                           norm = "quant", filter = TRUE)
## Removing 134 low-count genes (8576 remaining).
## transform_counts: Found 2 values equal to 0, adding 1 to the matrix.
sus_pca <- plot_pca(sus_norm, plot_title = "PCA of parasite expression values",
                    plot_labels = FALSE)
dev <- pp(file = "images/sus_norm_pca.png")
sp(sus_pca[["plot"]])
closed <- dev.off()
sp(sus_pca[["plot"]])

sus_nb <- normalize_expt(lp_susceptibility, transform = "log2", convert = "cpm",
                         batch = "svaseq", filter = TRUE)
## Removing 134 low-count genes (8576 remaining).
## Setting 636 low elements to zero.
## transform_counts: Found 636 values equal to 0, adding 1 to the matrix.
sus_nb_pca <- plot_pca(sus_nb, plot_title = "PCA of parasite expression values",
                       plot_labels = FALSE)
dev <- pp(file = "images/sus_nb_pca.png")
sp(sus_nb_pca[["plot"]])
closed <- dev.off()
sp(sus_nb_pca[["plot"]])

1.7 By Historical drug sensitivity assay data

sus_hist_norm <- normalize_expt(lp_susceptibility_historical, transform = "log2", convert = "cpm",
                                norm = "quant", filter = TRUE)
## Removing 134 low-count genes (8576 remaining).
## transform_counts: Found 2 values equal to 0, adding 1 to the matrix.
sus_hist_pca <- plot_pca(sus_hist_norm, plot_title = "PCA of parasite expression values",
                         plot_labels = FALSE)
dev <- pp(file = "images/sus_hist_norm_pca.png")
sp(sus_hist_pca[["plot"]])
closed <- dev.off()
sp(sus_hist_pca[["plot"]])

sus_hist_nb <- normalize_expt(lp_susceptibility_historical, transform = "log2", convert = "cpm",
                              batch = "svaseq", filter = TRUE)
## Removing 134 low-count genes (8576 remaining).
## Setting 374 low elements to zero.
## transform_counts: Found 374 values equal to 0, adding 1 to the matrix.
sus_hist_nb_pca <- plot_pca(sus_hist_nb, plot_title = "PCA of parasite expression values",
                            plot_labels = FALSE)
dev <- pp(file = "images/sus_hist_nb_pca.png")
sp(sus_hist_nb_pca[["plot"]])
closed <- dev.off()
sp(sus_hist_nb_pca[["plot"]])

1.8 Zymodeme enzyme gene IDs

Najib read me an email listing off the gene names associated with the zymodeme classification. I took those names and cross referenced them against the Leishmania panamensis gene annotations and found the following:

They are:

  1. ALAT: LPAL13_120010900 – alanine aminotransferase
  2. ASAT: LPAL13_340013000 – aspartate aminotransferase
  3. G6PD: LPAL13_000054100 – glucase-6-phosphate 1-dehydrogenase
  4. NH: LPAL13_14006100, LPAL13_180018500 – inosine-guanine nucleoside hydrolase
  5. MPI: LPAL13_320022300 (maybe) – mannose phosphate isomerase (I chose phosphomannose isomerase)

Given these 6 gene IDs (NH has two gene IDs associated with it), I can do some looking for specific differences among the various samples.

1.8.1 Expression levels of zymodeme genes

The following creates a colorspace (red to green) heatmap showing the observed expression of these genes in every sample.

my_genes <- c("LPAL13_120010900", "LPAL13_340013000", "LPAL13_000054100",
              "LPAL13_140006100", "LPAL13_180018500", "LPAL13_320022300",
              "other")
my_names <- c("ALAT", "ASAT", "G6PD", "NHv1", "NHv2", "MPI", "other")

zymo_expt <- exclude_genes_expt(strain_norm, ids = my_genes, method = "keep")
## Note, I renamed this to subset_genes().
## remove_genes_expt(), before removal, there were 8576 genes, now there are 6.
## There are 101 samples which kept less than 90 percent counts.
## TMRC20001 TMRC20065 TMRC20005 TMRC20007 TMRC20008 TMRC20027 TMRC20028 TMRC20032 
##   0.08652   0.08512   0.08414   0.08695   0.08365   0.08470   0.08796   0.08394 
## TMRC20040 TMRC20066 TMRC20039 TMRC20037 TMRC20038 TMRC20067 TMRC20068 TMRC20041 
##   0.08260   0.08191   0.08481   0.08204   0.08359   0.08402   0.08449   0.08315 
## TMRC20015 TMRC20009 TMRC20010 TMRC20016 TMRC20011 TMRC20012 TMRC20013 TMRC20017 
##   0.08490   0.08382   0.08432   0.08365   0.08356   0.08550   0.08577   0.08344 
## TMRC20014 TMRC20018 TMRC20019 TMRC20070 TMRC20020 TMRC20021 TMRC20022 TMRC20025 
##   0.08400   0.08355   0.08372   0.08410   0.08220   0.08198   0.08548   0.08592 
## TMRC20024 TMRC20036 TMRC20069 TMRC20033 TMRC20026 TMRC20031 TMRC20076 TMRC20073 
##   0.08229   0.08273   0.08271   0.08278   0.08754   0.08204   0.08331   0.08490 
## TMRC20055 TMRC20079 TMRC20071 TMRC20078 TMRC20094 TMRC20042 TMRC20058 TMRC20072 
##   0.08446   0.08525   0.08434   0.08392   0.08409   0.08430   0.08318   0.08411 
## TMRC20059 TMRC20048 TMRC20057 TMRC20088 TMRC20056 TMRC20060 TMRC20077 TMRC20074 
##   0.08360   0.08241   0.08607   0.08494   0.08475   0.08320   0.08402   0.08375 
## TMRC20063 TMRC20053 TMRC20052 TMRC20064 TMRC20075 TMRC20051 TMRC20050 TMRC20049 
##   0.08251   0.08292   0.08267   0.08314   0.08374   0.08448   0.08262   0.08544 
## TMRC20062 TMRC20110 TMRC20080 TMRC20043 TMRC20083 TMRC20054 TMRC20085 TMRC20046 
##   0.08427   0.08519   0.08222   0.08343   0.08444   0.08488   0.08429   0.08544 
## TMRC20093 TMRC20089 TMRC20047 TMRC20090 TMRC20044 TMRC20045 TMRC20061 TMRC20105 
##   0.08460   0.08355   0.08430   0.08171   0.08531   0.08388   0.08348   0.08449 
## TMRC20108 TMRC20109 TMRC20098 TMRC20096 TMRC20097 TMRC20101 TMRC20092 TMRC20082 
##   0.08313   0.08458   0.08489   0.08363   0.08338   0.08366   0.08318   0.08277 
## TMRC20102 TMRC20099 TMRC20100 TMRC20091 TMRC20084 TMRC20087 TMRC20103 TMRC20104 
##   0.08338   0.08468   0.08324   0.08503   0.08319   0.08445   0.08440   0.08415 
## TMRC20086 TMRC20107 TMRC20081 TMRC20106 TMRC20095 
##   0.08366   0.08155   0.08221   0.08079   0.07790
zymo_heatmap <- plot_sample_heatmap(zymo_expt, row_label = my_names)
zymo_heatmap

A recent suggestion included a query about the relationship of our amastigote TMRC2 samples which were the result of infecting a set of macrophages vs. these promastigote samples.

So far, we have kept these two experiments separate, now let us merge them.

tmrc2_macrophage_norm <- normalize_expt(lp_macrophage, transform="log2", convert="cpm",
                                        norm="quant", filter=TRUE)
## Removing 0 low-count genes (8710 remaining).
## transform_counts: Found 3735 values equal to 0, adding 1 to the matrix.
## Hey you, this annotation call should be made automatic for the container!
annotation(lp_expt) <- "org.Lpanamensis.MHOMCOL81L13.v46.eg.db"
annotation(lp_macrophage) <- "org.Lpanamensis.MHOMCOL81L13.v46.eg.db"
all_tmrc2 <- combine_expts(lp_expt, lp_macrophage)
## The numbers of samples by condition are:
## 
##   b2904 unknown    z1.0    z1.5    z2.0    z2.1    z2.2    z2.3    z2.4    z3.0 
##       1       2       1       1       1       7      64      70       2       1 
##    z3.2 
##       1

Before we can use the combined data, we must reconcile a few of aspects of it, notably we need to specify which samples are amastigotes and which are promastigotes.

all_nosb <- all_tmrc2
pData(all_nosb)[["stage"]] <- "promastigote"
na_idx <- is.na(pData(all_nosb)[["macrophagetreatment"]])
pData(all_nosb)[na_idx, "macrophagetreatment"] <- "undefined"
all_nosb <- subset_expt(all_nosb, subset="macrophagetreatment!='inf_sb'")
## subset_expt(): There were 151, now there are 130 samples.
ama_idx <- pData(all_nosb)[["macrophagetreatment"]] == "inf"
pData(all_nosb)[ama_idx, "stage" ] <- "amastigote"

## Make sure that the zymodeme does not have the inf_ prefix.
zymodeme_char <- gsub(x = pData(all_nosb)[["condition"]], pattern = "^inf_", replacement = "")
pData(all_nosb)[["condition"]] <- zymodeme_char

pData(all_nosb)[["batch"]] <- pData(all_nosb)[["stage"]]
all_norm <- normalize_expt(all_nosb, convert="cpm", norm="quant", transform="log2", filter=TRUE)
## Removing 85 low-count genes (8625 remaining).
## transform_counts: Found 11 values equal to 0, adding 1 to the matrix.
pro_ama_pca <- plot_pca(all_norm)
sp(pro_ama_pca[["plot"]])

I think the above picture is sort of the opposite of what we want to compare in a DE analysis for this set of data, e.g. we want to compare promastigotes from amastigotes?

two_nosb <- set_expt_batches(all_nosb, fact="condition") %>%
  set_expt_conditions(fact="stage") %>%
  subset_expt(subset="batch=='z2.2'|batch=='z2.3'")
## The number of samples by batch are:
## 
##   b2904 unknown    z1.0    z1.5    z2.0    z2.1    z2.2    z2.3    z2.4    z3.0 
##       1       2       1       1       1       7      57      56       2       1 
##    z3.2 
##       1
## The numbers of samples by condition are:
## 
##   amastigote promastigote 
##           29          101
## subset_expt(): There were 130, now there are 113 samples.
two_norm <- normalize_expt(two_nosb, convert="cpm", norm="quant",
                           transform="log2", filter=TRUE)
## Removing 97 low-count genes (8613 remaining).
## transform_counts: Found 81 values equal to 0, adding 1 to the matrix.
pro_ama_two_pca <- plot_pca(two_norm)
sp(pro_ama_two_pca[["plot"]])

zy_stage_factor <- paste0(pData(two_nosb)[["batch"]], "_",
                          pData(two_nosb)[["stage"]])
pData(two_nosb)[["zystage"]] <- zy_stage_factor
zystage <- set_expt_conditions(two_nosb, fact = "zystage")
## The numbers of samples by condition are:
## 
##   z2.2_amastigote z2.2_promastigote   z2.3_amastigote z2.3_promastigote 
##                14                43                15                41
zystage_norm <- normalize_expt(zystage, filter = TRUE, norm = "quant",
                               convert = "cpm", transform = "log2")
## Removing 97 low-count genes (8613 remaining).
## transform_counts: Found 81 values equal to 0, adding 1 to the matrix.
plot_pca(zystage_norm)$plot

zystage_keepers <- list(
  "z2322_ama" = c("z23amastigote", "z22amastigote"),
  "z2322_pro" = c("z23promastigote", "z22promastigote"),
  "proama_z23" = c("z23amastigote", "z23promastigote"),
  "proama_z22" = c("z22amastigote", "z22promastigote"))

zystage_de <- all_pairwise(zystage, filter = TRUE, model_batch = "svaseq")
## 
##   z2.2_amastigote z2.2_promastigote   z2.3_amastigote z2.3_promastigote 
##                14                43                15                41
## Removing 0 low-count genes (8613 remaining).
## Setting 1736 low elements to zero.
## transform_counts: Found 1736 values equal to 0, adding 1 to the matrix.

zystage_tables <- combine_de_tables(
  zystage_de, keepers = zystage_keepers,
  excel = glue("excel/zymodeme_stage_table-v{ver}.xlsx"))
## Adding venn plots for z2322_ama.
## Adding venn plots for z2322_pro.
## Adding venn plots for proama_z23.
## Adding venn plots for proama_z22.

2 Gene expression with respect to chromosome

I want to make a plot where the x-axis is the number of genes on a chromosome and the y-axis is the mean of the expression of those genes.

exprs_vs_chromosome <- function(expt, scaffolds = TRUE, min_genes = 10) {
  start <- data.frame(row.names = unique(fData(expt)[["chromosome"]]))
  start[["genes"]] <- 0
  start[["exprs_mean"]] <- 0
  start[["exprs_stdev"]] <- 0
  start[["exprs_var"]] <- 0
  start[["exprs_min"]] <- 0
  start[["exprs_qt1"]] <- 0
  start[["exprs_qt3"]] <- 0
  start[["exprs_median"]] <- 0
  start[["exprs_max"]] <- 0
  for (ch in rownames(start)) {
    gene_id_idx <- fData(expt)[["chromosome"]] == ch
    gene_ids <- rownames(fData(expt))[gene_id_idx]
    start[ch, "genes"] <- length(gene_ids)
    subset_exprs <- exprs(expt)[gene_ids, ]
    if (length(gene_ids) == 1) {
      start[ch, "exprs_mean"] <- mean(subset_exprs)
      start[ch, "exprs_stdev"] <- stats::sd(subset_exprs)
      start[ch, "exprs_min"] <- min(subset_exprs)
      start[ch, "exprs_max"] <- max(subset_exprs)
      start[ch, "exprs_median"] <- median(subset_exprs)
      start[ch, "exprs_var"] <- stats::var(subset_exprs)
      start[ch, "exprs_qt1"] <- as.numeric(summary(subset_exprs))[2]
      start[ch, "exprs_qt3"] <- as.numeric(summary(subset_exprs))[5]
    } else {
      start[ch, "exprs_mean"] <- mean(rowMeans(subset_exprs))
      start[ch, "exprs_stdev"] <- stats::sd(rowMeans(subset_exprs))
      start[ch, "exprs_min"] <- min(rowMeans(subset_exprs))
      start[ch, "exprs_max"] <- max(rowMeans(subset_exprs))
      start[ch, "exprs_median"] <- median(rowMeans(subset_exprs))
      start[ch, "exprs_var"] <- stats::var(rowMeans(subset_exprs))
      start[ch, "exprs_qt1"] <- as.numeric(summary(rowMeans(subset_exprs)))[2]
      start[ch, "exprs_qt3"] <- as.numeric(summary(rowMeans(subset_exprs)))[5]
    }

    min_idx <- start[["genes"]] >= min_genes
    start <- start[min_idx, ]

    plt <- ggplot(start, aes(y = exprs_mean, x = genes)) +
      geom_point() +
      scale_y_log10()
    plt <- ggplot(start, aes(y = exprs_var, x = genes)) +
      geom_point()
  }
}

3 SNP profiles

One potentially interesting aspect of the variant data: it may be able to help us define the zymodeme state of previous, untested samples.

In order to test this, I am loading some of the 2016 data alongside the new TMRC2 data to see if they fit together.

This is using an older dataset for which I am not sure we have permissions to include in the container, so I am turning them off for now.

old_expt <- create_expt("sample_sheets/tmrc2_samples_20191203.xlsx",
                        file_column = "tophat2file")

tt <- old_expt$expressionset
rownames(tt) <- gsub(pattern = "^exon_", replacement = "", x = rownames(tt))
rownames(tt) <- gsub(pattern = "\\.1$", replacement = "", x = rownames(tt))
old_expt$expressionset <- tt
rm(tt)

3.1 Create the SNP expressionset

One other important caveat, we have a group of new samples which have not yet run through the variant search pipeline, so I need to remove them from consideration. Though it looks like they finished overnight…

In the non-containerized version of this document, the following block combines an older dataset with the current data.

both_norm <- normalize_expt(lp_snps, transform = "log2", norm = "quant") %>%
  set_expt_conditions(fact = "strain")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'pData': error in evaluating the argument 'expt' in selecting a method for function 'state': object 'lp_snps' not found

The data structure ‘both_norm’ now contains our 2016 data along with the newer data collected since 2019.

3.2 Plot of SNP profiles for zymodemes

The following plot shows the SNP profiles of all samples (old and new) where the colors at the top show either the 2.2 strains (orange), 2.3 strains (green), the previous samples (purple), or the various lab strains (pink etc).

new_variant_heatmap <- plot_disheat(new_snps)
dev <- pp(file = "images/raw_snp_disheat.png", height=12, width=12)
new_variant_heatmap$plot
closed <- dev.off()
new_variant_heatmap$plot

The function get_snp_sets() takes the provided metadata factor (in this case ‘condition’) and looks for variants which are exclusive to each element in it. In this case, this is looking for differences between 2.2 and 2.3, as well as the set shared among them.

snp_sets <- get_snp_sets(new_snps, factor = "condition")
## The samples represent the following categories:
## 
##   b2904 unknown    z1.0    z1.5    z2.0    z2.1    z2.2    z2.3    z2.4    z3.0 
##       1       2       1       1       1       7      43      41       2       1 
##    z3.2 
##       1
## Using a proportion of observed variants, converting the data to binary observations.
## The factor b2904 has only 1 row.
## The factor unknown has 2 rows.
## The factor z1.0 has only 1 row.
## The factor z1.5 has only 1 row.
## The factor z2.0 has only 1 row.
## The factor z2.1 has 7 rows.
## The factor z2.2 has 43 rows.
## The factor z2.3 has 41 rows.
## The factor z2.4 has 2 rows.
## The factor z3.0 has only 1 row.
## The factor z3.2 has only 1 row.
## Finished iterating over the chromosomes.
snp_sets
## A set of variants observed when cross referencing all variants against
## the samples associated with each metadata factor: condition.  11
## categories and 1514127 variants were observed with 640
## combinations among them.  726 chromosomes/scaffolds were observed with a
## density of variants ranging from 0.000652315720808871 to 0.151111111111111.
##Biobase::annotation(old_expt$expressionset) = Biobase::annotation(lp_expt$expressionset)
##both_expt <- combine_expts(lp_expt, old_expt)

snp_genes <- sm(snps_vs_genes(lp_expt, snp_sets, expt_name_col = "chromosome"))
snp_genes
## When the variants observed were cross referenced against annotated genes,
## 8662 genes were observed with at least 1 variant.  LPAL13_250017600 had the most variants, with
## 1297.
## I think we have some metrics here we can plot...
snp_subset <- snp_subset_genes(
  lp_expt, new_snps,
  genes = c("LPAL13_120010900", "LPAL13_340013000", "LPAL13_000054100",
            "LPAL13_140006100", "LPAL13_180018500", "LPAL13_320022300"))
## Note, I renamed this to subset_genes().
## remove_genes_expt(), before removal, there were 1514127 genes, now there are 179.
## There are 101 samples which kept less than 90 percent counts.
## tmrc20001 tmrc20065 tmrc20005 tmrc20007 tmrc20008 tmrc20027 tmrc20028 tmrc20032 
## 0.0363994 0.0284342 0.0446300 0.0678958 0.0000000 0.0594351 0.0753021 0.0353544 
## tmrc20040 tmrc20066 tmrc20039 tmrc20037 tmrc20038 tmrc20067 tmrc20068 tmrc20041 
## 0.0204152 0.0244539 0.0218095 0.0228205 0.0244650 0.0259861 0.0275633 0.0084708 
## tmrc20015 tmrc20009 tmrc20010 tmrc20016 tmrc20011 tmrc20012 tmrc20013 tmrc20017 
## 0.0249880 0.0000000 0.0278667 0.0232143 0.0243409 0.0778398 0.0294979 0.0102837 
## tmrc20014 tmrc20018 tmrc20019 tmrc20070 tmrc20020 tmrc20021 tmrc20022 tmrc20025 
## 0.0191370 0.0239034 0.0282985 0.0274939 0.0235432 0.0286477 0.0000000 0.0624989 
## tmrc20024 tmrc20036 tmrc20069 tmrc20033 tmrc20026 tmrc20031 tmrc20076 tmrc20073 
## 0.0212984 0.0089603 0.0270318 0.0019682 0.0352553 0.0199682 0.0270505 0.0282364 
## tmrc20055 tmrc20079 tmrc20071 tmrc20078 tmrc20094 tmrc20042 tmrc20058 tmrc20072 
## 0.0395199 0.0280768 0.0247556 0.0177539 0.0279169 0.0398656 0.0256906 0.0158464 
## tmrc20059 tmrc20048 tmrc20057 tmrc20088 tmrc20056 tmrc20060 tmrc20077 tmrc20074 
## 0.0251221 0.0238737 0.0062348 0.0349161 0.0003009 0.0294943 0.0340198 0.0282781 
## tmrc20063 tmrc20053 tmrc20052 tmrc20064 tmrc20075 tmrc20051 tmrc20050 tmrc20049 
## 0.0017690 0.0202252 0.0274156 0.0280939 0.0236363 0.0297070 0.0324800 0.0338522 
## tmrc20062 tmrc20110 tmrc20080 tmrc20043 tmrc20083 tmrc20054 tmrc20085 tmrc20046 
## 0.0314505 0.0350400 0.0288995 0.0273341 0.0121428 0.0298779 0.0251593 0.0054459 
## tmrc20093 tmrc20089 tmrc20047 tmrc20090 tmrc20044 tmrc20045 tmrc20061 tmrc20105 
## 0.0065508 0.0269888 0.0299476 0.0273432 0.0316706 0.0052405 0.0263012 0.0303013 
## tmrc20108 tmrc20109 tmrc20098 tmrc20096 tmrc20097 tmrc20101 tmrc20092 tmrc20082 
## 0.0267368 0.0184037 0.0265627 0.0202590 0.0107502 0.0282691 0.0024627 0.0012027 
## tmrc20102 tmrc20099 tmrc20100 tmrc20091 tmrc20084 tmrc20087 tmrc20103 tmrc20104 
## 0.0250142 0.0291960 0.0266043 0.0274472 0.0063494 0.0291851 0.0065616 0.0265140 
## tmrc20086 tmrc20107 tmrc20081 tmrc20106 tmrc20095 
## 0.0251956 0.0200527 0.0133293 0.0134896 0.0136510
zymo_heat <- plot_sample_heatmap(snp_subset, row_label = rownames(exprs(snp_subset)))
zymo_heat

3.3 Compare variants to DE genes

Najib has asked a few times about the relationship between variants and DE genes. In subsequent conversations I figured out what he really wants to learn is variants in the UTR (most likely 5’) which might affect expression of genes. The following explicitly does not help this question, but is a paralog: is there a relationship between variants in the CDS and differential expression?

3.3.1 Collect DE data

In order to do this comparison, we need to reload some of the DE results.

These blocks need to be moved to post-differential analyses

rda <- glue("rda/zymo_tables_sva-v{ver}.rda")
varname <- gsub(x = basename(rda), pattern = "\\.rda", replacement = "")
loaded <- load(file = rda)
zy_df <- get0(varname)[["data"]][["zymodeme"]]
vars_df <- data.frame(ID = names(snp_genes$summary_by_gene), variants = as.numeric(snp_genes$summary_by_gene))
vars_df[["variants"]] <- log2(vars_df[["variants"]] + 1)
vars_by_de_gene <- merge(zy_df, vars_df, by.x="row.names", by.y="ID")
cor.test(vars_by_de_gene$deseq_logfc, vars_by_de_gene$variants)
variants_wrt_logfc <- plot_linear_scatter(vars_by_de_gene[, c("deseq_logfc", "variants")])
variants_wrt_logfc$scatter
## It looks like there might be some genes of interest, even though this is not actually
## the question of interest.

Didn’t I create a set of densities by chromosome? Oh I think they come in from get_snp_sets()

3.4 SNPS associated with clinical response in the TMRC samples

clinical_sets <- get_snp_sets(new_snps, factor = "clinicalresponse")
## The samples represent the following categories:
## 
##                                  cure                               failure 
##                                    40                                    36 
##                   failure miltefosine                       laboratory line 
##                                     1                                     1 
## laboratory line miltefosine resistant                                    nd 
##                                     1                                    18 
##                      reference strain 
##                                     4
## Using a proportion of observed variants, converting the data to binary observations.
## The factor cure has 40 rows.
## The factor failure has 36 rows.
## The factor failure miltefosine has only 1 row.
## The factor laboratory line has only 1 row.
## The factor laboratory line miltefosine resistant has only 1 row.
## The factor nd has 18 rows.
## The factor reference strain has 4 rows.
## Finished iterating over the chromosomes.
clinical_sets
## A set of variants observed when cross referencing all variants against
## the samples associated with each metadata factor: clinicalresponse.  7
## categories and 1514127 variants were observed with 95
## combinations among them.  726 chromosomes/scaffolds were observed with a
## density of variants ranging from 0.000652315720808871 to 0.151111111111111.
density_vec <- clinical_sets[["density"]]
chromosome_idx <- grep(pattern = "LpaL", x = names(density_vec))
density_df <- as.data.frame(density_vec[chromosome_idx])
density_df[["chr"]] <- rownames(density_df)
colnames(density_df) <- c("density_vec", "chr")
ggplot(density_df, aes_string(x = "chr", y = "density_vec")) +
  ggplot2::geom_col() +
  ggplot2::theme(axis.text = ggplot2::element_text(size = 10, colour = "black"),
                 axis.text.x = ggplot2::element_text(angle = 90, vjust = 0.5))

## clinical_written <- write_variants(new_snps)

3.4.1 Cross reference these variants by gene

clinical_genes <- snps_vs_genes(lp_expt, clinical_sets, expt_name_col = "chromosome")
## The snp grange data has 1514127 elements.
## There are 678944 overlapping variants and genes.
snp_density <- merge(as.data.frame(clinical_genes[["summary"]]),
                     as.data.frame(fData(lp_expt)),
                     by = "row.names")
snp_density <- snp_density[, c(1, 2, 4, 15)]
colnames(snp_density) <- c("name", "snps", "product", "length")
snp_density[["product"]] <- tolower(snp_density[["product"]])
snp_density[["length"]] <- as.numeric(snp_density[["length"]])
snp_density[["density"]] <- as.numeric(snp_density[["snps"]]) / snp_density[["length"]]
snp_idx <- order(snp_density[["density"]], decreasing = TRUE)
snp_density <- snp_density[snp_idx, ]

removers <- c("amastin", "gp63", "leishmanolysin")
for (r in removers) {
  drop_idx <- grepl(pattern = r, x = snp_density[["product"]])
  snp_density <- snp_density[!drop_idx, ]
}
## Filter these for [A|a]mastin gp63 Leishmanolysin
clinical_snps <- snps_intersections(lp_expt, clinical_sets, chr_column = "chromosome")

fail_ref_snps <- as.data.frame(clinical_snps[["inters"]][["failure, reference strain"]])
fail_ref_snps <- rbind(fail_ref_snps,
                       as.data.frame(clinical_snps[["inters"]][["failure"]]))
cure_snps <- as.data.frame(clinical_snps[["inters"]][["cure"]])

head(fail_ref_snps)
## [1] seqnames start    end      width    strand  
## <0 rows> (or 0-length row.names)
head(cure_snps)
##                                        seqnames   start     end width strand
## chr_LpaL13-34_pos_1370748_ref_C_alt_G LpaL13-34 1370748 1370749     2      +
write.csv(file="excel/cure_variants.txt", x = rownames(cure_snps))
write.csv(file="excel/fail_variants.txt", x = rownames(fail_ref_snps))

annot <- fData(lp_expt)
clinical_interest <- as.data.frame(clinical_snps[["gene_summaries"]][["cure"]])
clinical_interest <- merge(clinical_interest,
                           as.data.frame(clinical_snps[["gene_summaries"]][["failure, reference strain"]]),
                           by = "row.names")
rownames(clinical_interest) <- clinical_interest[["Row.names"]]
clinical_interest[["Row.names"]] <- NULL
colnames(clinical_interest) <- c("cure_snps","fail_snps")
annot <- merge(annot, clinical_interest, by = "row.names")
rownames(annot) <- annot[["Row.names"]]
annot[["Row.names"]] <- NULL
fData(lp_expt$expressionset) <- annot

4 Zymodeme for new samples

The heatmap produced here should show the variants only for the zymodeme genes.

4.1 Hunt for snp clusters

I am thinking that if we find clusters of locations which are variant, that might provide some PCR testing possibilities.

## Drop the 2.1, 2.4, unknown, and null
pruned_snps <- subset_expt(new_snps, subset = "condition=='z2.2'|condition=='z2.3'")
## subset_expt(): There were 101, now there are 84 samples.
new_sets <- get_snp_sets(pruned_snps, factor = "zymodemecategorical")
## The samples represent the following categories:
## 
## z22 z23 
##  43  41
## Using a proportion of observed variants, converting the data to binary observations.
## The factor z22 has 43 rows.
## The factor z23 has 41 rows.
## Finished iterating over the chromosomes.
summary(new_sets)
##               Length Class      Mode     
## factor          1    -none-     character
## values          2    data.frame list     
## observations    3    data.frame list     
## possibilities   2    -none-     character
## intersections   3    -none-     list     
## chr_data      726    -none-     list     
## set_names       4    -none-     list     
## invert_names    4    -none-     list     
## density       726    -none-     numeric
## 1000000: 2.2
## 0100000: 2.3

summary(new_sets[["intersections"]][["10"]])
##    Length     Class      Mode 
##       932 character character
write.csv(file = "excel/variants_22.csv", x = new_sets[["intersections"]][["10"]])
summary(new_sets[["intersections"]][["01"]])
##    Length     Class      Mode 
##     66865 character character
write.csv(file = "excel/variants_23.csv", x = new_sets[["intersections"]][["01"]])

Thus we see that there are 3,553 variants associated with 2.2 and 81,589 associated with 2.3.

4.1.1 A small function for searching for potential PCR primers

The following function uses the positional data to look for sequential mismatches associated with zymodeme in the hopes that there will be some regions which would provide good potential targets for a PCR-based assay.

sequential_variants <- function(snp_sets, conditions = NULL, minimum = 3, maximum_separation = 3) {
  if (is.null(conditions)) {
    conditions <- 1
  }
  intersection_sets <- snp_sets[["intersections"]]
  intersection_names <- snp_sets[["set_names"]]
  chosen_intersection <- 1
  if (is.numeric(conditions)) {
    chosen_intersection <- conditions
  } else {
    intersection_idx <- intersection_names == conditions
    chosen_intersection <- names(intersection_names)[intersection_idx]
  }

  possible_positions <- intersection_sets[[chosen_intersection]]
  position_table <- data.frame(row.names = possible_positions)
  pat <- "^chr_(.+)_pos_(.+)_ref_.*$"
  position_table[["chr"]] <- gsub(pattern = pat, replacement = "\\1", x = rownames(position_table))
  position_table[["pos"]] <- as.numeric(gsub(pattern = pat, replacement = "\\2", x = rownames(position_table)))
  position_idx <- order(position_table[, "chr"], position_table[, "pos"])
  position_table <- position_table[position_idx, ]
  position_table[["dist"]] <- 0

  last_chr <- ""
  for (r in 1:nrow(position_table)) {
    this_chr <- position_table[r, "chr"]
    if (r == 1) {
      position_table[r, "dist"] <- position_table[r, "pos"]
      last_chr <- this_chr
      next
    }
    if (this_chr == last_chr) {
      position_table[r, "dist"] <- position_table[r, "pos"] - position_table[r - 1, "pos"]
    } else {
      position_table[r, "dist"] <- position_table[r, "pos"]
    }
    last_chr <- this_chr
  }

  ## Working interactively here.

  doubles <- position_table[["dist"]] == 1
  doubles <- position_table[doubles, ]
  write.csv(doubles, "doubles.csv")

  one_away <- position_table[["dist"]] == 2
  one_away <- position_table[one_away, ]
  write.csv(one_away, "one_away.csv")

  two_away <- position_table[["dist"]] == 3
  two_away <- position_table[two_away, ]
  write.csv(two_away, "two_away.csv")

  combined <- rbind(doubles, one_away)
  combined <- rbind(combined, two_away)
  position_idx <- order(combined[, "chr"], combined[, "pos"])
  combined <- combined[position_idx, ]

  this_chr <- ""
  for (r in 1:nrow(combined)) {
    this_chr <- combined[r, "chr"]
    if (r == 1) {
      combined[r, "dist_pair"] <- combined[r, "pos"]
      last_chr <- this_chr
      next
    }
    if (this_chr == last_chr) {
      combined[r, "dist_pair"] <- combined[r, "pos"] - combined[r - 1, "pos"]
    } else {
      combined[r, "dist_pair"] <- combined[r, "pos"]
    }
    last_chr <- this_chr
  }

  dist_pair_maximum <- 1000
  dist_pair_minimum <- 200
  dist_pair_idx <- combined[["dist_pair"]] <= dist_pair_maximum &
    combined[["dist_pair"]] >= dist_pair_minimum
  remaining <- combined[dist_pair_idx, ]
  no_weak_idx <- grepl(pattern="ref_(G|C)", x=rownames(remaining))
  remaining <- remaining[no_weak_idx, ]

  print(head(table(position_table[["dist"]])))
  sequentials <- position_table[["dist"]] <= maximum_separation
  message("There are ", sum(sequentials), " candidate regions.")

  ## The following can tell me how many runs of each length occurred, that is not quite what I want.
  ## Now use run length encoding to find the set of sequential sequentials!
  rle_result <- rle(sequentials)
  rle_values <- rle_result[["values"]]
  ## The following line is equivalent to just leaving values alone:
  ## true_values <- rle_result[["values"]] == TRUE
  rle_lengths <- rle_result[["lengths"]]
  true_sequentials <- rle_lengths[rle_values]
  rle_idx <- cumsum(rle_lengths)[which(rle_values)]

  position_table[["last_sequential"]] <- 0
  count <- 0
  for (r in rle_idx) {
    count <- count + 1
    position_table[r, "last_sequential"] <- true_sequentials[count]
  }
  message("The maximum sequential set is: ", max(position_table[["last_sequential"]]), ".")

  wanted_idx <- position_table[["last_sequential"]] >= minimum
  wanted <- position_table[wanted_idx, c("chr", "pos")]
  return(wanted)
}

zymo22_sequentials <- sequential_variants(new_sets, conditions = "z22", minimum=1, maximum_separation=2)
dim(zymo22_sequentials)
## 7 candidate regions for zymodeme 2.2 -- thus I am betting that the reference strain is a 2.2
zymo23_sequentials <- sequential_variants(new_sets, conditions = "z23",
                                          minimum = 2, maximum_separation = 2)
dim(zymo23_sequentials)
## In contrast, there are lots (587) of interesting regions for 2.3!

4.1.2 Extract a promising region from the genome

The first 4 candidate regions from my set of remaining: * Chr Pos. Distance * LpaL13-15 238433 448 * LpaL13-18 142844 613 * LpaL13-29 830342 252 * LpaL13-33 1331507 843

Lets define a couple of terms: * Third: Each of the 4 above positions. * Second: Third - Distance * End: Third + PrimerLen * Start: Second - Primerlen

In each instance, these are the last positions, so we want to grab three things:

  • The entire region from End -> Start, this way we can have a quick sanity check.
  • Start -> Second.
  • (Third -> End) <- Reverse complemented
## * LpaL13-15 238433 448
first_candidate_chr <- lp_genome[["LpaL13_15"]]
primer_length <- 22
amplicon_length <- 448
first_candidate_third <- 238433
first_candidate_second <- first_candidate_third - amplicon_length
first_candidate_start <- first_candidate_second - primer_length
first_candidate_end <- first_candidate_third + primer_length
first_candidate_region <- subseq(first_candidate_chr, first_candidate_start, first_candidate_end)
first_candidate_region
first_candidate_5p <- subseq(first_candidate_chr, first_candidate_start, first_candidate_second)
as.character(first_candidate_5p)
first_candidate_3p <- spgs::reverseComplement(subseq(first_candidate_chr, first_candidate_third, first_candidate_end))
first_candidate_3p

## * LpaL13-18 142844 613
second_candidate_chr <- lp_genome[["LpaL13_18"]]
primer_length <- 22
amplicon_length <- 613
second_candidate_third <- 142844
second_candidate_second <- second_candidate_third - amplicon_length
second_candidate_start <- second_candidate_second - primer_length
second_candidate_end <- second_candidate_third + primer_length
second_candidate_region <- subseq(second_candidate_chr, second_candidate_start, second_candidate_end)
second_candidate_region
second_candidate_5p <- subseq(second_candidate_chr, second_candidate_start, second_candidate_second)
as.character(second_candidate_5p)
second_candidate_3p <- spgs::reverseComplement(subseq(second_candidate_chr, second_candidate_third, second_candidate_end))
second_candidate_3p


## * LpaL13-29 830342 252
third_candidate_chr <- lp_genome[["LpaL13_29"]]
primer_length <- 22
amplicon_length <- 252
third_candidate_third <- 830342
third_candidate_second <- third_candidate_third - amplicon_length
third_candidate_start <- third_candidate_second - primer_length
third_candidate_end <- third_candidate_third + primer_length
third_candidate_region <- subseq(third_candidate_chr, third_candidate_start, third_candidate_end)
third_candidate_region
third_candidate_5p <- subseq(third_candidate_chr, third_candidate_start, third_candidate_second)
as.character(third_candidate_5p)
third_candidate_3p <- spgs::reverseComplement(subseq(third_candidate_chr, third_candidate_third, third_candidate_end))
third_candidate_3p
## You are a garbage polypyrimidine tract.
## Which is actually interesting if the mutations mess it up.


## * LpaL13-33 1331507 843
fourth_candidate_chr <- lp_genome[["LpaL13_33"]]
primer_length <- 22
amplicon_length <- 843
fourth_candidate_third <- 1331507
fourth_candidate_second <- fourth_candidate_third - amplicon_length
fourth_candidate_start <- fourth_candidate_second - primer_length
fourth_candidate_end <- fourth_candidate_third + primer_length
fourth_candidate_region <- subseq(fourth_candidate_chr, fourth_candidate_start, fourth_candidate_end)
fourth_candidate_region
fourth_candidate_5p <- subseq(fourth_candidate_chr, fourth_candidate_start, fourth_candidate_second)
as.character(fourth_candidate_5p)
fourth_candidate_3p <- spgs::reverseComplement(subseq(fourth_candidate_chr, fourth_candidate_third, fourth_candidate_end))
fourth_candidate_3p

4.2 Go hunting for Sanger sequencing regions

I made a fun little function which should find regions which have lots of variants associated with a given experimental factor.

pheno <- subset_expt(lp_expt, subset = "condition=='z2.2'|condition=='z2.3'")
pheno <- subset_expt(pheno, subset = "!is.na(pData(pheno)[['bcftable']])")
## pheno_snps <- count_expt_snps(pheno, annot_column = "freebayessummary", snp_column="PAIRED")
pheno_snps <- sm(count_expt_snps(pheno, annot_column = "bcftable"))

fun_stuff <- snp_density_primers(
  pheno_snps,
  bsgenome = "BSGenome.Leishmania.panamensis.MHOMCOL81L13.v53",
  gff = "reference/TriTrypDB-53_LpanamensisMHOMCOL81L13.gff")
drop_scaffolds <- grepl(x = rownames(fun_stuff$favorites), pattern = "SCAF")
favorite_primer_regions <- fun_stuff[["favorites"]][!drop_scaffolds, ]
favorite_primer_regions[["bin"]] <- rownames(favorite_primer_regions)
library(dplyr)
favorite_primer_regions <- favorite_primer_regions %>%
  relocate(bin)

4.3 Combine this table with 2.2/2.3 genes

Here is my note from our meeting:

Cross reference primers to DE genes of 2.2/2.3 and/or resistance/suscpetible, add a column to the primer spreadsheet with the DE genes (in retrospect I am guessing this actually means to put the logFC as a column.

One nice thing, I did a semantic removal on the lp_expt, so the set of logFC/pvalues should not have any of the offending types; thus I should be able to automagically get rid of them in the merge.

This block needs to go after differential expression analyses.

logfc <- zy_table_sva[["data"]][["z23_vs_z22"]]
logfc_columns <- logfc[, c("deseq_logfc", "deseq_adjp")]
colnames(logfc_columns) <- c("z23_logfc", "z23_adjp")
new_table <- merge(favorite_primer_regions, logfc_columns,
                   by.x = "closest_gene_before_id", by.y = "row.names")
sus <- sus_table_sva[["data"]][["sensitive_vs_resistant"]]
sus_columns <- sus[, c("deseq_logfc", "deseq_adjp")]
colnames(sus_columns) <- c("sus_logfc", "sus_adjp")
new_table <- merge(new_table, sus_columns,
                   by.x = "closest_gene_before_id", by.y = "row.names") %>%
  relocate(bin)
written <- write_xlsx(data = new_table,
                      excel = "excel/favorite_primers_xref_zy_sus.xlsx")

4.4 Make a heatmap describing the clustering of variants

We can cross reference the variants against the zymodeme status and plot a heatmap of the results and hopefully see how they separate.

snp_genes <- sm(snps_vs_genes(lp_expt, new_sets, expt_name_col = "chromosome"))

clinical_colors_v2 <- list(
  "z22" = "#0000cc",
  "z23" = "#cc0000")
new_zymo_norm <- normalize_expt(pruned_snps, normq = "quant") %>%
  set_expt_conditions(fact = "zymodemecategorical") %>%
  set_expt_colors(clinical_colors_v2)
## The numbers of samples by condition are:
## 
## z22 z23 
##  43  41
zymo_heat <- plot_disheat(new_zymo_norm)
dev <- pp(file = "images/onlyz22_z23_snp_heatmap.pdf", width=12, height=12)
zymo_heat[["plot"]]
closed <- dev.off()
zymo_heat[["plot"]]

4.4.1 Annotated heatmap of variants

Now let us try to make a heatmap which includes some of the annotation data.

des <- both_norm[["design"]]
## Error in eval(expr, envir, enclos): object 'both_norm' not found
undef_idx <- is.na(des[["strain"]])
## Error in eval(expr, envir, enclos): object 'des' not found
des[undef_idx, "strain"] <- "unknown"
## Error: object 'des' not found
##hmcols <- colorRampPalette(c("yellow","black","darkblue"))(256)
correlations <- hpgl_cor(exprs(both_norm))
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'exprs': object 'both_norm' not found
na_idx <- is.na(correlations)
## Error in eval(expr, envir, enclos): object 'correlations' not found
correlations[na_idx] <- 0
## Error: object 'correlations' not found
zymo_missing_idx <- is.na(des[["zymodemecategorical"]])
## Error in eval(expr, envir, enclos): object 'des' not found
des[["zymodemecategorical"]] <- as.character(des[["zymodemecategorical"]])
## Error in eval(expr, envir, enclos): object 'des' not found
des[["clinicalcategorical"]] <- as.character(des[["clinicalcategorical"]])
## Error in eval(expr, envir, enclos): object 'des' not found
des[zymo_missing_idx, "zymodemecategorical"] <- "unknown"
## Error: object 'des' not found
mydendro <- list(
  "clustfun" = hclust,
  "lwd" = 2.0)
col_data <- as.data.frame(des[, c("zymodemecategorical", "clinicalcategorical")])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'des' not found
unknown_clinical <- is.na(col_data[["clinicalcategorical"]])
## Error in eval(expr, envir, enclos): object 'col_data' not found
row_data <- as.data.frame(des[, c("strain")])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'des' not found
colnames(col_data) <- c("zymodeme", "outcome")
## Error: object 'col_data' not found
col_data[unknown_clinical, "outcome"] <- "undefined"
## Error: object 'col_data' not found
colnames(row_data) <- c("strain")
## Error: object 'row_data' not found
myannot <- list(
  "Col" = list("data" = col_data),
  "Row" = list("data" = row_data))
## Error in eval(expr, envir, enclos): object 'col_data' not found
myclust <- list("cuth" = 1.0,
                "col" = BrewerClusterCol)
mylabs <- list(
  "Row" = list("nrow" = 4),
  "Col" = list("nrow" = 4))
hmcols <- colorRampPalette(c("darkblue", "beige"))(240)
zymo_annot_heat <- annHeatmap2(
  correlations,
  dendrogram = mydendro,
  annotation = myannot,
  cluster = myclust,
  labels = mylabs,
  ## The following controls if the picture is symmetric
  scale = "none",
  col = hmcols)
## Error in eval(expr, envir, enclos): object 'correlations' not found
dev <- pp(file = "images/dendro_heatmap.png", height = 20, width = 20)
plot(zymo_annot_heat)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'plot': object 'zymo_annot_heat' not found
closed <- dev.off()
plot(zymo_annot_heat)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'plot': object 'zymo_annot_heat' not found

Print the larger heatmap so that all the labels appear. Keep in mind that as we get more samples, this image needs to continue getting bigger.

I cannot run the following block until/unless I install cmplot in the container.

xref_prop <- table(pheno_snps[["conditions"]])
pheno_snps$conditions
idx_tbl <- exprs(pheno_snps) > 5
new_tbl <- data.frame(row.names = rownames(exprs(pheno_snps)))
for (n in names(xref_prop)) {
  new_tbl[[n]] <- 0
  idx_cols <- which(pheno_snps[["conditions"]] == n)
  prop_col <- rowSums(idx_tbl[, idx_cols]) / xref_prop[n]
  new_tbl[n] <- prop_col
}
keepers <- grepl(x = rownames(new_tbl), pattern = "LpaL13")
new_tbl <- new_tbl[keepers, ]
new_tbl[["strong22"]] <- 1.001 - new_tbl[["z2.2"]]
new_tbl[["strong23"]] <- 1.001 - new_tbl[["z2.3"]]
s22_na <- new_tbl[["strong22"]] > 1
new_tbl[s22_na, "strong22"] <- 1
s23_na <- new_tbl[["strong23"]] > 1
new_tbl[s23_na, "strong23"] <- 1

new_tbl[["SNP"]] <- rownames(new_tbl)
new_tbl[["Chromosome"]] <- gsub(x = new_tbl[["SNP"]], pattern = "chr_(.*)_pos_.*", replacement = "\\1")
new_tbl[["Position"]] <- gsub(x = new_tbl[["SNP"]], pattern = ".*_pos_(\\d+)_.*", replacement = "\\1")
new_tbl <- new_tbl[, c("SNP", "Chromosome", "Position", "strong22", "strong23")]

library(CMplot)
simplify <- new_tbl
simplify[["strong22"]] <- NULL

CMplot(simplify, bin.size = 100000)

CMplot(new_tbl, plot.type="m", multracks=TRUE, threshold = c(0.01, 0.05),
       threshold.lwd=c(1,1), threshold.col=c("black","grey"),
       amplify=TRUE, bin.size=10000,
       chr.den.col=c("darkgreen", "yellow", "red"),
       signal.col=c("red", "green", "blue"),
       signal.cex=1, file="jpg", memo="", dpi=300, file.output=TRUE, verbose=TRUE)

4.5 Try out MatrixEQTL

This tool looks a little opaque, but provides sample data with things that make sense to me and should be pretty easy to recapitulate in our data.

  1. covariates.txt: Columns are samples, rows are things from pData – the most likely ones of interest for our data would be zymodeme, sensitivity
  2. geneloc.txt: columns are ‘geneid’, ‘chr’, ‘left’, ‘right’. I guess I can assume left and right are start/stop; in which case this is trivially acquirable from fData.
  3. ge.txt: This appears to be a log(rpkm/cpm) table with rows as genes and columns as samples
  4. snpsloc.txt: columns are ‘snpid’, ‘chr’, ‘pos’
  5. snps.txt: columns are samples, rows are the ids from snsploc, values a 0,1,2. I assume 0 is identical and 1..12 are the various A->TGC T->AGC C->AGT G->ACT
## For this, let us use the 'new_snps' data structure.
## Caveat here: these need to be coerced to numbers.
my_covariates <- pData(new_snps)[, c("zymodemecategorical", "clinicalcategorical")]
for (col in colnames(my_covariates)) {
  my_covariates[[col]] <- as.numeric(as.factor(my_covariates[[col]]))
}
my_covariates <- t(my_covariates)

my_geneloc <- fData(lp_expt)[, c("gid", "chromosome", "start", "end")]
colnames(my_geneloc) <- c("geneid", "chr", "left", "right")

my_ge <- exprs(normalize_expt(lp_expt, transform = "log2", filter = TRUE, convert = "cpm"))
used_samples <- tolower(colnames(my_ge)) %in% colnames(exprs(new_snps))
my_ge <- my_ge[, used_samples]

my_snpsloc <- data.frame(rownames = rownames(exprs(new_snps)))
## Oh, caveat here: Because of the way I stored the data,
## I could have duplicate rows which presumably will make matrixEQTL sad
my_snpsloc[["chr"]] <- gsub(pattern = "^chr_(.+)_pos(.+)_ref_.*$", replacement = "\\1",
                            x = rownames(my_snpsloc))
my_snpsloc[["pos"]] <- gsub(pattern = "^chr_(.+)_pos(.+)_ref_.*$", replacement = "\\2",
                            x = rownames(my_snpsloc))
test <- duplicated(my_snpsloc)
## Each duplicated row would be another variant at that position;
## so in theory we would do a rle to number them I am guessing
## However, I do not have different variants so I think I can ignore this for the moment
## but will need to make my matrix either 0 or 1.
if (sum(test) > 0) {
  message("There are: ", sum(duplicated), " duplicated entries.")
  keep_idx <- ! test
  my_snpsloc <- my_snpsloc[keep_idx, ]
}

my_snps <- exprs(new_snps)
one_idx <- my_snps > 0
my_snps[one_idx] <- 1

## Ok, at this point I think I have all the pieces which this method wants...
## Oh, no I guess not; it actually wants the data as a set of filenames...
library(MatrixEQTL)
write.table(my_snps, "eqtl/snps.tsv", na = "NA", col.names = TRUE, row.names = TRUE, sep = "\t", quote = TRUE)
## readr::write_tsv(my_snps, "eqtl/snps.tsv", )
write.table(my_snpsloc, "eqtl/snpsloc.tsv", na = "NA", col.names = TRUE, row.names = TRUE, sep = "\t", quote = TRUE)
## readr::write_tsv(my_snpsloc, "eqtl/snpsloc.tsv")
write.table(as.data.frame(my_ge), "eqtl/ge.tsv", na = "NA", col.names = TRUE, row.names = TRUE, sep = "\t", quote = TRUE)
## readr::write_tsv(as.data.frame(my_ge), "eqtl/ge.tsv")
write.table(as.data.frame(my_geneloc), "eqtl/geneloc.tsv", na = "NA", col.names = TRUE, row.names = TRUE, sep = "\t", quote = TRUE)
## readr::write_tsv(as.data.frame(my_geneloc), "eqtl/geneloc.tsv")
write.table(as.data.frame(my_covariates), "eqtl/covariates.tsv", na = "NA", col.names = TRUE, row.names = TRUE, sep = "\t", quote = TRUE)
## readr::write_tsv(as.data.frame(my_covariates), "eqtl/covariates.tsv")

useModel = modelLINEAR # modelANOVA, modelLINEAR, or modelLINEAR_CROSS

# Genotype file name
SNP_file_name = "eqtl/snps.tsv"
snps_location_file_name = "eqtl/snpsloc.tsv"
expression_file_name = "eqtl/ge.tsv"
gene_location_file_name = "eqtl/geneloc.tsv"
covariates_file_name = "eqtl/covariates.tsv"
# Output file name
output_file_name_cis = tempfile()
output_file_name_tra = tempfile()
# Only associations significant at this level will be saved
pvOutputThreshold_cis = 0.1
pvOutputThreshold_tra = 0.1
# Error covariance matrix
# Set to numeric() for identity.
errorCovariance = numeric()
# errorCovariance = read.table("Sample_Data/errorCovariance.txt");
# Distance for local gene-SNP pairs
cisDist = 1e6
## Load genotype data
snps = SlicedData$new()
snps$fileDelimiter = "\t"      # the TAB character
snps$fileOmitCharacters = "NA" # denote missing values;
snps$fileSkipRows = 1          # one row of column labels
snps$fileSkipColumns = 1       # one column of row labels
snps$fileSliceSize = 2000      # read file in slices of 2,000 rows
snps$LoadFile(SNP_file_name)
## Load gene expression data
gene = SlicedData$new()
gene$fileDelimiter = "\t"      # the TAB character
gene$fileOmitCharacters = "NA" # denote missing values;
gene$fileSkipRows = 1          # one row of column labels
gene$fileSkipColumns = 1       # one column of row labels
gene$fileSliceSize = 2000      # read file in slices of 2,000 rows
gene$LoadFile(expression_file_name)
## Load covariates
cvrt = SlicedData$new()
cvrt$fileDelimiter = "\t"      # the TAB character
cvrt$fileOmitCharacters = "NA" # denote missing values;
cvrt$fileSkipRows = 1          # one row of column labels
cvrt$fileSkipColumns = 1       # one column of row labels
if(length(covariates_file_name) > 0) {
  cvrt$LoadFile(covariates_file_name)
}
## Run the analysis
snpspos = read.table(snps_location_file_name, header = TRUE, stringsAsFactors = FALSE)
genepos = read.table(gene_location_file_name, header = TRUE, stringsAsFactors = FALSE)

me = Matrix_eQTL_main(
  snps = snps,
  gene = gene,
  cvrt = cvrt,
  output_file_name = output_file_name_tra,
  pvOutputThreshold = pvOutputThreshold_tra,
  useModel = useModel,
  errorCovariance = errorCovariance,
  verbose = TRUE,
  output_file_name.cis = output_file_name_cis,
  pvOutputThreshold.cis = pvOutputThreshold_cis,
  snpspos = snpspos,
  genepos = genepos,
  cisDist = cisDist,
  pvalue.hist = "qqplot",
  min.pv.by.genesnp = FALSE,
  noFDRsaveMemory = FALSE);
if (!isTRUE(get0("skip_load"))) {
  pander::pander(sessionInfo())
  message(paste0("This is hpgltools commit: ", get_git_commit()))
  message(paste0("Saving to ", savefile))
  ## tmp <- sm(saveme(filename = savefile))
}
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 996769878223fb869c8fa3e1496bffec3a7de7f6
## This is hpgltools commit: Mon Jan 8 16:36:00 2024 -0500: 996769878223fb869c8fa3e1496bffec3a7de7f6
## Saving to 02pre_visualization.rda.xz
tmp <- loame(filename = savefile)
