1 Changelog

  • Set input data to the new 202212 dataset. Looking for some messed up colors.
  • Reasonably certain I figured out the color discrepency. I was letting the eosinophil dataset choose its own colors rather than force them to be the same as the other cell types; even though I thought I told them to explicitly set their colors to be the same as the others. I think the changes I made in datasets.Rmd fixed this, so I regenerated the rda/etc in that document and am now testing the colors here.

2 Introduction

Moving all of the visualization and diagnostic tasks to this document. The metadata and gene annotation data collection tasks are therefore in tmrc3_data_structures.Rmd. The reasons for some of the data structure creation in that document is made clear in this document, but they are all performed there.

3 Notes

  1. Lesion vs Ulcer: Ulcer is the base of the crater of the lesion observed. The lesion is this, the border, and any region with signs of inflammation. It is not known if these metrics are equivalent, or if one is better than the other. Some people do not have ulcers and therefore in those cases we can only really consider the lesion size. E.g. most people in Colombia have ulcers, which are the cratered sore; however there are a few people who have a ‘plaque’ or some form of smaller, less intrusive presentation – these are still cutaneous.

Thus the lesion size is the more inclusive metric, but potentially ulcer size is more informative? Any inflammation in the skin causes the person to be defined as failure.

  1. Note from Maria Adelaida: Some chemokines are suggestive of Eosinophil recruitment.

3.1 Goals

These samples are from patients who either successfully cleared a Leishmania panamensis infection following treatment, or did not. They include biopsies from each patient along with purifications for Monocytes, Neutrophils, and Eosinophils. When possible, this process was repeated over three visits; but some patients did not return for the second or third visit.

The over-arching goal is to look for attributes(most likely genes) which distinguish patients who do and do not cure the infection after treatment. If possible, these will be apparent on the first visit.

plot_legend(hs_expt)$plot
## plot labels was not set and there are more than 100 samples, disabling it.

all_nz <- plot_nonzero(hs_expt)
## The following samples have less than 12949.95 genes.
##  [1] "TMRC30010" "TMRC30140" "TMRC30280" "TMRC30284" "TMRC30050" "TMRC30056"
##  [7] "TMRC30052" "TMRC30058" "TMRC30031" "TMRC30038" "TMRC30265"
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
all_nz$plot
## Warning: ggrepel: 195 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

3.2 Figure XX + 1: Non-zero genes after sample filtering

The following plot is essentially identical to the previous with two exceptions:

  1. The samples with too few genes (11,000 currently) are gone. In the current iteration of the datasets Rmd, this comprises either two or three samples.
  2. The samples are colored by cure(purple)/fail(yellow)
nz_post <- plot_nonzero(tc_valid)
## The following samples have less than 12949.95 genes.
## [1] "TMRC30140" "TMRC30280" "TMRC30284" "TMRC30056" "TMRC30058" "TMRC30031"
## [7] "TMRC30265"
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
nz_post$plot
## Warning: ggrepel: 163 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

3.3 Quick picture before removing miltefosine samples

Maria Adelaida’s quote: “I would like one picture of all samples including the miltefosine so that I can keep in my mind why we removed them.”

4 PCA with both drugs

tc_expt_norm <- normalize_expt(hs_expt, filter=TRUE, norm="quant",
                               convert="cpm", transform="log2") %>%
  set_expt_batches(fact="drug")
## Removing 5149 low-count genes (14774 remaining).
## transform_counts: Found 855 values equal to 0, adding 1 to the matrix.
## 
##    antimony miltefosine 
##         202           8
tc_expt_drug_pca <- plot_pca(tc_expt_norm, cis=NULL)
## plot labels was not set and there are more than 100 samples, disabling it.
tc_expt_drug_pca <- plot_pca(tc_expt_norm)
## plot labels was not set and there are more than 100 samples, disabling it.
tc_expt_drug_pca$plot

tc_expt_nb <- normalize_expt(hs_expt, filter=TRUE, convert="cpm",
                             transform="log2", batch="svaseq") %>%
  set_expt_batches(fact="drug")
## Removing 5149 low-count genes (14774 remaining).
## Setting 35565 low elements to zero.
## transform_counts: Found 35565 values equal to 0, adding 1 to the matrix.
## 
##    antimony miltefosine 
##         202           8
tc_expt_drug_nb_pca <- plot_pca(tc_expt_nb)
## plot labels was not set and there are more than 100 samples, disabling it.
tc_expt_drug_nb_pca$plot

t_expt_drug <- subset_expt(hs_expt, subset="clinic=='Tumaco'")
## subset_expt(): There were 210, now there are 143 samples.
t_expt_norm <- normalize_expt(t_expt_drug, filter=TRUE, norm="quant",
                              convert="cpm", transform="log2") %>%
  set_expt_batches(fact="drug")
## Removing 5676 low-count genes (14247 remaining).
## transform_counts: Found 388 values equal to 0, adding 1 to the matrix.
## 
##    antimony miltefosine 
##         135           8
t_expt_drug_pca <- plot_pca(t_expt_norm)
## plot labels was not set and there are more than 100 samples, disabling it.
t_expt_drug_pca$plot

t_expt_nb <- normalize_expt(t_expt_drug, filter=TRUE, convert="cpm",
                             transform="log2", batch="svaseq") %>%
  set_expt_batches(fact="drug")
## Removing 5676 low-count genes (14247 remaining).
## Setting 18820 low elements to zero.
## transform_counts: Found 18820 values equal to 0, adding 1 to the matrix.
## 
##    antimony miltefosine 
##         135           8
t_expt_drug_nb_pca <- plot_pca(t_expt_nb)
## plot labels was not set and there are more than 100 samples, disabling it.
t_expt_drug_nb_pca$plot

4.1 Summarize: Tally samples after filtering

We need to keep track of how many of each sample type is lost when we do our various filters. Thus I am repeating the same set of tallies. This will likely happen one more time, following the removal of samples which came from Cali.

table(pData(tc_valid)$drug)
## 
## antimony 
##      184
table(pData(tc_valid)$clinic)
## 
##   Cali Tumaco 
##     61    123
table(pData(tc_valid)$finaloutcome)
## 
##    cure failure 
##     122      62
table(pData(tc_valid)$typeofcells)
## 
##      biopsy eosinophils   monocytes neutrophils 
##          18          41          63          62
table(pData(tc_valid)$visit)
## 
##  3  2  1 
## 51 50 83
summary(as.numeric(pData(tc_valid)$eb_lc_tiempo_evolucion))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    4.00    6.00    8.19   12.00   21.00
summary(as.numeric(pData(tc_valid)$eb_lc_tto_mcto_glucan_dosis))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    13.0    14.8    19.0    17.5    20.0    20.0
summary(as.numeric(pData(tc_valid)$v3_lc_ejey_lesion_mm_1))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     7.2    32.0   303.4   999.0   999.0
summary(as.numeric(pData(tc_valid)$v3_lc_lesion_area_1))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0     226     999    2328    2448   16965
summary(as.numeric(pData(tc_valid)$v3_lc_ejex_ulcera_mm_1))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0    12.5   295.9   999.0   999.0
table(pData(tc_valid)$eb_lc_sexo)
## 
##   1   2 
## 156  28
table(pData(tc_valid)$eb_lc_etnia)
## 
##  1  2  3 
## 91 46 47
summary(as.numeric(pData(tc_valid)$edad))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    18.0    25.0    28.5    30.7    36.0    51.0
table(pData(tc_valid)$eb_lc_peso)
## 
##  53.9  57.9    58  58.1  58.3  58.6    59  59.6    62    63    67  69.4    72 
##     9     2     6     7    10     3     8     1     6     6     6    10     9 
##    75  76.5    77    78  79.2    82  83.3  83.4  86.4    87    89  93.3   100 
##     2     3    18    10    10     9     4    10     9     3     9     7     5 
## 100.8 
##     2
table(pData(tc_valid)$eb_lc_estatura)
## 
## 152 154 155 156 158 159 160 163 164 165 166 167 169 172 173 174 176 177 182 183 
##   1  10   9   6  15   2   3   9  15  12  19   3   2  10   9  32   1   7   9  10
length(unique(pData(tc_valid)[["codigo_paciente"]]))
## [1] 29

5 Host Distributions/Visualizations of interest

The sets of samples used to visualize the data will also comprise the sets used when later performing the various differential expression analyses.

5.1 Global metrics

Start out with some initial metrics of all samples. The most obvious are plots of the numbers of non-zero genes observed, heatmaps showing the relative relationships among the samples, the relative library sizes, and some PCA. It might be smart to split the library sizes up across subsets of the data, because they have expanded too far to see well on a computer screen.

The most likely factors to query when considering the entire dataset are cure/fail, visit, and cell type. This is the level at which we will choose samples to exclude from future analyses.

plot_legend(tc_biopsies)$plot

plot_libsize(tc_biopsies)$plot

plot_nonzero(tc_biopsies)$plot
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.

There is a (relatively)new function in the following block. plot_libsize_prepost attempts to provide an idea about how much data is lost when low-count filtering the data.

The first plot it produces is a barplot of the number of reads removed by the filter from each sample. The second plot has two bars, the top bar is labeled with the number of low-count genes before the filter. The lower bar represents the number after the filter and is assumed to be quite low.

biopsy_prepost <- plot_libsize_prepost(tc_biopsies)
biopsy_prepost$count_plot

biopsy_prepost$lowgene_plot
## Warning: Using alpha for a discrete variable is not advised.

## Minimum number of biopsy genes: ~ 14,000

plot_libsize(tc_eosinophils)$plot

plot_nonzero(tc_eosinophils)$plot
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## Warning: ggrepel: 18 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

eosinophil_prepost <- plot_libsize_prepost(tc_eosinophils)
eosinophil_prepost$count_plot

eosinophil_prepost$lowgene_plot
## Warning: Using alpha for a discrete variable is not advised.

## Minimum number of eosinophil genes: ~ 13,500

plot_libsize(tc_monocytes)$plot

plot_nonzero(tc_monocytes)$plot
## The following samples have less than 12949.95 genes.
## [1] "TMRC30056"
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## Warning: ggrepel: 48 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

monocyte_prepost <- plot_libsize_prepost(tc_monocytes)
monocyte_prepost$count_plot

monocyte_prepost$lowgene_plot
## Warning: Using alpha for a discrete variable is not advised.

## Minimum number of monocyte genes: ~ 7,500 before setting the minimum.

plot_libsize(tc_neutrophils)$plot

plot_nonzero(tc_neutrophils)$plot
## The following samples have less than 12949.95 genes.
## [1] "TMRC30140" "TMRC30280" "TMRC30284" "TMRC30058" "TMRC30031" "TMRC30265"
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## Warning: ggrepel: 41 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

neutrophil_prepost <- plot_libsize_prepost(tc_neutrophils)
neutrophil_prepost$count_plot

neutrophil_prepost$lowgene_plot
## Warning: Using alpha for a discrete variable is not advised.

## Minimum number of neutrophil genes: ~ 10,000 before setting minimum coverage.

The above block just repeats the same two plots on a per-celltype basis: the number of reads observed / sample and a plot of observed genes with respect to coverage. I made some comments with my observations about the number of genes.

5.2 Global views of all cell types

Now that those ‘global’ metrics are out of the way, lets look at some global metrics of the data following normalization; the most likely plots are of course PCA but also a couple of heatmaps.

5.2.1 Figure 1

In the google doc TMRC3_Aug18_2021, there is an example of an image for the first figure:

“Transcriptomic profiles of primary innate cells of CL patients show unique transcriptional signatures - Remove PBMCs and M0, maybe biopsies as well (but Remove WT samples)”

While we were talking in a meeting however, it sounded like there was some desire to keep all cell types. Therefore the following block has one image with everything and one following the above.

tc_type <- set_expt_conditions(tc_valid, fact="typeofcells") %>%
  set_expt_batches(fact="finaloutcome") %>%
  set_expt_colors(color_choices[["type"]])
## 
##      biopsy eosinophils   monocytes neutrophils 
##          18          41          63          62 
## 
##    cure failure 
##     122      62
tc_norm <- sm(normalize_expt(tc_type, transform="log2", norm="quant",
                             convert="cpm", filter=TRUE))

tc_pca <- plot_pca(tc_norm, plot_labels=FALSE,
                    plot_title="PCA - Cell type", size_column="visitnumber")
dev <- pp(file=glue("images/tmrc3_pca_nolabels-v{ver}.svg"))
tc_pca$plot
## Error in `palette()`:
## ! Insufficient values in manual scale. 4 needed but only 3 provided.
closed <- dev.off()
tc_pca$plot
## Error in `palette()`:
## ! Insufficient values in manual scale. 4 needed but only 3 provided.
tc_pca_nosize <- plot_pca(tc_norm, plot_labels=FALSE)
tc_pca_nosize$plot

write.csv(tc_pca$table, file="coords/tc_donor_pca_coords.csv")
tc_cf_norm <- set_expt_batches(tc_norm, fact="visitnumber")
## 
##  3  2  1 
## 51 50 83
tc_cf_corheat <- plot_corheat(tc_cf_norm, plot_title="Heirarchical clustering:
         cell types")

dev <- pp(file=glue("images/tmrc3_corheat_cf-v{ver}.svg"), height=12, width=12)
tc_cf_corheat$plot
closed <- dev.off()
tc_cf_corheat$plot

tc_cf_disheat <- plot_disheat(tc_cf_norm, plot_title="Heirarchical clustering:
         cell types")
dev <- pp(file=glue("images/tmrc3_disheat_cf-v{ver}.png"), height=12, width=12)
tc_cf_disheat$plot
closed <- dev.off()
tc_cf_disheat$plot

5.3 Figure 1B: Transcriptomic profiles of primary innate cells

A potential figure legend for the following images might include:

The observed counts per gene for all of the clinical samples were filtered, log transformed, cpm converted, and quantile normalized. The colors were defined by cell types and shapes by patient visit. When the first two principle components were plotted, clustering was observed by cell type. The biopsy samples were significantly different from the innate immune cell types.

fig1v2_norm <- normalize_expt(tc_type, transform="log2",
                              convert="cpm", norm="quant", filter=TRUE)
## Removing 5633 low-count genes (14290 remaining).
## transform_counts: Found 675 values equal to 0, adding 1 to the matrix.
fig1v2_pca <- plot_pca(fig1v2_norm, cis=FALSE)
## plot labels was not set and there are more than 100 samples, disabling it.
dev <- pp(file=glue("images/tmrc3_fig1v2.png"))
fig1v2_pca$plot
closed <- dev.off()
fig1v2_pca$plot

fig1v3_norm <- normalize_expt(tc_type, transform="log2",
                              convert="cpm", norm="quant", filter=TRUE)
## Removing 5633 low-count genes (14290 remaining).
## transform_counts: Found 675 values equal to 0, adding 1 to the matrix.
fig1v3_pca <- plot_pca(fig1v3_norm, cis=FALSE)
## plot labels was not set and there are more than 100 samples, disabling it.
dev <- pp(file="images/tmrc3_fig1v3.png")
fig1v3_pca$plot
closed <- dev.off()
fig1v3_pca$plot

6 Compare samples by clinic

Spoiler alert: This section will eventually suggest pretty strongly that we will not easily be able to use the Cali samples. Thus, after finishing it, we will likely exclude those samples.

Take a moment to view the biopsy samples. We separated them by clinic (Cali or Tumaco), and this view of the samples is the only one which does not suggest a strong difference between the two clinics. However, it also suggests that the biopsy samples will not prove very helpful.

6.1 Biopsies by clinic

tc_biopsies_norm <- normalize_expt(tc_biopsies, transform="log2",
                                   convert="cpm", norm="quant", filter=TRUE)
## Removing 6315 low-count genes (13608 remaining).
## transform_counts: Found 206 values equal to 0, adding 1 to the matrix.
tc_biopsies_pca <- plot_pca(tc_biopsies_norm, plot_labels=FALSE)
dev <- pp(file="images/biopsy_place.svg")
tc_biopsies_pca$plot
closed <- dev.off()
tc_biopsies_pca$plot

tc_biopsies_nb <- normalize_expt(tc_biopsies, transform="log2",
                                 convert="cpm", batch="svaseq", filter=TRUE)
## Removing 6315 low-count genes (13608 remaining).
## Setting 290 low elements to zero.
## transform_counts: Found 290 values equal to 0, adding 1 to the matrix.
tc_biopsies_nb_pca <- plot_pca(tc_biopsies_nb, plot_labels=FALSE)
dev <- pp(file="images/biopsy_place_nb.svg")
tc_biopsies_nb_pca$plot
closed <- dev.off()
tc_biopsies_nb_pca$plot

6.2 Patient Race and clinic?

Here is the relevant field for ethnicity from the codebook:

1 Afrocolombiana 2 Indígena 3 Mestiza 4 Blanca 5 Mulata 6 Otra 8 NO DATO

etnia_expt <- set_expt_conditions(tc_valid, fact="clinic_etnia")
## 
##   Cali_ Tumaco_ 
##      61     123
etnia_norm <- normalize_expt(etnia_expt, transform = "log2", convert = "cpm",
                             filter = TRUE, norm = "quant")
## Removing 5633 low-count genes (14290 remaining).
## transform_counts: Found 675 values equal to 0, adding 1 to the matrix.
plot_pca(etnia_norm)$plot
## plot labels was not set and there are more than 100 samples, disabling it.

tc_eo_ec <- set_expt_conditions(tc_eosinophils, fact="clinic_etnia")
## 
##   Cali_ Tumaco_ 
##      15      26
etnia_eo_norm <- normalize_expt(tc_eo_ec, transform = "log2", convert = "cpm",
                                filter = TRUE, norm = "quant")
## Removing 9059 low-count genes (10864 remaining).
## transform_counts: Found 5 values equal to 0, adding 1 to the matrix.
plot_pca(etnia_eo_norm)$plot

6.3 Eosinophils by clinic

In contrast, the Eosinophil samples do have significant amounts of variance which discriminates the two clinics. At the time of this writing, there are fewer eosinophil samples than monocytes nor neutrophils; as a result there are no samples which failed from Cali. This is somewhat limiting is we wish to look for differences between the cure and fail samples which came from the two clinics.

tc_eosinophils_norm <- normalize_expt(tc_eosinophils, transform="log2",
                                      convert="cpm", norm="quant", filter=TRUE)
## Removing 9059 low-count genes (10864 remaining).
## transform_counts: Found 5 values equal to 0, adding 1 to the matrix.
tc_eosinophils_pca <- plot_pca(tc_eosinophils_norm, plot_labels=FALSE)
dev <- pp(file="images/eosinophil_place.svg")
tc_eosinophils_pca$plot
closed <- dev.off()
tc_eosinophils_pca$plot

tc_eosinophils_nb <- normalize_expt(tc_eosinophils, transform="log2",
                                    convert="cpm", batch="svaseq", filter=TRUE)
## Removing 9059 low-count genes (10864 remaining).
## Setting 1043 low elements to zero.
## transform_counts: Found 1043 values equal to 0, adding 1 to the matrix.
tc_eosinophils_nb_pca <- plot_pca(tc_eosinophils_nb, plot_labels=FALSE)
dev <- pp(file="images/eosinophil_place_nb.svg")
tc_eosinophils_nb_pca$plot
closed <- dev.off()
tc_eosinophils_nb_pca$plot

6.4 Monocytes by clinic

In contrast with the eosinophil samples, we have one patient’s monocyte and neutrophil samples which did not cure. As we will see, there is one person from Cali who did not cure, this person is not different with respect to tracscriptome than the other people from Cali.

tc_monocytes_norm <- normalize_expt(tc_monocytes, transform="log2",
                                       convert="cpm", norm="quant", filter=TRUE)
## Removing 8819 low-count genes (11104 remaining).
## transform_counts: Found 12 values equal to 0, adding 1 to the matrix.
tc_monocytes_pca <- plot_pca(tc_monocytes_norm, plot_labels=FALSE)
dev <- pp(file="images/monocytes_place.svg")
tc_monocytes_pca$plot
closed <- dev.off()
tc_monocytes_pca$plot

tc_monocytes_nb <- normalize_expt(tc_monocytes, transform="log2",
                                  convert="cpm", batch="svaseq", filter=TRUE)
## Removing 8819 low-count genes (11104 remaining).
## Setting 1447 low elements to zero.
## transform_counts: Found 1447 values equal to 0, adding 1 to the matrix.
tc_monocytes_nb_pca <- plot_pca(tc_monocytes_nb, plot_labels=FALSE)
dev <- pp(file="images/monocytes_place_nb.svg")
tc_monocytes_nb_pca$plot
closed <- dev.off()
tc_monocytes_nb_pca$plot

6.5 Neutrophils by clinic

Finally, that same one person does appear to be different than the others from Cali.

tc_neutrophils_norm <- normalize_expt(tc_neutrophils, transform="log2",
                                      convert="cpm", norm="quant", filter=TRUE)
## Removing 10681 low-count genes (9242 remaining).
## transform_counts: Found 1 values equal to 0, adding 1 to the matrix.
tc_neutrophils_pca <- plot_pca(tc_neutrophils_norm, plot_labels=FALSE)
dev <- pp(file="images/neutrophil_place.svg")
tc_neutrophils_pca$plot
closed <- dev.off()
tc_neutrophils_pca$plot

tc_neutrophils_nb <- normalize_expt(tc_neutrophils, transform="log2",
                                    convert="cpm", batch="svaseq", filter=TRUE)
## Removing 10681 low-count genes (9242 remaining).
## Setting 1541 low elements to zero.
## transform_counts: Found 1541 values equal to 0, adding 1 to the matrix.
tc_neutrophils_nb_pca <- plot_pca(tc_neutrophils_nb, plot_labels=FALSE)
dev <- pp(file="images/neutrophil_place_nb.svg")
tc_neutrophils_nb_pca$plot
closed <- dev.off()
tc_neutrophils_nb_pca$plot

6.6 PCA: Compare clinics

Now that we have these various subsets, perform an explicit comparison of the samples which came from the two clinics.

tc_clinic_type <- tc_valid %>%
  set_expt_conditions(fact="clinic") %>%
  set_expt_batches(fact="typeofcells")
## 
##   Cali Tumaco 
##     61    123 
## 
##      biopsy eosinophils   monocytes neutrophils 
##          18          41          63          62
tc_clinic_type_norm <- normalize_expt(tc_clinic_type, transform="log2", convert="cpm",
                                      norm="quant", filter=TRUE)
## Removing 5633 low-count genes (14290 remaining).
## transform_counts: Found 675 values equal to 0, adding 1 to the matrix.
tc_clinic_type_pca <- plot_pca(tc_clinic_type_norm)
## plot labels was not set and there are more than 100 samples, disabling it.
tc_clinic_type_pca$plot

tc_clinic_type_nb <- normalize_expt(tc_clinic_type, transform="log2", convert="cpm",
                                    batch="svaseq", filter=TRUE)
## Removing 5633 low-count genes (14290 remaining).
## Setting 31271 low elements to zero.
## transform_counts: Found 31271 values equal to 0, adding 1 to the matrix.
tc_clinic_type_nb_pca <- plot_pca(tc_clinic_type_nb)
## plot labels was not set and there are more than 100 samples, disabling it.
tc_clinic_type_nb_pca$plot

tc_clinical_norm <- sm(normalize_expt(tc_clinical, filter="simple", transform="log2",
                                      norm="quant", convert="cpm"))
clinical_pca <- plot_pca(tc_clinical_norm, plot_labels=FALSE,
                         cis=NULL,
                         plot_title="PCA - clinical samples")
dev <- pp(file=glue("images/all_clinical_nobatch_pca-v{ver}.png"), height=8, width=16)
clinical_pca$plot
closed <- dev.off()
clinical_pca$plot

tc_clinical_nb <- normalize_expt(tc_clinical, filter="simple", transform="log2",
                                 batch="svaseq", convert="cpm")
## Removing 1872 low-count genes (18051 remaining).
## Setting 156640 low elements to zero.
## transform_counts: Found 156640 values equal to 0, adding 1 to the matrix.
tc_clinical_nb_pca <- plot_pca(tc_clinical_nb)
## plot labels was not set and there are more than 100 samples, disabling it.
dev <- pp(file=glue("images/all_clinical_svaseqbatch_pca-v{ver}.png"), height=6, width=8)
tc_clinical_nb_pca$plot
closed <- dev.off()
tc_clinical_nb_pca$plot

clinical_pca_info <- pca_information(
    tc_clinical_norm, plot_pcas=TRUE, num_components = 30,
    expt_factors=c("visitnumber", "typeofcells", "finaloutcome",
                   "clinic", "donor"))
## plot labels was not set and there are more than 100 samples, disabling it.
dev <- pp(file="images/clinical_samples_neglogp_pcs.png")
clinical_pca_info$anova_neglogp_heatmap
closed <- dev.off()
clinical_pca_info$anova_neglogp_heatmap

clinical_pca_info$pca_plots$PC4_PC7
## Warning: ggrepel: 114 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

clinical_scores <- pca_highscores(tc_clinical_norm)
clinical_scores[["highest"]][,"Comp.4"]
##  [1] "15.73:ENSG00000168329" "14.96:ENSG00000133574" "14.03:ENSG00000204389"
##  [4] "14.02:ENSG00000171115" "13.89:ENSG00000163563" "13.47:ENSG00000179144"
##  [7] "13.17:ENSG00000004799" "13.11:ENSG00000180871" "13:ENSG00000172086"   
## [10] "12.77:ENSG00000091106" "12.61:ENSG00000121858" "12.37:ENSG00000123405"
## [13] "12.36:ENSG00000175538" "12.04:ENSG00000138449" "12.01:ENSG00000109971"
## [16] "11.84:ENSG00000165118" "11.6:ENSG00000088986"  "11.59:ENSG00000135828"
## [19] "11.37:ENSG00000038274" "11.17:ENSG00000130150"

6.7 Iterative SVA followed by PCA

Another way to explore the effect of SVA is to iteratively increase the number of SVs removed by it and look at some simple plots of the resulting data. Ideally, this should complement the methods employed by Theresa.

first <- normalize_expt(tc_clinical, transform="log2", convert="cpm",
                        filter = TRUE, batch="svaseq", surrogates=1)
## Removing 5633 low-count genes (14290 remaining).
## Setting 192779 low elements to zero.
## transform_counts: Found 192779 values equal to 0, adding 1 to the matrix.
first_info <- pca_information(
    first, plot_pcas=TRUE, num_components = 30,
    expt_factors=c("visitnumber", "typeofcells",
                   "finaloutcome", "clinic"))
## plot labels was not set and there are more than 100 samples, disabling it.
first_info$anova_neglogp_heatmap

first_info$pca_plots[["PC1_PC2"]]
## Warning in MASS::cov.trob(data[, vars]): Probable convergence failure

## Warning in MASS::cov.trob(data[, vars]): Probable convergence failure
## Warning: ggrepel: 176 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

second <- normalize_expt(tc_clinical, transform="log2", convert="cpm",
                         filter = TRUE, batch="svaseq", surrogates=2) %>%
  set_expt_batches(fact="clinic")
## Removing 5633 low-count genes (14290 remaining).
## Setting 31218 low elements to zero.
## transform_counts: Found 31218 values equal to 0, adding 1 to the matrix.
## 
##   Cali Tumaco 
##     61    123
second_info <- pca_information(
    second, plot_pcas=TRUE, num_components = 30,
    expt_factors=c("visitnumber", "typeofcells",
                   "finaloutcome", "clinic"))
## plot labels was not set and there are more than 100 samples, disabling it.
second_info$anova_neglogp_heatmap

third <- normalize_expt(tc_clinical, transform="log2", convert="cpm",
                        filter = TRUE, batch="svaseq", surrogates=3) %>%
  set_expt_batches(fact="clinic")
## Removing 5633 low-count genes (14290 remaining).
## Setting 27267 low elements to zero.
## transform_counts: Found 27267 values equal to 0, adding 1 to the matrix.
## 
##   Cali Tumaco 
##     61    123
third_info <- pca_information(
    third, plot_pcas=TRUE, num_components = 30,
    expt_factors=c("visitnumber", "typeofcells",
                   "finaloutcome", "clinic"))
## plot labels was not set and there are more than 100 samples, disabling it.
third_info$anova_neglogp_heatmap

fourth <- normalize_expt(tc_clinical, transform="log2", convert="cpm",
                         filter = TRUE, batch="svaseq", surrogates=4) %>%
  set_expt_batches(fact="clinic")
## Removing 5633 low-count genes (14290 remaining).
## Setting 25946 low elements to zero.
## transform_counts: Found 25946 values equal to 0, adding 1 to the matrix.
## 
##   Cali Tumaco 
##     61    123
fourth_info <- pca_information(
    fourth, plot_pcas=TRUE, num_components = 30,
    expt_factors=c("visitnumber", "typeofcells",
                   "finaloutcome", "clinic"))
## plot labels was not set and there are more than 100 samples, disabling it.
fourth_info$anova_neglogp_heatmap

fourth_info[["pca_plots"]][["PC1_PC2"]]
## Warning: ggrepel: 109 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

fifth <- normalize_expt(tc_clinical, transform="log2", convert="cpm",
                        filter = TRUE, batch="svaseq", surrogates=5) %>%
  set_expt_batches(fact="clinic")
## Removing 5633 low-count genes (14290 remaining).
## Setting 27033 low elements to zero.
## transform_counts: Found 27033 values equal to 0, adding 1 to the matrix.
## 
##   Cali Tumaco 
##     61    123
fifth_info <- pca_information(
    fifth, plot_pcas=TRUE, num_components = 30,
    expt_factors=c("visitnumber", "typeofcells",
                   "finaloutcome", "clinic"))
## plot labels was not set and there are more than 100 samples, disabling it.
fifth_info$anova_neglogp_heatmap

fifth_info[["pca_plots"]][["PC1_PC12"]]
## Warning: ggrepel: 112 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

sixth <- normalize_expt(tc_clinical, transform="log2", convert="cpm",
                        filter = TRUE, batch="svaseq", surrogates=6) %>%
  set_expt_batches(fact="clinic")
## Removing 5633 low-count genes (14290 remaining).
## Setting 23957 low elements to zero.
## transform_counts: Found 23957 values equal to 0, adding 1 to the matrix.
## 
##   Cali Tumaco 
##     61    123
sixth_info <- pca_information(
    sixth, plot_pcas=TRUE, num_components = 30,
    expt_factors=c("visitnumber", "typeofcells",
                   "finaloutcome", "clinic"))
## plot labels was not set and there are more than 100 samples, disabling it.
sixth_info$anova_neglogp_heatmap

seventh <- normalize_expt(tc_clinical, transform="log2", convert="cpm",
                          filter = TRUE, batch="svaseq", surrogates=7) %>%
  set_expt_batches(fact="clinic")
## Removing 5633 low-count genes (14290 remaining).
## Setting 24476 low elements to zero.
## transform_counts: Found 24476 values equal to 0, adding 1 to the matrix.
## 
##   Cali Tumaco 
##     61    123
seventh_info <- pca_information(
    seventh, plot_pcas=TRUE, num_components = 30,
    expt_factors=c("visitnumber", "typeofcells",
                   "finaloutcome", "clinic"))
## plot labels was not set and there are more than 100 samples, disabling it.
seventh_info$anova_neglogp_heatmap

eighth <- normalize_expt(tc_clinical, transform="log2", convert="cpm",
                        filter = TRUE, batch="svaseq", surrogates=8)
## Removing 5633 low-count genes (14290 remaining).
## Setting 24108 low elements to zero.
## transform_counts: Found 24108 values equal to 0, adding 1 to the matrix.
eighth_info <- pca_information(
    eighth, plot_pcas=TRUE, num_components = 30,
    expt_factors=c("visitnumber", "typeofcells",
                   "finaloutcome", "clinic"))
## plot labels was not set and there are more than 100 samples, disabling it.
eighth_info$anova_neglogp_heatmap

6.8 Summarize: Collect Tumaco sample numbers.

At least in theory, everything which follows will be using the above ‘clinical’ data structure. Thus, let us count it up and get a sense of what we will work with.

table(pData(t_clinical)$drug)
## 
## antimony 
##      123
table(pData(t_clinical)$clinic)
## 
## Tumaco 
##    123
table(pData(t_clinical)$finaloutcome)
## 
##    cure failure 
##      67      56
table(pData(t_clinical)$typeofcells)
## 
##      biopsy eosinophils   monocytes neutrophils 
##          14          26          42          41
table(pData(t_clinical)$visit)
## 
##  3  2  1 
## 34 35 54
summary(as.numeric(pData(t_clinical)$eb_lc_tiempo_evolucion))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    4.00    4.00    7.03    8.00   21.00
summary(as.numeric(pData(t_clinical)$eb_lc_tto_mcto_glucan_dosis))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      13      14      17      17      20      20
summary(as.numeric(pData(t_clinical)$v3_lc_ejey_lesion_mm_1))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0     7.2    31.3   389.6   999.0   999.0
summary(as.numeric(pData(t_clinical)$v3_lc_lesion_area_1))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      46     222     999    1089     999    5055
summary(as.numeric(pData(t_clinical)$v3_lc_ejex_ulcera_mm_1))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0     383     999     999
table(pData(t_clinical)$eb_lc_sexo)
## 
##   1   2 
## 101  22
table(pData(t_clinical)$eb_lc_etnia)
## 
##  1  2  3 
## 76 19 28
summary(as.numeric(pData(t_clinical)$edad))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    18.0    23.0    25.0    28.5    34.0    51.0
table(pData(t_clinical)$eb_lc_peso)
## 
##  53.9  57.9  58.1  58.3  58.6    59  59.6    62    63  69.4    77    78  79.2 
##     9     2     7    10     3     8     1     6     6    10     9    10    10 
##  83.3  83.4  86.4  93.3 100.8 
##     4    10     9     7     2
table(pData(t_clinical)$eb_lc_estatura)
## 
## 152 154 158 159 163 164 165 166 172 173 174 176 177 182 183 
##   1  10  15   2   9  15  12  10  10   4   8   1   7   9  10
length(unique(pData(t_clinical)[["codigo_paciente"]]))
## [1] 19
only_cure <- pData(t_clinical)[["finaloutcome"]] == "cure"
c_meta <- pData(t_clinical)[only_cure, ]
length(unique(c_meta[["codigo_paciente"]]))
## [1] 10
only_fail <- pData(t_clinical)[["finaloutcome"]] == "failure"
f_meta <- pData(t_clinical)[only_fail, ]
length(unique(f_meta[["codigo_paciente"]]))
## [1] 9

6.9 Visualize: Repeat plots using only the Tumaco samples

6.9.1 All samples

t_clinical_nobiop_norm <- normalize_expt(t_clinical_nobiop, filter=TRUE, norm="quant",
                                         convert="cpm", transform="log2")
## Removing 8016 low-count genes (11907 remaining).
## transform_counts: Found 93 values equal to 0, adding 1 to the matrix.
t_clinical_nobiop_pca <- plot_pca(t_clinical_nobiop_norm, plot_labels=FALSE)
dev <- pp(file="images/clinical_nobiopsys_tumaco_norm_pca.png")
t_clinical_nobiop_pca$plot
closed <- dev.off()
t_clinical_nobiop_pca$plot

t_clinical_nobiop_nb <- normalize_expt(t_clinical_nobiop, filter=TRUE, nb="quant", convert="cpm",
                                       transform="log2", batch="svaseq")
## Removing 8016 low-count genes (11907 remaining).
## Setting 9578 low elements to zero.
## transform_counts: Found 9578 values equal to 0, adding 1 to the matrix.
t_clinical_nobiop_nb_pca <- plot_pca(t_clinical_nobiop_nb, plot_labels=FALSE)
dev <- pp(file="images/clinical_nobiopsys_tumaco_nb_pca.png")
t_clinical_nobiop_nb_pca$plot
closed <- dev.off()
t_clinical_nobiop_nb_pca$plot

Now we have a new, smaller set of primary samples which are categorized by cell type.

6.9.2 Visualize: Biopsy samples only Tumaco

Sadly, the biopsy samples remain basically impenetrable. This makes me sad, I think it would be particularly nice if we could judge cure/fail from a visit 1 biopsy.

t_biopsies_norm <- normalize_expt(t_biopsies, transform="log2", convert="cpm",
  norm="quant", filter=TRUE)
## Removing 6417 low-count genes (13506 remaining).
## transform_counts: Found 136 values equal to 0, adding 1 to the matrix.
t_biopsies_pca <- plot_pca(t_biopsies_norm,
  plot_labels=FALSE)
dev <- pp(file="images/biopsys_tumaco_norm.png")
t_biopsies_pca$plot
closed <- dev.off()
t_biopsies_pca$plot

t_biopsies_nb <- normalize_expt(t_biopsies, transform="log2", convert="cpm",
                                batch="svaseq", filter=TRUE)
## Removing 6417 low-count genes (13506 remaining).
## Setting 145 low elements to zero.
## transform_counts: Found 145 values equal to 0, adding 1 to the matrix.
t_biopsies_nb_pca <- plot_pca(t_biopsies_nb, plot_labels=FALSE)
dev <- pp(file="images/biopsys_tumaco_norm_sva.png")
t_biopsies_nb_pca$plot
closed <- dev.off()
t_biopsies_nb_pca$plot

6.9.3 Visualize: Monocyte samples only Tumaco

In contrast, I suspect that we can get meaningful data from the other cell types. The monocyte samples are still a bit messy.

t_monocyte_norm <- normalize_expt(t_monocytes, transform="log2", convert="cpm",
                                  norm="quant", filter=TRUE)
## Removing 9064 low-count genes (10859 remaining).
## transform_counts: Found 5 values equal to 0, adding 1 to the matrix.
t_monocyte_pca <- plot_pca(t_monocyte_norm,
  plot_labels=FALSE)
dev <- pp(file="images/monocytes_tumaco_norm.png")
t_monocyte_pca$plot
closed <- dev.off()
t_monocyte_pca$plot

t_monocyte_nb <- normalize_expt(t_monocytes, transform="log2", convert="cpm",
                                batch="svaseq", filter=TRUE)
## Removing 9064 low-count genes (10859 remaining).
## Setting 730 low elements to zero.
## transform_counts: Found 730 values equal to 0, adding 1 to the matrix.
t_monocyte_nb_pca <- plot_pca(t_monocyte_nb, plot_labels=FALSE)
dev <- pp(file="images/monocytes_tumaco_norm_sva.png")
t_monocyte_nb_pca$plot
closed <- dev.off()
t_monocyte_nb_pca$plot

6.9.4 Visualize: Neutrophil samples only Tumaco

Well, really all the cell types remain pretty messy. There is always at least one person in one visit or another who really does not fit well with the rest of the cohort.

t_neutrophil_norm <- normalize_expt(t_neutrophils, transform="log2", convert="cpm",
                                    norm="quant", filter=TRUE)
## Removing 10824 low-count genes (9099 remaining).
## transform_counts: Found 1 values equal to 0, adding 1 to the matrix.
t_neutrophil_pca <- plot_pca(t_neutrophil_norm,
                             plot_labels=FALSE)
dev <- pp(file="images/neutrophils_tumaco_norm.png")
t_neutrophil_pca$plot
closed <- dev.off()
t_neutrophil_pca$plot

t_neutrophil_nb <- normalize_expt(t_neutrophils, transform="log2", convert="cpm",
                                     batch="svaseq", filter=TRUE)
## Removing 10824 low-count genes (9099 remaining).
## Setting 750 low elements to zero.
## transform_counts: Found 750 values equal to 0, adding 1 to the matrix.
t_neutrophil_nb_pca <- plot_pca(t_neutrophil_nb, plot_labels=FALSE)
dev <- pp(file="images/neutrophils_tumaco_norm_sva.png")
t_neutrophil_nb_pca$plot
closed <- dev.off()
t_neutrophil_nb_pca$plot

6.9.5 Visualize: Eosinophil samples only Tumaco

t_eosinophil_norm <- normalize_expt(t_eosinophils, transform="log2", convert="cpm",
                                    norm="quant", filter=TRUE)
## Removing 9393 low-count genes (10530 remaining).
## transform_counts: Found 1 values equal to 0, adding 1 to the matrix.
t_eosinophil_pca <- plot_pca(t_eosinophil_norm,
                             plot_labels=FALSE)
dev <- pp(file="images/eosinophils_tumaco_norm.png")
t_eosinophil_pca$plot
closed <- dev.off()
t_eosinophil_pca$plot

t_eosinophil_nb <- normalize_expt(t_eosinophils, transform="log2", convert="cpm",
                                  batch="svaseq", filter=TRUE)
## Removing 9393 low-count genes (10530 remaining).
## Setting 325 low elements to zero.
## transform_counts: Found 325 values equal to 0, adding 1 to the matrix.
t_eosinophil_nb_pca <- plot_pca(t_eosinophil_nb, plot_labels=FALSE)
dev <- pp(file="images/eosinophils_tumaco_norm_sva.png")
t_eosinophil_nb_pca$plot
## Warning in MASS::cov.trob(data[, vars]): Probable convergence failure

## Warning in MASS::cov.trob(data[, vars]): Probable convergence failure
closed <- dev.off()
t_eosinophil_nb_pca$plot
## Warning in MASS::cov.trob(data[, vars]): Probable convergence failure

## Warning in MASS::cov.trob(data[, vars]): Probable convergence failure

6.9.6 Visualize: Look at Cell types C/F by visit

6.9.6.1 Monocytes, Visit 1

t_monocyte_v1 <- subset_expt(t_monocytes, subset = "visitnumber=='1'")
## subset_expt(): There were 42, now there are 16 samples.
t_monocyte_v1_norm <- normalize_expt(t_monocyte_v1, norm = "quant", convert = "cpm",
                                     transform = "log2", filter = TRUE)
## Removing 9444 low-count genes (10479 remaining).
## transform_counts: Found 1 values equal to 0, adding 1 to the matrix.
t_monocyte_v1_pca <- plot_pca(t_monocyte_v1_norm, plot_labels = FALSE)
dev <- pp(file="images/monocytes_v1_cf_norm_pca.png")
t_monocyte_v1_pca$plot
closed <- dev.off()
t_monocyte_v1_pca$plot

t_monocyte_v1_nb <- normalize_expt(t_monocyte_v1, convert = "cpm",
                                   transform = "log2", filter = TRUE, batch = "svaseq")
## Removing 9444 low-count genes (10479 remaining).
## Setting 187 low elements to zero.
## transform_counts: Found 187 values equal to 0, adding 1 to the matrix.
t_monocyte_v1_nb_pca <- plot_pca(t_monocyte_v1_nb, plot_labels = FALSE)
dev <- pp(file="images/monocytes_v1_cf_norm_sva_pca.png")
t_monocyte_v1_nb_pca$plot
closed <- dev.off()
t_monocyte_v1_nb_pca$plot

6.9.6.2 Monocytes Visit 2

t_monocyte_v2 <- subset_expt(t_monocytes, subset = "visitnumber=='2'")
## subset_expt(): There were 42, now there are 13 samples.
t_monocyte_v2_norm <- normalize_expt(t_monocyte_v2, norm = "quant", convert = "cpm",
                                     transform = "log2", filter = TRUE)
## Removing 9403 low-count genes (10520 remaining).
## transform_counts: Found 1 values equal to 0, adding 1 to the matrix.
t_monocyte_v2_pca <- plot_pca(t_monocyte_v2_norm, plot_labels = FALSE)
dev <- pp(file="images/monocytes_v2_cf_norm_pca.png")
t_monocyte_v2_pca$plot
closed <- dev.off()
t_monocyte_v2_pca$plot

t_monocyte_v2_nb <- normalize_expt(t_monocyte_v2, convert = "cpm",
                                   transform = "log2", filter = TRUE, batch = "svaseq")
## Removing 9403 low-count genes (10520 remaining).
## Setting 115 low elements to zero.
## transform_counts: Found 115 values equal to 0, adding 1 to the matrix.
t_monocyte_v2_nb_pca <- plot_pca(t_monocyte_v2_nb, plot_labels = FALSE)
dev <- pp(file="images/monocytes_v2_cf_norm_sva_pca.png")
t_monocyte_v2_nb_pca$plot
closed <- dev.off()
t_monocyte_v2_nb_pca$plot

6.9.6.3 Monocytes Visit 3

t_monocyte_v3 <- subset_expt(t_monocytes, subset = "visitnumber=='3'")
## subset_expt(): There were 42, now there are 13 samples.
t_monocyte_v3_norm <- normalize_expt(t_monocyte_v3, norm = "quant", convert = "cpm",
                                   transform = "log2", filter = TRUE)
## Removing 9549 low-count genes (10374 remaining).
## transform_counts: Found 16 values equal to 0, adding 1 to the matrix.
t_monocyte_v3_pca <- plot_pca(t_monocyte_v3_norm, plot_labels = FALSE)
dev <- pp(file="images/monocytes_v3_cf_norm_pca.png")
t_monocyte_v3_pca$plot
closed <- dev.off()
t_monocyte_v3_pca$plot

t_monocyte_v3_nb <- normalize_expt(t_monocyte_v3, convert = "cpm",
                                   transform = "log2", filter = TRUE, batch = "svaseq")
## Removing 9549 low-count genes (10374 remaining).
## Setting 55 low elements to zero.
## transform_counts: Found 55 values equal to 0, adding 1 to the matrix.
t_monocyte_v3_nb_pca <- plot_pca(t_monocyte_v3_nb, plot_labels = FALSE)
dev <- pp(file="images/monocytes_v3_cf_norm_sva_pca.png")
t_monocyte_v3_nb_pca$plot
closed <- dev.off()
t_monocyte_v3_nb_pca$plot

6.9.6.4 Neutrophils, Visit 1

t_neutrophil_v1 <- subset_expt(t_neutrophils, subset = "visitnumber=='1'")
## subset_expt(): There were 41, now there are 16 samples.
t_neutrophil_v1_norm <- normalize_expt(t_neutrophil_v1, norm = "quant", convert = "cpm",
                                   transform = "log2", filter = TRUE)
## Removing 11208 low-count genes (8715 remaining).
## transform_counts: Found 2 values equal to 0, adding 1 to the matrix.
t_neutrophil_v1_pca <- plot_pca(t_neutrophil_v1_norm, plot_labels = FALSE)
dev <- pp(file="images/neutrophils_v1_cf_norm_pca.png")
t_neutrophil_v1_pca$plot
closed <- dev.off()
t_neutrophil_v1_pca$plot

t_neutrophil_v1_nb <- normalize_expt(t_neutrophil_v1, convert = "cpm",
                                     transform = "log2", filter = TRUE, batch = "ruvg")
## Removing 11208 low-count genes (8715 remaining).
## Warning in RUVSeq::RUVg(linear_mtrx, ruv_controls, k = chosen_surrogates): The expression matrix does not contain counts.
## Please, pass a matrix of counts (not logged) or set isLog to TRUE to skip the log transformation
## Setting 192 low elements to zero.
## transform_counts: Found 192 values equal to 0, adding 1 to the matrix.
t_neutrophil_v1_nb_pca <- plot_pca(t_neutrophil_v1_nb, plot_labels = FALSE)
dev <- pp(file="images/neutrophils_v1_cf_norm_sva_pca.png")
t_neutrophil_v1_nb_pca$plot
closed <- dev.off()
t_neutrophil_v1_nb_pca$plot

6.9.6.5 Neutrophils Visit 2

t_neutrophil_v2 <- subset_expt(t_neutrophils, subset = "visitnumber=='2'")
## subset_expt(): There were 41, now there are 13 samples.
t_neutrophil_v2_norm <- normalize_expt(t_neutrophil_v2, norm = "quant", convert = "cpm",
                                   transform = "log2", filter = TRUE)
## Removing 11473 low-count genes (8450 remaining).
## transform_counts: Found 2 values equal to 0, adding 1 to the matrix.
t_neutrophil_v2_pca <- plot_pca(t_neutrophil_v2_norm, plot_labels = FALSE)
dev <- pp(file="images/neutrophils_v2_cf_norm_pca.png")
t_neutrophil_v2_pca$plot
closed <- dev.off()
t_neutrophil_v2_pca$plot

t_neutrophil_v2_nb <- normalize_expt(t_neutrophil_v2, convert = "cpm",
                                     transform = "log2", filter = TRUE, batch = "svaseq")
## Removing 11473 low-count genes (8450 remaining).
## Setting 78 low elements to zero.
## transform_counts: Found 78 values equal to 0, adding 1 to the matrix.
t_neutrophil_v2_nb_pca <- plot_pca(t_neutrophil_v2_nb, plot_labels = FALSE)
dev <- pp(file="images/neutrophils_v2_cf_norm_sva_pca.png")
t_neutrophil_v2_nb_pca$plot
closed <- dev.off()
t_neutrophil_v2_nb_pca$plot

6.9.6.6 Neutrophils Visit 3

t_neutrophil_v3 <- subset_expt(t_neutrophils, subset = "visitnumber=='3'")
## subset_expt(): There were 41, now there are 12 samples.
t_neutrophil_v3_norm <- normalize_expt(t_neutrophil_v3, norm = "quant", convert = "cpm",
                                       transform = "log3", filter = TRUE)
## Removing 11420 low-count genes (8503 remaining).
## transform_counts: Found 2 values equal to 0, adding 1 to the matrix.
## Did not recognize the transformation, leaving the table.
##  Recognized transformations include: 'log2', 'log10', 'log'
t_neutrophil_v3_pca <- plot_pca(t_neutrophil_v3_norm, plot_labels = FALSE)
dev <- pp(file="images/neutrophils_v3_cf_norm_pca.png")
t_neutrophil_v3_pca$plot
closed <- dev.off()
t_neutrophil_v3_pca$plot

t_neutrophil_v3_nb <- normalize_expt(t_neutrophil_v3, convert = "cpm",
                                     transform = "log2", filter = TRUE, batch = "svaseq")
## Removing 11420 low-count genes (8503 remaining).
## Setting 83 low elements to zero.
## transform_counts: Found 83 values equal to 0, adding 1 to the matrix.
t_neutrophil_v3_nb_pca <- plot_pca(t_neutrophil_v3_nb, plot_labels = FALSE)
dev <- pp(file="images/neutrophils_v3_cf_norm_sva_pca.png")
t_neutrophil_v3_nb_pca$plot
closed <- dev.off()
t_neutrophil_v3_nb_pca$plot

6.9.6.7 Eosinophils, Visit 1

t_eosinophil_v1 <- subset_expt(t_eosinophils, subset = "visitnumber=='1'")
## subset_expt(): There were 26, now there are 8 samples.
t_eosinophil_v1_norm <- normalize_expt(t_eosinophil_v1, norm = "quant", convert = "cpm",
                                   transform = "log2", filter = TRUE)
## Removing 9946 low-count genes (9977 remaining).
## transform_counts: Found 1 values equal to 0, adding 1 to the matrix.
t_eosinophil_v1_pca <- plot_pca(t_eosinophil_v1_norm, plot_labels = FALSE)
dev <- pp(file="images/eosinophils_v1_cf_norm_pca.png")
t_eosinophil_v1_pca$plot
closed <- dev.off()
t_eosinophil_v1_pca$plot

t_eosinophil_v1_nb <- normalize_expt(t_eosinophil_v1, convert = "cpm",
                                     transform = "log2", filter = TRUE, batch = "svaseq")
## Removing 9946 low-count genes (9977 remaining).
## Setting 57 low elements to zero.
## transform_counts: Found 57 values equal to 0, adding 1 to the matrix.
t_eosinophil_v1_nb_pca <- plot_pca(t_eosinophil_v1_nb, plot_labels = FALSE)
dev <- pp(file="images/eosinophils_v1_cf_norm_sva_pca.png")
t_eosinophil_v1_nb_pca$plot
closed <- dev.off()
t_eosinophil_v1_nb_pca$plot

6.9.6.8 Eosinophils Visit 2

t_eosinophil_v2 <- subset_expt(t_eosinophils, subset = "visitnumber=='2'")
## subset_expt(): There were 26, now there are 9 samples.
t_eosinophil_v2_norm <- normalize_expt(t_eosinophil_v2, norm = "quant", convert = "cpm",
                                   transform = "log2", filter = TRUE)
## Removing 9808 low-count genes (10115 remaining).
## transform_counts: Found 1 values equal to 0, adding 1 to the matrix.
t_eosinophil_v2_pca <- plot_pca(t_eosinophil_v2_norm, plot_labels = FALSE)
dev <- pp(file="images/eosinophils_v2_cf_norm_pca.png")
t_eosinophil_v2_pca$plot
closed <- dev.off()
t_eosinophil_v2_pca$plot

t_eosinophil_v2_nb <- normalize_expt(t_eosinophil_v2, convert = "cpm",
                                     transform = "log2", filter = TRUE, batch = "svaseq")
## Removing 9808 low-count genes (10115 remaining).
## Setting 90 low elements to zero.
## transform_counts: Found 90 values equal to 0, adding 1 to the matrix.
t_eosinophil_v2_nb_pca <- plot_pca(t_eosinophil_v2_nb, plot_labels = FALSE)
dev <- pp(file="images/eosinophils_v2_cf_norm_sva_pca.png")
t_eosinophil_v2_nb_pca$plot
closed <- dev.off()
t_eosinophil_v2_nb_pca$plot

6.9.6.9 Eosinophils Visit 3

t_eosinophil_v3 <- subset_expt(t_eosinophils, subset = "visitnumber=='3'")
## subset_expt(): There were 26, now there are 9 samples.
t_eosinophil_v3_norm <- normalize_expt(t_eosinophil_v3, norm = "quant", convert = "cpm",
                                       transform = "log3", filter = TRUE)
## Removing 9845 low-count genes (10078 remaining).
## transform_counts: Found 1 values equal to 0, adding 1 to the matrix.
## Did not recognize the transformation, leaving the table.
##  Recognized transformations include: 'log2', 'log10', 'log'
t_eosinophil_v3_pca <- plot_pca(t_eosinophil_v3_norm, plot_labels = FALSE)
dev <- pp(file="images/eosinophils_v3_cf_norm_pca.png")
t_eosinophil_v3_pca$plot
closed <- dev.off()
t_eosinophil_v3_pca$plot

t_eosinophil_v3_nb <- normalize_expt(t_eosinophil_v3, convert = "cpm",
                                     transform = "log2", filter = TRUE, batch = "svaseq")
## Removing 9845 low-count genes (10078 remaining).
## Setting 48 low elements to zero.
## transform_counts: Found 48 values equal to 0, adding 1 to the matrix.
t_eosinophil_v3_nb_pca <- plot_pca(t_eosinophil_v3_nb, plot_labels = FALSE)
dev <- pp(file="images/eosinophils_v3_cf_norm_sva_pca.png")
t_eosinophil_v3_nb_pca$plot
closed <- dev.off()
t_eosinophil_v3_nb_pca$plot

6.10 Recategorize: Concatenate cure/fail and cell type

In the following block the experimental condition was reset to the concatenation of clinical outcome and type of cells. There are an insufficient number of biopsy samples for them to be useful in this visualization, so they are ignored.

desired_levels <- c("cure_biopsy", "failure_biopsy", "cure_eosinophils", "failure_eosinophils",
                    "cure_monocytes", "failure_monocytes", "cure_neutrophils", "failure_neutrophils")
new_fact <- factor(
    paste0(pData(t_clinical)[["condition"]], "_",
           pData(t_clinical)[["batch"]]),
    levels=desired_levels)

t_clinical_concat <- set_expt_conditions(t_clinical, fact = new_fact) %>%
  set_expt_batches(fact = "visitnumber") %>%
  set_expt_colors(color_choices[["cf_type"]]) %>%
  subset_expt(subset="typeofcells!='biopsy'")
## 
##         cure_biopsy      failure_biopsy    cure_eosinophils failure_eosinophils 
##                   9                   5                  17                   9 
##      cure_monocytes   failure_monocytes    cure_neutrophils failure_neutrophils 
##                  21                  21                  20                  21 
## 
##  3  2  1 
## 34 35 54
## subset_expt(): There were 123, now there are 109 samples.
## Try to ensure that the levels stay in the order I want
meta <- pData(t_clinical_concat) %>%
  mutate(condition = fct_relevel(condition, desired_levels))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `condition = fct_relevel(condition, desired_levels)`.
## Caused by warning:
## ! 2 unknown levels in `f`: cure_biopsy and failure_biopsy
pData(t_clinical_concat) <- meta

6.10.1 Visualize: Look at Tumaco-only samples by cell type and cure/fail

The following block is pretty wild to my eyes; it seems to me that the variances introduced by cell type basically wipe out the apparent differences between cure/fail that we were able to see previously.

I suppose this is not entirely surprising, but when we had the Cali samples it at least looked like there were differences which were explicitly between cure/fail across cell types. I suppose this means those differences were actually coming from the unbalanced state of the two clinics from the perspective of clinic.

t_clinical_concat_norm <- normalize_expt(t_clinical_concat, transform = "log2", convert = "cpm",
                                       norm = "quant", filter = TRUE)
## Removing 8016 low-count genes (11907 remaining).
## transform_counts: Found 93 values equal to 0, adding 1 to the matrix.
t_clinical_concat_norm_pca <- plot_pca(t_clinical_concat_norm)
## plot labels was not set and there are more than 100 samples, disabling it.
dev <- pp(file=glue("images/clinical_concatenated_normalized_pca-v{ver}.png"), height=6, width=10)
t_clinical_concat_norm_pca$plot
closed <- dev.off()
t_clinical_concat_norm_pca$plot

t_clinical_concat_nb <- normalize_expt(t_clinical_concat, transform = "log2", convert = "cpm",
                                     batch = "svaseq", filter = TRUE)
## Removing 8016 low-count genes (11907 remaining).
## Setting 9896 low elements to zero.
## transform_counts: Found 9896 values equal to 0, adding 1 to the matrix.
t_clinical_concat_nb_pca <- plot_pca(t_clinical_concat_nb)
## plot labels was not set and there are more than 100 samples, disabling it.
dev <- pp(file=glue("images/clinical_concatenated_svaseqbatch_pca-v{ver}.png"), height=6, width=12)
t_clinical_concat_nb_pca$plot
closed <- dev.off()
t_clinical_concat_nb_pca$plot

7 Visit comparisons

Let us shift the focus from cell type and/or Cure/Fail to the visit number. As you are likely aware, the three visits are significantly spread apart according to the clinical treatment of each patient. Thus we will now separate the samples by visit in order to more easily see what new patterns emerge.

7.1 Recategorize: All visits together

Now let us shift the view slightly to focus on changes observed over time.

t_visit_expt <- set_expt_conditions(t_clinical, fact = "visitnumber") %>%
  set_expt_batches(fact = "finaloutcome") %>%
  subset_expt(subset="typeofcells!='biopsy'")
## 
##  3  2  1 
## 34 35 54 
## 
##    cure failure 
##      67      56
## subset_expt(): There were 123, now there are 109 samples.
t_visit_norm <- normalize_expt(t_visit_expt, transform="log2", convert="cpm",
                             norm="quant", filter=TRUE)
## Removing 8016 low-count genes (11907 remaining).
## transform_counts: Found 93 values equal to 0, adding 1 to the matrix.
plot_pca(t_visit_norm)$plot
## plot labels was not set and there are more than 100 samples, disabling it.

t_visit_nb <- normalize_expt(t_visit_expt, transform = "log2", convert="cpm",
                             filter = TRUE, batch = "svaseq")
## Removing 8016 low-count genes (11907 remaining).
## Setting 9614 low elements to zero.
## transform_counts: Found 9614 values equal to 0, adding 1 to the matrix.
t_visit_nb_pca <- plot_pca(t_visit_nb)
## plot labels was not set and there are more than 100 samples, disabling it.
dev <- pp(file=glue("images/visit_svaseqbatch_pca-v{ver}.png"), height=7, width=9)
t_visit_nb_pca$plot
closed <- dev.off()
t_visit_nb_pca$plot

When looking at all cell types, it is quite difficult to see differences among the three visits.

7.2 Visualize: C/F for only the visit 1 samples

Wen we had both Cali and Tumaco samples, it looked like there was variance suggesting differences between cure and fail for visit 1. I think the following block will suggest pretty strongly that this was not true.

tv1_norm <- normalize_expt(tv1_samples, transform="log2", convert="cpm",
                          norm="quant", filter=TRUE)
## Removing 5907 low-count genes (14016 remaining).
## transform_counts: Found 272 values equal to 0, adding 1 to the matrix.
plot_pca(tv1_norm)$plot
## Warning: ggrepel: 38 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

tv1_nb <- normalize_expt(tv1_samples, transform = "log2", convert = "cpm",
                        filter = TRUE, batch = "svaseq")
## Removing 5907 low-count genes (14016 remaining).
## Setting 7615 low elements to zero.
## transform_counts: Found 7615 values equal to 0, adding 1 to the matrix.
plot_pca(tv1_nb, plot_labels = FALSE)$plot

7.3 Visualize: C/F for only the visit 2 samples

tv2_clinical <- subset_expt(tv2_samples, subset="visitnumber=='2'") %>%
  set_expt_conditions(fact = "finaloutcome") %>%
  set_expt_batches(fact = "typeofcells")
## subset_expt(): There were 35, now there are 35 samples.
## 
##    cure failure 
##      20      15 
## 
## eosinophils   monocytes neutrophils 
##           9          13          13
tv2_nb <- normalize_expt(tv2_clinical, transform = "log2", convert = "cpm", norm = "quant",
                        filter = TRUE, batch = "svaseq")
## Warning in normalize_expt(tv2_clinical, transform = "log2", convert = "cpm", :
## Quantile normalization and sva do not always play well together.
## Removing 8364 low-count genes (11559 remaining).
## Setting 1786 low elements to zero.
## transform_counts: Found 1786 values equal to 0, adding 1 to the matrix.
plot_pca(tv2_nb, plot_labels = FALSE)$plot

7.4 Visualize: C/F for only the visit 3 samples

tv3_clinical <- subset_expt(tv3_samples, subset="visitnumber=='3'") %>%
  set_expt_conditions(fact = "finaloutcome") %>%
  set_expt_batches(fact = "typeofcells")
## subset_expt(): There were 34, now there are 34 samples.
## 
##    cure failure 
##      17      17 
## 
## eosinophils   monocytes neutrophils 
##           9          13          12
tv3_nb <- normalize_expt(tv3_clinical, transform = "log2", convert = "cpm", norm = "quant",
                        filter = TRUE, batch = "svaseq")
## Warning in normalize_expt(tv3_clinical, transform = "log2", convert = "cpm", :
## Quantile normalization and sva do not always play well together.
## Removing 8474 low-count genes (11449 remaining).
## Setting 1481 low elements to zero.
## transform_counts: Found 1481 values equal to 0, adding 1 to the matrix.
plot_pca(tv3_nb, plot_labels = FALSE)$plot

7.4.1 Visualize: Comparing 3 visits by cell type

Separate the samples by cell type in order to more easily observe patterns with respect to visit and clinical outcome.

7.4.1.1 Monocytes across visits

t_visitcf_monocyte_norm <- normalize_expt(t_visitcf_monocyte, norm = "quant", convert = "cpm",
                                transform = "log2", filter = TRUE)
## Removing 9064 low-count genes (10859 remaining).
## transform_counts: Found 5 values equal to 0, adding 1 to the matrix.
t_visitcf_monocyte_pca <- plot_pca(t_visitcf_monocyte_norm, plot_labels = FALSE)
dev <- pp(file="images/visit_monocytes_cf_norm_pca.png")
t_visitcf_monocyte_pca$plot
closed <- dev.off()
t_visitcf_monocyte_pca$plot

t_visitcf_monocyte_disheat <- plot_disheat(t_visitcf_monocyte_norm)
dev <- pp(file="images/visit_monocytes_cf_norm_disheat.png")
t_visitcf_monocyte_disheat$plot
closed <- dev.off()
t_visitcf_monocyte_disheat$plot

t_visitcf_monocyte_nb <- normalize_expt(t_visitcf_monocyte, convert = "cpm",
                                    transform = "log2", filter = TRUE, batch = "svaseq")
## Removing 9064 low-count genes (10859 remaining).
## Setting 688 low elements to zero.
## transform_counts: Found 688 values equal to 0, adding 1 to the matrix.
t_visitcf_monocyte_nb_pca <- plot_pca(t_visitcf_monocyte_nb, plot_labels = FALSE)
dev <- pp(file="images/monocytes_cf_norm_sva_pca.png")
t_visitcf_monocyte_nb_pca$plot
closed <- dev.off()
t_visitcf_monocyte_nb_pca$plot

8 Persistence

8.0.1 Take a look

See if there are any patterns which look usable.

## All
t_persistence_norm <- normalize_expt(t_persistence, transform = "log2", convert = "cpm",
                                   norm = "quant", filter = TRUE)
## Removing 8537 low-count genes (11386 remaining).
## transform_counts: Found 15 values equal to 0, adding 1 to the matrix.
plot_pca(t_persistence_norm)$plot
## Warning: ggrepel: 6 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

t_persistence_nb <- normalize_expt(t_persistence, transform = "log2", convert = "cpm",
                                 batch = "svaseq", filter = TRUE)
## Removing 8537 low-count genes (11386 remaining).
## Setting 1538 low elements to zero.
## transform_counts: Found 1538 values equal to 0, adding 1 to the matrix.
plot_pca(t_persistence_nb)$plot

## Biopsies
##persistence_biopsy_norm <- normalize_expt(persistence_biopsy, transform = "log2", convert = "cpm",
##                                   norm = "quant", filter = TRUE)
##plot_pca(persistence_biopsy_norm)$plot
## Insufficient data

## Monocytes
t_persistence_monocyte_norm <- normalize_expt(t_persistence_monocyte, transform = "log2", convert = "cpm",
                                              norm = "quant", filter = TRUE)
## Removing 9597 low-count genes (10326 remaining).
## transform_counts: Found 1 values equal to 0, adding 1 to the matrix.
plot_pca(t_persistence_monocyte_norm)$plot

t_persistence_monocyte_nb <- normalize_expt(t_persistence_monocyte, transform = "log2", convert = "cpm",
                                 batch = "svaseq", filter = TRUE)
## Removing 9597 low-count genes (10326 remaining).
## Setting 46 low elements to zero.
## transform_counts: Found 46 values equal to 0, adding 1 to the matrix.
plot_pca(t_persistence_monocyte_nb)$plot

## Neutrophils
t_persistence_neutrophil_norm <- normalize_expt(t_persistence_neutrophil, transform = "log2", convert = "cpm",
                                                norm = "quant", filter = TRUE)
## Removing 11531 low-count genes (8392 remaining).
## transform_counts: Found 2 values equal to 0, adding 1 to the matrix.
plot_pca(t_persistence_neutrophil_norm)$plot

t_persistence_neutrophil_nb <- normalize_expt(t_persistence_neutrophil, transform = "log2", convert = "cpm",
                                 batch = "svaseq", filter = TRUE)
## Removing 11531 low-count genes (8392 remaining).
## Setting 46 low elements to zero.
## transform_counts: Found 46 values equal to 0, adding 1 to the matrix.
plot_pca(t_persistence_neutrophil_nb)$plot

## Eosinophils
t_persistence_eosinophil_norm <- normalize_expt(t_persistence_eosinophil, transform = "log2", convert = "cpm",
                                   norm = "quant", filter = TRUE)
## Removing 9895 low-count genes (10028 remaining).
## transform_counts: Found 1 values equal to 0, adding 1 to the matrix.
plot_pca(t_persistence_eosinophil_norm)$plot

t_persistence_eosinophil_nb <- normalize_expt(t_persistence_eosinophil, transform = "log2", convert = "cpm",
                                 batch = "svaseq", filter = TRUE)
## Removing 9895 low-count genes (10028 remaining).
## Setting 25 low elements to zero.
## transform_counts: Found 25 values equal to 0, adding 1 to the matrix.
plot_pca(t_persistence_eosinophil_nb)$plot

9 Classify me!

I wrote out all the z2.2 and z2.3 specific variants to a couple files, I want to see if I can classify a human sample as infected with 2.2 or 2.3.

z22 <- read.csv("csv/variants_22.csv")
z23 <- read.csv("csv/variants_23.csv")
cure <- read.csv("csv/cure_variants.txt")
fail <- read.csv("csv/fail_variants.txt")
z22_vec <- gsub(pattern="\\-", replacement="_", x=z22[["x"]])
z23_vec <- gsub(pattern="\\-", replacement="_", x=z23[["x"]])
cure_vec <- gsub(pattern="\\-", replacement="_", x=cure)
fail_vec <- gsub(pattern="\\-", replacement="_", x=fail)

classify_zymo <- function(sample) {
  arbitrary_tags <- sm(readr::read_tsv(sample))
  arbitrary_ids <- arbitrary_tags[["position"]]
  message("Length: ", length(arbitrary_ids), ", z22: ",
          sum(arbitrary_ids %in% z22_vec) / (length(z22_vec)), " z23: ",
          sum(arbitrary_ids %in% z23_vec) / (length(z23_vec)))
}

arbitrary_sample <- "preprocessing/TMRC30156/outputs/40freebayes_lpanamensis_v36/all_tags.txt.xz"
classify_zymo(arbitrary_sample)
##if (!isTRUE(get0("skip_load"))) {
##  pander::pander(sessionInfo())
##  message("This is hpgltools commit: ", get_git_commit())
##  message("Saving to ", savefile)
##  tmp <- sm(saveme(filename=savefile))
##}
tmp <- loadme(filename=savefile)
