1 Introduction

This worksheet is intended to provide the analyses for the TMRC3 project. As of ~ 20211013 we have a shared directory into which to place various results. For the moment I think we will primarily attempt to put csv files. In each directory of that shared tree, we will attempt to put the set of results which are defined by each subdivision.

2 Sample sheet

I think there have been some new/improved annotations in the online sample sheet, so let us go download a fresh copy and see.

samplesheet <- "sample_sheets/tmrc3_samples_20211012.xlsx"
crf_metadata <- "sample_sheets/20210825_EXP_ESPECIAL_TMRC3_VERSION_2.xlsx"

3 Annotation

We take the annotation data from ensembl’s biomart instance. The genome which was used to map the data was hg38 revision 100. My default when using biomart is to load the data from 1 year before the current date.

hs_annot <- sm(load_biomart_annotations(year="2020"))
hs_annot <- hs_annot[["annotation"]]
hs_annot[["transcript"]] <- paste0(rownames(hs_annot), ".", hs_annot[["version"]])
rownames(hs_annot) <- make.names(hs_annot[["ensembl_gene_id"]], unique=TRUE)
tx_gene_map <- hs_annot[, c("transcript", "ensembl_gene_id")]

summary(hs_annot)
##  ensembl_transcript_id ensembl_gene_id       version     transcript_version
##  Length:227921         Length:227921      Min.   : 1.0   Min.   : 1.00     
##  Class :character      Class :character   1st Qu.: 6.0   1st Qu.: 1.00     
##  Mode  :character      Mode  :character   Median :12.0   Median : 1.00     
##                                           Mean   :10.7   Mean   : 3.08     
##                                           3rd Qu.:16.0   3rd Qu.: 5.00     
##                                           Max.   :29.0   Max.   :17.00     
##                                                                            
##  hgnc_symbol        description        gene_biotype         cds_length    
##  Length:227921      Length:227921      Length:227921      Min.   :     3  
##  Class :character   Class :character   Class :character   1st Qu.:   357  
##  Mode  :character   Mode  :character   Mode  :character   Median :   694  
##                                                           Mean   :  1139  
##                                                           3rd Qu.:  1446  
##                                                           Max.   :107976  
##                                                           NA's   :127343  
##  chromosome_name       strand          start_position      end_position     
##  Length:227921      Length:227921      Min.   :5.77e+02   Min.   :6.47e+02  
##  Class :character   Class :character   1st Qu.:3.11e+07   1st Qu.:3.12e+07  
##  Mode  :character   Mode  :character   Median :6.04e+07   Median :6.06e+07  
##                                        Mean   :7.41e+07   Mean   :7.42e+07  
##                                        3rd Qu.:1.09e+08   3rd Qu.:1.09e+08  
##                                        Max.   :2.49e+08   Max.   :2.49e+08  
##                                                                             
##   transcript       
##  Length:227921     
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
hs_go <- sm(load_biomart_go()[["go"]])
hs_length <- hs_annot[, c("ensembl_gene_id", "cds_length")]
colnames(hs_length) <- c("ID", "length")

4 Introduction

This document is intended to provide an overview of TMRC3 samples which have been sequenced. It includes some plots and analyses showing the relationships among the samples as well as some differential analyses when possible.

5 Sample Estimation

5.1 Generate expressionsets

The sample sheet is copied from our shared online sheet and updated with each release of sequencing data.

5.1.1 Hisat2 expressionsets

The first thing to note is the large range in coverage. There are multiple samples with coverage which is too low to use. These will be removed shortly.

In the following block I immediately exclude any non-coding reads as well.

## Create the expressionset and immediately pass it to a filter
## removing the non protein coding genes.
sanitize_columns <- c("visitnumber", "clinicaloutcome", "donor",
                      "typeofcells", "clinicalpresentation",
                      "condition", "batch")
hs_expt <- create_expt(samplesheet,
                       file_column="hg38100hisatfile",
                       savefile=glue::glue("rda/hs_expt_all-v{ver}.rda"),
                       gene_info=hs_annot) %>%
  exclude_genes_expt(column="gene_biotype", method="keep",
                     pattern="protein_coding", meta_column="ncrna_lost") %>%
  sanitize_expt_metadata(columns=sanitize_columns) %>%
  set_expt_factors(columns=sanitize_columns, class="factor")
## Reading the sample metadata.
## Dropped 54 rows from the sample metadata because they were blank.
## The sample definitions comprises: 238 rows(samples) and 75 columns(metadata fields).
## Warning in create_expt(samplesheet, file_column = "hg38100hisatfile", savefile =
## glue::glue("rda/hs_expt_all-v{ver}.rda"), : Some samples were removed when cross
## referencing the samples against the count data.
## Matched 21452 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## The final expressionset has 21481 rows and 205 columns.
## Before removal, there were 21481 genes, now there are 19941.
## There are 19 samples which kept less than 90 percent counts.
## TMRC30015 TMRC30017 TMRC30019 TMRC30044 TMRC30045 TMRC30154 TMRC30097 TMRC30075 
##     79.24     85.72     89.75     80.34     73.33     83.20     89.90     86.97 
## TMRC30087 TMRC30101 TMRC30104 TMRC30114 TMRC30126 TMRC30127 TMRC30120 TMRC30128 
##     83.63     88.41     80.29     87.62     84.52     89.49     79.16     82.53 
## TMRC30141 TMRC30131 TMRC30073 
##     89.40     86.82     89.26
levels(pData(hs_expt[["expressionset"]])[["visitnumber"]]) <- list(
    '0'="notapplicable", '1'=1, '2'=2, '3'=3)

5.1.2 Add Clinical metadata

Let us also merge in the clinician’s metadata. I worry a little that this might not be allowed for dbGap data, but if it is a problem I suspect we can just remove the bad columns from it. Also note that I rarely use the join function, but it is somewhat required here because I do not want to risk shuffling the metadata when I add the new metadata, which comes from a spreadsheet sorted by patient, not sample. In doing this I therefore just created a new column ‘join’ which contains the shared information, e.g. the patient ID from the existing metadata and the same ID from the CRF file which has been coerced into lowercase.

hs_pd <- pData(hs_expt)
start <- rownames(hs_pd)
hs_crf <- openxlsx::read.xlsx(crf_metadata)
hs_crf[["join"]] <- tolower(hs_crf[["codigo_paciente"]])
hs_pd[["join"]] <- hs_pd[["tubelabelorigin"]]
test <- plyr::join(hs_pd, hs_crf, by="join")
test[["join"]] <- NULL
rownames(test) <- rownames(hs_pd)
na_idx <- is.na(test)
test[na_idx] <- "undefined"
pData(hs_expt) <- test

This added a bunch of new columns, of which there are a few which Theresa showed are of likely interest:

  • eb_lc_sexo
  • eb_lc_etnia
  • edad eb_lc_tiempo_evolucion
  • eb_lc_num_lc_activas
  • eb_lc_ulcera_area_1
  • eb_lc_ejex_lesion_mm_1
  • eb_lc_lesion_area_1
  • v2_lc_ejex_lesion_mm_1
  • v2_lc_lesion_area_1
  • v2_lc_ejex_ulcera_mm_1
  • v2_lc_ejey_ulcera_mm_1
  • v2_lc_ulcera_area_1
  • v3_lc_ejex_lesion_mm_1
  • v3_lc_ejey_lesion_mm_1
  • v3_lc_lesion_area_1
  • v3_lc_ejex_ulcera_mm_1
  • v3_lc_ejey_ulcera_mm_1
  • v3_lc_ulcera_area_1
  • eb_lc_tto_mcto_prescrito
  • eb_lc_tto_mcto_glucan_dosis
  • v3_lc_num_amp_cap_tto
  • adherencia_tto

5.1.3 Consider lcRNA (currently unused)

Split this data into CDS and lncRNA. Oh crap in order to do that I need to recount the data. Running now (20210518)

## lnc_expt <- create_expt(samplesheet,
##                         file_column="hg38100lncfile",
##                         gene_info=hs_annot)

5.1.3.1 Initial metrics

Once the data was loaded, there are a couple of metrics which may be plotted immediately.

nonzero <- plot_nonzero(hs_expt)
nonzero$plot
## Warning: ggrepel: 182 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

ncrna_lost_df <- as.data.frame(pData(hs_expt)[["ncrna_lost"]])
rownames(ncrna_lost_df) <- rownames(pData(hs_expt))
colnames(ncrna_lost_df) <- "ncrna_lost"

tmpdf <- merge(nonzero$table, ncrna_lost_df, by="row.names")
rownames(tmpdf) <- tmpdf[["Row.names"]]
tmpdf[["Row.names"]] <- NULL

ggplot(tmpdf, aes(x=ncrna_lost, y=nonzero_genes)) +
  ggplot2::geom_point() +
  ggplot2::ggtitle("Nonzero genes with respect to percent counts
lost when ncRNA was removed.")

Najib doesn’t want this plot, but I am using it to check new samples, so will hide it from general use.

libsize <- plot_libsize(hs_expt)
libsize$plot

5.2 Minimum coverage sample filtering

I arbitrarily chose 11,000 non-zero genes as a minimum. We may want this to be higher.

hs_valid <- subset_expt(hs_expt, nonzero=11000)
## The samples (and read coverage) removed when filtering 11000 non-zero genes are:
## TMRC30010 TMRC30050 TMRC30052 
##     52471    808149   3087347
## subset_expt(): There were 205, now there are 202 samples.
## valid_write <- sm(write_expt(hs_valid, excel=glue("excel/hs_valid-v{ver}-{rundate}.xlsx")))

5.3 Upload ‘raw’ CPM

The cpm directory in the shared tree has a place to drop the counts for the full dataset along with some subsets by cell type and time. The following block should make that process rather easier.

all_cpm <- hs_valid %>%
  normalize_expt(hs_valid, filter=TRUE, convert="cpm") %>%
  exprs() %>%
  write.csv(file=file=glue::glue("upload/CPM/all_cpm-v{ver}-d{rundate}.csv"))

biopsy_cpm <- hs_valid %>%
  subset_expt(subset="typeofcells=='biopsy'") %>%
  normalize_expt(filter=TRUE, convert="cpm") %>%
  exprs() %>%
  write.csv(file=glue::glue("upload/CPM/Biopsies/biopsy_cpm-v{ver}-d{rundate}.csv"))

eosinophils_cpm <- hs_valid %>%
  subset_expt(subset="typeofcells=='eosinophils'") %>%
  normalize_expt(filter=TRUE, convert="cpm") %>%
  exprs() %>%
  write.csv(file=glue::glue("upload/CPM/Eosinophils/eosinophil_cpm-v{ver}-d{rundate}.csv"))

monocyte_cpm <- hs_valid %>%
  subset_expt(subset="typeofcells=='monocytes'") %>%
  normalize_expt(filter=TRUE, convert="cpm") %>%
  exprs() %>%
  write.csv(file=glue::glue("upload/CPM/Monocytes/monocyte_cpm-v{ver}-d{rundate}.csv"))

neutrophil_cpm <- hs_valid %>%
  subset_expt(subset="typeofcells=='neutrophils'") %>%
  normalize_expt(filter=TRUE, convert="cpm") %>%
  exprs() %>%
  write.csv(file=glue::glue("upload/CPM/Neutrophils/neutrophil_cpm-v{ver}-d{rundate}.csv"))

times <- c("1", "2", "3")
for (time in times) {
  time_subset <- paste0("visitnumber=='", time, "'")
  filename <- glue::glue("upload/CPM/visit{time}-v{ver}-d{rundate}.csv")
  written <- hs_valid %>%
    subset_expt(subset=time_subset) %>%
    exprs() %>%
    write.csv(file=filename)
}

types <- c("neutrophils", "monocytes", "eosinophils", "biopsy")
for (type in types) {
  type_subset <- paste0("typeofcells=='", type, "'")
  for (time in times) {
    time_subset <- paste0("visitnumber=='", time, "'")
    filename <- glue::glue("upload/CPM/{type}_visit{time}-v{ver}-d{rundate}.csv")
    written <- hs_valid %>%
      subset_expt(subset=time_subset) %>%
      subset_expt(subset=type_subset) %>%
      exprs() %>%
      write.csv(file=filename)
  }
}
## Error: <text>:4:22: unexpected '='
## 3:   exprs() %>%
## 4:   write.csv(file=file=
##                         ^

6 Project Aims

The project seeks to determine the relationship of the innate immune response and inflammatory signaling to the clinical outcome of antileishmanial drug treatment. We will test the hypothesis that the profile of innate immune cell activation and their dynamics through the course of treatment differ between CL patients with prospectively determined therapeutic cure or failure.

This will be achieved through the characterization of the in vivo dynamics of blood-derived monocyte, neutrophil and eosinophil transcriptome before, during and at the end of treatment in CL patients. Cell-type specific transcriptomes, composite signatures and time-response expression profiles will be contrasted among patients with therapeutic cure or failure.

6.1 Preparation

To address these, I added to the end of the sample sheet columns named ‘condition’, ‘batch’, ‘donor’, and ‘time’. These are filled in with shorthand values according to the above.

6.2 Global view

Before addressing the questions explicitly by subsetting the data, I want to get a look at the samples as they are.

new_names <- pData(hs_valid)[["samplename"]]
hs_valid <- hs_valid %>%
  set_expt_batches(fact="cellssource") %>%
  set_expt_conditions(fact="typeofcells") %>%
  set_expt_samplenames(newnames=new_names)

all_norm <- sm(normalize_expt(hs_valid, transform="log2", norm="quant",
                              convert="cpm", filter=TRUE))

all_pca <- plot_pca(all_norm, plot_labels=FALSE,
                    plot_title="PCA - Cell type", size_column="visitnumber")
pp(file=glue("images/tmrc3_pca_nolabels-v{ver}.png"), image=all_pca$plot)

write.csv(all_pca$table, file="coords/hs_donor_pca_coords.csv")
plot_corheat(all_norm, plot_title="Heirarchical clustering:
         cell types")$plot

6.3 Examine samples relevant to clinical outcome

Now let us consider only the samples for which we have a clinical outcome. These fall primarily into either ‘cured’ or ‘failed’, but some people have not yet returned to the clinic after the first or second visit. These are deemed ‘lost’.

hs_clinical <- hs_valid %>%
  set_expt_conditions(fact="clinicaloutcome") %>%
  set_expt_batches(fact="typeofcells") %>%
  subset_expt(subset="typeofcells!='pbmcs'&typeofcells!='macrophages'")
## subset_expt(): There were 202, now there are 182 samples.
chosen_colors <- c("#D95F02", "#7570B3", "#1B9E77", "#FF0000", "#FF0000")
names(chosen_colors) <- c("cure", "failure", "lost", "null", "notapplicable")
hs_clinical <- set_expt_colors(hs_clinical, colors=chosen_colors)
## Warning in set_expt_colors(hs_clinical, colors = chosen_colors): Colors for the
## following categories are not being used: null.
newnames <- make.names(pData(hs_clinical)[["samplename"]], unique=TRUE)
hs_clinical <- set_expt_samplenames(hs_clinical, newnames=newnames)

hs_clinical_norm <- sm(normalize_expt(hs_clinical, filter=TRUE, transform="log2",
                                      convert="cpm", norm="quant"))
clinical_pca <- plot_pca(hs_clinical_norm, plot_labels=FALSE,
                         size_column="visitnumber", cis=NULL,
                         plot_title="PCA - clinical samples")
pp(file=glue("images/all_clinical_nobatch_pca-v{ver}.png"), image=clinical_pca$plot,
   height=8, width=20)

6.3.1 Repeat without the biopsy samples

hs_clinical_nobiop <- hs_clinical %>%
  subset_expt(subset="typeofcells!='biopsy'") %>%
  subset_expt(subset="condition=='lost'|condition=='cure'|condition=='failure'")
## subset_expt(): There were 182, now there are 130 samples.
## subset_expt(): There were 130, now there are 128 samples.
hs_clinical_nobiop_norm <- sm(normalize_expt(hs_clinical_nobiop, filter=TRUE, transform="log2",
                                             convert="cpm", norm="quant"))
clinical_nobiop_pca <- plot_pca(hs_clinical_nobiop_norm, plot_labels=FALSE, cis=NULL,
                                plot_title="PCA - clinical samples without biopsies")
pp(file=glue("images/all_clinical_nobiop_nobatch_pca-v{ver}.png"),
   image=clinical_nobiop_pca$plot)

6.3.2 Attempt to correct for the surrogate variables

At this time we have two primary data structures of interest: hs_clinical and hs_clinical_nobiop

hs_clinical_nb <- normalize_expt(hs_clinical, filter=TRUE, batch="svaseq",
                                 transform="log2", convert="cpm")
## Removing 5216 low-count genes (14725 remaining).
## batch_counts: Before batch/surrogate estimation, 198077 entries are x==0: 7%.
## batch_counts: Before batch/surrogate estimation, 528392 entries are 0<x<1: 20%.
## Setting 27938 low elements to zero.
## transform_counts: Found 27938 values equal to 0, adding 1 to the matrix.
clinical_batch_pca <- plot_pca(hs_clinical_nb, plot_labels=FALSE, cis=NULL,
                               size_column="visitnumber", plot_title="PCA - clinical samples")
clinical_batch_pca$plot

hs_clinical_nobiop_nb <- sm(normalize_expt(hs_clinical_nobiop, filter=TRUE, batch="svaseq",
                                           transform="log2", convert="cpm"))
clinical_nobiop_batch_pca <- plot_pca(hs_clinical_nobiop_nb,
                                      plot_title="PCA - clinical samples without biopsies",
                                      plot_labels=FALSE)
pp(file="images/clinical_batch.png", image=clinical_nobiop_batch_pca$plot)

test <- plot_pca(hs_clinical_nobiop_nb, size_column="visitnumber",
                 plot_title="PCA - clinical samples without biopsies",
                 plot_labels=FALSE)
test$plot

clinical_nobiop_batch_tsne <- plot_tsne(hs_clinical_nobiop_nb,
                                        plot_title="tSNE - clinical samples without biopsies",
                                        plot_labels=FALSE)
clinical_nobiop_batch_tsne$plot

6.4 Perform DE of the clinical samples cure vs. fail

individual_celltypes <- subset_expt(hs_clinical_nobiop, subset="condition!='lost'")
## subset_expt(): There were 128, now there are 115 samples.
hs_clinic_de <- sm(all_pairwise(individual_celltypes, model_batch="svaseq", filter=TRUE))
## Error in checkForRemoteErrors(val): 5 nodes produced errors; first error: c("Error in if (grep(x = eval_name, pattern = \"^([[:digit:]]|[[:punct:]])\")) { : \n  argument is of length zero\n", "basic")
hs_clinic_table <- sm(combine_de_tables(
    hs_clinic_de,
    excel=glue::glue("excel/individual_celltypes_table-v{ver}.xlsx")))
## Error in combine_de_tables(hs_clinic_de, excel = glue::glue("excel/individual_celltypes_table-v{ver}.xlsx")): object 'hs_clinic_de' not found
hs_clinic_sig <- sm(extract_significant_genes(
    hs_clinic_table,
    excel=glue::glue("excel/individual_celltypes_sig-v{ver}.xlsx")))
## Error in extract_significant_genes(hs_clinic_table, excel = glue::glue("excel/individual_celltypes_sig-v{ver}.xlsx")): object 'hs_clinic_table' not found
hs_clinic_sig[["summary_df"]]
## Error in eval(expr, envir, enclos): object 'hs_clinic_sig' not found
hs_clinic_de[["comparison"]][["heat"]]
## Error in eval(expr, envir, enclos): object 'hs_clinic_de' not found

6.5 Random aside for my own education

I was just doing some reading about concordance statistics. I want to try something with them.

test_df <- hs_clinic_table[["data"]][["failure_vs_cure"]][, c("deseq_logfc", "edger_logfc")]
## Error in eval(expr, envir, enclos): object 'hs_clinic_table' not found
test_lm <- glm(deseq_logfc ~ edger_logfc, data=test_df)
## Error in is.data.frame(data): object 'test_df' not found
con <- survival::concordance(test_lm)
## Error in survival::concordance(test_lm): object 'test_lm' not found
con
## Error in eval(expr, envir, enclos): object 'con' not found

To calculate the first of these metrics, the area under the concordance curve (AUCC), we ranked genes in both the single-cell and bulk datasets in descending order by the statistical significance of their differential expression. Then, we created lists of the top-ranked genes in each dataset of matching size, up to some maximum size k. For each of these lists (that is, for the top-1 genes, top-2 genes, top-3 genes, and so on), we computed the size of the intersection between the single-cell and bulk DE genes. This procedure yielded a curve relating the number of shared genes between datasets to the number of top-ranked genes considered. The area under this curve was computed by summing the size of all intersections, and normalized to the range [0, 1] by dividing it by its maximum possible value, k × ( k +1) / 2. To evaluate the concordance of DE analysis, we used k =500 except where otherwise noted, but found our results were insensitive to the precise value of k. To compute the second metric, the transcriptome-wide rank correlation, we multiplied the absolute value of the test statistic for each gene by the sign of its log-fold change between conditions, and then computed the Spearman correlation over genes between them.

calculate_aucc <- function(tbl, px="deseq_adjp", py="edger_adjp",
                           lx="deseq_logfc", ly="edger_logfc",
                           topn=0.1) {
  ## If the topn argument is an integer, the just ask for that number.
  ## If it is a floating point 0<x<1, then set topn to that proportion
  ## of the number of genes.
  if (topn <= 0) {
    stop("topn need to be either a float from 0-1 or an interger bigger than 100.")
  } else if (topn > 1 & topn <= 100) {
    stop("topn need to be either a float from 0-1 or an interger bigger than 100.")
  } else if (topn < 1) {
    topn <- ceiling(nrow(tbl) * topn)
  }

  x_df <- tbl[, c(px, lx)]
  y_df <- tbl[, c(py, ly)]
  ## curve (AUCC), we ranked genes in both the single-cell and bulk datasets in
  ## descending order by the statistical significance of their differential expression.
  x_idx <- order(x_df[[1]], decreasing=FALSE)
  x_df <- x_df[x_idx, ]
  y_idx <- order(y_df[[1]], decreasing=FALSE)
  y_df <- y_df[y_idx, ]

  ## Then, we created lists of the top-ranked genes in each dataset of
  ## matching size, up to some maximum size k. For each of these lists
  ## (that is, for the top-1 genes, top-2 genes, top-3 genes, and so
  ## on), we computed the size of the intersection between the
  ## single-cell and bulk DE genes. This procedure yielded a curve
  ## relating the number of shared genes between datasets to the
  ## number of top-ranked genes considered.

  intersections <- rep(0, topn)
  for (i in 1:topn) {
    if (i == 1) {
      x_intersections[i] <- rownames(x_df)[i] == rownames(y_df)[i]
    } else {
      x_set <- rownames(x_df)[1:i]
      y_set <- rownames(y_df)[1:i]
      intersections[i] <- sum(x_set %in% y_set)
    }
  }

  ## The area under this curve was computed by summing the size of all
  ## intersections, and normalized to the range [0, 1] by dividing it
  ## by its maximum possible value, k × ( k +1) / 2. To evaluate the
  ## concordance of DE analysis, we used k =500 except where otherwise
  ## noted, but found our results were insensitive to the
  sumint <- sum(intersections)
  norm <- (topn * (topn + 1)) /2
  aucc <- sumint / norm
  return(aucc)
}

6.5.1 Perform LRT with the clinical samples

I am not sure if we have enough samples across the three visit to completely work as well as we would like, but there is only 1 way to find out! Now that I think about it, one thing which might be awesome is to use cell type as an interacting factor…

6.5.1.1 With biopsy samples

I figure this might be a place where the biopsy samples might prove useful.

clinical_nolost <- subset_expt(hs_clinical, subset="condition!='lost'&condition!='notapplicable'")
## subset_expt(): There were 182, now there are 165 samples.
## Currently (202109), this is not returning any significant hits.
## In previous iterations it did.
lrt_visit_clinical_test <- sm(deseq_lrt(clinical_nolost, transform="vst",
                                        interactor_column="visitnumber",
                                        interest_column="clinicaloutcome"))
summary(lrt_visit_clinical_test[["deseq_table"]])
##      gene              baseMean      log2FoldChange     lfcSE     
##  Length:19941       Min.   :     0   Min.   :-4.5   Min.   :0.0   
##  Class :character   1st Qu.:     6   1st Qu.:-0.1   1st Qu.:0.4   
##  Mode  :character   Median :   186   Median : 0.1   Median :0.6   
##                     Mean   :  1771   Mean   : 0.1   Mean   :1.0   
##                     3rd Qu.:  1107   3rd Qu.: 0.3   3rd Qu.:1.0   
##                     Max.   :355344   Max.   : 8.0   Max.   :7.0   
##                                      NA's   :1624   NA's   :1624  
##       stat          pvalue          padj     
##  Min.   :-1.6   Min.   :0.0    Min.   :0.4   
##  1st Qu.: 0.1   1st Qu.:0.6    1st Qu.:1.0   
##  Median : 0.4   Median :0.8    Median :1.0   
##  Mean   : 0.7   Mean   :0.8    Mean   :1.0   
##  3rd Qu.: 0.9   3rd Qu.:0.9    3rd Qu.:1.0   
##  Max.   :21.6   Max.   :1.0    Max.   :1.0   
##  NA's   :1624   NA's   :1624   NA's   :1625
written <- write_xlsx(data=as.data.frame(lrt_visit_clinical_test[["deseq_table"]]),
                      excel=glue::glue("excel/lrt_clinical_visit-v{ver}.xlsx"))

lrt_celltype_clinical_test <- sm(deseq_lrt(clinical_nolost, transform="vst",
                                           interactor_column="typeofcells",
                                           interest_column="clinicaloutcome"))
## Going to attempt to install: Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) : 
##   there is no package called 'Cairo'
## 'getOption("repos")' replaces Bioconductor standard repositories, see
## '?repositories' for details
## 
## replacement repositories:
##     CRAN: http://cran.r-project.org
## Bioconductor version 3.13 (BiocManager 1.30.16), R 4.1.0 (2021-05-18)
## Installing package(s) 'Error in loadNamespace(j <- i[[1L]], c(lib.loc,
##   .libPaths()), versionCheck = vI[[j]]) : there is no package called 'Cairo''
## Warning in .inet_warning(msg): package 'Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) : 
##   there is no package called 'Cairo'' is not available for Bioconductor version '3.13'
## 
## A version of this package for your version of R might be available elsewhere,
## see the ideas at
## https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
## Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]): there is no package called 'Cairo'
summary(lrt_celltype_clinical_test[["favorite_genes"]])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'summary': object 'lrt_celltype_clinical_test' not found
favorite_lrt_df <- merge(hs_annot, lrt_celltype_clinical_test[["favorite_genes"]], all.y=TRUE,
                     by="row.names")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'y' in selecting a method for function 'merge': object 'lrt_celltype_clinical_test' not found
written <- write_xlsx(data=favorite_lrt_df,
                      excel=glue::glue("excel/lrt_clinical_celltype_favorites-v{ver}.xlsx"))
## Error in write_xlsx(data = favorite_lrt_df, excel = glue::glue("excel/lrt_clinical_celltype_favorites-v{ver}.xlsx")): object 'favorite_lrt_df' not found
deseq_lrt_df <- merge(hs_annot, as.data.frame(lrt_celltype_clinical_test[["deseq_table"]]), all.y=TRUE,
                      by="row.names")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'y' in selecting a method for function 'merge': error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'lrt_celltype_clinical_test' not found
rownames(deseq_lrt_df) <- deseq_lrt_df[["Row.names"]]
## Error in eval(expr, envir, enclos): object 'deseq_lrt_df' not found
deseq_lrt_df[["Row.names"]] <- NULL
## Error in deseq_lrt_df[["Row.names"]] <- NULL: object 'deseq_lrt_df' not found
written <- write_xlsx(data=deseq_lrt_df,
                      excel=glue::glue("excel/lrt_clinical_celltype-v{ver}.xlsx"))
## Error in write_xlsx(data = deseq_lrt_df, excel = glue::glue("excel/lrt_clinical_celltype-v{ver}.xlsx")): object 'deseq_lrt_df' not found
lrt_celltype_clinical_test$cluster_data$plot
## Error in eval(expr, envir, enclos): object 'lrt_celltype_clinical_test' not found
cluster_23_genes_idx <- lrt_celltype_clinical_test[["cluster_data"]][["df"]][["cluster"]] == 23
## Error in eval(expr, envir, enclos): object 'lrt_celltype_clinical_test' not found
cluster_23_genes <- lrt_celltype_clinical_test[["cluster_data"]][["df"]][cluster_23_genes_idx, ]
## Error in eval(expr, envir, enclos): object 'lrt_celltype_clinical_test' not found
cluster_23_genes
## Error in eval(expr, envir, enclos): object 'cluster_23_genes' not found
cluster_24_genes_idx <- lrt_celltype_clinical_test[["cluster_data"]][["df"]][["cluster"]] == 24
## Error in eval(expr, envir, enclos): object 'lrt_celltype_clinical_test' not found
cluster_24_genes <- lrt_celltype_clinical_test[["cluster_data"]][["df"]][cluster_24_genes_idx, ]
## Error in eval(expr, envir, enclos): object 'lrt_celltype_clinical_test' not found
cluster_24_genes
## Error in eval(expr, envir, enclos): object 'cluster_24_genes' not found
cluster_22_genes_idx <- lrt_celltype_clinical_test[["cluster_data"]][["df"]][["cluster"]] == 22
## Error in eval(expr, envir, enclos): object 'lrt_celltype_clinical_test' not found
cluster_22_genes <- lrt_celltype_clinical_test[["cluster_data"]][["df"]][cluster_22_genes_idx, ]
## Error in eval(expr, envir, enclos): object 'lrt_celltype_clinical_test' not found
cluster_22_genes
## Error in eval(expr, envir, enclos): object 'cluster_22_genes' not found

6.5.2 Look at only the differential genes

A good suggestion from Theresa was to examine only the most variant genes from failure vs. cure and see how they change the clustering/etc results. This is my attempt to address this query.

hs_clinic_topn <- sm(extract_significant_genes(hs_clinic_table, n=100))
## Error in extract_significant_genes(hs_clinic_table, n = 100): object 'hs_clinic_table' not found
table <- "failure_vs_cure"
wanted <- rbind(hs_clinic_topn[["deseq"]][["ups"]][[table]],
                hs_clinic_topn[["deseq"]][["downs"]][[table]])
## Error in eval(quote(list(...)), env): object 'hs_clinic_topn' not found
small_expt <- exclude_genes_expt(hs_clinical_nobiop, ids=rownames(wanted), method="keep")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'rownames': object 'wanted' not found
small_norm <- sm(normalize_expt(small_expt, transform="log2", convert="cpm",
                                norm="quant", filter=TRUE))
## Error in normalize_expt(small_expt, transform = "log2", convert = "cpm", : object 'small_expt' not found
plot_pca(small_norm)$plot
## Error in plot_pca(small_norm): object 'small_norm' not found
small_nb <- normalize_expt(small_expt, transform="log2", convert="cpm",
                           batch="svaseq", norm="quant", filter=TRUE)
## Error in normalize_expt(small_expt, transform = "log2", convert = "cpm", : object 'small_expt' not found
plot_pca(small_nb)$plot
## Error in plot_pca(small_nb): object 'small_nb' not found
## DESeq2 MA plot of failure / cure
hs_clinic_table[["plots"]][["failure_vs_cure"]][["deseq_ma_plots"]]$plot
## Error in eval(expr, envir, enclos): object 'hs_clinic_table' not found
## DESeq2 Volcano plot of failure / cure
hs_clinic_table[["plots"]][["failure_vs_cure"]][["deseq_vol_plots"]]$plot
## Error in eval(expr, envir, enclos): object 'hs_clinic_table' not found

6.5.3 g:Profiler results using the significant up and down genes

ups <- hs_clinic_sig[["deseq"]][["ups"]][[1]]
## Error in eval(expr, envir, enclos): object 'hs_clinic_sig' not found
downs <- hs_clinic_sig[["deseq"]][["downs"]][[1]]
## Error in eval(expr, envir, enclos): object 'hs_clinic_sig' not found
hs_clinic_gprofiler_ups <- simple_gprofiler(ups)
## Error in simple_gprofiler(ups): object 'ups' not found
hs_clinic_gprofiler_ups[["pvalue_plots"]][["bpp_plot_over"]]
## Error in eval(expr, envir, enclos): object 'hs_clinic_gprofiler_ups' not found
hs_clinic_gprofiler_ups[["pvalue_plots"]][["mfp_plot_over"]]
## Error in eval(expr, envir, enclos): object 'hs_clinic_gprofiler_ups' not found
hs_clinic_gprofiler_ups[["pvalue_plots"]][["reactome_plot_over"]]
## Error in eval(expr, envir, enclos): object 'hs_clinic_gprofiler_ups' not found
##hs_try2 <- simple_gprofiler2(ups)

hs_clinic_gprofiler_downs <- simple_gprofiler(downs)
## Error in simple_gprofiler(downs): object 'downs' not found
hs_clinic_gprofiler_downs[["pvalue_plots"]][["bpp_plot_over"]]
## Error in eval(expr, envir, enclos): object 'hs_clinic_gprofiler_downs' not found
hs_clinic_gprofiler_downs[["pvalue_plots"]][["mfp_plot_over"]]
## Error in eval(expr, envir, enclos): object 'hs_clinic_gprofiler_downs' not found
hs_clinic_gprofiler_downs[["pvalue_plots"]][["reactome_plot_over"]]
## Error in eval(expr, envir, enclos): object 'hs_clinic_gprofiler_downs' not found

6.6 Perform GSVA on the clinical samples

hs_celltype_gsva_c2 <- sm(simple_gsva(individual_celltypes))
hs_celltype_gsva_c2_sig <- sm(get_sig_gsva_categories(
    hs_celltype_gsva_c2,
    excel="excel/individual_celltypes_gsva_c2.xlsx"))

## broad_c7 <- GSEABase::getGmt("reference/msigdb/c7.all.v7.2.entrez.gmt",
##                              collectionType=GSEABase::BroadCollection(category="c7"),
##                              geneIdType=GSEABase::EntrezIdentifier())
## hs_celltype_gsva_c7 <- sm(simple_gsva(individual_celltypes, signatures=broad_c7,
##                                       msig_xml="reference/msigdb_v7.2.xml", cores=10))
hs_celltype_gsva_c7 <- simple_gsva(individual_celltypes,
                                   signatures="reference/msigdb/c7.all.v7.2.entrez.gmt",
                                   signature_category="c7",
                                   msig_xml="reference/msigdb_v7.2.xml",
                                   cores=10)
## Converting the rownames() of the expressionset to ENTREZID.
## 583 ENSEMBL ID's didn't have a matching ENTEREZ ID. Dropping them now.
## Before conversion, the expressionset has 19941 entries.
## After conversion, the expressionset has 19519 entries.
## Adding annotations from reference/msigdb_v7.2.xml.
## Warning: `xml_nodes()` was deprecated in rvest 1.0.0.
## Please use `html_elements()` instead.
hs_celltype_gsva_c7_sig <- sm(get_sig_gsva_categories(
    hs_celltype_gsva_c7,
    excel="excel/individual_celltypes_gsva_c7.xlsx"))

7 Some queries from 20210816

Upon returning from Maine, I sat with Najib, Theresa, and Maria Adelaida. A whole host of potential ideas were considered; here are a couple of them:

7.1 Visualize variance in the TMRC3 data

My TODO text says: "Variance partition of TMRC3 samples with at least time, cell-type, and cure/fail. I mixed this idea with the visit 1 previously.

vp_factors <- c("condition", "batch", "visitnumber", "eb_lc_ulcera_area_1")

v1_vp <- simple_varpart(hs_clinical, factors=vp_factors, do_fit=TRUE)
## Error in checkModelStatus(fit, showWarnings = showWarnings, colinearityCutoff = colinearityCutoff) : 
##   The variables specified in this model are redundant,
## so the design matrix is not full rank
## Retrying with only condition in the model.
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:S4Vectors':
## 
##     expand
## 
## Total:116 s
## 
## Total:52 s
pp(file="images/v1_vp.png", image=v1_vp$partition_plot)

7.2 Visit 1 comparisons

Here is the note from my TODO list: “Using v1, perform cell-type comparisons/signatures as well as a cure/fail differential expression”.

Implicit in this query is the idea that we should do an explicit visualization of the visit 1 samples without the subsequent visits.

## For the moment, let us ignore the lost/NA samples
v1_expt <- subset_expt(hs_clinical, subset="visitnumber=='1'") %>%
  subset_expt(subset="condition=='cure'|condition=='failure'") %>%
  subset_expt(subset="batch!='biopsy'")
## subset_expt(): There were 182, now there are 76 samples.
## subset_expt(): There were 76, now there are 68 samples.
## subset_expt(): There were 68, now there are 42 samples.
v1_norm <- normalize_expt(v1_expt, filter=TRUE, transform="log2", convert="cpm",
                          batch="svaseq")
## Removing 8361 low-count genes (11580 remaining).
## batch_counts: Before batch/surrogate estimation, 8014 entries are x==0: 2%.
## batch_counts: Before batch/surrogate estimation, 75129 entries are 0<x<1: 15%.
## Setting 2645 low elements to zero.
## transform_counts: Found 2645 values equal to 0, adding 1 to the matrix.
plot_pca(v1_norm)$plot

v1_de <- all_pairwise(v1_expt, filter=TRUE, model_batch="svaseq")
## batch_counts: Before batch/surrogate estimation, 8014 entries are x==0: 2%.
## Plotting a PCA before surrogate/batch inclusion.
## Using svaseq to visualize before/after batch inclusion.
## Performing a test normalization with: raw
## Removing 0 low-count genes (11580 remaining).
## batch_counts: Before batch/surrogate estimation, 8014 entries are x==0: 2%.
## batch_counts: Before batch/surrogate estimation, 75129 entries are 0<x<1: 15%.
## Setting 2645 low elements to zero.
## transform_counts: Found 2645 values equal to 0, adding 1 to the matrix.
## Error in checkForRemoteErrors(val): 5 nodes produced errors; first error: c("Error in if (grep(x = eval_name, pattern = \"^([[:digit:]]|[[:punct:]])\")) { : \n  argument is of length zero\n", "basic")
v1_table <- combine_de_tables(v1_de, excel=glue::glue("excel/v1_tables-v{ver}.xlsx"))
## Error in combine_de_tables(v1_de, excel = glue::glue("excel/v1_tables-v{ver}.xlsx")): object 'v1_de' not found

7.2.1 Visit 1 variance partition

It might be interesting to add ‘parasitemappingrate’ here but I have not filled it in for all samples yet.

samplecollectiondate should be useful too but is not complete.

This is also a great place to pull in some of the clinical information. I need to ask Maria Adelaida for it…

v1_vp <- simple_varpart(v1_expt, do_fit=TRUE)
## 
## Total:107 s
## 
## Total:56 s
v1_vp$partition_plot

## condition is cure/fail batch is cell type.

7.2.2 Visit 1 separated by cell type

v1_eo <- subset_expt(v1_expt, subset="batch=='eosinophils'")
## subset_expt(): There were 42, now there are 10 samples.
v1_mono <- subset_expt(v1_expt, subset="batch=='monocytes'")
## subset_expt(): There were 42, now there are 16 samples.
v1_neut <- subset_expt(v1_expt, subset="batch=='neutrophils'")
## subset_expt(): There were 42, now there are 16 samples.
v1_eo_norm <- normalize_expt(v1_eo, transform="log2", convert="cpm",
                             filter=TRUE, norm="quant")
## Removing 9789 low-count genes (10152 remaining).
## transform_counts: Found 1 values equal to 0, adding 1 to the matrix.
plot_pca(v1_eo_norm)$plot

v1_eo_nb <- normalize_expt(v1_eo, transform="log2", convert="cpm",
                           filter=TRUE, batch="svaseq")
## Removing 9789 low-count genes (10152 remaining).
## batch_counts: Before batch/surrogate estimation, 93 entries are x==0: 0%.
## batch_counts: Before batch/surrogate estimation, 3050 entries are 0<x<1: 3%.
## Setting 106 low elements to zero.
## transform_counts: Found 106 values equal to 0, adding 1 to the matrix.
plot_pca(v1_eo_nb)$plot

v1_mono_norm <- normalize_expt(v1_mono, transform="log2", convert="cpm",
                               filter=TRUE, norm="quant")
## Removing 9329 low-count genes (10612 remaining).
## transform_counts: Found 1 values equal to 0, adding 1 to the matrix.
plot_pca(v1_mono_norm)$plot

v1_mono_nb <- normalize_expt(v1_mono, transform="log2", convert="cpm",
                             filter=TRUE, batch="svaseq")
## Removing 9329 low-count genes (10612 remaining).
## batch_counts: Before batch/surrogate estimation, 226 entries are x==0: 0%.
## batch_counts: Before batch/surrogate estimation, 6672 entries are 0<x<1: 4%.
## Setting 205 low elements to zero.
## transform_counts: Found 205 values equal to 0, adding 1 to the matrix.
plot_pca(v1_mono_nb)$plot

v1_neut_norm <- normalize_expt(v1_neut, transform="log2", convert="cpm",
                               filter=TRUE, norm="quant")
## Removing 11186 low-count genes (8755 remaining).
## transform_counts: Found 1 values equal to 0, adding 1 to the matrix.
plot_pca(v1_neut_norm)$plot

v1_neut_nb <- normalize_expt(v1_neut, transform="log2", convert="cpm",
                             filter=TRUE, batch="svaseq")
## Removing 11186 low-count genes (8755 remaining).
## batch_counts: Before batch/surrogate estimation, 138 entries are x==0: 0%.
## batch_counts: Before batch/surrogate estimation, 6557 entries are 0<x<1: 5%.
## Setting 167 low elements to zero.
## transform_counts: Found 167 values equal to 0, adding 1 to the matrix.
plot_pca(v1_neut_nb)$plot

8 Compare visits

Compare visits independent of cure/fail

time_expt <- hs_clinical %>%
  set_expt_conditions(fact="visitnumber")
pData(time_expt)[["condition"]] <- paste0("v", pData(time_expt)[["condition"]])

time_norm <- normalize_expt(time_expt, filter=TRUE, norm="quant", convert="cpm",
                            transform="log2")
## Removing 5216 low-count genes (14725 remaining).
## transform_counts: Found 107 values equal to 0, adding 1 to the matrix.
time_nb <- normalize_expt(time_expt, filter=TRUE, batch="svaseq", convert="cpm",
                          transform="log2")
## Removing 5216 low-count genes (14725 remaining).
## batch_counts: Before batch/surrogate estimation, 198077 entries are x==0: 7%.
## batch_counts: Before batch/surrogate estimation, 528392 entries are 0<x<1: 20%.
## Setting 33769 low elements to zero.
## transform_counts: Found 33769 values equal to 0, adding 1 to the matrix.
plot_pca(time_nb)$plot
## plot labels was not set and there are more than 100 samples, disabling it.

time_de <- all_pairwise(time_expt, filter=TRUE, model_batch="svaseq")
## batch_counts: Before batch/surrogate estimation, 198077 entries are x==0: 7%.
## Plotting a PCA before surrogate/batch inclusion.
## Using svaseq to visualize before/after batch inclusion.
## Performing a test normalization with: raw
## Removing 0 low-count genes (14725 remaining).
## batch_counts: Before batch/surrogate estimation, 198077 entries are x==0: 7%.
## batch_counts: Before batch/surrogate estimation, 528392 entries are 0<x<1: 20%.
## Setting 33769 low elements to zero.
## transform_counts: Found 33769 values equal to 0, adding 1 to the matrix.
## Error in checkForRemoteErrors(val): 5 nodes produced errors; first error: c("Error in if (grep(x = eval_name, pattern = \"^([[:digit:]]|[[:punct:]])\")) { : \n  argument is of length zero\n", "basic")
keepers <- list(
    "v2v1"=c("v2", "v1"),
    "v3v1"=c("v3", "v1"),
    "v3v2"=c("v3", "v2"))
time_tables <- combine_de_tables(time_de, keepers=keepers,
                                 excel=glue::glue("excel/compare_visits-v{ver}.xlsx"))
## Error in combine_de_tables(time_de, keepers = keepers, excel = glue::glue("excel/compare_visits-v{ver}.xlsx")): object 'time_de' not found

8.1 Separate visits by cell type

Now let us consider the 3 visits from the lens of cell type.

time_type_factor <- paste0(pData(time_expt)[["condition"]], "_", pData(time_expt)[["batch"]])
time_type_expt <- set_expt_conditions(time_expt, fact=time_type_factor)

time_type_biopsy <- subset_expt(time_type_expt, subset="batch=='biopsy'")
## subset_expt(): There were 182, now there are 52 samples.
time_type_neutrophil <- subset_expt(time_type_expt, subset="batch=='neutrophils'")
## subset_expt(): There were 182, now there are 50 samples.
time_type_eosinophil <- subset_expt(time_type_expt, subset="batch=='eosinophils'")
## subset_expt(): There were 182, now there are 30 samples.
time_type_monocyte <- subset_expt(time_type_expt, subset="batch=='monocytes'")
## subset_expt(): There were 182, now there are 50 samples.

8.1.1 Biopsy

time_biopsy <- normalize_expt(time_type_biopsy, filter=TRUE, norm="quant", convert="cpm",
                              transform="log2")
## Removing 5738 low-count genes (14203 remaining).
## transform_counts: Found 16 values equal to 0, adding 1 to the matrix.
time_biopsy_nb <- normalize_expt(time_type_biopsy, filter=TRUE, batch="svaseq", convert="cpm",
                                 transform="log2")
## Removing 5738 low-count genes (14203 remaining).
## batch_counts: Before batch/surrogate estimation, 5504 entries are x==0: 1%.
## batch_counts: Before batch/surrogate estimation, 44637 entries are 0<x<1: 6%.
## Setting 1749 low elements to zero.
## transform_counts: Found 1749 values equal to 0, adding 1 to the matrix.
plot_pca(time_biopsy)$plot

plot_pca(time_biopsy_nb)$plot

time_biopsy_de <- all_pairwise(time_type_biopsy, model_batch="svaseq", filter=TRUE)
## batch_counts: Before batch/surrogate estimation, 5504 entries are x==0: 1%.
## Plotting a PCA before surrogate/batch inclusion.
## Using svaseq to visualize before/after batch inclusion.
## Performing a test normalization with: raw
## Removing 0 low-count genes (14203 remaining).
## batch_counts: Before batch/surrogate estimation, 5504 entries are x==0: 1%.
## batch_counts: Before batch/surrogate estimation, 44637 entries are 0<x<1: 6%.
## Setting 1749 low elements to zero.
## transform_counts: Found 1749 values equal to 0, adding 1 to the matrix.
## Error in checkForRemoteErrors(val): 5 nodes produced errors; first error: c("Error in if (grep(x = eval_name, pattern = \"^([[:digit:]]|[[:punct:]])\")) { : \n  argument is of length zero\n", "basic")
keepers <- list(
    "v2v1"=c("v2_biopsy", "v1_biopsy"),
    "v3v1"=c("v3_biopsy", "v1_biopsy"),
    "v3v2"=c("v3_biopsy", "v2_biopsy"))
time_biopsy_tables <- combine_de_tables(
    time_biopsy_de, keepers=keepers,
    excel=glue::glue("excel/time_biopsy_tables-v{ver}.xlsx"))
## Error in combine_de_tables(time_biopsy_de, keepers = keepers, excel = glue::glue("excel/time_biopsy_tables-v{ver}.xlsx")): object 'time_biopsy_de' not found

8.1.2 Neutrophils

time_neutrophil <- normalize_expt(time_type_neutrophil, filter=TRUE, norm="quant", convert="cpm",
                              transform="log2")
## Removing 10599 low-count genes (9342 remaining).
## transform_counts: Found 1 values equal to 0, adding 1 to the matrix.
time_neutrophil_nb <- normalize_expt(time_type_neutrophil, filter=TRUE, batch="svaseq", convert="cpm",
                                 transform="log2")
## Removing 10599 low-count genes (9342 remaining).
## batch_counts: Before batch/surrogate estimation, 1669 entries are x==0: 0%.
## batch_counts: Before batch/surrogate estimation, 42558 entries are 0<x<1: 9%.
## Setting 775 low elements to zero.
## transform_counts: Found 775 values equal to 0, adding 1 to the matrix.
plot_pca(time_neutrophil)$plot

plot_pca(time_neutrophil_nb)$plot

time_neutrophil_de <- all_pairwise(time_type_neutrophil, model_batch="svaseq", filter=TRUE)
## batch_counts: Before batch/surrogate estimation, 1669 entries are x==0: 0%.
## Plotting a PCA before surrogate/batch inclusion.
## Using svaseq to visualize before/after batch inclusion.
## Performing a test normalization with: raw
## Removing 0 low-count genes (9342 remaining).
## batch_counts: Before batch/surrogate estimation, 1669 entries are x==0: 0%.
## batch_counts: Before batch/surrogate estimation, 42558 entries are 0<x<1: 9%.
## Setting 829 low elements to zero.
## transform_counts: Found 829 values equal to 0, adding 1 to the matrix.
## Error in checkForRemoteErrors(val): 5 nodes produced errors; first error: c("Error in if (grep(x = eval_name, pattern = \"^([[:digit:]]|[[:punct:]])\")) { : \n  argument is of length zero\n", "basic")
keepers <- list(
    "v2v1"=c("v2_neutrophils", "v1_neutrophils"),
    "v3v1"=c("v3_neutrophils", "v1_neutrophils"),
    "v3v2"=c("v3_neutrophils", "v2_neutrophils"))
time_neutrophil_tables <- combine_de_tables(
    time_neutrophil_de, keepers=keepers,
    excel=glue::glue("excel/time_neutrophil_tables-v{ver}.xlsx"))
## Error in combine_de_tables(time_neutrophil_de, keepers = keepers, excel = glue::glue("excel/time_neutrophil_tables-v{ver}.xlsx")): object 'time_neutrophil_de' not found

8.1.3 Eosinophils

time_eosinophil <- normalize_expt(time_type_eosinophil, filter=TRUE, norm="quant", convert="cpm",
                              transform="log2")
## Removing 9190 low-count genes (10751 remaining).
## transform_counts: Found 5 values equal to 0, adding 1 to the matrix.
time_eosinophil_nb <- normalize_expt(time_type_eosinophil, filter=TRUE, batch="svaseq", convert="cpm",
                                 transform="log2")
## Removing 9190 low-count genes (10751 remaining).
## batch_counts: Before batch/surrogate estimation, 785 entries are x==0: 0%.
## batch_counts: Before batch/surrogate estimation, 18837 entries are 0<x<1: 6%.
## Setting 428 low elements to zero.
## transform_counts: Found 428 values equal to 0, adding 1 to the matrix.
plot_pca(time_eosinophil)$plot

plot_pca(time_eosinophil_nb)$plot

time_eosinophil_de <- all_pairwise(time_type_eosinophil, model_batch="svaseq", filter=TRUE)
## batch_counts: Before batch/surrogate estimation, 785 entries are x==0: 0%.
## Plotting a PCA before surrogate/batch inclusion.
## Using svaseq to visualize before/after batch inclusion.
## Performing a test normalization with: raw
## Removing 0 low-count genes (10751 remaining).
## batch_counts: Before batch/surrogate estimation, 785 entries are x==0: 0%.
## batch_counts: Before batch/surrogate estimation, 18837 entries are 0<x<1: 6%.
## Setting 428 low elements to zero.
## transform_counts: Found 428 values equal to 0, adding 1 to the matrix.
## Error in checkForRemoteErrors(val): 5 nodes produced errors; first error: c("Error in if (grep(x = eval_name, pattern = \"^([[:digit:]]|[[:punct:]])\")) { : \n  argument is of length zero\n", "basic")
keepers <- list(
    "v2v1"=c("v2_eosinophils", "v1_eosinophils"),
    "v3v1"=c("v3_eosinophils", "v1_eosinophils"),
    "v3v2"=c("v3_eosinophils", "v2_eosinophils"))
time_eosinophil_tables <- combine_de_tables(
    time_eosinophil_de, keepers=keepers,
    excel=glue::glue("excel/time_eosinophil_tables-v{ver}.xlsx"))
## Error in combine_de_tables(time_eosinophil_de, keepers = keepers, excel = glue::glue("excel/time_eosinophil_tables-v{ver}.xlsx")): object 'time_eosinophil_de' not found

8.1.4 Monocyte

time_monocyte <- normalize_expt(time_type_monocyte, filter=TRUE, norm="quant", convert="cpm",
                              transform="log2")
## Removing 8776 low-count genes (11165 remaining).
## transform_counts: Found 11 values equal to 0, adding 1 to the matrix.
time_monocyte_nb <- normalize_expt(time_type_monocyte, filter=TRUE, batch="svaseq", convert="cpm",
                                 transform="log2")
## Removing 8776 low-count genes (11165 remaining).
## batch_counts: Before batch/surrogate estimation, 2380 entries are x==0: 0%.
## batch_counts: Before batch/surrogate estimation, 36412 entries are 0<x<1: 7%.
## Setting 1029 low elements to zero.
## transform_counts: Found 1029 values equal to 0, adding 1 to the matrix.
plot_pca(time_monocyte)$plot

plot_pca(time_monocyte_nb)$plot

time_monocyte_de <- all_pairwise(time_type_monocyte, model_batch="svaseq", filter=TRUE)
## batch_counts: Before batch/surrogate estimation, 2380 entries are x==0: 0%.
## Plotting a PCA before surrogate/batch inclusion.
## Using svaseq to visualize before/after batch inclusion.
## Performing a test normalization with: raw
## Removing 0 low-count genes (11165 remaining).
## batch_counts: Before batch/surrogate estimation, 2380 entries are x==0: 0%.
## batch_counts: Before batch/surrogate estimation, 36412 entries are 0<x<1: 7%.
## Setting 1029 low elements to zero.
## transform_counts: Found 1029 values equal to 0, adding 1 to the matrix.
## Error in checkForRemoteErrors(val): 5 nodes produced errors; first error: c("Error in if (grep(x = eval_name, pattern = \"^([[:digit:]]|[[:punct:]])\")) { : \n  argument is of length zero\n", "basic")
keepers <- list(
    "v2v1"=c("v2_monocytes", "v1_monocytes"),
    "v3v1"=c("v3_monocytes", "v1_monocytes"),
    "v3v2"=c("v3_monocytes", "v2_monocytes"))
time_monocyte_tables <- combine_de_tables(
    time_monocyte_de, keepers=keepers,
    excel=glue::glue("excel/time_monocyte_tables-v{ver}.xlsx"))
## Error in combine_de_tables(time_monocyte_de, keepers = keepers, excel = glue::glue("excel/time_monocyte_tables-v{ver}.xlsx")): object 'time_monocyte_de' not found

8.2 Time and clinicaloutcome

time_cf_factor <- paste0(pData(time_type_biopsy)[["condition"]], "_", pData(time_type_biopsy)[["clinicaloutcome"]])
biopsy_time_cf <- set_expt_conditions(time_type_biopsy, fact=time_cf_factor)

time_cf_biopsy <- normalize_expt(biopsy_time_cf, filter=TRUE, norm="quant", convert="cpm",
                                 transform="log2")
## Removing 5738 low-count genes (14203 remaining).
## transform_counts: Found 16 values equal to 0, adding 1 to the matrix.
time_cf_biopsy_nb <- normalize_expt(biopsy_time_cf, filter=TRUE, batch="svaseq", convert="cpm",
                                    transform="log2")
## Removing 5738 low-count genes (14203 remaining).
## batch_counts: Before batch/surrogate estimation, 5504 entries are x==0: 1%.
## batch_counts: Before batch/surrogate estimation, 44637 entries are 0<x<1: 6%.
## Setting 1721 low elements to zero.
## transform_counts: Found 1721 values equal to 0, adding 1 to the matrix.
plot_pca(time_cf_biopsy)$plot

plot_pca(time_cf_biopsy_nb)$plot

biopsy_time_de <- all_pairwise(time_type_biopsy, filter=TRUE, model_batch="svaseq")
## batch_counts: Before batch/surrogate estimation, 5504 entries are x==0: 1%.
## Plotting a PCA before surrogate/batch inclusion.
## Using svaseq to visualize before/after batch inclusion.
## Performing a test normalization with: raw
## Removing 0 low-count genes (14203 remaining).
## batch_counts: Before batch/surrogate estimation, 5504 entries are x==0: 1%.
## batch_counts: Before batch/surrogate estimation, 44637 entries are 0<x<1: 6%.
## Setting 1749 low elements to zero.
## transform_counts: Found 1749 values equal to 0, adding 1 to the matrix.
## Error in checkForRemoteErrors(val): 5 nodes produced errors; first error: c("Error in if (grep(x = eval_name, pattern = \"^([[:digit:]]|[[:punct:]])\")) { : \n  argument is of length zero\n", "basic")
biopsy_time_tables <- combine_de_tables(time_type_de, keepers=keepers,
                                        excel=glue::glue("excel/compare_biopsy_visits-v{ver}.xlsx"))
## Error in combine_de_tables(time_type_de, keepers = keepers, excel = glue::glue("excel/compare_biopsy_visits-v{ver}.xlsx")): object 'time_type_de' not found

9 Individual Cell types

The following blocks split the samples into a few groups by sample type and look at the distributions between them.

9.1 Implementation details

Get top/bottom n genes for each cell type, using clinical outcome as the factor of interest. For the moment, use sva for the DE analysis. Provide cpms for the top/bottom n genes.

Start with top/bottom 200. Perform default logFC and p-value as well.

9.1.1 Shared contrasts

Here is the contrast we will use throughput, I am leaving open the option to add more.

keepers <- list(
  "fail_vs_cure"=c("failure", "cure"))

9.2 Monocytes

9.2.1 Evaluate Monocyte samples

mono <- subset_expt(hs_valid, subset="typeofcells=='monocytes'") %>%
  set_expt_conditions(fact="clinicaloutcome") %>%
  set_expt_batches(fact="visitnumber") %>%
  set_expt_colors(colors=chosen_colors)
## subset_expt(): There were 202, now there are 50 samples.
## Warning in set_expt_colors(., colors = chosen_colors): Colors for the following
## categories are not being used: null.
## FIXME set_expt_colors should speak up if there are mismatches here!!!

save_result <- save(mono, file="rda/monocyte_expt.rda")
mono_norm <- normalize_expt(mono, convert="cpm", filter=TRUE,
                            transform="log2", norm="quant")
## Removing 8776 low-count genes (11165 remaining).
## transform_counts: Found 11 values equal to 0, adding 1 to the matrix.
plt <- plot_pca(mono_norm, plot_labels=FALSE)$plot
pp(file=glue("images/mono_pca_normalized-v{ver}.pdf"), image=plt)

mono_nb <- normalize_expt(mono, convert="cpm", filter=TRUE,
                          transform="log2", batch="svaseq")
## Removing 8776 low-count genes (11165 remaining).
## batch_counts: Before batch/surrogate estimation, 2380 entries are x==0: 0%.
## batch_counts: Before batch/surrogate estimation, 36412 entries are 0<x<1: 7%.
## Setting 1302 low elements to zero.
## transform_counts: Found 1302 values equal to 0, adding 1 to the matrix.
plt <- plot_pca(mono_nb, plot_labels=FALSE)$plot
pp(file=glue("images/mono_pca_normalized_batch-v{ver}.pdf"), image=plt)

9.2.2 DE of Monocyte samples

9.2.2.1 Without sva

mono_de <- sm(all_pairwise(mono, model_batch=FALSE, filter=TRUE))
## Error in checkForRemoteErrors(val): 5 nodes produced errors; first error: c("Error in if (grep(x = eval_name, pattern = \"^([[:digit:]]|[[:punct:]])\")) { : \n  argument is of length zero\n", "basic")
mono_tables <- sm(combine_de_tables(
    mono_de, keepers=keepers,
    excel=glue::glue("upload/DE_Cure_vs_Fail/{ver}_mono_vall_de_curevsfail_nobatch.xlsx")))
## Error in combine_de_tables(mono_de, keepers = keepers, excel = glue::glue("upload/DE_Cure_vs_Fail/{ver}_mono_vall_de_curevsfail_nobatch.xlsx")): object 'mono_de' not found
written <- write_xlsx(data=mono_tables[["data"]][[1]],
                      excel=glue::glue("excel/monocyte_clinical_table-v{ver}.xlsx"))
## Error in write_xlsx(data = mono_tables[["data"]][[1]], excel = glue::glue("excel/monocyte_clinical_table-v{ver}.xlsx")): object 'mono_tables' not found
mono_sig <- sm(extract_significant_genes(mono_tables, according_to="deseq"))
## Error in extract_significant_genes(mono_tables, according_to = "deseq"): object 'mono_tables' not found
written <- write_xlsx(data=mono_sig[["deseq"]][["ups"]][[1]],
                      excel=glue::glue("excel/monocyte_clinical_sigup-v{ver}.xlsx"))
## Error in write_xlsx(data = mono_sig[["deseq"]][["ups"]][[1]], excel = glue::glue("excel/monocyte_clinical_sigup-v{ver}.xlsx")): object 'mono_sig' not found
written <- write_xlsx(data=mono_sig[["deseq"]][["downs"]][[1]],
                      excel=glue::glue("excel/monocyte_clinical_sigdown-v{ver}.xlsx"))
## Error in write_xlsx(data = mono_sig[["deseq"]][["downs"]][[1]], excel = glue::glue("excel/monocyte_clinical_sigdown-v{ver}.xlsx")): object 'mono_sig' not found
mono_pct_sig <- sm(extract_significant_genes(mono_tables, n=200,
                                             lfc=NULL, p=NULL, according_to="deseq"))
## Error in extract_significant_genes(mono_tables, n = 200, lfc = NULL, p = NULL, : object 'mono_tables' not found
written <- write_xlsx(data=mono_pct_sig[["deseq"]][["ups"]][[1]],
                      excel=glue::glue("excel/monocyte_clinical_sigup_pct-v{ver}.xlsx"))
## Error in write_xlsx(data = mono_pct_sig[["deseq"]][["ups"]][[1]], excel = glue::glue("excel/monocyte_clinical_sigup_pct-v{ver}.xlsx")): object 'mono_pct_sig' not found
written <- write_xlsx(data=mono_pct_sig[["deseq"]][["downs"]][[1]],
                      excel=glue::glue("excel/monocyte_clinical_sigdown_pct-v{ver}.xlsx"))
## Error in write_xlsx(data = mono_pct_sig[["deseq"]][["downs"]][[1]], excel = glue::glue("excel/monocyte_clinical_sigdown_pct-v{ver}.xlsx")): object 'mono_pct_sig' not found
mono_sig$summary_df
## Error in eval(expr, envir, enclos): object 'mono_sig' not found
## Print out a table of the cpm values for other explorations.
mono_cpm <- sm(normalize_expt(mono, convert="cpm"))
written <- write_xlsx(data=exprs(mono_cpm),
                      excel=glue::glue("excel/monocyte_cpm_before_batch-v{ver}.xlsx"))
mono_bcpm <- sm(normalize_expt(mono, filter=TRUE, convert="cpm", batch="svaseq"))
written <- write_xlsx(data=exprs(mono_bcpm),
                      excel=glue::glue("excel/monocyte_cpm_after_batch-v{ver}.xlsx"))

9.2.2.2 With sva

mono_de_sva <- sm(all_pairwise(mono, model_batch="svaseq", filter=TRUE))
## Error in checkForRemoteErrors(val): 5 nodes produced errors; first error: c("Error in if (grep(x = eval_name, pattern = \"^([[:digit:]]|[[:punct:]])\")) { : \n  argument is of length zero\n", "basic")
mono_sva_tables <- sm(combine_de_tables(
    mono_de_sva, keepers=keepers,
    excel=glue::glue("upload/DE_Cure_vs_Fail/{ver}_mono_vall_de_curevsfail_svabatch.xlsx")))
## Error in combine_de_tables(mono_de_sva, keepers = keepers, excel = glue::glue("upload/DE_Cure_vs_Fail/{ver}_mono_vall_de_curevsfail_svabatch.xlsx")): object 'mono_de_sva' not found
mono_sig_sva <- sm(extract_significant_genes(
    mono_tables_sva,
    excel=glue::glue("excel/monocyte_clinical_sig_tables_sva-v{ver}.xlsx"),
    according_to="deseq"))
## Error in extract_significant_genes(mono_tables_sva, excel = glue::glue("excel/monocyte_clinical_sig_tables_sva-v{ver}.xlsx"), : object 'mono_tables_sva' not found

9.2.2.3 Monocyte DE plots

First print out the DE plots without and then with sva estimates.

## DESeq2 MA plot of failure / cure
mono_tables[["plots"]][["fail_vs_cure"]][["deseq_ma_plots"]]$plot
## Error in eval(expr, envir, enclos): object 'mono_tables' not found
## DESeq2 Volcano plot of failure / cure
mono_tables[["plots"]][["fail_vs_cure"]][["deseq_vol_plots"]]$plot
## Error in eval(expr, envir, enclos): object 'mono_tables' not found
## DESeq2 MA plot of failure / cure with svaseq
mono_tables_sva[["plots"]][["fail_vs_cure"]][["deseq_ma_plots"]]$plot
## Error in eval(expr, envir, enclos): object 'mono_tables_sva' not found
## DESeq2 Volcano plot of failure / cure with svaseq
mono_tables_sva[["plots"]][["fail_vs_cure"]][["deseq_vol_plots"]]$plot
## Error in eval(expr, envir, enclos): object 'mono_tables_sva' not found

9.2.2.5 Monocyte MSigDB query

## broad_c7 <- GSEABase::getGmt("reference/msigdb/c7.all.v7.2.entrez.gmt",
##                              collectionType=GSEABase::BroadCollection(category="c7"),
##                              geneIdType=GSEABase::EntrezIdentifier())
broad_c7 <- load_gmt_signatures(signatures="reference/msigdb/c7.all.v7.2.entrez.gmt",
                                signature_category="c7")
mono_up_goseq_msig <- goseq_msigdb(sig_genes=ups, signatures=broad_c7,
                                   signature_category="c7", length_db=hs_length)
## Error in rownames(sig_genes): object 'ups' not found
mono_down_goseq_msig <- goseq_msigdb(sig_genes=downs, signatures=broad_c7,
                                     signature_category="c7", length_db=hs_length)
## Error in rownames(sig_genes): object 'downs' not found

9.2.2.6 Plot of similar experiments

## Monocyte genes with increased expression in the failed samples
## share genes with the following experiments
mono_up_goseq_msig[["pvalue_plots"]][["mfp_plot_over"]]
## Error in eval(expr, envir, enclos): object 'mono_up_goseq_msig' not found
## Monocyte genes with increased expression in the cured samples
## share genes with the following experiments
mono_down_goseq_msig[["pvalue_plots"]][["mfp_plot_over"]]
## Error in eval(expr, envir, enclos): object 'mono_down_goseq_msig' not found

9.2.3 Evaluate Neutrophil samples

neut <- subset_expt(hs_valid, subset="typeofcells=='neutrophils'") %>%
  set_expt_conditions(fact="clinicaloutcome") %>%
  set_expt_batches(fact="visitnumber") %>%
  set_expt_colors(colors=chosen_colors)
## subset_expt(): There were 202, now there are 50 samples.
## Warning in set_expt_colors(., colors = chosen_colors): Colors for the following
## categories are not being used: null.
save_result <- save(neut, file="rda/neutrophil_expt.rda")

neut_norm <- sm(normalize_expt(neut, convert="cpm", filter=TRUE, transform="log2"))
plt <- plot_pca(neut_norm, plot_labels=FALSE)$plot
pp(file=glue("images/neut_pca_normalized-v{ver}.pdf"), image=plt)

neut_nb <- sm(normalize_expt(neut, convert="cpm", filter=TRUE,
                             transform="log2", batch="svaseq"))
plt <- plot_pca(neut_nb, plot_labels=FALSE)$plot
pp(file=glue("images/neut_pca_normalized_svaseq-v{ver}.pdf"), image=plt)

9.2.4 DE of Netrophil samples

9.2.4.1 Without sva

neut_de <- sm(all_pairwise(neut, model_batch=FALSE, filter=TRUE))
## Error in checkForRemoteErrors(val): 5 nodes produced errors; first error: c("Error in if (grep(x = eval_name, pattern = \"^([[:digit:]]|[[:punct:]])\")) { : \n  argument is of length zero\n", "basic")
neut_tables <- sm(combine_de_tables(
    neut_de, keepers=keepers,
    excel=glue::glue("upload/DE_Cure_vs_Fail/{ver}_neut_vall_de_curevsfail_nobatch.xlsx")))
## Error in combine_de_tables(neut_de, keepers = keepers, excel = glue::glue("upload/DE_Cure_vs_Fail/{ver}_neut_vall_de_curevsfail_nobatch.xlsx")): object 'neut_de' not found
written <- write_xlsx(data=neut_tables[["data"]][[1]],
                      excel=glue::glue("excel/neutrophil_clinical_table-v{ver}.xlsx"))
## Error in write_xlsx(data = neut_tables[["data"]][[1]], excel = glue::glue("excel/neutrophil_clinical_table-v{ver}.xlsx")): object 'neut_tables' not found
neut_sig <- sm(extract_significant_genes(neut_tables, according_to="deseq"))
## Error in extract_significant_genes(neut_tables, according_to = "deseq"): object 'neut_tables' not found
written <- write_xlsx(data=neut_sig[["deseq"]][["ups"]][[1]],
                      excel=glue::glue("excel/neutrophil_clinical_sigup-v{ver}.xlsx"))
## Error in write_xlsx(data = neut_sig[["deseq"]][["ups"]][[1]], excel = glue::glue("excel/neutrophil_clinical_sigup-v{ver}.xlsx")): object 'neut_sig' not found
written <- write_xlsx(data=neut_sig[["deseq"]][["downs"]][[1]],
                      excel=glue::glue("excel/neutrophil_clinical_sigdown-v{ver}.xlsx"))
## Error in write_xlsx(data = neut_sig[["deseq"]][["downs"]][[1]], excel = glue::glue("excel/neutrophil_clinical_sigdown-v{ver}.xlsx")): object 'neut_sig' not found
neut_pct_sig <- sm(extract_significant_genes(neut_tables, n=200, lfc=NULL,
                                             p=NULL, according_to="deseq"))
## Error in extract_significant_genes(neut_tables, n = 200, lfc = NULL, p = NULL, : object 'neut_tables' not found
written <- write_xlsx(data=neut_sig[["deseq"]][["ups"]][[1]],
                      excel=glue::glue("excel/neutrophil_clinical_sigup_pct-v{ver}.xlsx"))
## Error in write_xlsx(data = neut_sig[["deseq"]][["ups"]][[1]], excel = glue::glue("excel/neutrophil_clinical_sigup_pct-v{ver}.xlsx")): object 'neut_sig' not found
written <- write_xlsx(data=neut_sig[["deseq"]][["downs"]][[1]],
                      excel=glue::glue("excel/neutrophil_clinical_sigdown_pct-v{ver}.xlsx"))
## Error in write_xlsx(data = neut_sig[["deseq"]][["downs"]][[1]], excel = glue::glue("excel/neutrophil_clinical_sigdown_pct-v{ver}.xlsx")): object 'neut_sig' not found
neut_cpm <- sm(normalize_expt(neut, convert="cpm"))
written <- write_xlsx(data=exprs(neut_cpm),
                      excel=glue::glue("excel/neutrophil_cpm_before_batch-v{ver}.xlsx"))
neut_bcpm <- sm(normalize_expt(neut, filter=TRUE, batch="svaseq", convert="cpm"))
written <- write_xlsx(data=exprs(neut_bcpm),
                      excel=glue::glue("excel/neutrophil_cpm_after_batch-v{ver}.xlsx"))

9.2.4.2 With sva

neut_de_sva <- sm(all_pairwise(neut, model_batch="svaseq", filter=TRUE))
## Error in checkForRemoteErrors(val): 5 nodes produced errors; first error: c("Error in if (grep(x = eval_name, pattern = \"^([[:digit:]]|[[:punct:]])\")) { : \n  argument is of length zero\n", "basic")
neut_tables_sva <- sm(combine_de_tables(
    neut_de_sva, keepers=keepers,
    excel=glue::glue("upload/DE_Cure_vs_Fail/{ver}_neut_vall_de_curevsfail_svabatch.xlsx")))
## Error in combine_de_tables(neut_de_sva, keepers = keepers, excel = glue::glue("upload/DE_Cure_vs_Fail/{ver}_neut_vall_de_curevsfail_svabatch.xlsx")): object 'neut_de_sva' not found
neut_sig_sva <- sm(extract_significant_genes(
    neut_tables_sva,
    excel=glue::glue("excel/neutrophil_clinical_sig_tables_sva-v{ver}.xlsx"),
    according_to="deseq"))
## Error in extract_significant_genes(neut_tables_sva, excel = glue::glue("excel/neutrophil_clinical_sig_tables_sva-v{ver}.xlsx"), : object 'neut_tables_sva' not found

9.2.4.3 Neutrophil DE plots

## DESeq2 MA plot of failure / cure
neut_tables[["plots"]][["fail_vs_cure"]][["deseq_ma_plots"]]$plot
## Error in eval(expr, envir, enclos): object 'neut_tables' not found
## DESeq2 Volcano plot of failure / cure
neut_tables[["plots"]][["fail_vs_cure"]][["deseq_vol_plots"]]$plot
## Error in eval(expr, envir, enclos): object 'neut_tables' not found
## DESeq2 MA plot of failure / cure with sva
neut_tables_sva[["plots"]][["fail_vs_cure"]][["deseq_ma_plots"]]$plot
## Error in eval(expr, envir, enclos): object 'neut_tables_sva' not found
## DESeq2 Volcano plot of failure / cure with sva
neut_tables_sva[["plots"]][["fail_vs_cure"]][["deseq_vol_plots"]]$plot
## Error in eval(expr, envir, enclos): object 'neut_tables_sva' not found

9.2.4.5 Neutrophil GSVA query

neut_up_goseq_msig <- goseq_msigdb(sig_genes=ups, signatures=broad_c7,
                                   signature_category="c7", length_db=hs_length)
## Error in rownames(sig_genes): object 'ups' not found
neut_down_goseq_msig <- goseq_msigdb(sig_genes=downs, signatures=broad_c7,
                                     signature_category="c7", length_db=hs_length)
## Error in rownames(sig_genes): object 'downs' not found

9.2.4.6 Plot of similar experiments

## Neutrophil genes with increased expression in the failed samples
## share genes with the following experiments
neut_up_goseq_msig[["pvalue_plots"]][["mfp_plot_over"]]
## Error in eval(expr, envir, enclos): object 'neut_up_goseq_msig' not found
## Neutrophil genes with increased expression in the cured samples
## share genes with the following experiments
neut_down_goseq_msig[["pvalue_plots"]][["mfp_plot_over"]]
## Error in eval(expr, envir, enclos): object 'neut_down_goseq_msig' not found

9.3 Eosinophils

9.3.1 Evaluate Eosinophil samples

eo <- subset_expt(hs_valid, subset="typeofcells=='eosinophils'") %>%
  set_expt_conditions(fact="clinicaloutcome") %>%
  set_expt_batches(fact="visitnumber") %>%
  set_expt_colors(colors=chosen_colors)
## subset_expt(): There were 202, now there are 30 samples.
## Warning in set_expt_colors(., colors = chosen_colors): Colors for the following
## categories are not being used: nullnotapplicable.
save_result <- save(eo, file="rda/eosinophil_expt.rda")
eo_norm <- sm(normalize_expt(eo, convert="cpm", transform="log2",
                             norm="quant", filter=TRUE))
plt <- plot_pca(eo_norm, plot_labels=FALSE)$plot
pp(file=glue("images/eo_pca_normalized-v{ver}.pdf"), image=plt)

eo_nb <- sm(normalize_expt(eo, convert="cpm", transform="log2",
                           filter=TRUE, batch="svaseq"))
plt <- plot_pca(eo_nb, plot_labels=FALSE)$plot
plt

9.3.2 DE of Eosinophil samples

9.3.2.1 Withouth sva

eo_de <- sm(all_pairwise(eo, model_batch=FALSE, filter=TRUE))
## Error in checkForRemoteErrors(val): 5 nodes produced errors; first error: c("Error in if (grep(x = eval_name, pattern = \"^([[:digit:]]|[[:punct:]])\")) { : \n  argument is of length zero\n", "basic")
eo_tables <- sm(combine_de_tables(
    eo_de, keepers=keepers,
    excel=glue::glue("upload/DE_Cure_vs_Fail/{ver}_eo_vall_de_curevsfail_nobatch.xlsx")))
## Error in combine_de_tables(eo_de, keepers = keepers, excel = glue::glue("upload/DE_Cure_vs_Fail/{ver}_eo_vall_de_curevsfail_nobatch.xlsx")): object 'eo_de' not found
written <- write_xlsx(data=eo_tables[["data"]][[1]],
                      excel=glue::glue("excel/eosinophil_clinical_table-v{ver}.xlsx"))
## Error in write_xlsx(data = eo_tables[["data"]][[1]], excel = glue::glue("excel/eosinophil_clinical_table-v{ver}.xlsx")): object 'eo_tables' not found
eo_sig <- sm(extract_significant_genes(eo_tables, according_to="deseq"))
## Error in extract_significant_genes(eo_tables, according_to = "deseq"): object 'eo_tables' not found
written <- write_xlsx(data=eo_sig[["deseq"]][["ups"]][[1]],
                      excel=glue::glue("excel/eosinophil_clinical_sigup-v{ver}.xlsx"))
## Error in write_xlsx(data = eo_sig[["deseq"]][["ups"]][[1]], excel = glue::glue("excel/eosinophil_clinical_sigup-v{ver}.xlsx")): object 'eo_sig' not found
written <- write_xlsx(data=eo_sig[["deseq"]][["downs"]][[1]],
                      excel=glue::glue("excel/eosinophil_clinical_sigdown-v{ver}.xlsx"))
## Error in write_xlsx(data = eo_sig[["deseq"]][["downs"]][[1]], excel = glue::glue("excel/eosinophil_clinical_sigdown-v{ver}.xlsx")): object 'eo_sig' not found
eo_pct_sig <- sm(extract_significant_genes(eo_tables, n=200,
                                           lfc=NULL, p=NULL, according_to="deseq"))
## Error in extract_significant_genes(eo_tables, n = 200, lfc = NULL, p = NULL, : object 'eo_tables' not found
written <- write_xlsx(data=eo_pct_sig[["deseq"]][["ups"]][[1]],
                      excel=glue::glue("excel/eosinophil_clinical_sigup_pct-v{ver}.xlsx"))
## Error in write_xlsx(data = eo_pct_sig[["deseq"]][["ups"]][[1]], excel = glue::glue("excel/eosinophil_clinical_sigup_pct-v{ver}.xlsx")): object 'eo_pct_sig' not found
written <- write_xlsx(data=eo_pct_sig[["deseq"]][["downs"]][[1]],
                      excel=glue::glue("excel/eosinophil_clinical_sigdown_pct-v{ver}.xlsx"))
## Error in write_xlsx(data = eo_pct_sig[["deseq"]][["downs"]][[1]], excel = glue::glue("excel/eosinophil_clinical_sigdown_pct-v{ver}.xlsx")): object 'eo_pct_sig' not found
eo_cpm <- sm(normalize_expt(eo, convert="cpm"))
written <- write_xlsx(data=exprs(eo_cpm),
                      excel=glue::glue("excel/eosinophil_cpm_before_batch-v{ver}.xlsx"))
eo_bcpm <- sm(normalize_expt(eo, filter=TRUE, batch="svaseq", convert="cpm"))
written <- write_xlsx(data=exprs(eo_bcpm),
                      excel=glue::glue("excel/eosinophil_cpm_after_batch-v{ver}.xlsx"))

9.3.2.2 With sva

eo_de_sva <- sm(all_pairwise(eo, model_batch="svaseq", filter=TRUE))
## Error in checkForRemoteErrors(val): 5 nodes produced errors; first error: c("Error in if (grep(x = eval_name, pattern = \"^([[:digit:]]|[[:punct:]])\")) { : \n  argument is of length zero\n", "basic")
eo_tables_sva <- sm(combine_de_tables(
    eo_de_sva, keepers=keepers,
    excel=glue::glue("upload/DE_Cure_vs_Fail/{ver}_eo_vall_de_curevsfail_svabatch.xlsx")))
## Error in combine_de_tables(eo_de_sva, keepers = keepers, excel = glue::glue("upload/DE_Cure_vs_Fail/{ver}_eo_vall_de_curevsfail_svabatch.xlsx")): object 'eo_de_sva' not found
eo_sig_sva <- sm(extract_significant_genes(
    eo_tables_sva,
    excel=glue::glue("excel/eosinophil_clinical_sig_tables_sva-v{ver}.xlsx"),
    according_to="deseq"))
## Error in extract_significant_genes(eo_tables_sva, excel = glue::glue("excel/eosinophil_clinical_sig_tables_sva-v{ver}.xlsx"), : object 'eo_tables_sva' not found

9.3.2.3 Eosinophil DE plots

## DESeq2 MA plot of failure / cure
eo_tables[["plots"]][["fail_vs_cure"]][["deseq_ma_plots"]]$plot
## Error in eval(expr, envir, enclos): object 'eo_tables' not found
## DESeq2 Volcano plot of failure / cure
eo_tables[["plots"]][["fail_vs_cure"]][["deseq_vol_plots"]]$plot
## Error in eval(expr, envir, enclos): object 'eo_tables' not found
## DESeq2 MA plot of failure / cure with sva
eo_tables_sva[["plots"]][["fail_vs_cure"]][["deseq_ma_plots"]]$plot
## Error in eval(expr, envir, enclos): object 'eo_tables_sva' not found
## DESeq2 Volcano plot of failure / cure with sva
eo_tables_sva[["plots"]][["fail_vs_cure"]][["deseq_vol_plots"]]$plot
## Error in eval(expr, envir, enclos): object 'eo_tables_sva' not found

9.3.2.5 Eosinophil MSigDB query

eo_up_goseq_msig <- goseq_msigdb(sig_genes=ups, signatures=broad_c7,
                                 signature_category="c7", length_db=hs_length)
## Error in rownames(sig_genes): object 'ups' not found
eo_down_goseq_msig <- goseq_msigdb(sig_genes=downs, signatures=broad_c7,
                                   signature_category="c7", length_db=hs_length)
## Error in rownames(sig_genes): object 'downs' not found

9.3.2.6 Plot of similar experiments

## Eosinophil genes with increased expression in the failed samples
## share genes with the following experiments
eo_up_goseq_msig[["pvalue_plots"]][["mfp_plot_over"]]
## Error in eval(expr, envir, enclos): object 'eo_up_goseq_msig' not found
## Eosinophil genes with increased expression in the cured samples
## share genes with the following experiments
eo_down_goseq_msig[["pvalue_plots"]][["mfp_plot_over"]]
## Error in eval(expr, envir, enclos): object 'eo_down_goseq_msig' not found

9.4 Biopsies

9.4.1 Evaluate Biopsy samples

biop <- subset_expt(hs_valid, subset="typeofcells=='biopsy'") %>%
  set_expt_conditions(fact="clinicaloutcome") %>%
  set_expt_batches(fact="visitnumber") %>%
  set_expt_colors(colors=chosen_colors)
## subset_expt(): There were 202, now there are 52 samples.
## Warning in set_expt_colors(., colors = chosen_colors): Colors for the following
## categories are not being used: nullnotapplicable.
save_result <- save(biop, file="rda/biopsy_expt.rda")
biop_norm <- normalize_expt(biop, filter=TRUE, convert="cpm",
                            transform="log2", norm="quant")
## Removing 5738 low-count genes (14203 remaining).
## transform_counts: Found 16 values equal to 0, adding 1 to the matrix.
plt <- plot_pca(biop_norm, plot_labels=FALSE)$plot
pp(file=glue("images/biop_pca_normalized-v{ver}.pdf"), image=plt)

biop_nb <- sm(normalize_expt(biop, convert="cpm", filter=TRUE,
                             transform="log2", batch="svaseq"))
plt <- plot_pca(biop_nb, plot_labels=FALSE)$plot
pp(file=glue("images/biop_pca_normalized_svaseq-v{ver}.pdf"), image=plt)

9.4.2 DE of Biopsy samples

9.4.2.1 Without sva

biop_de <- sm(all_pairwise(biop, model_batch=FALSE, filter=TRUE))
biop_tables <- combine_de_tables(
    biop_de, keepers=keepers,
    excel=glue::glue("upload/DE_Cure_vs_Fail/{ver}_biopsy_vall_de_curevsfail_nobatch.xlsx")))

written <- write_xlsx(data=biop_tables[["data"]][[1]],
                      excel=glue::glue("excel/biopsy_clinical_table-v{ver}.xlsx"))
biop_sig <- extract_significant_genes(biop_tables, according_to="deseq")
##written <- write_xlsx(data=biop_sig[["deseq"]][["ups"]][[1]],
##                      excel=glue::glue("excel/biopsy_clinical_sigup-v{ver}.xlsx"))
written <- write_xlsx(data=biop_sig[["deseq"]][["downs"]][[1]],
                      excel=glue::glue("excel/biopsy_clinical_sigdown-v{ver}.xlsx"))
biop_pct_sig <- extract_significant_genes(biop_tables, n=200, lfc=NULL, p=NULL, according_to="deseq")
written <- write_xlsx(data=biop_pct_sig[["deseq"]][["ups"]][[1]],
                      excel=glue::glue("excel/biopsy_clinical_sigup_pct-v{ver}.xlsx"))
written <- write_xlsx(data=biop_pct_sig[["deseq"]][["downs"]][[1]],
                      excel=glue::glue("excel/biopsy_clinical_sigdown_pct-v{ver}.xlsx"))

biop_cpm <- sm(normalize_expt(biop, convert="cpm"))
written <- write_xlsx(data=exprs(biop_cpm),
                      excel=glue::glue("excel/biopsy_cpm_before_batch-v{ver}.xlsx"))
biop_bcpm <- sm(normalize_expt(biop, filter=TRUE, batch="svaseq", convert="cpm"))
written <- write_xlsx(data=exprs(biop_bcpm),
                      excel=glue::glue("excel/biopsy_cpm_after_batch-v{ver}.xlsx"))
## Error: <text>:4:93: unexpected ')'
## 3:     biop_de, keepers=keepers,
## 4:     excel=glue::glue("upload/DE_Cure_vs_Fail/{ver}_biopsy_vall_de_curevsfail_nobatch.xlsx")))
##                                                                                                ^

9.4.2.2 with sva

biop_de_sva <- sm(all_pairwise(biop, model_batch="svaseq", filter=TRUE))
## Error in checkForRemoteErrors(val): 5 nodes produced errors; first error: c("Error in if (grep(x = eval_name, pattern = \"^([[:digit:]]|[[:punct:]])\")) { : \n  argument is of length zero\n", "basic")
biop_tables_sva <- sm(combine_de_tables(
    biop_de_sva, keepers=keepers,
    excel=glue::glue("upload/DE_Cure_vs_Fail/{ver}_biopsy_vall_de_curevsfail_svabatch.xlsx")))
## Error in combine_de_tables(biop_de_sva, keepers = keepers, excel = glue::glue("upload/DE_Cure_vs_Fail/{ver}_biopsy_vall_de_curevsfail_svabatch.xlsx")): object 'biop_de_sva' not found
biop_sig_sva <- sm(extract_significant_genes(
    biop_tables_sva,
    excel=glue::glue("excel/biopsy_clinical_sig_tables_sva-v{ver}.xlsx"),
    according_to="deseq"))
## Error in extract_significant_genes(biop_tables_sva, excel = glue::glue("excel/biopsy_clinical_sig_tables_sva-v{ver}.xlsx"), : object 'biop_tables_sva' not found

9.4.2.3 Biopsy DE plots

## DESeq2 MA plot of failure / cure
biop_tables[["plots"]][["fail_vs_cure"]][["deseq_ma_plots"]]$plot
## Error in eval(expr, envir, enclos): object 'biop_tables' not found
## DESeq2 Volcano plot of failure / cure
biop_tables[["plots"]][["fail_vs_cure"]][["deseq_vol_plots"]]$plot
## Error in eval(expr, envir, enclos): object 'biop_tables' not found
## DESeq2 MA plot of failure / cure
biop_tables_sva[["plots"]][["fail_vs_cure"]][["deseq_ma_plots"]]$plot
## Error in eval(expr, envir, enclos): object 'biop_tables_sva' not found
## DESeq2 Volcano plot of failure / cure
biop_tables_sva[["plots"]][["fail_vs_cure"]][["deseq_vol_plots"]]$plot
## Error in eval(expr, envir, enclos): object 'biop_tables_sva' not found

10 Look for shared genes among Monocytes/Neutrophils/Eosinophils

We have three variables containing the ‘significant’ DE genes for the three cell types. For this I am choosing (for the moment) to use the sva data.

## mono_sig_sva, neut_sig_sva, eo_sig_sva
sig_vectors <- list(
    "monocytes"=c(rownames(mono_sig_sva[["deseq"]][["ups"]][["fail_vs_cure"]]),
                    rownames(mono_sig_sva[["deseq"]][["downs"]][["fail_vs_cure"]])),
    "neutrophils"=c(rownames(neut_sig_sva[["deseq"]][["ups"]][["fail_vs_cure"]]),
                      rownames(neut_sig_sva[["deseq"]][["downs"]][["fail_vs_cure"]])),
    "eosinophils"= c(rownames(eo_sig_sva[["deseq"]][["ups"]][["fail_vs_cure"]]),
                       rownames(eo_sig_sva[["deseq"]][["downs"]][["fail_vs_cure"]])))
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'rownames': object 'mono_sig_sva' not found
shared_vector <- Vennerable::Venn(Sets=sig_vectors)
## Error in Vennerable::Venn(Sets = sig_vectors): object 'sig_vectors' not found
Vennerable::plot(shared_vector, doWeights=FALSE)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'plot': object 'shared_vector' not found
shared_ids <- shared_vector@IntersectionSets[["111"]]
## Error in eval(expr, envir, enclos): object 'shared_vector' not found
shared_expt <- exclude_genes_expt(hs_clinical, ids=shared_ids, method="keep")
## Error in exclude_genes_expt(hs_clinical, ids = shared_ids, method = "keep"): object 'shared_ids' not found
shared_written <- write_expt(shared_expt,
                             excel=glue::glue("excel/shared_across_celltypes-v{ver}.xlsx"))
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'exprs': object 'shared_expt' not found

11 Monocytes by visit

  1. Can you please share with us a PCA (SVA and non-SVA) of the monocytes of the TMRC3 project, but labeling them based on the visit (V1, V2, V3)?
  2. Can you please share DE lists of V1 vs V2, V1 vs V3, V1 vs. V2+V3 and V2 vs V3?
visit_colors <- chosen_colors <- c("#D95F02", "#7570B3", "#1B9E77")
names(visit_colors) <- c(1, 2, 3)
mono_visit <- subset_expt(hs_valid, subset="typeofcells=='monocytes'") %>%
  set_expt_conditions(fact="visitnumber") %>%
  set_expt_batches(fact="clinicaloutcome") %>%
  set_expt_colors(colors=chosen_colors)
## subset_expt(): There were 202, now there are 50 samples.
mono_visit_norm <- normalize_expt(mono_visit, filter=TRUE, norm="quant", convert="cpm",
                                  transform="log2")
## Removing 8776 low-count genes (11165 remaining).
## transform_counts: Found 11 values equal to 0, adding 1 to the matrix.
mono_visit_pca <- plot_pca(mono_visit_norm)
pp(file="images/monocyte_by_visit.png", image=mono_visit_pca$plot)

mono_visit_nb <- normalize_expt(mono_visit, filter=TRUE, convert="cpm",
                                batch="svaseq", transform="log2")
## Removing 8776 low-count genes (11165 remaining).
## batch_counts: Before batch/surrogate estimation, 2380 entries are x==0: 0%.
## batch_counts: Before batch/surrogate estimation, 36412 entries are 0<x<1: 7%.
## Setting 1029 low elements to zero.
## transform_counts: Found 1029 values equal to 0, adding 1 to the matrix.
mono_visit_nb_pca <- plot_pca(mono_visit_nb)
pp(file="images/monocyte_by_visit_nb.png", image=mono_visit_nb_pca$plot)

table(pData(mono_visit_norm)$batch)
## 
##          cure       failure          lost notapplicable 
##            19            25             5             1
keepers <- list(
    "second_vs_first"=c("c2", "c1"),
    "third_vs_second"=c("c3", "c2"),
    "third_vs_first"=c("c3", "c1"))
mono_visit_de <- all_pairwise(mono_visit, model_batch="svaseq", filter=TRUE)
## batch_counts: Before batch/surrogate estimation, 2380 entries are x==0: 0%.
## Plotting a PCA before surrogate/batch inclusion.
## Using svaseq to visualize before/after batch inclusion.
## Performing a test normalization with: raw
## Removing 0 low-count genes (11165 remaining).
## batch_counts: Before batch/surrogate estimation, 2380 entries are x==0: 0%.
## batch_counts: Before batch/surrogate estimation, 36412 entries are 0<x<1: 7%.
## Setting 1029 low elements to zero.
## transform_counts: Found 1029 values equal to 0, adding 1 to the matrix.
## Error in checkForRemoteErrors(val): 5 nodes produced errors; first error: c("Error in if (grep(x = eval_name, pattern = \"^([[:digit:]]|[[:punct:]])\")) { : \n  argument is of length zero\n", "basic")
mono_visit_tables <- combine_de_tables(
    mono_visit_de,
    keepers=keepers,
    excel=glue::glue("excel/mono_visit_tables-v{ver}.xlsx"))
## Error in combine_de_tables(mono_visit_de, keepers = keepers, excel = glue::glue("excel/mono_visit_tables-v{ver}.xlsx")): object 'mono_visit_de' not found
new_factor <- as.character(pData(mono_visit)[["visitnumber"]])
not_one_idx <- new_factor != 1
new_factor[not_one_idx] <- "not_1"
mono_one_vs <- set_expt_conditions(mono_visit, new_factor)

mono_one_vs_de <- all_pairwise(mono_one_vs, model_batch="svaseq", filter=TRUE)
## batch_counts: Before batch/surrogate estimation, 2380 entries are x==0: 0%.
## Plotting a PCA before surrogate/batch inclusion.
## Using svaseq to visualize before/after batch inclusion.
## Performing a test normalization with: raw
## Removing 0 low-count genes (11165 remaining).
## batch_counts: Before batch/surrogate estimation, 2380 entries are x==0: 0%.
## batch_counts: Before batch/surrogate estimation, 36412 entries are 0<x<1: 7%.
## Setting 936 low elements to zero.
## transform_counts: Found 936 values equal to 0, adding 1 to the matrix.
## Error in checkForRemoteErrors(val): 5 nodes produced errors; first error: c("Error in if (grep(x = eval_name, pattern = \"^([[:digit:]]|[[:punct:]])\")) { : \n  argument is of length zero\n", "basic")
mono_one_vs_tables <- combine_de_tables(
    mono_one_vs_de,
    excel=glue::glue("excel/mono_one_vs_tables-v{ver}.xlsx"))
## Error in combine_de_tables(mono_one_vs_de, excel = glue::glue("excel/mono_one_vs_tables-v{ver}.xlsx")): object 'mono_one_vs_de' not found

12 Test TSP

In writing the following, I quickly realized that tspair was not joking when it said it is intended for small numbers of genes. For a full expressionset of human data it is struggling. I like the idea, it may prove worth while to spend some time optimizing the package so that it is more usable.

expt <- hs_clinical_nobiop

simple_tsp <- function(expt, column="condition") {
  facts <- levels(as.factor(pData(expt)[[column]]))
  retlist <- list()
  if (length(facts) < 2) {
    stop("This requires factors with at least 2 levels.")
  } else if (length(facts) == 2) {
    retlist <- simple_tsp_pair(expt, column=column)
  } else {
    for (first in 1:(length(facts) - 1)) {
      for (second in 2:(length(facts))) {
        if (first < second) {
          name <- glue::glue("{facts[first]}_vs_{facts[second]}")
          message("Starting ", name, ".")
          substring <- glue::glue("{column}=='{facts[first]}'|{column}=='{facts[second]}'")
          subby <- subset_expt(expt, subset=as.character(substring))
          retlist[[name]] <- simple_tsp_pair(subby, column=column)
        }
      }
    }
  }
}

simple_tsp_pair <- function(subby, column="condition", repetitions=50) {
  tsp_input <- subby[["expressionset"]]
  tsp_output <- tspcalc(tsp_input, column)
  tsp_scores <- tspsig(tsp_input, column, B=repetitions)
}

tsp1 <- tspcalc(tsp_input, "condition")
if (!isTRUE(get0("skip_load"))) {
  pander::pander(sessionInfo())
  message(paste0("This is hpgltools commit: ", get_git_commit()))
  message(paste0("Saving to ", savefile))
  tmp <- sm(saveme(filename=savefile))
}
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 841bed6443e2a8e27851bef94a1a6aacc0bae5cf
## This is hpgltools commit: Fri Oct 29 16:01:32 2021 -0400: 841bed6443e2a8e27851bef94a1a6aacc0bae5cf
## Saving to tmrc3_02sample_estimation_v202110.rda.xz
tmp <- loadme(filename=savefile)
---
title: "TMRC3 Comprehensive Data Analysis: 202110"
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
 html_document:
  code_download: true
  code_folding: show
  fig_caption: true
  fig_height: 7
  fig_width: 7
  highlight: default
  keep_md: false
  mode: selfcontained
  number_sections: true
  self_contained: true
  theme: readable
  toc: true
  toc_float:
   collapsed: false
   smooth_scroll: false
---

<style>
  body .main-container {
    max-width: 1600px;
  }
</style>

```{r options, include = FALSE}
library(hpgltools)
tt <- sm(devtools::load_all("~/hpgltools"))
knitr::opts_knit$set(progress=TRUE,
                     verbose=TRUE,
                     width=120,
                     echo=TRUE)
knitr::opts_chunk$set(error=TRUE,
                      fig.width=12,
                      fig.height=12,
                      dpi=96)
old_options <- options(digits=4,
                       stringsAsFactors=FALSE,
                       knitr.duplicate.label="allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size=12))
ver <- "202110"
rundate <- format(Sys.Date(), format="%Y%m%d")

rmd_file <- glue::glue("tmrc3_02sample_estimation_v{ver}.Rmd")
savefile <- gsub(pattern="\\.Rmd", replace="\\.rda\\.xz", x=rmd_file)
```

# Introduction

This worksheet is intended to provide the analyses for the TMRC3
project.  As of ~ 20211013 we have a shared directory into which to
place various results.  For the moment I think we will primarily
attempt to put csv files.  In each directory of that shared tree, we
will attempt to put the set of results which are defined by each
subdivision.

# Sample sheet

I think there have been some new/improved annotations in the online sample sheet,
so let us go download a fresh copy and see.

```{r samplesheet}
samplesheet <- "sample_sheets/tmrc3_samples_20211012.xlsx"
crf_metadata <- "sample_sheets/20210825_EXP_ESPECIAL_TMRC3_VERSION_2.xlsx"
```

# Annotation

We take the annotation data from ensembl's biomart instance.  The genome which
was used to map the data was hg38 revision 100.  My default when using biomart is
to load the data from 1 year before the current date.

```{r hs_annot}
hs_annot <- sm(load_biomart_annotations(year="2020"))
hs_annot <- hs_annot[["annotation"]]
hs_annot[["transcript"]] <- paste0(rownames(hs_annot), ".", hs_annot[["version"]])
rownames(hs_annot) <- make.names(hs_annot[["ensembl_gene_id"]], unique=TRUE)
tx_gene_map <- hs_annot[, c("transcript", "ensembl_gene_id")]

summary(hs_annot)
```

```{r hs_go}
hs_go <- sm(load_biomart_go()[["go"]])
hs_length <- hs_annot[, c("ensembl_gene_id", "cds_length")]
colnames(hs_length) <- c("ID", "length")
```

# Introduction

This document is intended to provide an overview of TMRC3 samples which have
been sequenced.  It includes some plots and analyses showing the relationships
among the samples as well as some differential analyses when possible.

# Sample Estimation

## Generate expressionsets

The sample sheet is copied from our shared online sheet and updated with each release
of sequencing data.

### Hisat2 expressionsets

The first thing to note is the large range in coverage.  There are multiple
samples with coverage which is too low to use.  These will be removed shortly.

In the following block I immediately exclude any non-coding reads as well.

```{r all_new_hisat2}
## Create the expressionset and immediately pass it to a filter
## removing the non protein coding genes.
sanitize_columns <- c("visitnumber", "clinicaloutcome", "donor",
                      "typeofcells", "clinicalpresentation",
                      "condition", "batch")
hs_expt <- create_expt(samplesheet,
                       file_column="hg38100hisatfile",
                       savefile=glue::glue("rda/hs_expt_all-v{ver}.rda"),
                       gene_info=hs_annot) %>%
  exclude_genes_expt(column="gene_biotype", method="keep",
                     pattern="protein_coding", meta_column="ncrna_lost") %>%
  sanitize_expt_metadata(columns=sanitize_columns) %>%
  set_expt_factors(columns=sanitize_columns, class="factor")

levels(pData(hs_expt[["expressionset"]])[["visitnumber"]]) <- list(
    '0'="notapplicable", '1'=1, '2'=2, '3'=3)
```

### Add Clinical metadata

Let us also merge in the clinician's metadata.  I worry a little that
this might not be allowed for dbGap data, but if it is a problem I
suspect we can just remove the bad columns from it.  Also note that I
rarely use the join function, but it is somewhat required here because
I do not want to risk shuffling the metadata when I add the new
metadata, which comes from a spreadsheet sorted by patient, not
sample.  In doing this I therefore just created a new column 'join'
which contains the shared information, e.g. the patient ID
from the existing metadata and the same ID from the CRF file which has
been coerced into lowercase.

```{r merge_crf}
hs_pd <- pData(hs_expt)
start <- rownames(hs_pd)
hs_crf <- openxlsx::read.xlsx(crf_metadata)
hs_crf[["join"]] <- tolower(hs_crf[["codigo_paciente"]])
hs_pd[["join"]] <- hs_pd[["tubelabelorigin"]]
test <- plyr::join(hs_pd, hs_crf, by="join")
test[["join"]] <- NULL
rownames(test) <- rownames(hs_pd)
na_idx <- is.na(test)
test[na_idx] <- "undefined"
pData(hs_expt) <- test
```

This added a bunch of new columns, of which there are a few which
Theresa showed are of likely interest:

* eb_lc_sexo
* eb_lc_etnia
* edad eb_lc_tiempo_evolucion
* eb_lc_num_lc_activas
* eb_lc_ulcera_area_1
* eb_lc_ejex_lesion_mm_1
* eb_lc_lesion_area_1
* v2_lc_ejex_lesion_mm_1
* v2_lc_lesion_area_1
* v2_lc_ejex_ulcera_mm_1
* v2_lc_ejey_ulcera_mm_1
* v2_lc_ulcera_area_1
* v3_lc_ejex_lesion_mm_1
* v3_lc_ejey_lesion_mm_1
* v3_lc_lesion_area_1
* v3_lc_ejex_ulcera_mm_1
* v3_lc_ejey_ulcera_mm_1
* v3_lc_ulcera_area_1
* eb_lc_tto_mcto_prescrito
* eb_lc_tto_mcto_glucan_dosis
* v3_lc_num_amp_cap_tto
* adherencia_tto

### Consider lcRNA (currently unused)

Split this data into CDS and lncRNA.  Oh crap in order to do that I need to recount the data.
Running now (20210518)

```{r lnc_cds}
## lnc_expt <- create_expt(samplesheet,
##                         file_column="hg38100lncfile",
##                         gene_info=hs_annot)
```

#### Initial metrics

Once the data was loaded, there are a couple of metrics which may be plotted immediately.

```{r initial_metrics}
nonzero <- plot_nonzero(hs_expt)
nonzero$plot

ncrna_lost_df <- as.data.frame(pData(hs_expt)[["ncrna_lost"]])
rownames(ncrna_lost_df) <- rownames(pData(hs_expt))
colnames(ncrna_lost_df) <- "ncrna_lost"

tmpdf <- merge(nonzero$table, ncrna_lost_df, by="row.names")
rownames(tmpdf) <- tmpdf[["Row.names"]]
tmpdf[["Row.names"]] <- NULL

ggplot(tmpdf, aes(x=ncrna_lost, y=nonzero_genes)) +
  ggplot2::geom_point() +
  ggplot2::ggtitle("Nonzero genes with respect to percent counts
lost when ncRNA was removed.")
```

Najib doesn't want this plot, but I am using it to check new samples,
so will hide it from general use.

```{r libsize}
libsize <- plot_libsize(hs_expt)
libsize$plot
```

## Minimum coverage sample filtering

I arbitrarily chose 11,000 non-zero genes as a minimum.  We may
want this to be higher.

```{r hisat2_write, fig.show="hide"}
hs_valid <- subset_expt(hs_expt, nonzero=11000)

## valid_write <- sm(write_expt(hs_valid, excel=glue("excel/hs_valid-v{ver}-{rundate}.xlsx")))
```

## Upload 'raw' CPM

The cpm directory in the shared tree has a place to drop the counts
for the full dataset along with some subsets by cell type and time.
The following block should make that process rather easier.

```{r cpm_csv, results=FALSE, message=FALSE}
all_cpm <- hs_valid %>%
  normalize_expt(hs_valid, filter=TRUE, convert="cpm") %>%
  exprs() %>%
  write.csv(file=file=glue::glue("upload/CPM/all_cpm-v{ver}-d{rundate}.csv"))

biopsy_cpm <- hs_valid %>%
  subset_expt(subset="typeofcells=='biopsy'") %>%
  normalize_expt(filter=TRUE, convert="cpm") %>%
  exprs() %>%
  write.csv(file=glue::glue("upload/CPM/Biopsies/biopsy_cpm-v{ver}-d{rundate}.csv"))

eosinophils_cpm <- hs_valid %>%
  subset_expt(subset="typeofcells=='eosinophils'") %>%
  normalize_expt(filter=TRUE, convert="cpm") %>%
  exprs() %>%
  write.csv(file=glue::glue("upload/CPM/Eosinophils/eosinophil_cpm-v{ver}-d{rundate}.csv"))

monocyte_cpm <- hs_valid %>%
  subset_expt(subset="typeofcells=='monocytes'") %>%
  normalize_expt(filter=TRUE, convert="cpm") %>%
  exprs() %>%
  write.csv(file=glue::glue("upload/CPM/Monocytes/monocyte_cpm-v{ver}-d{rundate}.csv"))

neutrophil_cpm <- hs_valid %>%
  subset_expt(subset="typeofcells=='neutrophils'") %>%
  normalize_expt(filter=TRUE, convert="cpm") %>%
  exprs() %>%
  write.csv(file=glue::glue("upload/CPM/Neutrophils/neutrophil_cpm-v{ver}-d{rundate}.csv"))

times <- c("1", "2", "3")
for (time in times) {
  time_subset <- paste0("visitnumber=='", time, "'")
  filename <- glue::glue("upload/CPM/visit{time}-v{ver}-d{rundate}.csv")
  written <- hs_valid %>%
    subset_expt(subset=time_subset) %>%
    exprs() %>%
    write.csv(file=filename)
}

types <- c("neutrophils", "monocytes", "eosinophils", "biopsy")
for (type in types) {
  type_subset <- paste0("typeofcells=='", type, "'")
  for (time in times) {
    time_subset <- paste0("visitnumber=='", time, "'")
    filename <- glue::glue("upload/CPM/{type}_visit{time}-v{ver}-d{rundate}.csv")
    written <- hs_valid %>%
      subset_expt(subset=time_subset) %>%
      subset_expt(subset=type_subset) %>%
      exprs() %>%
      write.csv(file=filename)
  }
}
```

# Project Aims

The project seeks to determine the relationship of the innate immune response
and inflammatory signaling to the clinical outcome of antileishmanial drug
treatment. We will test the hypothesis that the profile of innate immune cell
activation and their dynamics through the course of treatment differ between CL
patients with prospectively determined therapeutic cure or failure.

This will be achieved through the characterization of the in vivo dynamics of
blood-derived monocyte, neutrophil and eosinophil transcriptome before, during
and at the end of treatment in CL patients. Cell-type specific transcriptomes,
composite signatures and time-response expression profiles will be contrasted
among patients with therapeutic cure or failure.

## Preparation

To address these, I added to the end of the sample sheet columns named
'condition', 'batch', 'donor', and 'time'.  These are filled in with shorthand
values according to the above.

## Global view

Before addressing the questions explicitly by subsetting the data, I want to get
a look at the samples as they are.

```{r pre_questions}
new_names <- pData(hs_valid)[["samplename"]]
hs_valid <- hs_valid %>%
  set_expt_batches(fact="cellssource") %>%
  set_expt_conditions(fact="typeofcells") %>%
  set_expt_samplenames(newnames=new_names)

all_norm <- sm(normalize_expt(hs_valid, transform="log2", norm="quant",
                              convert="cpm", filter=TRUE))

all_pca <- plot_pca(all_norm, plot_labels=FALSE,
                    plot_title="PCA - Cell type", size_column="visitnumber")
pp(file=glue("images/tmrc3_pca_nolabels-v{ver}.png"), image=all_pca$plot)

write.csv(all_pca$table, file="coords/hs_donor_pca_coords.csv")
plot_corheat(all_norm, plot_title="Heirarchical clustering:
         cell types")$plot
```

## Examine samples relevant to clinical outcome

Now let us consider only the samples for which we have a clinical outcome.
These fall primarily into either 'cured' or 'failed', but some people have not
yet returned to the clinic after the first or second visit.  These are deemed
'lost'.

```{r all_clinical}
hs_clinical <- hs_valid %>%
  set_expt_conditions(fact="clinicaloutcome") %>%
  set_expt_batches(fact="typeofcells") %>%
  subset_expt(subset="typeofcells!='pbmcs'&typeofcells!='macrophages'")

chosen_colors <- c("#D95F02", "#7570B3", "#1B9E77", "#FF0000", "#FF0000")
names(chosen_colors) <- c("cure", "failure", "lost", "null", "notapplicable")
hs_clinical <- set_expt_colors(hs_clinical, colors=chosen_colors)

newnames <- make.names(pData(hs_clinical)[["samplename"]], unique=TRUE)
hs_clinical <- set_expt_samplenames(hs_clinical, newnames=newnames)

hs_clinical_norm <- sm(normalize_expt(hs_clinical, filter=TRUE, transform="log2",
                                      convert="cpm", norm="quant"))
clinical_pca <- plot_pca(hs_clinical_norm, plot_labels=FALSE,
                         size_column="visitnumber", cis=NULL,
                         plot_title="PCA - clinical samples")
pp(file=glue("images/all_clinical_nobatch_pca-v{ver}.png"), image=clinical_pca$plot,
   height=8, width=20)
```

### Repeat without the biopsy samples

```{r ibid_nobiopsy}
hs_clinical_nobiop <- hs_clinical %>%
  subset_expt(subset="typeofcells!='biopsy'") %>%
  subset_expt(subset="condition=='lost'|condition=='cure'|condition=='failure'")

hs_clinical_nobiop_norm <- sm(normalize_expt(hs_clinical_nobiop, filter=TRUE, transform="log2",
                                             convert="cpm", norm="quant"))
clinical_nobiop_pca <- plot_pca(hs_clinical_nobiop_norm, plot_labels=FALSE, cis=NULL,
                                plot_title="PCA - clinical samples without biopsies")
pp(file=glue("images/all_clinical_nobiop_nobatch_pca-v{ver}.png"),
   image=clinical_nobiop_pca$plot)
```

### Attempt to correct for the surrogate variables

At this time we have two primary data structures of interest: hs_clinical and hs_clinical_nobiop

```{r clinical_sva}
hs_clinical_nb <- normalize_expt(hs_clinical, filter=TRUE, batch="svaseq",
                                 transform="log2", convert="cpm")
clinical_batch_pca <- plot_pca(hs_clinical_nb, plot_labels=FALSE, cis=NULL,
                               size_column="visitnumber", plot_title="PCA - clinical samples")
clinical_batch_pca$plot

hs_clinical_nobiop_nb <- sm(normalize_expt(hs_clinical_nobiop, filter=TRUE, batch="svaseq",
                                           transform="log2", convert="cpm"))
clinical_nobiop_batch_pca <- plot_pca(hs_clinical_nobiop_nb,
                                      plot_title="PCA - clinical samples without biopsies",
                                      plot_labels=FALSE)
pp(file="images/clinical_batch.png", image=clinical_nobiop_batch_pca$plot)
test <- plot_pca(hs_clinical_nobiop_nb, size_column="visitnumber",
                 plot_title="PCA - clinical samples without biopsies",
                 plot_labels=FALSE)
test$plot

clinical_nobiop_batch_tsne <- plot_tsne(hs_clinical_nobiop_nb,
                                        plot_title="tSNE - clinical samples without biopsies",
                                        plot_labels=FALSE)
clinical_nobiop_batch_tsne$plot
```

## Perform DE of the clinical samples cure vs. fail

```{r clinical_de, fig.show="hide"}
individual_celltypes <- subset_expt(hs_clinical_nobiop, subset="condition!='lost'")
hs_clinic_de <- sm(all_pairwise(individual_celltypes, model_batch="svaseq", filter=TRUE))
hs_clinic_table <- sm(combine_de_tables(
    hs_clinic_de,
    excel=glue::glue("excel/individual_celltypes_table-v{ver}.xlsx")))
hs_clinic_sig <- sm(extract_significant_genes(
    hs_clinic_table,
    excel=glue::glue("excel/individual_celltypes_sig-v{ver}.xlsx")))

hs_clinic_sig[["summary_df"]]
```

```{r de_heatmap}
hs_clinic_de[["comparison"]][["heat"]]
```

## Random aside for my own education

I was just doing some reading about concordance statistics.  I want to try something with them.

```{r concordance_test}
test_df <- hs_clinic_table[["data"]][["failure_vs_cure"]][, c("deseq_logfc", "edger_logfc")]
test_lm <- glm(deseq_logfc ~ edger_logfc, data=test_df)
con <- survival::concordance(test_lm)
con
```

To calculate the first of these metrics, the area under the concordance
curve (AUCC), we ranked genes in both the single-cell and bulk datasets in
descending order by the statistical significance of their differential expression.
Then, we created lists of the top-ranked genes in each dataset of matching size, up
to some maximum size k. For each of these lists (that is, for the top-1 genes, top-2
genes, top-3 genes, and so on), we computed the size of the intersection between
the single-cell and bulk DE genes. This procedure yielded a curve relating the
number of shared genes between datasets to the number of top-ranked genes
considered. The area under this curve was computed by summing the size of all
intersections, and normalized to the range [0, 1] by dividing it by its maximum
possible value, k × ( k +1) / 2. To evaluate the concordance of DE analysis, we used
k =500 except where otherwise noted, but found our results were insensitive to the
precise value of k. To compute the second metric, the transcriptome-wide rank
correlation, we multiplied the absolute value of the test statistic for each gene by the
sign of its log-fold change between conditions, and then computed the Spearman
correlation over genes between them.

```{r aucc}
calculate_aucc <- function(tbl, px="deseq_adjp", py="edger_adjp",
                           lx="deseq_logfc", ly="edger_logfc",
                           topn=0.1) {
  ## If the topn argument is an integer, the just ask for that number.
  ## If it is a floating point 0<x<1, then set topn to that proportion
  ## of the number of genes.
  if (topn <= 0) {
    stop("topn need to be either a float from 0-1 or an interger bigger than 100.")
  } else if (topn > 1 & topn <= 100) {
    stop("topn need to be either a float from 0-1 or an interger bigger than 100.")
  } else if (topn < 1) {
    topn <- ceiling(nrow(tbl) * topn)
  }

  x_df <- tbl[, c(px, lx)]
  y_df <- tbl[, c(py, ly)]
  ## curve (AUCC), we ranked genes in both the single-cell and bulk datasets in
  ## descending order by the statistical significance of their differential expression.
  x_idx <- order(x_df[[1]], decreasing=FALSE)
  x_df <- x_df[x_idx, ]
  y_idx <- order(y_df[[1]], decreasing=FALSE)
  y_df <- y_df[y_idx, ]

  ## Then, we created lists of the top-ranked genes in each dataset of
  ## matching size, up to some maximum size k. For each of these lists
  ## (that is, for the top-1 genes, top-2 genes, top-3 genes, and so
  ## on), we computed the size of the intersection between the
  ## single-cell and bulk DE genes. This procedure yielded a curve
  ## relating the number of shared genes between datasets to the
  ## number of top-ranked genes considered.

  intersections <- rep(0, topn)
  for (i in 1:topn) {
    if (i == 1) {
      x_intersections[i] <- rownames(x_df)[i] == rownames(y_df)[i]
    } else {
      x_set <- rownames(x_df)[1:i]
      y_set <- rownames(y_df)[1:i]
      intersections[i] <- sum(x_set %in% y_set)
    }
  }

  ## The area under this curve was computed by summing the size of all
  ## intersections, and normalized to the range [0, 1] by dividing it
  ## by its maximum possible value, k × ( k +1) / 2. To evaluate the
  ## concordance of DE analysis, we used k =500 except where otherwise
  ## noted, but found our results were insensitive to the
  sumint <- sum(intersections)
  norm <- (topn * (topn + 1)) /2
  aucc <- sumint / norm
  return(aucc)
}
```


### Perform LRT with the clinical samples

I am not sure if we have enough samples across the three visit to
completely work as well as we would like, but there is only 1 way to
find out!  Now that I think about it, one thing which might be awesome
is to use cell type as an interacting factor...

#### With biopsy samples

I figure this might be a place where the biopsy samples might prove useful.

```{r lrt_test}
clinical_nolost <- subset_expt(hs_clinical, subset="condition!='lost'&condition!='notapplicable'")

## Currently (202109), this is not returning any significant hits.
## In previous iterations it did.
lrt_visit_clinical_test <- sm(deseq_lrt(clinical_nolost, transform="vst",
                                        interactor_column="visitnumber",
                                        interest_column="clinicaloutcome"))
summary(lrt_visit_clinical_test[["deseq_table"]])
written <- write_xlsx(data=as.data.frame(lrt_visit_clinical_test[["deseq_table"]]),
                      excel=glue::glue("excel/lrt_clinical_visit-v{ver}.xlsx"))

lrt_celltype_clinical_test <- sm(deseq_lrt(clinical_nolost, transform="vst",
                                           interactor_column="typeofcells",
                                           interest_column="clinicaloutcome"))
summary(lrt_celltype_clinical_test[["favorite_genes"]])
favorite_lrt_df <- merge(hs_annot, lrt_celltype_clinical_test[["favorite_genes"]], all.y=TRUE,
                     by="row.names")
written <- write_xlsx(data=favorite_lrt_df,
                      excel=glue::glue("excel/lrt_clinical_celltype_favorites-v{ver}.xlsx"))
deseq_lrt_df <- merge(hs_annot, as.data.frame(lrt_celltype_clinical_test[["deseq_table"]]), all.y=TRUE,
                      by="row.names")
rownames(deseq_lrt_df) <- deseq_lrt_df[["Row.names"]]
deseq_lrt_df[["Row.names"]] <- NULL
written <- write_xlsx(data=deseq_lrt_df,
                      excel=glue::glue("excel/lrt_clinical_celltype-v{ver}.xlsx"))
lrt_celltype_clinical_test$cluster_data$plot

cluster_23_genes_idx <- lrt_celltype_clinical_test[["cluster_data"]][["df"]][["cluster"]] == 23
cluster_23_genes <- lrt_celltype_clinical_test[["cluster_data"]][["df"]][cluster_23_genes_idx, ]
cluster_23_genes

cluster_24_genes_idx <- lrt_celltype_clinical_test[["cluster_data"]][["df"]][["cluster"]] == 24
cluster_24_genes <- lrt_celltype_clinical_test[["cluster_data"]][["df"]][cluster_24_genes_idx, ]
cluster_24_genes

cluster_22_genes_idx <- lrt_celltype_clinical_test[["cluster_data"]][["df"]][["cluster"]] == 22
cluster_22_genes <- lrt_celltype_clinical_test[["cluster_data"]][["df"]][cluster_22_genes_idx, ]
cluster_22_genes
```

### Look at only the differential genes

A good suggestion from Theresa was to examine only the most variant
genes from failure vs. cure and see how they change the clustering/etc
results.  This is my attempt to address this query.

```{r small_explore}
hs_clinic_topn <- sm(extract_significant_genes(hs_clinic_table, n=100))
table <- "failure_vs_cure"
wanted <- rbind(hs_clinic_topn[["deseq"]][["ups"]][[table]],
                hs_clinic_topn[["deseq"]][["downs"]][[table]])

small_expt <- exclude_genes_expt(hs_clinical_nobiop, ids=rownames(wanted), method="keep")
small_norm <- sm(normalize_expt(small_expt, transform="log2", convert="cpm",
                                norm="quant", filter=TRUE))
plot_pca(small_norm)$plot

small_nb <- normalize_expt(small_expt, transform="log2", convert="cpm",
                           batch="svaseq", norm="quant", filter=TRUE)
plot_pca(small_nb)$plot
```

```{r clinical_plot}
## DESeq2 MA plot of failure / cure
hs_clinic_table[["plots"]][["failure_vs_cure"]][["deseq_ma_plots"]]$plot

## DESeq2 Volcano plot of failure / cure
hs_clinic_table[["plots"]][["failure_vs_cure"]][["deseq_vol_plots"]]$plot
```

### g:Profiler results using the significant up and down genes

```{r perform_gprofiler}
ups <- hs_clinic_sig[["deseq"]][["ups"]][[1]]
downs <- hs_clinic_sig[["deseq"]][["downs"]][[1]]

hs_clinic_gprofiler_ups <- simple_gprofiler(ups)
hs_clinic_gprofiler_ups[["pvalue_plots"]][["bpp_plot_over"]]
hs_clinic_gprofiler_ups[["pvalue_plots"]][["mfp_plot_over"]]
hs_clinic_gprofiler_ups[["pvalue_plots"]][["reactome_plot_over"]]

##hs_try2 <- simple_gprofiler2(ups)

hs_clinic_gprofiler_downs <- simple_gprofiler(downs)
hs_clinic_gprofiler_downs[["pvalue_plots"]][["bpp_plot_over"]]
hs_clinic_gprofiler_downs[["pvalue_plots"]][["mfp_plot_over"]]
hs_clinic_gprofiler_downs[["pvalue_plots"]][["reactome_plot_over"]]
```

## Perform GSVA on the clinical samples

```{r gsva, fig.show="hide"}
hs_celltype_gsva_c2 <- sm(simple_gsva(individual_celltypes))
hs_celltype_gsva_c2_sig <- sm(get_sig_gsva_categories(
    hs_celltype_gsva_c2,
    excel="excel/individual_celltypes_gsva_c2.xlsx"))

## broad_c7 <- GSEABase::getGmt("reference/msigdb/c7.all.v7.2.entrez.gmt",
##                              collectionType=GSEABase::BroadCollection(category="c7"),
##                              geneIdType=GSEABase::EntrezIdentifier())
## hs_celltype_gsva_c7 <- sm(simple_gsva(individual_celltypes, signatures=broad_c7,
##                                       msig_xml="reference/msigdb_v7.2.xml", cores=10))
hs_celltype_gsva_c7 <- simple_gsva(individual_celltypes,
                                   signatures="reference/msigdb/c7.all.v7.2.entrez.gmt",
                                   signature_category="c7",
                                   msig_xml="reference/msigdb_v7.2.xml",
                                   cores=10)

hs_celltype_gsva_c7_sig <- sm(get_sig_gsva_categories(
    hs_celltype_gsva_c7,
    excel="excel/individual_celltypes_gsva_c7.xlsx"))
```

### Print some plots of the GSVA outputs

```{r gsva_plots}
## The raw heatmap of the C2 values
hs_celltype_gsva_c2_sig[["raw_plot"]]
## The 'significance' scores of the C2 values
hs_celltype_gsva_c2_sig[["score_plot"]]
## The subset of scores for categories deemed significantly different.
hs_celltype_gsva_c2_sig[["subset_plot"]]

## The raw heatmap of the C7 values
hs_celltype_gsva_c7_sig[["raw_plot"]]
## The 'significance' scores of the C7 values
hs_celltype_gsva_c7_sig[["score_plot"]]
## The subset of scores for categories deemed significantly different.
hs_celltype_gsva_c7_sig[["subset_plot"]]
```

# Some queries from 20210816

Upon returning from Maine, I sat with Najib, Theresa, and Maria
Adelaida.  A whole host of potential ideas were considered; here are a
couple of them:

## Visualize variance in the TMRC3 data

My TODO text says: "Variance partition of TMRC3 samples with at least
time, cell-type, and cure/fail.  I mixed this idea with the visit 1 previously.

```{r clinical_varpart}
vp_factors <- c("condition", "batch", "visitnumber", "eb_lc_ulcera_area_1")

v1_vp <- simple_varpart(hs_clinical, factors=vp_factors, do_fit=TRUE)
pp(file="images/v1_vp.png", image=v1_vp$partition_plot)
```

## Visit 1 comparisons

Here is the note from my TODO list:  "Using v1, perform cell-type
comparisons/signatures as well as a cure/fail differential
expression".

Implicit in this query is the idea that we should do an explicit
visualization of the visit 1 samples without the subsequent visits.

```{r visit1_subset}
## For the moment, let us ignore the lost/NA samples
v1_expt <- subset_expt(hs_clinical, subset="visitnumber=='1'") %>%
  subset_expt(subset="condition=='cure'|condition=='failure'") %>%
  subset_expt(subset="batch!='biopsy'")

v1_norm <- normalize_expt(v1_expt, filter=TRUE, transform="log2", convert="cpm",
                          batch="svaseq")
plot_pca(v1_norm)$plot

v1_de <- all_pairwise(v1_expt, filter=TRUE, model_batch="svaseq")
v1_table <- combine_de_tables(v1_de, excel=glue::glue("excel/v1_tables-v{ver}.xlsx"))
```

### Visit 1 variance partition

It might be interesting to add 'parasitemappingrate' here but I have
not filled it in for all samples yet.

samplecollectiondate should be useful too but is not complete.

This is also a great place to pull in some of the clinical
information.  I need to ask Maria Adelaida for it...

```{r visit1_vp}
v1_vp <- simple_varpart(v1_expt, do_fit=TRUE)
v1_vp$partition_plot
## condition is cure/fail batch is cell type.
```

### Visit 1 separated by cell type

```{r visit1_celltypes}
v1_eo <- subset_expt(v1_expt, subset="batch=='eosinophils'")
v1_mono <- subset_expt(v1_expt, subset="batch=='monocytes'")
v1_neut <- subset_expt(v1_expt, subset="batch=='neutrophils'")

v1_eo_norm <- normalize_expt(v1_eo, transform="log2", convert="cpm",
                             filter=TRUE, norm="quant")
plot_pca(v1_eo_norm)$plot
v1_eo_nb <- normalize_expt(v1_eo, transform="log2", convert="cpm",
                           filter=TRUE, batch="svaseq")
plot_pca(v1_eo_nb)$plot

v1_mono_norm <- normalize_expt(v1_mono, transform="log2", convert="cpm",
                               filter=TRUE, norm="quant")
plot_pca(v1_mono_norm)$plot
v1_mono_nb <- normalize_expt(v1_mono, transform="log2", convert="cpm",
                             filter=TRUE, batch="svaseq")
plot_pca(v1_mono_nb)$plot

v1_neut_norm <- normalize_expt(v1_neut, transform="log2", convert="cpm",
                               filter=TRUE, norm="quant")
plot_pca(v1_neut_norm)$plot
v1_neut_nb <- normalize_expt(v1_neut, transform="log2", convert="cpm",
                             filter=TRUE, batch="svaseq")
plot_pca(v1_neut_nb)$plot
```

# Compare visits

Compare visits independent of cure/fail

```{r visits}
time_expt <- hs_clinical %>%
  set_expt_conditions(fact="visitnumber")
pData(time_expt)[["condition"]] <- paste0("v", pData(time_expt)[["condition"]])

time_norm <- normalize_expt(time_expt, filter=TRUE, norm="quant", convert="cpm",
                            transform="log2")
time_nb <- normalize_expt(time_expt, filter=TRUE, batch="svaseq", convert="cpm",
                          transform="log2")
plot_pca(time_nb)$plot

time_de <- all_pairwise(time_expt, filter=TRUE, model_batch="svaseq")
keepers <- list(
    "v2v1"=c("v2", "v1"),
    "v3v1"=c("v3", "v1"),
    "v3v2"=c("v3", "v2"))
time_tables <- combine_de_tables(time_de, keepers=keepers,
                                 excel=glue::glue("excel/compare_visits-v{ver}.xlsx"))
```

## Separate visits by cell type

Now let us consider the 3 visits from the lens of cell type.

```{r visit_cell_type}
time_type_factor <- paste0(pData(time_expt)[["condition"]], "_", pData(time_expt)[["batch"]])
time_type_expt <- set_expt_conditions(time_expt, fact=time_type_factor)

time_type_biopsy <- subset_expt(time_type_expt, subset="batch=='biopsy'")
time_type_neutrophil <- subset_expt(time_type_expt, subset="batch=='neutrophils'")
time_type_eosinophil <- subset_expt(time_type_expt, subset="batch=='eosinophils'")
time_type_monocyte <- subset_expt(time_type_expt, subset="batch=='monocytes'")
```

### Biopsy

```{r biopsy_time_type}
time_biopsy <- normalize_expt(time_type_biopsy, filter=TRUE, norm="quant", convert="cpm",
                              transform="log2")
time_biopsy_nb <- normalize_expt(time_type_biopsy, filter=TRUE, batch="svaseq", convert="cpm",
                                 transform="log2")
plot_pca(time_biopsy)$plot
plot_pca(time_biopsy_nb)$plot
time_biopsy_de <- all_pairwise(time_type_biopsy, model_batch="svaseq", filter=TRUE)
keepers <- list(
    "v2v1"=c("v2_biopsy", "v1_biopsy"),
    "v3v1"=c("v3_biopsy", "v1_biopsy"),
    "v3v2"=c("v3_biopsy", "v2_biopsy"))
time_biopsy_tables <- combine_de_tables(
    time_biopsy_de, keepers=keepers,
    excel=glue::glue("excel/time_biopsy_tables-v{ver}.xlsx"))
```

### Neutrophils

```{r neutrophil_time_type}
time_neutrophil <- normalize_expt(time_type_neutrophil, filter=TRUE, norm="quant", convert="cpm",
                              transform="log2")
time_neutrophil_nb <- normalize_expt(time_type_neutrophil, filter=TRUE, batch="svaseq", convert="cpm",
                                 transform="log2")
plot_pca(time_neutrophil)$plot
plot_pca(time_neutrophil_nb)$plot
time_neutrophil_de <- all_pairwise(time_type_neutrophil, model_batch="svaseq", filter=TRUE)
keepers <- list(
    "v2v1"=c("v2_neutrophils", "v1_neutrophils"),
    "v3v1"=c("v3_neutrophils", "v1_neutrophils"),
    "v3v2"=c("v3_neutrophils", "v2_neutrophils"))
time_neutrophil_tables <- combine_de_tables(
    time_neutrophil_de, keepers=keepers,
    excel=glue::glue("excel/time_neutrophil_tables-v{ver}.xlsx"))
```

### Eosinophils

```{r eosinophil_time_type}
time_eosinophil <- normalize_expt(time_type_eosinophil, filter=TRUE, norm="quant", convert="cpm",
                              transform="log2")
time_eosinophil_nb <- normalize_expt(time_type_eosinophil, filter=TRUE, batch="svaseq", convert="cpm",
                                 transform="log2")
plot_pca(time_eosinophil)$plot
plot_pca(time_eosinophil_nb)$plot
time_eosinophil_de <- all_pairwise(time_type_eosinophil, model_batch="svaseq", filter=TRUE)
keepers <- list(
    "v2v1"=c("v2_eosinophils", "v1_eosinophils"),
    "v3v1"=c("v3_eosinophils", "v1_eosinophils"),
    "v3v2"=c("v3_eosinophils", "v2_eosinophils"))
time_eosinophil_tables <- combine_de_tables(
    time_eosinophil_de, keepers=keepers,
    excel=glue::glue("excel/time_eosinophil_tables-v{ver}.xlsx"))
```

### Monocyte

```{r monocyte_time_type}
time_monocyte <- normalize_expt(time_type_monocyte, filter=TRUE, norm="quant", convert="cpm",
                              transform="log2")
time_monocyte_nb <- normalize_expt(time_type_monocyte, filter=TRUE, batch="svaseq", convert="cpm",
                                 transform="log2")
plot_pca(time_monocyte)$plot
plot_pca(time_monocyte_nb)$plot
time_monocyte_de <- all_pairwise(time_type_monocyte, model_batch="svaseq", filter=TRUE)
keepers <- list(
    "v2v1"=c("v2_monocytes", "v1_monocytes"),
    "v3v1"=c("v3_monocytes", "v1_monocytes"),
    "v3v2"=c("v3_monocytes", "v2_monocytes"))
time_monocyte_tables <- combine_de_tables(
    time_monocyte_de, keepers=keepers,
    excel=glue::glue("excel/time_monocyte_tables-v{ver}.xlsx"))
```

## Time and clinicaloutcome

```{r time_clinicaloutcome}
time_cf_factor <- paste0(pData(time_type_biopsy)[["condition"]], "_", pData(time_type_biopsy)[["clinicaloutcome"]])
biopsy_time_cf <- set_expt_conditions(time_type_biopsy, fact=time_cf_factor)

time_cf_biopsy <- normalize_expt(biopsy_time_cf, filter=TRUE, norm="quant", convert="cpm",
                                 transform="log2")
time_cf_biopsy_nb <- normalize_expt(biopsy_time_cf, filter=TRUE, batch="svaseq", convert="cpm",
                                    transform="log2")
plot_pca(time_cf_biopsy)$plot
plot_pca(time_cf_biopsy_nb)$plot

biopsy_time_de <- all_pairwise(time_type_biopsy, filter=TRUE, model_batch="svaseq")
biopsy_time_tables <- combine_de_tables(time_type_de, keepers=keepers,
                                        excel=glue::glue("excel/compare_biopsy_visits-v{ver}.xlsx"))

```
# Individual Cell types

The following blocks split the samples into a few groups by sample type and look
at the distributions between them.

## Implementation details

Get top/bottom n genes for each cell type, using clinical outcome as the factor of interest.
For the moment, use sva for the DE analysis.
Provide cpms for the top/bottom n genes.

Start with top/bottom 200.
Perform default logFC and p-value as well.

### Shared contrasts

Here is the contrast we will use throughput, I am leaving open the option to add more.

```{r keepers}
keepers <- list(
  "fail_vs_cure"=c("failure", "cure"))
```

## Monocytes

### Evaluate Monocyte samples

```{r monocytes}
mono <- subset_expt(hs_valid, subset="typeofcells=='monocytes'") %>%
  set_expt_conditions(fact="clinicaloutcome") %>%
  set_expt_batches(fact="visitnumber") %>%
  set_expt_colors(colors=chosen_colors)
## FIXME set_expt_colors should speak up if there are mismatches here!!!

save_result <- save(mono, file="rda/monocyte_expt.rda")
mono_norm <- normalize_expt(mono, convert="cpm", filter=TRUE,
                            transform="log2", norm="quant")
plt <- plot_pca(mono_norm, plot_labels=FALSE)$plot
pp(file=glue("images/mono_pca_normalized-v{ver}.pdf"), image=plt)

mono_nb <- normalize_expt(mono, convert="cpm", filter=TRUE,
                          transform="log2", batch="svaseq")
plt <- plot_pca(mono_nb, plot_labels=FALSE)$plot
pp(file=glue("images/mono_pca_normalized_batch-v{ver}.pdf"), image=plt)
```

### DE of Monocyte samples

#### Without sva

```{r de_monocyte, fig.show="hide"}
mono_de <- sm(all_pairwise(mono, model_batch=FALSE, filter=TRUE))
mono_tables <- sm(combine_de_tables(
    mono_de, keepers=keepers,
    excel=glue::glue("upload/DE_Cure_vs_Fail/{ver}_mono_vall_de_curevsfail_nobatch.xlsx")))

written <- write_xlsx(data=mono_tables[["data"]][[1]],
                      excel=glue::glue("excel/monocyte_clinical_table-v{ver}.xlsx"))
mono_sig <- sm(extract_significant_genes(mono_tables, according_to="deseq"))
written <- write_xlsx(data=mono_sig[["deseq"]][["ups"]][[1]],
                      excel=glue::glue("excel/monocyte_clinical_sigup-v{ver}.xlsx"))
written <- write_xlsx(data=mono_sig[["deseq"]][["downs"]][[1]],
                      excel=glue::glue("excel/monocyte_clinical_sigdown-v{ver}.xlsx"))

mono_pct_sig <- sm(extract_significant_genes(mono_tables, n=200,
                                             lfc=NULL, p=NULL, according_to="deseq"))
written <- write_xlsx(data=mono_pct_sig[["deseq"]][["ups"]][[1]],
                      excel=glue::glue("excel/monocyte_clinical_sigup_pct-v{ver}.xlsx"))
written <- write_xlsx(data=mono_pct_sig[["deseq"]][["downs"]][[1]],
                      excel=glue::glue("excel/monocyte_clinical_sigdown_pct-v{ver}.xlsx"))
mono_sig$summary_df

## Print out a table of the cpm values for other explorations.
mono_cpm <- sm(normalize_expt(mono, convert="cpm"))
written <- write_xlsx(data=exprs(mono_cpm),
                      excel=glue::glue("excel/monocyte_cpm_before_batch-v{ver}.xlsx"))
mono_bcpm <- sm(normalize_expt(mono, filter=TRUE, convert="cpm", batch="svaseq"))
written <- write_xlsx(data=exprs(mono_bcpm),
                      excel=glue::glue("excel/monocyte_cpm_after_batch-v{ver}.xlsx"))
```

#### With sva

```{r de_mono_sva, fig.show="hide"}
mono_de_sva <- sm(all_pairwise(mono, model_batch="svaseq", filter=TRUE))
mono_sva_tables <- sm(combine_de_tables(
    mono_de_sva, keepers=keepers,
    excel=glue::glue("upload/DE_Cure_vs_Fail/{ver}_mono_vall_de_curevsfail_svabatch.xlsx")))

mono_sig_sva <- sm(extract_significant_genes(
    mono_tables_sva,
    excel=glue::glue("excel/monocyte_clinical_sig_tables_sva-v{ver}.xlsx"),
    according_to="deseq"))
```

#### Monocyte DE plots

First print out the DE plots without and then with sva estimates.

```{r mono_de_plots}
## DESeq2 MA plot of failure / cure
mono_tables[["plots"]][["fail_vs_cure"]][["deseq_ma_plots"]]$plot

## DESeq2 Volcano plot of failure / cure
mono_tables[["plots"]][["fail_vs_cure"]][["deseq_vol_plots"]]$plot

## DESeq2 MA plot of failure / cure with svaseq
mono_tables_sva[["plots"]][["fail_vs_cure"]][["deseq_ma_plots"]]$plot

## DESeq2 Volcano plot of failure / cure with svaseq
mono_tables_sva[["plots"]][["fail_vs_cure"]][["deseq_vol_plots"]]$plot
```

#### Monocyte ontology search

```{r mono_gprofiler}
ups <- mono_sig[["deseq"]][["ups"]][["fail_vs_cure"]]
downs <- mono_sig[["deseq"]][["downs"]][["fail_vs_cure"]]

mono_up_gp <- simple_gprofiler(sig_genes=ups)
mono_up_gp[["pvalue_plots"]][["bpp_plot_over"]]
mono_up_gp[["pvalue_plots"]][["mfp_plot_over"]]
mono_up_gp[["pvalue_plots"]][["reactome_plot_over"]]

mono_down_gp <- simple_gprofiler(sig_genes=downs)
mono_down_gp[["pvalue_plots"]][["bpp_plot_over"]]
mono_down_gp[["pvalue_plots"]][["mfp_plot_over"]]
mono_down_gp[["pvalue_plots"]][["reactome_plot_over"]]
```

#### Monocyte MSigDB query

```{r msig_mono_goseq, fig.show="hide"}
## broad_c7 <- GSEABase::getGmt("reference/msigdb/c7.all.v7.2.entrez.gmt",
##                              collectionType=GSEABase::BroadCollection(category="c7"),
##                              geneIdType=GSEABase::EntrezIdentifier())
broad_c7 <- load_gmt_signatures(signatures="reference/msigdb/c7.all.v7.2.entrez.gmt",
                                signature_category="c7")
mono_up_goseq_msig <- goseq_msigdb(sig_genes=ups, signatures=broad_c7,
                                   signature_category="c7", length_db=hs_length)
mono_down_goseq_msig <- goseq_msigdb(sig_genes=downs, signatures=broad_c7,
                                     signature_category="c7", length_db=hs_length)
```

#### Plot of similar experiments

```{r msig_plots}
## Monocyte genes with increased expression in the failed samples
## share genes with the following experiments
mono_up_goseq_msig[["pvalue_plots"]][["mfp_plot_over"]]

## Monocyte genes with increased expression in the cured samples
## share genes with the following experiments
mono_down_goseq_msig[["pvalue_plots"]][["mfp_plot_over"]]
```

### Evaluate Neutrophil samples

```{r neutrophils}
neut <- subset_expt(hs_valid, subset="typeofcells=='neutrophils'") %>%
  set_expt_conditions(fact="clinicaloutcome") %>%
  set_expt_batches(fact="visitnumber") %>%
  set_expt_colors(colors=chosen_colors)

save_result <- save(neut, file="rda/neutrophil_expt.rda")

neut_norm <- sm(normalize_expt(neut, convert="cpm", filter=TRUE, transform="log2"))
plt <- plot_pca(neut_norm, plot_labels=FALSE)$plot
pp(file=glue("images/neut_pca_normalized-v{ver}.pdf"), image=plt)

neut_nb <- sm(normalize_expt(neut, convert="cpm", filter=TRUE,
                             transform="log2", batch="svaseq"))
plt <- plot_pca(neut_nb, plot_labels=FALSE)$plot
pp(file=glue("images/neut_pca_normalized_svaseq-v{ver}.pdf"), image=plt)
```

### DE of Netrophil samples

#### Without sva

```{r neutrophil_de, fig.show="hide"}
neut_de <- sm(all_pairwise(neut, model_batch=FALSE, filter=TRUE))
neut_tables <- sm(combine_de_tables(
    neut_de, keepers=keepers,
    excel=glue::glue("upload/DE_Cure_vs_Fail/{ver}_neut_vall_de_curevsfail_nobatch.xlsx")))

written <- write_xlsx(data=neut_tables[["data"]][[1]],
                      excel=glue::glue("excel/neutrophil_clinical_table-v{ver}.xlsx"))
neut_sig <- sm(extract_significant_genes(neut_tables, according_to="deseq"))
written <- write_xlsx(data=neut_sig[["deseq"]][["ups"]][[1]],
                      excel=glue::glue("excel/neutrophil_clinical_sigup-v{ver}.xlsx"))
written <- write_xlsx(data=neut_sig[["deseq"]][["downs"]][[1]],
                      excel=glue::glue("excel/neutrophil_clinical_sigdown-v{ver}.xlsx"))

neut_pct_sig <- sm(extract_significant_genes(neut_tables, n=200, lfc=NULL,
                                             p=NULL, according_to="deseq"))
written <- write_xlsx(data=neut_sig[["deseq"]][["ups"]][[1]],
                      excel=glue::glue("excel/neutrophil_clinical_sigup_pct-v{ver}.xlsx"))
written <- write_xlsx(data=neut_sig[["deseq"]][["downs"]][[1]],
                      excel=glue::glue("excel/neutrophil_clinical_sigdown_pct-v{ver}.xlsx"))
neut_cpm <- sm(normalize_expt(neut, convert="cpm"))
written <- write_xlsx(data=exprs(neut_cpm),
                      excel=glue::glue("excel/neutrophil_cpm_before_batch-v{ver}.xlsx"))
neut_bcpm <- sm(normalize_expt(neut, filter=TRUE, batch="svaseq", convert="cpm"))
written <- write_xlsx(data=exprs(neut_bcpm),
                      excel=glue::glue("excel/neutrophil_cpm_after_batch-v{ver}.xlsx"))
```

#### With sva

```{r de_neut_sva, fig.show="hide"}
neut_de_sva <- sm(all_pairwise(neut, model_batch="svaseq", filter=TRUE))
neut_tables_sva <- sm(combine_de_tables(
    neut_de_sva, keepers=keepers,
    excel=glue::glue("upload/DE_Cure_vs_Fail/{ver}_neut_vall_de_curevsfail_svabatch.xlsx")))

neut_sig_sva <- sm(extract_significant_genes(
    neut_tables_sva,
    excel=glue::glue("excel/neutrophil_clinical_sig_tables_sva-v{ver}.xlsx"),
    according_to="deseq"))
```

#### Neutrophil DE plots

```{r neut_de_plots}
## DESeq2 MA plot of failure / cure
neut_tables[["plots"]][["fail_vs_cure"]][["deseq_ma_plots"]]$plot

## DESeq2 Volcano plot of failure / cure
neut_tables[["plots"]][["fail_vs_cure"]][["deseq_vol_plots"]]$plot

## DESeq2 MA plot of failure / cure with sva
neut_tables_sva[["plots"]][["fail_vs_cure"]][["deseq_ma_plots"]]$plot

## DESeq2 Volcano plot of failure / cure with sva
neut_tables_sva[["plots"]][["fail_vs_cure"]][["deseq_vol_plots"]]$plot
```

#### Neutrophil ontology search

```{r neut_gp}
ups <- neut_sig[["deseq"]][["ups"]][["fail_vs_cure"]]
downs <- neut_sig[["deseq"]][["downs"]][["fail_vs_cure"]]

neut_up_gp <- simple_gprofiler(sig_genes=ups)
neut_up_gp[["pvalue_plots"]][["bpp_plot_over"]]
neut_up_gp[["pvalue_plots"]][["mfp_plot_over"]]
neut_up_gp[["pvalue_plots"]][["reactome_plot_over"]]

neut_down_gp <- simple_gprofiler(downs)
neut_down_gp[["pvalue_plots"]][["bpp_plot_over"]]
neut_down_gp[["pvalue_plots"]][["mfp_plot_over"]]
neut_down_gp[["pvalue_plots"]][["reactome_plot_over"]]
```

#### Neutrophil GSVA query

```{r msig_neut_goseq, fig.show="hide"}
neut_up_goseq_msig <- goseq_msigdb(sig_genes=ups, signatures=broad_c7,
                                   signature_category="c7", length_db=hs_length)

neut_down_goseq_msig <- goseq_msigdb(sig_genes=downs, signatures=broad_c7,
                                     signature_category="c7", length_db=hs_length)
```

#### Plot of similar experiments

```{r msig_plots_neut}
## Neutrophil genes with increased expression in the failed samples
## share genes with the following experiments
neut_up_goseq_msig[["pvalue_plots"]][["mfp_plot_over"]]

## Neutrophil genes with increased expression in the cured samples
## share genes with the following experiments
neut_down_goseq_msig[["pvalue_plots"]][["mfp_plot_over"]]
```

## Eosinophils

### Evaluate Eosinophil samples

```{r eosinophils}
eo <- subset_expt(hs_valid, subset="typeofcells=='eosinophils'") %>%
  set_expt_conditions(fact="clinicaloutcome") %>%
  set_expt_batches(fact="visitnumber") %>%
  set_expt_colors(colors=chosen_colors)

save_result <- save(eo, file="rda/eosinophil_expt.rda")
eo_norm <- sm(normalize_expt(eo, convert="cpm", transform="log2",
                             norm="quant", filter=TRUE))
plt <- plot_pca(eo_norm, plot_labels=FALSE)$plot
pp(file=glue("images/eo_pca_normalized-v{ver}.pdf"), image=plt)

eo_nb <- sm(normalize_expt(eo, convert="cpm", transform="log2",
                           filter=TRUE, batch="svaseq"))
plt <- plot_pca(eo_nb, plot_labels=FALSE)$plot
plt
```

### DE of Eosinophil samples

#### Withouth sva

```{r eosinophil_de, fig.show="hide"}
eo_de <- sm(all_pairwise(eo, model_batch=FALSE, filter=TRUE))
eo_tables <- sm(combine_de_tables(
    eo_de, keepers=keepers,
    excel=glue::glue("upload/DE_Cure_vs_Fail/{ver}_eo_vall_de_curevsfail_nobatch.xlsx")))

written <- write_xlsx(data=eo_tables[["data"]][[1]],
                      excel=glue::glue("excel/eosinophil_clinical_table-v{ver}.xlsx"))
eo_sig <- sm(extract_significant_genes(eo_tables, according_to="deseq"))
written <- write_xlsx(data=eo_sig[["deseq"]][["ups"]][[1]],
                      excel=glue::glue("excel/eosinophil_clinical_sigup-v{ver}.xlsx"))
written <- write_xlsx(data=eo_sig[["deseq"]][["downs"]][[1]],
                      excel=glue::glue("excel/eosinophil_clinical_sigdown-v{ver}.xlsx"))

eo_pct_sig <- sm(extract_significant_genes(eo_tables, n=200,
                                           lfc=NULL, p=NULL, according_to="deseq"))
written <- write_xlsx(data=eo_pct_sig[["deseq"]][["ups"]][[1]],
                      excel=glue::glue("excel/eosinophil_clinical_sigup_pct-v{ver}.xlsx"))
written <- write_xlsx(data=eo_pct_sig[["deseq"]][["downs"]][[1]],
                      excel=glue::glue("excel/eosinophil_clinical_sigdown_pct-v{ver}.xlsx"))

eo_cpm <- sm(normalize_expt(eo, convert="cpm"))
written <- write_xlsx(data=exprs(eo_cpm),
                      excel=glue::glue("excel/eosinophil_cpm_before_batch-v{ver}.xlsx"))
eo_bcpm <- sm(normalize_expt(eo, filter=TRUE, batch="svaseq", convert="cpm"))
written <- write_xlsx(data=exprs(eo_bcpm),
                      excel=glue::glue("excel/eosinophil_cpm_after_batch-v{ver}.xlsx"))
```

#### With sva

```{r de_eo_sva, fig.show="hide"}
eo_de_sva <- sm(all_pairwise(eo, model_batch="svaseq", filter=TRUE))
eo_tables_sva <- sm(combine_de_tables(
    eo_de_sva, keepers=keepers,
    excel=glue::glue("upload/DE_Cure_vs_Fail/{ver}_eo_vall_de_curevsfail_svabatch.xlsx")))

eo_sig_sva <- sm(extract_significant_genes(
    eo_tables_sva,
    excel=glue::glue("excel/eosinophil_clinical_sig_tables_sva-v{ver}.xlsx"),
    according_to="deseq"))
```

#### Eosinophil DE plots

```{r eo_de_plots}
## DESeq2 MA plot of failure / cure
eo_tables[["plots"]][["fail_vs_cure"]][["deseq_ma_plots"]]$plot

## DESeq2 Volcano plot of failure / cure
eo_tables[["plots"]][["fail_vs_cure"]][["deseq_vol_plots"]]$plot

## DESeq2 MA plot of failure / cure with sva
eo_tables_sva[["plots"]][["fail_vs_cure"]][["deseq_ma_plots"]]$plot

## DESeq2 Volcano plot of failure / cure with sva
eo_tables_sva[["plots"]][["fail_vs_cure"]][["deseq_vol_plots"]]$plot
```

#### Eosinophil ontology search

```{r eo_gprofiler}
ups <- eo_sig[["deseq"]][["ups"]][["fail_vs_cure"]]
downs <- eo_sig[["deseq"]][["downs"]][["fail_vs_cure"]]

eo_up_gp <- simple_gprofiler(sig_genes=ups)
eo_up_gp[["pvalue_plots"]][["bpp_plot_over"]]
eo_up_gp[["pvalue_plots"]][["mfp_plot_over"]]
eo_up_gp[["pvalue_plots"]][["reactome_plot_over"]]

eo_down_gp <- simple_gprofiler(downs)
eo_down_gp[["pvalue_plots"]][["bpp_plot_over"]]
eo_down_gp[["pvalue_plots"]][["mfp_plot_over"]]
eo_down_gp[["pvalue_plots"]][["reactome_plot_over"]]
```

#### Eosinophil MSigDB query

```{r msig_eo_goseq, fig.show="hide"}
eo_up_goseq_msig <- goseq_msigdb(sig_genes=ups, signatures=broad_c7,
                                 signature_category="c7", length_db=hs_length)

eo_down_goseq_msig <- goseq_msigdb(sig_genes=downs, signatures=broad_c7,
                                   signature_category="c7", length_db=hs_length)
```

#### Plot of similar experiments

```{r msig_plots_eo}
## Eosinophil genes with increased expression in the failed samples
## share genes with the following experiments
eo_up_goseq_msig[["pvalue_plots"]][["mfp_plot_over"]]

## Eosinophil genes with increased expression in the cured samples
## share genes with the following experiments
eo_down_goseq_msig[["pvalue_plots"]][["mfp_plot_over"]]
```

## Biopsies

### Evaluate Biopsy samples

```{r biopsies}
biop <- subset_expt(hs_valid, subset="typeofcells=='biopsy'") %>%
  set_expt_conditions(fact="clinicaloutcome") %>%
  set_expt_batches(fact="visitnumber") %>%
  set_expt_colors(colors=chosen_colors)

save_result <- save(biop, file="rda/biopsy_expt.rda")
biop_norm <- normalize_expt(biop, filter=TRUE, convert="cpm",
                            transform="log2", norm="quant")
plt <- plot_pca(biop_norm, plot_labels=FALSE)$plot
pp(file=glue("images/biop_pca_normalized-v{ver}.pdf"), image=plt)

biop_nb <- sm(normalize_expt(biop, convert="cpm", filter=TRUE,
                             transform="log2", batch="svaseq"))
plt <- plot_pca(biop_nb, plot_labels=FALSE)$plot
pp(file=glue("images/biop_pca_normalized_svaseq-v{ver}.pdf"), image=plt)
```

### DE of Biopsy samples

#### Without sva

```{r de_biopsy, fig.show="hide"}
biop_de <- sm(all_pairwise(biop, model_batch=FALSE, filter=TRUE))
biop_tables <- combine_de_tables(
    biop_de, keepers=keepers,
    excel=glue::glue("upload/DE_Cure_vs_Fail/{ver}_biopsy_vall_de_curevsfail_nobatch.xlsx")))

written <- write_xlsx(data=biop_tables[["data"]][[1]],
                      excel=glue::glue("excel/biopsy_clinical_table-v{ver}.xlsx"))
biop_sig <- extract_significant_genes(biop_tables, according_to="deseq")
##written <- write_xlsx(data=biop_sig[["deseq"]][["ups"]][[1]],
##                      excel=glue::glue("excel/biopsy_clinical_sigup-v{ver}.xlsx"))
written <- write_xlsx(data=biop_sig[["deseq"]][["downs"]][[1]],
                      excel=glue::glue("excel/biopsy_clinical_sigdown-v{ver}.xlsx"))
biop_pct_sig <- extract_significant_genes(biop_tables, n=200, lfc=NULL, p=NULL, according_to="deseq")
written <- write_xlsx(data=biop_pct_sig[["deseq"]][["ups"]][[1]],
                      excel=glue::glue("excel/biopsy_clinical_sigup_pct-v{ver}.xlsx"))
written <- write_xlsx(data=biop_pct_sig[["deseq"]][["downs"]][[1]],
                      excel=glue::glue("excel/biopsy_clinical_sigdown_pct-v{ver}.xlsx"))

biop_cpm <- sm(normalize_expt(biop, convert="cpm"))
written <- write_xlsx(data=exprs(biop_cpm),
                      excel=glue::glue("excel/biopsy_cpm_before_batch-v{ver}.xlsx"))
biop_bcpm <- sm(normalize_expt(biop, filter=TRUE, batch="svaseq", convert="cpm"))
written <- write_xlsx(data=exprs(biop_bcpm),
                      excel=glue::glue("excel/biopsy_cpm_after_batch-v{ver}.xlsx"))
```

#### with sva

```{r de_biopsy_sva, fig.show="hide"}
biop_de_sva <- sm(all_pairwise(biop, model_batch="svaseq", filter=TRUE))
biop_tables_sva <- sm(combine_de_tables(
    biop_de_sva, keepers=keepers,
    excel=glue::glue("upload/DE_Cure_vs_Fail/{ver}_biopsy_vall_de_curevsfail_svabatch.xlsx")))

biop_sig_sva <- sm(extract_significant_genes(
    biop_tables_sva,
    excel=glue::glue("excel/biopsy_clinical_sig_tables_sva-v{ver}.xlsx"),
    according_to="deseq"))
```

#### Biopsy DE plots

```{r biop_de_plots}
## DESeq2 MA plot of failure / cure
biop_tables[["plots"]][["fail_vs_cure"]][["deseq_ma_plots"]]$plot

## DESeq2 Volcano plot of failure / cure
biop_tables[["plots"]][["fail_vs_cure"]][["deseq_vol_plots"]]$plot

## DESeq2 MA plot of failure / cure
biop_tables_sva[["plots"]][["fail_vs_cure"]][["deseq_ma_plots"]]$plot

## DESeq2 Volcano plot of failure / cure
biop_tables_sva[["plots"]][["fail_vs_cure"]][["deseq_vol_plots"]]$plot
```

# Look for shared genes among Monocytes/Neutrophils/Eosinophils

We have three variables containing the 'significant' DE genes for the
three cell types.  For this I am choosing (for the moment) to use the
sva data.

```{r shared_by_type}
## mono_sig_sva, neut_sig_sva, eo_sig_sva
sig_vectors <- list(
    "monocytes"=c(rownames(mono_sig_sva[["deseq"]][["ups"]][["fail_vs_cure"]]),
                    rownames(mono_sig_sva[["deseq"]][["downs"]][["fail_vs_cure"]])),
    "neutrophils"=c(rownames(neut_sig_sva[["deseq"]][["ups"]][["fail_vs_cure"]]),
                      rownames(neut_sig_sva[["deseq"]][["downs"]][["fail_vs_cure"]])),
    "eosinophils"= c(rownames(eo_sig_sva[["deseq"]][["ups"]][["fail_vs_cure"]]),
                       rownames(eo_sig_sva[["deseq"]][["downs"]][["fail_vs_cure"]])))

shared_vector <- Vennerable::Venn(Sets=sig_vectors)
Vennerable::plot(shared_vector, doWeights=FALSE)

shared_ids <- shared_vector@IntersectionSets[["111"]]
shared_expt <- exclude_genes_expt(hs_clinical, ids=shared_ids, method="keep")
shared_written <- write_expt(shared_expt,
                             excel=glue::glue("excel/shared_across_celltypes-v{ver}.xlsx"))
```

# Monocytes by visit

 1. Can you please share with us a PCA (SVA and non-SVA) of the
    monocytes of the TMRC3 project, but labeling them based on the visit
    (V1, V2, V3)?
 2. Can you please share DE lists of V1 vs V2, V1 vs V3, V1 vs. V2+V3
    and V2 vs V3?

```{r monocytes_by_visit}
visit_colors <- chosen_colors <- c("#D95F02", "#7570B3", "#1B9E77")
names(visit_colors) <- c(1, 2, 3)
mono_visit <- subset_expt(hs_valid, subset="typeofcells=='monocytes'") %>%
  set_expt_conditions(fact="visitnumber") %>%
  set_expt_batches(fact="clinicaloutcome") %>%
  set_expt_colors(colors=chosen_colors)

mono_visit_norm <- normalize_expt(mono_visit, filter=TRUE, norm="quant", convert="cpm",
                                  transform="log2")
mono_visit_pca <- plot_pca(mono_visit_norm)
pp(file="images/monocyte_by_visit.png", image=mono_visit_pca$plot)

mono_visit_nb <- normalize_expt(mono_visit, filter=TRUE, convert="cpm",
                                batch="svaseq", transform="log2")
mono_visit_nb_pca <- plot_pca(mono_visit_nb)
pp(file="images/monocyte_by_visit_nb.png", image=mono_visit_nb_pca$plot)

table(pData(mono_visit_norm)$batch)
```

```{r mono_visit_de, fig.show="hide"}
keepers <- list(
    "second_vs_first"=c("c2", "c1"),
    "third_vs_second"=c("c3", "c2"),
    "third_vs_first"=c("c3", "c1"))
mono_visit_de <- all_pairwise(mono_visit, model_batch="svaseq", filter=TRUE)

mono_visit_tables <- combine_de_tables(
    mono_visit_de,
    keepers=keepers,
    excel=glue::glue("excel/mono_visit_tables-v{ver}.xlsx"))
```

```{r v1_vs_all}
new_factor <- as.character(pData(mono_visit)[["visitnumber"]])
not_one_idx <- new_factor != 1
new_factor[not_one_idx] <- "not_1"
mono_one_vs <- set_expt_conditions(mono_visit, new_factor)

mono_one_vs_de <- all_pairwise(mono_one_vs, model_batch="svaseq", filter=TRUE)

mono_one_vs_tables <- combine_de_tables(
    mono_one_vs_de,
    excel=glue::glue("excel/mono_one_vs_tables-v{ver}.xlsx"))
```

# Test TSP

In writing the following, I quickly realized that tspair was not
joking when it said it is intended for small numbers of genes.  For a
full expressionset of human data it is struggling.  I like the idea,
it may prove worth while to spend some time optimizing the package so
that it is more usable.

```{r tsp, eval=FALSE}
expt <- hs_clinical_nobiop

simple_tsp <- function(expt, column="condition") {
  facts <- levels(as.factor(pData(expt)[[column]]))
  retlist <- list()
  if (length(facts) < 2) {
    stop("This requires factors with at least 2 levels.")
  } else if (length(facts) == 2) {
    retlist <- simple_tsp_pair(expt, column=column)
  } else {
    for (first in 1:(length(facts) - 1)) {
      for (second in 2:(length(facts))) {
        if (first < second) {
          name <- glue::glue("{facts[first]}_vs_{facts[second]}")
          message("Starting ", name, ".")
          substring <- glue::glue("{column}=='{facts[first]}'|{column}=='{facts[second]}'")
          subby <- subset_expt(expt, subset=as.character(substring))
          retlist[[name]] <- simple_tsp_pair(subby, column=column)
        }
      }
    }
  }
}

simple_tsp_pair <- function(subby, column="condition", repetitions=50) {
  tsp_input <- subby[["expressionset"]]
  tsp_output <- tspcalc(tsp_input, column)
  tsp_scores <- tspsig(tsp_input, column, B=repetitions)
}

tsp1 <- tspcalc(tsp_input, "condition")

```

```{r saveme}
if (!isTRUE(get0("skip_load"))) {
  pander::pander(sessionInfo())
  message(paste0("This is hpgltools commit: ", get_git_commit()))
  message(paste0("Saving to ", savefile))
  tmp <- sm(saveme(filename=savefile))
}
```

```{r loadme_after, eval=FALSE}
tmp <- loadme(filename=savefile)
```
