1 Estimates!

## The scale difference between the smallest and largest
## libraries is > 10. Assuming a log10 scale is better, set scale=FALSE if not.

## This function will replace the expt$expressionset slot with:
## log2(cpm(quant(simple(data))))
## It will save copies of each step along the way
##  in expt$normalized with the corresponding libsizes. Keep libsizes in mind
##  when invoking limma.  The appropriate libsize is non-log(cpm(normalized)).
##  This is most likely kept at:
##  'new_expt$normalized$intermediate_counts$normalization$libsizes'
##  A copy of this may also be found at:
##  new_expt$best_libsize
## Not correcting the count-data for batch effects.  If batch is
##  included in EdgerR/limma's model, then this is probably wise; but in extreme
##  batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: simple
## Removing 212 low-count genes (19417 remaining).
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 6752 values equal to 0, adding 1 to the matrix.
## Step 5: not doing batch correction.

## plot labels was not set and there are more than 100 samples, disabling it.
## Not putting labels on the plot.

## This function will replace the expt$expressionset slot with:
## log2(svaseq(cpm(quant(simple(data)))))
## It will save copies of each step along the way
##  in expt$normalized with the corresponding libsizes. Keep libsizes in mind
##  when invoking limma.  The appropriate libsize is non-log(cpm(normalized)).
##  This is most likely kept at:
##  'new_expt$normalized$intermediate_counts$normalization$libsizes'
##  A copy of this may also be found at:
##  new_expt$best_libsize
## Warning in normalize_expt(hs_expt, transform = "log2", convert = "cpm", :
## Quantile normalization and sva do not always play well together.
## Step 1: performing count filter with option: simple
## Removing 212 low-count genes (19417 remaining).
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 6752 values equal to 0, adding 1 to the matrix.
## Step 5: doing batch correction with svaseq.
## Note to self:  If you get an error like 'x contains missing values' The data has too many 0's and needs a stronger low-count filter applied.
## Passing off to all_adjusters.
## batch_counts: Before batch/surrogate estimation, 1022588 entries are x>1: 51.6%.
## batch_counts: Before batch/surrogate estimation, 6752 entries are x==0: 0.341%.
## batch_counts: Before batch/surrogate estimation, 951194 entries are 0<x<1: 48.0%.
## The be method chose 10 surrogate variable(s).
## Attempting svaseq estimation with 10 surrogates.
## There are 30781 (1.55%) elements which are < 0 after batch correction.
## plot labels was not set and there are more than 100 samples, disabling it.
## Not putting labels on the plot.

## This function will replace the expt$expressionset slot with:
## log2(svaseq(quant(simple(data))))
## It will save copies of each step along the way
##  in expt$normalized with the corresponding libsizes. Keep libsizes in mind
##  when invoking limma.  The appropriate libsize is non-log(cpm(normalized)).
##  This is most likely kept at:
##  'new_expt$normalized$intermediate_counts$normalization$libsizes'
##  A copy of this may also be found at:
##  new_expt$best_libsize
## Leaving the data unconverted.  It is often advisable to cpm/rpkm
##  the data to normalize for sampling differences, keep in mind though that rpkm
##  has some annoying biases, and voom() by default does a cpm (though hpgl_voom()
##  will try to detect this).
## Warning in normalize_expt(hs_expt, transform = "log2", norm = "quant",
## filter = "simple", : Quantile normalization and sva do not always play well
## together.
## Step 1: performing count filter with option: simple
## Removing 212 low-count genes (19417 remaining).
## Step 2: normalizing the data with quant.
## Step 3: not converting the data.
## Step 4: transforming the data with log2.
## transform_counts: Found 6752 values equal to 0, adding 1 to the matrix.
## Step 5: doing batch correction with svaseq.
## Note to self:  If you get an error like 'x contains missing values' The data has too many 0's and needs a stronger low-count filter applied.
## Passing off to all_adjusters.
## batch_counts: Before batch/surrogate estimation, 1473620 entries are x>1: 74.4%.
## batch_counts: Before batch/surrogate estimation, 6752 entries are x==0: 0.341%.
## batch_counts: Before batch/surrogate estimation, 500162 entries are 0<x<1: 25.3%.
## The be method chose 11 surrogate variable(s).
## Attempting svaseq estimation with 11 surrogates.
## There are 4732 (0.239%) elements which are < 0 after batch correction.
## plot labels was not set and there are more than 100 samples, disabling it.
## Not putting labels on the plot.

2 Add our data

Najib asked about adding the various data provided by our work. The expressionset which contains this information live in ‘../multiple_leishmania_2018’, more explicitly, the expressionset may be loaded via Hs_M0Lm4h.rda

## This function will replace the expt$expressionset slot with:
## log2(cpm(quant(cbcb(data))))
## It will save copies of each step along the way
##  in expt$normalized with the corresponding libsizes. Keep libsizes in mind
##  when invoking limma.  The appropriate libsize is non-log(cpm(normalized)).
##  This is most likely kept at:
##  'new_expt$normalized$intermediate_counts$normalization$libsizes'
##  A copy of this may also be found at:
##  new_expt$best_libsize
## Not correcting the count-data for batch effects.  If batch is
##  included in EdgerR/limma's model, then this is probably wise; but in extreme
##  batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 0 low-count genes (19629 remaining).
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 9876 values equal to 0, adding 1 to the matrix.
## Step 5: not doing batch correction.
## plot labels was not set and there are more than 100 samples, disabling it.
## Not putting labels on the plot.
## Warning in MASS::cov.trob(data[, vars]): Probable convergence failure

## Warning in MASS::cov.trob(data[, vars]): Probable convergence failure

## This function will replace the expt$expressionset slot with:
## log2(svaseq(cpm(quant(cbcb(data)))))
## It will save copies of each step along the way
##  in expt$normalized with the corresponding libsizes. Keep libsizes in mind
##  when invoking limma.  The appropriate libsize is non-log(cpm(normalized)).
##  This is most likely kept at:
##  'new_expt$normalized$intermediate_counts$normalization$libsizes'
##  A copy of this may also be found at:
##  new_expt$best_libsize
## Warning in normalize_expt(all_expt, filter = TRUE, norm = "quant", convert
## = "cpm", : Quantile normalization and sva do not always play well together.
## Step 1: performing count filter with option: cbcb
## Removing 0 low-count genes (19629 remaining).
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 9876 values equal to 0, adding 1 to the matrix.
## Step 5: doing batch correction with svaseq.
## Note to self:  If you get an error like 'x contains missing values' The data has too many 0's and needs a stronger low-count filter applied.
## Passing off to all_adjusters.
## batch_counts: Before batch/surrogate estimation, 4233025 entries are x>1: 58.4%.
## batch_counts: Before batch/surrogate estimation, 9876 entries are x==0: 0.136%.
## batch_counts: Before batch/surrogate estimation, 3000200 entries are 0<x<1: 41.4%.
## The be method chose 25 surrogate variable(s).
## Attempting svaseq estimation with 25 surrogates.
## There are 211468 (2.92%) elements which are < 0 after batch correction.
## plot labels was not set and there are more than 100 samples, disabling it.
## Not putting labels on the plot.

3 Break apart the expt

Najib wants to set the colors/sizes/shapes for the set of all samples, including our data from previous work . In order to do that, I am going to need to break apart the expt into its component pieces, edit the experimental design in order to match it with his query, and recreate the expressionset.

So, lets do it.

I edited the metadata sheet, filling in numeric values for time, (partially) standardizing the host cell types, and filling in the infection state. Oh! But I think a bunch of the ones which currently say uninfected are actually stimulated! Lets go back through and check that before moving forward.

## Reading the sample metadata.
## The sample definitions comprises: 369 rows(samples) and 61 columns(metadata fields).
## Matched 19629 annotations and counts.
## Bringing together the count matrix and gene information.
## The final expressionset has 19629 rows and 369 columns.

3.1 Play with the re-created expressionset.

## There were 369, now there are 296 samples.
## There were 296, now there are 271 samples.
## This function will replace the expt$expressionset slot with:
## log2(svaseq(cpm(quant(cbcb(data)))))
## It will save copies of each step along the way
##  in expt$normalized with the corresponding libsizes. Keep libsizes in mind
##  when invoking limma.  The appropriate libsize is non-log(cpm(normalized)).
##  This is most likely kept at:
##  'new_expt$normalized$intermediate_counts$normalization$libsizes'
##  A copy of this may also be found at:
##  new_expt$best_libsize
## Warning in normalize_expt(newer_expt, filter = "cbcb", convert = "cpm", :
## Quantile normalization and sva do not always play well together.
## Step 1: performing count filter with option: cbcb
## Removing 0 low-count genes (19629 remaining).
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 6269 values equal to 0, adding 1 to the matrix.
## Step 5: doing batch correction with svaseq.
## Note to self:  If you get an error like 'x contains missing values' The data has too many 0's and needs a stronger low-count filter applied.
## Passing off to all_adjusters.
## batch_counts: Before batch/surrogate estimation, 2925747 entries are x>1: 55.0%.
## batch_counts: Before batch/surrogate estimation, 6269 entries are x==0: 0.118%.
## batch_counts: Before batch/surrogate estimation, 2387443 entries are 0<x<1: 44.9%.
## The be method chose 45 surrogate variable(s).
## Attempting svaseq estimation with 45 surrogates.
## There are 180019 (3.38%) elements which are < 0 after batch correction.
## plot labels was not set and there are more than 100 samples, disabling it.
## Not putting labels on the plot.

---
title: "Downloaded data sets, sample estimation."
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
  html_document:
    code_download: true
    code_folding: show
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: tango
    keep_md: false
    mode: selfcontained
    number_sections: true
    self_contained: true
    theme: readable
    toc: true
    toc_float:
      collapsed: false
      smooth_scroll: false
  rmdformats::readthedown:
    code_download: true
    code_folding: show
    df_print: paged
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: tango
    width: 300
    keep_md: false
    mode: selfcontained
    toc_float: true
  BiocStyle::html_document:
    code_download: true
    code_folding: show
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: tango
    keep_md: false
    mode: selfcontained
    toc_float: true
---

<style type="text/css">
body, td {
  font-size: 16px;
}
code.r{
  font-size: 16px;
}
pre {
 font-size: 16px
}
</style>

```{r options, include=FALSE}
library("hpgltools")
tt <- devtools::load_all("/data/hpgltools")
knitr::opts_knit$set(width=120,
                     progress=TRUE,
                     verbose=TRUE,
                     echo=TRUE)
knitr::opts_chunk$set(error=TRUE,
                      dpi=96)
old_options <- options(digits=4,
                       stringsAsFactors=FALSE,
                       knitr.duplicate.label="allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size=10))
rundate <- format(Sys.Date(), format="%Y%m%d")

ver <- "20190701"
previous_file <- paste0("01_annotation_", ver, ".Rmd")

tmp <- sm(loadme(filename=paste0(gsub(pattern="\\.Rmd", replace="", x=previous_file), "-v", ver, ".rda.xz")))
rmd_file <- "02_estimation_20190701.Rmd"
```

# Estimates!

```{r estimate}
hs_expt <- set_expt_conditions(hs_expt, fact="infectstate")
hs_expt <- set_expt_batches(hs_expt, fact="studypmid")

plot_libsize(hs_expt)$plot
hs_norm <- normalize_expt(hs_expt, transform="log2", convert="cpm",
                          norm="quant", filter="simple")
hs_cor <- plot_corheat(hs_norm)
hs_pca <- plot_pca(hs_norm)
hs_pca$plot

hs_nb <- normalize_expt(hs_expt, transform="log2", convert="cpm",
                          norm="quant", filter="simple", batch="svaseq")

hs_nb_pca <- plot_pca(hs_nb)
hs_nb_pca$plot

hs_expt <- set_expt_conditions(hs_expt, fact="expttime")

hs_norm <- normalize_expt(hs_expt, transform="log2",
                          norm="quant", filter="simple", batch="svaseq")
hs_pca <- plot_pca(hs_norm)
hs_pca$plot
```

# Add our data

Najib asked about adding the various data provided by our work.  The
expressionset which contains this information live in
'../multiple_leishmania_2018', more explicitly, the expressionset may be loaded
via Hs_M0Lm4h.rda

```{r add_our_data}
load("../multiple_leishmania_2018/Hs_M0Lm4h.rda")

all_expt <- combine_expts(hs_expt, expt, merge_meta=TRUE)

all_expt <- set_expt_conditions(all_expt, fact="infectstate")
all_n <- normalize_expt(all_expt, filter=TRUE, norm="quant", convert="cpm",
                        transform="log2")

##test_pca <- pca_information(all_expt, num_pcs=4, plot_pcas=TRUE,
##                            expt_factors=c("condition", "batch", "hostcellsource",
##                                           "pathogenspecies", "expttime"))

plot_pca(all_n)$plot

all_nb <- normalize_expt(all_expt, filter=TRUE, norm="quant", convert="cpm",
                           transform="log2", batch="svaseq")
all_pca <- plot_pca(all_nb)
all_pca$plot
```

# Break apart the expt

Najib wants to set the colors/sizes/shapes for the set of all samples, including
our data from previous work .  In order to do that, I am going to need to break
apart the expt into its component pieces, edit the experimental design in order
to match it with his query, and recreate the expressionset.

So, lets do it.

```{r break_expt_setup}
## This should really only be run manually, because I will be using this
## to recreate the full set of experimental design across multiple revisions of
## my metadata.
version_id <- "201907"
count_sheet <- glue::glue("sample_sheets/all_counts_{version_id}.xlsx")
all_counts <- as.data.frame(exprs(all_expt))
annot_sheet <- glue::glue("sample_sheets/all_annot_{version_id}.xlsx")
all_annot <- as.data.frame(fData(all_expt))
meta_sheet <- glue::glue("sample_sheets/all_meta_{version_id}.xlsx")
all_meta <- as.data.frame(pData(all_expt))
```

```{r break_expt, eval=FALSE}
tt <- write_xls(data=all_meta, excel=meta_sheet)
tt <- write_xls(data=all_annot, excel=annot_sheet)
tt <- write_xls(data=all_counts, excel=count_sheet)
```

I edited the metadata sheet, filling in numeric values for time, (partially)
standardizing the host cell types, and filling in the infection state.  Oh!  But
I think a bunch of the ones which currently say uninfected are actually
stimulated! Lets go back through and check that before moving forward.

```{r replot}
new_counts <- openxlsx::readWorkbook(xlsxFile=count_sheet)
rownames(new_counts) <- new_counts[["row.names"]]
new_counts[["row.names"]] <- NULL
colnames(new_counts) <- gsub(pattern="^(.*)_.*$", replacement="\\1", x=colnames(new_counts))
new_meta <- openxlsx::readWorkbook(xlsxFile=meta_sheet)
rownames(new_meta) <- new_meta[["row.names"]]
new_meta[["row.names"]] <- NULL
colnames(new_meta) <- gsub(pattern="^(.*)_.*$", replacement="\\1", x=colnames(new_meta))
new_annot <- openxlsx::readWorkbook(xlsxFile=annot_sheet)
rownames(new_annot) <- new_annot[["row.names"]]
new_annot[["row.names"]] <- NULL
colnames(new_annot) <- gsub(pattern="^(.*)_.*$", replacement="\\1", x=colnames(new_annot))

new_expt <- create_expt(metadata=new_meta,
                        count_dataframe=new_counts,
                        gene_info=new_annot)
```

## Play with the re-created expressionset.

```{r play}
newer_expt <- subset_expt(new_expt, subset="timehours!='undef'")
newer_expt <- subset_expt(newer_expt, subset="hostcellsource!='skin'")
newer_expt <- set_expt_conditions(newer_expt, fact="pathogenspecies")
newer_expt <- set_expt_batches(newer_expt, fact="hostcellsource")
newer_norm <- normalize_expt(newer_expt, filter="cbcb", convert="cpm",
                             transform="log2", norm="quant", batch="svaseq")

newer_pc <- plot_pca(newer_norm, size_column="timehours", size_order=c(0,4,24,50,120))
newer_pc$plot
```

```{r saveme, eval=FALSE}
pander::pander(sessionInfo())
message(paste0("This is hpgltools commit: ", get_git_commit()))
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
tmp <- sm(saveme(filename=this_save))
```

```{r loadme, eval=FALSE, include=FALSE}
tmp <- loadme(filename=this_save)
```
