These documents are a series of Leishmania panamensis RNASeq analyses.

This data is in fact in 4 sub-experiments. Lets address them one at a time.

1 TODO

1.1 2017-08-24

  1. Define analyses.
  2. Set up supplementals.
    1. Same panels as per Santuza’s paper:
    • Coverage
    • Boxplot
    • correlation
    • PCA
    • distance
    • SMC

1.2 2017-07-27

  1. Find overlaps between the three patients and the groups of ch/un + sh/un

1.3 2016-11-29

  • Do the following:
    1. Limma with batch in model, report # genes fc>0, p<=0.05; fc<0, p<=0.05; fc>1.5, p<=0.05; fc<-1.5, p<=0.05
    2. Limma with combat, report # genes fc>0, p<=0.05; fc<0, p<=0.05, etc.
    3. Limma without b, report same
    4. Limma without b+sva, report same
  • Take the 4 results and compare them:
    1. Correlate various fold changes
    2. Find # genes of 1 in 2, 1 in 3, 1 in 4, 2 in 3, 2 in 4, 3 in 4.

1.4 2016-11-03

Do in lexical order, send the first 4 by email asap.

  • For macrophage, remove batch b and re-perform metrics.
  • For macrophage, look at pca information after adjustment with combat.
  • Compare de lists with and without batch b (using combat for batch b containing work)
  • PBMC: variance partition human, “Are there any genes with variance we can ascribe to condition?” That set of genes is the most interesting. – use the variance parition table and pull the top 200. Report back the % variance by condition for all of these genes.

  • Simplify snp plot for keith.
  • Make separate snp plot of only macrophage and only infection.

1.5 2016-09-23

  1. Add to existing vcfutils pipeline a check that all samples have sufficient coverage. Rsamtools, calculate coverage on all positions stored as gRanges[]. Cross reference these against each snp position and percentage.
    1. For each existing snp position, make a data structure including the coverage and percentage for all samples
    2. Given this data structure, keep only those rows with sufficient coverage across all samples. (If any sample has < n reads at position x && at no one sample has it > 0.95 then remove the row)
    3. The remaining positions should all have 0<=x<=1 percentage. Use them for the following analyses.
  2. Subset the full data set into a separate plot for each experiment. Plot these heatmaps accordingly.
    1. If there are interesting clades within these subsets, reuse them for variancePartition/models.
  3. Assuming the same result applies, use PCA or something to query self/chronic for all samples and see if the funny blue clade is split or not.

1.6 2016-09-15

  1. (2)Re-perform vcfutils pipeline and have it include in its summary a number of coverage, then when merging keep only those positions which have coverage > x for all samples. (done except I didn’t do coverage > x for all samples properly, but the results I think make it a moot point)
  2. Perform with a set of monocladous samples
  3. (4)Perform an explicit set of mappings of only the non-human reads and collect mapping stats for keith ;p
  4. (1)Make 100+% certain that all violin plots are on human data (I think this is true now, done by performing plots of both for all).
  5. (3)Conversely/simultaneously, query the snp profiles (focus on hpgl0635) to seek snps which define: (Goal is to find positions driving the clustering observed)
  6. Make a heatmap of snps in the columns and strains in rows and use the % to score. How to choose n out of 92k snps
  1. For each snp, use snp3 clade as outcome and perform an f-test for each snp and extract the highest n f-statistic snps, then plot the heatmap at i.
  2. Compare clustering of gene expression vs. SNP genotype.
  3. Measure variance explained by SNP vs. variance explained by strain by doing varPartition of:
  4. donor + strain
  5. donor + snp(3/4/5) and snpclade
  6. Assuming this works out, fit a model without strain, but instead snp.

1.7 Misc

  1. VCF tools in bioconductor (I am inclined to suggest that these tools sort of suck)
  2. Exclude <0.9 percentages from: (done – there is no significant difference between the two, the correlations/distances are nearly identical.) DO ME FIRST: newb.
  3. Step1: Perform varpart and see that variance by snp > variance by strain
  4. Perform this on the human data and not parasite
  1. Perform this for both snp3 and snp5 clades
  2. Make sure to plot donor+strain+healing and donor+snp+healing for snp3/snp5
  3. Assuming snp3 is appropriate, fit a model of y=donor+snp3+healing
  1. Assuming it functions properly, we have a set of genes with differential expression taking into account both donor and clustering.
  1. Bioconductor package ‘variancePartition’
  1. (1|factor) are random effects ergo; model <- as.formula(“~ (1|donor) + (1|strain) + (1|batch)”)
  2. Make violin plot of this result, plot residuals vs. healing,
  3. Repeat above with introduced + (1|healing)
  1. Copy chr/sh pca plot down to the relabel one sample section to make it easier to follow
  2. Figure out a way to automatically compare every sample against the SNP profiles generated for all strains.
  1. Record a set of canonical snps for each strain
  2. For each sample, use something like bam_whateveritwas to extract those regions
  3. For each snp profile, count the identities/differences for every snp position and heatmap them?
  4. Then if they match it should come up the terminal color I think
  1. Try the same PCA plots with just dropping hpgl0635
  2. Print full set of plots for the parasite samples before getting into the PCA.

1.8 Macrophage tasks

  1. Re-perform DE analyses with no-sva/ruv but condition+batch (done)
  2. Collect IGV images for strains ch/sh (done)
  3. For SVs, plot SV vs. technical variables: %rna, snps (done)
  4. Take the % of reads mapped to parasite and add it as an experimental parameter (done)
  5. Extend SV analysis to include condition + batch (not done)
  6. For the SV methods, plot log2 genes vs. log2fc (?)
  7. Run RUV/SVA condition + nobatch (done)
  1. Take SVs and add to the model
  2. Perform DE
  3. Plot MA, correlate log2FCs across analyses
  1. Provide a PCA plot of our final decision for this data and provide a table of coordinates
  2. Provide DE lists with condition+batch in the model vs. combat adjusted and do comparison
  3. Provide some ontology/KEGG results

1.9 Infection tasks

  1. Add ‘coordinates’ for all the PCA plots for Maria Adelaida to pick up. done by adding them to all PCA plots
  2. Send along some PCA coordinates (from the email:) done
  1. The coordinates of the PCA of PBMCs without batch correction – showing the strong batch effect by patient Ibid #0
  2. The coordinates of the PCAs of PBMCs after batch correction Ibid #0
  1. Amend PCA coordinates: first without uninfected, then with
  1. without batch, patient, strain Ibid #0 – actually did I do them all, I think so
  1. Will we call the 6 isolates be called ‘strains’, ‘isolates’, ‘cultures’?
  1. The reason for this question is that they do not necessarily follow well defined rules vis a vis the similarity to the characterized genome.
  1. Make sure the actual experimental batch comparison is maintained I believe this is correct
  2. Make sure that it is clear the difference between ‘experimental batch effect’ and ‘donor effect’ I think this is clear, I have a specific call before each comparison now, is there a best way to ensure this?
  3. Report correlation between factors and components. Done for the case of samples without uninfected
  4. Fit linear model after removing each factor and observe variance remaining to directly assess %var in each factor. ‘Most elegant method is the mixed effect model’
  5. Terms agreed to:
    1. “donor”: (patient) bob jane alice
    2. “isolate”: (strain) 2171 2172 etc
    3. “status”: (self healing, chronic, uninfected)
    4. “batch”: (a b) The two library/RNA isolation dates
    5. “snp”: (x y z) Clade top (few), clade middle (some), clade bottom (many)
  6. Make a model of ~ 0 + isolate + donor + batch (maybe not batch) Fit to the data, take residuals, make PCA of the residuals, color by sh/ch
  7. Compare parasite data clustering to see if some other surrogate factor to make some sense of this a) Goal: attempt to identify the cluster of strains which has peculiar human expression.
  8. See what happens if I relabel 635 as strain ch2504 Done for some values of ‘done’ I am not certain of the result
  9. Learn best method

2 Installation and setup

These are rmarkdown documents which make heavy use of the hpgltools package. The following section demonstrates how to set that up in a clean R environment.

## Use R's install.packages to install devtools.
install.packages("devtools")
## Use devtools to install hpgltools.
devtools::install_github("elsayedlab/hpgltools")
## Load hpgltools into the R environment.
library(hpgltools)
## Use hpgltools' autoloads_all() function to install the many packages used by hpgltools.
autoloads_all()

This document is rather short. In some others I did the preprocessing, shared, etc steps as ‘child’ documents of this and they would be appended here. In this case I did not.

if (!isTRUE(get0("skip_load"))) {
  pander::pander(sessionInfo())
  message(paste0("This is hpgltools commit: ", get_git_commit()))
  this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
  message(paste0("Saving to ", this_save))
  tmp <- sm(saveme(filename=this_save))
}
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 2a0661d6e37f8a3d8831eb3bbd6347c0d9c4b3b7
## R> packrat::restore()
## This is hpgltools commit: Thu Mar 29 16:59:07 2018 -0400: 2a0661d6e37f8a3d8831eb3bbd6347c0d9c4b3b7
## Saving to index-v20171110.rda.xz
---
title: "L.panamensis 2016: RNAseq of L.panamensis in human macrophages, during infection, in biopsies, and with antimonial treatment."
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
 html_document:
  code_download: true
  code_folding: show
  fig_caption: true
  fig_height: 7
  fig_width: 7
  highlight: default
  keep_md: false
  mode: selfcontained
  number_sections: true
  self_contained: true
  theme: readable
  toc: true
  toc_float:
    collapsed: false
    smooth_scroll: false
---

<style>
  body .main-container {
    max-width: 1600px;
  }
</style>

```{r options, include=FALSE}
if (!isTRUE(get0("skip_load"))) {
  library(hpgltools)
  tt <- devtools::load_all("~/hpgltools")
  knitr::opts_knit$set(progress=TRUE,
                       verbose=TRUE,
                       width=90,
                       echo=TRUE)
  knitr::opts_chunk$set(error=TRUE,
                        fig.width=8,
                        fig.height=8,
                        dpi=96)
  old_options <- options(digits=4,
                         stringsAsFactors=FALSE,
                         knitr.duplicate.label="allow")
  ggplot2::theme_set(ggplot2::theme_bw(base_size=10))
  ver <- "20171110"
  previous_file <- "index.Rmd"

  tmp <- try(sm(loadme(filename=paste0(gsub(pattern="\\.Rmd", replace="", x=previous_file), "-v", ver, ".rda.xz"))))
  rmd_file <- "index.Rmd"
}
```

These documents are a series of Leishmania panamensis RNASeq analyses.

This data is in fact in 4 sub-experiments.  Lets address them one at a time.

* [Preprocessing](preprocessing.html)  The steps performed to preprocess the data.
* [Annotation](annotation.html)  Data shared by all experiments (annotations, genomes, etc).
* [Snp clustering](snp_cluster.html)  Cluster samples by snp percentages.
* Macrophage Analyses:  Responses of macrophages from 1 donor and 3 strains.
    * [Macrophage Sample Estimation: Human](macrophage_estimation_human.html)  Macrophage responses: hpgl0241-0248
      and hpgl0636-0639.
    * [Macrophage Sample Estimation: Parasite](macrophage_estimation_parasite.html)  Ibid except the parasite mappings.
    * [Macrophage Differential Expression: Human](macrophage_expression_human.html)  Macrophage responses: human
    * [Macrophage Differential Expression: Parasite](macrophage_expression_parasite.html)  Macrophage responses: parasite
    * [Macrophage Gene Ontology: Human](macrophage_ontology_human.html)  Gene ontology searches: human
    * [Macrophage Gene Ontology: Parasite](macrophage_ontology_parasite.html)  Gene ontology searches: parasite
* Infection Analyses:  Responses of PBMC cells to strains identified as 'self-healing' and 'chronic'.
    * [Infection Sample Estimation: Human](infection_estimation_human.html)  Infection responses: hpgl0241-0248
      and hpgl0636-0639.
    * [Infection Sample Estimation: Parasite](infection_estimation_parasite.html)  Ibid except the parasite mappings.
    * [Infection Differential Expression: Human](infection_expression_human.html)  Infection responses: human
    * [Infection Differential Expression: Parasite](infection_expression_parasite.html)  Infection responses: parasite
    * [Infection Gene Ontology: Human](infection_ontology_human.html)  Gene ontology searches: human
    * [Infection Gene Ontology: Parasite](infection_ontology_parasite.html)  Gene ontology searches: parasite
*  Biopsy Analyses:  PBMC samples pre and post-biopsy.
    * [Biopsy Sample Estimation: Human](biopsy_estimation_human.html)  Biopsy responses: hpgl0241-0248
    and hpgl0636-0639.
    * [Biopsy Sample Estimation: Parasite](biopsy_estimation_parasite.html)  Ibid except the parasite mappings.
    * [Biopsy Differential Expression: Human](biopsy_expression_human.html)  Biopsy responses: human
    * [Biopsy Differential Expression: Parasite](biopsy_expression_parasite.html)  Biopsy responses: parasite
    * [Biopsy Gene Ontology: Human](biopsy_ontology_human.html)  Gene ontology searches: human
    * [Biopsy Gene Ontology: Parasite](biopsy_ontology_parasite.html)  Gene ontology searches: parasite
*  Antimonial Treatments:  A search for differences between responsive and unresponsive samples
    * [Antimonial Sample Estimation: Human](antimonial_estimation_human.html)  Antimonial responses: hpgl0241-0248
    and hpgl0636-0639.
    * [Antimonial Sample Estimation: Parasite](antimonial_estimation_parasite.html)  Ibid except the parasite mappings.
    * [Antimonial Differential Expression: Human](antimonial_expression_human.html)  Antimonial responses: human
    * [Antimonial Differential Expression: Parasite](antimonial_expression_parasite.html)  Antimonial responses: parasite
    * [Antimonial Gene Ontology: Human](antimonial_ontology_human.html)  Gene ontology searches: human
    * [Antimonial Gene Ontology: Parasite](antimonial_ontology_parasite.html)  Gene ontology searches: parasite

# TODO

## 2017-08-24

1.  Define analyses.
2.  Set up supplementals.
    a.  Same panels as per Santuza's paper:
      *  Coverage
      *  Boxplot
      *  correlation
      *  PCA
      *  distance
      *  SMC



## 2017-07-27

1.  Find overlaps between the three patients and the groups of ch/un + sh/un

## 2016-11-29

*  Do the following:
    1.  Limma with batch in model, report # genes fc>0, p<=0.05;  fc<0, p<=0.05; fc>1.5, p<=0.05; fc<-1.5, p<=0.05
    2.  Limma with combat, report # genes fc>0, p<=0.05; fc<0, p<=0.05, etc.
    3.  Limma without b, report same
    4.  Limma without b+sva, report same
*  Take the 4 results and compare them:
    1.  Correlate various fold changes
    2.  Find # genes of 1 in 2, 1 in 3, 1 in 4, 2 in 3, 2 in 4, 3 in 4.

## 2016-11-03

Do in lexical order, send the first 4 by email asap.

*    For macrophage, remove batch b and re-perform metrics.
*    For macrophage, look at pca information after adjustment with combat.
*    Compare de lists with and without batch b (using combat for batch b containing work)
*    PBMC: variance partition human, "Are there any genes with variance we can ascribe to condition?"
    That set of genes is the most interesting. -- use the variance parition table and pull the top 200.
    Report back the % variance by condition for all of these genes.


*    Simplify snp plot for keith.
*    Make separate snp plot of only macrophage and only infection.

## 2016-09-23

1.  Add to existing vcfutils pipeline a check that _all_ samples have sufficient coverage.
Rsamtools, calculate coverage on all positions stored as gRanges[].  Cross reference these against
each snp position and percentage.
    a.  For each existing snp position, make a data structure including the coverage and percentage for all samples
    b.  Given this data structure, keep only those rows with sufficient coverage across _all_
    samples.  (If any sample has < n reads at position x && at no one sample has it > 0.95 then
    remove the row)
    c.  The remaining positions should all have 0<=x<=1 percentage.  Use them for the following
    analyses.
2.  Subset the full data set into a separate plot for each experiment.  Plot these heatmaps
    accordingly.
    a.  If there are interesting clades within these subsets, reuse them for variancePartition/models.
3.  Assuming the same result applies, use PCA or something to query self/chronic for all samples and
    see if the funny blue clade is split or not.

## 2016-09-15

a.  (2)Re-perform vcfutils pipeline and have it include in its summary a number of coverage,  then when merging keep only those positions which have coverage > x for _all_ samples.  (done except I didn't do coverage > x for all samples properly, but the results I think make it a moot point)
 i. Perform with a set of monocladous samples
b.  (4)Perform an explicit set of mappings of only the non-human reads and collect mapping stats for keith ;p
a.  (1)Make 100+% certain that all violin plots are on human data (I think this is true now, done by performing plots of both for all).
b.  (3)Conversely/simultaneously, query the snp profiles (focus on hpgl0635) to
    seek snps which define:   (Goal is to find positions driving the clustering observed)
  i.  Make a heatmap of snps in the columns and strains in rows and use the
    % to score.  How to choose n out of 92k snps
  ii.  For each snp, use snp3 clade as outcome and perform an f-test for each snp and extract the
    highest n f-statistic snps, then plot the heatmap at i.
c.  Compare clustering of gene expression vs. SNP genotype.
d.  Measure variance explained by SNP vs. variance explained by strain by doing varPartition of:
  i.   donor + strain
  ii.  donor + snp(3/4/5) and snpclade
  iii. Assuming this works out, fit a model without strain, but instead snp.

## Misc

a.  VCF tools in bioconductor (I am inclined to suggest that these tools sort of suck)
b.  Exclude <0.9 percentages from:  (done -- there is no significant difference between the two, the correlations/distances are nearly identical.)
DO ME FIRST: newb.
c.  Step1:  Perform varpart and see that variance by snp > variance by strain
   i.  Perform this on the human data and not parasite
   ii.  Perform this for both snp3 and snp5 clades
   iii. Make sure to plot donor+strain+healing and donor+snp+healing for snp3/snp5
d.  Assuming snp3 is appropriate, fit a model of y=donor+snp3+healing
e.  Assuming it functions properly, we have a set of genes with
    differential expression taking into account both donor and clustering.


1.  Bioconductor package 'variancePartition'
  a.  (1|factor) are random effects  ergo; model <- as.formula("~ (1|donor) + (1|strain) + (1|batch)")
  b.  Make violin plot of this result, plot residuals vs. healing,
  c.  Repeat above with introduced + (1|healing)
2.  Copy chr/sh pca plot down to the relabel one sample section to make it easier to follow
3.  Figure out a way to automatically compare every sample against the SNP profiles generated for all strains.
  a.  Record a set of canonical snps for each strain
  b.  For each sample, use something like bam_whateveritwas to extract those regions
  c.  For each snp profile, count the identities/differences for every snp position and heatmap them?
  d.  Then if they match it should come up the terminal color I think
4.  Try the same PCA plots with just dropping hpgl0635
5.  Print full set of plots for the parasite samples before getting into the PCA.

## Macrophage tasks

1.  Re-perform DE analyses with no-sva/ruv but condition+batch (done)
2.  Collect IGV images for strains ch/sh (done)
3.  For SVs, plot SV vs. technical variables: %rna, snps (done)
4.  Take the % of reads mapped to parasite and add it as an experimental parameter (done)
5.  Extend SV analysis to include condition + batch (not done)
6.  For the SV methods, plot log2 genes vs. log2fc (?)
7.  Run RUV/SVA condition + nobatch (done)
  a.  Take SVs and add to the model
  b.  Perform DE
  c.  Plot MA, correlate log2FCs across analyses
8.  Provide a PCA plot of our final decision for this data and provide a table of coordinates
9.  Provide DE lists with condition+batch in the model vs. combat adjusted and do comparison
10. Provide some ontology/KEGG results

## Infection tasks

0. Add 'coordinates' for all the PCA plots for Maria Adelaida to pick up.
      *done by adding them to all PCA plots*
1. Send along some PCA coordinates (from the email:)
      *done*
  a) The coordinates of the PCA of PBMCs without batch correction –
      showing the strong batch effect by patient
      *Ibid #0*
  b) The coordinates of the PCAs of PBMCs after batch correction
      *Ibid #0*
2.  Amend PCA coordinates:  first without uninfected, then with
  a)  without batch, patient, strain
      *Ibid #0 -- actually did I do them all, I think so*
3.  Will we call the 6 isolates be called 'strains', 'isolates', 'cultures'?
  a)  The reason for this question is that they do not necessarily follow well defined rules vis a
      vis the similarity to the characterized genome.

4.  Make sure the actual experimental batch comparison is maintained
      *I believe this is correct*
5.  Make sure that it is clear the difference between 'experimental batch effect' and 'donor effect'
      *I think this is clear, I have a specific call before each comparison now, is there a best way to ensure this?*
6.  Report correlation between factors and components.
      *Done for the case of samples without uninfected*
7.  Fit linear model after removing each factor and observe variance remaining to directly assess
    %var in each factor.  'Most elegant method is the mixed effect model'
8.  Terms agreed to:
    a)  "donor":  (patient) bob jane alice
    b)  "isolate":  (strain) 2171 2172 etc
    c)  "status":   (self healing, chronic, uninfected)
    d)  "batch":    (a b) The two library/RNA isolation dates
    e)  "snp":      (x y z) Clade top (few), clade middle (some), clade bottom (many)
9.  Make a model of ~ 0 + isolate + donor + batch (maybe not batch)
    Fit to the data, take residuals, make PCA of the residuals, color by sh/ch
10. Compare parasite data clustering to see if some other surrogate factor to make some sense of
    this    a)  Goal: attempt to identify the cluster of strains which has peculiar human expression.
11. See what happens if I relabel 635 as strain ch2504
      *Done for some values of 'done' I am not certain of the result*
12. Learn best method

# Installation and setup

These are rmarkdown documents which make heavy use of the hpgltools package.  The following section
demonstrates how to set that up in a clean R environment.

```{r setup, eval=FALSE}
## Use R's install.packages to install devtools.
install.packages("devtools")
## Use devtools to install hpgltools.
devtools::install_github("elsayedlab/hpgltools")
## Load hpgltools into the R environment.
library(hpgltools)
## Use hpgltools' autoloads_all() function to install the many packages used by hpgltools.
autoloads_all()
```

This document is rather short.  In some others I did the preprocessing, shared, etc steps as 'child'
documents of this and they would be appended here.  In this case I did not.

```{r saveme}
if (!isTRUE(get0("skip_load"))) {
  pander::pander(sessionInfo())
  message(paste0("This is hpgltools commit: ", get_git_commit()))
  this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
  message(paste0("Saving to ", this_save))
  tmp <- sm(saveme(filename=this_save))
}
```
