index.html preprocessing.html

1 Annotation version: 20170703

There are a few methods of importing annotation data into R. The following are two attempts, the second is currently being used in these analyses.

1.1 AnnotationHub

AnnotationHub is a newer service and has promise to be an excellent top-level resource for gathering annotation data.

## This doesn't quite work yet
tt <- sm(library(AnnotationHub))
ah <- AnnotationHub()
orgdbs <- query(ah, "OrgDb")
leishmanias <- query(ah, c("OrgDB", "Leishmania"))

1.2 OrganismDbi

AnnotationHub is the new and fancier version of what OrganismDb does. Keith already made these for the parasites though, lets try and use one of those.

The OrganismDb packages are installable via Keith’s builder: https://github.com/elsayed-lab/eupathdb-organismdb

I did a git pull of it, changed a couple small things and ran ‘make lbraziliensis’. This pulls the annotations down from a mix of Najib’s directory on the cluster and the TriTrypdb. As a result, it currently gets the version 27 of the TriTrypDB data. It may be useful for me to download a newer revision (but I am reasonably certain not much (or anything) has changed for braziliensis in a while).

After 5 or so minutes a brand new package ‘Leishmania.braziliensis’ appeared in my R environment.

testing_panamensis <- make_tritrypdb_organismdbi("lpanamensis")
testing_braziliensis <- make_tritrypdb_organismdbi("lbraziliensis")
testing_donovani <- make_tritrypdb_organismdbi("ldonovani")
testing_mexicana <- make_tritrypdb_organismdbi("lmexicana")
testing_major <- make_tritrypdb_organismdbi("lmajor")
testing_donovani_eu <- make_eupath_organismdbi("donovani")
tmp <- sm(try(library("Leishmania.major.Friedlin")))
if (class(tmp) == "try-error") {
  tmp <- make_tritrypdb_organismdbi("lmajor")
}
tmp <- sm(library("org.Lmajor.friedlin.eg.db"))
tmp <- sm(library("TxDb.Lmajor.friedlin.TriTrypDB28"))
Leishmania.major.Friedlin
## OrganismDb Object:
## # Includes GODb Object:  GO.db 
## # With data about:  Gene Ontology 
## # Includes OrgDb Object:  org.Lmajor.friedlin.eg.db 
## # Gene data about:  Leishmania major.friedlin 
## # Taxonomy Id:   
## # Includes TxDb Object:  TxDb.Lmajor.friedlin.TriTrypDB28 
## # Transcriptome data about:  Leishmania major 
## # Based on genome:  NA 
## # The OrgDb gene id GID is mapped to the TxDb gene id GENEID .
lm_org <- sm(load_orgdb_annotations(Leishmania.major.Friedlin, keytype="geneid"))
lm_annotations <- lm_org[["genes"]]

1.3 Read a gff file

In this instance, it might be smarter to actually read the annotations from a gff file, primarily because I modified the gff file to include an entry for the LmjF.26.0890 3’ UTR.

lm_gff_annotations <- load_gff_annotations(gff="reference/lmajor_modified.gff", type="gene")
## Trying attempt: rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo=TRUE)
## Warning in as.data.frame(mcols(x), ...): Arguments in '...' ignored

## Warning in as.data.frame(mcols(x), ...): Arguments in '...' ignored
## Had a successful gff import with rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo=TRUE)
## Warning in as.data.frame(mcols(x), ...): Arguments in '...' ignored
## Returning a df with 24 columns and 9379 rows.
head(lm_gff_annotations)
##    seqnames start   end width strand    source type score phase           ID         Name
## 37  LmjF.01  3704  4702   999      - TriTrypDB gene    NA    NA LmjF.01.0010 LmjF.01.0010
## 41  LmjF.01  5790  7439  1650      - TriTrypDB gene    NA    NA LmjF.01.0020 LmjF.01.0020
## 45  LmjF.01  9061 11067  2007      - TriTrypDB gene    NA    NA LmjF.01.0030 LmjF.01.0030
## 49  LmjF.01 12073 12642   570      - TriTrypDB gene    NA    NA LmjF.01.0040 LmjF.01.0040
## 53  LmjF.01 15025 17022  1998      - TriTrypDB gene    NA    NA LmjF.01.0050 LmjF.01.0050
## 57  LmjF.01 18137 18886   750      - TriTrypDB gene    NA    NA LmjF.01.0060 LmjF.01.0060
##                               description size       web_id molecule_type organism_name
## 37 hypothetical+protein,+unknown+function  999 LmjF.01.0010          <NA>          <NA>
## 41        hypothetical+protein,+conserved 1650 LmjF.01.0020          <NA>          <NA>
## 45       Kinesin-13+1,+putative+(KIN13-1) 2007 LmjF.01.0030          <NA>          <NA>
## 49 hypothetical+protein,+unknown+function  570 LmjF.01.0040          <NA>          <NA>
## 53                  carboxylase,+putative 1998 LmjF.01.0050          <NA>          <NA>
## 57        hypothetical+protein,+conserved  750 LmjF.01.0060          <NA>          <NA>
##    translation_table topology localization Dbxref    locus_tag        Alias Parent
## 37              <NA>     <NA>         <NA>        LmjF.01.0010 32143805....       
## 41              <NA>     <NA>         <NA>        LmjF.01.0020 32143805....       
## 45              <NA>     <NA>         <NA>        LmjF.01.0030 KIN13-1,....       
## 49              <NA>     <NA>         <NA>        LmjF.01.0040 32143805....       
## 53              <NA>     <NA>         <NA>        LmjF.01.0050 32143805....       
## 57              <NA>     <NA>         <NA>        LmjF.01.0060 32143805....       
##    Ontology_term
## 37              
## 41              
## 45              
## 49              
## 53              
## 57
rownames(lm_gff_annotations) <- paste0("exon_", make.names(lm_gff_annotations[["Name"]], unique=TRUE), ".1")

In contrast, it is possible to load most annotations of interest directly from the gff files used in the alignments. More in-depth information for the human transcriptome may be extracted from biomart.

1.4 Getting ontology data

## While testing, I called this desc, that will need to change.
lm_tooltips <- make_tooltips(annotations=lm_gff_annotations, desc_col="description")

##rownames(lm_annotations) <- paste0("exon_", rownames(lm_annotations), ".1")
##lm_lengths <- lm_annotations[, c("geneid", "width")]
##colnames(lm_lengths) <- c("ID", "width")

lm_goids <- as.data.frame(load_orgdb_go(Leishmania.major.Friedlin, keytype="TXID",
                                        columns=c("TXNAME","GO")))
## This is an organismdbi, that should be ok.
lm_goids <- lm_goids[, c("TXNAME", "GO", "ONTOLOGY", "TERM")]
colnames(lm_goids) <- c("ID", "GO", "ont", "name")
new_ids <- gsub(pattern="-1$", replacement="\\.1", x=lm_goids[["ID"]])
new_ids <- paste0("exon_", new_ids)
lm_goids[["ID"]] <- new_ids

2 Putting the pieces together

The macrophage experiment has samples across 2 contexts, the host and parasite. The following block sets up one experiment for each. If you open the all_samples-species.xlsx files, you will note immediately that a few different attempts were made at ascertaining the most likely experimental factors that contributed to the readily apparent batch effects.

Question: Are there any human reads in this data? I assume no, but will happily map human if so.

2.1 The parasite transcriptome mappings

parasite_expt <- create_expt("sample_sheet/all_samples.xlsx", gene_info=lm_gff_annotations)
## wt_1/processed/outputs/tophat_lmajor/accepted_paired.count.xz contains 9470 rows.
## wt_2/processed/outputs/tophat_lmajor/accepted_paired.count.xz contains 9470 rows and merges to 9470 rows.
## wt_3/processed/outputs/tophat_lmajor/accepted_paired.count.xz contains 9470 rows and merges to 9470 rows.
## odd3_1/processed/outputs/tophat_lmajor/accepted_paired.count.xz contains 9470 rows and merges to 9470 rows.
## odd3_2/processed/outputs/tophat_lmajor/accepted_paired.count.xz contains 9470 rows and merges to 9470 rows.
## odd3_3/processed/outputs/tophat_lmajor/accepted_paired.count.xz contains 9470 rows and merges to 9470 rows.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
chosen_colors <- list(
    "wt" = "#AA3939",
    "odd3" = "#067300")
    
parasite_expt <- set_expt_colors(expt=parasite_expt, colors=chosen_colors)

At this point, we should have everything necessary to perform some analyses.

index.html

3 The Experiment Design

3.1 Parasite

knitr::kable(parasite_expt$design)
sampleid condition batch file
wt_1 wt_1 wt null wt_1/processed/outputs/tophat_lmajor/accepted_paired.count.xz
wt_2 wt_2 wt null wt_2/processed/outputs/tophat_lmajor/accepted_paired.count.xz
wt_3 wt_3 wt null wt_3/processed/outputs/tophat_lmajor/accepted_paired.count.xz
odd_1 odd_1 odd3 null odd3_1/processed/outputs/tophat_lmajor/accepted_paired.count.xz
odd_2 odd_2 odd3 null odd3_2/processed/outputs/tophat_lmajor/accepted_paired.count.xz
odd_3 odd_3 odd3 null odd3_3/processed/outputs/tophat_lmajor/accepted_paired.count.xz

4 Changelog

pander::pander(sessionInfo())

R version 3.4.1 (2017-06-30)

**Platform:** x86_64-pc-linux-gnu (64-bit)

locale: LC_CTYPE=en_US.utf8, LC_NUMERIC=C, LC_TIME=en_US.utf8, LC_COLLATE=en_US.utf8, LC_MONETARY=en_US.utf8, LC_MESSAGES=en_US.utf8, LC_PAPER=en_US.utf8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.utf8 and LC_IDENTIFICATION=C

attached base packages: parallel, stats4, stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: hpgltools(v.2017.01), TxDb.Lmajor.friedlin.TriTrypDB28(v.28.0), org.Lmajor.friedlin.eg.db(v.28.0), Leishmania.major.Friedlin(v.28.0), GO.db(v.3.4.1), OrganismDbi(v.1.18.0), GenomicFeatures(v.1.28.4), GenomicRanges(v.1.28.4), GenomeInfoDb(v.1.12.2), AnnotationDbi(v.1.38.2), IRanges(v.2.10.3), S4Vectors(v.0.14.3), Biobase(v.2.36.2) and BiocGenerics(v.0.22.0)

loaded via a namespace (and not attached): bit64(v.0.9-7), foreach(v.1.4.3), assertthat(v.0.2.0), highr(v.0.6), pander(v.0.6.1), RBGL(v.1.52.0), blob(v.1.1.0), GenomeInfoDbData(v.0.99.0), Rsamtools(v.1.28.0), yaml(v.2.1.14), backports(v.1.1.0), RSQLite(v.2.0), lattice(v.0.20-35), glue(v.1.1.1), digest(v.0.6.12), RColorBrewer(v.1.1-2), XVector(v.0.16.0), colorspace(v.1.3-2), htmltools(v.0.3.6), Matrix(v.1.2-11), plyr(v.1.8.4), XML(v.3.98-1.9), pkgconfig(v.2.0.1), devtools(v.1.13.3), biomaRt(v.2.32.1), zlibbioc(v.1.22.0), scales(v.0.5.0), openxlsx(v.4.0.17), BiocParallel(v.1.10.1), tibble(v.1.3.4), ggplot2(v.2.2.1), withr(v.2.0.0), SummarizedExperiment(v.1.6.3), lazyeval(v.0.2.0), magrittr(v.1.5), crayon(v.1.3.2), evaluate(v.0.10.1), memoise(v.1.1.0), xml2(v.1.1.1), graph(v.1.54.0), BiocInstaller(v.1.26.1), tools(v.3.4.1), data.table(v.1.10.4), matrixStats(v.0.52.2), stringr(v.1.2.0), munsell(v.0.4.3), DelayedArray(v.0.2.7), bindrcpp(v.0.2), Biostrings(v.2.44.2), compiler(v.3.4.1), rlang(v.0.1.2), grid(v.3.4.1), RCurl(v.1.95-4.8), iterators(v.1.0.8), base64enc(v.0.1-3), rmarkdown(v.1.6), bitops(v.1.0-6), testthat(v.1.0.2), gtable(v.0.2.0), codetools(v.0.2-15), DBI(v.0.7), roxygen2(v.6.0.1), R6(v.2.2.2), GenomicAlignments(v.1.12.2), knitr(v.1.17), dplyr(v.0.7.2), rtracklayer(v.1.36.4), bit(v.1.1-12), rprojroot(v.1.2), bindr(v.0.1), commonmark(v.1.4), stringi(v.1.1.5) and Rcpp(v.0.12.12)

message(paste0("This is hpgltools commit: ", get_git_commit()))
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 739e4a10345a89efb17b37e7de07c5811491ccea
## R> packrat::restore()
## This is hpgltools commit: Thu Aug 31 11:24:06 2017 -0400: 739e4a10345a89efb17b37e7de07c5811491ccea
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to: ", this_save))
## Saving to: 01_annotation-v20170703.rda.xz
tmp <- sm(saveme(filename=this_save))
---
title: "RNASeq o L.major odd3 vs. wt: Collecting annotation."
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
 html_document:
  code_download: true
  code_folding: show
  fig_caption: true
  fig_height: 7
  fig_width: 7
  highlight: default
  keep_md: false
  mode: selfcontained
  number_sections: true
  self_contained: true
  theme: readable
  toc: true
  toc_float:
    collapsed: false
    smooth_scroll: false
---

<style>
  body .main-container {
    max-width: 1600px;
}
</style>

```{r options, include=FALSE}
library("hpgltools")
tt <- sm(devtools::load_all("~/hpgltools"))
knitr::opts_knit$set(progress=TRUE,
                     verbose=TRUE,
                     width=90,
                     echo=TRUE)
knitr::opts_chunk$set(error=TRUE,
                      fig.width=8,
                      fig.height=8,
                      dpi=96)
old_options <- options(digits = 4,
                       stringsAsFactors = FALSE,
                       knitr.duplicate.label = "allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size=10))
set.seed(1)
ver <- "20170703"
previous_file <- "index.Rmd"

tmp <- try(sm(loadme(filename=paste0(gsub(pattern="\\.Rmd", replace="", x=previous_file), "-v", ver, ".rda.xz"))))

rmd_file <- "01_annotation.Rmd"
```

```{r rendering, include=FALSE, eval=FALSE}
rmarkdown::render(rmd_file)

rmarkdown::render(rmd_file, output_format="pdf_document")
```

[index.html](index.html) [preprocessing.html](preprocessing.html)

# Annotation version: `r ver`

There are a few methods of importing annotation data into R.  The following are two attempts, the
second is currently being used in these analyses.

## AnnotationHub

AnnotationHub is a newer service and has promise to be an excellent top-level resource for gathering
annotation data.

```{r data_input_genome, eval=FALSE}
## This doesn't quite work yet
tt <- sm(library(AnnotationHub))
ah <- AnnotationHub()
orgdbs <- query(ah, "OrgDb")
leishmanias <- query(ah, c("OrgDB", "Leishmania"))
```

## OrganismDbi

AnnotationHub is the new and fancier version of what OrganismDb does.  Keith already made these for the
parasites though, lets try and use one of those.

The OrganismDb packages are installable via Keith's builder:
https://github.com/elsayed-lab/eupathdb-organismdb

I did a git pull of it, changed a couple small things and ran 'make lbraziliensis'.
This pulls the annotations down from a mix of Najib's directory on the cluster and the TriTrypdb.
As a result, it currently gets the version 27 of the TriTrypDB data.  It may be useful for me
to download a newer revision (but I am reasonably certain not much (or anything) has changed
for braziliensis in a while).

After 5 or so minutes a brand new package 'Leishmania.braziliensis' appeared in my R environment.

```{r make_orgdb, eval=FALSE}
testing_panamensis <- make_tritrypdb_organismdbi("lpanamensis")
testing_braziliensis <- make_tritrypdb_organismdbi("lbraziliensis")
testing_donovani <- make_tritrypdb_organismdbi("ldonovani")
testing_mexicana <- make_tritrypdb_organismdbi("lmexicana")
testing_major <- make_tritrypdb_organismdbi("lmajor")
```

```{r make_eupathdb, eval=FALSE}
testing_donovani_eu <- make_eupath_organismdbi("donovani")
```

```{r lm_orgdb}
tmp <- sm(try(library("Leishmania.major.Friedlin")))
if (class(tmp) == "try-error") {
  tmp <- make_tritrypdb_organismdbi("lmajor")
}
tmp <- sm(library("org.Lmajor.friedlin.eg.db"))
tmp <- sm(library("TxDb.Lmajor.friedlin.TriTrypDB28"))
Leishmania.major.Friedlin
lm_org <- sm(load_orgdb_annotations(Leishmania.major.Friedlin, keytype="geneid"))
lm_annotations <- lm_org[["genes"]]
```

## Read a gff file

In this instance, it might be smarter to actually read the annotations from a gff file,
primarily because I modified the gff file to include an entry for the LmjF.26.0890 3' UTR.

```{r lm_from_gff}
lm_gff_annotations <- load_gff_annotations(gff="reference/lmajor_modified.gff", type="gene")
head(lm_gff_annotations)
rownames(lm_gff_annotations) <- paste0("exon_", make.names(lm_gff_annotations[["Name"]], unique=TRUE), ".1")
```

In contrast, it is possible to load most annotations of interest directly from the gff files used in
the alignments.  More in-depth information for the human transcriptome may be extracted from biomart.

## Getting ontology data

```{r ontology}
## While testing, I called this desc, that will need to change.
lm_tooltips <- make_tooltips(annotations=lm_gff_annotations, desc_col="description")

##rownames(lm_annotations) <- paste0("exon_", rownames(lm_annotations), ".1")
##lm_lengths <- lm_annotations[, c("geneid", "width")]
##colnames(lm_lengths) <- c("ID", "width")

lm_goids <- as.data.frame(load_orgdb_go(Leishmania.major.Friedlin, keytype="TXID",
                                        columns=c("TXNAME","GO")))
lm_goids <- lm_goids[, c("TXNAME", "GO", "ONTOLOGY", "TERM")]
colnames(lm_goids) <- c("ID", "GO", "ont", "name")
new_ids <- gsub(pattern="-1$", replacement="\\.1", x=lm_goids[["ID"]])
new_ids <- paste0("exon_", new_ids)
lm_goids[["ID"]] <- new_ids
```

# Putting the pieces together

The macrophage experiment has samples across 2 contexts, the host and parasite.  The following block
sets up one experiment for each.  If you open the all_samples-species.xlsx files, you will note
immediately that a few different attempts were made at ascertaining the most likely experimental
factors that contributed to the readily apparent batch effects.

Question:  Are there any human reads in this data?  I assume no, but will happily map human if so.

## The parasite transcriptome mappings

```{r parasite_expt}
parasite_expt <- create_expt("sample_sheet/all_samples.xlsx", gene_info=lm_gff_annotations)
chosen_colors <- list(
    "wt" = "#AA3939",
    "odd3" = "#067300")
    
parasite_expt <- set_expt_colors(expt=parasite_expt, colors=chosen_colors)
```

At this point, we should have everything necessary to perform some analyses.

[index.html](index.html)

# The Experiment Design

## Parasite

```{r kable_parasite}
knitr::kable(parasite_expt$design)
```

# Changelog

```{r saveme}
pander::pander(sessionInfo())
message(paste0("This is hpgltools commit: ", get_git_commit()))
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to: ", this_save))
tmp <- sm(saveme(filename=this_save))
```
