Analyses of two S.cerevisiae strains, wt and mutant for pseudouridylation(CBF5)

The goal of this project is to look for changes in the yeast transcriptome as a result of a mutation(s) in the CBF5 gene, responsible for the pseudouridylation of important ribosomal bases, which in turn lower the fidelity of the yeast ribosome vis a vis programmed -1 ribosomal frameshifting (among other things). This document is intended to make it easier to reproduce/improve analyses performed during a RNASequencing experiment of 2 yeast strains.

  1. preprocessing.html The steps performed to preprocess the data.
  2. annotation.html Data shared by all the downstream analyses.
  3. sample_estimation.html Check the samples for suitability.
  4. differential_expression.html Performing the DE analyses.
  5. ontology.html Perform ontology searches.

TODO list

The following are some requests I have received and whether or not I think I did them.

  1. I need a contrast table as mut-wt (currently is wt-mut, so plus and minus signs are reversed) (done)
  2. Need volcano plot (x=FC, y=adj p value), goal is to put a line through p=0.05, color up in D95A as red and down in D95A as green (my early version has opposite due to file being wt-mut) – is there easy way of doing in R? Otherwise I’ll figure out how to color specific dots with Excel)
  3. Distribute all of those up and down genes (separately) into pathways/processes for easy presentation – Jon hinted he wants pie charts (I’ll show you the example from a future paper from Sergey). I would want to know how many genes (and what they are) fell in and how many total are in a category. Briggs and I were talking about only including categories that have at least 3 messages represented…
  4. Need to pick top DEGs from up and down list to show as table, but have to see what that’d look like. Should it be balanced – top 10 or 20 on each side– or by a cut off of FC (this is rather imbalanced no matter what cut-off I pick)? May also depend on how much space I have to play with at that point.
  5. FC=1.5 (that is logfc = 0.5849),
  6. FC=2 (that is logfc = 1) – what I’m most interested in, if you can only do 1
  7. FC=4 (that is logfc = 2) – this may not be too informative, since it only leaves 18 up-regulated genes, and 46 down-regulated genes. Is it even worth doing?

Installation and setup

These are rmarkdown documents which make heavy use of the hpgltools package. The following section demonstrates how to set that up in a clean R environment.

## Use R's install.packages to install devtools.
install.packages("devtools")
## Use devtools to install hpgltools.
devtools::install_github("elsayedlab/hpgltools")
## Load hpgltools into the R environment.
library(hpgltools)
## Use hpgltools' autoloads_all() function to install the many packages used by hpgltools.
autoloads_all()

Download genome/annotation

For some projects, I have been relying heavily on the illumina iGenomes. It seems to me to be a fairly consistent and well annotated data set for the species I have worked with so far.

http://support.illumina.com/sequencing/sequencing_software/igenome.html

I left a copy of the Ensembl data set in $LAB/ref_data/illumina/ and made symbolic links into $LAB/ref_data/scerevisiae/

Some tools I use ask for .gff files while others look for .gtf files. I have a converter which I am copying to the local bin/ directory (gff_convert.pl)

In addition, I have recently been using a mix of the bioconductor OrganismDbi/TxDb/OrgDb interfaces along with Ensembl’s biomart. The triumvirate of gff annotations, biomart, and extant orgdb instances provide a powerful, if confusing combination.

library('pander')
pander(sessionInfo())

R version 3.3.1 (2016-06-21)

**Platform:** x86_64-pc-linux-gnu (64-bit)

locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C

attached base packages: stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: pander(v.0.6.0) and hpgltools(v.2016.02)

loaded via a namespace (and not attached): Rcpp(v.0.12.6), formatR(v.1.4), compiler(v.3.3.1), RColorBrewer(v.1.1-2), plyr(v.1.8.4), iterators(v.1.0.8), tools(v.3.3.1), testthat(v.1.0.2), digest(v.0.6.9), lattice(v.0.20-33), preprocessCore(v.1.35.0), evaluate(v.0.9), memoise(v.1.0.0), gtable(v.0.2.0), openxlsx(v.3.0.0), foreach(v.1.4.3), yaml(v.2.1.13), ggrepel(v.0.5), parallel(v.3.3.1), withr(v.1.0.2), stringr(v.1.0.0), roxygen2(v.5.0.1), knitr(v.1.13), gtools(v.3.5.0), devtools(v.1.12.0), locfit(v.1.5-9.1), grid(v.3.3.1), data.table(v.1.9.6), Biobase(v.2.33.0), R6(v.2.1.2), rmarkdown(v.1.0), limma(v.3.29.14), edgeR(v.3.15.2), ggplot2(v.2.1.0), reshape2(v.1.4.1), corpcor(v.1.6.8), magrittr(v.1.5), scales(v.0.4.0), codetools(v.0.2-14), htmltools(v.0.3.5), matrixStats(v.0.50.2), BiocGenerics(v.0.19.2), colorspace(v.1.2-6), labeling(v.0.3), stringi(v.1.1.1), munsell(v.0.4.3), chron(v.2.3-47) and crayon(v.1.3.2)