1 Preprocessing Saccharomyces cerevisiae samples

The preprocessing of data consists of a few steps, which I will outline below. Most(all) of these steps are handled by a shell script (bin/process_all.sh) which calls a perl script (bin/process.pl) which in turn uses a perl library (bin/lib/HPGL.pm). These scripts help make sure I don’t make any mistakes when processing the data directories. Without them, I would likely make some boneheaded type-o’s and/or reverse filenames/directories and not realize it for hours/days. It is worth noting that I play with these constantly, and the copies in bin/ are likely not going to stay current for very long. For example, I have plans to make a version which simultaneously aligns using sailfish, tophat, and bwa at the same time.

process.pl creates some directories to help make this process a bit more transparent. The scripts/ directory contains a copy of every script before it is submitted to the compute nodes. I will copy an example of each script run into bin/. The outputs/ directory contains all the output from each script. The sequences/ directory contains a compressed copy of the important intermediate sequence files.

1.1 Data storage

I like to put a copy of the raw reads in the umiacs object store called ‘ceph’. I do this with the script ‘ceph_upload’ which has been copied into bin/

1.2 Data trimming

The purpose of data trimming is to remove bases which have low sequence quality and the portions of any reads for which the library adapters were included.

This is handled in process.pl. The actual tool used is trimomatic. I am copying one of the trimomatic submission scripts to bin/. It will be something like bin/trim-hpgl0566.sh

1.3 Sequence quality graphs

There are some tools which will examine the quality scores of all the bases of all the reads. I use a mix of fastqc and biopieces to generate some graphs to show that (hopefully) the raw sequence quality is good.

This is handled in process.pl. A copy will reside in bin/fastqc-something.sh

1.4 Alignment against the yeast genome

There are quite a few tools to align against the reference genome. The reference genome I am using is the Ensembl R64-1-1 release. Given that yeast has very few introns, tophat may not be needed, but I will likely use that anyway just because I’ve used it recently a lot.

If there is a preference, process.pl can currently handle bowtie1, bowtie2, and bwa. I will copy a script it generated into bin/tophat-something.sh

1.5 Counting reads

Assuming I don’t use tophat, process.pl will invoke samtools to convert the text alignments into a compressed, sorted, indexed binary format. If I do use tophat, the accepted_hits.bam is already sorted, so I will merely call htseq-count. process.pl does this and I will therefore leave a copy of its scripts in bin/htseq-something.sh

Note to self: the correct htseq-count invocation looks like: My original invocation had -t gene -i ID which is incorrect for yeast. So if I am going to rerun these, make sure to set the htseq options in process.pl.

htseq-count -q -f bam -s no -t CDS -i transcript_name
bowtie_out/hpgl0564_forward-trimmed-v0M1.bam
/cbcbhomes/abelew/libraries/genome/scerevisiae.gtf 1>bowtie_out/hpgl0564_forward-trimmed-v0M1.count

1.6 Cleaning up

Once these tasks are complete, the working directories for bowtie/tophat/whatever are no longer needed and so deleted, the raw sequences are compressed, and the trimmed sequences are saved to sequences/. The count tables are left behind with names like: hpgl-0xxx.count.gz. I will then copy them into the working tree for this project in the preprocessing/ directory and they will be used for the analyses which follow.

2 Copy reads to the working tree

I prefer to concatenate the multiple sequence files into a single file for each the forward(R1) and reverse(R2) sequences. Interestingly, for this data I decided to handle all steps in one shot with a single script which is invoked with 1 argument for each sample. The text of that script ‘copy_process.sh’ follows:

cd preprocessing
for i in $(/bin/ls hpgl*); do
  ./copy_process.sh ${i}
done

## startdir=$(pwd)
## for i in "$@"; do
##   cd $startdir
##   cyoa --task rnaseq --method copyraw --raw_dir "/cbcb/lab/nelsayed/raw_data/dinman" --hpgl ${i}
##   cd $i
##   inputs="${i}_forward.fastq.gz"
##   trimmed="${i}_forward-trimmed.fastq.gz"
##   cyoa --task rnaseq --method fastqc --input ${inputs}
##   cyoa --task rnaseq --method trimomatic --input ${inputs}
##   trim_id=$(cat last_job.txt)
##   cyoa --task rnaseq --method tophat --input ${trimmed} -s scerevisiae --depends "${trim_id}"
##   cd $startdir
## done

This process is therefore done in 6 steps: 1. Copy the raw data from Najib’s raw_data directory 2. Perform fastqc on the raw reads 3. Perform trimomatic 4. Align with tophat against the L.panamensis genome 5. Align with tophat against the L.braziliensis genome 6. Align with tophat against the H.sapiens genome

One small caveat, these sequences were created in two separate groups, hpgl0241-hpgl0322 and hpgl0630-hpgl0663.

3 Fastqc

The cyoa tool writes out a script which looks like the following:

## This FastQC run is against rnaseq data and is used for
## an initial estimation of the overall sequencing quality.
mkdir -p outputs/fastqc && \
  fastqc --extract -o outputs/fastqc hpgl0241_forward-trimmed.fastq \
  2>outputs/fastqc.out 1>&2

4 Trimomatic

The trimomatic script looks like: It is worth noting that in some cases, the illumina adapter removal results in a java exception. If that happens, rather than lose the data I made my cyoa script re-invoke trimomatic without the illumina clipping. In addition, if things need to be redone, then the sequences may in fact have been moved into the sequences/ directory and compressed with xz -9e, thus the if [[]] at the beginning testing for those files.

## Trimomatic_Pairwise: In case a trimming needs to be redone...
if [[ ! -r "hpgl0663_forward.fastq.gz" ]]; then
  if [[ -r "sequences/hpgl0663_forward.fastq.xz" ]]; then
    mv sequences/hpgl0663_forward.fastq.xz . && pxz -d hpgl0663_forward.fastq.xz && pigz hpgl0663_forward.fastq && mv sequences/hpgl0663_reverse.fastq.xz . &&
 pxz -d hpgl0663_reverse.fastq.xz && pigz hpgl0663_reverse.fastq
  else
    echo "Missing files. Did not find hpgl0663_forward.fastq.gz nor sequences/hpgl0663_forward.fastq.xz"
    exit 1
  fi
fi
trimomatic PE -threads 1 -phred33 hpgl0663_forward.fastq.gz hpgl0663_reverse.fastq.gz hpgl0663_forward-trimmed_paired.fastq.gz hpgl0663_forward-trimmed_unpair
ed.fastq.gz hpgl0663_reverse-trimmed_paired.fastq.gz hpgl0663_reverse-trimmed_unpaired.fastq.gz ILLUMINACLIP:/cbcbhomes/abelew/libraries/adapters.fa:2:20:4 SL
IDINGWINDOW:4:25 1>outputs/hpgl0663-trimomatic.out 2>&1
excepted=$(grep "Exception" outputs/hpgl0663-trimomatic.out)
## The following is in case the illumina clipping fails, which it does if this has already been run I think.
if [[ "${excepted}" != "" ]]; then
  trimomatic PE -threads 1 -phred33 hpgl0663_forward.fastq.gz hpgl0663_reverse.fastq.gz hpgl0663_forward-trimmed_paired.fastq.gz hpgl0663_forward-trimmed_unpa
ired.fastq.gz hpgl0663_reverse-trimmed_paired.fastq.gz hpgl0663_reverse-trimmed_unpaired.fastq.gz SLIDINGWINDOW:4:25 1>>outputs/hpgl0663-trimomatic.out 2>&1
fi
sleep 10
mv hpgl0663_forward-trimmed_paired.fastq.gz hpgl0663_forward-trimmed.fastq.gz && mv hpgl0663_reverse-trimmed_paired.fastq.gz hpgl0663_reverse-trimmed.fastq.gz

The end result should be a set of files with trimmed, paired sequences separated from unpaired sequences.

5 Tophat

The tophat process is actually a couple of steps rolled into one, the trimmed/paired reads are fed to tophat, then the accepted_hits.bam is tested for properly paired reads and those reads are separated into accepted_paired.bam. These 2 bam files are sorted and indexed, and finally passed to htseq. Thus the tophat scripts look like:

These scripts however, do not include the additional processing step which split the accepted_hits.bam into accepted_paired.bam I am not sure why, but these scripts are missing the following step which was performed but not properly logged:

if [ -r "${tophat_dir}/accepted_hits.bam" ]; then
  samtools view -b -f 2 ${tophat_dir}/accepted_hits.bam > ${tophat_dir}/accepted_paired.bam && samtools index ${tophat_dir}/accepted_paired.bam
fi

I know this step was performed, because the files accepted_paired.bam exists for each directory, but the option -f 2 says that the ‘2’ flag must be set, which is the flag for an alignment which is paired.

mkdir -p outputs/tophat_scerevisiae && tophat  -g 1  \
  -G /cbcbhomes/abelew/libraries/genome/scerevisiae.gff \
  --b2-very-sensitive -p 4 -o outputs/tophat_scerevisiae \
/cbcbhomes/abelew/libraries/genome/indexes/scerevisiae \
  hpgl0663_forward-trimmed.fastq.gz hpgl0663_reverse-trimmed.fastq.gz && \
  samtools sort -l 9 -n outputs/tophat_scerevisiae/accepted_hits.bam outputs/tophat_scerevisiae/accepted_sorted && \
  mv outputs/tophat_scerevisiae/accepted_sorted.bam outputs/tophat_scerevisiae/accepted_hits.bam && \
  samtools index outputs/tophat_scerevisiae/accepted_hits.bam && \
  samtools sort -l 9 -n outputs/tophat_scerevisiae/unmapped.bam outputs/tophat_scerevisiae/unmapped_sorted && \
  mv outputs/tophat_scerevisiae/unmapped_sorted.bam outputs/tophat_scerevisiae/unmapped.bam && \
  samtools index outputs/tophat_scerevisiae/unmapped.bam

6 Run htseq

The above scripts create sorted accepted_hits.bam files, I think that I manually made the accepted_paired.bam files and added those steps to the cyoa script post-facto, but the end result is the same and so I needed to run htseq on both:

## scripts/th_scerevisiae-hpgl0241.sh
## Counting the number of hits in outputs/tophat_scerevisiae/accepted_hits.bam for each feature found in /cbcbhomes/abelew/libraries/genome/scerevisiae.gff
## Is this stranded? no.  The defaults of htseq are:
##  --order=name --idattr=gene_id --minaqual=10 --type=exon --stranded=yes --mode=union
htseq-count -q -f bam -s no  -i ID  \
  outputs/tophat_scerevisiae/accepted_hits.bam /cbcbhomes/abelew/libraries/genome/scerevisiae.gff \
  1>outputs/tophat_scerevisiae/accepted_hits.count 2>outputs/tophat_scerevisiae/accepted_hits.error && \
    xz -9e outputs/tophat_scerevisiae/accepted_hits.count
## I am not going to belabor the point and print the htseq commands for all species.
---
title: "S.cerevisiae 2015: Preprocessing RNAseq samples."
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
 html_document:
  code_download: true
  code_folding: show
  fig_caption: true
  fig_height: 7
  fig_width: 7
  highlight: default
  keep_md: false
  mode: selfcontained
  number_sections: true
  self_contained: true
  theme: readable
  toc: true
  toc_float:
    collapsed: false
    smooth_scroll: false
---

<style>
  body .main-container {
    max-width: 1600px;
  }
</style>

```{r options, include=FALSE}
if (!isTRUE(get0("skip_load"))) {
  library(hpgltools)
  tt <- devtools::load_all("~/hpgltools")
  knitr::opts_knit$set(progress=TRUE,
                       verbose=TRUE,
                       width=90,
                       echo=TRUE)
  knitr::opts_chunk$set(error=TRUE,
                        fig.width=8,
                        fig.height=8,
                        dpi=96)
  old_options <- options(digits=4,
                         stringsAsFactors=FALSE,
                         knitr.duplicate.label="allow")
  ggplot2::theme_set(ggplot2::theme_bw(base_size=10))
  ver <- "20151102"
  previous_file <- "index.Rmd"

  tmp <- try(sm(loadme(filename=paste0(gsub(pattern="\\.Rmd", replace="", x=previous_file), "-v", ver, ".rda.xz"))))
  rmd_file <- "preprocessing.Rmd"
}
```

Preprocessing Saccharomyces cerevisiae samples
==============================================

The preprocessing of data consists of a few steps, which I will
outline below.  Most(all) of these steps are handled by a shell
script (bin/process_all.sh) which calls a perl script (bin/process.pl) which
in turn uses a perl library (bin/lib/HPGL.pm).  These scripts help make sure I
don't make any mistakes when processing the data directories.  Without
them, I would likely make some boneheaded type-o's and/or reverse
filenames/directories and not realize it for hours/days.  It is worth
noting that I play with these constantly, and the copies in bin/ are
likely not going to stay current for very long.  For example, I have
plans to make a version which simultaneously aligns using sailfish,
tophat, and bwa at the same time.

process.pl creates some directories to help make this process a bit
more transparent.  The scripts/ directory contains a copy of every
script before it is submitted to the compute nodes.  I will copy an
example of each script run into bin/.  The outputs/ directory contains
all the output from each script.  The sequences/ directory contains a
compressed copy of the important intermediate sequence files.

## Data storage

I like to put a copy of the raw reads in the umiacs object store
called 'ceph'.  I do this with the script 'ceph_upload' which has been
copied into bin/

## Data trimming

The purpose of data trimming is to remove bases which have low
sequence quality and the portions of any reads for which the library
adapters were included.

This is handled in process.pl.  The actual tool used is trimomatic.
I am copying one of the trimomatic submission scripts to bin/.  It
will be something like bin/trim-hpgl0566.sh

## Sequence quality graphs

There are some tools which will examine the quality scores of all the
bases of all the reads.  I use a mix of fastqc and biopieces to
generate some graphs to show that (hopefully) the raw sequence quality
is good.

This is handled in process.pl.  A copy will reside in
bin/fastqc-something.sh

## Alignment against the yeast genome

There are quite a few tools to align against the reference genome.
The reference genome I am using is the Ensembl R64-1-1 release.
Given that yeast has very few introns, tophat may not be needed, but I
will likely use that anyway just because I've used it recently a lot.

If there is a preference, process.pl can currently handle bowtie1,
bowtie2, and bwa.  I will copy a script it generated into
bin/tophat-something.sh

## Counting reads

Assuming I don't use tophat, process.pl will invoke samtools to
convert the text alignments into a compressed, sorted, indexed binary
format.  If I do use tophat, the accepted_hits.bam is already sorted,
so I will merely call htseq-count.  process.pl does this and I will
therefore leave a copy of its scripts in bin/htseq-something.sh

Note to self:  the correct htseq-count invocation looks like:
My original invocation had -t gene -i ID which is incorrect for yeast.
So if I am going to rerun these, make sure to set the htseq options in process.pl.

> htseq-count -q -f bam -s no -t CDS -i transcript_name \
> bowtie_out/hpgl0564_forward-trimmed-v0M1.bam \
> /cbcbhomes/abelew/libraries/genome/scerevisiae.gtf 1>bowtie_out/hpgl0564_forward-trimmed-v0M1.count


## Cleaning up

Once these tasks are complete, the working directories for
bowtie/tophat/whatever are no longer needed and so deleted, the raw
sequences are compressed, and the trimmed sequences are saved to
sequences/.  The count tables are left behind with names like:
hpgl-0xxx.count.gz.  I will then copy them into the working tree for
this project in the preprocessing/ directory and they will be used for
the analyses which follow.


# Copy reads to the working tree

I prefer to concatenate the multiple sequence files into a single file for each the forward(R1) and
reverse(R2) sequences.  Interestingly, for this data I decided to handle all steps in one shot with
a single script which is invoked with 1 argument for each sample.  The text of that script
'copy_process.sh' follows:

```{r mkdirs, engine='bash', eval=FALSE}
cd preprocessing
for i in $(/bin/ls hpgl*); do
  ./copy_process.sh ${i}
done

## startdir=$(pwd)
## for i in "$@"; do
##   cd $startdir
##   cyoa --task rnaseq --method copyraw --raw_dir "/cbcb/lab/nelsayed/raw_data/dinman" --hpgl ${i}
##   cd $i
##   inputs="${i}_forward.fastq.gz"
##   trimmed="${i}_forward-trimmed.fastq.gz"
##   cyoa --task rnaseq --method fastqc --input ${inputs}
##   cyoa --task rnaseq --method trimomatic --input ${inputs}
##   trim_id=$(cat last_job.txt)
##   cyoa --task rnaseq --method tophat --input ${trimmed} -s scerevisiae --depends "${trim_id}"
##   cd $startdir
## done
```

This process is therefore done in 6 steps:
1.  Copy the raw data from Najib's raw_data directory
2.  Perform fastqc on the raw reads
3.  Perform trimomatic
4.  Align with tophat against the L.panamensis genome
5.  Align with tophat against the L.braziliensis genome
6.  Align with tophat against the H.sapiens genome

One small caveat, these sequences were created in two separate groups, hpgl0241-hpgl0322 and
hpgl0630-hpgl0663.

# Fastqc

The cyoa tool writes out a script which looks like the following:

```{r fastqc_script, engine='bash', eval=FALSE}
## This FastQC run is against rnaseq data and is used for
## an initial estimation of the overall sequencing quality.
mkdir -p outputs/fastqc && \
  fastqc --extract -o outputs/fastqc hpgl0241_forward-trimmed.fastq \
  2>outputs/fastqc.out 1>&2
```

# Trimomatic

The trimomatic script looks like:
It is worth noting that in some cases, the illumina adapter removal results in a java exception.  If
that happens, rather than lose the data I made my cyoa script re-invoke trimomatic without the
illumina clipping.  In addition, if things need to be redone, then the sequences may in fact have
been moved into the sequences/ directory and compressed with xz -9e, thus the if [[]] at the
beginning testing for those files.

```{r trimomatic_script, engine='bash', eval=FALSE}
## Trimomatic_Pairwise: In case a trimming needs to be redone...
if [[ ! -r "hpgl0663_forward.fastq.gz" ]]; then
  if [[ -r "sequences/hpgl0663_forward.fastq.xz" ]]; then
    mv sequences/hpgl0663_forward.fastq.xz . && pxz -d hpgl0663_forward.fastq.xz && pigz hpgl0663_forward.fastq && mv sequences/hpgl0663_reverse.fastq.xz . &&
 pxz -d hpgl0663_reverse.fastq.xz && pigz hpgl0663_reverse.fastq
  else
    echo "Missing files. Did not find hpgl0663_forward.fastq.gz nor sequences/hpgl0663_forward.fastq.xz"
    exit 1
  fi
fi
trimomatic PE -threads 1 -phred33 hpgl0663_forward.fastq.gz hpgl0663_reverse.fastq.gz hpgl0663_forward-trimmed_paired.fastq.gz hpgl0663_forward-trimmed_unpair
ed.fastq.gz hpgl0663_reverse-trimmed_paired.fastq.gz hpgl0663_reverse-trimmed_unpaired.fastq.gz ILLUMINACLIP:/cbcbhomes/abelew/libraries/adapters.fa:2:20:4 SL
IDINGWINDOW:4:25 1>outputs/hpgl0663-trimomatic.out 2>&1
excepted=$(grep "Exception" outputs/hpgl0663-trimomatic.out)
## The following is in case the illumina clipping fails, which it does if this has already been run I think.
if [[ "${excepted}" != "" ]]; then
  trimomatic PE -threads 1 -phred33 hpgl0663_forward.fastq.gz hpgl0663_reverse.fastq.gz hpgl0663_forward-trimmed_paired.fastq.gz hpgl0663_forward-trimmed_unpa
ired.fastq.gz hpgl0663_reverse-trimmed_paired.fastq.gz hpgl0663_reverse-trimmed_unpaired.fastq.gz SLIDINGWINDOW:4:25 1>>outputs/hpgl0663-trimomatic.out 2>&1
fi
sleep 10
mv hpgl0663_forward-trimmed_paired.fastq.gz hpgl0663_forward-trimmed.fastq.gz && mv hpgl0663_reverse-trimmed_paired.fastq.gz hpgl0663_reverse-trimmed.fastq.gz
```

The end result should be a set of files with trimmed, paired sequences separated from unpaired
sequences.

# Tophat

The tophat process is actually a couple of steps rolled into one, the trimmed/paired reads are fed
to tophat, then the accepted_hits.bam is tested for properly paired reads and those reads are
separated into accepted_paired.bam.  These 2 bam files are sorted and indexed, and finally passed to
htseq.  Thus the tophat scripts look like:

These scripts however, do not include the additional processing step which split the
accepted_hits.bam into accepted_paired.bam  I am not sure why, but these scripts are missing the
following step which was performed but not properly logged:

```{r paired_hits, engine='bash', eval=FALSE}
if [ -r "${tophat_dir}/accepted_hits.bam" ]; then
  samtools view -b -f 2 ${tophat_dir}/accepted_hits.bam > ${tophat_dir}/accepted_paired.bam && samtools index ${tophat_dir}/accepted_paired.bam
fi
```

I know this step was performed, because the files accepted_paired.bam exists for each directory, but
the option -f 2 says that the '2' flag must be set, which is the flag for an alignment which is
paired.

```{r tophat_runs, engine='bash', eval=FALSE}
mkdir -p outputs/tophat_scerevisiae && tophat  -g 1  \
  -G /cbcbhomes/abelew/libraries/genome/scerevisiae.gff \
  --b2-very-sensitive -p 4 -o outputs/tophat_scerevisiae \
/cbcbhomes/abelew/libraries/genome/indexes/scerevisiae \
  hpgl0663_forward-trimmed.fastq.gz hpgl0663_reverse-trimmed.fastq.gz && \
  samtools sort -l 9 -n outputs/tophat_scerevisiae/accepted_hits.bam outputs/tophat_scerevisiae/accepted_sorted && \
  mv outputs/tophat_scerevisiae/accepted_sorted.bam outputs/tophat_scerevisiae/accepted_hits.bam && \
  samtools index outputs/tophat_scerevisiae/accepted_hits.bam && \
  samtools sort -l 9 -n outputs/tophat_scerevisiae/unmapped.bam outputs/tophat_scerevisiae/unmapped_sorted && \
  mv outputs/tophat_scerevisiae/unmapped_sorted.bam outputs/tophat_scerevisiae/unmapped.bam && \
  samtools index outputs/tophat_scerevisiae/unmapped.bam
```

# Run htseq

The above scripts create sorted accepted_hits.bam files, I think that I manually made the
accepted_paired.bam files and added those steps to the cyoa script post-facto, but the end result is
the same and so I needed to run htseq on both:

```{r htseq_invocations, engine='bash', eval=FALSE}
## scripts/th_scerevisiae-hpgl0241.sh
## Counting the number of hits in outputs/tophat_scerevisiae/accepted_hits.bam for each feature found in /cbcbhomes/abelew/libraries/genome/scerevisiae.gff
## Is this stranded? no.  The defaults of htseq are:
##  --order=name --idattr=gene_id --minaqual=10 --type=exon --stranded=yes --mode=union
htseq-count -q -f bam -s no  -i ID  \
  outputs/tophat_scerevisiae/accepted_hits.bam /cbcbhomes/abelew/libraries/genome/scerevisiae.gff \
  1>outputs/tophat_scerevisiae/accepted_hits.count 2>outputs/tophat_scerevisiae/accepted_hits.error && \
    xz -9e outputs/tophat_scerevisiae/accepted_hits.count
## I am not going to belabor the point and print the htseq commands for all species.
```
