1 Copy reads to the working tree

I prefer to concatenate the multiple sequence files into a single file for each the forward(R1) and reverse(R2) sequences. Interestingly, for this data I decided to handle all steps in one shot with a single script which is invoked with 1 argument for each sample. The text of that script ‘copy_process.sh’ follows:

cd preprocessing
for i in $(/bin/ls hpgl*); do
  ./copy_process.sh ${i}
done

## startdir=$(pwd)
## for i in "$@"; do
##   cd $startdir
##   cyoa --task rnaseq --method copyraw --raw_dir "/cbcb/lab/nelsayed/raw_data/fenselau" --hpgl ${i}
##   cd $i
##   inputs="${i}_forward.fastq.gz"
##   trimmed="${i}_forward-trimmed.fastq.gz"
##   cyoa --task rnaseq --method fastqc --input ${inputs}
##   cyoa --task rnaseq --method trimomatic --input ${inputs}
##   trim_id=$(cat last_job.txt)
##   cyoa --task rnaseq --method tophat --input ${trimmed} -s mmusculus --depends "${trim_id}"
##   lp_id=$(cat last_job.txt)
##   cyoa --task rnaseq --method bt_multi --input ${trimmed} -s mmusculus --depends "${trim_id}"
##   hs_id=$(cat last_job.txt)
##   rm last_job.txt
##   cd $startdir
## done

This process is therefore done in 6 steps: 1. Copy the raw data from Najib’s raw_data directory 2. Perform fastqc on the raw reads 3. Perform trimomatic 4. Align with tophat against the Mus musculus genome for the polyA samples. 5. Align with bowtie against the Mus musculus genome for the small-RNA samples.

2 Fastqc

The cyoa tool writes out a script which looks like the following:

## This FastQC run is against rnaseq data and is used for
## an initial estimation of the overall sequencing quality.
mkdir -p outputs/fastqc && \
  fastqc --extract -o outputs/fastqc hpgl0673_forward-trimmed.fastq hpgl0673_reverse-trimmed.fastq \
  2>outputs/fastqc.out 1>&2

A relevant plot from fastqc is (from sample hpgl0673):

qualities: quality for hpgl0673 nucleotide distribution for hpgl0673

3 Trimomatic

The trimomatic script looks like: It is worth noting that in some cases, the illumina adapter removal results in a java exception. If that happens, rather than lose the data I made my cyoa script re-invoke trimomatic without the illumina clipping. In addition, if things need to be redone, then the sequences may in fact have been moved into the sequences/ directory and compressed with xz -9e, thus the if [[]] at the beginning testing for those files.

## Trimomatic_Pairwise: In case a trimming needs to be redone...
if [[ ! -r "hpgl0673_forward.fastq.gz" ]]; then
  if [[ -r "sequences/hpgl0673_forward.fastq.xz" ]]; then
    mv sequences/hpgl0673_forward.fastq.xz . && pxz -d hpgl0673_forward.fastq.xz && pigz hpgl0673_forward.fastq
  else
    echo "Missing files. Did not find hpgl0673_forward.fastq.gz nor sequences/hpgl0673_forward.fastq.xz"
    exit 1
  fi
fi
trimomatic SE -threads 1 -phred33 hpgl0673_forward.fastq.gz hpgl0673_forward-trimmed_paired.fastq.gz \
  hpgl0673_forward-trimmed_unpaired.fastq.gz ILLUMINACLIP:/cbcbhomes/abelew/libraries/adapters.fa:2:20:4 \
  SLIDINGWINDOW:4:25 1>outputs/hpgl0673-trimomatic.out 2>&1
excepted=$(grep "Exception" outputs/hpgl0673-trimomatic.out)
## The following is in case the illumina clipping fails, which it does if this has already been run I think.
if [[ "${excepted}" != "" ]]; then
trimomatic SE -threads 1 -phred33 hpgl0673_forward.fastq.gz hpgl0673_forward-trimmed_paired.fastq.gz \
  hpgl0673_forward-trimmed_unpaired.fastq.gz SLIDINGWINDOW:4:25 1>>outputs/hpgl0673-trimomatic.out 2>&1
fi
sleep 10
mv hpgl0673_forward-trimmed_paired.fastq.gz hpgl0673_forward-trimmed.fastq.gz

The end result should be a set of files with trimmed, paired sequences separated from unpaired sequences.

4 Tophat

The tophat process is actually a couple of steps rolled into one, the trimmed/paired reads are fed to tophat, then the accepted_hits.bam is tested for properly paired reads and those reads are separated into accepted_paired.bam. These 2 bam files are sorted and indexed, and finally passed to htseq. Thus the tophat scripts look like:

These scripts however, do not include the additional processing step which split the accepted_hits.bam into accepted_paired.bam I am not sure why, but these scripts are missing the following step which was performed but not properly logged:

if [ -r "${tophat_dir}/accepted_hits.bam" ]; then
  samtools view -b -f 2 ${tophat_dir}/accepted_hits.bam > ${tophat_dir}/accepted_paired.bam && samtools index ${tophat_dir}/accepted_paired.bam
fi

I know this step was performed, because the files accepted_paired.bam exists for each directory, but the option -f 2 says that the ‘2’ flag must be set, which is the flag for an alignment which is paired.

## scripts/th_lbraziliensis-hpgl0673.sh
mkdir -p outputs/tophat_mmusculus && tophat  -g 1  \
  -G /cbcbhomes/abelew/libraries/genome/mmusculus.gff \
  --b2-very-sensitive -p 4 -o outputs/tophat_mmusculus \
/cbcbhomes/abelew/libraries/genome/indexes/mmusculus \
  hpgl0673_forward-trimmed.fastq.gz && \
  samtools sort -l 9 -n outputs/tophat_mmusculus/accepted_hits.bam outputs/tophat_mmusculus/accepted_sorted && \
  mv outputs/tophat_mmusculus/accepted_sorted.bam outputs/tophat_mmusculus/accepted_hits.bam && \
  samtools index outputs/tophat_mmusculus/accepted_hits.bam && \
  samtools sort -l 9 -n outputs/tophat_mmusculus/unmapped.bam outputs/tophat_mmusculus/unmapped_sorted && \
  mv outputs/tophat_mmusculus/unmapped_sorted.bam outputs/tophat_mmusculus/unmapped.bam && \
  samtools index outputs/tophat_mmusculus/unmapped.bam

5 Run htseq

The above scripts create sorted accepted_hits.bam files, I think that I manually made the accepted_paired.bam files and added those steps to the cyoa script post-facto, but the end result is the same and so I needed to run htseq on both:

## scripts/th_mmusculus-hpgl0673.sh
## Counting the number of hits in outputs/tophat_mmusculus/accepted_hits.bam for each feature found in /cbcbhomes/abelew/libraries/genome/mmusculus.gff
## Is this stranded? no.  The defaults of htseq are:
##  --order=name --idattr=gene_id --minaqual=10 --type=exon --stranded=yes --mode=union
htseq-count -q -f bam -s no  -i ID  \
  outputs/tophat_mmusculus/accepted_hits.bam /cbcbhomes/abelew/libraries/genome/mmusculus.gff \
  1>outputs/tophat_mmusculus/accepted_hits.count 2>outputs/tophat_mmusculus/accepted_hits.error && \
    xz -9e outputs/tophat_mmusculus/accepted_hits.count
## I am not going to belabor the point and print the htseq commands for all types.

6 Run Bowtie

Bowtie was used for the small RNA libraries and follows a similar path.

## This is a bowtie1 alignment of hpgl0673_forward-trimmed.fastq against
## /cbcbhomes/abelew/libraries/genome/indexes/mmusculus using arguments:  --best -v 0 -M 1 .
## This jobs depended on:
mkdir -p outputs/bowtie && sleep 10 && bowtie /cbcbhomes/abelew/libraries/genome/indexes/mmusculus  --best -v 0 -M 1  \
  -p 1 \
  -q hpgl0673_forward-trimmed.fastq \
  --un outputs/bowtie/hpgl0673_forward-trimmed-v0M1_unaligned_mmusculus.fastq \
  --al outputs/bowtie/hpgl0673_forward-trimmed-v0M1_aligned_mmusculus.fastq \
  -S outputs/bowtie/hpgl0673_forward-trimmed-v0M1.sam \
  2>outputs/bowtie/hpgl0673_forward-trimmed-v0M1.err \
  1>outputs/bowtie/hpgl0673_forward-trimmed-v0M1.out

## Counting the number of hits in outputs/bowtie/hpgl0673_forward-trimmed-v0M1.bam for each feature found in /cbcbhomes/abelew/libraries/genome/mmusculus_mi.gff
## Is this stranded? no.  The defaults of htseq are:
##  --order=name --idattr=gene_id --minaqual=10 --type=exon --stranded=yes --mode=union

htseq-count -q -f bam -s no  -i ID  -t exon \
  outputs/bowtie/hpgl0673_forward-trimmed-v0M1.bam /cbcbhomes/abelew/libraries/genome/mmusculus_mi.gff \
  1>outputs/bowtie/hpgl0673_forward-trimmed-v0M1_mi.count 2>outputs/bowtie/hpgl0673_forward-trimmed-v0M1.error && \
    xz -9e outputs/bowtie/hpgl0673_forward-trimmed-v0M1_mi.count

index.html

---
title: "Preprocessing M.musculus samples."
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
 html_document:
  code_download: true
  code_folding: show
  fig_caption: true
  fig_height: 7
  fig_width: 7
  highlight: default
  keep_md: false
  mode: selfcontained
  number_sections: true
  self_contained: true
  theme: readable
  toc: true
  toc_float:
    collapsed: false
    smooth_scroll: false
---

<style>
  <!-- Document prelude revision 2017-02 -->
  body .main-container {
    max-width: 1600px;
}
</style>

```{r options, include=FALSE}
## These are the options I tend to favor
library("hpgltools")
tt <- sm(devtools::load_all("~/hpgltools"))
knitr::opts_knit$set(
    progress = TRUE,
    verbose = TRUE,
    width = 90,
    echo = TRUE)
knitr::opts_chunk$set(
    error = TRUE,
    fig.width = 8,
    fig.height = 8,
    dpi = 96)
old_options <- options(
    digits = 4,
    stringsAsFactors = FALSE,
    knitr.duplicate.label = "allow")
set.seed(1)
previous_file <- "index.Rmd"
rmd_file <- "preprocessing.Rmd"
ver <- "20170220"
previous_save <- paste0(gsub(pattern="\\.Rmd", replace=".rda.xz", x=previous_file))
this_save <- paste0(gsub(pattern="\\.Rmd", replace=".rda.xz", x=rmd_file))
```

```{r rendering, include=FALSE, eval=FALSE}
## This block is used to render a document from within it.
rmarkdown::render(rmd_file)

## An extra renderer for pdf outputr
markdown::render(rmd_file, output_format="pdf_document", output_options=c("skip_html"))
## Or to save/load large Rdata files.
hpgltools:::saveme()
hpgltools:::loadme()
rm(list=ls())
```

```{r loadme, include=FALSE, eval=FALSE}
tmp <- sm(loadme(filename=previous_save))
```


# Copy reads to the working tree

I prefer to concatenate the multiple sequence files into a single file for each the forward(R1) and
reverse(R2) sequences.  Interestingly, for this data I decided to handle all steps in one shot with
a single script which is invoked with 1 argument for each sample.  The text of that script
'copy_process.sh' follows:

```{r mkdirs, engine='bash', eval=FALSE}
cd preprocessing
for i in $(/bin/ls hpgl*); do
  ./copy_process.sh ${i}
done

## startdir=$(pwd)
## for i in "$@"; do
##   cd $startdir
##   cyoa --task rnaseq --method copyraw --raw_dir "/cbcb/lab/nelsayed/raw_data/fenselau" --hpgl ${i}
##   cd $i
##   inputs="${i}_forward.fastq.gz"
##   trimmed="${i}_forward-trimmed.fastq.gz"
##   cyoa --task rnaseq --method fastqc --input ${inputs}
##   cyoa --task rnaseq --method trimomatic --input ${inputs}
##   trim_id=$(cat last_job.txt)
##   cyoa --task rnaseq --method tophat --input ${trimmed} -s mmusculus --depends "${trim_id}"
##   lp_id=$(cat last_job.txt)
##   cyoa --task rnaseq --method bt_multi --input ${trimmed} -s mmusculus --depends "${trim_id}"
##   hs_id=$(cat last_job.txt)
##   rm last_job.txt
##   cd $startdir
## done
```

This process is therefore done in 6 steps:
1.  Copy the raw data from Najib's raw_data directory
2.  Perform fastqc on the raw reads
3.  Perform trimomatic
4.  Align with tophat against the Mus musculus genome for the polyA samples.
5.  Align with bowtie against the Mus musculus genome for the small-RNA samples.

# Fastqc

The cyoa tool writes out a script which looks like the following:

```{r fastqc_script, engine='bash', eval=FALSE}
## This FastQC run is against rnaseq data and is used for
## an initial estimation of the overall sequencing quality.
mkdir -p outputs/fastqc && \
  fastqc --extract -o outputs/fastqc hpgl0673_forward-trimmed.fastq hpgl0673_reverse-trimmed.fastq \
  2>outputs/fastqc.out 1>&2
```

A relevant plot from fastqc is (from sample hpgl0673):

qualities:
![quality for hpgl0673](preprocessing/small_exo/hpgl0673/outputs/fastqc/hpgl0673_forward-trimmed_fastqc/Images/per_base_quality.png)
![nucleotide distribution for hpgl0673](preprocessing/small_exo/hpgl0673/outputs/biopieces/hpgl0673_forward-trimmed_ntdist.svg)

# Trimomatic

The trimomatic script looks like:
It is worth noting that in some cases, the illumina adapter removal results in a java exception.  If
that happens, rather than lose the data I made my cyoa script re-invoke trimomatic without the
illumina clipping.  In addition, if things need to be redone, then the sequences may in fact have
been moved into the sequences/ directory and compressed with xz -9e, thus the if [[]] at the
beginning testing for those files.

```{r trimomatic_script, engine='bash', eval=FALSE}
## Trimomatic_Pairwise: In case a trimming needs to be redone...
if [[ ! -r "hpgl0673_forward.fastq.gz" ]]; then
  if [[ -r "sequences/hpgl0673_forward.fastq.xz" ]]; then
    mv sequences/hpgl0673_forward.fastq.xz . && pxz -d hpgl0673_forward.fastq.xz && pigz hpgl0673_forward.fastq
  else
    echo "Missing files. Did not find hpgl0673_forward.fastq.gz nor sequences/hpgl0673_forward.fastq.xz"
    exit 1
  fi
fi
trimomatic SE -threads 1 -phred33 hpgl0673_forward.fastq.gz hpgl0673_forward-trimmed_paired.fastq.gz \
  hpgl0673_forward-trimmed_unpaired.fastq.gz ILLUMINACLIP:/cbcbhomes/abelew/libraries/adapters.fa:2:20:4 \
  SLIDINGWINDOW:4:25 1>outputs/hpgl0673-trimomatic.out 2>&1
excepted=$(grep "Exception" outputs/hpgl0673-trimomatic.out)
## The following is in case the illumina clipping fails, which it does if this has already been run I think.
if [[ "${excepted}" != "" ]]; then
trimomatic SE -threads 1 -phred33 hpgl0673_forward.fastq.gz hpgl0673_forward-trimmed_paired.fastq.gz \
  hpgl0673_forward-trimmed_unpaired.fastq.gz SLIDINGWINDOW:4:25 1>>outputs/hpgl0673-trimomatic.out 2>&1
fi
sleep 10
mv hpgl0673_forward-trimmed_paired.fastq.gz hpgl0673_forward-trimmed.fastq.gz
```

The end result should be a set of files with trimmed, paired sequences separated from unpaired
sequences.

# Tophat

The tophat process is actually a couple of steps rolled into one, the trimmed/paired reads are fed
to tophat, then the accepted_hits.bam is tested for properly paired reads and those reads are
separated into accepted_paired.bam.  These 2 bam files are sorted and indexed, and finally passed to
htseq.  Thus the tophat scripts look like:

These scripts however, do not include the additional processing step which split the
accepted_hits.bam into accepted_paired.bam  I am not sure why, but these scripts are missing the
following step which was performed but not properly logged:

```{r paired_hits, engine='bash', eval=FALSE}
if [ -r "${tophat_dir}/accepted_hits.bam" ]; then
  samtools view -b -f 2 ${tophat_dir}/accepted_hits.bam > ${tophat_dir}/accepted_paired.bam && samtools index ${tophat_dir}/accepted_paired.bam
fi
```

I know this step was performed, because the files accepted_paired.bam exists for each directory, but
the option -f 2 says that the '2' flag must be set, which is the flag for an alignment which is
paired.

```{r tophat_runs, engine='bash', eval=FALSE}
## scripts/th_lbraziliensis-hpgl0673.sh
mkdir -p outputs/tophat_mmusculus && tophat  -g 1  \
  -G /cbcbhomes/abelew/libraries/genome/mmusculus.gff \
  --b2-very-sensitive -p 4 -o outputs/tophat_mmusculus \
/cbcbhomes/abelew/libraries/genome/indexes/mmusculus \
  hpgl0673_forward-trimmed.fastq.gz && \
  samtools sort -l 9 -n outputs/tophat_mmusculus/accepted_hits.bam outputs/tophat_mmusculus/accepted_sorted && \
  mv outputs/tophat_mmusculus/accepted_sorted.bam outputs/tophat_mmusculus/accepted_hits.bam && \
  samtools index outputs/tophat_mmusculus/accepted_hits.bam && \
  samtools sort -l 9 -n outputs/tophat_mmusculus/unmapped.bam outputs/tophat_mmusculus/unmapped_sorted && \
  mv outputs/tophat_mmusculus/unmapped_sorted.bam outputs/tophat_mmusculus/unmapped.bam && \
  samtools index outputs/tophat_mmusculus/unmapped.bam
```

# Run htseq

The above scripts create sorted accepted_hits.bam files, I think that I manually made the
accepted_paired.bam files and added those steps to the cyoa script post-facto, but the end result is
the same and so I needed to run htseq on both:

```{r htseq_invocations, engine='bash', eval=FALSE}
## scripts/th_mmusculus-hpgl0673.sh
## Counting the number of hits in outputs/tophat_mmusculus/accepted_hits.bam for each feature found in /cbcbhomes/abelew/libraries/genome/mmusculus.gff
## Is this stranded? no.  The defaults of htseq are:
##  --order=name --idattr=gene_id --minaqual=10 --type=exon --stranded=yes --mode=union
htseq-count -q -f bam -s no  -i ID  \
  outputs/tophat_mmusculus/accepted_hits.bam /cbcbhomes/abelew/libraries/genome/mmusculus.gff \
  1>outputs/tophat_mmusculus/accepted_hits.count 2>outputs/tophat_mmusculus/accepted_hits.error && \
    xz -9e outputs/tophat_mmusculus/accepted_hits.count
## I am not going to belabor the point and print the htseq commands for all types.
```

# Run Bowtie

Bowtie was used for the small RNA libraries and follows a similar path.

```{r bowtie_invocation, engine='bash', eval=FALSE}
## This is a bowtie1 alignment of hpgl0673_forward-trimmed.fastq against
## /cbcbhomes/abelew/libraries/genome/indexes/mmusculus using arguments:  --best -v 0 -M 1 .
## This jobs depended on:
mkdir -p outputs/bowtie && sleep 10 && bowtie /cbcbhomes/abelew/libraries/genome/indexes/mmusculus  --best -v 0 -M 1  \
  -p 1 \
  -q hpgl0673_forward-trimmed.fastq \
  --un outputs/bowtie/hpgl0673_forward-trimmed-v0M1_unaligned_mmusculus.fastq \
  --al outputs/bowtie/hpgl0673_forward-trimmed-v0M1_aligned_mmusculus.fastq \
  -S outputs/bowtie/hpgl0673_forward-trimmed-v0M1.sam \
  2>outputs/bowtie/hpgl0673_forward-trimmed-v0M1.err \
  1>outputs/bowtie/hpgl0673_forward-trimmed-v0M1.out

## Counting the number of hits in outputs/bowtie/hpgl0673_forward-trimmed-v0M1.bam for each feature found in /cbcbhomes/abelew/libraries/genome/mmusculus_mi.gff
## Is this stranded? no.  The defaults of htseq are:
##  --order=name --idattr=gene_id --minaqual=10 --type=exon --stranded=yes --mode=union

htseq-count -q -f bam -s no  -i ID  -t exon \
  outputs/bowtie/hpgl0673_forward-trimmed-v0M1.bam /cbcbhomes/abelew/libraries/genome/mmusculus_mi.gff \
  1>outputs/bowtie/hpgl0673_forward-trimmed-v0M1_mi.count 2>outputs/bowtie/hpgl0673_forward-trimmed-v0M1.error && \
    xz -9e outputs/bowtie/hpgl0673_forward-trimmed-v0M1_mi.count
```

[index.html](index.html)
