index.html

1 Fastqc sample estimations

Fastqc is a quick and easy tool to ensure that the data is suitable for further analysis.

cd preprocessing && ./fastqc.sh  ## The script contains the following:
## start=$(pwd)
## for i in $(/bin/ls -d hpgl*); do
##   echo "cd into ${i}"
##   cd ${i}
##   input=$(/bin/ls *.fastq.gz)
##   cyoa --input ${input} --task rnaseq --method fastqc
##   cd ${start}
## done
## A representative script generated looks like this: (without the logging)
## mkdir -p outputs/fastqc &&\
##   fastqc --extract -o outputs/fastqc hpgl0689_forward.fastq.gz \
##   2>outputs/hpgl0689_forward-unfiltered_fastqc.out 1>&2

The results from fastqc are generally not of much interest unless they find something wrong in the data. The most commonly queried result from it is the score distribution with respect to nucleotide.

quality distribution.

quality distribution.

2 Sample trimming

We usually use trimomatic to remove sequence adapters and trim away poor quality bases. It is worth noting that for this data I created three test directories into which I copied 100,000 reads each in order to test each task.

cd preprocessing && ./trim.sh
## start=$(pwd)
## for i in $(/bin/ls -d hpgl*); do
##   echo "cd into ${i}"
##   cd ${i}
##   input=$(/bin/ls *.fastq.gz)
##   cyoa --input ${input} --task rnaseq --method trim
##   cd ${start}
##done
## A representative script looks like this:

## This call to trimomatic removes illumina and epicentre adapters from hpgl0689_forward.fastq.gz.
## It also performs a sliding window removal of anything with quality <25;
## cutadapt provides an alternative to this tool.
## The original sequence data is recompressed and saved in the sequences/ directory.
## Trimomatic_Single: In case a trimming needs to be redone...
## if [[ ! -r "hpgl0689_forward.fastq.gz" ]]; then
##   if [[ -r "sequences/hpgl0689_forward.fastq.gz.xz" ]]; then
##     mv sequences/hpgl0689_forward.fastq.gz.xz . && pxz -d hpgl0689_forward.fastq.gz.xz
##   else
##     echo "Missing files. Did not find hpgl0689_forward.fastq.gz nor sequences/hpgl0689_forward.fastq.gz.xz"
##     exit 1
##   fi
## fi
## trimomatic SE -phred33 hpgl0689_forward.fastq.gz hpgl0689_forward-trimmed.fastq ILLUMINACLIP:/cbcbhomes/abelew/libraries/adapters.fa:2:20:4 SLIDINGWINDOW:4:25 1>outputs/hpgl0689_forward-trimomatic.out 2>&1

3 Sample estimation with biopieces

The biopieces framework provides some pretty metrics.

cd preprocessing && ./biopieces.sh
## start=$(pwd)
## for i in $(/bin/ls -d hpgl*); do
##   echo "cd into ${i}"
##   cd ${i}
##   input=$(/bin/ls *trimmed.fastq)
##   cyoa --input ${input} --task rnaseq --method biopieces
##   cd ${start}
## done
## A representative script looks like this:
## This script uses biopieces to draw some simple graphs of the sequence.
## Do not forget that _only_ the last command in a biopieces string is allowed to have the -x.
## mkdir -p outputs/biopieces
## less hpgl0689_forward-trimmed.fastq | read_fastq -i - -e base_33 |\
##  plot_scores -T 'Quality Scores' -t svg -o outputs/biopieces/hpgl0689_forward-trimmed_quality_scores.svg |\
##  plot_nucleotide_distribution -T 'NT. Distribution' -t svg -o outputs/biopieces/hpgl0689_forward-trimmed_ntdist.svg |\
##  plot_lendist -T 'Length Distribution' -k SEQ_LEN -t svg -o outputs/biopieces/hpgl0689_forward-trimmed_lendist.svg |\
##  analyze_gc |\
##      bin_vals -b 20 -k 'GC%' |\
##      plot_distribution -k 'GC%_BIN' -t svg -o outputs/biopieces/hpgl0689_forward-trimmed_gc_dist.svg |\
##  analyze_gc |\
##      mean_vals -k 'GC%' -o outputs/biopieces/hpgl0689_forward-trimmed_gc.txt |\
##  count_records -o outputs/biopieces/hpgl0689_forward-trimmed_count.txt -x

Length distribution of a sample: length distribution. Quality scores of a sample: quality distribution. Nucleotide distributions of a sample: quality distribution. GC distribution of a sample: quality distribution.

4 Tophat alignments

For these intron rich genomes, we use tophat and/or kallisto. As you can see at the end of the tophat command, I make an attempt to convert all the generated samfiles as quickly as possible into sorted, indexed bamfiles.

cd preprocessing && ./tophat.sh
## start=$(pwd)
## for i in $(/bin/ls -d hpgl*); do
##   echo "cd into ${i}"
##   cd ${i}
##   input=$(/bin/ls *-trimmed.fastq.gz)
##   cyoa --input ${input} --task rnaseq --method tophat --species iscapularis --htseq_type mRNA
##   cd ${start}
## done
## A representative script looks like this:
## However, I know that -g 1 will allow only 1 hit in the case of multihits, but randomly place it
## From the manual:  "If there are more alignments with the same score than this
## number, TopHat will randomly report only this many alignments"
## -N 1 will discard anything with >1 mismatch (default is 2)
## -r adjusts the allowable mean distance between the paired reads
## --mate-std-dev sets the deviation of -r
## --microexon-search will tell it to search short exons for reads >=50
## mkdir -p outputs/tophat_iscapularis && tophat  -g 1  \
##   -G /cbcbhomes/abelew/libraries/genome/iscapularis.gff \
##   --b2-very-sensitive -p 4 -o outputs/tophat_iscapularis \
##   /cbcbhomes/abelew/libraries/genome/indexes/iscapularis \
##   hpgl0689_forward-trimmed.fastq.gz 2>outputs/tophat.out 1>&2 && \
##  samtools sort -l 9 -n outputs/tophat_iscapularis/accepted_hits.bam outputs/tophat_iscapularis/accepted_sorted && \
##  mv outputs/tophat_iscapularis/accepted_sorted.bam outputs/tophat_iscapularis/accepted_hits.bam && \
##  samtools index outputs/tophat_iscapularis/accepted_hits.bam && \
##  samtools sort -l 9 -n outputs/tophat_iscapularis/unmapped.bam outputs/tophat_iscapularis/unmapped_sorted && \
##  mv outputs/tophat_iscapularis/unmapped_sorted.bam outputs/tophat_iscapularis/unmapped.bam && \
## samtools index outputs/tophat_iscapularis/unmapped.bam

5 Counting reads

HTSeq is the most common tool for this task. The invocation of this is actually handled in the same command as tophat above, the following is a relevant script generated by it in order to count against the Ixodes scapularis annotations.

## Counting the number of hits in outputs/tophat_iscapularis/accepted_hits.bam for each feature found in /cbcbhomes/abelew/libraries/genome/iscapularis.gff
## Is this stranded? no.  The defaults of htseq are:
##  --order=name --idattr=gene_id --minaqual=10 --type=exon --stranded=yes --mode=union
htseq-count -q -f bam -s no  -i ID  -t mRNA \
  outputs/tophat_iscapularis/accepted_hits.bam /cbcbhomes/abelew/libraries/genome/iscapularis.gff \
  1>outputs/tophat_iscapularis/accepted_hits.count 2>outputs/tophat_iscapularis/accepted_hits_htseq.err && \
    xz -9e outputs/tophat_iscapularis/accepted_hits.count

The result of this process is the creation of files named ‘accepted_hits.count.xz’ in each sample’s outputs/tophat_iscapularis/ directory. After finishing this, I decided to have the cyoa script put these count tables in their own htseq_(species) directory, so if they disappear that is where they went. (The thinking is that this way I can try out some different counters.) At this point, the data is ready for analysis, I merely need to update the all_samples spreadsheet to reflect the locations of the new count tables in a column named ‘file’. Also, I modified it slightly, as R gets angry at the word ‘naive’ with the umlaut over the i.

With that I can move on to reading in annotations and performing sample estimations.

library('pander')
pander(sessionInfo())

R version 3.3.3 (2017-03-06)

**Platform:** x86_64-pc-linux-gnu (64-bit)

locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C

attached base packages: stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: RMySQL(v.0.10.11), DBI(v.0.6-1), hpgltools(v.2017.01) and pander(v.0.6.0)

loaded via a namespace (and not attached): SummarizedExperiment(v.1.4.0), lattice(v.0.20-35), colorspace(v.1.3-2), testthat(v.1.0.2), htmltools(v.0.3.5), stats4(v.3.3.3), rtracklayer(v.1.34.2), yaml(v.2.1.14), base64enc(v.0.1-3), XML(v.3.98-1.6), withr(v.1.0.2), BiocParallel(v.1.8.2), BiocGenerics(v.0.20.0), foreach(v.1.4.3), plyr(v.1.8.4), stringr(v.1.2.0), zlibbioc(v.1.20.0), Biostrings(v.2.42.1), munsell(v.0.4.3), commonmark(v.1.2), gtable(v.0.2.0), devtools(v.1.12.0), codetools(v.0.2-15), evaluate(v.0.10), memoise(v.1.1.0), Biobase(v.2.34.0), knitr(v.1.15.1), IRanges(v.2.8.2), GenomeInfoDb(v.1.10.3), parallel(v.3.3.3), Rcpp(v.0.12.10), backports(v.1.0.5), scales(v.0.4.1), S4Vectors(v.0.12.2), XVector(v.0.14.1), Rsamtools(v.1.26.2), ggplot2(v.2.2.1), digest(v.0.6.12), stringi(v.1.1.5), GenomicRanges(v.1.26.4), grid(v.3.3.3), rprojroot(v.1.2), tools(v.3.3.3), bitops(v.1.0-6), magrittr(v.1.5), lazyeval(v.0.2.0), RCurl(v.1.95-4.8), tibble(v.1.3.0), crayon(v.1.3.2), Matrix(v.1.2-8), data.table(v.1.10.4), xml2(v.1.1.1), rmarkdown(v.1.4), roxygen2(v.6.0.1), iterators(v.1.0.8), R6(v.2.2.0), GenomicAlignments(v.1.10.1) and compiler(v.3.3.3)

---
title: "Preprocessing Ixodes scapularis raw data."
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
 html_document:
  code_download: true
  code_folding: show
  fig_caption: true
  fig_height: 7
  fig_width: 7
  highlight: tango
  keep_md: false
  mode: selfcontained
  number_sections: true
  self_contained: true
  theme: cosmo
  toc: true
  toc_float:
    collapsed: false
    smooth_scroll: false
---

<style>
  <!-- Document prelude revision 2016-10 -->
  body .main-container {
    max-width: 1600px;
}
</style>

```{r options, include=FALSE}
## These are the options I tend to favor
library("hpgltools")
knitr::opts_knit$set(
    progress = TRUE,
    verbose = TRUE,
    width = 90,
    echo = TRUE)
knitr::opts_chunk$set(
    error = TRUE,
    fig.width = 8,
    fig.height = 8,
    dpi = 96)
options(
    digits = 4,
    stringsAsFactors = FALSE,
    knitr.duplicate.label = "allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size=10))
set.seed(1)
previous_file <- "index.Rmd"
rmd_file <- "annotation.Rmd"
ver <- "20170424"
previous_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=previous_file), "-v", ver, ".rda.xz")
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
```

[index.html](index.html)

```{r rendering, include=FALSE, eval=FALSE}
## This block is used to render a document from within it.
rmarkdown::render(rmd_file)

## An extra renderer for pdf output
rmarkdown::render(rmd_file, output_format="pdf_document", output_options=c("skip_html"))
## Or to save/load large Rdata files.
hpgltools:::saveme()
hpgltools:::loadme()
rm(list=ls())
```

# Fastqc sample estimations

Fastqc is a quick and easy tool to ensure that the data is suitable for further analysis.

```{r fastqc, engine='bash', eval=FALSE}
cd preprocessing && ./fastqc.sh  ## The script contains the following:
## start=$(pwd)
## for i in $(/bin/ls -d hpgl*); do
##   echo "cd into ${i}"
##   cd ${i}
##   input=$(/bin/ls *.fastq.gz)
##   cyoa --input ${input} --task rnaseq --method fastqc
##   cd ${start}
## done
## A representative script generated looks like this: (without the logging)
## mkdir -p outputs/fastqc &&\
##   fastqc --extract -o outputs/fastqc hpgl0689_forward.fastq.gz \
##   2>outputs/hpgl0689_forward-unfiltered_fastqc.out 1>&2
```

The results from fastqc are generally not of much interest unless they find something wrong in the
data.  The most commonly queried result from it is the score distribution with respect to
nucleotide.

![quality distribution.](preprocessing/hpgl0689/outputs/fastqc/hpgl0689_forward_fastqc/Images/per_base_quality.png)

# Sample trimming

We usually use trimomatic to remove sequence adapters and trim away poor quality bases.  It is worth
noting that for this data I created three test directories into which I copied 100,000 reads each in
order to test each task.

```{r trimomatic, engine='bash', eval=FALSE}
cd preprocessing && ./trim.sh
## start=$(pwd)
## for i in $(/bin/ls -d hpgl*); do
##   echo "cd into ${i}"
##   cd ${i}
##   input=$(/bin/ls *.fastq.gz)
##   cyoa --input ${input} --task rnaseq --method trim
##   cd ${start}
##done
## A representative script looks like this:

## This call to trimomatic removes illumina and epicentre adapters from hpgl0689_forward.fastq.gz.
## It also performs a sliding window removal of anything with quality <25;
## cutadapt provides an alternative to this tool.
## The original sequence data is recompressed and saved in the sequences/ directory.
## Trimomatic_Single: In case a trimming needs to be redone...
## if [[ ! -r "hpgl0689_forward.fastq.gz" ]]; then
##   if [[ -r "sequences/hpgl0689_forward.fastq.gz.xz" ]]; then
##     mv sequences/hpgl0689_forward.fastq.gz.xz . && pxz -d hpgl0689_forward.fastq.gz.xz
##   else
##     echo "Missing files. Did not find hpgl0689_forward.fastq.gz nor sequences/hpgl0689_forward.fastq.gz.xz"
##     exit 1
##   fi
## fi
## trimomatic SE -phred33 hpgl0689_forward.fastq.gz hpgl0689_forward-trimmed.fastq ILLUMINACLIP:/cbcbhomes/abelew/libraries/adapters.fa:2:20:4 SLIDINGWINDOW:4:25 1>outputs/hpgl0689_forward-trimomatic.out 2>&1
```

# Sample estimation with biopieces

The biopieces framework provides some pretty metrics.

```{r biopieces, engine='bash', eval=FALSE}
cd preprocessing && ./biopieces.sh
## start=$(pwd)
## for i in $(/bin/ls -d hpgl*); do
##   echo "cd into ${i}"
##   cd ${i}
##   input=$(/bin/ls *trimmed.fastq)
##   cyoa --input ${input} --task rnaseq --method biopieces
##   cd ${start}
## done
## A representative script looks like this:
## This script uses biopieces to draw some simple graphs of the sequence.
## Do not forget that _only_ the last command in a biopieces string is allowed to have the -x.
## mkdir -p outputs/biopieces
## less hpgl0689_forward-trimmed.fastq | read_fastq -i - -e base_33 |\
##  plot_scores -T 'Quality Scores' -t svg -o outputs/biopieces/hpgl0689_forward-trimmed_quality_scores.svg |\
##  plot_nucleotide_distribution -T 'NT. Distribution' -t svg -o outputs/biopieces/hpgl0689_forward-trimmed_ntdist.svg |\
##  plot_lendist -T 'Length Distribution' -k SEQ_LEN -t svg -o outputs/biopieces/hpgl0689_forward-trimmed_lendist.svg |\
##  analyze_gc |\
##      bin_vals -b 20 -k 'GC%' |\
##      plot_distribution -k 'GC%_BIN' -t svg -o outputs/biopieces/hpgl0689_forward-trimmed_gc_dist.svg |\
##  analyze_gc |\
##      mean_vals -k 'GC%' -o outputs/biopieces/hpgl0689_forward-trimmed_gc.txt |\
##  count_records -o outputs/biopieces/hpgl0689_forward-trimmed_count.txt -x
```

Length distribution of a sample:
![length distribution.](preprocessing/hpgl0689/outputs/biopieces/hpgl0689_forward-trimmed_lendist.svg)
Quality scores of a sample:
![quality distribution.](preprocessing/hpgl0689/outputs/biopieces/hpgl0689_forward-trimmed_quality_scores.svg)
Nucleotide distributions of a sample:
![quality distribution.](preprocessing/hpgl0689/outputs/biopieces/hpgl0689_forward-trimmed_ntdist.svg)
GC distribution of a sample:
![quality distribution.](preprocessing/hpgl0689/outputs/biopieces/hpgl0689_forward-trimmed_gcdist.svg)

# Tophat alignments

For these intron rich genomes, we use tophat and/or kallisto.  As you can see at the end of the
tophat command, I make an attempt to convert all the generated samfiles as quickly as possible into
sorted, indexed bamfiles.

```{r tophat, engine='bash', eval=FALSE}
cd preprocessing && ./tophat.sh
## start=$(pwd)
## for i in $(/bin/ls -d hpgl*); do
##   echo "cd into ${i}"
##   cd ${i}
##   input=$(/bin/ls *-trimmed.fastq.gz)
##   cyoa --input ${input} --task rnaseq --method tophat --species iscapularis --htseq_type mRNA
##   cd ${start}
## done
## A representative script looks like this:
## However, I know that -g 1 will allow only 1 hit in the case of multihits, but randomly place it
## From the manual:  "If there are more alignments with the same score than this
## number, TopHat will randomly report only this many alignments"
## -N 1 will discard anything with >1 mismatch (default is 2)
## -r adjusts the allowable mean distance between the paired reads
## --mate-std-dev sets the deviation of -r
## --microexon-search will tell it to search short exons for reads >=50
## mkdir -p outputs/tophat_iscapularis && tophat  -g 1  \
##   -G /cbcbhomes/abelew/libraries/genome/iscapularis.gff \
##   --b2-very-sensitive -p 4 -o outputs/tophat_iscapularis \
##   /cbcbhomes/abelew/libraries/genome/indexes/iscapularis \
##   hpgl0689_forward-trimmed.fastq.gz 2>outputs/tophat.out 1>&2 && \
##  samtools sort -l 9 -n outputs/tophat_iscapularis/accepted_hits.bam outputs/tophat_iscapularis/accepted_sorted && \
##  mv outputs/tophat_iscapularis/accepted_sorted.bam outputs/tophat_iscapularis/accepted_hits.bam && \
##  samtools index outputs/tophat_iscapularis/accepted_hits.bam && \
##  samtools sort -l 9 -n outputs/tophat_iscapularis/unmapped.bam outputs/tophat_iscapularis/unmapped_sorted && \
##  mv outputs/tophat_iscapularis/unmapped_sorted.bam outputs/tophat_iscapularis/unmapped.bam && \
## samtools index outputs/tophat_iscapularis/unmapped.bam
```

# Counting reads

HTSeq is the most common tool for this task.  The invocation of this is actually handled in the same
command as tophat above, the following is a relevant script generated by it in order to count
against the Ixodes scapularis annotations.

```{r htseq, engine='bash', eval=FALSE}
## Counting the number of hits in outputs/tophat_iscapularis/accepted_hits.bam for each feature found in /cbcbhomes/abelew/libraries/genome/iscapularis.gff
## Is this stranded? no.  The defaults of htseq are:
##  --order=name --idattr=gene_id --minaqual=10 --type=exon --stranded=yes --mode=union
htseq-count -q -f bam -s no  -i ID  -t mRNA \
  outputs/tophat_iscapularis/accepted_hits.bam /cbcbhomes/abelew/libraries/genome/iscapularis.gff \
  1>outputs/tophat_iscapularis/accepted_hits.count 2>outputs/tophat_iscapularis/accepted_hits_htseq.err && \
    xz -9e outputs/tophat_iscapularis/accepted_hits.count
```

The result of this process is the creation of files named 'accepted_hits.count.xz' in each sample's
outputs/tophat_iscapularis/ directory.  After finishing this, I decided to have the cyoa script put
these count tables in their own htseq_(species) directory, so if they disappear that is where they
went.  (The thinking is that this way I can try out some different counters.)  At this point, the
data is ready for analysis, I merely need to update the all_samples spreadsheet to reflect the
locations of the new count tables in a column named 'file'.  Also, I modified it slightly, as R gets
angry at the word 'naive' with the umlaut over the i.

With that I can move on to reading in annotations and performing sample estimations.

* [Annotation](annotation.html)
* [Sample Estimation](sample_estimation.html)

```{r sysinfo, results='asis'}
library('pander')
pander(sessionInfo())
```
