This document is intended to describe every operation performed from raw data up until the creation of count tables suitable for use in R and differential expression analyses. Most (all) of these steps are performed via a command line utility ‘cyoa’ which may be found at http://github.com/abelew/cyoa/.

1 Setting up

In discussions with Najib and Santuza, it became clear to me that it may prove useful to have fresh mappings for the strains: CLBrener, CL14, Y, and perhaps G and Sylvio. I therefore created a new working tree in Najib’s scratch directory:

cd ~abelew/scratch/rnaseq
mkdir -p tcruzi_cl14-brener-sylvio/preprocessing/parasite/cl14
mkdir -p tcruzi_cl14-brener-sylvio/preprocessing/parasite/clbrener
mkdir -p tcruzi_cl14-brener-sylvio/preprocessing/parasite/sylvio
mkdir -p tcruzi_cl14-brener-sylvio/preprocessing/parasite/y
mkdir -p tcruzi_cl14-brener-sylvio/preprocessing/parasite/g
mkdir -p tcruzi_cl14-brener-sylvio/preprocessing/human

2 Copying the raw data

I prefer to concatenate the sequence files into two files, one for the R1 reads (forward) and one for the R2 reads (reverse). The cyoa utility has some functionality to handle this process for me and therefore hopefully avoid any typeographical errors.

The following command will look in the cl14 raw data directory and concatenate every file by sample into the two files (per sample): hpgl0xxx_forward.fastq.gz and hpgl0xxx_reverse.fastq.gz.

cd ~abelew/scratch/rnaseq/preprocessing/parasite/cl14
for i in $(/bin/ls -d /cbcb/lab/nelsayed/raw_data/cl14/HP*); do
  id=$(basename ${i})
  cyoa --task prepare --method copy --raw_dir /cbcb/lab/nelsayed/raw_data/cl14 --hpgl ${id}
done

3 Make testing data

Before running any commands on the computer cluster, I would like to run test them on a small data set. The following gives me 10,000 reads forward and reverse for testing purposes.

cd ~abelew/scratch/rnaseq/preprocessing/parasite/cl14
mkdir test
less ../hpgl0110/hpgl0110_forward.fastq.gz | head -n 40000 | gzip > test_forward.fastq.gz
less ../hpgl0110/hpgl0110_reverse.fastq.gz | head -n 40000 | gzip > test_reverse.fastq.gz

4 Run Fastqc

Everyone loves fastqc except me. Oh well. First I will make sure it looks ok on the testing data.

cd ~abelew/scratch/rnaseq/preprocessing/parasite/cl14/test  ## BTW I copy/paste this
cyoa --task rnaseq --method fastqc --input test_forward.fastq.gz:test_reverse.fastq.gz

The cyoa utility creates the directories ‘output/’ and ‘scripts/.’ The former will receive the shell scripts used to run each step along the way and the latter will get the output(s). Thus in this case, the scripts directory gets 3 files: fqc-test.sh which runs the fastqc command with options suitable for this paired-end data. The fqcst_forward-test.sh and _reverse files gather some simple statistics from fastqc and drop them into a csv file in output/.

fastqc quality for the forward test data.

Thus we see that the sequencing quality for this data is reasonably good. The other plots from fastqc are not likely to be of any interest unless something turns out to be wrong with the data.

5 Run Trimomatic

I like using trimomatic and cutadapt for trimming/adapter removal. Trimomatic is faster but cutadapt is more flexible. Given that this is pretty generic RNASeq data, the flexibility of cutadapt is not likely required.

cd ~abelew/scratch/rnaseq/preprocessing/parasite/cl14/test  ## BTW I copy/paste this
cyoa --task rnaseq --method trim --input test_forward.fastq.gz:test_reverse.fastq.gz

This creates 2 jobs/sample on the cluster, one to do the trimming and one to collect some statistics. The output from this command however is left in test/ as test_forward-trimmed.fastq.gz and test_reverse-trimmed.fastq.gz. Since this is paired end data, it also creates 2 files _unpaired.fastq.gz which are the reads in each category which do not have proper pairs.

6 Biopieces graphs

Biopieces has a few pretty graphs of the total data which I like to look at on occasion.

cd ~abelew/scratch/rnaseq/preprocessing/parasite/cl14/test  ## BTW I copy/paste this
cyoa --task rnaseq --method biopi --input test_forward-trimmed.fastq.gz:test_reverse-trimmed.fastq.gz

Length distribution of the forward testing data: length distribution for forward test data. Length distribution of the reverse testing data: length distribution for reverse test data. Quality scores of the forward testing data: quality scores for forward test data. Quality scores of the reverse testing data: quality scores for reverse test data. Nucleotide distributions of the forward testing data: Nucleotide distributions of the reverse testing data:

7 Map against esmeraldo/nonesmeraldo/clbrener with tophat and kallisto

cd ~abelew/scratch/rnaseq/preprocessing/parasite/cl14/test  ## BTW I copy/paste this
cyoa --task rnaseq --method tophat --input test_forward-trimmed.fastq.gz:test_reverse-trimmed.fastq.gz --species esmer --htseq_type mRNA
cyoa --task rnaseq --method tophat --input test_forward-trimmed.fastq.gz:test_reverse-trimmed.fastq.gz --species nonesmer --htseq_type mRNA
cyoa --task rnaseq --method tophat --input test_forward-trimmed.fastq.gz:test_reverse-trimmed.fastq.gz --species clbrener --htseq_type mRNA
cyoa --task rnaseq --method kallisto --input test_forward-trimmed.fastq.gz:test_reverse-trimmed.fastq.gz --species esmer
cyoa --task rnaseq --method kallisto --input test_forward-trimmed.fastq.gz:test_reverse-trimmed.fastq.gz --species nonesmer
cyoa --task rnaseq --method kallisto --input test_forward-trimmed.fastq.gz:test_reverse-trimmed.fastq.gz --species clbrener

index.html

RNASeq of CL14/CLBrener samples: preprocessing raw data.

atb abelew@gmail.com

2017-03-02