Preprocessing some small RNAs from Pseudomonas aeruginosa

The big caveat is that I make a lot of typeographical errors; to avoid that I wrote a small command line utility to handle many of the repetetive tasks performed when preprocessing new data. It may be found here:

http://github.com/abelew/CYOA

It invokes all the various commands for me and passes them to the computer cluster. In doing this, it writes the commands to a series of shell scripts to (hopefully) make it easier to see what was actually done.

With that in mind, here is what I have done so far.

TODO

Range of data to play with: 4-10, but start with only 4mers, after limiting the data to only the 4mers – perform for all samples
Count the set of 256 4mers AAAA -> UUUU, perform some form of contrast of normalized identities
1. Keep in mind the size exclusions performed and also wt/orn.

Copy reads to my working tree

Najib has kindly provided us with a large playspace. I created the following directories:

root="~abelew/scratch/small_rna/paeruginosa"
mkdir -p ${root}/preprocessing/hpgl70[1-8]/
mkdir -p ${root}/sample_sheets/
mkdir -p ${root}/preprocessing/test/

the hpgl701 through hpgl708 were numbers provided in the sample sheet csv file Najib emailed to me. I copied it to the sample_sheets/ directory. I further copied it to all_samples.csv and in that file simplified a couple column names to make them easier for the computer to interpret.

Because I always make stupid mistakes, I also created a test directory into which I will copy a small number of reads and perform all my commands once there to make sure everything is sane before running on the actual data.

I then ran the following command:

cd preprocessing && cyoa
## Number 5 is 'Prepare', then Number 2 is to copy the raw data.
## It then asks me to provide the input directory and I typed:
## /cbcb/lab/nelsayed/raw_data/vtlee
## It finally asked me for a filename of samples to read: ../all_samples.csv
## It helpfully then reminds me that I may invoke the same thing with:
## cyoa --task prepare --method copyraw --csv ../all_samples.csv

This goes into the raw data directories from Najib and concatenates all the files by sample into preprocessing/hpgl####/hpgl####_forward.fastq.gz.

Create testing set

I will then make a small test data set:

cd preprocessing/
zcat hpgl0701/hpgl0701_forward.fastq.gz | head -n 80000 > test/test.fastq && gzip test/test.fastq

This command grabs the first 20,000 sequences from an arbitrarily chosen sample and drops it into the test/ directory, then compresses it. If you are new to the shell, don’t be concerned by the &&, I think of it as ‘andand’ and use it when I want to run multiple commands at one time, but with the caveat that if an earlier command fails then I do not want it to try running the later commands.

Ok, so what is my goal, I forgot. Well, while I think about that I will glance at the raw sequence to see if there are any obvious problems:

@HWI-1KL118:196:HJVNFBCXX:1:1101:1659:2141 1:N:0:ATCACG
AGGATTGagatcgGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGC
+
DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIHI
@HWI-1KL118:196:HJVNFBCXX:1:1101:1729:2164 1:N:0:ATCACG
CAGAACTGGCagatcgGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTC
+
DDDDDIIIIIIIIIIIIIGIIIHIIIIIIIIIIIIIIE@GHIIIIIIIIIIIIIIHIIIIIIIIIIII
@HWI-1KL118:196:HJVNFBCXX:1:1101:1595:2191 1:N:0:ATCACG
AGGACGTagatcgGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGC
+
DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHHIIIIIIIIIIIIIIIIIIIIIIIIIII

It looks like there might be the very slightest weakening of signal at around position 30 (the E@GH) but in reality the quality of the data looks beautiful. I think I will run fastqc to make sure that is true:

Run Fastqc

cd preprocessing/test
cyoa --task rnaseq --method fastqc --input test.fastq.gz
## This spins off 2 jobs on the cluster, one to run fastqc, and one to collect the statistics from it.
## It finished after about 10 seconds and created files in preprocessing/test/outputs/fastqc including
## fastqc_stats.csv which I glanced at and saw that it found 20,000 sequences of which 0 failed the
## quality checks. fastqc found that there is a high percentage of adapter content!  This is
## undoubtedly a good thing in this case.

Ok, so I will run fastqc on all the samples. I have a 3 line shell script which invokes cyoa on every sample in turn, it looks like this:

start=$(pwd)
for i in $(/bin/ls -d hpgl*); do
  echo "cd into ${i}"
  cd ${i}
  input=$(/bin/ls *.fastq.gz)
  cyoa --input ${input} --task rnaseq --method fastqc
  cd ${start}
done
## Since it worked fine in test/ I assume it will work for all samples.
./fastqc.sh

Adapter removal

Now lets remove the adatpers. I’ve looked at a fair number of sequencing libraries and there is a pattern I reflexively look for: ‘AGATCG….’ If you look again at the sequence above, you will find that I lower-cased this string in each of the sequences above. You will therefore note that the adapter starts from positions 7-10 in them. We usually use ‘trimomatic’ to handle the removal of sequencing adapters, this may be more than it is prepared to handle, if that proves to the be case, I will use a second tool ‘cutadapt’… Lets see.

cd preprocessing/test && cyoa
## hit number 6 for rnaseq, number 3 for trimming
## then tell it the input file, it will helpfully remind me that I could have done the same with:
## cyoa --task rnaseq --method trim --input test.fastq.gz

Here are a few reads from the result:

@HWI-1KL118:196:HJVNFBCXX:1:1101:1501:2205 1:N:0:ATCACG
ATTCGGACCTCGGCGCCTCCGCTCCGGCGGGGCGCCACAGGCCCTGA
+
DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIII
@HWI-1KL118:196:HJVNFBCXX:1:1101:10706:2176 1:N:0:ATCACG
TGACTTGTTGAACA
+
DDDDDIIIIIIIII
@HWI-1KL118:196:HJVNFBCXX:1:1101:10571:2247 1:N:0:ATCACG
AGAAGACTCC
+
DDDDDIIIII
@HWI-1KL118:196:HJVNFBCXX:1:1101:10977:2147 1:N:0:ATCACG
AATGACTGGG
+
DDDDDIIIII
@HWI-1KL118:196:HJVNFBCXX:1:1101:10935:2155 1:N:0:ATCACG
TCGGTACCG
+
DDDDDIIII
@HWI-1KL118:196:HJVNFBCXX:1:1101:10805:2229 1:N:0:ATCACG
CATGTGGCTTCG
+
DDDDDIIIIIII
@HWI-1KL118:196:HJVNFBCXX:1:1101:11095:2183 1:N:0:ATCACG
TGTTGGAGTTG
+
DDDDDIIIIII

That first read stands out, it appears to not have the adapter but was left in. That is a case where cutadapt might be a better choice, as it may be told to keep only the reads with the adapter. So I am thinking I will do a separate trimmings with trimomatic and cutadapt.

Trimomatic

bin/trim.sh
## trim.sh contains the following:
### start=$(pwd)
### for i in $(/bin/ls -d hpgl*); do
###  echo "cd into ${i}"
###  cd ${i}
###  input=$(/bin/ls *.fastq.gz)
###  cyoa --input ${input} --task rnaseq --method fastqc
###  cd ${start}
###done

Oh yeah, my toy also makes trimomatic provide some statistics like fastqc did. From that I see that 89.4% of the hpgl0701 sample survived, which is cool. It creates a .csv file of these statistics, so if you need some table later of this information, no worries.

It finished while I was typing the above, so I will move the output files to some other name so that I may run again with cutadapt.

for oldname in $(find . -name '*-trimmed.fastq'); do
 newname=$(echo ${oldname} | sed 's/_forward-trimmed/_trimomatic/g')
 mv ${oldname} ${newname}
done

Now redo with cutadapt and see if there is a significant difference. In previous work, I have only used cutadapt with ribosome profiling and/or TNSeq experiments, so its option is not to be found in the RNASeq menu…

Run cutadapt

cd preprocessing/test && cyoa
## hit number 7 for riboseq, then 2 for cutadapt, then provide the input file.
## wait about 20 seconds...
## Annnnd realize that it assumed I was doing tnseq because I had a typeographical error.  TNSeq has
## completely different adapters.
## Fix my speeling(sic) and rerun.
## once I fixed that, the command that ends up being run looks like:
### mkdir -p /cbcb/nelsayed-scratch/atb/small_rna/lee/preprocessing/test/outputs/cutadapt && \
###  less test.fastq.xz | cutadapt -    -a ACAGGTTGGATGATAAGTCCCCGGTCTGACACATCTCCCTAT -a AGATCGGAAGAGCACACGTCTGAAC -b AGATCGGAAGAGCACACGTCTGAAC  -e 0.1 -n 3 -m 16
###  -M 42 \
###   --too-short-output=/cbcb/nelsayed-scratch/atb/small_rna/lee/preprocessing/test/outputs/cutadapt/test_tooshort.fastq \
###   --too-long-output=/cbcb/nelsayed-scratch/atb/small_rna/lee/preprocessing/test/outputs/cutadapt/test_toolong.fastq \
###   --untrimmed-output=/cbcb/nelsayed-scratch/atb/small_rna/lee/preprocessing/test/outputs/cutadapt/test_untrimmed.fastq \
###   2>outputs/cutadapt.err 1>test-trimmed.fastq

Of the 3 adapters passed to cutadapt, the last one (-b AGAT…) is the one which should do the work. The way I used cutadapt separates its outputs into a few piles: The primary output is test-trimmed.fastq which contains all reads which had the adapter and are 16<=x<=42 nucleotides long. These are interesting, but perhaps not what we care about. Instead, we might want to look at the reads in outputs/cutadapt/test_tooshort.fastq.xz, it contains the reads which are 0<=x<=16 nucleotides long. Glancing at that file, I see stuff like:

@HWI-1KL118:196:HJVNFBCXX:1:1101:21177:2206 1:N:0:ATCACG
GGGGGCTC
+
DDDDDIII
@HWI-1KL118:196:HJVNFBCXX:1:1101:21267:2125 1:N:0:ATCACG
AGCTGGGACTG
+
DDDDDIIIHII
@HWI-1KL118:196:HJVNFBCXX:1:1101:21293:2163 1:N:0:ATCACG

+

@HWI-1KL118:196:HJVNFBCXX:1:1101:21294:2191 1:N:0:ATCACG
CCGGCTGTTC
+
DDDDDIIIHI

Thus there are reads which started with the sequencing adapter. These are going to be a problem, as most tools fail catastrophically if faced with an empty sequence, but no worries, we can do a sed ‘s/^$/N/g’ and fill them with a single unknown nucleotide.

So I will do that, but first run cutadapt on all the sequences: I bet you can guess what the script looks like!!

bin/cutadapt.sh
## The script looks just like the trim above, except...
### start=$(pwd)
### for i in $(/bin/ls -d hpgl*); do
###  echo "cd into ${i}"
###  cd ${i}
###  input=$(/bin/ls *.fastq.gz)
###  THIS NEXT LINE!
###  cyoa --input ${input} --task riboseq --method cutadapt
###  cd ${start}
### done

Biopieces sample estimation

Biopieces may be used to graph a few sample estimates, unfortunately, it assumes one wants the data in the ‘normal’ range, which is > 15 nt. As a result these graphs are not very useful. TODO: rerun for the small pieces.

Length distribution of the first sample: length distribution. Quality scores of the first sample: quality distribution. Nucleotide distributions of the first sample: GC distribution of the first sample:

Note that the following sections work with all reads in the data; however in discussions with Vince and Mona, it was decided to work specifically first with 4mers and later with potentially other lengths. But the following provided an initial assessment.

Ok, while that is running, I try running jellyfish on the limited data in test/ First I will replace blank lines with ‘N’ and try out some arbitrary jellyfish commands which I will likely decide are stupid in a few minutes and change…

Test jellyfish

cd test/outputs/cutadapt
less test_tooshort.fastq.xz | sed 's/^$/N/g' > test_tooshort_n.fastq
module add jellyfish
jellyfish count -m 8 -o jelly_test -c 3 -s 1000000 -t 32 test_tooshort_n.fastq
jellyfish info jelly_test
jellyfish histo jelly_test
jellyfish dump jelly_test > 8mers.fasta

Glancing at the result file I see that there are a few highly overrepresented 8-mers: CGAGACGA happened 48 times out of 18,720 sequences, for example, looking at this histogram, the spread goes from 1 8-mer which happened 1,695 times to 8,415 8-mers which happened 1 time. I think that is probably encouraging, though not perfect given my preconceptions.

Lets see what happens when we run this on all the sequences, so I write a short script: run_jellyfish.sh

bin/run_jellyfish.sh
## copy/pasted from the cutadapt above, with some modifications
### start=$(pwd)
### for i in $(/bin/ls -d hpgl*); do
###  echo "cd into ${i}"
###  cd ${i}
###  export INPUT=$(/bin/ls outputs/cutadapt/*tooshort.fastq.xz)
###  cat <<"EOF" | qsub -V -d `pwd` -
###    less ${INPUT} | sed 's/^$/N/g' > short_reads.fastq
###    module add jellyfish
###    jellyfish count -m 8 -o jellyfish.out -c 3 -s 100000 -t 32 short_reads.fastq
###    jellyfish info jellyfish.out > jellyfish_infomation.txt
###    jellyfish histo jellyfish.out > jellyfish_histogram.txt
###    jellyfish dump jellyfish.out > 8mers.fasta
###  EOF
###  cd ${start}
### done

That is going to take a while and I want to get a drink from the coop. … still running, I think I will grab some annotation information for pseudomonas aeruginosa pa14 while waiting.

Run jellyfish on 4mers

Redo the above the some modifications to make it specific to nmers of a given length (4 at the time of writing.) Note that I moved all my scripts into the bin/ directory. The following script does the following:

use biopieces to extract nmers.
run jellyfish on the resulting nmer file.
make jellyfish histograms and counted nmers.
make a count-table of this output.

bin/run_jellyfish_nmer.sh 2
bin/run_jellyfish_nmer.sh 3
bin/run_jellyfish_nmer.sh 4
bin/run_jellyfish_nmer.sh 5
bin/run_jellyfish_nmer.sh 6
bin/run_jellyfish_nmer.sh 7
bin/run_jellyfish_nmer.sh 8
bin/run_jellyfish_nmer.sh 9
bin/run_jellyfish_nmer.sh 10
bin/run_jellyfish_nmer.sh 11
bin/run_jellyfish_nmer.sh 12
## #!/usr/bin/env bash
## start=$(pwd)
## export NMER=${1-"4"}  ## Default to 4mers
## for i in $(/bin/ls -d hpgl*); do
##  echo "cd into ${i}"
##  cd ${i}
##  cat <<"EOF" | qsub -V -d `pwd` -q throughput -l walltime=04:00:00 -
##    module add jellyfish
##    less short_reads.fastq.gz | read_fastq -e base_33 -i - | grab -e "SEQ_LEN==${NMER}" | write_fastq -o ${NMER}mers.fastq -x
##    COUNT_OUT="${NMER}mers_jellyfish.out"
##    jellyfish count -m ${NMER} -o ${COUNT_OUT} -c 3 -s 100000 -t 32 ${NMER}mers.fastq
##    jellyfish info ${COUNT_OUT} > ${NMER}mers_jellyfish_infomation.txt
##    jellyfish histo ${COUNT_OUT} > ${NMER}mers_jellyfish_histogram.txt
##    jellyfish dump ${COUNT_OUT} > ${NMER}mers_by_count.fasta
##    ${start}/bin/count_mers.pl ${NMER}mers_by_count.fasta
##    gzip -f ${NMER}mers.fastq ${NMER}mers_by_count.fasta
## EOF
##  cd ${start}
## done

Write a csv for hpgltools to read

I already wrote one, so really I am just changing the filenames to use the new counts for each nmer. I started with 4 mers, so I will skip that number.

cp all_samples.csv all_samples12.csv
sed -i 's/4mers/2mers/g' all_samples2.csv
sed -i 's/4mers/3mers/g' all_samples3.csv
sed -i 's/4mers/5mers/g' all_samples5.csv
sed -i 's/4mers/6mers/g' all_samples6.csv
sed -i 's/4mers/7mers/g' all_samples7.csv
sed -i 's/4mers/8mers/g' all_samples8.csv
sed -i 's/4mers/9mers/g' all_samples9.csv
sed -i 's/4mers/10mers/g' all_samples10.csv
sed -i 's/4mers/11mers/g' all_samples11.csv
sed -i 's/4mers/12mers/g' all_samples12.csv

Write a Makefile to automate this stuff

Using the tool make(1), it is possible to automate all of these tasks. Keep in mind that tabs are important for Makefiles, so I am not sure what will happen if one attempts to copy/paste the following.

make
## Here is the current Makefile
##.PHONY: clean all
##all:  copy index.html preprocessing.html 2mers.html 3mers.html 4mers.html 5mers.html 6mers.html 7mers.html 8mers.html 9mers.html 10mers.html
##
##%.html: %.Rmd
##  @rm -f $@ && echo "Compiling $<" ;
##  @Rscript -e "setwd('$(dir $<)');\
##  library('rmarkdown');\
##  render('$(notdir $<)', output_format='html_document')" 2>$<.out 1>&2
##
##clean:
##  rm -f *mers.Rmd *mers.html
##
##NUMBERS = 2 3 4 5 6 7 8 9 10
##copy: nmers.Rmd
##  $(foreach var,$(NUMBERS),cp nmers.Rmd $(var)mers.Rmd && sed -i "s/mers <- n/mers <- $(var)/g" $(var)mers.Rmd;)
##

20170705: Preprocessing Small RNA samples from Pseudomonas aeruginosa

atb abelew@gmail.com

2018-07-24