The big caveat is that I make a lot of typeographical errors; to avoid that I wrote a small command line utility to handle many of the repetetive tasks performed when preprocessing new data. It may be found here:
It invokes all the various commands for me and passes them to the computer cluster. In doing this, it writes the commands to a series of shell scripts to (hopefully) make it easier to see what was actually done.
With that in mind, here is what I have done so far.
Najib has kindly provided us with a large playspace. I created the following directories:
root="~abelew/scratch/small_rna/paeruginosa"
mkdir -p ${root}/preprocessing/hpgl70[1-8]/
mkdir -p ${root}/sample_sheets/
mkdir -p ${root}/preprocessing/test/
the hpgl701 through hpgl708 were numbers provided in the sample sheet csv file Najib emailed to me. I copied it to the sample_sheets/ directory. I further copied it to all_samples.csv and in that file simplified a couple column names to make them easier for the computer to interpret.
Because I always make stupid mistakes, I also created a test directory into which I will copy a small number of reads and perform all my commands once there to make sure everything is sane before running on the actual data.
I then ran the following command:
cd preprocessing && cyoa
## Number 5 is 'Prepare', then Number 2 is to copy the raw data.
## It then asks me to provide the input directory and I typed:
## /cbcb/lab/nelsayed/raw_data/vtlee
## It finally asked me for a filename of samples to read: ../all_samples.csv
## It helpfully then reminds me that I may invoke the same thing with:
## cyoa --task prepare --method copyraw --csv ../all_samples.csv
This goes into the raw data directories from Najib and concatenates all the files by sample into preprocessing/hpgl####/hpgl####_forward.fastq.gz.
I will then make a small test data set:
cd preprocessing/
zcat hpgl0701/hpgl0701_forward.fastq.gz | head -n 80000 > test/test.fastq && gzip test/test.fastq
This command grabs the first 20,000 sequences from an arbitrarily chosen sample and drops it into the test/ directory, then compresses it. If you are new to the shell, don’t be concerned by the &&, I think of it as ‘andand’ and use it when I want to run multiple commands at one time, but with the caveat that if an earlier command fails then I do not want it to try running the later commands.
Ok, so what is my goal, I forgot. Well, while I think about that I will glance at the raw sequence to see if there are any obvious problems:
@HWI-1KL118:196:HJVNFBCXX:1:1101:1659:2141 1:N:0:ATCACG AGGATTGagatcgGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGC + DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIHI @HWI-1KL118:196:HJVNFBCXX:1:1101:1729:2164 1:N:0:ATCACG CAGAACTGGCagatcgGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTC + DDDDDIIIIIIIIIIIIIGIIIHIIIIIIIIIIIIIIE@GHIIIIIIIIIIIIIIHIIIIIIIIIIII @HWI-1KL118:196:HJVNFBCXX:1:1101:1595:2191 1:N:0:ATCACG AGGACGTagatcgGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGC + DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHHIIIIIIIIIIIIIIIIIIIIIIIIIII
It looks like there might be the very slightest weakening of signal at around position 30 (the E@GH) but in reality the quality of the data looks beautiful. I think I will run fastqc to make sure that is true:
cd preprocessing/test
cyoa --task rnaseq --method fastqc --input test.fastq.gz
## This spins off 2 jobs on the cluster, one to run fastqc, and one to collect the statistics from it.
## It finished after about 10 seconds and created files in preprocessing/test/outputs/fastqc including
## fastqc_stats.csv which I glanced at and saw that it found 20,000 sequences of which 0 failed the
## quality checks. fastqc found that there is a high percentage of adapter content! This is
## undoubtedly a good thing in this case.
Ok, so I will run fastqc on all the samples. I have a 3 line shell script which invokes cyoa on every sample in turn, it looks like this:
Now lets remove the adatpers. I’ve looked at a fair number of sequencing libraries and there is a pattern I reflexively look for: ‘AGATCG….’ If you look again at the sequence above, you will find that I lower-cased this string in each of the sequences above. You will therefore note that the adapter starts from positions 7-10 in them. We usually use ‘trimomatic’ to handle the removal of sequencing adapters, this may be more than it is prepared to handle, if that proves to the be case, I will use a second tool ‘cutadapt’… Lets see.
cd preprocessing/test && cyoa
## hit number 6 for rnaseq, number 3 for trimming
## then tell it the input file, it will helpfully remind me that I could have done the same with:
## cyoa --task rnaseq --method trim --input test.fastq.gz
@HWI-1KL118:196:HJVNFBCXX:1:1101:1501:2205 1:N:0:ATCACG ATTCGGACCTCGGCGCCTCCGCTCCGGCGGGGCGCCACAGGCCCTGA + DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIII @HWI-1KL118:196:HJVNFBCXX:1:1101:10706:2176 1:N:0:ATCACG TGACTTGTTGAACA + DDDDDIIIIIIIII @HWI-1KL118:196:HJVNFBCXX:1:1101:10571:2247 1:N:0:ATCACG AGAAGACTCC + DDDDDIIIII @HWI-1KL118:196:HJVNFBCXX:1:1101:10977:2147 1:N:0:ATCACG AATGACTGGG + DDDDDIIIII @HWI-1KL118:196:HJVNFBCXX:1:1101:10935:2155 1:N:0:ATCACG TCGGTACCG + DDDDDIIII @HWI-1KL118:196:HJVNFBCXX:1:1101:10805:2229 1:N:0:ATCACG CATGTGGCTTCG + DDDDDIIIIIII @HWI-1KL118:196:HJVNFBCXX:1:1101:11095:2183 1:N:0:ATCACG TGTTGGAGTTG + DDDDDIIIIII
That first read stands out, it appears to not have the adapter but was left in. That is a case where cutadapt might be a better choice, as it may be told to keep only the reads with the adapter. So I am thinking I will do a separate trimmings with trimomatic and cutadapt.
bin/trim.sh
## trim.sh contains the following:
### start=$(pwd)
### for i in $(/bin/ls -d hpgl*); do
### echo "cd into ${i}"
### cd ${i}
### input=$(/bin/ls *.fastq.gz)
### cyoa --input ${input} --task rnaseq --method fastqc
### cd ${start}
###done
Oh yeah, my toy also makes trimomatic provide some statistics like fastqc did. From that I see that 89.4% of the hpgl0701 sample survived, which is cool. It creates a .csv file of these statistics, so if you need some table later of this information, no worries.
It finished while I was typing the above, so I will move the output files to some other name so that I may run again with cutadapt.
for oldname in $(find . -name '*-trimmed.fastq'); do
newname=$(echo ${oldname} | sed 's/_forward-trimmed/_trimomatic/g')
mv ${oldname} ${newname}
done
Now redo with cutadapt and see if there is a significant difference. In previous work, I have only used cutadapt with ribosome profiling and/or TNSeq experiments, so its option is not to be found in the RNASeq menu…
cd preprocessing/test && cyoa
## hit number 7 for riboseq, then 2 for cutadapt, then provide the input file.
## wait about 20 seconds...
## Annnnd realize that it assumed I was doing tnseq because I had a typeographical error. TNSeq has
## completely different adapters.
## Fix my speeling(sic) and rerun.
## once I fixed that, the command that ends up being run looks like:
### mkdir -p /cbcb/nelsayed-scratch/atb/small_rna/lee/preprocessing/test/outputs/cutadapt && \
### less test.fastq.xz | cutadapt - -a ACAGGTTGGATGATAAGTCCCCGGTCTGACACATCTCCCTAT -a AGATCGGAAGAGCACACGTCTGAAC -b AGATCGGAAGAGCACACGTCTGAAC -e 0.1 -n 3 -m 16
### -M 42 \
### --too-short-output=/cbcb/nelsayed-scratch/atb/small_rna/lee/preprocessing/test/outputs/cutadapt/test_tooshort.fastq \
### --too-long-output=/cbcb/nelsayed-scratch/atb/small_rna/lee/preprocessing/test/outputs/cutadapt/test_toolong.fastq \
### --untrimmed-output=/cbcb/nelsayed-scratch/atb/small_rna/lee/preprocessing/test/outputs/cutadapt/test_untrimmed.fastq \
### 2>outputs/cutadapt.err 1>test-trimmed.fastq
Of the 3 adapters passed to cutadapt, the last one (-b AGAT…) is the one which should do the work. The way I used cutadapt separates its outputs into a few piles: The primary output is test-trimmed.fastq which contains all reads which had the adapter and are 16<=x<=42 nucleotides long. These are interesting, but perhaps not what we care about. Instead, we might want to look at the reads in outputs/cutadapt/test_tooshort.fastq.xz, it contains the reads which are 0<=x<=16 nucleotides long. Glancing at that file, I see stuff like:
@HWI-1KL118:196:HJVNFBCXX:1:1101:21177:2206 1:N:0:ATCACG GGGGGCTC + DDDDDIII @HWI-1KL118:196:HJVNFBCXX:1:1101:21267:2125 1:N:0:ATCACG AGCTGGGACTG + DDDDDIIIHII @HWI-1KL118:196:HJVNFBCXX:1:1101:21293:2163 1:N:0:ATCACG + @HWI-1KL118:196:HJVNFBCXX:1:1101:21294:2191 1:N:0:ATCACG CCGGCTGTTC + DDDDDIIIHI
Thus there are reads which started with the sequencing adapter. These are going to be a problem, as most tools fail catastrophically if faced with an empty sequence, but no worries, we can do a sed ‘s/^$/N/g’ and fill them with a single unknown nucleotide.
So I will do that, but first run cutadapt on all the sequences: I bet you can guess what the script looks like!!
Biopieces may be used to graph a few sample estimates, unfortunately, it assumes one wants the data in the ‘normal’ range, which is > 15 nt. As a result these graphs are not very useful. TODO: rerun for the small pieces.
Length distribution of the first sample: Quality scores of the first sample:
Nucleotide distributions of the first sample:
GC distribution of the first sample:
Note that the following sections work with all reads in the data; however in discussions with Vince and Mona, it was decided to work specifically first with 4mers and later with potentially other lengths. But the following provided an initial assessment.
Ok, while that is running, I try running jellyfish on the limited data in test/ First I will replace blank lines with ‘N’ and try out some arbitrary jellyfish commands which I will likely decide are stupid in a few minutes and change…
cd test/outputs/cutadapt
less test_tooshort.fastq.xz | sed 's/^$/N/g' > test_tooshort_n.fastq
module add jellyfish
jellyfish count -m 8 -o jelly_test -c 3 -s 1000000 -t 32 test_tooshort_n.fastq
jellyfish info jelly_test
jellyfish histo jelly_test
jellyfish dump jelly_test > 8mers.fasta
Glancing at the result file I see that there are a few highly overrepresented 8-mers: CGAGACGA happened 48 times out of 18,720 sequences, for example, looking at this histogram, the spread goes from 1 8-mer which happened 1,695 times to 8,415 8-mers which happened 1 time. I think that is probably encouraging, though not perfect given my preconceptions.
Lets see what happens when we run this on all the sequences, so I write a short script: run_jellyfish.sh
bin/run_jellyfish.sh
## copy/pasted from the cutadapt above, with some modifications
### start=$(pwd)
### for i in $(/bin/ls -d hpgl*); do
### echo "cd into ${i}"
### cd ${i}
### export INPUT=$(/bin/ls outputs/cutadapt/*tooshort.fastq.xz)
### cat <<"EOF" | qsub -V -d `pwd` -
### less ${INPUT} | sed 's/^$/N/g' > short_reads.fastq
### module add jellyfish
### jellyfish count -m 8 -o jellyfish.out -c 3 -s 100000 -t 32 short_reads.fastq
### jellyfish info jellyfish.out > jellyfish_infomation.txt
### jellyfish histo jellyfish.out > jellyfish_histogram.txt
### jellyfish dump jellyfish.out > 8mers.fasta
### EOF
### cd ${start}
### done
That is going to take a while and I want to get a drink from the coop. … still running, I think I will grab some annotation information for pseudomonas aeruginosa pa14 while waiting.
Redo the above the some modifications to make it specific to nmers of a given length (4 at the time of writing.) Note that I moved all my scripts into the bin/ directory. The following script does the following:
bin/run_jellyfish_nmer.sh 2
bin/run_jellyfish_nmer.sh 3
bin/run_jellyfish_nmer.sh 4
bin/run_jellyfish_nmer.sh 5
bin/run_jellyfish_nmer.sh 6
bin/run_jellyfish_nmer.sh 7
bin/run_jellyfish_nmer.sh 8
bin/run_jellyfish_nmer.sh 9
bin/run_jellyfish_nmer.sh 10
bin/run_jellyfish_nmer.sh 11
bin/run_jellyfish_nmer.sh 12
## #!/usr/bin/env bash
## start=$(pwd)
## export NMER=${1-"4"} ## Default to 4mers
## for i in $(/bin/ls -d hpgl*); do
## echo "cd into ${i}"
## cd ${i}
## cat <<"EOF" | qsub -V -d `pwd` -q throughput -l walltime=04:00:00 -
## module add jellyfish
## less short_reads.fastq.gz | read_fastq -e base_33 -i - | grab -e "SEQ_LEN==${NMER}" | write_fastq -o ${NMER}mers.fastq -x
## COUNT_OUT="${NMER}mers_jellyfish.out"
## jellyfish count -m ${NMER} -o ${COUNT_OUT} -c 3 -s 100000 -t 32 ${NMER}mers.fastq
## jellyfish info ${COUNT_OUT} > ${NMER}mers_jellyfish_infomation.txt
## jellyfish histo ${COUNT_OUT} > ${NMER}mers_jellyfish_histogram.txt
## jellyfish dump ${COUNT_OUT} > ${NMER}mers_by_count.fasta
## ${start}/bin/count_mers.pl ${NMER}mers_by_count.fasta
## gzip -f ${NMER}mers.fastq ${NMER}mers_by_count.fasta
## EOF
## cd ${start}
## done
I already wrote one, so really I am just changing the filenames to use the new counts for each nmer. I started with 4 mers, so I will skip that number.
cp all_samples.csv all_samples12.csv
sed -i 's/4mers/2mers/g' all_samples2.csv
sed -i 's/4mers/3mers/g' all_samples3.csv
sed -i 's/4mers/5mers/g' all_samples5.csv
sed -i 's/4mers/6mers/g' all_samples6.csv
sed -i 's/4mers/7mers/g' all_samples7.csv
sed -i 's/4mers/8mers/g' all_samples8.csv
sed -i 's/4mers/9mers/g' all_samples9.csv
sed -i 's/4mers/10mers/g' all_samples10.csv
sed -i 's/4mers/11mers/g' all_samples11.csv
sed -i 's/4mers/12mers/g' all_samples12.csv
Using the tool make(1), it is possible to automate all of these tasks. Keep in mind that tabs are important for Makefiles, so I am not sure what will happen if one attempts to copy/paste the following.
make
## Here is the current Makefile
##.PHONY: clean all
##all: copy index.html preprocessing.html 2mers.html 3mers.html 4mers.html 5mers.html 6mers.html 7mers.html 8mers.html 9mers.html 10mers.html
##
##%.html: %.Rmd
## @rm -f $@ && echo "Compiling $<" ;
## @Rscript -e "setwd('$(dir $<)');\
## library('rmarkdown');\
## render('$(notdir $<)', output_format='html_document')" 2>$<.out 1>&2
##
##clean:
## rm -f *mers.Rmd *mers.html
##
##NUMBERS = 2 3 4 5 6 7 8 9 10
##copy: nmers.Rmd
## $(foreach var,$(NUMBERS),cp nmers.Rmd $(var)mers.Rmd && sed -i "s/mers <- n/mers <- $(var)/g" $(var)mers.Rmd;)
##