Preprocessing some small RNAs from Pseudomonas aeruginosa

The big caveat is that I make a lot of typeographical errors; to avoid that I wrote a small command line utility to handle many of the repetetive tasks performed when preprocessing new data. It may be found here:

http://github.com/abelew/CYOA

It invokes all the various commands for me and passes them to the computer cluster. In doing this, it writes the commands to a series of shell scripts to (hopefully) make it easier to see what was actually done.

With that in mind, here is what I have done so far.

TODO

  1. Range of data to play with: 4-10, but start with only 4mers, after limiting the data to only the 4mers – perform for all samples
  2. Count the set of 256 4mers AAAA -> UUUU, perform some form of contrast of normalized identities
    1. Keep in mind the size exclusions performed and also wt/orn.

Copy reads to my working tree

Najib has kindly provided us with a large playspace. I created the following directories:

the hpgl701 through hpgl708 were numbers provided in the sample sheet csv file Najib emailed to me. I copied it to the sample_sheets/ directory. I further copied it to all_samples.csv and in that file simplified a couple column names to make them easier for the computer to interpret.

Because I always make stupid mistakes, I also created a test directory into which I will copy a small number of reads and perform all my commands once there to make sure everything is sane before running on the actual data.

I then ran the following command:

This goes into the raw data directories from Najib and concatenates all the files by sample into preprocessing/hpgl####/hpgl####_forward.fastq.gz.

Create testing set

I will then make a small test data set:

This command grabs the first 20,000 sequences from an arbitrarily chosen sample and drops it into the test/ directory, then compresses it. If you are new to the shell, don’t be concerned by the &&, I think of it as ‘andand’ and use it when I want to run multiple commands at one time, but with the caveat that if an earlier command fails then I do not want it to try running the later commands.

Ok, so what is my goal, I forgot. Well, while I think about that I will glance at the raw sequence to see if there are any obvious problems:

@HWI-1KL118:196:HJVNFBCXX:1:1101:1659:2141 1:N:0:ATCACG
AGGATTGagatcgGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGC
+
DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIHI
@HWI-1KL118:196:HJVNFBCXX:1:1101:1729:2164 1:N:0:ATCACG
CAGAACTGGCagatcgGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTC
+
DDDDDIIIIIIIIIIIIIGIIIHIIIIIIIIIIIIIIE@GHIIIIIIIIIIIIIIHIIIIIIIIIIII
@HWI-1KL118:196:HJVNFBCXX:1:1101:1595:2191 1:N:0:ATCACG
AGGACGTagatcgGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGC
+
DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHHIIIIIIIIIIIIIIIIIIIIIIIIIII

It looks like there might be the very slightest weakening of signal at around position 30 (the E@GH) but in reality the quality of the data looks beautiful. I think I will run fastqc to make sure that is true:

Adapter removal

Now lets remove the adatpers. I’ve looked at a fair number of sequencing libraries and there is a pattern I reflexively look for: ‘AGATCG….’ If you look again at the sequence above, you will find that I lower-cased this string in each of the sequences above. You will therefore note that the adapter starts from positions 7-10 in them. We usually use ‘trimomatic’ to handle the removal of sequencing adapters, this may be more than it is prepared to handle, if that proves to the be case, I will use a second tool ‘cutadapt’… Lets see.

Here are a few reads from the result:
@HWI-1KL118:196:HJVNFBCXX:1:1101:1501:2205 1:N:0:ATCACG
ATTCGGACCTCGGCGCCTCCGCTCCGGCGGGGCGCCACAGGCCCTGA
+
DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIII
@HWI-1KL118:196:HJVNFBCXX:1:1101:10706:2176 1:N:0:ATCACG
TGACTTGTTGAACA
+
DDDDDIIIIIIIII
@HWI-1KL118:196:HJVNFBCXX:1:1101:10571:2247 1:N:0:ATCACG
AGAAGACTCC
+
DDDDDIIIII
@HWI-1KL118:196:HJVNFBCXX:1:1101:10977:2147 1:N:0:ATCACG
AATGACTGGG
+
DDDDDIIIII
@HWI-1KL118:196:HJVNFBCXX:1:1101:10935:2155 1:N:0:ATCACG
TCGGTACCG
+
DDDDDIIII
@HWI-1KL118:196:HJVNFBCXX:1:1101:10805:2229 1:N:0:ATCACG
CATGTGGCTTCG
+
DDDDDIIIIIII
@HWI-1KL118:196:HJVNFBCXX:1:1101:11095:2183 1:N:0:ATCACG
TGTTGGAGTTG
+
DDDDDIIIIII

That first read stands out, it appears to not have the adapter but was left in. That is a case where cutadapt might be a better choice, as it may be told to keep only the reads with the adapter. So I am thinking I will do a separate trimmings with trimomatic and cutadapt.

Trimomatic

Oh yeah, my toy also makes trimomatic provide some statistics like fastqc did. From that I see that 89.4% of the hpgl0701 sample survived, which is cool. It creates a .csv file of these statistics, so if you need some table later of this information, no worries.

It finished while I was typing the above, so I will move the output files to some other name so that I may run again with cutadapt.

Now redo with cutadapt and see if there is a significant difference. In previous work, I have only used cutadapt with ribosome profiling and/or TNSeq experiments, so its option is not to be found in the RNASeq menu…

Run cutadapt

Of the 3 adapters passed to cutadapt, the last one (-b AGAT…) is the one which should do the work. The way I used cutadapt separates its outputs into a few piles: The primary output is test-trimmed.fastq which contains all reads which had the adapter and are 16<=x<=42 nucleotides long. These are interesting, but perhaps not what we care about. Instead, we might want to look at the reads in outputs/cutadapt/test_tooshort.fastq.xz, it contains the reads which are 0<=x<=16 nucleotides long. Glancing at that file, I see stuff like:

@HWI-1KL118:196:HJVNFBCXX:1:1101:21177:2206 1:N:0:ATCACG
GGGGGCTC
+
DDDDDIII
@HWI-1KL118:196:HJVNFBCXX:1:1101:21267:2125 1:N:0:ATCACG
AGCTGGGACTG
+
DDDDDIIIHII
@HWI-1KL118:196:HJVNFBCXX:1:1101:21293:2163 1:N:0:ATCACG

+

@HWI-1KL118:196:HJVNFBCXX:1:1101:21294:2191 1:N:0:ATCACG
CCGGCTGTTC
+
DDDDDIIIHI

Thus there are reads which started with the sequencing adapter. These are going to be a problem, as most tools fail catastrophically if faced with an empty sequence, but no worries, we can do a sed ‘s/^$/N/g’ and fill them with a single unknown nucleotide.

So I will do that, but first run cutadapt on all the sequences: I bet you can guess what the script looks like!!

Biopieces sample estimation

Biopieces may be used to graph a few sample estimates, unfortunately, it assumes one wants the data in the ‘normal’ range, which is > 15 nt. As a result these graphs are not very useful. TODO: rerun for the small pieces.

Length distribution of the first sample: length distribution. Quality scores of the first sample: quality distribution. Nucleotide distributions of the first sample: quality distribution. GC distribution of the first sample: quality distribution.

Note that the following sections work with all reads in the data; however in discussions with Vince and Mona, it was decided to work specifically first with 4mers and later with potentially other lengths. But the following provided an initial assessment.

Ok, while that is running, I try running jellyfish on the limited data in test/ First I will replace blank lines with ‘N’ and try out some arbitrary jellyfish commands which I will likely decide are stupid in a few minutes and change…

Test jellyfish

Glancing at the result file I see that there are a few highly overrepresented 8-mers: CGAGACGA happened 48 times out of 18,720 sequences, for example, looking at this histogram, the spread goes from 1 8-mer which happened 1,695 times to 8,415 8-mers which happened 1 time. I think that is probably encouraging, though not perfect given my preconceptions.

Lets see what happens when we run this on all the sequences, so I write a short script: run_jellyfish.sh

That is going to take a while and I want to get a drink from the coop. … still running, I think I will grab some annotation information for pseudomonas aeruginosa pa14 while waiting.

Run jellyfish on 4mers

Redo the above the some modifications to make it specific to nmers of a given length (4 at the time of writing.) Note that I moved all my scripts into the bin/ directory. The following script does the following:

  1. use biopieces to extract nmers.
  2. run jellyfish on the resulting nmer file.
  3. make jellyfish histograms and counted nmers.
  4. make a count-table of this output.