1 Estimate RT error rates through sequencing

I want to first figure out what is in this data. In order to do that, I am thinking that I will need to do the following:

  1. Combine the read1/read2 pairs.
  2. Do simple grep searches for the end of the template sequences to see how many reads have them.
  3. Count the 6nt indexes.

1.1 Combine read pairs

I am using flash for this. I copied the raw data into preprocessing/s1, s2, and s3 arbitrarily.

Now I have a file in each directory named out.extendedFrags.fastq in each directory.

Let us find out how many reads have the end of the template.

Reading the template in the forward direction, we expect to find: TTGTAATACGACTCAC

In the reverse direction, we expect to find: GTGAGTCGTATTACAA

2 Extract index containing sequences

I wrote a short perl script which attempts to pull out the sequences which contain the template. For each sequence with the template, it then writes the reads as a fasta file where the ID of each sequence is the 14 nt. index. Thus we should be able to scan the actual sequences for mismatches.

The above three commands also print out a file ‘idx_count.txt’ which provides a count of how many times each random index was observed. Let us plot that information.

## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 30 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 7 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

