Estimate RT error rates through sequencing
I want to first figure out what is in this data. In order to do that, I am thinking that I will need to do the following:
- Combine the read1/read2 pairs.
- Do simple grep searches for the end of the template sequences to see how many reads have them.
- Count the 6nt indexes.
Combine read pairs
I am using flash for this. I copied the raw data into preprocessing/s1, s2, and s3 arbitrarily.
Now I have a file in each directory named out.extendedFrags.fastq in each directory.
Let us find out how many reads have the end of the template.
Reading the template in the forward direction, we expect to find: TTGTAATACGACTCAC
In the reverse direction, we expect to find: GTGAGTCGTATTACAA
fwd="TTGTAATACGACTCAC"
rev="GTGAGTCGTATTACAA"
cd preprocessing/s1
reads=$(( $(wc -l out.extendedFrags.fastq | awk '{print $1}') / 4 ))
echo "There are ${reads} reads"
## There are 4,667,506 reads
fwd_reads=$(grep ${fwd} out.extendedFrags.fastq | wc -l)
echo $fwd_reads
## 2,285,256
rev_reads=$(grep ${rev} out.extendedFrags.fastq | wc -l)
echo $rev_reads
## 2,136,744
fwd_plus_rev=$(( ${fwd_reads} + ${rev_reads} ))
missing_reads=$(( ${reads} - ${fwd_plus_rev} ))
echo ${missing_reads}
## 245,506
cd ../s2
reads=$(( $(wc -l out.extendedFrags.fastq | awk '{print $1}') / 4 ))
echo "There are ${reads} reads"
## There are 1,847,635 reads
fwd_reads=$(grep ${fwd} out.extendedFrags.fastq | wc -l)
echo $fwd_reads
## 988,931
rev_reads=$(grep ${rev} out.extendedFrags.fastq | wc -l)
echo $rev_reads
## 747,696
fwd_plus_rev=$(( ${fwd_reads} + ${rev_reads} ))
missing_reads=$(( ${reads} - ${fwd_plus_rev} ))
echo ${missing_reads}
## 111,008
cd ../s3
reads=$(( $(wc -l out.extendedFrags.fastq | awk '{print $1}') / 4 ))
echo "There are ${reads} reads"
## There are 6,150,081
fwd_reads=$(grep ${fwd} out.extendedFrags.fastq | wc -l)
echo $fwd_reads
## 2,688,628
rev_reads=$(grep ${rev} out.extendedFrags.fastq | wc -l)
echo $rev_reads
## 3,030,038
fwd_plus_rev=$(( ${fwd_reads} + ${rev_reads} ))
missing_reads=$(( ${reads} - ${fwd_plus_rev} ))
echo ${missing_reads}
##
LS0tCnRpdGxlOiAiQSBtZXRob2QgdG8gZXN0aW1hdGUgUlQgZXJyb3IgcmF0ZXMuIgphdXRob3I6ICJhdGIgYWJlbGV3QGdtYWlsLmNvbSIKZGF0ZTogImByIFN5cy5EYXRlKClgIgpvdXRwdXQ6CiAgaHRtbF9kb2N1bWVudDoKICAgIGNvZGVfZG93bmxvYWQ6IHRydWUKICAgIGNvZGVfZm9sZGluZzogc2hvdwogICAgZmlnX2NhcHRpb246IHRydWUKICAgIGZpZ19oZWlnaHQ6IDcKICAgIGZpZ193aWR0aDogNwogICAgaGlnaGxpZ2h0OiB0YW5nbwogICAga2VlcF9tZDogZmFsc2UKICAgIG1vZGU6IHNlbGZjb250YWluZWQKICAgIG51bWJlcl9zZWN0aW9uczogdHJ1ZQogICAgc2VsZl9jb250YWluZWQ6IHRydWUKICAgIHRoZW1lOiByZWFkYWJsZQogICAgdG9jOiB0cnVlCiAgICB0b2NfZmxvYXQ6CiAgICAgIGNvbGxhcHNlZDogZmFsc2UKICAgICAgc21vb3RoX3Njcm9sbDogZmFsc2UKICBybWRmb3JtYXRzOjpyZWFkdGhlZG93bjoKICAgIGNvZGVfZG93bmxvYWQ6IHRydWUKICAgIGNvZGVfZm9sZGluZzogc2hvdwogICAgZGZfcHJpbnQ6IHBhZ2VkCiAgICBmaWdfY2FwdGlvbjogdHJ1ZQogICAgZmlnX2hlaWdodDogNwogICAgZmlnX3dpZHRoOiA3CiAgICBoaWdobGlnaHQ6IHRhbmdvCiAgICB3aWR0aDogMzAwCiAgICBrZWVwX21kOiBmYWxzZQogICAgbW9kZTogc2VsZmNvbnRhaW5lZAogICAgdG9jX2Zsb2F0OiB0cnVlCiAgQmlvY1N0eWxlOjpodG1sX2RvY3VtZW50OgogICAgY29kZV9kb3dubG9hZDogdHJ1ZQogICAgY29kZV9mb2xkaW5nOiBzaG93CiAgICBmaWdfY2FwdGlvbjogdHJ1ZQogICAgZmlnX2hlaWdodDogNwogICAgZmlnX3dpZHRoOiA3CiAgICBoaWdobGlnaHQ6IHRhbmdvCiAgICBrZWVwX21kOiBmYWxzZQogICAgbW9kZTogc2VsZmNvbnRhaW5lZAogICAgdG9jX2Zsb2F0OiB0cnVlCi0tLQoKPHN0eWxlIHR5cGU9InRleHQvY3NzIj4KYm9keSwgdGQgewogIGZvbnQtc2l6ZTogMTZweDsKfQpjb2RlLnJ7CiAgZm9udC1zaXplOiAxNnB4Owp9CnByZSB7CiBmb250LXNpemU6IDE2cHgKfQo8L3N0eWxlPgoKYGBge3Igb3B0aW9ucywgaW5jbHVkZT1GQUxTRX0KbGlicmFyeSgiaHBnbHRvb2xzIikKdHQgPC0gZGV2dG9vbHM6OmxvYWRfYWxsKCIvZGF0YS9ocGdsdG9vbHMiKQprbml0cjo6b3B0c19rbml0JHNldCh3aWR0aD0xMjAsCiAgICAgICAgICAgICAgICAgICAgIHByb2dyZXNzPVRSVUUsCiAgICAgICAgICAgICAgICAgICAgIHZlcmJvc2U9VFJVRSwKICAgICAgICAgICAgICAgICAgICAgZWNobz1UUlVFKQprbml0cjo6b3B0c19jaHVuayRzZXQoZXJyb3I9VFJVRSwKICAgICAgICAgICAgICAgICAgICAgIGRwaT05NikKb2xkX29wdGlvbnMgPC0gb3B0aW9ucyhkaWdpdHM9NCwKICAgICAgICAgICAgICAgICAgICAgICBzdHJpbmdzQXNGYWN0b3JzPUZBTFNFLAogICAgICAgICAgICAgICAgICAgICAgIGtuaXRyLmR1cGxpY2F0ZS5sYWJlbD0iYWxsb3ciKQpnZ3Bsb3QyOjp0aGVtZV9zZXQoZ2dwbG90Mjo6dGhlbWVfYncoYmFzZV9zaXplPTEwKSkKcnVuZGF0ZSA8LSBmb3JtYXQoU3lzLkRhdGUoKSwgZm9ybWF0PSIlWSVtJWQiKQpwcmV2aW91c19maWxlIDwtICIiCnZlciA8LSAiMjAxOTA4MDUiCgojI3RtcCA8LSBzbShsb2FkbWUoZmlsZW5hbWU9cGFzdGUwKGdzdWIocGF0dGVybj0iXFwuUm1kIiwgcmVwbGFjZT0iIiwgeD1wcmV2aW91c19maWxlKSwgIi12IiwgdmVyLCAiLnJkYS54eiIpKSkKcm1kX2ZpbGUgPC0gImluZGV4LlJtZCIKYGBgCgojIEVzdGltYXRlIFJUIGVycm9yIHJhdGVzIHRocm91Z2ggc2VxdWVuY2luZwoKSSB3YW50IHRvIGZpcnN0IGZpZ3VyZSBvdXQgd2hhdCBpcyBpbiB0aGlzIGRhdGEuICBJbiBvcmRlciB0byBkbyB0aGF0LCBJIGFtCnRoaW5raW5nIHRoYXQgSSB3aWxsIG5lZWQgdG8gZG8gdGhlIGZvbGxvd2luZzoKCjEuICBDb21iaW5lIHRoZSByZWFkMS9yZWFkMiBwYWlycy4KMi4gIERvIHNpbXBsZSBncmVwIHNlYXJjaGVzIGZvciB0aGUgZW5kIG9mIHRoZSB0ZW1wbGF0ZSBzZXF1ZW5jZXMgdG8gc2VlIGhvdwogICAgbWFueSByZWFkcyBoYXZlIHRoZW0uCjMuICBDb3VudCB0aGUgNm50IGluZGV4ZXMuCgojIyBDb21iaW5lIHJlYWQgcGFpcnMKCkkgYW0gdXNpbmcgZmxhc2ggZm9yIHRoaXMuICBJIGNvcGllZCB0aGUgcmF3IGRhdGEgaW50byBwcmVwcm9jZXNzaW5nL3MxLCBzMiwgYW5kCnMzIGFyYml0cmFyaWx5LgoKYGBge2Jhc2ggZmxhc2gsIGV2YWw9RkFMU0V9Cm1vZHVsZSBhZGQgZmxhc2gKY2QgcHJlcHJvY2Vzc2luZy9zMQpmbGFzaCA8KGxlc3MgKl8xLmZhc3RxLmd6KSA8KGxlc3MgKl8yLmZhc3RxLmd6KQpjZCAuLi9zMgpmbGFzaCA8KGxlc3MgKl8xLmZhc3RxLmd6KSA8KGxlc3MgKl8yLmZhc3RxLmd6KQpjZCAuLi9zMwpmbGFzaCA8KGxlc3MgKl8xLmZhc3RxLmd6KSA8KGxlc3MgKl8yLmZhc3RxLmd6KQpgYGAKCk5vdyBJIGhhdmUgYSBmaWxlIGluIGVhY2ggZGlyZWN0b3J5IG5hbWVkIG91dC5leHRlbmRlZEZyYWdzLmZhc3RxIGluIGVhY2gKZGlyZWN0b3J5LgoKTGV0IHVzIGZpbmQgb3V0IGhvdyBtYW55IHJlYWRzIGhhdmUgdGhlIGVuZCBvZiB0aGUgdGVtcGxhdGUuCgpSZWFkaW5nIHRoZSB0ZW1wbGF0ZSBpbiB0aGUgZm9yd2FyZCBkaXJlY3Rpb24sIHdlIGV4cGVjdCB0byBmaW5kOgpUVEdUQUFUQUNHQUNUQ0FDCgpJbiB0aGUgcmV2ZXJzZSBkaXJlY3Rpb24sIHdlIGV4cGVjdCB0byBmaW5kOgpHVEdBR1RDR1RBVFRBQ0FBCgpgYGB7YmFzaCBncmVwLCBldmFsPUZBTFNFfQpmd2Q9IlRUR1RBQVRBQ0dBQ1RDQUMiCnJldj0iR1RHQUdUQ0dUQVRUQUNBQSIKY2QgcHJlcHJvY2Vzc2luZy9zMQpyZWFkcz0kKCggJCh3YyAtbCBvdXQuZXh0ZW5kZWRGcmFncy5mYXN0cSB8IGF3ayAne3ByaW50ICQxfScpIC8gNCApKQplY2hvICJUaGVyZSBhcmUgJHtyZWFkc30gcmVhZHMiCiMjIFRoZXJlIGFyZSA0LDY2Nyw1MDYgcmVhZHMKZndkX3JlYWRzPSQoZ3JlcCAke2Z3ZH0gb3V0LmV4dGVuZGVkRnJhZ3MuZmFzdHEgfCB3YyAtbCkKZWNobyAkZndkX3JlYWRzCiMjIDIsMjg1LDI1NgpyZXZfcmVhZHM9JChncmVwICR7cmV2fSBvdXQuZXh0ZW5kZWRGcmFncy5mYXN0cSB8IHdjIC1sKQplY2hvICRyZXZfcmVhZHMKIyMgMiwxMzYsNzQ0CmZ3ZF9wbHVzX3Jldj0kKCggJHtmd2RfcmVhZHN9ICsgJHtyZXZfcmVhZHN9ICkpCm1pc3NpbmdfcmVhZHM9JCgoICR7cmVhZHN9IC0gJHtmd2RfcGx1c19yZXZ9ICkpCmVjaG8gJHttaXNzaW5nX3JlYWRzfQojIyAyNDUsNTA2CgpjZCAuLi9zMgpyZWFkcz0kKCggJCh3YyAtbCBvdXQuZXh0ZW5kZWRGcmFncy5mYXN0cSB8IGF3ayAne3ByaW50ICQxfScpIC8gNCApKQplY2hvICJUaGVyZSBhcmUgJHtyZWFkc30gcmVhZHMiCiMjIFRoZXJlIGFyZSAxLDg0Nyw2MzUgcmVhZHMKZndkX3JlYWRzPSQoZ3JlcCAke2Z3ZH0gb3V0LmV4dGVuZGVkRnJhZ3MuZmFzdHEgfCB3YyAtbCkKZWNobyAkZndkX3JlYWRzCiMjIDk4OCw5MzEKcmV2X3JlYWRzPSQoZ3JlcCAke3Jldn0gb3V0LmV4dGVuZGVkRnJhZ3MuZmFzdHEgfCB3YyAtbCkKZWNobyAkcmV2X3JlYWRzCiMjIDc0Nyw2OTYKZndkX3BsdXNfcmV2PSQoKCAke2Z3ZF9yZWFkc30gKyAke3Jldl9yZWFkc30gKSkKbWlzc2luZ19yZWFkcz0kKCggJHtyZWFkc30gLSAke2Z3ZF9wbHVzX3Jldn0gKSkKZWNobyAke21pc3NpbmdfcmVhZHN9CiMjIDExMSwwMDgKCmNkIC4uL3MzCnJlYWRzPSQoKCAkKHdjIC1sIG91dC5leHRlbmRlZEZyYWdzLmZhc3RxIHwgYXdrICd7cHJpbnQgJDF9JykgLyA0ICkpCmVjaG8gIlRoZXJlIGFyZSAke3JlYWRzfSByZWFkcyIKIyMgVGhlcmUgYXJlIDYsMTUwLDA4MQpmd2RfcmVhZHM9JChncmVwICR7ZndkfSBvdXQuZXh0ZW5kZWRGcmFncy5mYXN0cSB8IHdjIC1sKQplY2hvICRmd2RfcmVhZHMKIyMgMiw2ODgsNjI4CnJldl9yZWFkcz0kKGdyZXAgJHtyZXZ9IG91dC5leHRlbmRlZEZyYWdzLmZhc3RxIHwgd2MgLWwpCmVjaG8gJHJldl9yZWFkcwojIyAzLDAzMCwwMzgKZndkX3BsdXNfcmV2PSQoKCAke2Z3ZF9yZWFkc30gKyAke3Jldl9yZWFkc30gKSkKbWlzc2luZ19yZWFkcz0kKCggJHtyZWFkc30gLSAke2Z3ZF9wbHVzX3Jldn0gKSkKZWNobyAke21pc3NpbmdfcmVhZHN9CiMjCmBgYAoKIyBFeHRyYWN0IGluZGV4IGNvbnRhaW5pbmcgc2VxdWVuY2VzCgpJIHdyb3RlIGEgc2hvcnQgcGVybCBzY3JpcHQgd2hpY2ggYXR0ZW1wdHMgdG8gcHVsbCBvdXQgdGhlIHNlcXVlbmNlcyB3aGljaApjb250YWluIHRoZSB0ZW1wbGF0ZS4gIEZvciBlYWNoIHNlcXVlbmNlIHdpdGggdGhlIHRlbXBsYXRlLCBpdCB0aGVuIHdyaXRlcyB0aGUKcmVhZHMgYXMgYSBmYXN0YSBmaWxlIHdoZXJlIHRoZSBJRCBvZiBlYWNoIHNlcXVlbmNlIGlzIHRoZSAxNCBudC4gaW5kZXguClRodXMgd2Ugc2hvdWxkIGJlIGFibGUgdG8gc2NhbiB0aGUgYWN0dWFsIHNlcXVlbmNlcyBmb3IgbWlzbWF0Y2hlcy4KCmBgYHtiYXNoIGNvdW50X2luZGV4ZXMsIGV2YWw9RkFMU0V9CmNkIGpkZXN0ZWZhXzE2MDkyMC9zMV9DR0FUR1QgJiYgXAogICAgLi4vY291bnRfaW5kZXhlcy5wbCAgb3V0LmV4dGVuZGVkRnJhZ3MuZmFzdHEuZ3oKY2QgamRlc3RlZmFfMTYwOTIwL3MyX0NHQVRHVCAmJiBcCiAgICAuLi9jb3VudF9pbmRleGVzLnBsICBvdXQuZXh0ZW5kZWRGcmFncy5mYXN0cS5negpjZCBqZGVzdGVmYV8xNjA5MjAvczNfQ0dBVEdUICYmIFwKICAgIC4uL2NvdW50X2luZGV4ZXMucGwgIG91dC5leHRlbmRlZEZyYWdzLmZhc3RxLmd6CmBgYAoKVGhlIGFib3ZlIHRocmVlIGNvbW1hbmRzIGFsc28gcHJpbnQgb3V0IGEgZmlsZSAnaWR4X2NvdW50LnR4dCcgd2hpY2ggcHJvdmlkZXMgYQpjb3VudCBvZiBob3cgbWFueSB0aW1lcyBlYWNoIHJhbmRvbSBpbmRleCB3YXMgb2JzZXJ2ZWQuICBMZXQgdXMgcGxvdCB0aGF0CmluZm9ybWF0aW9uLgoKYGBge3IgcGxvdF9pbmRleGVzfQpzMSA8LSByZWFkLnRhYmxlKCJqZGVzdGVmYV8xNjA5MjAvczFfQ0dBVEdUL2lkeF9jb3VudC50eHQiLCBzZXA9IiAiKQpjb2xuYW1lcyhzMSkgPC0gYygiczFfaW5kZXgiLCAiczFfb2JzZXJ2YXRpb25zIikKczEgPC0gZGF0YS50YWJsZTo6YXMuZGF0YS50YWJsZShzMSkKczIgPC0gcmVhZC50YWJsZSgiamRlc3RlZmFfMTYwOTIwL3MyX0FUQ0FDRy9pZHhfY291bnQudHh0Iiwgc2VwPSIgIikKY29sbmFtZXMoczIpIDwtIGMoInMyX2luZGV4IiwgInMyX29ic2VydmF0aW9ucyIpCnMyIDwtIGRhdGEudGFibGU6OmFzLmRhdGEudGFibGUoczIpCnMzIDwtIHJlYWQudGFibGUoImpkZXN0ZWZhXzE2MDkyMC9zM19UVEFHR0MvaWR4X2NvdW50LnR4dCIsIHNlcD0iICIpCmNvbG5hbWVzKHMzKSA8LSBjKCJzM19pbmRleCIsICJzM19vYnNlcnZhdGlvbnMiKQpzMyA8LSBkYXRhLnRhYmxlOjphcy5kYXRhLnRhYmxlKHMzKQphbGwgPC0gbWVyZ2UoczEsIHMyLCBieS54PSJzMV9pbmRleCIsIGJ5Lnk9InMyX2luZGV4IiwgYWxsLng9VFJVRSwgYWxsLnk9VFJVRSkKYWxsIDwtIG1lcmdlKGFsbCwgczMsIGJ5Lng9InMxX2luZGV4IiwgYnkueT0iczNfaW5kZXgiLCBhbGwueD1UUlVFLCBhbGwueT1UUlVFKQpjb2xuYW1lcyhhbGwpWzFdIDwtICJpbmRleCIKCm5hX2lkeCA8LSBpcy5uYShhbGwpCmFsbFtuYV9pZHhdIDwtIDAKYWxsIDwtIGFzLmRhdGEuZnJhbWUoYWxsKQpyb3duYW1lcyhhbGwpIDwtIGFsbFtbImluZGV4Il1dCmFsbFtbImluZGV4Il1dIDwtIE5VTEwKCmxpYnJhcnkoZ2dwbG90MikKZ2dwbG90KHMxLCBhZXMoeD1zMV9vYnNlcnZhdGlvbnMpKSArCiAgZ2VvbV9oaXN0b2dyYW0oYmlud2lkdGg9MSkgKwogIHNjYWxlX3hfY29udGludW91cyhsaW1pdHM9YygwLDIwKSkKZ2dwbG90KHMyLCBhZXMoeD1zMl9vYnNlcnZhdGlvbnMpKSArCiAgZ2VvbV9oaXN0b2dyYW0oYmlud2lkdGg9MSkgKwogIHNjYWxlX3hfY29udGludW91cyhsaW1pdHM9YygwLDIwKSkKZ2dwbG90KHMzLCBhZXMoeD1zM19vYnNlcnZhdGlvbnMpKSArCiAgZ2VvbV9oaXN0b2dyYW0oYmlud2lkdGg9MSkgKwogIHNjYWxlX3hfY29udGludW91cyhsaW1pdHM9YygwLDIwKSkKYGBgCgoKYGBge3Igc2F2ZW1lLCBldmFsPUZBTFNFfQpwYW5kZXI6OnBhbmRlcihzZXNzaW9uSW5mbygpKQptZXNzYWdlKHBhc3RlMCgiVGhpcyBpcyBocGdsdG9vbHMgY29tbWl0OiAiLCBnZXRfZ2l0X2NvbW1pdCgpKSkKdGhpc19zYXZlIDwtIHBhc3RlMChnc3ViKHBhdHRlcm49IlxcLlJtZCIsIHJlcGxhY2U9IiIsIHg9cm1kX2ZpbGUpLCAiLXYiLCB2ZXIsICIucmRhLnh6IikKbWVzc2FnZShwYXN0ZTAoIlNhdmluZyB0byAiLCB0aGlzX3NhdmUpKQp0bXAgPC0gc20oc2F2ZW1lKGZpbGVuYW1lPXRoaXNfc2F2ZSkpCmBgYAo=