1 Calculating error rates.

I wrote the function ‘create_matrices()’ to collect mutation counts. At least in theory the results from it should be able to address most/any question regarding the counts of mutations observed in the data.

1.1 Categorize the data with at least 3 indexes per mutant

## Loading errRt
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:hpgltools':
## 
##     combine
## The following object is masked from 'package:Biobase':
## 
##     combine
## The following objects are masked from 'package:BiocGenerics':
## 
##     combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: tidyr
## Starting sample: 1.
## Reading the file containing mutations: preprocessing/s1/step4.txt.xz
## Reading the file containing the identical reads: preprocessing/s1/step2_identical_reads.txt.xz
## Mutation data: removing any differences before position: 24.
## Mutation data: before pruning, there are: 1156535 reads.
## Mutation data: after min-position pruning, there are: 1037310 reads: 119225 lost or 10.31%.
## Mutation data: removing any differences after position: 176.
## Mutation data: before pruning, there are: 1037310 reads.
## Mutation data: after max-position pruning, there are: 968161 reads: 69149 lost or 6.67%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 953181 reads: 14980 lost or 1.55%.
## Mutation data: all filters removed 203354 reads, or 17.58%.
## All data: gathering information about the indexes observed, this is slow.
## Before reads/index pruning, there are: 1742165 indexes in all the data.
## After reads/index pruning, there are: 837608 indexes: 904557 lost or 51.92%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 953181 changed reads.
## All data: before reads/index pruning, there are: 4681501 identical reads.
## All data: after index pruning, there are: 491995 changed reads: 51.62%.
## All data: after index pruning, there are: 3663004 identical reads: 78.24%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 3663004 identical reads.
## Before classification, there are 491995 reads with mutations.
## After classification, there are 2738199 reads/indexes which are only identical.
## After classification, there are 11023 reads/indexes which are strictly sequencer.
## After classification, there are 26963 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 7018785 forward reads and 7148314 reverse_reads.
## Subsetting based on mutations with at least 3 indexes.
## Classified mutation strings according to various queries.
## Starting sample: 2.
## Reading the file containing mutations: preprocessing/s2/step4.txt.xz
## Reading the file containing the identical reads: preprocessing/s2/step2_identical_reads.txt.xz
## Mutation data: removing any differences before position: 24.
## Mutation data: before pruning, there are: 3421203 reads.
## Mutation data: after min-position pruning, there are: 1758479 reads: 1662724 lost or 48.60%.
## Mutation data: removing any differences after position: 176.
## Mutation data: before pruning, there are: 1758479 reads.
## Mutation data: after max-position pruning, there are: 1667302 reads: 91177 lost or 5.18%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 1642969 reads: 24333 lost or 1.46%.
## Mutation data: all filters removed 1778234 reads, or 51.98%.
## All data: gathering information about the indexes observed, this is slow.
## Before reads/index pruning, there are: 1261478 indexes in all the data.
## After reads/index pruning, there are: 693725 indexes: 567753 lost or 45.01%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 1642969 changed reads.
## All data: before reads/index pruning, there are: 5230976 identical reads.
## All data: after index pruning, there are: 814407 changed reads: 49.57%.
## All data: after index pruning, there are: 4834092 identical reads: 92.41%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 4834092 identical reads.
## Before classification, there are 814407 reads with mutations.
## After classification, there are 2802107 reads/indexes which are only identical.
## After classification, there are 111708 reads/indexes which are strictly sequencer.
## After classification, there are 126921 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 11803361 forward reads and 12275547 reverse_reads.
## Subsetting based on mutations with at least 3 indexes.
## Classified mutation strings according to various queries.
## Starting sample: 3.
## Reading the file containing mutations: preprocessing/s3/step4.txt.xz
## Reading the file containing the identical reads: preprocessing/s3/step2_identical_reads.txt.xz
## Mutation data: removing any differences before position: 24.
## Mutation data: before pruning, there are: 4309681 reads.
## Mutation data: after min-position pruning, there are: 1564155 reads: 2745526 lost or 63.71%.
## Mutation data: removing any differences after position: 176.
## Mutation data: before pruning, there are: 1564155 reads.
## Mutation data: after max-position pruning, there are: 1482559 reads: 81596 lost or 5.22%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 1452047 reads: 30512 lost or 2.06%.
## Mutation data: all filters removed 2857634 reads, or 66.31%.
## All data: gathering information about the indexes observed, this is slow.
## Before reads/index pruning, there are: 884042 indexes in all the data.
## After reads/index pruning, there are: 463445 indexes: 420597 lost or 47.58%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 1452047 changed reads.
## All data: before reads/index pruning, there are: 3583390 identical reads.
## All data: after index pruning, there are: 730397 changed reads: 50.30%.
## All data: after index pruning, there are: 3332136 identical reads: 92.99%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 3332136 identical reads.
## Before classification, there are 730397 reads with mutations.
## After classification, there are 1851177 reads/indexes which are only identical.
## After classification, there are 90341 reads/indexes which are strictly sequencer.
## After classification, there are 244494 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 9104237 forward reads and 9257103 reverse_reads.
## Subsetting based on mutations with at least 3 indexes.
## Classified mutation strings according to various queries.
## Making a matrix of miss_reads_by_position.
## Making a matrix of miss_indexes_by_position.
## Making a matrix of miss_sequencer_by_position.
## Making a matrix of miss_reads_by_string.
## Making a matrix of miss_indexes_by_string.
## Making a matrix of miss_sequencer_by_string.
## Making a matrix of miss_reads_by_ref_nt.
## Making a matrix of miss_indexes_by_ref_nt.
## Making a matrix of miss_sequencer_by_ref_nt.
## Making a matrix of miss_reads_by_hit_nt.
## Making a matrix of miss_indexes_by_hit_nt.
## Making a matrix of miss_sequencer_by_hit_nt.
## Making a matrix of miss_reads_by_type.
## Making a matrix of miss_indexes_by_type.
## Making a matrix of miss_sequencer_by_type.
## Making a matrix of miss_reads_by_trans.
## Making a matrix of miss_indexes_by_trans.
## Making a matrix of miss_sequencer_by_trans.
## Making a matrix of miss_reads_by_strength.
## Making a matrix of miss_indexes_by_strength.
## Making a matrix of miss_sequencer_by_strength.
## Making a matrix of insert_reads_by_position.
## Making a matrix of insert_indexes_by_position.
## Making a matrix of insert_sequencer_by_position.
## Making a matrix of insert_reads_by_nt.
## Making a matrix of insert_indexes_by_nt.
## Making a matrix of insert_sequencer_by_nt.
## Making a matrix of delete_reads_by_position.
## Making a matrix of delete_indexes_by_position.
## Making a matrix of delete_sequencer_by_position.
## Making a matrix of delete_reads_by_nt.
## Making a matrix of delete_indexes_by_nt.
## Making a matrix of delete_sequencer_by_nt.
## Skipping table: miss_reads_by_ref_nt
## Skipping table: miss_indexes_by_ref_nt
## Skipping table: miss_sequencer_by_ref_nt
## Skipping table: miss_reads_by_hit_nt
## Skipping table: miss_indexes_by_hit_nt
## Skipping table: miss_sequencer_by_hit_nt
## Skipping table: delete_reads_by_position
## Skipping table: delete_indexes_by_position
## Skipping table: delete_sequencer_by_position
## Skipping table: delete_reads_by_nt
## Skipping table: delete_indexes_by_nt
## Skipping table: delete_sequencer_by_nt
##                      Length Class  Mode   
## samples               3     -none- list   
## reads_per_sample      3     -none- numeric
## indexes_per_sample    3     -none- numeric
## matrices             33     -none- list   
## matrices_by_counts   33     -none- list   
## normalized           33     -none- list   
## normalized_by_counts 33     -none- list
## Starting sample: 1.
## Reading the file containing mutations: preprocessing/s1/step4.txt.xz
## Reading the file containing the identical reads: preprocessing/s1/step2_identical_reads.txt.xz
## Mutation data: removing any differences before position: 24.
## Mutation data: before pruning, there are: 1156535 reads.
## Mutation data: after min-position pruning, there are: 1037310 reads: 119225 lost or 10.31%.
## Mutation data: removing any differences after position: 176.
## Mutation data: before pruning, there are: 1037310 reads.
## Mutation data: after max-position pruning, there are: 968161 reads: 69149 lost or 6.67%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 953181 reads: 14980 lost or 1.55%.
## Mutation data: removing reads with greater than 10 mutations.
## Mutation data: after max_mutation pruning, there are: 799403 reads: 153778 lost or 16.13%.
## Mutation data: all filters removed 357132 reads, or 30.88%.
## All data: gathering information about the indexes observed, this is slow.
## Before reads/index pruning, there are: 1733789 indexes in all the data.
## After reads/index pruning, there are: 836838 indexes: 896951 lost or 51.73%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 799403 changed reads.
## All data: before reads/index pruning, there are: 4681501 identical reads.
## All data: after index pruning, there are: 441562 changed reads: 55.24%.
## All data: after index pruning, there are: 3661605 identical reads: 78.21%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 3661605 identical reads.
## Before classification, there are 441562 reads with mutations.
## After classification, there are 2748736 reads/indexes which are only identical.
## After classification, there are 9916 reads/indexes which are strictly sequencer.
## After classification, there are 26403 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 7049093 forward reads and 7175885 reverse_reads.
## Subsetting based on mutations with at least 3 indexes.
## Classified mutation strings according to various queries.
## Starting sample: 2.
## Reading the file containing mutations: preprocessing/s2/step4.txt.xz
## Reading the file containing the identical reads: preprocessing/s2/step2_identical_reads.txt.xz
## Mutation data: removing any differences before position: 24.
## Mutation data: before pruning, there are: 3421203 reads.
## Mutation data: after min-position pruning, there are: 1758479 reads: 1662724 lost or 48.60%.
## Mutation data: removing any differences after position: 176.
## Mutation data: before pruning, there are: 1758479 reads.
## Mutation data: after max-position pruning, there are: 1667302 reads: 91177 lost or 5.18%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 1642969 reads: 24333 lost or 1.46%.
## Mutation data: removing reads with greater than 10 mutations.
## Mutation data: after max_mutation pruning, there are: 1232741 reads: 410228 lost or 24.97%.
## Mutation data: all filters removed 2188462 reads, or 63.97%.
## All data: gathering information about the indexes observed, this is slow.
## Before reads/index pruning, there are: 1231605 indexes in all the data.
## After reads/index pruning, there are: 693381 indexes: 538224 lost or 43.70%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 1232741 changed reads.
## All data: before reads/index pruning, there are: 5230976 identical reads.
## All data: after index pruning, there are: 720963 changed reads: 58.48%.
## All data: after index pruning, there are: 4833605 identical reads: 92.40%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 4833605 identical reads.
## Before classification, there are 720963 reads with mutations.
## After classification, there are 2832509 reads/indexes which are only identical.
## After classification, there are 98387 reads/indexes which are strictly sequencer.
## After classification, there are 123178 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 11930745 forward reads and 12406826 reverse_reads.
## Subsetting based on mutations with at least 3 indexes.
## Classified mutation strings according to various queries.
## Starting sample: 3.
## Reading the file containing mutations: preprocessing/s3/step4.txt.xz
## Reading the file containing the identical reads: preprocessing/s3/step2_identical_reads.txt.xz
## Mutation data: removing any differences before position: 24.
## Mutation data: before pruning, there are: 4309681 reads.
## Mutation data: after min-position pruning, there are: 1564155 reads: 2745526 lost or 63.71%.
## Mutation data: removing any differences after position: 176.
## Mutation data: before pruning, there are: 1564155 reads.
## Mutation data: after max-position pruning, there are: 1482559 reads: 81596 lost or 5.22%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 1452047 reads: 30512 lost or 2.06%.
## Mutation data: removing reads with greater than 10 mutations.
## Mutation data: after max_mutation pruning, there are: 1110089 reads: 341958 lost or 23.55%.
## Mutation data: all filters removed 3199592 reads, or 74.24%.
## All data: gathering information about the indexes observed, this is slow.
## Before reads/index pruning, there are: 857851 indexes in all the data.
## After reads/index pruning, there are: 463161 indexes: 394690 lost or 46.01%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 1110089 changed reads.
## All data: before reads/index pruning, there are: 3583390 identical reads.
## All data: after index pruning, there are: 662025 changed reads: 59.64%.
## All data: after index pruning, there are: 3331914 identical reads: 92.98%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 3331914 identical reads.
## Before classification, there are 662025 reads with mutations.
## After classification, there are 1873630 reads/indexes which are only identical.
## After classification, there are 79142 reads/indexes which are strictly sequencer.
## After classification, there are 237111 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 9205882 forward reads and 9355117 reverse_reads.
## Subsetting based on mutations with at least 3 indexes.
## Classified mutation strings according to various queries.
## Making a matrix of miss_reads_by_position.
## Making a matrix of miss_indexes_by_position.
## Making a matrix of miss_sequencer_by_position.
## Making a matrix of miss_reads_by_string.
## Making a matrix of miss_indexes_by_string.
## Making a matrix of miss_sequencer_by_string.
## Making a matrix of miss_reads_by_ref_nt.
## Making a matrix of miss_indexes_by_ref_nt.
## Making a matrix of miss_sequencer_by_ref_nt.
## Making a matrix of miss_reads_by_hit_nt.
## Making a matrix of miss_indexes_by_hit_nt.
## Making a matrix of miss_sequencer_by_hit_nt.
## Making a matrix of miss_reads_by_type.
## Making a matrix of miss_indexes_by_type.
## Making a matrix of miss_sequencer_by_type.
## Making a matrix of miss_reads_by_trans.
## Making a matrix of miss_indexes_by_trans.
## Making a matrix of miss_sequencer_by_trans.
## Making a matrix of miss_reads_by_strength.
## Making a matrix of miss_indexes_by_strength.
## Making a matrix of miss_sequencer_by_strength.
## Making a matrix of insert_reads_by_position.
## Making a matrix of insert_indexes_by_position.
## Making a matrix of insert_sequencer_by_position.
## Making a matrix of insert_reads_by_nt.
## Making a matrix of insert_indexes_by_nt.
## Making a matrix of insert_sequencer_by_nt.
## Making a matrix of delete_reads_by_position.
## Making a matrix of delete_indexes_by_position.
## Making a matrix of delete_sequencer_by_position.
## Making a matrix of delete_reads_by_nt.
## Making a matrix of delete_indexes_by_nt.
## Making a matrix of delete_sequencer_by_nt.
## Skipping table: miss_reads_by_ref_nt
## Skipping table: miss_indexes_by_ref_nt
## Skipping table: miss_sequencer_by_ref_nt
## Skipping table: miss_reads_by_hit_nt
## Skipping table: miss_indexes_by_hit_nt
## Skipping table: miss_sequencer_by_hit_nt
## Skipping table: delete_reads_by_position
## Skipping table: delete_indexes_by_position
## Skipping table: delete_sequencer_by_position
## Skipping table: delete_reads_by_nt
## Skipping table: delete_indexes_by_nt
## Skipping table: delete_sequencer_by_nt
##                      Length Class  Mode   
## samples               3     -none- list   
## reads_per_sample      3     -none- numeric
## indexes_per_sample    3     -none- numeric
## matrices             33     -none- list   
## matrices_by_counts   33     -none- list   
## normalized           33     -none- list   
## normalized_by_counts 33     -none- list
## Starting sample: 1.
## Reading the file containing mutations: preprocessing/s1/step4.txt.xz
## Reading the file containing the identical reads: preprocessing/s1/step2_identical_reads.txt.xz
## Mutation data: removing any differences before position: 24.
## Mutation data: before pruning, there are: 1156535 reads.
## Mutation data: after min-position pruning, there are: 1037310 reads: 119225 lost or 10.31%.
## Mutation data: removing any differences after position: 176.
## Mutation data: before pruning, there are: 1037310 reads.
## Mutation data: after max-position pruning, there are: 968161 reads: 69149 lost or 6.67%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 953181 reads: 14980 lost or 1.55%.
## Mutation data: removing reads with greater than 5 mutations.
## Mutation data: after max_mutation pruning, there are: 608429 reads: 344752 lost or 36.17%.
## Mutation data: all filters removed 548106 reads, or 47.39%.
## All data: gathering information about the indexes observed, this is slow.
## Before reads/index pruning, there are: 1713933 indexes in all the data.
## After reads/index pruning, there are: 834821 indexes: 879112 lost or 51.29%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 608429 changed reads.
## All data: before reads/index pruning, there are: 4681501 identical reads.
## All data: after index pruning, there are: 379603 changed reads: 62.39%.
## All data: after index pruning, there are: 3657910 identical reads: 78.14%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 3657910 identical reads.
## Before classification, there are 379603 reads with mutations.
## After classification, there are 2777271 reads/indexes which are only identical.
## After classification, there are 8544 reads/indexes which are strictly sequencer.
## After classification, there are 25485 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 7127863 forward reads and 7254038 reverse_reads.
## Subsetting based on mutations with at least 3 indexes.
## Classified mutation strings according to various queries.
## Starting sample: 2.
## Reading the file containing mutations: preprocessing/s2/step4.txt.xz
## Reading the file containing the identical reads: preprocessing/s2/step2_identical_reads.txt.xz
## Mutation data: removing any differences before position: 24.
## Mutation data: before pruning, there are: 3421203 reads.
## Mutation data: after min-position pruning, there are: 1758479 reads: 1662724 lost or 48.60%.
## Mutation data: removing any differences after position: 176.
## Mutation data: before pruning, there are: 1758479 reads.
## Mutation data: after max-position pruning, there are: 1667302 reads: 91177 lost or 5.18%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 1642969 reads: 24333 lost or 1.46%.
## Mutation data: removing reads with greater than 5 mutations.
## Mutation data: after max_mutation pruning, there are: 807185 reads: 835784 lost or 50.87%.
## Mutation data: all filters removed 2614018 reads, or 76.41%.
## All data: gathering information about the indexes observed, this is slow.
## Before reads/index pruning, there are: 1179116 indexes in all the data.
## After reads/index pruning, there are: 692307 indexes: 486809 lost or 41.29%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 807185 changed reads.
## All data: before reads/index pruning, there are: 5230976 identical reads.
## All data: after index pruning, there are: 585835 changed reads: 72.58%.
## All data: after index pruning, there are: 4832196 identical reads: 92.38%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 4832196 identical reads.
## Before classification, there are 585835 reads with mutations.
## After classification, there are 2934376 reads/indexes which are only identical.
## After classification, there are 79902 reads/indexes which are strictly sequencer.
## After classification, there are 116271 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 12365004 forward reads and 12844113 reverse_reads.
## Subsetting based on mutations with at least 3 indexes.
## Classified mutation strings according to various queries.
## Starting sample: 3.
## Reading the file containing mutations: preprocessing/s3/step4.txt.xz
## Reading the file containing the identical reads: preprocessing/s3/step2_identical_reads.txt.xz
## Mutation data: removing any differences before position: 24.
## Mutation data: before pruning, there are: 4309681 reads.
## Mutation data: after min-position pruning, there are: 1564155 reads: 2745526 lost or 63.71%.
## Mutation data: removing any differences after position: 176.
## Mutation data: before pruning, there are: 1564155 reads.
## Mutation data: after max-position pruning, there are: 1482559 reads: 81596 lost or 5.22%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 1452047 reads: 30512 lost or 2.06%.
## Mutation data: removing reads with greater than 5 mutations.
## Mutation data: after max_mutation pruning, there are: 746662 reads: 705385 lost or 48.58%.
## Mutation data: all filters removed 3563019 reads, or 82.67%.
## All data: gathering information about the indexes observed, this is slow.
## Before reads/index pruning, there are: 808995 indexes in all the data.
## After reads/index pruning, there are: 461997 indexes: 346998 lost or 42.89%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 746662 changed reads.
## All data: before reads/index pruning, there are: 3583390 identical reads.
## All data: after index pruning, there are: 555226 changed reads: 74.36%.
## All data: after index pruning, there are: 3330970 identical reads: 92.96%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 3330970 identical reads.
## Before classification, there are 555226 reads with mutations.
## After classification, there are 1957637 reads/indexes which are only identical.
## After classification, there are 63014 reads/indexes which are strictly sequencer.
## After classification, there are 223250 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 9578873 forward reads and 9724531 reverse_reads.
## Subsetting based on mutations with at least 3 indexes.
## Classified mutation strings according to various queries.
## Making a matrix of miss_reads_by_position.
## Making a matrix of miss_indexes_by_position.
## Making a matrix of miss_sequencer_by_position.
## Making a matrix of miss_reads_by_string.
## Making a matrix of miss_indexes_by_string.
## Making a matrix of miss_sequencer_by_string.
## Making a matrix of miss_reads_by_ref_nt.
## Making a matrix of miss_indexes_by_ref_nt.
## Making a matrix of miss_sequencer_by_ref_nt.
## Making a matrix of miss_reads_by_hit_nt.
## Making a matrix of miss_indexes_by_hit_nt.
## Making a matrix of miss_sequencer_by_hit_nt.
## Making a matrix of miss_reads_by_type.
## Making a matrix of miss_indexes_by_type.
## Making a matrix of miss_sequencer_by_type.
## Making a matrix of miss_reads_by_trans.
## Making a matrix of miss_indexes_by_trans.
## Making a matrix of miss_sequencer_by_trans.
## Making a matrix of miss_reads_by_strength.
## Making a matrix of miss_indexes_by_strength.
## Making a matrix of miss_sequencer_by_strength.
## Making a matrix of insert_reads_by_position.
## Making a matrix of insert_indexes_by_position.
## Making a matrix of insert_sequencer_by_position.
## Making a matrix of insert_reads_by_nt.
## Making a matrix of insert_indexes_by_nt.
## Making a matrix of insert_sequencer_by_nt.
## Making a matrix of delete_reads_by_position.
## Making a matrix of delete_indexes_by_position.
## Making a matrix of delete_sequencer_by_position.
## Making a matrix of delete_reads_by_nt.
## Making a matrix of delete_indexes_by_nt.
## Making a matrix of delete_sequencer_by_nt.
## Skipping table: miss_reads_by_ref_nt
## Skipping table: miss_indexes_by_ref_nt
## Skipping table: miss_sequencer_by_ref_nt
## Skipping table: miss_reads_by_hit_nt
## Skipping table: miss_indexes_by_hit_nt
## Skipping table: miss_sequencer_by_hit_nt
## Skipping table: delete_reads_by_position
## Skipping table: delete_indexes_by_position
## Skipping table: delete_sequencer_by_position
## Skipping table: delete_reads_by_nt
## Skipping table: delete_indexes_by_nt
## Skipping table: delete_sequencer_by_nt
##                      Length Class  Mode   
## samples               3     -none- list   
## reads_per_sample      3     -none- numeric
## indexes_per_sample    3     -none- numeric
## matrices             33     -none- list   
## matrices_by_counts   33     -none- list   
## normalized           33     -none- list   
## normalized_by_counts 33     -none- list

1.2 Categorize the data with at least 5 indexes per mutant

## Starting sample: 1.
## Reading the file containing mutations: preprocessing/s1/step4.txt.xz
## Reading the file containing the identical reads: preprocessing/s1/step2_identical_reads.txt.xz
## Mutation data: removing any differences before position: 24.
## Mutation data: before pruning, there are: 1156535 reads.
## Mutation data: after min-position pruning, there are: 1037310 reads: 119225 lost or 10.31%.
## Mutation data: removing any differences after position: 176.
## Mutation data: before pruning, there are: 1037310 reads.
## Mutation data: after max-position pruning, there are: 968161 reads: 69149 lost or 6.67%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 953181 reads: 14980 lost or 1.55%.
## Mutation data: all filters removed 203354 reads, or 17.58%.
## All data: gathering information about the indexes observed, this is slow.
## Before reads/index pruning, there are: 1742165 indexes in all the data.
## After reads/index pruning, there are: 837608 indexes: 904557 lost or 51.92%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 953181 changed reads.
## All data: before reads/index pruning, there are: 4681501 identical reads.
## All data: after index pruning, there are: 491995 changed reads: 51.62%.
## All data: after index pruning, there are: 3663004 identical reads: 78.24%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 3663004 identical reads.
## Before classification, there are 491995 reads with mutations.
## After classification, there are 2738199 reads/indexes which are only identical.
## After classification, there are 11023 reads/indexes which are strictly sequencer.
## After classification, there are 26963 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 7018785 forward reads and 7148314 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various queries.
## Starting sample: 2.
## Reading the file containing mutations: preprocessing/s2/step4.txt.xz
## Reading the file containing the identical reads: preprocessing/s2/step2_identical_reads.txt.xz
## Mutation data: removing any differences before position: 24.
## Mutation data: before pruning, there are: 3421203 reads.
## Mutation data: after min-position pruning, there are: 1758479 reads: 1662724 lost or 48.60%.
## Mutation data: removing any differences after position: 176.
## Mutation data: before pruning, there are: 1758479 reads.
## Mutation data: after max-position pruning, there are: 1667302 reads: 91177 lost or 5.18%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 1642969 reads: 24333 lost or 1.46%.
## Mutation data: all filters removed 1778234 reads, or 51.98%.
## All data: gathering information about the indexes observed, this is slow.
## Before reads/index pruning, there are: 1261478 indexes in all the data.
## After reads/index pruning, there are: 693725 indexes: 567753 lost or 45.01%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 1642969 changed reads.
## All data: before reads/index pruning, there are: 5230976 identical reads.
## All data: after index pruning, there are: 814407 changed reads: 49.57%.
## All data: after index pruning, there are: 4834092 identical reads: 92.41%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 4834092 identical reads.
## Before classification, there are 814407 reads with mutations.
## After classification, there are 2802107 reads/indexes which are only identical.
## After classification, there are 111708 reads/indexes which are strictly sequencer.
## After classification, there are 126921 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 11803361 forward reads and 12275547 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various queries.
## Starting sample: 3.
## Reading the file containing mutations: preprocessing/s3/step4.txt.xz
## Reading the file containing the identical reads: preprocessing/s3/step2_identical_reads.txt.xz
## Mutation data: removing any differences before position: 24.
## Mutation data: before pruning, there are: 4309681 reads.
## Mutation data: after min-position pruning, there are: 1564155 reads: 2745526 lost or 63.71%.
## Mutation data: removing any differences after position: 176.
## Mutation data: before pruning, there are: 1564155 reads.
## Mutation data: after max-position pruning, there are: 1482559 reads: 81596 lost or 5.22%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 1452047 reads: 30512 lost or 2.06%.
## Mutation data: all filters removed 2857634 reads, or 66.31%.
## All data: gathering information about the indexes observed, this is slow.
## Before reads/index pruning, there are: 884042 indexes in all the data.
## After reads/index pruning, there are: 463445 indexes: 420597 lost or 47.58%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 1452047 changed reads.
## All data: before reads/index pruning, there are: 3583390 identical reads.
## All data: after index pruning, there are: 730397 changed reads: 50.30%.
## All data: after index pruning, there are: 3332136 identical reads: 92.99%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 3332136 identical reads.
## Before classification, there are 730397 reads with mutations.
## After classification, there are 1851177 reads/indexes which are only identical.
## After classification, there are 90341 reads/indexes which are strictly sequencer.
## After classification, there are 244494 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 9104237 forward reads and 9257103 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various queries.
## Making a matrix of miss_reads_by_position.
## Making a matrix of miss_indexes_by_position.
## Making a matrix of miss_sequencer_by_position.
## Making a matrix of miss_reads_by_string.
## Making a matrix of miss_indexes_by_string.
## Making a matrix of miss_sequencer_by_string.
## Making a matrix of miss_reads_by_ref_nt.
## Making a matrix of miss_indexes_by_ref_nt.
## Making a matrix of miss_sequencer_by_ref_nt.
## Making a matrix of miss_reads_by_hit_nt.
## Making a matrix of miss_indexes_by_hit_nt.
## Making a matrix of miss_sequencer_by_hit_nt.
## Making a matrix of miss_reads_by_type.
## Making a matrix of miss_indexes_by_type.
## Making a matrix of miss_sequencer_by_type.
## Making a matrix of miss_reads_by_trans.
## Making a matrix of miss_indexes_by_trans.
## Making a matrix of miss_sequencer_by_trans.
## Making a matrix of miss_reads_by_strength.
## Making a matrix of miss_indexes_by_strength.
## Making a matrix of miss_sequencer_by_strength.
## Making a matrix of insert_reads_by_position.
## Making a matrix of insert_indexes_by_position.
## Making a matrix of insert_sequencer_by_position.
## Making a matrix of insert_reads_by_nt.
## Making a matrix of insert_indexes_by_nt.
## Making a matrix of insert_sequencer_by_nt.
## Making a matrix of delete_reads_by_position.
## Making a matrix of delete_indexes_by_position.
## Making a matrix of delete_sequencer_by_position.
## Making a matrix of delete_reads_by_nt.
## Making a matrix of delete_indexes_by_nt.
## Making a matrix of delete_sequencer_by_nt.
## Skipping table: miss_reads_by_ref_nt
## Skipping table: miss_indexes_by_ref_nt
## Skipping table: miss_sequencer_by_ref_nt
## Skipping table: miss_reads_by_hit_nt
## Skipping table: miss_indexes_by_hit_nt
## Skipping table: miss_sequencer_by_hit_nt
## Skipping table: delete_reads_by_position
## Skipping table: delete_indexes_by_position
## Skipping table: delete_sequencer_by_position
## Skipping table: delete_reads_by_nt
## Skipping table: delete_indexes_by_nt
## Skipping table: delete_sequencer_by_nt
##                      Length Class  Mode   
## samples               3     -none- list   
## reads_per_sample      3     -none- numeric
## indexes_per_sample    3     -none- numeric
## matrices             33     -none- list   
## matrices_by_counts   33     -none- list   
## normalized           33     -none- list   
## normalized_by_counts 33     -none- list
## Starting sample: 1.
## Reading the file containing mutations: preprocessing/s1/step4.txt.xz
## Reading the file containing the identical reads: preprocessing/s1/step2_identical_reads.txt.xz
## Mutation data: removing any differences before position: 24.
## Mutation data: before pruning, there are: 1156535 reads.
## Mutation data: after min-position pruning, there are: 1037310 reads: 119225 lost or 10.31%.
## Mutation data: removing any differences after position: 176.
## Mutation data: before pruning, there are: 1037310 reads.
## Mutation data: after max-position pruning, there are: 968161 reads: 69149 lost or 6.67%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 953181 reads: 14980 lost or 1.55%.
## Mutation data: removing reads with greater than 10 mutations.
## Mutation data: after max_mutation pruning, there are: 799403 reads: 153778 lost or 16.13%.
## Mutation data: all filters removed 357132 reads, or 30.88%.
## All data: gathering information about the indexes observed, this is slow.
## Before reads/index pruning, there are: 1733789 indexes in all the data.
## After reads/index pruning, there are: 836838 indexes: 896951 lost or 51.73%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 799403 changed reads.
## All data: before reads/index pruning, there are: 4681501 identical reads.
## All data: after index pruning, there are: 441562 changed reads: 55.24%.
## All data: after index pruning, there are: 3661605 identical reads: 78.21%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 3661605 identical reads.
## Before classification, there are 441562 reads with mutations.
## After classification, there are 2748736 reads/indexes which are only identical.
## After classification, there are 9916 reads/indexes which are strictly sequencer.
## After classification, there are 26403 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 7049093 forward reads and 7175885 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various queries.
## Starting sample: 2.
## Reading the file containing mutations: preprocessing/s2/step4.txt.xz
## Reading the file containing the identical reads: preprocessing/s2/step2_identical_reads.txt.xz
## Mutation data: removing any differences before position: 24.
## Mutation data: before pruning, there are: 3421203 reads.
## Mutation data: after min-position pruning, there are: 1758479 reads: 1662724 lost or 48.60%.
## Mutation data: removing any differences after position: 176.
## Mutation data: before pruning, there are: 1758479 reads.
## Mutation data: after max-position pruning, there are: 1667302 reads: 91177 lost or 5.18%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 1642969 reads: 24333 lost or 1.46%.
## Mutation data: removing reads with greater than 10 mutations.
## Mutation data: after max_mutation pruning, there are: 1232741 reads: 410228 lost or 24.97%.
## Mutation data: all filters removed 2188462 reads, or 63.97%.
## All data: gathering information about the indexes observed, this is slow.
## Before reads/index pruning, there are: 1231605 indexes in all the data.
## After reads/index pruning, there are: 693381 indexes: 538224 lost or 43.70%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 1232741 changed reads.
## All data: before reads/index pruning, there are: 5230976 identical reads.
## All data: after index pruning, there are: 720963 changed reads: 58.48%.
## All data: after index pruning, there are: 4833605 identical reads: 92.40%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 4833605 identical reads.
## Before classification, there are 720963 reads with mutations.
## After classification, there are 2832509 reads/indexes which are only identical.
## After classification, there are 98387 reads/indexes which are strictly sequencer.
## After classification, there are 123178 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 11930745 forward reads and 12406826 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various queries.
## Starting sample: 3.
## Reading the file containing mutations: preprocessing/s3/step4.txt.xz
## Reading the file containing the identical reads: preprocessing/s3/step2_identical_reads.txt.xz
## Mutation data: removing any differences before position: 24.
## Mutation data: before pruning, there are: 4309681 reads.
## Mutation data: after min-position pruning, there are: 1564155 reads: 2745526 lost or 63.71%.
## Mutation data: removing any differences after position: 176.
## Mutation data: before pruning, there are: 1564155 reads.
## Mutation data: after max-position pruning, there are: 1482559 reads: 81596 lost or 5.22%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 1452047 reads: 30512 lost or 2.06%.
## Mutation data: removing reads with greater than 10 mutations.
## Mutation data: after max_mutation pruning, there are: 1110089 reads: 341958 lost or 23.55%.
## Mutation data: all filters removed 3199592 reads, or 74.24%.
## All data: gathering information about the indexes observed, this is slow.
## Before reads/index pruning, there are: 857851 indexes in all the data.
## After reads/index pruning, there are: 463161 indexes: 394690 lost or 46.01%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 1110089 changed reads.
## All data: before reads/index pruning, there are: 3583390 identical reads.
## All data: after index pruning, there are: 662025 changed reads: 59.64%.
## All data: after index pruning, there are: 3331914 identical reads: 92.98%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 3331914 identical reads.
## Before classification, there are 662025 reads with mutations.
## After classification, there are 1873630 reads/indexes which are only identical.
## After classification, there are 79142 reads/indexes which are strictly sequencer.
## After classification, there are 237111 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 9205882 forward reads and 9355117 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various queries.
## Making a matrix of miss_reads_by_position.
## Making a matrix of miss_indexes_by_position.
## Making a matrix of miss_sequencer_by_position.
## Making a matrix of miss_reads_by_string.
## Making a matrix of miss_indexes_by_string.
## Making a matrix of miss_sequencer_by_string.
## Making a matrix of miss_reads_by_ref_nt.
## Making a matrix of miss_indexes_by_ref_nt.
## Making a matrix of miss_sequencer_by_ref_nt.
## Making a matrix of miss_reads_by_hit_nt.
## Making a matrix of miss_indexes_by_hit_nt.
## Making a matrix of miss_sequencer_by_hit_nt.
## Making a matrix of miss_reads_by_type.
## Making a matrix of miss_indexes_by_type.
## Making a matrix of miss_sequencer_by_type.
## Making a matrix of miss_reads_by_trans.
## Making a matrix of miss_indexes_by_trans.
## Making a matrix of miss_sequencer_by_trans.
## Making a matrix of miss_reads_by_strength.
## Making a matrix of miss_indexes_by_strength.
## Making a matrix of miss_sequencer_by_strength.
## Making a matrix of insert_reads_by_position.
## Making a matrix of insert_indexes_by_position.
## Making a matrix of insert_sequencer_by_position.
## Making a matrix of insert_reads_by_nt.
## Making a matrix of insert_indexes_by_nt.
## Making a matrix of insert_sequencer_by_nt.
## Making a matrix of delete_reads_by_position.
## Making a matrix of delete_indexes_by_position.
## Making a matrix of delete_sequencer_by_position.
## Making a matrix of delete_reads_by_nt.
## Making a matrix of delete_indexes_by_nt.
## Making a matrix of delete_sequencer_by_nt.
## Skipping table: miss_reads_by_ref_nt
## Skipping table: miss_indexes_by_ref_nt
## Skipping table: miss_sequencer_by_ref_nt
## Skipping table: miss_reads_by_hit_nt
## Skipping table: miss_indexes_by_hit_nt
## Skipping table: miss_sequencer_by_hit_nt
## Skipping table: delete_reads_by_position
## Skipping table: delete_indexes_by_position
## Skipping table: delete_sequencer_by_position
## Skipping table: delete_reads_by_nt
## Skipping table: delete_indexes_by_nt
## Skipping table: delete_sequencer_by_nt
##                      Length Class  Mode   
## samples               3     -none- list   
## reads_per_sample      3     -none- numeric
## indexes_per_sample    3     -none- numeric
## matrices             33     -none- list   
## matrices_by_counts   33     -none- list   
## normalized           33     -none- list   
## normalized_by_counts 33     -none- list
## Starting sample: 1.
## Reading the file containing mutations: preprocessing/s1/step4.txt.xz
## Reading the file containing the identical reads: preprocessing/s1/step2_identical_reads.txt.xz
## Mutation data: removing any differences before position: 24.
## Mutation data: before pruning, there are: 1156535 reads.
## Mutation data: after min-position pruning, there are: 1037310 reads: 119225 lost or 10.31%.
## Mutation data: removing any differences after position: 176.
## Mutation data: before pruning, there are: 1037310 reads.
## Mutation data: after max-position pruning, there are: 968161 reads: 69149 lost or 6.67%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 953181 reads: 14980 lost or 1.55%.
## Mutation data: removing reads with greater than 5 mutations.
## Mutation data: after max_mutation pruning, there are: 608429 reads: 344752 lost or 36.17%.
## Mutation data: all filters removed 548106 reads, or 47.39%.
## All data: gathering information about the indexes observed, this is slow.
## Before reads/index pruning, there are: 1713933 indexes in all the data.
## After reads/index pruning, there are: 834821 indexes: 879112 lost or 51.29%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 608429 changed reads.
## All data: before reads/index pruning, there are: 4681501 identical reads.
## All data: after index pruning, there are: 379603 changed reads: 62.39%.
## All data: after index pruning, there are: 3657910 identical reads: 78.14%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 3657910 identical reads.
## Before classification, there are 379603 reads with mutations.
## After classification, there are 2777271 reads/indexes which are only identical.
## After classification, there are 8544 reads/indexes which are strictly sequencer.
## After classification, there are 25485 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 7127863 forward reads and 7254038 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various queries.
## Starting sample: 2.
## Reading the file containing mutations: preprocessing/s2/step4.txt.xz
## Reading the file containing the identical reads: preprocessing/s2/step2_identical_reads.txt.xz
## Mutation data: removing any differences before position: 24.
## Mutation data: before pruning, there are: 3421203 reads.
## Mutation data: after min-position pruning, there are: 1758479 reads: 1662724 lost or 48.60%.
## Mutation data: removing any differences after position: 176.
## Mutation data: before pruning, there are: 1758479 reads.
## Mutation data: after max-position pruning, there are: 1667302 reads: 91177 lost or 5.18%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 1642969 reads: 24333 lost or 1.46%.
## Mutation data: removing reads with greater than 5 mutations.
## Mutation data: after max_mutation pruning, there are: 807185 reads: 835784 lost or 50.87%.
## Mutation data: all filters removed 2614018 reads, or 76.41%.
## All data: gathering information about the indexes observed, this is slow.
## Before reads/index pruning, there are: 1179116 indexes in all the data.
## After reads/index pruning, there are: 692307 indexes: 486809 lost or 41.29%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 807185 changed reads.
## All data: before reads/index pruning, there are: 5230976 identical reads.
## All data: after index pruning, there are: 585835 changed reads: 72.58%.
## All data: after index pruning, there are: 4832196 identical reads: 92.38%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 4832196 identical reads.
## Before classification, there are 585835 reads with mutations.
## After classification, there are 2934376 reads/indexes which are only identical.
## After classification, there are 79902 reads/indexes which are strictly sequencer.
## After classification, there are 116271 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 12365004 forward reads and 12844113 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various queries.
## Starting sample: 3.
## Reading the file containing mutations: preprocessing/s3/step4.txt.xz
## Reading the file containing the identical reads: preprocessing/s3/step2_identical_reads.txt.xz
## Mutation data: removing any differences before position: 24.
## Mutation data: before pruning, there are: 4309681 reads.
## Mutation data: after min-position pruning, there are: 1564155 reads: 2745526 lost or 63.71%.
## Mutation data: removing any differences after position: 176.
## Mutation data: before pruning, there are: 1564155 reads.
## Mutation data: after max-position pruning, there are: 1482559 reads: 81596 lost or 5.22%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 1452047 reads: 30512 lost or 2.06%.
## Mutation data: removing reads with greater than 5 mutations.
## Mutation data: after max_mutation pruning, there are: 746662 reads: 705385 lost or 48.58%.
## Mutation data: all filters removed 3563019 reads, or 82.67%.
## All data: gathering information about the indexes observed, this is slow.
## Before reads/index pruning, there are: 808995 indexes in all the data.
## After reads/index pruning, there are: 461997 indexes: 346998 lost or 42.89%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 746662 changed reads.
## All data: before reads/index pruning, there are: 3583390 identical reads.
## All data: after index pruning, there are: 555226 changed reads: 74.36%.
## All data: after index pruning, there are: 3330970 identical reads: 92.96%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 3330970 identical reads.
## Before classification, there are 555226 reads with mutations.
## After classification, there are 1957637 reads/indexes which are only identical.
## After classification, there are 63014 reads/indexes which are strictly sequencer.
## After classification, there are 223250 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 9578873 forward reads and 9724531 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various queries.
## Making a matrix of miss_reads_by_position.
## Making a matrix of miss_indexes_by_position.
## Making a matrix of miss_sequencer_by_position.
## Making a matrix of miss_reads_by_string.
## Making a matrix of miss_indexes_by_string.
## Making a matrix of miss_sequencer_by_string.
## Making a matrix of miss_reads_by_ref_nt.
## Making a matrix of miss_indexes_by_ref_nt.
## Making a matrix of miss_sequencer_by_ref_nt.
## Making a matrix of miss_reads_by_hit_nt.
## Making a matrix of miss_indexes_by_hit_nt.
## Making a matrix of miss_sequencer_by_hit_nt.
## Making a matrix of miss_reads_by_type.
## Making a matrix of miss_indexes_by_type.
## Making a matrix of miss_sequencer_by_type.
## Making a matrix of miss_reads_by_trans.
## Making a matrix of miss_indexes_by_trans.
## Making a matrix of miss_sequencer_by_trans.
## Making a matrix of miss_reads_by_strength.
## Making a matrix of miss_indexes_by_strength.
## Making a matrix of miss_sequencer_by_strength.
## Making a matrix of insert_reads_by_position.
## Making a matrix of insert_indexes_by_position.
## Making a matrix of insert_sequencer_by_position.
## Making a matrix of insert_reads_by_nt.
## Making a matrix of insert_indexes_by_nt.
## Making a matrix of insert_sequencer_by_nt.
## Making a matrix of delete_reads_by_position.
## Making a matrix of delete_indexes_by_position.
## Making a matrix of delete_sequencer_by_position.
## Making a matrix of delete_reads_by_nt.
## Making a matrix of delete_indexes_by_nt.
## Making a matrix of delete_sequencer_by_nt.
## Skipping table: miss_reads_by_ref_nt
## Skipping table: miss_indexes_by_ref_nt
## Skipping table: miss_sequencer_by_ref_nt
## Skipping table: miss_reads_by_hit_nt
## Skipping table: miss_indexes_by_hit_nt
## Skipping table: miss_sequencer_by_hit_nt
## Skipping table: delete_reads_by_position
## Skipping table: delete_indexes_by_position
## Skipping table: delete_sequencer_by_position
## Skipping table: delete_reads_by_nt
## Skipping table: delete_indexes_by_nt
## Skipping table: delete_sequencer_by_nt
##                      Length Class  Mode   
## samples               3     -none- list   
## reads_per_sample      3     -none- numeric
## indexes_per_sample    3     -none- numeric
## matrices             33     -none- list   
## matrices_by_counts   33     -none- list   
## normalized           33     -none- list   
## normalized_by_counts 33     -none- list

2 Questions from Dr. DeStefano

I think what is best is to get the number of recovered mutations of each type from each data set. That would be A to T, A to G, A to C; T to A, T to G, T to C; G to A, G to C, G to T; and C to A, C to G, C to T; as well as deletions and insertions. I would then need the sum number of the reads that met all our criteria (i.e. at least 3 good recovered reads for that 14 nt index). Each set of 3 or more would ct as “1” read of that particular index so I would need the total with this in mind. I also need to know the total number of nucleotides that were in the region we decided to consider in the analysis. We may want to try this for 3 or more and 5 or more recovered indexes if it is not hard. This information does not include specific positions on the template where errors occurred but we can look at that latter. Right now I just want to get a general error rate and type of error. It would basically be calculated by dividing the number of recovered mutations of a particular type by sum number of the reads times the number of nucleotides screened in the template. As it ends up, this number does not really have a lot of meaning but it can be used to calculate the overall mutation rate as well as the rate for transversions, transitions, and deletions and insertions.

3 Answers

In order to address those queries, I invoked create_matrices() with a minimum index count of 3 and 5. It should be noted that this is not the same as requiring 3 or 5 reads per index. In both cases I require 3 reads per index.

3.1 Recovered mutations of each type

I am interpreting this question as the number of indexes recovered for each mutation type. I collect this information in 2 ways of interest: the indexes by type which are deemed to be from the RT and from the sequencer. In addition, I calculate a normalized (cpm) version of this information which may be used to look for changes across samples.

3.1.1 Mutations by RT index

This following block should print out tables of the numbers of mutant indexes observed for each type for the RT and the sequencer. One would hope that the sequencer will be consistent for all samples, but I think the results will instead suggest that my metric is not yet stringent enough.

s1 s2 s3
A_C 1226 4078 8324
A_G 687 14428 50666
A_T 212 2050 4514
C_A 9115 28661 33332
C_G 329 3690 9533
C_T 2108 17340 59479
G_A 1617 29449 35634
G_C 268 1549 2843
G_T 9304 11694 14377
T_A 178 4752 7492
T_C 805 3995 8312
T_G 1044 5090 9203
s1 s2 s3
A_C 1216 4078 8324
A_G 675 14428 50666
A_T 202 2050 4514
C_A 9115 28661 33332
C_G 305 3686 9533
C_T 2104 17340 59479
G_A 1613 29449 35634
G_C 243 1545 2839
G_T 9304 11694 14377
T_A 161 4752 7492
T_C 797 3995 8312
T_G 1044 5084 9203
s1 s2 s3
A_C 2265 17215 14189
A_G 623 7106 5588
A_T 170 2701 2256
C_A 1163 14161 11148
C_G 561 5632 4067
C_T 695 7037 5431
G_A 519 4839 3979
G_C 452 6119 5518
G_T 966 8307 6799
T_A 372 3370 2702
T_C 916 9795 8136
T_G 2227 25324 20436
s1 s2 s3
A_C 2258 17215 14189
A_G 623 7106 5588
A_T 148 2701 2256
C_A 1139 14161 11148
C_G 551 5632 4067
C_T 676 7037 5431
G_A 508 4839 3979
G_C 427 6119 5518
G_T 954 8307 6799
T_A 351 3370 2702
T_C 912 9795 8136
T_G 2214 25324 20436

Plots of this information

This suggests to me that this information needs to be normalized in some more sensible fashion. Thus the following:

3.1.2 Mutations by RT index, post normalization

The same numbers may be expressed in the context of the number of indexes observed / sample and/or as a cpm across samples. Thus in the first instance one can look at the apparent error rate for each sample, and in the second instance one may look for relative changes in apparent error rate across samples.

3.1.2.1 Rewriting the matrices as cpm to account for library sizes.

s1 s2 s3
A_C 45588 32167 34155
A_G 25546 113807 207895
A_T 7883 16170 18522
C_A 338936 226076 136770
C_G 12234 29106 39116
C_T 78385 136777 244057
G_A 60127 232292 146215
G_C 9965 12218 11666
G_T 345964 92241 58992
T_A 6619 37483 30742
T_C 29933 31512 34106
T_G 38821 40150 37762
s1 s2 s3
A_C 45409 32171 34156
A_G 25206 113820 207899
A_T 7543 16172 18522
C_A 340379 226101 136772
C_G 11390 29078 39117
C_T 78569 136792 244061
G_A 60234 232317 146218
G_C 9074 12188 11649
G_T 347436 92252 58993
T_A 6012 37488 30742
T_C 29762 31516 34107
T_G 38986 40107 37763
s1 s2 s3
A_C 207247 154248 157221
A_G 57004 63670 61918
A_T 15555 24201 24998
C_A 106414 126884 123525
C_G 51331 50463 45064
C_T 63592 63052 60178
G_A 47488 43358 44089
G_C 41358 54827 61142
G_T 88389 74431 75336
T_A 34038 30196 29939
T_C 83814 87764 90151
T_G 203770 226905 226440
s1 s2 s3
A_C 209832 154248 157221
A_G 57894 63670 61918
A_T 13753 24201 24998
C_A 105845 126884 123525
C_G 51203 50463 45064
C_T 62819 63052 60178
G_A 47208 43358 44089
G_C 39680 54827 61142
G_T 88653 74431 75336
T_A 32618 30196 29939
T_C 84750 87764 90151
T_G 205743 226905 226440

3.1.2.2 Rewriting the matrices by dividing by all indexes

This I think starts to address the later text in your query.

s1 s2 s3
A_C 0.0015 0.0049 0.0099
A_G 0.0010 0.0208 0.0730
A_T 0.0005 0.0044 0.0097
C_A 0.0109 0.0342 0.0398
C_G 0.0005 0.0053 0.0137
C_T 0.0045 0.0374 0.1283
G_A 0.0019 0.0352 0.0425
G_C 0.0004 0.0022 0.0041
G_T 0.0201 0.0252 0.0310
T_A 0.0002 0.0057 0.0089
T_C 0.0012 0.0058 0.0120
T_G 0.0023 0.0110 0.0199
s1 s2 s3
A_C 0.0015 0.0049 0.0099
A_G 0.0010 0.0208 0.0730
A_T 0.0004 0.0044 0.0097
C_A 0.0109 0.0342 0.0398
C_G 0.0004 0.0053 0.0137
C_T 0.0045 0.0374 0.1283
G_A 0.0019 0.0352 0.0425
G_C 0.0004 0.0022 0.0041
G_T 0.0201 0.0252 0.0310
T_A 0.0002 0.0057 0.0089
T_C 0.0011 0.0058 0.0120
T_G 0.0023 0.0110 0.0199
s1 s2 s3
A_C 0.0027 0.0206 0.0169
A_G 0.0009 0.0102 0.0081
A_T 0.0004 0.0058 0.0049
C_A 0.0014 0.0169 0.0133
C_G 0.0008 0.0081 0.0059
C_T 0.0015 0.0152 0.0117
G_A 0.0006 0.0058 0.0048
G_C 0.0007 0.0088 0.0080
G_T 0.0021 0.0179 0.0147
T_A 0.0004 0.0040 0.0032
T_C 0.0013 0.0141 0.0117
T_G 0.0048 0.0546 0.0441
s1 s2 s3
A_C 0.0027 0.0206 0.0169
A_G 0.0009 0.0102 0.0081
A_T 0.0003 0.0058 0.0049
C_A 0.0014 0.0169 0.0133
C_G 0.0008 0.0081 0.0059
C_T 0.0015 0.0152 0.0117
G_A 0.0006 0.0058 0.0048
G_C 0.0006 0.0088 0.0080
G_T 0.0021 0.0179 0.0147
T_A 0.0004 0.0040 0.0032
T_C 0.0013 0.0141 0.0117
T_G 0.0048 0.0546 0.0441

3.1.2.3 Rewriting the matrices by dividing by all indexes and cpm

I think this might prove to be where we get the most meaningful results.

The nicest thing in it is that after accounting for library sizes and total indexes observed, we finally see that the sequencer error is mostly consistent across all samples and mutation types – with a couple of notable exceptions.

By the same token, for the mutations which are identical for the sequencer, we have some which are decidedly different for the non-sequencer data. The most notable examples I think are A to G but _not G to A; and C to T.

s1 s2 s3
A_C 0.0544 0.0384 0.0408
A_G 0.0368 0.1641 0.2997
A_T 0.0170 0.0349 0.0400
C_A 0.4046 0.2699 0.1633
C_G 0.0176 0.0420 0.0564
C_T 0.1691 0.2951 0.5266
G_A 0.0718 0.2773 0.1746
G_C 0.0144 0.0176 0.0168
G_T 0.7465 0.1990 0.1273
T_A 0.0079 0.0448 0.0367
T_C 0.0431 0.0454 0.0492
T_G 0.0838 0.0866 0.0815
s1 s2 s3
A_C 0.0542 0.0384 0.0408
A_G 0.0363 0.1641 0.2997
A_T 0.0163 0.0349 0.0400
C_A 0.4064 0.2699 0.1633
C_G 0.0164 0.0419 0.0564
C_T 0.1695 0.2952 0.5266
G_A 0.0719 0.2774 0.1746
G_C 0.0131 0.0176 0.0168
G_T 0.7497 0.1991 0.1273
T_A 0.0072 0.0448 0.0367
T_C 0.0429 0.0454 0.0492
T_G 0.0841 0.0865 0.0815
s1 s2 s3
A_C 0.2474 0.1842 0.1877
A_G 0.0822 0.0918 0.0893
A_T 0.0336 0.0522 0.0539
C_A 0.1270 0.1515 0.1475
C_G 0.0740 0.0727 0.0650
C_T 0.1372 0.1361 0.1298
G_A 0.0567 0.0518 0.0526
G_C 0.0596 0.0790 0.0881
G_T 0.1907 0.1606 0.1626
T_A 0.0406 0.0360 0.0357
T_C 0.1208 0.1265 0.1300
T_G 0.4397 0.4896 0.4886
s1 s2 s3
A_C 0.2505 0.1842 0.1877
A_G 0.0835 0.0918 0.0893
A_T 0.0297 0.0522 0.0539
C_A 0.1264 0.1515 0.1475
C_G 0.0738 0.0727 0.0650
C_T 0.1355 0.1361 0.1298
G_A 0.0564 0.0518 0.0526
G_C 0.0572 0.0790 0.0881
G_T 0.1913 0.1606 0.1626
T_A 0.0389 0.0360 0.0357
T_C 0.1222 0.1265 0.1300
T_G 0.4439 0.4896 0.4886

3.1.3 Indels by RT index

The following blocks will repeat the above, but looking for insertions. This data does not observe sufficient deletions to make a proper count for them.

s1 s2 s3
A 0 25 382
C 0 23 69
G 0 31 89
T 0 48 221
s1 s2 s3
A 0 25 382
C 0 20 65
G 0 27 89
T 0 48 217
s1 s2 s3
A 0 3 8
C 0 24 25
G 0 14 16
T 0 0 3
s1 s2 s3
A 0 0 5
C 0 17 15
G 0 10 5

Plots of this information

3.1.4 Insertions by RT index, post normalization

3.1.4.1 Rewriting the matrices as cpm to account for library sizes.

s2 s3
A 196850 501971
C 181102 90670
G 244094 116951
T 377953 290407
s2 s3
A 208333 507304
C 166667 86321
G 225000 118194
T 400000 288181
s2 s3
A 73171 153846
C 585366 480769
G 341463 307692
T 0 57692
s2 s3
A 0 2e+05
C 629630 6e+05
G 370370 2e+05

3.1.4.2 Rewriting the matrices by dividing by all indexes

I think that there are few enough insertion events that this gets a bit messed up. I will double check the logic of this, but that is my initial guess given how few insertions I was seeing when reading the outputs manually. Unfortunately, this means that for these I also cannot provide a cpm measurement.

s1 s2 s3
A 0 0e+00 8e-04
C 0 0e+00 1e-04
G 0 0e+00 1e-04
T 0 1e-04 5e-04
s1 s2 s3
A 0 0e+00 8e-04
C 0 0e+00 1e-04
G 0 0e+00 1e-04
T 0 1e-04 5e-04
s1 s2 s3
A 0 0e+00 0
C 0 1e-04 0
G 0 0e+00 0
T 0 0e+00 0
s1 s2 s3
A 0 0 0
C 0 0 0
G 0 0 0

The following is my previous writing of this worksheet which just dumped the various tables.

