1 Calculating error rates.

I wrote the function ‘create_matrices()’ to collect mutation counts. At least in theory the results from it should be able to address most/any question regarding the counts of mutations observed in the data.

## Loading errRt
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:hpgltools':
## 
##     combine
## The following object is masked from 'package:Biobase':
## 
##     combine
## The following objects are masked from 'package:BiocGenerics':
## 
##     combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: tidyr
## Starting sample: 1.
## Reading the file containing mutations: preprocessing/s1/step4.txt.xz
## Reading the file containing the identical reads: preprocessing/s1/step2_identical_reads.txt.xz
## Removing any differences before position: 24.
## Before pruning, there are: 1156535 reads.
## After position pruning, there are: 1037310 reads.
## Removing any reads with 'N' as the hit.
## Before N pruning, there are: 1037310 reads.
## After N pruning, there are: 1021066 reads.
## Gathering information about the indexes observed, this is slow.
## Before read pruning, there are: 1742743 indexes.
## After read pruning, there are: 838158 indexes.
## Removing indexes with fewer than 3 indexes.
## Before index pruning, there are: 1021066 changed reads.
## Before index pruning, there are: 4681501 identical reads.
## After index pruning, there are: 532571 changed reads.
## After index pruning, there are: 3663814 identical reads.
## Gathering identical, mutant, and sequencer reads/indexes.
## Counting by direction.
## Counting by string.
## Subsetting based on mutations with at least 3 indexes.
## Counting by reference position.
## Counting by identity string.
## Counting by reference nucleotide.
## Counting by product nucleotide.
## Counting by mutation type.
## Counting by transitions/transversion.
## Counting strong/weak.
## Counting insertions by position.
## Counting insertions by nucleotide
## Counting deletions by position.
## Counting deletions by nucleotide.
## Starting sample: 2.
## Reading the file containing mutations: preprocessing/s2/step4.txt.xz
## Reading the file containing the identical reads: preprocessing/s2/step2_identical_reads.txt.xz
## Removing any differences before position: 24.
## Before pruning, there are: 3421203 reads.
## After position pruning, there are: 1758479 reads.
## Removing any reads with 'N' as the hit.
## Before N pruning, there are: 1758479 reads.
## After N pruning, there are: 1732605 reads.
## Gathering information about the indexes observed, this is slow.
## Before read pruning, there are: 1263563 indexes.
## After read pruning, there are: 694479 indexes.
## Removing indexes with fewer than 3 indexes.
## Before index pruning, there are: 1732605 changed reads.
## Before index pruning, there are: 5230976 identical reads.
## After index pruning, there are: 874781 changed reads.
## After index pruning, there are: 4834367 identical reads.
## Gathering identical, mutant, and sequencer reads/indexes.
## Counting by direction.
## Counting by string.
## Subsetting based on mutations with at least 3 indexes.
## Counting by reference position.
## Counting by identity string.
## Counting by reference nucleotide.
## Counting by product nucleotide.
## Counting by mutation type.
## Counting by transitions/transversion.
## Counting strong/weak.
## Counting insertions by position.
## Counting insertions by nucleotide
## Counting deletions by position.
## Counting deletions by nucleotide.
## Starting sample: 3.
## Reading the file containing mutations: preprocessing/s3/step4.txt.xz
## Reading the file containing the identical reads: preprocessing/s3/step2_identical_reads.txt.xz
## Removing any differences before position: 24.
## Before pruning, there are: 4309681 reads.
## After position pruning, there are: 1564155 reads.
## Removing any reads with 'N' as the hit.
## Before N pruning, there are: 1564155 reads.
## After N pruning, there are: 1532016 reads.
## Gathering information about the indexes observed, this is slow.
## Before read pruning, there are: 887982 indexes.
## After read pruning, there are: 465046 indexes.
## Removing indexes with fewer than 3 indexes.
## Before index pruning, there are: 1532016 changed reads.
## Before index pruning, there are: 3583390 identical reads.
## After index pruning, there are: 783358 changed reads.
## After index pruning, there are: 3332425 identical reads.
## Gathering identical, mutant, and sequencer reads/indexes.
## Counting by direction.
## Counting by string.
## Subsetting based on mutations with at least 3 indexes.
## Counting by reference position.
## Counting by identity string.
## Counting by reference nucleotide.
## Counting by product nucleotide.
## Counting by mutation type.
## Counting by transitions/transversion.
## Counting strong/weak.
## Counting insertions by position.
## Counting insertions by nucleotide
## Counting deletions by position.
## Counting deletions by nucleotide.
## Working on miss_reads_by_position.
## Working on miss_indexes_by_position.
## Working on miss_sequencer_by_position.
## Working on miss_reads_by_string.
## Warning in order(as.numeric(rownames(matrices[[t]]))): NAs introduced by
## coercion
## Working on miss_indexes_by_string.
## Warning in order(as.numeric(rownames(matrices[[t]]))): NAs introduced by
## coercion
## Working on miss_sequencer_by_string.
## Warning in order(as.numeric(rownames(matrices[[t]]))): NAs introduced by
## coercion
## Working on miss_reads_by_ref_nt.
## Working on miss_indexes_by_ref_nt.
## Working on miss_sequencer_by_ref_nt.
## Working on miss_reads_by_hit_nt.
## Working on miss_indexes_by_hit_nt.
## Working on miss_sequencer_by_hit_nt.
## Working on miss_reads_by_type.
## Working on miss_indexes_by_type.
## Working on miss_sequencer_by_type.
## Working on miss_reads_by_trans.
## Working on miss_indexes_by_trans.
## Working on miss_sequencer_by_trans.
## Working on miss_reads_by_strength.
## Working on miss_indexes_by_strength.
## Working on miss_sequencer_by_strength.
## Working on insert_reads_by_position.
## Working on insert_indexes_by_position.
## Working on insert_sequencer_by_position.
## Working on insert_reads_by_nt.
## Working on insert_indexes_by_nt.
## Working on insert_sequencer_by_nt.
## Working on delete_reads_by_position.
## Working on delete_indexes_by_position.
## Working on delete_sequencer_by_position.
## Working on delete_reads_by_nt.
## Working on delete_indexes_by_nt.
## Working on delete_sequencer_by_nt.
## Skipping table: delete_reads_by_position
## Skipping table: delete_indexes_by_position
## Skipping table: delete_sequencer_by_position
## Skipping table: delete_reads_by_nt
## Skipping table: delete_indexes_by_nt
## Skipping table: delete_sequencer_by_nt
##                      Length Class  Mode   
## samples               3     -none- list   
## reads_per_sample      3     -none- numeric
## indexes_per_sample    3     -none- numeric
## matrices             33     -none- list   
## matrices_by_counts   33     -none- list   
## normalized           33     -none- list   
## normalized_by_counts 33     -none- list
## Starting sample: 1.
## Reading the file containing mutations: preprocessing/s1/step4.txt.xz
## Reading the file containing the identical reads: preprocessing/s1/step2_identical_reads.txt.xz
## Removing any differences before position: 24.
## Before pruning, there are: 1156535 reads.
## After position pruning, there are: 1037310 reads.
## Removing any reads with 'N' as the hit.
## Before N pruning, there are: 1037310 reads.
## After N pruning, there are: 1021066 reads.
## Gathering information about the indexes observed, this is slow.
## Before read pruning, there are: 1742743 indexes.
## After read pruning, there are: 838158 indexes.
## Removing indexes with fewer than 5 indexes.
## Before index pruning, there are: 1021066 changed reads.
## Before index pruning, there are: 4681501 identical reads.
## After index pruning, there are: 532571 changed reads.
## After index pruning, there are: 3663814 identical reads.
## Gathering identical, mutant, and sequencer reads/indexes.
## Counting by direction.
## Counting by string.
## Subsetting based on mutations with at least 5 indexes.
## Counting by reference position.
## Counting by identity string.
## Counting by reference nucleotide.
## Counting by product nucleotide.
## Counting by mutation type.
## Counting by transitions/transversion.
## Counting strong/weak.
## Counting insertions by position.
## Counting insertions by nucleotide
## Counting deletions by position.
## Counting deletions by nucleotide.
## Starting sample: 2.
## Reading the file containing mutations: preprocessing/s2/step4.txt.xz
## Reading the file containing the identical reads: preprocessing/s2/step2_identical_reads.txt.xz
## Removing any differences before position: 24.
## Before pruning, there are: 3421203 reads.
## After position pruning, there are: 1758479 reads.
## Removing any reads with 'N' as the hit.
## Before N pruning, there are: 1758479 reads.
## After N pruning, there are: 1732605 reads.
## Gathering information about the indexes observed, this is slow.
## Before read pruning, there are: 1263563 indexes.
## After read pruning, there are: 694479 indexes.
## Removing indexes with fewer than 5 indexes.
## Before index pruning, there are: 1732605 changed reads.
## Before index pruning, there are: 5230976 identical reads.
## After index pruning, there are: 874781 changed reads.
## After index pruning, there are: 4834367 identical reads.
## Gathering identical, mutant, and sequencer reads/indexes.
## Counting by direction.
## Counting by string.
## Subsetting based on mutations with at least 5 indexes.
## Counting by reference position.
## Counting by identity string.
## Counting by reference nucleotide.
## Counting by product nucleotide.
## Counting by mutation type.
## Counting by transitions/transversion.
## Counting strong/weak.
## Counting insertions by position.
## Counting insertions by nucleotide
## Counting deletions by position.
## Counting deletions by nucleotide.
## Starting sample: 3.
## Reading the file containing mutations: preprocessing/s3/step4.txt.xz
## Reading the file containing the identical reads: preprocessing/s3/step2_identical_reads.txt.xz
## Removing any differences before position: 24.
## Before pruning, there are: 4309681 reads.
## After position pruning, there are: 1564155 reads.
## Removing any reads with 'N' as the hit.
## Before N pruning, there are: 1564155 reads.
## After N pruning, there are: 1532016 reads.
## Gathering information about the indexes observed, this is slow.
## Before read pruning, there are: 887982 indexes.
## After read pruning, there are: 465046 indexes.
## Removing indexes with fewer than 5 indexes.
## Before index pruning, there are: 1532016 changed reads.
## Before index pruning, there are: 3583390 identical reads.
## After index pruning, there are: 783358 changed reads.
## After index pruning, there are: 3332425 identical reads.
## Gathering identical, mutant, and sequencer reads/indexes.
## Counting by direction.
## Counting by string.
## Subsetting based on mutations with at least 5 indexes.
## Counting by reference position.
## Counting by identity string.
## Counting by reference nucleotide.
## Counting by product nucleotide.
## Counting by mutation type.
## Counting by transitions/transversion.
## Counting strong/weak.
## Counting insertions by position.
## Counting insertions by nucleotide
## Counting deletions by position.
## Counting deletions by nucleotide.
## Working on miss_reads_by_position.
## Working on miss_indexes_by_position.
## Working on miss_sequencer_by_position.
## Working on miss_reads_by_string.
## Warning in order(as.numeric(rownames(matrices[[t]]))): NAs introduced by
## coercion
## Working on miss_indexes_by_string.
## Warning in order(as.numeric(rownames(matrices[[t]]))): NAs introduced by
## coercion
## Working on miss_sequencer_by_string.
## Warning in order(as.numeric(rownames(matrices[[t]]))): NAs introduced by
## coercion
## Working on miss_reads_by_ref_nt.
## Working on miss_indexes_by_ref_nt.
## Working on miss_sequencer_by_ref_nt.
## Working on miss_reads_by_hit_nt.
## Working on miss_indexes_by_hit_nt.
## Working on miss_sequencer_by_hit_nt.
## Working on miss_reads_by_type.
## Working on miss_indexes_by_type.
## Working on miss_sequencer_by_type.
## Working on miss_reads_by_trans.
## Working on miss_indexes_by_trans.
## Working on miss_sequencer_by_trans.
## Working on miss_reads_by_strength.
## Working on miss_indexes_by_strength.
## Working on miss_sequencer_by_strength.
## Working on insert_reads_by_position.
## Working on insert_indexes_by_position.
## Working on insert_sequencer_by_position.
## Working on insert_reads_by_nt.
## Working on insert_indexes_by_nt.
## Working on insert_sequencer_by_nt.
## Working on delete_reads_by_position.
## Working on delete_indexes_by_position.
## Working on delete_sequencer_by_position.
## Working on delete_reads_by_nt.
## Working on delete_indexes_by_nt.
## Working on delete_sequencer_by_nt.
## Skipping table: delete_reads_by_position
## Skipping table: delete_indexes_by_position
## Skipping table: delete_sequencer_by_position
## Skipping table: delete_reads_by_nt
## Skipping table: delete_indexes_by_nt
## Skipping table: delete_sequencer_by_nt
##                      Length Class  Mode   
## samples               3     -none- list   
## reads_per_sample      3     -none- numeric
## indexes_per_sample    3     -none- numeric
## matrices             33     -none- list   
## matrices_by_counts   33     -none- list   
## normalized           33     -none- list   
## normalized_by_counts 33     -none- list

2 Questions from Dr. DeStefano

I think what is best is to get the number of recovered mutations of each type from each data set. That would be A to T, A to G, A to C; T to A, T to G, T to C; G to A, G to C, G to T; and C to A, C to G, C to T; as well as deletions and insertions. I would then need the sum number of the reads that met all our criteria (i.e. at least 3 good recovered reads for that 14 nt index). Each set of 3 or more would ct as “1” read of that particular index so I would need the total with this in mind. I also need to know the total number of nucleotides that were in the region we decided to consider in the analysis. We may want to try this for 3 or more and 5 or more recovered indexes if it is not hard. This information does not include specific positions on the template where errors occurred but we can look at that latter. Right now I just want to get a general error rate and type of error. It would basically be calculated by dividing the number of recovered mutations of a particular type by sum number of the reads times the number of nucleotides screened in the template. As it ends up, this number does not really have a lot of meaning but it can be used to calculate the overall mutation rate as well as the rate for transversions, transitions, and deletions and insertions.

3 Answers

In order to address those queries, I invoked create_matrices() with a minimum index count of 3 and 5. It should be noted that this is not the same as requiring 3 or 5 reads per index. In both cases I require 3 reads per index.

3.1 Recovered mutations of each type

I am interpreting this question as the number of indexes recovered for each mutation type. I collect this information in 2 ways of interest: the indexes by type which are deemed to be from the RT and from the sequencer. In addition, I calculate a normalized (cpm) version of this information which may be used to look for changes across samples.

3.1.1 Mutations by RT index

This following block should print out tables of the numbers of mutant indexes observed for each type for the RT and the sequencer. One would hope that the sequencer will be consistent for all samples, but I think the results will instead suggest that my metric is not yet stringent enough.

s1 s2 s3
A_C 1294 4516 9418
A_G 728 14896 53401
A_T 235 2243 5053
C_A 9305 29486 34851
C_G 342 3848 10038
C_T 2133 18061 61965
G_A 1685 30686 37225
G_C 283 1683 3087
G_T 9630 12295 15293
T_A 222 5119 8532
T_C 854 4466 9521
T_G 1195 5972 11188
s1 s2 s3
A_C 2381 18295 15010
A_G 659 7440 5942
A_T 224 3157 2592
C_A 1278 15254 11862
C_G 601 6072 4442
C_T 732 7343 5703
G_A 555 5215 4268
G_C 476 6334 5631
G_T 997 8701 7109
T_A 453 4280 3403
T_C 997 10482 8692
T_G 2471 27537 22137
s1 s2 s3
A_C 1281 4516 9418
A_G 708 14896 53401
A_T 215 2243 5053
C_A 9305 29486 34851
C_G 318 3844 10038
C_T 2129 18061 61965
G_A 1677 30686 37225
G_C 255 1679 3087
G_T 9627 12295 15293
T_A 198 5119 8532
T_C 842 4466 9521
T_G 1195 5969 11188
s1 s2 s3
A_C 2374 18295 15010
A_G 651 7440 5942
A_T 202 3157 2592
C_A 1254 15254 11862
C_G 591 6072 4442
C_T 713 7343 5703
G_A 545 5215 4268
G_C 445 6334 5631
G_T 977 8701 7109
T_A 424 4280 3403
T_C 990 10482 8692
T_G 2458 27537 22137

Plots of this information

This suggests to me that this information needs to be normalized in some more sensible fashion. Thus the following:

3.1.2 Mutations by RT index, post normalization

The same numbers may be expressed in the context of the number of indexes observed / sample and/or as a cpm across samples. Thus in the first instance one can look at the apparent error rate for each sample, and in the second instance one may look for relative changes in apparent error rate across samples.

3.1.2.1 Rewriting the matrices as cpm to account for library sizes.

s1 s2 s3
A_C 46370 33886 36283
A_G 26088 111772 205727
A_T 8421 16830 19467
C_A 333441 221248 134263
C_G 12255 28873 38671
C_T 76435 135521 238720
G_A 60381 230253 143409
G_C 10141 12628 11893
G_T 345087 92256 58916
T_A 7955 38410 32869
T_C 30603 33511 36680
T_G 42822 44811 43102
s1 s2 s3
A_C 201370 152319 155076
A_G 55734 61943 61390
A_T 18945 26284 26779
C_A 108085 127000 122553
C_G 50829 50554 45893
C_T 61908 61136 58921
G_A 46938 43419 44095
G_C 40257 52735 58177
G_T 84320 72442 73447
T_A 38312 35634 35158
T_C 84320 87270 89802
T_G 208982 229265 228709
s1 s2 s3
A_C 46162 33889 36283
A_G 25514 111781 205727
A_T 7748 16832 19467
C_A 335315 221267 134263
C_G 11459 28846 38671
C_T 76721 135532 238720
G_A 60432 230272 143409
G_C 9189 12599 11893
G_T 346919 92263 58916
T_A 7135 38414 32869
T_C 30342 33513 36680
T_G 43063 44792 43102
s1 s2 s3
A_C 204233 152319 155076
A_G 56005 61943 61390
A_T 17378 26284 26779
C_A 107880 127000 122553
C_G 50843 50554 45893
C_T 61339 61136 58921
G_A 46886 43419 44095
G_C 38283 52735 58177
G_T 84050 72442 73447
T_A 36476 35634 35158
T_C 85169 87270 89802
T_G 211459 229265 228709

3.1.2.2 Rewriting the matrices by dividing by all indexes

This I think starts to address the later text in your query.

s1 s2 s3
A_C 0.0015 0.0054 0.0112
A_G 0.0010 0.0214 0.0769
A_T 0.0005 0.0048 0.0109
C_A 0.0111 0.0352 0.0416
C_G 0.0005 0.0055 0.0145
C_T 0.0046 0.0388 0.1332
G_A 0.0020 0.0366 0.0444
G_C 0.0004 0.0024 0.0044
G_T 0.0207 0.0264 0.0329
T_A 0.0003 0.0061 0.0102
T_C 0.0012 0.0064 0.0137
T_G 0.0026 0.0128 0.0241
s1 s2 s3
A_C 0.0028 0.0218 0.0179
A_G 0.0009 0.0107 0.0086
A_T 0.0005 0.0068 0.0056
C_A 0.0015 0.0182 0.0142
C_G 0.0009 0.0087 0.0064
C_T 0.0016 0.0158 0.0123
G_A 0.0007 0.0062 0.0051
G_C 0.0007 0.0091 0.0081
G_T 0.0021 0.0187 0.0153
T_A 0.0005 0.0051 0.0041
T_C 0.0014 0.0151 0.0125
T_G 0.0053 0.0592 0.0476
s1 s2 s3
A_C 0.0015 0.0054 0.0112
A_G 0.0010 0.0214 0.0769
A_T 0.0005 0.0048 0.0109
C_A 0.0111 0.0352 0.0416
C_G 0.0005 0.0055 0.0145
C_T 0.0046 0.0388 0.1332
G_A 0.0020 0.0366 0.0444
G_C 0.0004 0.0024 0.0044
G_T 0.0207 0.0264 0.0329
T_A 0.0002 0.0061 0.0102
T_C 0.0012 0.0064 0.0137
T_G 0.0026 0.0128 0.0241
s1 s2 s3
A_C 0.0028 0.0218 0.0179
A_G 0.0009 0.0107 0.0086
A_T 0.0004 0.0068 0.0056
C_A 0.0015 0.0182 0.0142
C_G 0.0009 0.0087 0.0064
C_T 0.0015 0.0158 0.0123
G_A 0.0007 0.0062 0.0051
G_C 0.0006 0.0091 0.0081
G_T 0.0021 0.0187 0.0153
T_A 0.0005 0.0051 0.0041
T_C 0.0014 0.0151 0.0125
T_G 0.0053 0.0592 0.0476

3.1.2.3 Rewriting the matrices by dividing by all indexes and cpm

I think this might prove to be where we get the most meaningful results.

The nicest thing in it is that after accounting for library sizes and total indexes observed, we finally see that the sequencer error is mostly consistent across all samples and mutation types – with a couple of notable exceptions.

By the same token, for the mutations which are identical for the sequencer, we have some which are decidedly different for the non-sequencer data. The most notable examples I think are A to G but _not G to A; and C to T.

s1 s2 s3
A_C 0.0553 0.0404 0.0433
A_G 0.0376 0.1609 0.2962
A_T 0.0181 0.0362 0.0419
C_A 0.3978 0.2640 0.1602
C_G 0.0176 0.0416 0.0557
C_T 0.1644 0.2914 0.5133
G_A 0.0720 0.2747 0.1711
G_C 0.0146 0.0182 0.0171
G_T 0.7420 0.1984 0.1267
T_A 0.0095 0.0458 0.0392
T_C 0.0441 0.0483 0.0528
T_G 0.0921 0.0964 0.0927
s1 s2 s3
A_C 0.2403 0.1817 0.1850
A_G 0.0803 0.0892 0.0884
A_T 0.0407 0.0565 0.0576
C_A 0.1290 0.1515 0.1462
C_G 0.0732 0.0728 0.0661
C_T 0.1331 0.1315 0.1267
G_A 0.0560 0.0518 0.0526
G_C 0.0580 0.0759 0.0838
G_T 0.1813 0.1558 0.1579
T_A 0.0457 0.0425 0.0419
T_C 0.1214 0.1257 0.1293
T_G 0.4494 0.4930 0.4918
s1 s2 s3
A_C 0.0551 0.0404 0.0433
A_G 0.0367 0.1610 0.2962
A_T 0.0167 0.0362 0.0419
C_A 0.4001 0.2640 0.1602
C_G 0.0165 0.0415 0.0557
C_T 0.1650 0.2914 0.5133
G_A 0.0721 0.2747 0.1711
G_C 0.0132 0.0181 0.0171
G_T 0.7460 0.1984 0.1267
T_A 0.0085 0.0458 0.0392
T_C 0.0437 0.0483 0.0528
T_G 0.0926 0.0963 0.0927
s1 s2 s3
A_C 0.2437 0.1817 0.1850
A_G 0.0806 0.0892 0.0884
A_T 0.0374 0.0565 0.0576
C_A 0.1287 0.1515 0.1462
C_G 0.0732 0.0728 0.0661
C_T 0.1319 0.1315 0.1267
G_A 0.0559 0.0518 0.0526
G_C 0.0551 0.0759 0.0838
G_T 0.1807 0.1558 0.1579
T_A 0.0435 0.0425 0.0419
T_C 0.1226 0.1257 0.1293
T_G 0.4547 0.4930 0.4918

3.1.3 Indels by RT index

The following blocks will repeat the above, but looking for insertions. This data does not observe sufficient deletions to make a proper count for them.

s1 s2 s3
A 0 25 382
C 0 23 70
G 0 31 89
T 0 51 221
s1 s2 s3
A 0 3 8
C 0 23 24
G 0 14 16
T 0 0 3
s1 s2 s3
A 0 25 382
C 0 20 66
G 0 27 89
T 0 48 217
s1 s2 s3
A 0 0 5
C 0 16 14
G 0 10 5

Plots of this information

3.1.4 Insertions by RT index, post normalization

3.1.4.1 Rewriting the matrices as cpm to account for library sizes.

s2 s3
A 192308 501312
C 176923 91864
G 238462 116798
T 392308 290026
s2 s3
A 75000 156863
C 575000 470588
G 350000 313725
T 0 58824
s2 s3
A 208333 506631
C 166667 87533
G 225000 118037
T 400000 287798
s2 s3
A 0 208333
C 615385 583333
G 384615 208333

3.1.4.2 Rewriting the matrices by dividing by all indexes

I think that there are few enough insertion events that this gets a bit messed up. I will double check the logic of this, but that is my initial guess given how few insertions I was seeing when reading the outputs manually. Unfortunately, this means that for these I also cannot provide a cpm measurement.

s1 s2 s3
A 0 0e+00 8e-04
C 0 0e+00 1e-04
G 0 0e+00 1e-04
T 0 1e-04 5e-04
s1 s2 s3
A 0 0 0
C 0 0 0
G 0 0 0
T 0 0 0
s1 s2 s3
A 0 0e+00 8e-04
C 0 0e+00 1e-04
G 0 0e+00 1e-04
T 0 1e-04 5e-04
s1 s2 s3
A 0 0 0
C 0 0 0
G 0 0 0

The following is my previous writing of this worksheet which just dumped the various tables.

---
title: "Counting RT mutations from illumina sequencing data."
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
  html_document:
    code_download: true
    code_folding: show
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: tango
    keep_md: false
    mode: selfcontained
    number_sections: true
    self_contained: true
    theme: readable
    toc: true
    toc_float:
      collapsed: false
      smooth_scroll: false
  rmdformats::readthedown:
    code_download: true
    code_folding: show
    df_print: paged
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: tango
    width: 300
    keep_md: false
    mode: selfcontained
    toc_float: true
  BiocStyle::html_document:
    code_download: true
    code_folding: show
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: tango
    keep_md: false
    mode: selfcontained
    toc_float: true
---

<style type="text/css">
body, td {
  font-size: 16px;
}
code.r{
  font-size: 16px;
}
pre {
 font-size: 16px
}
</style>

```{r options, include=FALSE}
library("hpgltools")
tt <- devtools::load_all("/data/hpgltools")
knitr::opts_knit$set(width=120,
                     progress=TRUE,
                     verbose=TRUE,
                     echo=TRUE)
knitr::opts_chunk$set(error=TRUE,
                      dpi=96)
old_options <- options(digits=4,
                       stringsAsFactors=FALSE,
                       knitr.duplicate.label="allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size=10))
rundate <- format(Sys.Date(), format="%Y%m%d")
previous_file <- "index.Rmd"
ver <- "20191201"

##tmp <- sm(loadme(filename=paste0(gsub(pattern="\\.Rmd", replace="", x=previous_file), "-v", ver, ".rda.xz")))
##rmd_file <- "03_expression_infection_20180822.Rmd"
```

# Calculating error rates.

I wrote the function 'create_matrices()' to collect mutation counts.  At least
in theory the results from it should be able to address most/any question
regarding the counts of mutations observed in the data.

```{r testing}
devtools::load_all("errRt")

triplets <- create_matrices(sample_sheet="sample_sheets/all_samples.xlsx",
                            ident_column="identtable", mut_column="mutationtable",
                            min_reads=3, min_indexes=3, min_sequencer=10,
                            min_position=24, prune_n=TRUE, verbose=TRUE)
triplet_plots <- barplot_matrices(triplets)
summary(triplets)
quintuplets <- create_matrices(sample_sheet="sample_sheets/all_samples.xlsx",
                               ident_column="identtable", mut_column="mutationtable",
                               min_reads=3, min_indexes=5, min_sequencer=10,
                               min_position=24, prune_n=TRUE, verbose=TRUE)
quintuplet_plots <- barplot_matrices(quintuplets)
summary(triplets)
```

# Questions from Dr. DeStefano

I think what is best is to get the number of recovered mutations of each type
from each data set.  That would be A to T, A to G, A to C; T to A, T to G, T to
C; G to A, G to C, G to T; and C to A, C to G, C to T; as well as deletions and
insertions.  I would then need the sum number of the reads that met all our
criteria (i.e. at least 3 good recovered reads for that 14 nt index).  Each set
of 3 or more would ct as "1" read of that particular index so I would need the
total with this in mind.  I also need to know the total number of nucleotides
that were in the region we decided to consider in the analysis.  We may want to
try this for 3 or more and 5 or more recovered indexes if it is not hard.  This
information does not include specific positions on the template where errors
occurred but we can look at that latter.  Right now I just want to get a general
error rate and type of error.  It would basically be calculated by dividing the
number of recovered mutations of a particular type by sum number of the reads
times the number of nucleotides screened in the template.  As it ends up, this
number does not really have a lot of meaning but it can be used to calculate the
overall mutation rate as well as the rate for transversions, transitions, and
deletions and insertions.

# Answers

In order to address those queries, I invoked create_matrices() with a minimum
index count of 3 and 5.  It should be noted that this is not the same as
requiring 3 or 5 reads per index.  In both cases I require 3 reads per index.

## Recovered mutations of each type

I am interpreting this question as the number of indexes recovered for each
mutation type.  I collect this information in 2 ways of interest: the indexes by
type which are deemed to be from the RT and from the sequencer.  In addition, I
calculate a normalized (cpm) version of this information which may be used to look for
changes across samples.

### Mutations by RT index

This following block should print out tables of the numbers of mutant indexes
observed for each type for the RT and the sequencer.  One would hope that the
sequencer will be consistent for all samples, but I think the results will
instead suggest that my metric is not yet stringent enough.

```{r mutation_index_count, results='asis'}
knitr::kable(triplets[["matrices"]][["miss_indexes_by_type"]])

knitr::kable(triplets[["matrices"]][["miss_sequencer_by_type"]])

knitr::kable(quintuplets[["matrices"]][["miss_indexes_by_type"]])

knitr::kable(quintuplets[["matrices"]][["miss_sequencer_by_type"]])
```

Plots of this information

```{r mutation_index_count_plots}
triplet_plots[["matrices"]][["miss_indexes_by_type"]]
triplet_plots[["normal"]][["miss_indexes_by_type"]]

quintuplet_plots[["matrices"]][["miss_indexes_by_type"]]
quintuplet_plots[["normal"]][["miss_indexes_by_type"]]
```

This suggests to me that this information needs to be normalized in some more
sensible fashion.  Thus the following:

### Mutations by RT index, post normalization

The same numbers may be expressed in the context of the number of indexes
observed / sample and/or as a cpm across samples.  Thus in the first instance
one can look at the apparent error rate for each sample, and in the second
instance one may look for relative changes in apparent error rate across
samples.

#### Rewriting the matrices as cpm to account for library sizes.

```{r mutation_index_normalized, results='asis'}
knitr::kable(triplets[["normalized"]][["miss_indexes_by_type"]])

knitr::kable(triplets[["normalized"]][["miss_sequencer_by_type"]])

knitr::kable(quintuplets[["normalized"]][["miss_indexes_by_type"]])

knitr::kable(quintuplets[["normalized"]][["miss_sequencer_by_type"]])
```

#### Rewriting the matrices by dividing by all indexes

This I think starts to address the later text in your query.

```{r mutation_index_normalized_by_counts, results='asis'}
knitr::kable(triplets[["matrices_by_counts"]][["miss_indexes_by_type"]])

knitr::kable(triplets[["matrices_by_counts"]][["miss_sequencer_by_type"]])

knitr::kable(quintuplets[["matrices_by_counts"]][["miss_indexes_by_type"]])

knitr::kable(quintuplets[["matrices_by_counts"]][["miss_sequencer_by_type"]])
```

#### Rewriting the matrices by dividing by all indexes and cpm

I think this might prove to be where we get the most meaningful results.

The nicest thing in it is that after accounting for library sizes and total
indexes observed, we finally see that the sequencer error is mostly consistent
across all samples and mutation types -- with a couple of notable exceptions.

By the same token, for the mutations which _are_ identical for the sequencer, we
have some which are decidedly different for the non-sequencer data.  The most
notable examples I think are A to G but _not G to A; and C to T.

```{r mutation_index_cpm_by_counts, results='asis'}
knitr::kable(triplets[["normalized_by_counts"]][["miss_indexes_by_type"]])

knitr::kable(triplets[["normalized_by_counts"]][["miss_sequencer_by_type"]])

knitr::kable(quintuplets[["normalized_by_counts"]][["miss_indexes_by_type"]])

knitr::kable(quintuplets[["normalized_by_counts"]][["miss_sequencer_by_type"]])
```

### Indels by RT index

The following blocks will repeat the above, but looking for insertions.
This data does not observe sufficient deletions to make a proper count for them.

```{r insert_index_count, results='asis'}
knitr::kable(triplets[["matrices"]][["insert_indexes_by_nt"]])

knitr::kable(triplets[["matrices"]][["insert_sequencer_by_nt"]])

knitr::kable(quintuplets[["matrices"]][["insert_indexes_by_nt"]])

knitr::kable(quintuplets[["matrices"]][["insert_sequencer_by_nt"]])
```

Plots of this information

```{r insert_index_count_plots}
triplet_plots[["matrices"]][["insert_indexes_by_nt"]]
triplet_plots[["normal"]][["insert_indexes_by_nt"]]

quintuplet_plots[["matrices"]][["insert_indexes_by_nt"]]
quintuplet_plots[["matrices"]][["insert_sequencer_by_nt"]]
quintuplet_plots[["normal"]][["insert_indexes_by_nt"]]
quintuplet_plots[["normal"]][["insert_sequencer_by_nt"]]
```

### Insertions by RT index, post normalization

#### Rewriting the matrices as cpm to account for library sizes.

```{r insert_index_normalized, results='asis'}
knitr::kable(triplets[["normalized"]][["insert_indexes_by_nt"]])

knitr::kable(triplets[["normalized"]][["insert_sequencer_by_nt"]])

knitr::kable(quintuplets[["normalized"]][["insert_indexes_by_nt"]])

knitr::kable(quintuplets[["normalized"]][["insert_sequencer_by_nt"]])
```

#### Rewriting the matrices by dividing by all indexes

I think that there are few enough insertion events that this gets a bit messed
up.  I will double check the logic of this, but that is my initial guess given
how few insertions I was seeing when reading the outputs manually.
Unfortunately, this means that for these I also cannot provide a cpm measurement.

```{r insert_index_normalized_by_counts, results='asis'}
knitr::kable(triplets[["matrices_by_counts"]][["insert_indexes_by_nt"]])

knitr::kable(triplets[["matrices_by_counts"]][["insert_sequencer_by_nt"]])

knitr::kable(quintuplets[["matrices_by_counts"]][["insert_indexes_by_nt"]])

knitr::kable(quintuplets[["matrices_by_counts"]][["insert_sequencer_by_nt"]])
```

The following is my previous writing of this worksheet which just dumped the
various tables.

# Print raw tables

```{r raw, results='asis'}
for (t in 1:length(triplets[["matrices"]])) {
  table_name <- names(triplets[["matrices"]])[t]
  message("Raw table: ", table_name, ".")
  print(knitr::kable(triplets[["matrices"]][t]))
}
```

# Print raw plots

```{r raw_plots}
for (t in 1:length(triplets[["matrices"]])) {
  message("Raw table: ", table_name, ".")
  print(data_plots[["matrices"]][t])
}
```

# Print normalized tables

```{r norm, results='asis'}
for (t in 1:length(triplets[["normalized"]])) {
  table_name <- names(triplets[["normalized"]])[t]
  message("Normalized table: ", table_name, ".")
  print(knitr::kable(triplets[["normalized"]][t]))
}
```

# Print normalized plots

```{r norm_plots}
for (t in 1:length(triplets[["normalized"]])) {
  message("Normalized table: ", table_name, ".")
  print(data_plots[["normal"]][t])
}
```

```{r saveme}
pander::pander(sessionInfo())
message(paste0("This is hpgltools commit: ", get_git_commit()))
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
tmp <- sm(saveme(filename=this_save))
```


```{r loadme, eval=FALSE}
loadme(filename=this_save)
```
