1 Calculating error rates.

I wrote the function ‘create_matrices()’ to collect mutation counts. At least in theory the results from it should be able to address most/any question regarding the counts of mutations observed in the data.

1.1 Categorize the data with at least 3 indexes per mutant

devtools::load_all("Rerrrt")
## Loading Rerrrt
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:hpgltools':
## 
##     combine
## The following object is masked from 'package:testthat':
## 
##     matches
## The following object is masked from 'package:Biobase':
## 
##     combine
## The following objects are masked from 'package:BiocGenerics':
## 
##     combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: tidyr
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:testthat':
## 
##     matches
sample_sheet <- "sample_sheets/new_samples.xlsx"
ident_column <- "identtable"
mut_column <- "mutationtable"
min_reads <- 3
min_indexes <- 3
min_sequencer <- 6
min_position <- 5
max_position <- 200
max_mutations_per_read <- NULL
prune_n <- TRUE
verbose <- TRUE
excel <- glue::glue("excel/{rundate}_new_triples-v{ver}.xlsx")
triples <- create_matrices(sample_sheet=sample_sheet,
                           ident_column=ident_column, mut_column=mut_column,
                           min_reads=min_reads, min_indexes=min_indexes,
                           min_sequencer=min_sequencer,
                           min_position=min_position, max_position=max_position,
                           prune_n=prune_n, verbose=verbose, excel=excel)
## Dropped 6 rows from the sample metadata because they were blank.
## Starting sample: control_dna.
##   Reading the file containing mutations: preprocessing/control_dna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/control_dna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 344009 reads.
## Mutation data: after min-position pruning, there are: 344007 reads: 2 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 344007 reads.
## Mutation data: after max-position pruning, there are: 344007 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 337124 reads: 6883 lost or 2.00%.
##   Mutation data: all filters removed 6885 reads, or 2.00%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1392352 indexes in all the data.
## After reads/index pruning, there are: 258638 indexes: 1133714 lost or 81.42%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 337124 changed reads.
## All data: before reads/index pruning, there are: 2240589 identical reads.
## All data: after index pruning, there are: 117053 changed reads: 34.72%.
## All data: after index pruning, there are: 870547 identical reads: 38.85%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 870547 identical reads.
## Before classification, there are 117053 reads with mutations.
## After classification, there are 741108 reads/indexes which are only identical.
## After classification, there are 6056 reads/indexes which are strictly sequencer.
## After classification, there are 12445 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 1336553 forward reads and 1515986 reverse_reads.
## Subsetting based on mutations with at least 3 indexes.
## Classified mutation strings according to various queries.
## Starting sample: low_dna.
##   Reading the file containing mutations: preprocessing/low_dna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/low_dna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 418833 reads.
## Mutation data: after min-position pruning, there are: 418830 reads: 3 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 418830 reads.
## Mutation data: after max-position pruning, there are: 418830 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 411131 reads: 7699 lost or 1.84%.
##   Mutation data: all filters removed 7702 reads, or 1.84%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1950160 indexes in all the data.
## After reads/index pruning, there are: 177404 indexes: 1772756 lost or 90.90%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 411131 changed reads.
## All data: before reads/index pruning, there are: 2566212 identical reads.
## All data: after index pruning, there are: 85285 changed reads: 20.74%.
## All data: after index pruning, there are: 540399 identical reads: 21.06%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 540399 identical reads.
## Before classification, there are 85285 reads with mutations.
## After classification, there are 468326 reads/indexes which are only identical.
## After classification, there are 1201 reads/indexes which are strictly sequencer.
## After classification, there are 20730 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 771013 forward reads and 901147 reverse_reads.
## Subsetting based on mutations with at least 3 indexes.
## Classified mutation strings according to various queries.
## Starting sample: high_dna.
##   Reading the file containing mutations: preprocessing/high_dna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/high_dna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 548445 reads.
## Mutation data: after min-position pruning, there are: 548444 reads: 1 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 548444 reads.
## Mutation data: after max-position pruning, there are: 548444 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 533484 reads: 14960 lost or 2.73%.
##   Mutation data: all filters removed 14961 reads, or 2.73%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1829434 indexes in all the data.
## After reads/index pruning, there are: 212532 indexes: 1616902 lost or 88.38%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 533484 changed reads.
## All data: before reads/index pruning, there are: 2421460 identical reads.
## All data: after index pruning, there are: 137380 changed reads: 25.75%.
## All data: after index pruning, there are: 629940 identical reads: 26.01%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 629940 identical reads.
## Before classification, there are 137380 reads with mutations.
## After classification, there are 542896 reads/indexes which are only identical.
## After classification, there are 1940 reads/indexes which are strictly sequencer.
## After classification, there are 60340 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 985552 forward reads and 1114299 reverse_reads.
## Subsetting based on mutations with at least 3 indexes.
## Classified mutation strings according to various queries.
## Starting sample: control_rna.
##   Reading the file containing mutations: preprocessing/control_rna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/control_rna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 267014 reads.
## Mutation data: after min-position pruning, there are: 267014 reads: 0 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 267014 reads.
## Mutation data: after max-position pruning, there are: 267014 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 261209 reads: 5805 lost or 2.17%.
##   Mutation data: all filters removed 5805 reads, or 2.17%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1155581 indexes in all the data.
## After reads/index pruning, there are: 210352 indexes: 945229 lost or 81.80%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 261209 changed reads.
## All data: before reads/index pruning, there are: 1854668 identical reads.
## All data: after index pruning, there are: 88407 changed reads: 33.85%.
## All data: after index pruning, there are: 711770 identical reads: 38.38%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 711770 identical reads.
## Before classification, there are 88407 reads with mutations.
## After classification, there are 609535 reads/indexes which are only identical.
## After classification, there are 4778 reads/indexes which are strictly sequencer.
## After classification, there are 9929 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 1101670 forward reads and 1250588 reverse_reads.
## Subsetting based on mutations with at least 3 indexes.
## Classified mutation strings according to various queries.
## Starting sample: low_rna.
##   Reading the file containing mutations: preprocessing/low_rna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/low_rna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 387777 reads.
## Mutation data: after min-position pruning, there are: 387776 reads: 1 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 387776 reads.
## Mutation data: after max-position pruning, there are: 387776 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 376092 reads: 11684 lost or 3.01%.
##   Mutation data: all filters removed 11685 reads, or 3.01%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1306310 indexes in all the data.
## After reads/index pruning, there are: 255514 indexes: 1050796 lost or 80.44%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 376092 changed reads.
## All data: before reads/index pruning, there are: 2083008 identical reads.
## All data: after index pruning, there are: 136149 changed reads: 36.20%.
## All data: after index pruning, there are: 833439 identical reads: 40.01%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 833439 identical reads.
## Before classification, there are 136149 reads with mutations.
## After classification, there are 709121 reads/indexes which are only identical.
## After classification, there are 5313 reads/indexes which are strictly sequencer.
## After classification, there are 36973 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 1343128 forward reads and 1445434 reverse_reads.
## Subsetting based on mutations with at least 3 indexes.
## Classified mutation strings according to various queries.
## Starting sample: high_rna.
##   Reading the file containing mutations: preprocessing/high_rna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/high_rna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 558061 reads.
## Mutation data: after min-position pruning, there are: 558059 reads: 2 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 558059 reads.
## Mutation data: after max-position pruning, there are: 558059 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 544407 reads: 13652 lost or 2.45%.
##   Mutation data: all filters removed 13654 reads, or 2.45%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1224839 indexes in all the data.
## After reads/index pruning, there are: 254401 indexes: 970438 lost or 79.23%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 544407 changed reads.
## All data: before reads/index pruning, there are: 1855921 identical reads.
## All data: after index pruning, there are: 216201 changed reads: 39.71%.
## All data: after index pruning, there are: 771788 identical reads: 41.59%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 771788 identical reads.
## Before classification, there are 216201 reads with mutations.
## After classification, there are 644468 reads/indexes which are only identical.
## After classification, there are 5761 reads/indexes which are strictly sequencer.
## After classification, there are 108340 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 1312147 forward reads and 1454177 reverse_reads.
## Subsetting based on mutations with at least 3 indexes.
## Classified mutation strings according to various queries.
## Deleting the file excel/20200908_new_triples-v20200314.xlsx before writing the tables.
##   Writing a legend.
## Plotting Index density for mutant reads before filtering.
## Plotting Index density for identical reads before filtering.
## Plotting Index density for all reads before filtering.
## Plotting Index density for mutant reads after filtering.
## Plotting Index density for identical reads after filtering.
## Plotting Index density for all reads after filtering.
##   Writing raw data.
##   Writing cpm data.
##   Writing data normalized by reads/indexes.
##   Writing data normalized by reads/indexes and length.
##   Writing data normalized by cpm(reads/indexes) and length.
max_mutations_per_read <- 10
excel <- glue::glue("excel/{rundate}_triples_tenmpr-v{ver}.xlsx")
triples_tenmpr <- create_matrices(sample_sheet=sample_sheet,
                                  ident_column=ident_column, mut_column=mut_column,
                                  min_reads=min_reads, min_indexes=min_indexes,
                                  min_sequencer=min_sequencer,
                                  min_position=min_position, max_position=max_position,
                                  prune_n=prune_n, verbose=verbose, excel=excel)
## Dropped 6 rows from the sample metadata because they were blank.
## Starting sample: control_dna.
##   Reading the file containing mutations: preprocessing/control_dna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/control_dna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 344009 reads.
## Mutation data: after min-position pruning, there are: 344007 reads: 2 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 344007 reads.
## Mutation data: after max-position pruning, there are: 344007 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 337124 reads: 6883 lost or 2.00%.
##   Mutation data: all filters removed 6885 reads, or 2.00%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1392352 indexes in all the data.
## After reads/index pruning, there are: 258638 indexes: 1133714 lost or 81.42%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 337124 changed reads.
## All data: before reads/index pruning, there are: 2240589 identical reads.
## All data: after index pruning, there are: 117053 changed reads: 34.72%.
## All data: after index pruning, there are: 870547 identical reads: 38.85%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 870547 identical reads.
## Before classification, there are 117053 reads with mutations.
## After classification, there are 741108 reads/indexes which are only identical.
## After classification, there are 6056 reads/indexes which are strictly sequencer.
## After classification, there are 12445 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 1336553 forward reads and 1515986 reverse_reads.
## Subsetting based on mutations with at least 3 indexes.
## Classified mutation strings according to various queries.
## Starting sample: low_dna.
##   Reading the file containing mutations: preprocessing/low_dna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/low_dna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 418833 reads.
## Mutation data: after min-position pruning, there are: 418830 reads: 3 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 418830 reads.
## Mutation data: after max-position pruning, there are: 418830 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 411131 reads: 7699 lost or 1.84%.
##   Mutation data: all filters removed 7702 reads, or 1.84%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1950160 indexes in all the data.
## After reads/index pruning, there are: 177404 indexes: 1772756 lost or 90.90%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 411131 changed reads.
## All data: before reads/index pruning, there are: 2566212 identical reads.
## All data: after index pruning, there are: 85285 changed reads: 20.74%.
## All data: after index pruning, there are: 540399 identical reads: 21.06%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 540399 identical reads.
## Before classification, there are 85285 reads with mutations.
## After classification, there are 468326 reads/indexes which are only identical.
## After classification, there are 1201 reads/indexes which are strictly sequencer.
## After classification, there are 20730 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 771013 forward reads and 901147 reverse_reads.
## Subsetting based on mutations with at least 3 indexes.
## Classified mutation strings according to various queries.
## Starting sample: high_dna.
##   Reading the file containing mutations: preprocessing/high_dna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/high_dna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 548445 reads.
## Mutation data: after min-position pruning, there are: 548444 reads: 1 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 548444 reads.
## Mutation data: after max-position pruning, there are: 548444 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 533484 reads: 14960 lost or 2.73%.
##   Mutation data: all filters removed 14961 reads, or 2.73%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1829434 indexes in all the data.
## After reads/index pruning, there are: 212532 indexes: 1616902 lost or 88.38%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 533484 changed reads.
## All data: before reads/index pruning, there are: 2421460 identical reads.
## All data: after index pruning, there are: 137380 changed reads: 25.75%.
## All data: after index pruning, there are: 629940 identical reads: 26.01%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 629940 identical reads.
## Before classification, there are 137380 reads with mutations.
## After classification, there are 542896 reads/indexes which are only identical.
## After classification, there are 1940 reads/indexes which are strictly sequencer.
## After classification, there are 60340 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 985552 forward reads and 1114299 reverse_reads.
## Subsetting based on mutations with at least 3 indexes.
## Classified mutation strings according to various queries.
## Starting sample: control_rna.
##   Reading the file containing mutations: preprocessing/control_rna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/control_rna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 267014 reads.
## Mutation data: after min-position pruning, there are: 267014 reads: 0 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 267014 reads.
## Mutation data: after max-position pruning, there are: 267014 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 261209 reads: 5805 lost or 2.17%.
##   Mutation data: all filters removed 5805 reads, or 2.17%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1155581 indexes in all the data.
## After reads/index pruning, there are: 210352 indexes: 945229 lost or 81.80%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 261209 changed reads.
## All data: before reads/index pruning, there are: 1854668 identical reads.
## All data: after index pruning, there are: 88407 changed reads: 33.85%.
## All data: after index pruning, there are: 711770 identical reads: 38.38%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 711770 identical reads.
## Before classification, there are 88407 reads with mutations.
## After classification, there are 609535 reads/indexes which are only identical.
## After classification, there are 4778 reads/indexes which are strictly sequencer.
## After classification, there are 9929 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 1101670 forward reads and 1250588 reverse_reads.
## Subsetting based on mutations with at least 3 indexes.
## Classified mutation strings according to various queries.
## Starting sample: low_rna.
##   Reading the file containing mutations: preprocessing/low_rna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/low_rna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 387777 reads.
## Mutation data: after min-position pruning, there are: 387776 reads: 1 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 387776 reads.
## Mutation data: after max-position pruning, there are: 387776 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 376092 reads: 11684 lost or 3.01%.
##   Mutation data: all filters removed 11685 reads, or 3.01%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1306310 indexes in all the data.
## After reads/index pruning, there are: 255514 indexes: 1050796 lost or 80.44%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 376092 changed reads.
## All data: before reads/index pruning, there are: 2083008 identical reads.
## All data: after index pruning, there are: 136149 changed reads: 36.20%.
## All data: after index pruning, there are: 833439 identical reads: 40.01%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 833439 identical reads.
## Before classification, there are 136149 reads with mutations.
## After classification, there are 709121 reads/indexes which are only identical.
## After classification, there are 5313 reads/indexes which are strictly sequencer.
## After classification, there are 36973 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 1343128 forward reads and 1445434 reverse_reads.
## Subsetting based on mutations with at least 3 indexes.
## Classified mutation strings according to various queries.
## Starting sample: high_rna.
##   Reading the file containing mutations: preprocessing/high_rna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/high_rna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 558061 reads.
## Mutation data: after min-position pruning, there are: 558059 reads: 2 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 558059 reads.
## Mutation data: after max-position pruning, there are: 558059 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 544407 reads: 13652 lost or 2.45%.
##   Mutation data: all filters removed 13654 reads, or 2.45%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1224839 indexes in all the data.
## After reads/index pruning, there are: 254401 indexes: 970438 lost or 79.23%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 544407 changed reads.
## All data: before reads/index pruning, there are: 1855921 identical reads.
## All data: after index pruning, there are: 216201 changed reads: 39.71%.
## All data: after index pruning, there are: 771788 identical reads: 41.59%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 771788 identical reads.
## Before classification, there are 216201 reads with mutations.
## After classification, there are 644468 reads/indexes which are only identical.
## After classification, there are 5761 reads/indexes which are strictly sequencer.
## After classification, there are 108340 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 1312147 forward reads and 1454177 reverse_reads.
## Subsetting based on mutations with at least 3 indexes.
## Classified mutation strings according to various queries.
##   Writing a legend.
## Plotting Index density for mutant reads before filtering.
## Plotting Index density for identical reads before filtering.
## Plotting Index density for all reads before filtering.
## Plotting Index density for mutant reads after filtering.
## Plotting Index density for identical reads after filtering.
## Plotting Index density for all reads after filtering.
##   Writing raw data.
##   Writing cpm data.
##   Writing data normalized by reads/indexes.
##   Writing data normalized by reads/indexes and length.
##   Writing data normalized by cpm(reads/indexes) and length.
max_mutations_per_read <- 5
excel <- glue::glue("excel/{rundate}_triples_fivempr-v{ver}.xlsx")
triples_fivempr <- create_matrices(sample_sheet=sample_sheet,
                                   ident_column=ident_column, mut_column=mut_column,
                                   min_reads=min_reads, min_indexes=min_indexes,
                                   min_sequencer=min_sequencer,
                                   min_position=min_position, max_position=max_position,
                                   prune_n=prune_n, verbose=verbose, excel=excel)
## Dropped 6 rows from the sample metadata because they were blank.
## Starting sample: control_dna.
##   Reading the file containing mutations: preprocessing/control_dna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/control_dna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 344009 reads.
## Mutation data: after min-position pruning, there are: 344007 reads: 2 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 344007 reads.
## Mutation data: after max-position pruning, there are: 344007 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 337124 reads: 6883 lost or 2.00%.
##   Mutation data: all filters removed 6885 reads, or 2.00%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1392352 indexes in all the data.
## After reads/index pruning, there are: 258638 indexes: 1133714 lost or 81.42%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 337124 changed reads.
## All data: before reads/index pruning, there are: 2240589 identical reads.
## All data: after index pruning, there are: 117053 changed reads: 34.72%.
## All data: after index pruning, there are: 870547 identical reads: 38.85%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 870547 identical reads.
## Before classification, there are 117053 reads with mutations.
## After classification, there are 741108 reads/indexes which are only identical.
## After classification, there are 6056 reads/indexes which are strictly sequencer.
## After classification, there are 12445 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 1336553 forward reads and 1515986 reverse_reads.
## Subsetting based on mutations with at least 3 indexes.
## Classified mutation strings according to various queries.
## Starting sample: low_dna.
##   Reading the file containing mutations: preprocessing/low_dna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/low_dna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 418833 reads.
## Mutation data: after min-position pruning, there are: 418830 reads: 3 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 418830 reads.
## Mutation data: after max-position pruning, there are: 418830 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 411131 reads: 7699 lost or 1.84%.
##   Mutation data: all filters removed 7702 reads, or 1.84%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1950160 indexes in all the data.
## After reads/index pruning, there are: 177404 indexes: 1772756 lost or 90.90%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 411131 changed reads.
## All data: before reads/index pruning, there are: 2566212 identical reads.
## All data: after index pruning, there are: 85285 changed reads: 20.74%.
## All data: after index pruning, there are: 540399 identical reads: 21.06%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 540399 identical reads.
## Before classification, there are 85285 reads with mutations.
## After classification, there are 468326 reads/indexes which are only identical.
## After classification, there are 1201 reads/indexes which are strictly sequencer.
## After classification, there are 20730 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 771013 forward reads and 901147 reverse_reads.
## Subsetting based on mutations with at least 3 indexes.
## Classified mutation strings according to various queries.
## Starting sample: high_dna.
##   Reading the file containing mutations: preprocessing/high_dna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/high_dna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 548445 reads.
## Mutation data: after min-position pruning, there are: 548444 reads: 1 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 548444 reads.
## Mutation data: after max-position pruning, there are: 548444 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 533484 reads: 14960 lost or 2.73%.
##   Mutation data: all filters removed 14961 reads, or 2.73%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1829434 indexes in all the data.
## After reads/index pruning, there are: 212532 indexes: 1616902 lost or 88.38%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 533484 changed reads.
## All data: before reads/index pruning, there are: 2421460 identical reads.
## All data: after index pruning, there are: 137380 changed reads: 25.75%.
## All data: after index pruning, there are: 629940 identical reads: 26.01%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 629940 identical reads.
## Before classification, there are 137380 reads with mutations.
## After classification, there are 542896 reads/indexes which are only identical.
## After classification, there are 1940 reads/indexes which are strictly sequencer.
## After classification, there are 60340 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 985552 forward reads and 1114299 reverse_reads.
## Subsetting based on mutations with at least 3 indexes.
## Classified mutation strings according to various queries.
## Starting sample: control_rna.
##   Reading the file containing mutations: preprocessing/control_rna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/control_rna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 267014 reads.
## Mutation data: after min-position pruning, there are: 267014 reads: 0 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 267014 reads.
## Mutation data: after max-position pruning, there are: 267014 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 261209 reads: 5805 lost or 2.17%.
##   Mutation data: all filters removed 5805 reads, or 2.17%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1155581 indexes in all the data.
## After reads/index pruning, there are: 210352 indexes: 945229 lost or 81.80%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 261209 changed reads.
## All data: before reads/index pruning, there are: 1854668 identical reads.
## All data: after index pruning, there are: 88407 changed reads: 33.85%.
## All data: after index pruning, there are: 711770 identical reads: 38.38%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 711770 identical reads.
## Before classification, there are 88407 reads with mutations.
## After classification, there are 609535 reads/indexes which are only identical.
## After classification, there are 4778 reads/indexes which are strictly sequencer.
## After classification, there are 9929 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 1101670 forward reads and 1250588 reverse_reads.
## Subsetting based on mutations with at least 3 indexes.
## Classified mutation strings according to various queries.
## Starting sample: low_rna.
##   Reading the file containing mutations: preprocessing/low_rna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/low_rna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 387777 reads.
## Mutation data: after min-position pruning, there are: 387776 reads: 1 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 387776 reads.
## Mutation data: after max-position pruning, there are: 387776 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 376092 reads: 11684 lost or 3.01%.
##   Mutation data: all filters removed 11685 reads, or 3.01%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1306310 indexes in all the data.
## After reads/index pruning, there are: 255514 indexes: 1050796 lost or 80.44%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 376092 changed reads.
## All data: before reads/index pruning, there are: 2083008 identical reads.
## All data: after index pruning, there are: 136149 changed reads: 36.20%.
## All data: after index pruning, there are: 833439 identical reads: 40.01%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 833439 identical reads.
## Before classification, there are 136149 reads with mutations.
## After classification, there are 709121 reads/indexes which are only identical.
## After classification, there are 5313 reads/indexes which are strictly sequencer.
## After classification, there are 36973 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 1343128 forward reads and 1445434 reverse_reads.
## Subsetting based on mutations with at least 3 indexes.
## Classified mutation strings according to various queries.
## Starting sample: high_rna.
##   Reading the file containing mutations: preprocessing/high_rna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/high_rna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 558061 reads.
## Mutation data: after min-position pruning, there are: 558059 reads: 2 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 558059 reads.
## Mutation data: after max-position pruning, there are: 558059 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 544407 reads: 13652 lost or 2.45%.
##   Mutation data: all filters removed 13654 reads, or 2.45%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1224839 indexes in all the data.
## After reads/index pruning, there are: 254401 indexes: 970438 lost or 79.23%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 544407 changed reads.
## All data: before reads/index pruning, there are: 1855921 identical reads.
## All data: after index pruning, there are: 216201 changed reads: 39.71%.
## All data: after index pruning, there are: 771788 identical reads: 41.59%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 771788 identical reads.
## Before classification, there are 216201 reads with mutations.
## After classification, there are 644468 reads/indexes which are only identical.
## After classification, there are 5761 reads/indexes which are strictly sequencer.
## After classification, there are 108340 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 1312147 forward reads and 1454177 reverse_reads.
## Subsetting based on mutations with at least 3 indexes.
## Classified mutation strings according to various queries.
##   Writing a legend.
## Plotting Index density for mutant reads before filtering.
## Plotting Index density for identical reads before filtering.
## Plotting Index density for all reads before filtering.
## Plotting Index density for mutant reads after filtering.
## Plotting Index density for identical reads after filtering.
## Plotting Index density for all reads after filtering.
##   Writing raw data.
##   Writing cpm data.
##   Writing data normalized by reads/indexes.
##   Writing data normalized by reads/indexes and length.
##   Writing data normalized by cpm(reads/indexes) and length.

1.2 Categorize the data with at least 5 indexes per mutant

min_indexes <- 5
max_mutations_per_read <- NULL
excel <- glue::glue("excel/{rundate}_quints-v{ver}.xlsx")
quints <- create_matrices(sample_sheet=sample_sheet,
                          ident_column=ident_column, mut_column=mut_column,
                          min_reads=min_reads, min_indexes=min_indexes,
                          min_sequencer=min_sequencer,
                          min_position=min_position, max_position=max_position,
                          prune_n=prune_n, verbose=verbose, excel=excel)
## Dropped 6 rows from the sample metadata because they were blank.
## Starting sample: control_dna.
##   Reading the file containing mutations: preprocessing/control_dna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/control_dna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 344009 reads.
## Mutation data: after min-position pruning, there are: 344007 reads: 2 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 344007 reads.
## Mutation data: after max-position pruning, there are: 344007 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 337124 reads: 6883 lost or 2.00%.
##   Mutation data: all filters removed 6885 reads, or 2.00%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1392352 indexes in all the data.
## After reads/index pruning, there are: 258638 indexes: 1133714 lost or 81.42%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 337124 changed reads.
## All data: before reads/index pruning, there are: 2240589 identical reads.
## All data: after index pruning, there are: 117053 changed reads: 34.72%.
## All data: after index pruning, there are: 870547 identical reads: 38.85%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 870547 identical reads.
## Before classification, there are 117053 reads with mutations.
## After classification, there are 741108 reads/indexes which are only identical.
## After classification, there are 6056 reads/indexes which are strictly sequencer.
## After classification, there are 12445 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 1336553 forward reads and 1515986 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various queries.
## Starting sample: low_dna.
##   Reading the file containing mutations: preprocessing/low_dna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/low_dna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 418833 reads.
## Mutation data: after min-position pruning, there are: 418830 reads: 3 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 418830 reads.
## Mutation data: after max-position pruning, there are: 418830 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 411131 reads: 7699 lost or 1.84%.
##   Mutation data: all filters removed 7702 reads, or 1.84%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1950160 indexes in all the data.
## After reads/index pruning, there are: 177404 indexes: 1772756 lost or 90.90%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 411131 changed reads.
## All data: before reads/index pruning, there are: 2566212 identical reads.
## All data: after index pruning, there are: 85285 changed reads: 20.74%.
## All data: after index pruning, there are: 540399 identical reads: 21.06%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 540399 identical reads.
## Before classification, there are 85285 reads with mutations.
## After classification, there are 468326 reads/indexes which are only identical.
## After classification, there are 1201 reads/indexes which are strictly sequencer.
## After classification, there are 20730 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 771013 forward reads and 901147 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various queries.
## Starting sample: high_dna.
##   Reading the file containing mutations: preprocessing/high_dna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/high_dna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 548445 reads.
## Mutation data: after min-position pruning, there are: 548444 reads: 1 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 548444 reads.
## Mutation data: after max-position pruning, there are: 548444 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 533484 reads: 14960 lost or 2.73%.
##   Mutation data: all filters removed 14961 reads, or 2.73%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1829434 indexes in all the data.
## After reads/index pruning, there are: 212532 indexes: 1616902 lost or 88.38%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 533484 changed reads.
## All data: before reads/index pruning, there are: 2421460 identical reads.
## All data: after index pruning, there are: 137380 changed reads: 25.75%.
## All data: after index pruning, there are: 629940 identical reads: 26.01%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 629940 identical reads.
## Before classification, there are 137380 reads with mutations.
## After classification, there are 542896 reads/indexes which are only identical.
## After classification, there are 1940 reads/indexes which are strictly sequencer.
## After classification, there are 60340 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 985552 forward reads and 1114299 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various queries.
## Starting sample: control_rna.
##   Reading the file containing mutations: preprocessing/control_rna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/control_rna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 267014 reads.
## Mutation data: after min-position pruning, there are: 267014 reads: 0 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 267014 reads.
## Mutation data: after max-position pruning, there are: 267014 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 261209 reads: 5805 lost or 2.17%.
##   Mutation data: all filters removed 5805 reads, or 2.17%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1155581 indexes in all the data.
## After reads/index pruning, there are: 210352 indexes: 945229 lost or 81.80%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 261209 changed reads.
## All data: before reads/index pruning, there are: 1854668 identical reads.
## All data: after index pruning, there are: 88407 changed reads: 33.85%.
## All data: after index pruning, there are: 711770 identical reads: 38.38%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 711770 identical reads.
## Before classification, there are 88407 reads with mutations.
## After classification, there are 609535 reads/indexes which are only identical.
## After classification, there are 4778 reads/indexes which are strictly sequencer.
## After classification, there are 9929 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 1101670 forward reads and 1250588 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various queries.
## Starting sample: low_rna.
##   Reading the file containing mutations: preprocessing/low_rna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/low_rna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 387777 reads.
## Mutation data: after min-position pruning, there are: 387776 reads: 1 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 387776 reads.
## Mutation data: after max-position pruning, there are: 387776 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 376092 reads: 11684 lost or 3.01%.
##   Mutation data: all filters removed 11685 reads, or 3.01%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1306310 indexes in all the data.
## After reads/index pruning, there are: 255514 indexes: 1050796 lost or 80.44%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 376092 changed reads.
## All data: before reads/index pruning, there are: 2083008 identical reads.
## All data: after index pruning, there are: 136149 changed reads: 36.20%.
## All data: after index pruning, there are: 833439 identical reads: 40.01%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 833439 identical reads.
## Before classification, there are 136149 reads with mutations.
## After classification, there are 709121 reads/indexes which are only identical.
## After classification, there are 5313 reads/indexes which are strictly sequencer.
## After classification, there are 36973 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 1343128 forward reads and 1445434 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various queries.
## Starting sample: high_rna.
##   Reading the file containing mutations: preprocessing/high_rna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/high_rna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 558061 reads.
## Mutation data: after min-position pruning, there are: 558059 reads: 2 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 558059 reads.
## Mutation data: after max-position pruning, there are: 558059 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 544407 reads: 13652 lost or 2.45%.
##   Mutation data: all filters removed 13654 reads, or 2.45%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1224839 indexes in all the data.
## After reads/index pruning, there are: 254401 indexes: 970438 lost or 79.23%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 544407 changed reads.
## All data: before reads/index pruning, there are: 1855921 identical reads.
## All data: after index pruning, there are: 216201 changed reads: 39.71%.
## All data: after index pruning, there are: 771788 identical reads: 41.59%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 771788 identical reads.
## Before classification, there are 216201 reads with mutations.
## After classification, there are 644468 reads/indexes which are only identical.
## After classification, there are 5761 reads/indexes which are strictly sequencer.
## After classification, there are 108340 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 1312147 forward reads and 1454177 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various queries.
##   Writing a legend.
## Plotting Index density for mutant reads before filtering.
## Plotting Index density for identical reads before filtering.
## Plotting Index density for all reads before filtering.
## Plotting Index density for mutant reads after filtering.
## Plotting Index density for identical reads after filtering.
## Plotting Index density for all reads after filtering.
##   Writing raw data.
##   Writing cpm data.
##   Writing data normalized by reads/indexes.
##   Writing data normalized by reads/indexes and length.
##   Writing data normalized by cpm(reads/indexes) and length.
max_mutations_per_read <- 10
excel <- glue::glue("excel/{rundate}_quints_tenmpr-v{ver}.xlsx")
quints_tenmpr <- create_matrices(sample_sheet=sample_sheet,
                                 ident_column=ident_column, mut_column=mut_column,
                                 min_reads=min_reads, min_indexes=min_indexes,
                                 min_sequencer=min_sequencer,
                                 min_position=min_position, max_position=max_position,
                                 prune_n=prune_n, verbose=verbose, excel=excel)
## Dropped 6 rows from the sample metadata because they were blank.
## Starting sample: control_dna.
##   Reading the file containing mutations: preprocessing/control_dna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/control_dna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 344009 reads.
## Mutation data: after min-position pruning, there are: 344007 reads: 2 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 344007 reads.
## Mutation data: after max-position pruning, there are: 344007 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 337124 reads: 6883 lost or 2.00%.
##   Mutation data: all filters removed 6885 reads, or 2.00%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1392352 indexes in all the data.
## After reads/index pruning, there are: 258638 indexes: 1133714 lost or 81.42%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 337124 changed reads.
## All data: before reads/index pruning, there are: 2240589 identical reads.
## All data: after index pruning, there are: 117053 changed reads: 34.72%.
## All data: after index pruning, there are: 870547 identical reads: 38.85%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 870547 identical reads.
## Before classification, there are 117053 reads with mutations.
## After classification, there are 741108 reads/indexes which are only identical.
## After classification, there are 6056 reads/indexes which are strictly sequencer.
## After classification, there are 12445 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 1336553 forward reads and 1515986 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various queries.
## Starting sample: low_dna.
##   Reading the file containing mutations: preprocessing/low_dna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/low_dna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 418833 reads.
## Mutation data: after min-position pruning, there are: 418830 reads: 3 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 418830 reads.
## Mutation data: after max-position pruning, there are: 418830 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 411131 reads: 7699 lost or 1.84%.
##   Mutation data: all filters removed 7702 reads, or 1.84%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1950160 indexes in all the data.
## After reads/index pruning, there are: 177404 indexes: 1772756 lost or 90.90%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 411131 changed reads.
## All data: before reads/index pruning, there are: 2566212 identical reads.
## All data: after index pruning, there are: 85285 changed reads: 20.74%.
## All data: after index pruning, there are: 540399 identical reads: 21.06%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 540399 identical reads.
## Before classification, there are 85285 reads with mutations.
## After classification, there are 468326 reads/indexes which are only identical.
## After classification, there are 1201 reads/indexes which are strictly sequencer.
## After classification, there are 20730 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 771013 forward reads and 901147 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various queries.
## Starting sample: high_dna.
##   Reading the file containing mutations: preprocessing/high_dna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/high_dna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 548445 reads.
## Mutation data: after min-position pruning, there are: 548444 reads: 1 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 548444 reads.
## Mutation data: after max-position pruning, there are: 548444 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 533484 reads: 14960 lost or 2.73%.
##   Mutation data: all filters removed 14961 reads, or 2.73%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1829434 indexes in all the data.
## After reads/index pruning, there are: 212532 indexes: 1616902 lost or 88.38%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 533484 changed reads.
## All data: before reads/index pruning, there are: 2421460 identical reads.
## All data: after index pruning, there are: 137380 changed reads: 25.75%.
## All data: after index pruning, there are: 629940 identical reads: 26.01%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 629940 identical reads.
## Before classification, there are 137380 reads with mutations.
## After classification, there are 542896 reads/indexes which are only identical.
## After classification, there are 1940 reads/indexes which are strictly sequencer.
## After classification, there are 60340 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 985552 forward reads and 1114299 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various queries.
## Starting sample: control_rna.
##   Reading the file containing mutations: preprocessing/control_rna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/control_rna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 267014 reads.
## Mutation data: after min-position pruning, there are: 267014 reads: 0 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 267014 reads.
## Mutation data: after max-position pruning, there are: 267014 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 261209 reads: 5805 lost or 2.17%.
##   Mutation data: all filters removed 5805 reads, or 2.17%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1155581 indexes in all the data.
## After reads/index pruning, there are: 210352 indexes: 945229 lost or 81.80%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 261209 changed reads.
## All data: before reads/index pruning, there are: 1854668 identical reads.
## All data: after index pruning, there are: 88407 changed reads: 33.85%.
## All data: after index pruning, there are: 711770 identical reads: 38.38%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 711770 identical reads.
## Before classification, there are 88407 reads with mutations.
## After classification, there are 609535 reads/indexes which are only identical.
## After classification, there are 4778 reads/indexes which are strictly sequencer.
## After classification, there are 9929 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 1101670 forward reads and 1250588 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various queries.
## Starting sample: low_rna.
##   Reading the file containing mutations: preprocessing/low_rna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/low_rna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 387777 reads.
## Mutation data: after min-position pruning, there are: 387776 reads: 1 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 387776 reads.
## Mutation data: after max-position pruning, there are: 387776 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 376092 reads: 11684 lost or 3.01%.
##   Mutation data: all filters removed 11685 reads, or 3.01%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1306310 indexes in all the data.
## After reads/index pruning, there are: 255514 indexes: 1050796 lost or 80.44%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 376092 changed reads.
## All data: before reads/index pruning, there are: 2083008 identical reads.
## All data: after index pruning, there are: 136149 changed reads: 36.20%.
## All data: after index pruning, there are: 833439 identical reads: 40.01%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 833439 identical reads.
## Before classification, there are 136149 reads with mutations.
## After classification, there are 709121 reads/indexes which are only identical.
## After classification, there are 5313 reads/indexes which are strictly sequencer.
## After classification, there are 36973 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 1343128 forward reads and 1445434 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various queries.
## Starting sample: high_rna.
##   Reading the file containing mutations: preprocessing/high_rna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/high_rna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 558061 reads.
## Mutation data: after min-position pruning, there are: 558059 reads: 2 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 558059 reads.
## Mutation data: after max-position pruning, there are: 558059 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 544407 reads: 13652 lost or 2.45%.
##   Mutation data: all filters removed 13654 reads, or 2.45%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1224839 indexes in all the data.
## After reads/index pruning, there are: 254401 indexes: 970438 lost or 79.23%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 544407 changed reads.
## All data: before reads/index pruning, there are: 1855921 identical reads.
## All data: after index pruning, there are: 216201 changed reads: 39.71%.
## All data: after index pruning, there are: 771788 identical reads: 41.59%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 771788 identical reads.
## Before classification, there are 216201 reads with mutations.
## After classification, there are 644468 reads/indexes which are only identical.
## After classification, there are 5761 reads/indexes which are strictly sequencer.
## After classification, there are 108340 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 1312147 forward reads and 1454177 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various queries.
##   Writing a legend.
## Plotting Index density for mutant reads before filtering.
## Plotting Index density for identical reads before filtering.
## Plotting Index density for all reads before filtering.
## Plotting Index density for mutant reads after filtering.
## Plotting Index density for identical reads after filtering.
## Plotting Index density for all reads after filtering.
##   Writing raw data.
##   Writing cpm data.
##   Writing data normalized by reads/indexes.
##   Writing data normalized by reads/indexes and length.
##   Writing data normalized by cpm(reads/indexes) and length.
max_mutations_per_read <- 5
excel <- glue::glue("excel/{rundate}_quints_fivempr-v{ver}.xlsx")
quints_fivempr <- create_matrices(sample_sheet=sample_sheet,
                                  ident_column=ident_column, mut_column=mut_column,
                                  min_reads=min_reads, min_indexes=min_indexes,
                                  min_sequencer=min_sequencer,
                                  min_position=min_position, max_position=max_position,
                                  prune_n=prune_n, verbose=verbose, excel=excel)
## Dropped 6 rows from the sample metadata because they were blank.
## Starting sample: control_dna.
##   Reading the file containing mutations: preprocessing/control_dna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/control_dna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 344009 reads.
## Mutation data: after min-position pruning, there are: 344007 reads: 2 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 344007 reads.
## Mutation data: after max-position pruning, there are: 344007 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 337124 reads: 6883 lost or 2.00%.
##   Mutation data: all filters removed 6885 reads, or 2.00%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1392352 indexes in all the data.
## After reads/index pruning, there are: 258638 indexes: 1133714 lost or 81.42%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 337124 changed reads.
## All data: before reads/index pruning, there are: 2240589 identical reads.
## All data: after index pruning, there are: 117053 changed reads: 34.72%.
## All data: after index pruning, there are: 870547 identical reads: 38.85%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 870547 identical reads.
## Before classification, there are 117053 reads with mutations.
## After classification, there are 741108 reads/indexes which are only identical.
## After classification, there are 6056 reads/indexes which are strictly sequencer.
## After classification, there are 12445 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 1336553 forward reads and 1515986 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various queries.
## Starting sample: low_dna.
##   Reading the file containing mutations: preprocessing/low_dna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/low_dna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 418833 reads.
## Mutation data: after min-position pruning, there are: 418830 reads: 3 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 418830 reads.
## Mutation data: after max-position pruning, there are: 418830 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 411131 reads: 7699 lost or 1.84%.
##   Mutation data: all filters removed 7702 reads, or 1.84%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1950160 indexes in all the data.
## After reads/index pruning, there are: 177404 indexes: 1772756 lost or 90.90%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 411131 changed reads.
## All data: before reads/index pruning, there are: 2566212 identical reads.
## All data: after index pruning, there are: 85285 changed reads: 20.74%.
## All data: after index pruning, there are: 540399 identical reads: 21.06%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 540399 identical reads.
## Before classification, there are 85285 reads with mutations.
## After classification, there are 468326 reads/indexes which are only identical.
## After classification, there are 1201 reads/indexes which are strictly sequencer.
## After classification, there are 20730 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 771013 forward reads and 901147 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various queries.
## Starting sample: high_dna.
##   Reading the file containing mutations: preprocessing/high_dna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/high_dna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 548445 reads.
## Mutation data: after min-position pruning, there are: 548444 reads: 1 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 548444 reads.
## Mutation data: after max-position pruning, there are: 548444 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 533484 reads: 14960 lost or 2.73%.
##   Mutation data: all filters removed 14961 reads, or 2.73%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1829434 indexes in all the data.
## After reads/index pruning, there are: 212532 indexes: 1616902 lost or 88.38%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 533484 changed reads.
## All data: before reads/index pruning, there are: 2421460 identical reads.
## All data: after index pruning, there are: 137380 changed reads: 25.75%.
## All data: after index pruning, there are: 629940 identical reads: 26.01%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 629940 identical reads.
## Before classification, there are 137380 reads with mutations.
## After classification, there are 542896 reads/indexes which are only identical.
## After classification, there are 1940 reads/indexes which are strictly sequencer.
## After classification, there are 60340 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 985552 forward reads and 1114299 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various queries.
## Starting sample: control_rna.
##   Reading the file containing mutations: preprocessing/control_rna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/control_rna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 267014 reads.
## Mutation data: after min-position pruning, there are: 267014 reads: 0 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 267014 reads.
## Mutation data: after max-position pruning, there are: 267014 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 261209 reads: 5805 lost or 2.17%.
##   Mutation data: all filters removed 5805 reads, or 2.17%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1155581 indexes in all the data.
## After reads/index pruning, there are: 210352 indexes: 945229 lost or 81.80%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 261209 changed reads.
## All data: before reads/index pruning, there are: 1854668 identical reads.
## All data: after index pruning, there are: 88407 changed reads: 33.85%.
## All data: after index pruning, there are: 711770 identical reads: 38.38%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 711770 identical reads.
## Before classification, there are 88407 reads with mutations.
## After classification, there are 609535 reads/indexes which are only identical.
## After classification, there are 4778 reads/indexes which are strictly sequencer.
## After classification, there are 9929 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 1101670 forward reads and 1250588 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various queries.
## Starting sample: low_rna.
##   Reading the file containing mutations: preprocessing/low_rna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/low_rna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 387777 reads.
## Mutation data: after min-position pruning, there are: 387776 reads: 1 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 387776 reads.
## Mutation data: after max-position pruning, there are: 387776 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 376092 reads: 11684 lost or 3.01%.
##   Mutation data: all filters removed 11685 reads, or 3.01%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1306310 indexes in all the data.
## After reads/index pruning, there are: 255514 indexes: 1050796 lost or 80.44%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 376092 changed reads.
## All data: before reads/index pruning, there are: 2083008 identical reads.
## All data: after index pruning, there are: 136149 changed reads: 36.20%.
## All data: after index pruning, there are: 833439 identical reads: 40.01%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 833439 identical reads.
## Before classification, there are 136149 reads with mutations.
## After classification, there are 709121 reads/indexes which are only identical.
## After classification, there are 5313 reads/indexes which are strictly sequencer.
## After classification, there are 36973 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 1343128 forward reads and 1445434 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various queries.
## Starting sample: high_rna.
##   Reading the file containing mutations: preprocessing/high_rna/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/high_rna/step2_identical_reads.txt.xz
##   Counting indexes before filtering.
## Mutation data: removing any differences before position: 5.
## Mutation data: before pruning, there are: 558061 reads.
## Mutation data: after min-position pruning, there are: 558059 reads: 2 lost or 0.00%.
## Mutation data: removing any differences after position: 200.
## Mutation data: before pruning, there are: 558059 reads.
## Mutation data: after max-position pruning, there are: 558059 reads: 0 lost or 0.00%.
## Mutation data: removing any reads with 'N' as the hit.
## Mutation data: after N pruning, there are: 544407 reads: 13652 lost or 2.45%.
##   Mutation data: all filters removed 13654 reads, or 2.45%.
## Gathering information about the number of reads per index.
## Before reads/index pruning, there are: 1224839 indexes in all the data.
## After reads/index pruning, there are: 254401 indexes: 970438 lost or 79.23%.
## All data: removing indexes with fewer than 3 reads/index.
## All data: before reads/index pruning, there are: 544407 changed reads.
## All data: before reads/index pruning, there are: 1855921 identical reads.
## All data: after index pruning, there are: 216201 changed reads: 39.71%.
## All data: after index pruning, there are: 771788 identical reads: 41.59%.
## Gathering identical, mutant, and sequencer reads/indexes.
## Before classification, there are 771788 identical reads.
## Before classification, there are 216201 reads with mutations.
## After classification, there are 644468 reads/indexes which are only identical.
## After classification, there are 5761 reads/indexes which are strictly sequencer.
## After classification, there are 108340 reads/indexes which are deemed from reverse transcriptase.
## Counted by direction: 1312147 forward reads and 1454177 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various queries.
##   Writing a legend.
## Plotting Index density for mutant reads before filtering.
## Plotting Index density for identical reads before filtering.
## Plotting Index density for all reads before filtering.
## Plotting Index density for mutant reads after filtering.
## Plotting Index density for identical reads after filtering.
## Plotting Index density for all reads after filtering.
##   Writing raw data.
##   Writing cpm data.
##   Writing data normalized by reads/indexes.
##   Writing data normalized by reads/indexes and length.
##   Writing data normalized by cpm(reads/indexes) and length.

2 Questions from Dr. DeStefano

I think what is best is to get the number of recovered mutations of each type from each data set. That would be A to T, A to G, A to C; T to A, T to G, T to C; G to A, G to C, G to T; and C to A, C to G, C to T; as well as deletions and insertions. I would then need the sum number of the reads that met all our criteria (i.e. at least 3 good recovered reads for that 14 nt index). Each set of 3 or more would ct as “1” read of that particular index so I would need the total with this in mind. I also need to know the total number of nucleotides that were in the region we decided to consider in the analysis. We may want to try this for 3 or more and 5 or more recovered indexes if it is not hard. This information does not include specific positions on the template where errors occurred but we can look at that latter. Right now I just want to get a general error rate and type of error. It would basically be calculated by dividing the number of recovered mutations of a particular type by sum number of the reads times the number of nucleotides screened in the template. As it ends up, this number does not really have a lot of meaning but it can be used to calculate the overall mutation rate as well as the rate for transversions, transitions, and deletions and insertions.

3 Answers

In order to address those queries, I invoked create_matrices() with a minimum index count of 3 and 5. It should be noted that this is not the same as requiring 3 or 5 reads per index. In both cases I require 3 reads per index.

3.1 Recovered mutations of each type

I am interpreting this question as the number of indexes recovered for each mutation type. I collect this information in 2 ways of interest: the indexes by type which are deemed to be from the RT and from the sequencer. In addition, I calculate a normalized (cpm) version of this information which may be used to look for changes across samples.

3.1.1 Mutations by RT index

This following block should print out tables of the numbers of mutant indexes observed for each type for the RT and the sequencer. One would hope that the sequencer will be consistent for all samples, but I think the results will instead suggest that my metric is not yet stringent enough.

knitr::kable(triples[["matrices"]][["miss_indexes_by_type"]])
names control_dna low_dna high_dna control_rna low_rna high_rna
A_C 415 751 1565 345 1483 2903
A_G 376 1550 8974 257 4220 11027
A_T 191 366 1109 108 1176 2196
C_A 3501 3381 6207 3622 7168 12117
C_G 282 1179 8557 284 2108 4604
C_T 773 1830 12891 583 4868 23471
G_A 755 5387 8187 396 4251 23361
G_C 179 279 407 78 411 789
G_T 4783 4228 8160 3275 3305 4373
T_A 131 337 767 104 1050 1683
T_C 376 528 1362 247 1917 14252
T_G 559 817 1998 511 1508 3997
knitr::kable(triples_tenmpr[["matrices"]][["miss_indexes_by_type"]])
names control_dna low_dna high_dna control_rna low_rna high_rna
A_C 415 751 1565 345 1483 2903
A_G 376 1550 8974 257 4220 11027
A_T 191 366 1109 108 1176 2196
C_A 3501 3381 6207 3622 7168 12117
C_G 282 1179 8557 284 2108 4604
C_T 773 1830 12891 583 4868 23471
G_A 755 5387 8187 396 4251 23361
G_C 179 279 407 78 411 789
G_T 4783 4228 8160 3275 3305 4373
T_A 131 337 767 104 1050 1683
T_C 376 528 1362 247 1917 14252
T_G 559 817 1998 511 1508 3997
knitr::kable(triples_fivempr[["matrices"]][["miss_indexes_by_type"]])
names control_dna low_dna high_dna control_rna low_rna high_rna
A_C 415 751 1565 345 1483 2903
A_G 376 1550 8974 257 4220 11027
A_T 191 366 1109 108 1176 2196
C_A 3501 3381 6207 3622 7168 12117
C_G 282 1179 8557 284 2108 4604
C_T 773 1830 12891 583 4868 23471
G_A 755 5387 8187 396 4251 23361
G_C 179 279 407 78 411 789
G_T 4783 4228 8160 3275 3305 4373
T_A 131 337 767 104 1050 1683
T_C 376 528 1362 247 1917 14252
T_G 559 817 1998 511 1508 3997
knitr::kable(quints[["matrices"]][["miss_indexes_by_type"]])
names control_dna low_dna high_dna control_rna low_rna high_rna
A_C 408 742 1561 320 1480 2903
A_G 349 1536 8970 224 4216 11027
A_T 150 335 1109 59 1170 2193
C_A 3495 3378 6207 3612 7168 12117
C_G 244 1169 8554 254 2108 4604
C_T 761 1819 12891 561 4865 23471
G_A 744 5384 8187 371 4251 23361
G_C 153 254 378 54 390 789
G_T 4783 4228 8160 3271 3305 4373
T_A 93 312 757 71 1043 1677
T_C 357 517 1350 215 1917 14252
T_G 544 799 1987 482 1497 3990
knitr::kable(quints_tenmpr[["matrices"]][["miss_indexes_by_type"]])
names control_dna low_dna high_dna control_rna low_rna high_rna
A_C 408 742 1561 320 1480 2903
A_G 349 1536 8970 224 4216 11027
A_T 150 335 1109 59 1170 2193
C_A 3495 3378 6207 3612 7168 12117
C_G 244 1169 8554 254 2108 4604
C_T 761 1819 12891 561 4865 23471
G_A 744 5384 8187 371 4251 23361
G_C 153 254 378 54 390 789
G_T 4783 4228 8160 3271 3305 4373
T_A 93 312 757 71 1043 1677
T_C 357 517 1350 215 1917 14252
T_G 544 799 1987 482 1497 3990
knitr::kable(quints_fivempr[["matrices"]][["miss_indexes_by_type"]])
names control_dna low_dna high_dna control_rna low_rna high_rna
A_C 408 742 1561 320 1480 2903
A_G 349 1536 8970 224 4216 11027
A_T 150 335 1109 59 1170 2193
C_A 3495 3378 6207 3612 7168 12117
C_G 244 1169 8554 254 2108 4604
C_T 761 1819 12891 561 4865 23471
G_A 744 5384 8187 371 4251 23361
G_C 153 254 378 54 390 789
G_T 4783 4228 8160 3271 3305 4373
T_A 93 312 757 71 1043 1677
T_C 357 517 1350 215 1917 14252
T_G 544 799 1987 482 1497 3990
knitr::kable(triples[["matrices"]][["miss_sequencer_by_type"]])
names control_dna low_dna high_dna control_rna low_rna high_rna
A_C 892 161 295 717 801 801
A_G 420 43 98 351 370 445
A_T 122 3 20 100 111 137
C_A 536 108 151 434 540 621
C_G 505 68 182 348 392 495
C_T 228 17 25 205 257 219
G_A 228 27 36 190 180 222
G_C 150 16 30 121 122 146
G_T 425 55 111 323 353 430
T_A 193 9 55 122 106 148
T_C 368 26 100 329 309 309
T_G 1785 382 509 1313 1573 1594
knitr::kable(triples_tenmpr[["matrices"]][["miss_sequencer_by_type"]])
names control_dna low_dna high_dna control_rna low_rna high_rna
A_C 892 161 295 717 801 801
A_G 420 43 98 351 370 445
A_T 122 3 20 100 111 137
C_A 536 108 151 434 540 621
C_G 505 68 182 348 392 495
C_T 228 17 25 205 257 219
G_A 228 27 36 190 180 222
G_C 150 16 30 121 122 146
G_T 425 55 111 323 353 430
T_A 193 9 55 122 106 148
T_C 368 26 100 329 309 309
T_G 1785 382 509 1313 1573 1594
knitr::kable(triples_fivempr[["matrices"]][["miss_sequencer_by_type"]])
names control_dna low_dna high_dna control_rna low_rna high_rna
A_C 892 161 295 717 801 801
A_G 420 43 98 351 370 445
A_T 122 3 20 100 111 137
C_A 536 108 151 434 540 621
C_G 505 68 182 348 392 495
C_T 228 17 25 205 257 219
G_A 228 27 36 190 180 222
G_C 150 16 30 121 122 146
G_T 425 55 111 323 353 430
T_A 193 9 55 122 106 148
T_C 368 26 100 329 309 309
T_G 1785 382 509 1313 1573 1594
knitr::kable(quints[["matrices"]][["miss_sequencer_by_type"]])
names control_dna low_dna high_dna control_rna low_rna high_rna
A_C 860 151 279 694 781 787
A_G 392 20 81 313 336 426
A_T 93 0 0 64 67 116
C_A 513 94 129 412 507 603
C_G 487 51 167 324 380 473
C_T 171 0 8 124 185 176
G_A 212 10 15 144 138 184
G_C 109 13 23 94 93 108
G_T 387 35 98 273 315 397
T_A 149 5 36 88 82 115
T_C 347 5 79 308 269 273
T_G 1772 364 500 1294 1552 1574
knitr::kable(quints_tenmpr[["matrices"]][["miss_sequencer_by_type"]])
names control_dna low_dna high_dna control_rna low_rna high_rna
A_C 860 151 279 694 781 787
A_G 392 20 81 313 336 426
A_T 93 0 0 64 67 116
C_A 513 94 129 412 507 603
C_G 487 51 167 324 380 473
C_T 171 0 8 124 185 176
G_A 212 10 15 144 138 184
G_C 109 13 23 94 93 108
G_T 387 35 98 273 315 397
T_A 149 5 36 88 82 115
T_C 347 5 79 308 269 273
T_G 1772 364 500 1294 1552 1574
knitr::kable(quints_fivempr[["matrices"]][["miss_sequencer_by_type"]])
names control_dna low_dna high_dna control_rna low_rna high_rna
A_C 860 151 279 694 781 787
A_G 392 20 81 313 336 426
A_T 93 0 0 64 67 116
C_A 513 94 129 412 507 603
C_G 487 51 167 324 380 473
C_T 171 0 8 124 185 176
G_A 212 10 15 144 138 184
G_C 109 13 23 94 93 108
G_T 387 35 98 273 315 397
T_A 149 5 36 88 82 115
T_C 347 5 79 308 269 273
T_G 1772 364 500 1294 1552 1574

Plots of this information

triples[["plots"]][["counts"]][["miss_indexes_by_type"]]
## NULL
triples_tenmpr[["plots"]][["counts"]][["miss_indexes_by_type"]]
## NULL
triples_fivempr[["plots"]][["counts"]][["miss_indexes_by_type"]]
## NULL
quints[["plots"]][["counts"]][["miss_indexes_by_type"]]
## NULL
quints_tenmpr[["plots"]][["counts"]][["miss_indexes_by_type"]]
## NULL
quints_fivempr[["plots"]][["counts"]][["miss_indexes_by_type"]]
## NULL

This suggests to me that this information needs to be normalized in some more sensible fashion. Thus the following:

3.1.2 Mutations by RT index, post normalization

The same numbers may be expressed in the context of the number of indexes observed / sample and/or as a cpm across samples. Thus in the first instance one can look at the apparent error rate for each sample, and in the second instance one may look for relative changes in apparent error rate across samples.

3.1.2.1 Rewriting the matrices as cpm to account for library sizes.

knitr::kable(triples[["normalized"]][["miss_indexes_by_type"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(triples_tenmpr[["normalized"]][["miss_indexes_by_type"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(triples_fivempr[["normalized"]][["miss_indexes_by_type"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(quints[["normalized"]][["miss_indexes_by_type"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(quints_tenmpr[["normalized"]][["miss_indexes_by_type"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(quints_fivempr[["normalized"]][["miss_indexes_by_type"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(triples[["normalized"]][["miss_sequencer_by_type"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(triples_tenmpr[["normalized"]][["miss_sequencer_by_type"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(triples_fivempr[["normalized"]][["miss_sequencer_by_type"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(quints[["normalized"]][["miss_sequencer_by_type"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(quints_tenmpr[["normalized"]][["miss_sequencer_by_type"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(quints_fivempr[["normalized"]][["miss_sequencer_by_type"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

3.1.2.2 Rewriting the matrices by dividing by all indexes

This I think starts to address the later text in your query.

knitr::kable(triples[["matrices_by_counts"]][["miss_indexes_by_type"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(quints[["matrices_by_counts"]][["miss_indexes_by_type"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(triples[["matrices_by_counts"]][["miss_sequencer_by_type"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(quints[["matrices_by_counts"]][["miss_sequencer_by_type"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

3.1.2.3 Rewriting the matrices by dividing by all indexes and cpm

I think this might prove to be where we get the most meaningful results.

The nicest thing in it is that after accounting for library sizes and total indexes observed, we finally see that the sequencer error is mostly consistent across all samples and mutation types – with a couple of notable exceptions.

By the same token, for the mutations which are identical for the sequencer, we have some which are decidedly different for the non-sequencer data. The most notable examples I think are A to G but _not G to A; and C to T.

knitr::kable(triples[["normalized_by_counts"]][["miss_indexes_by_type"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(triples_tenmpr[["normalized_by_counts"]][["miss_indexes_by_type"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(triples_fivempr[["normalized_by_counts"]][["miss_indexes_by_type"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(quints[["normalized_by_counts"]][["miss_indexes_by_type"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(quints_tenmpr[["normalized_by_counts"]][["miss_indexes_by_type"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(quints_fivempr[["normalized_by_counts"]][["miss_indexes_by_type"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(triples[["normalized_by_counts"]][["miss_sequencer_by_type"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(triples_tenmpr[["normalized_by_counts"]][["miss_sequencer_by_type"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(triples_fivempr[["normalized_by_counts"]][["miss_sequencer_by_type"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(quints[["normalized_by_counts"]][["miss_sequencer_by_type"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(quints_tenmpr[["normalized_by_counts"]][["miss_sequencer_by_type"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(quints_fivempr[["normalized_by_counts"]][["miss_sequencer_by_type"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

3.1.3 Indels by RT index

The following blocks will repeat the above, but looking for insertions. This data does not observe sufficient deletions to make a proper count for them.

knitr::kable(triples[["matrices"]][["insert_indexes_by_nt"]])
names control_dna low_dna high_dna control_rna low_rna high_rna
A 6 16 64 0 474 522
C 0 6 13 0 217 349
G 0 0 13 0 55 38
T 0 22 29 3 2726 2640
knitr::kable(triples_tenmpr[["matrices"]][["insert_indexes_by_nt"]])
names control_dna low_dna high_dna control_rna low_rna high_rna
A 6 16 64 0 474 522
C 0 6 13 0 217 349
G 0 0 13 0 55 38
T 0 22 29 3 2726 2640
knitr::kable(triples_fivempr[["matrices"]][["insert_indexes_by_nt"]])
names control_dna low_dna high_dna control_rna low_rna high_rna
A 6 16 64 0 474 522
C 0 6 13 0 217 349
G 0 0 13 0 55 38
T 0 22 29 3 2726 2640
knitr::kable(quints[["matrices"]][["insert_indexes_by_nt"]])
names control_dna low_dna high_dna control_rna low_rna high_rna
A 0 16 54 0 447 506
C 0 6 6 0 213 333
G 0 0 10 0 31 23
T 0 22 29 0 2692 2605
knitr::kable(quints_tenmpr[["matrices"]][["insert_indexes_by_nt"]])
names control_dna low_dna high_dna control_rna low_rna high_rna
A 0 16 54 0 447 506
C 0 6 6 0 213 333
G 0 0 10 0 31 23
T 0 22 29 0 2692 2605
knitr::kable(quints_fivempr[["matrices"]][["insert_indexes_by_nt"]])
names control_dna low_dna high_dna control_rna low_rna high_rna
A 0 16 54 0 447 506
C 0 6 6 0 213 333
G 0 0 10 0 31 23
T 0 22 29 0 2692 2605
knitr::kable(triples[["matrices"]][["insert_sequencer_by_nt"]])
names control_dna low_dna high_dna control_rna low_rna high_rna
knitr::kable(triples_tenmpr[["matrices"]][["insert_sequencer_by_nt"]])
names control_dna low_dna high_dna control_rna low_rna high_rna
knitr::kable(triples_fivempr[["matrices"]][["insert_sequencer_by_nt"]])
names control_dna low_dna high_dna control_rna low_rna high_rna
knitr::kable(quints[["matrices"]][["insert_sequencer_by_nt"]])
names control_dna low_dna high_dna control_rna low_rna high_rna
knitr::kable(quints_tenmpr[["matrices"]][["insert_sequencer_by_nt"]])
names control_dna low_dna high_dna control_rna low_rna high_rna
knitr::kable(quints_fivempr[["matrices"]][["insert_sequencer_by_nt"]])
names control_dna low_dna high_dna control_rna low_rna high_rna

Plots of this information

triples[["plots"]][["counts"]][["insert_indexes_by_nt"]]
## NULL
triples_tenmpr[["plots"]][["counts"]][["insert_indexes_by_nt"]]
## NULL
triples_fivempr[["plots"]][["counts"]][["insert_indexes_by_nt"]]
## NULL
quints[["plots"]][["counts"]][["insert_indexes_by_nt"]]
## NULL
quints_tenmpr[["plots"]][["counts"]][["insert_indexes_by_nt"]]
## NULL
quints_fivempr[["plots"]][["counts"]][["insert_indexes_by_nt"]]
## NULL

3.1.4 Insertions by RT index, post normalization

3.1.4.1 Rewriting the matrices as cpm to account for library sizes.

knitr::kable(triples[["normalized"]][["insert_indexes_by_nt"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(triples_tenmpr[["normalized"]][["insert_indexes_by_nt"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(triples_fivempr[["normalized"]][["insert_indexes_by_nt"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(quints[["normalized"]][["insert_indexes_by_nt"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(quints_tenmpr[["normalized"]][["insert_indexes_by_nt"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(quints_fivempr[["normalized"]][["insert_indexes_by_nt"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(triples[["normalized"]][["insert_sequencer_by_nt"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(triples_tenmpr[["normalized"]][["insert_sequencer_by_nt"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(triples_fivempr[["normalized"]][["insert_sequencer_by_nt"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(quints[["normalized"]][["insert_sequencer_by_nt"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(quints_tenmpr[["normalized"]][["insert_sequencer_by_nt"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(quints_fivempr[["normalized"]][["insert_sequencer_by_nt"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

3.1.4.2 Rewriting the matrices by dividing by all indexes

I think that there are few enough insertion events that this gets a bit messed up. I will double check the logic of this, but that is my initial guess given how few insertions I was seeing when reading the outputs manually. Unfortunately, this means that for these I also cannot provide a cpm measurement.

knitr::kable(triples[["matrices_by_counts"]][["insert_indexes_by_nt"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(triples_tenmpr[["matrices_by_counts"]][["insert_indexes_by_nt"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(triples_fivempr[["matrices_by_counts"]][["insert_indexes_by_nt"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(quints[["matrices_by_counts"]][["insert_indexes_by_nt"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(quints_tenmpr[["matrices_by_counts"]][["insert_indexes_by_nt"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(quints_fivempr[["matrices_by_counts"]][["insert_indexes_by_nt"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(triples[["matrices_by_counts"]][["insert_sequencer_by_nt"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(triples_tenmpr[["matrices_by_counts"]][["insert_sequencer_by_nt"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(triples_fivempr[["matrices_by_counts"]][["insert_sequencer_by_nt"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(quints[["matrices_by_counts"]][["insert_sequencer_by_nt"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(quints_tenmpr[["matrices_by_counts"]][["insert_sequencer_by_nt"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

knitr::kable(quints_fivempr[["matrices_by_counts"]][["insert_sequencer_by_nt"]])
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

The following is my previous writing of this worksheet which just dumped the various tables.

---
title: "Counting RT mutations from illumina sequencing data."
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
  html_document:
    code_download: true
    code_folding: show
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: tango
    keep_md: false
    mode: selfcontained
    number_sections: true
    self_contained: true
    theme: readable
    toc: true
    toc_float:
      collapsed: false
      smooth_scroll: false
  rmdformats::readthedown:
    code_download: true
    code_folding: show
    df_print: paged
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: tango
    width: 300
    keep_md: false
    mode: selfcontained
    toc_float: true
  BiocStyle::html_document:
    code_download: true
    code_folding: show
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: tango
    keep_md: false
    mode: selfcontained
    toc_float: true
---

<style type="text/css">
body, td {
  font-size: 16px;
}
code.r{
  font-size: 16px;
}
pre {
 font-size: 16px
}
</style>

```{r options, include=FALSE}
library("hpgltools")
tt <- devtools::load_all("/data/hpgltools")
knitr::opts_knit$set(width=120,
                     progress=TRUE,
                     verbose=TRUE,
                     echo=TRUE)
knitr::opts_chunk$set(error=TRUE,
                      dpi=96)
old_options <- options(digits=4,
                       stringsAsFactors=FALSE,
                       knitr.duplicate.label="allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size=10))
rundate <- format(Sys.Date(), format="%Y%m%d")
previous_file <- "index.Rmd"
ver <- "20200314"

##tmp <- sm(loadme(filename=paste0(gsub(pattern="\\.Rmd", replace="", x=previous_file), "-v", ver, ".rda.xz")))
rmd_file <- "error_quant_new.Rmd"
```

# Calculating error rates.

I wrote the function 'create_matrices()' to collect mutation counts.  At least
in theory the results from it should be able to address most/any question
regarding the counts of mutations observed in the data.

## Categorize the data with at least 3 indexes per mutant

```{r triples}
devtools::load_all("Rerrrt")
sample_sheet <- "sample_sheets/new_samples.xlsx"
ident_column <- "identtable"
mut_column <- "mutationtable"
min_reads <- 3
min_indexes <- 3
min_sequencer <- 6
min_position <- 5
max_position <- 200
max_mutations_per_read <- NULL
prune_n <- TRUE
verbose <- TRUE
excel <- glue::glue("excel/{rundate}_new_triples-v{ver}.xlsx")
triples <- create_matrices(sample_sheet=sample_sheet,
                           ident_column=ident_column, mut_column=mut_column,
                           min_reads=min_reads, min_indexes=min_indexes,
                           min_sequencer=min_sequencer,
                           min_position=min_position, max_position=max_position,
                           prune_n=prune_n, verbose=verbose, excel=excel)

max_mutations_per_read <- 10
excel <- glue::glue("excel/{rundate}_triples_tenmpr-v{ver}.xlsx")
triples_tenmpr <- create_matrices(sample_sheet=sample_sheet,
                                  ident_column=ident_column, mut_column=mut_column,
                                  min_reads=min_reads, min_indexes=min_indexes,
                                  min_sequencer=min_sequencer,
                                  min_position=min_position, max_position=max_position,
                                  prune_n=prune_n, verbose=verbose, excel=excel)
max_mutations_per_read <- 5
excel <- glue::glue("excel/{rundate}_triples_fivempr-v{ver}.xlsx")
triples_fivempr <- create_matrices(sample_sheet=sample_sheet,
                                   ident_column=ident_column, mut_column=mut_column,
                                   min_reads=min_reads, min_indexes=min_indexes,
                                   min_sequencer=min_sequencer,
                                   min_position=min_position, max_position=max_position,
                                   prune_n=prune_n, verbose=verbose, excel=excel)
```

## Categorize the data with at least 5 indexes per mutant

```{r quints}
min_indexes <- 5
max_mutations_per_read <- NULL
excel <- glue::glue("excel/{rundate}_quints-v{ver}.xlsx")
quints <- create_matrices(sample_sheet=sample_sheet,
                          ident_column=ident_column, mut_column=mut_column,
                          min_reads=min_reads, min_indexes=min_indexes,
                          min_sequencer=min_sequencer,
                          min_position=min_position, max_position=max_position,
                          prune_n=prune_n, verbose=verbose, excel=excel)
max_mutations_per_read <- 10
excel <- glue::glue("excel/{rundate}_quints_tenmpr-v{ver}.xlsx")
quints_tenmpr <- create_matrices(sample_sheet=sample_sheet,
                                 ident_column=ident_column, mut_column=mut_column,
                                 min_reads=min_reads, min_indexes=min_indexes,
                                 min_sequencer=min_sequencer,
                                 min_position=min_position, max_position=max_position,
                                 prune_n=prune_n, verbose=verbose, excel=excel)
max_mutations_per_read <- 5
excel <- glue::glue("excel/{rundate}_quints_fivempr-v{ver}.xlsx")
quints_fivempr <- create_matrices(sample_sheet=sample_sheet,
                                  ident_column=ident_column, mut_column=mut_column,
                                  min_reads=min_reads, min_indexes=min_indexes,
                                  min_sequencer=min_sequencer,
                                  min_position=min_position, max_position=max_position,
                                  prune_n=prune_n, verbose=verbose, excel=excel)
```

# Questions from Dr. DeStefano

I think what is best is to get the number of recovered mutations of each type
from each data set.  That would be A to T, A to G, A to C; T to A, T to G, T to
C; G to A, G to C, G to T; and C to A, C to G, C to T; as well as deletions and
insertions.  I would then need the sum number of the reads that met all our
criteria (i.e. at least 3 good recovered reads for that 14 nt index).  Each set
of 3 or more would ct as "1" read of that particular index so I would need the
total with this in mind.  I also need to know the total number of nucleotides
that were in the region we decided to consider in the analysis.  We may want to
try this for 3 or more and 5 or more recovered indexes if it is not hard.  This
information does not include specific positions on the template where errors
occurred but we can look at that latter.  Right now I just want to get a general
error rate and type of error.  It would basically be calculated by dividing the
number of recovered mutations of a particular type by sum number of the reads
times the number of nucleotides screened in the template.  As it ends up, this
number does not really have a lot of meaning but it can be used to calculate the
overall mutation rate as well as the rate for transversions, transitions, and
deletions and insertions.

# Answers

In order to address those queries, I invoked create_matrices() with a minimum
index count of 3 and 5.  It should be noted that this is not the same as
requiring 3 or 5 reads per index.  In both cases I require 3 reads per index.

## Recovered mutations of each type

I am interpreting this question as the number of indexes recovered for each
mutation type.  I collect this information in 2 ways of interest: the indexes by
type which are deemed to be from the RT and from the sequencer.  In addition, I
calculate a normalized (cpm) version of this information which may be used to look for
changes across samples.

### Mutations by RT index

This following block should print out tables of the numbers of mutant indexes
observed for each type for the RT and the sequencer.  One would hope that the
sequencer will be consistent for all samples, but I think the results will
instead suggest that my metric is not yet stringent enough.

```{r mutation_index_count, results='asis'}
knitr::kable(triples[["matrices"]][["miss_indexes_by_type"]])
knitr::kable(triples_tenmpr[["matrices"]][["miss_indexes_by_type"]])
knitr::kable(triples_fivempr[["matrices"]][["miss_indexes_by_type"]])
knitr::kable(quints[["matrices"]][["miss_indexes_by_type"]])
knitr::kable(quints_tenmpr[["matrices"]][["miss_indexes_by_type"]])
knitr::kable(quints_fivempr[["matrices"]][["miss_indexes_by_type"]])

knitr::kable(triples[["matrices"]][["miss_sequencer_by_type"]])
knitr::kable(triples_tenmpr[["matrices"]][["miss_sequencer_by_type"]])
knitr::kable(triples_fivempr[["matrices"]][["miss_sequencer_by_type"]])
knitr::kable(quints[["matrices"]][["miss_sequencer_by_type"]])
knitr::kable(quints_tenmpr[["matrices"]][["miss_sequencer_by_type"]])
knitr::kable(quints_fivempr[["matrices"]][["miss_sequencer_by_type"]])
```

Plots of this information

```{r mutation_index_count_plots}
triples[["plots"]][["counts"]][["miss_indexes_by_type"]]
triples_tenmpr[["plots"]][["counts"]][["miss_indexes_by_type"]]
triples_fivempr[["plots"]][["counts"]][["miss_indexes_by_type"]]

quints[["plots"]][["counts"]][["miss_indexes_by_type"]]
quints_tenmpr[["plots"]][["counts"]][["miss_indexes_by_type"]]
quints_fivempr[["plots"]][["counts"]][["miss_indexes_by_type"]]
```

This suggests to me that this information needs to be normalized in some more
sensible fashion.  Thus the following:

### Mutations by RT index, post normalization

The same numbers may be expressed in the context of the number of indexes
observed / sample and/or as a cpm across samples.  Thus in the first instance
one can look at the apparent error rate for each sample, and in the second
instance one may look for relative changes in apparent error rate across
samples.

#### Rewriting the matrices as cpm to account for library sizes.

```{r mutation_index_normalized, results='asis'}
knitr::kable(triples[["normalized"]][["miss_indexes_by_type"]])
knitr::kable(triples_tenmpr[["normalized"]][["miss_indexes_by_type"]])
knitr::kable(triples_fivempr[["normalized"]][["miss_indexes_by_type"]])
knitr::kable(quints[["normalized"]][["miss_indexes_by_type"]])
knitr::kable(quints_tenmpr[["normalized"]][["miss_indexes_by_type"]])
knitr::kable(quints_fivempr[["normalized"]][["miss_indexes_by_type"]])

knitr::kable(triples[["normalized"]][["miss_sequencer_by_type"]])
knitr::kable(triples_tenmpr[["normalized"]][["miss_sequencer_by_type"]])
knitr::kable(triples_fivempr[["normalized"]][["miss_sequencer_by_type"]])
knitr::kable(quints[["normalized"]][["miss_sequencer_by_type"]])
knitr::kable(quints_tenmpr[["normalized"]][["miss_sequencer_by_type"]])
knitr::kable(quints_fivempr[["normalized"]][["miss_sequencer_by_type"]])
```

#### Rewriting the matrices by dividing by all indexes

This I think starts to address the later text in your query.

```{r mutation_index_normalized_by_counts, results='asis'}
knitr::kable(triples[["matrices_by_counts"]][["miss_indexes_by_type"]])
knitr::kable(quints[["matrices_by_counts"]][["miss_indexes_by_type"]])

knitr::kable(triples[["matrices_by_counts"]][["miss_sequencer_by_type"]])
knitr::kable(quints[["matrices_by_counts"]][["miss_sequencer_by_type"]])
```

#### Rewriting the matrices by dividing by all indexes and cpm

I think this might prove to be where we get the most meaningful results.

The nicest thing in it is that after accounting for library sizes and total
indexes observed, we finally see that the sequencer error is mostly consistent
across all samples and mutation types -- with a couple of notable exceptions.

By the same token, for the mutations which _are_ identical for the sequencer, we
have some which are decidedly different for the non-sequencer data.  The most
notable examples I think are A to G but _not G to A; and C to T.

```{r mutation_index_cpm_by_counts, results='asis'}
knitr::kable(triples[["normalized_by_counts"]][["miss_indexes_by_type"]])
knitr::kable(triples_tenmpr[["normalized_by_counts"]][["miss_indexes_by_type"]])
knitr::kable(triples_fivempr[["normalized_by_counts"]][["miss_indexes_by_type"]])
knitr::kable(quints[["normalized_by_counts"]][["miss_indexes_by_type"]])
knitr::kable(quints_tenmpr[["normalized_by_counts"]][["miss_indexes_by_type"]])
knitr::kable(quints_fivempr[["normalized_by_counts"]][["miss_indexes_by_type"]])

knitr::kable(triples[["normalized_by_counts"]][["miss_sequencer_by_type"]])
knitr::kable(triples_tenmpr[["normalized_by_counts"]][["miss_sequencer_by_type"]])
knitr::kable(triples_fivempr[["normalized_by_counts"]][["miss_sequencer_by_type"]])
knitr::kable(quints[["normalized_by_counts"]][["miss_sequencer_by_type"]])
knitr::kable(quints_tenmpr[["normalized_by_counts"]][["miss_sequencer_by_type"]])
knitr::kable(quints_fivempr[["normalized_by_counts"]][["miss_sequencer_by_type"]])
```

### Indels by RT index

The following blocks will repeat the above, but looking for insertions.
This data does not observe sufficient deletions to make a proper count for them.

```{r insert_index_count, results='asis'}
knitr::kable(triples[["matrices"]][["insert_indexes_by_nt"]])
knitr::kable(triples_tenmpr[["matrices"]][["insert_indexes_by_nt"]])
knitr::kable(triples_fivempr[["matrices"]][["insert_indexes_by_nt"]])
knitr::kable(quints[["matrices"]][["insert_indexes_by_nt"]])
knitr::kable(quints_tenmpr[["matrices"]][["insert_indexes_by_nt"]])
knitr::kable(quints_fivempr[["matrices"]][["insert_indexes_by_nt"]])

knitr::kable(triples[["matrices"]][["insert_sequencer_by_nt"]])
knitr::kable(triples_tenmpr[["matrices"]][["insert_sequencer_by_nt"]])
knitr::kable(triples_fivempr[["matrices"]][["insert_sequencer_by_nt"]])
knitr::kable(quints[["matrices"]][["insert_sequencer_by_nt"]])
knitr::kable(quints_tenmpr[["matrices"]][["insert_sequencer_by_nt"]])
knitr::kable(quints_fivempr[["matrices"]][["insert_sequencer_by_nt"]])
```

Plots of this information

```{r insert_index_count_plots}
triples[["plots"]][["counts"]][["insert_indexes_by_nt"]]
triples_tenmpr[["plots"]][["counts"]][["insert_indexes_by_nt"]]
triples_fivempr[["plots"]][["counts"]][["insert_indexes_by_nt"]]

quints[["plots"]][["counts"]][["insert_indexes_by_nt"]]
quints_tenmpr[["plots"]][["counts"]][["insert_indexes_by_nt"]]
quints_fivempr[["plots"]][["counts"]][["insert_indexes_by_nt"]]
```

### Insertions by RT index, post normalization

#### Rewriting the matrices as cpm to account for library sizes.

```{r insert_index_normalized, results='asis'}
knitr::kable(triples[["normalized"]][["insert_indexes_by_nt"]])
knitr::kable(triples_tenmpr[["normalized"]][["insert_indexes_by_nt"]])
knitr::kable(triples_fivempr[["normalized"]][["insert_indexes_by_nt"]])
knitr::kable(quints[["normalized"]][["insert_indexes_by_nt"]])
knitr::kable(quints_tenmpr[["normalized"]][["insert_indexes_by_nt"]])
knitr::kable(quints_fivempr[["normalized"]][["insert_indexes_by_nt"]])

knitr::kable(triples[["normalized"]][["insert_sequencer_by_nt"]])
knitr::kable(triples_tenmpr[["normalized"]][["insert_sequencer_by_nt"]])
knitr::kable(triples_fivempr[["normalized"]][["insert_sequencer_by_nt"]])
knitr::kable(quints[["normalized"]][["insert_sequencer_by_nt"]])
knitr::kable(quints_tenmpr[["normalized"]][["insert_sequencer_by_nt"]])
knitr::kable(quints_fivempr[["normalized"]][["insert_sequencer_by_nt"]])
```

#### Rewriting the matrices by dividing by all indexes

I think that there are few enough insertion events that this gets a bit messed
up.  I will double check the logic of this, but that is my initial guess given
how few insertions I was seeing when reading the outputs manually.
Unfortunately, this means that for these I also cannot provide a cpm measurement.

```{r insert_index_normalized_by_counts, results='asis'}
knitr::kable(triples[["matrices_by_counts"]][["insert_indexes_by_nt"]])
knitr::kable(triples_tenmpr[["matrices_by_counts"]][["insert_indexes_by_nt"]])
knitr::kable(triples_fivempr[["matrices_by_counts"]][["insert_indexes_by_nt"]])
knitr::kable(quints[["matrices_by_counts"]][["insert_indexes_by_nt"]])
knitr::kable(quints_tenmpr[["matrices_by_counts"]][["insert_indexes_by_nt"]])
knitr::kable(quints_fivempr[["matrices_by_counts"]][["insert_indexes_by_nt"]])

knitr::kable(triples[["matrices_by_counts"]][["insert_sequencer_by_nt"]])
knitr::kable(triples_tenmpr[["matrices_by_counts"]][["insert_sequencer_by_nt"]])
knitr::kable(triples_fivempr[["matrices_by_counts"]][["insert_sequencer_by_nt"]])
knitr::kable(quints[["matrices_by_counts"]][["insert_sequencer_by_nt"]])
knitr::kable(quints_tenmpr[["matrices_by_counts"]][["insert_sequencer_by_nt"]])
knitr::kable(quints_fivempr[["matrices_by_counts"]][["insert_sequencer_by_nt"]])
```

The following is my previous writing of this worksheet which just dumped the
various tables.

# Print raw tables

```{r raw, results='asis'}
for (t in 1:length(triples[["matrices"]])) {
  table_name <- names(triples[["matrices"]])[t]
  message("Raw table: ", table_name, ".")
  print(knitr::kable(triples[["matrices"]][t]))
}
```

# Print raw plots

```{r raw_plots}
for (t in 1:length(triples[["plots"]][["matrices"]])) {
  message("Raw table: ", table_name, ".")
  print(triples[["plots"]][["matrices"]][t])
}
```

# Print normalized tables

```{r norm, results='asis'}
for (t in 1:length(triples[["matrices_counts"]])) {
  table_name <- names(triples[["matrices_counts"]])[t]
  message("Normalized table: ", table_name, ".")
  print(knitr::kable(triples[["matrices_counts"]][t]))
}
```

# Print normalized plots

```{r norm_plots}
for (t in 1:length(triples[["plots"]][["counts"]])) {
  message("Normalized table: ", table_name, ".")
  print(triples[["plots"]][["counts"]][t])
}
```

```{r saveme}
pander::pander(sessionInfo())
message(paste0("This is hpgltools commit: ", get_git_commit()))
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
tmp <- sm(saveme(filename=this_save))
```


```{r loadme, eval=FALSE}
loadme(filename=this_save)
```
