1 Calculating error rates.

I wrote the function ‘create_matrices()’ to collect mutation counts. At least in theory the results from it should be able to address most/any question regarding the counts of mutations observed in the data.

1.1 Categorize the data with at least 3 indexes per mutant

## Loading Rerrrt
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:hpgltools':
## 
##     combine
## The following object is masked from 'package:Biobase':
## 
##     combine
## The following objects are masked from 'package:BiocGenerics':
## 
##     combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: tidyr
## Dropped 3 rows from the sample metadata because they were blank.
## Starting sample: s4.
##   Reading the file containing mutations: preprocessing/s4/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/s4/step2_identical_reads.txt.xz
## Warning: 1 parsing failure.
## row    col expected   actual                                            file
##   1 readid a number read_num 'preprocessing/s4/step2_identical_reads.txt.xz'
##   Counting indexes before filtering.
##     Mutation data: removing any differences before position: 24.
##     Mutation data: before pruning, there are: 2239016 reads.
##     Mutation data: after min-position pruning, there are: 2183747 reads: 55269 lost or 2.47%.
##     Mutation data: removing any differences after position: 176.
##     Mutation data: before pruning, there are: 2183747 reads.
##     Mutation data: after max-position pruning, there are: 837996 reads: 1345751 lost or 61.63%.
##     Mutation data: removing any reads with 'N' as the hit.
##     Mutation data: after N pruning, there are: 832549 reads: 5447 lost or 0.65%.
##   Mutation data: all filters removed 1406467 reads, or 62.82%.
##     Gathering information about the number of reads per index.
##     Before reads/index pruning, there are: 973940 indexes in all the data.
##     After reads/index pruning, there are: 354890 indexes: 619050 lost or 63.56%.
##     All data: removing indexes with fewer than 3 reads/index.
##     All data: before reads/index pruning, there are: 832549 changed reads.
##     All data: before reads/index pruning, there are: 1975296 identical reads.
##     All data: after index pruning, there are: 454061 changed reads: 54.54%.
##     All data: after index pruning, there are: 1351672 identical reads: 68.43%.
##   Gathering identical, mutant, and sequencer reads/indexes.
##   Before classification, there are 1351672 identical reads.
##   Before classification, there are 454061 reads with mutations.
##   After classification, there are 779988 reads/indexes which are only identical.
##   After classification, there are 3025 reads/indexes which are strictly sequencer.
##   After classification, there are 38821 reads/indexes which are deemed from reverse transcriptase.
##   Counted by direction: 1634797 forward reads and 2107382 reverse_reads.
## Subsetting based on mutations with at least 3 indexes.
## Classified mutation strings according to various criteria.
## Starting sample: s5.
##   Reading the file containing mutations: preprocessing/s5/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/s5/step2_identical_reads.txt.xz
## Warning: 1 parsing failure.
## row    col expected   actual                                            file
##   1 readid a number read_num 'preprocessing/s5/step2_identical_reads.txt.xz'
##   Counting indexes before filtering.
##     Mutation data: removing any differences before position: 24.
##     Mutation data: before pruning, there are: 2259996 reads.
##     Mutation data: after min-position pruning, there are: 2204009 reads: 55987 lost or 2.48%.
##     Mutation data: removing any differences after position: 176.
##     Mutation data: before pruning, there are: 2204009 reads.
##     Mutation data: after max-position pruning, there are: 825004 reads: 1379005 lost or 62.57%.
##     Mutation data: removing any reads with 'N' as the hit.
##     Mutation data: after N pruning, there are: 817693 reads: 7311 lost or 0.89%.
##   Mutation data: all filters removed 1442303 reads, or 63.82%.
##     Gathering information about the number of reads per index.
##     Before reads/index pruning, there are: 965529 indexes in all the data.
##     After reads/index pruning, there are: 316034 indexes: 649495 lost or 67.27%.
##     All data: removing indexes with fewer than 3 reads/index.
##     All data: before reads/index pruning, there are: 817693 changed reads.
##     All data: before reads/index pruning, there are: 1764804 identical reads.
##     All data: after index pruning, there are: 420259 changed reads: 51.40%.
##     All data: after index pruning, there are: 1113889 identical reads: 63.12%.
##   Gathering identical, mutant, and sequencer reads/indexes.
##   Before classification, there are 1113889 identical reads.
##   Before classification, there are 420259 reads with mutations.
##   After classification, there are 650056 reads/indexes which are only identical.
##   After classification, there are 1554 reads/indexes which are strictly sequencer.
##   After classification, there are 63275 reads/indexes which are deemed from reverse transcriptase.
##   Counted by direction: 1429423 forward reads and 1637306 reverse_reads.
## Subsetting based on mutations with at least 3 indexes.
## Classified mutation strings according to various criteria.
## Starting sample: s6.
##   Reading the file containing mutations: preprocessing/s6/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/s6/step2_identical_reads.txt.xz
## Warning: 1 parsing failure.
## row    col expected   actual                                            file
##   1 readid a number read_num 'preprocessing/s6/step2_identical_reads.txt.xz'
##   Counting indexes before filtering.
##     Mutation data: removing any differences before position: 24.
##     Mutation data: before pruning, there are: 2012128 reads.
##     Mutation data: after min-position pruning, there are: 1966909 reads: 45219 lost or 2.25%.
##     Mutation data: removing any differences after position: 176.
##     Mutation data: before pruning, there are: 1966909 reads.
##     Mutation data: after max-position pruning, there are: 747617 reads: 1219292 lost or 61.99%.
##     Mutation data: removing any reads with 'N' as the hit.
##     Mutation data: after N pruning, there are: 734242 reads: 13375 lost or 1.79%.
##   Mutation data: all filters removed 1277886 reads, or 63.51%.
##     Gathering information about the number of reads per index.
##     Before reads/index pruning, there are: 844764 indexes in all the data.
##     After reads/index pruning, there are: 206713 indexes: 638051 lost or 75.53%.
##     All data: removing indexes with fewer than 3 reads/index.
##     All data: before reads/index pruning, there are: 734242 changed reads.
##     All data: before reads/index pruning, there are: 1254041 identical reads.
##     All data: after index pruning, there are: 316103 changed reads: 43.05%.
##     All data: after index pruning, there are: 627064 identical reads: 50.00%.
##   Gathering identical, mutant, and sequencer reads/indexes.
##   Before classification, there are 627064 identical reads.
##   Before classification, there are 316103 reads with mutations.
##   After classification, there are 375136 reads/indexes which are only identical.
##   After classification, there are 270 reads/indexes which are strictly sequencer.
##   After classification, there are 71568 reads/indexes which are deemed from reverse transcriptase.
##   Counted by direction: 727996 forward reads and 950480 reverse_reads.
## Subsetting based on mutations with at least 3 indexes.
## Classified mutation strings according to various criteria.
## Plotting index densities.
## Warning in max(nchar(as.character(matrix_melted[["category"]]))): no non-missing
## arguments to max; returning -Inf
## Warning in max(nchar(as.character(matrix_melted[["category"]]))): no non-missing
## arguments to max; returning -Inf

## Warning in max(nchar(as.character(matrix_melted[["category"]]))): no non-missing
## arguments to max; returning -Inf

## Warning in max(nchar(as.character(matrix_melted[["category"]]))): no non-missing
## arguments to max; returning -Inf

## Warning in max(nchar(as.character(matrix_melted[["category"]]))): no non-missing
## arguments to max; returning -Inf

## Warning in max(nchar(as.character(matrix_melted[["category"]]))): no non-missing
## arguments to max; returning -Inf

## Warning in max(nchar(as.character(matrix_melted[["category"]]))): no non-missing
## arguments to max; returning -Inf

## Warning in max(nchar(as.character(matrix_melted[["category"]]))): no non-missing
## arguments to max; returning -Inf
## Warning in dir.create(excel_dir, recursive = TRUE): cannot create dir 'excel',
## reason 'Transport endpoint is not connected'
##   Writing a legend.
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/pre_chng_density.pdf'
## Warning: Removed 12000 rows containing non-finite values (stat_density).
## Warning: Removed 12000 rows containing non-finite values (stat_density).
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/pre_ident_density.pdf'
## Warning: Removed 11672 rows containing non-finite values (stat_density).
## Warning: Removed 11672 rows containing non-finite values (stat_density).
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/pre_all_index_density.pdf'
## Warning: Removed 30533 rows containing non-finite values (stat_density).
## Warning: Removed 30533 rows containing non-finite values (stat_density).
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/post_chng_density.pdf'
## Warning: Removed 4037 rows containing non-finite values (stat_density).
## Warning: Removed 4037 rows containing non-finite values (stat_density).
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/post_ident_density.pdf'
## Warning: Removed 1431 rows containing non-finite values (stat_density).
## Warning: Removed 1431 rows containing non-finite values (stat_density).
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/post_all_index_density.pdf'
## Warning: Removed 3916 rows containing non-finite values (stat_density).
## Warning: Removed 3916 rows containing non-finite values (stat_density).
##   Writing raw data.
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_miss_indexes_by_refnt.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_miss_sequencer_by_refnt.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_miss_indexes_by_hitnt.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_miss_sequencer_by_hitnt.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_miss_indexes_by_type.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_miss_sequencer_by_type.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_miss_indexes_by_trans.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_miss_sequencer_by_trans.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_miss_indexes_by_strength.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_miss_sequencer_by_strength.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_miss_indexes_by_position.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_miss_sequencer_by_position.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_insert_indexes_by_nt.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_miss_indexes_by_position.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_miss_sequencer_by_position.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_insert_indexes_by_position.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_miss_indexes_by_string.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_miss_sequencer_by_string.pdf'
##   Writing cpm data.
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_cpm_miss_indexes_by_refnt.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_cpm_miss_sequencer_by_refnt.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_cpm_miss_indexes_by_hitnt.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_cpm_miss_sequencer_by_hitnt.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_cpm_miss_indexes_by_type.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_cpm_miss_sequencer_by_type.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_cpm_miss_indexes_by_trans.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_cpm_miss_sequencer_by_trans.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_cpm_miss_indexes_by_strength.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_cpm_miss_sequencer_by_strength.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_cpm_miss_indexes_by_position.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_cpm_miss_sequencer_by_position.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_cpm_insert_indexes_by_nt.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_cpm_miss_indexes_by_position.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_cpm_miss_sequencer_by_position.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_cpm_insert_indexes_by_position.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_cpm_miss_indexes_by_string.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_cpm_miss_sequencer_by_string.pdf'
##   Writing data normalized by reads/indexes.
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_counts_miss_indexes_by_refnt.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_counts_miss_sequencer_by_refnt.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_counts_miss_indexes_by_hitnt.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_counts_miss_sequencer_by_hitnt.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_counts_miss_indexes_by_type.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_counts_miss_sequencer_by_type.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_counts_miss_indexes_by_trans.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_counts_miss_sequencer_by_trans.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_counts_miss_indexes_by_strength.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_counts_miss_sequencer_by_strength.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_counts_miss_indexes_by_position.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_counts_miss_sequencer_by_position.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_counts_insert_indexes_by_nt.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_counts_miss_indexes_by_position.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_counts_miss_sequencer_by_position.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_counts_insert_indexes_by_position.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_counts_miss_indexes_by_string.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_counts_miss_sequencer_by_string.pdf'
##   Writing data normalized by reads/indexes and length.
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_countslength_miss_indexes_by_refnt.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_countslength_miss_sequencer_by_refnt.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_countslength_miss_indexes_by_hitnt.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_countslength_miss_sequencer_by_hitnt.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_countslength_miss_indexes_by_type.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_countslength_miss_sequencer_by_type.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_countslength_miss_indexes_by_trans.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_countslength_miss_sequencer_by_trans.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_countslength_miss_indexes_by_strength.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_countslength_miss_sequencer_by_strength.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_countslength_miss_indexes_by_position.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_countslength_miss_sequencer_by_position.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_countslength_insert_indexes_by_nt.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_countslength_miss_indexes_by_position.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_countslength_miss_sequencer_by_position.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_countslength_insert_indexes_by_position.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_countslength_miss_indexes_by_string.pdf'
## Warning in dir.create(savedir, recursive = TRUE): cannot create dir
## 'saved_plots', reason 'Transport endpoint is not connected'
## Error in pdf(file = high_quality) : 
##   cannot open file 'saved_plots/matrices_countslength_miss_sequencer_by_string.pdf'
## Error in setwd(wd): character argument expected
## Error in read.xlsx.default(xlsxFile = file, sheet = 1) : 
##   File does not exist.
## Error in read_metadata(meta_file, ...): Unable to read the metadata file: sample_sheets/all_samples.xlsx
## Error in read.xlsx.default(xlsxFile = file, sheet = 1) : 
##   File does not exist.
## Error in read_metadata(meta_file, ...): Unable to read the metadata file: sample_sheets/all_samples.xlsx

1.2 Categorize the data with at least 5 indexes per mutant

## Dropped 3 rows from the sample metadata because they were blank.
## Starting sample: s4.
##   Reading the file containing mutations: preprocessing/s4/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/s4/step2_identical_reads.txt.xz
## Warning: 1 parsing failure.
## row    col expected   actual                                            file
##   1 readid a number read_num 'preprocessing/s4/step2_identical_reads.txt.xz'
##   Counting indexes before filtering.
##     Mutation data: removing any differences before position: 24.
##     Mutation data: before pruning, there are: 2239016 reads.
##     Mutation data: after min-position pruning, there are: 2183747 reads: 55269 lost or 2.47%.
##     Mutation data: removing any differences after position: 176.
##     Mutation data: before pruning, there are: 2183747 reads.
##     Mutation data: after max-position pruning, there are: 837996 reads: 1345751 lost or 61.63%.
##     Mutation data: removing any reads with 'N' as the hit.
##     Mutation data: after N pruning, there are: 832549 reads: 5447 lost or 0.65%.
##   Mutation data: all filters removed 1406467 reads, or 62.82%.
##     Gathering information about the number of reads per index.
##     Before reads/index pruning, there are: 973940 indexes in all the data.
##     After reads/index pruning, there are: 354890 indexes: 619050 lost or 63.56%.
##     All data: removing indexes with fewer than 3 reads/index.
##     All data: before reads/index pruning, there are: 832549 changed reads.
##     All data: before reads/index pruning, there are: 1975296 identical reads.
##     All data: after index pruning, there are: 454061 changed reads: 54.54%.
##     All data: after index pruning, there are: 1351672 identical reads: 68.43%.
##   Gathering identical, mutant, and sequencer reads/indexes.
##   Before classification, there are 1351672 identical reads.
##   Before classification, there are 454061 reads with mutations.
##   After classification, there are 779988 reads/indexes which are only identical.
##   After classification, there are 3025 reads/indexes which are strictly sequencer.
##   After classification, there are 38821 reads/indexes which are deemed from reverse transcriptase.
##   Counted by direction: 1634797 forward reads and 2107382 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various criteria.
## Starting sample: s5.
##   Reading the file containing mutations: preprocessing/s5/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/s5/step2_identical_reads.txt.xz
## Warning: 1 parsing failure.
## row    col expected   actual                                            file
##   1 readid a number read_num 'preprocessing/s5/step2_identical_reads.txt.xz'
##   Counting indexes before filtering.
##     Mutation data: removing any differences before position: 24.
##     Mutation data: before pruning, there are: 2259996 reads.
##     Mutation data: after min-position pruning, there are: 2204009 reads: 55987 lost or 2.48%.
##     Mutation data: removing any differences after position: 176.
##     Mutation data: before pruning, there are: 2204009 reads.
##     Mutation data: after max-position pruning, there are: 825004 reads: 1379005 lost or 62.57%.
##     Mutation data: removing any reads with 'N' as the hit.
##     Mutation data: after N pruning, there are: 817693 reads: 7311 lost or 0.89%.
##   Mutation data: all filters removed 1442303 reads, or 63.82%.
##     Gathering information about the number of reads per index.
##     Before reads/index pruning, there are: 965529 indexes in all the data.
##     After reads/index pruning, there are: 316034 indexes: 649495 lost or 67.27%.
##     All data: removing indexes with fewer than 3 reads/index.
##     All data: before reads/index pruning, there are: 817693 changed reads.
##     All data: before reads/index pruning, there are: 1764804 identical reads.
##     All data: after index pruning, there are: 420259 changed reads: 51.40%.
##     All data: after index pruning, there are: 1113889 identical reads: 63.12%.
##   Gathering identical, mutant, and sequencer reads/indexes.
##   Before classification, there are 1113889 identical reads.
##   Before classification, there are 420259 reads with mutations.
##   After classification, there are 650056 reads/indexes which are only identical.
##   After classification, there are 1554 reads/indexes which are strictly sequencer.
##   After classification, there are 63275 reads/indexes which are deemed from reverse transcriptase.
##   Counted by direction: 1429423 forward reads and 1637306 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various criteria.
## Starting sample: s6.
##   Reading the file containing mutations: preprocessing/s6/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/s6/step2_identical_reads.txt.xz
## Warning: 1 parsing failure.
## row    col expected   actual                                            file
##   1 readid a number read_num 'preprocessing/s6/step2_identical_reads.txt.xz'
##   Counting indexes before filtering.
##     Mutation data: removing any differences before position: 24.
##     Mutation data: before pruning, there are: 2012128 reads.
##     Mutation data: after min-position pruning, there are: 1966909 reads: 45219 lost or 2.25%.
##     Mutation data: removing any differences after position: 176.
##     Mutation data: before pruning, there are: 1966909 reads.
##     Mutation data: after max-position pruning, there are: 747617 reads: 1219292 lost or 61.99%.
##     Mutation data: removing any reads with 'N' as the hit.
##     Mutation data: after N pruning, there are: 734242 reads: 13375 lost or 1.79%.
##   Mutation data: all filters removed 1277886 reads, or 63.51%.
##     Gathering information about the number of reads per index.
##     Before reads/index pruning, there are: 844764 indexes in all the data.
##     After reads/index pruning, there are: 206713 indexes: 638051 lost or 75.53%.
##     All data: removing indexes with fewer than 3 reads/index.
##     All data: before reads/index pruning, there are: 734242 changed reads.
##     All data: before reads/index pruning, there are: 1254041 identical reads.
##     All data: after index pruning, there are: 316103 changed reads: 43.05%.
##     All data: after index pruning, there are: 627064 identical reads: 50.00%.
##   Gathering identical, mutant, and sequencer reads/indexes.
##   Before classification, there are 627064 identical reads.
##   Before classification, there are 316103 reads with mutations.
##   After classification, there are 375136 reads/indexes which are only identical.
##   After classification, there are 270 reads/indexes which are strictly sequencer.
##   After classification, there are 71568 reads/indexes which are deemed from reverse transcriptase.
##   Counted by direction: 727996 forward reads and 950480 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various criteria.
## Plotting index densities.
## Warning in max(nchar(as.character(matrix_melted[["category"]]))): no non-missing
## arguments to max; returning -Inf
## Warning in max(nchar(as.character(matrix_melted[["category"]]))): no non-missing
## arguments to max; returning -Inf

## Warning in max(nchar(as.character(matrix_melted[["category"]]))): no non-missing
## arguments to max; returning -Inf

## Warning in max(nchar(as.character(matrix_melted[["category"]]))): no non-missing
## arguments to max; returning -Inf

## Warning in max(nchar(as.character(matrix_melted[["category"]]))): no non-missing
## arguments to max; returning -Inf

## Warning in max(nchar(as.character(matrix_melted[["category"]]))): no non-missing
## arguments to max; returning -Inf

## Warning in max(nchar(as.character(matrix_melted[["category"]]))): no non-missing
## arguments to max; returning -Inf

## Warning in max(nchar(as.character(matrix_melted[["category"]]))): no non-missing
## arguments to max; returning -Inf
##   Writing a legend.
## Warning: Removed 12000 rows containing non-finite values (stat_density).
## Warning: Removed 12000 rows containing non-finite values (stat_density).
## Warning: Removed 11672 rows containing non-finite values (stat_density).

## Warning: Removed 11672 rows containing non-finite values (stat_density).
## Warning: Removed 30533 rows containing non-finite values (stat_density).

## Warning: Removed 30533 rows containing non-finite values (stat_density).
## Warning: Removed 4037 rows containing non-finite values (stat_density).

## Warning: Removed 4037 rows containing non-finite values (stat_density).
## Warning: Removed 1431 rows containing non-finite values (stat_density).

## Warning: Removed 1431 rows containing non-finite values (stat_density).
## Warning: Removed 3916 rows containing non-finite values (stat_density).

## Warning: Removed 3916 rows containing non-finite values (stat_density).
##   Writing raw data.
##   Writing cpm data.
##   Writing data normalized by reads/indexes.
##   Writing data normalized by reads/indexes and length.
## Dropped 3 rows from the sample metadata because they were blank.
## Starting sample: s4.
##   Reading the file containing mutations: preprocessing/s4/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/s4/step2_identical_reads.txt.xz
## Warning: 1 parsing failure.
## row    col expected   actual                                            file
##   1 readid a number read_num 'preprocessing/s4/step2_identical_reads.txt.xz'
##   Counting indexes before filtering.
##     Mutation data: removing any differences before position: 24.
##     Mutation data: before pruning, there are: 2239016 reads.
##     Mutation data: after min-position pruning, there are: 2183747 reads: 55269 lost or 2.47%.
##     Mutation data: removing any differences after position: 176.
##     Mutation data: before pruning, there are: 2183747 reads.
##     Mutation data: after max-position pruning, there are: 837996 reads: 1345751 lost or 61.63%.
##     Mutation data: removing any reads with 'N' as the hit.
##     Mutation data: after N pruning, there are: 832549 reads: 5447 lost or 0.65%.
##   Mutation data: all filters removed 1406467 reads, or 62.82%.
##     Gathering information about the number of reads per index.
##     Before reads/index pruning, there are: 973940 indexes in all the data.
##     After reads/index pruning, there are: 354890 indexes: 619050 lost or 63.56%.
##     All data: removing indexes with fewer than 3 reads/index.
##     All data: before reads/index pruning, there are: 832549 changed reads.
##     All data: before reads/index pruning, there are: 1975296 identical reads.
##     All data: after index pruning, there are: 454061 changed reads: 54.54%.
##     All data: after index pruning, there are: 1351672 identical reads: 68.43%.
##   Gathering identical, mutant, and sequencer reads/indexes.
##   Before classification, there are 1351672 identical reads.
##   Before classification, there are 454061 reads with mutations.
##   After classification, there are 779988 reads/indexes which are only identical.
##   After classification, there are 3025 reads/indexes which are strictly sequencer.
##   After classification, there are 38821 reads/indexes which are deemed from reverse transcriptase.
##   Counted by direction: 1634797 forward reads and 2107382 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various criteria.
## Starting sample: s5.
##   Reading the file containing mutations: preprocessing/s5/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/s5/step2_identical_reads.txt.xz
## Warning: 1 parsing failure.
## row    col expected   actual                                            file
##   1 readid a number read_num 'preprocessing/s5/step2_identical_reads.txt.xz'
##   Counting indexes before filtering.
##     Mutation data: removing any differences before position: 24.
##     Mutation data: before pruning, there are: 2259996 reads.
##     Mutation data: after min-position pruning, there are: 2204009 reads: 55987 lost or 2.48%.
##     Mutation data: removing any differences after position: 176.
##     Mutation data: before pruning, there are: 2204009 reads.
##     Mutation data: after max-position pruning, there are: 825004 reads: 1379005 lost or 62.57%.
##     Mutation data: removing any reads with 'N' as the hit.
##     Mutation data: after N pruning, there are: 817693 reads: 7311 lost or 0.89%.
##   Mutation data: all filters removed 1442303 reads, or 63.82%.
##     Gathering information about the number of reads per index.
##     Before reads/index pruning, there are: 965529 indexes in all the data.
##     After reads/index pruning, there are: 316034 indexes: 649495 lost or 67.27%.
##     All data: removing indexes with fewer than 3 reads/index.
##     All data: before reads/index pruning, there are: 817693 changed reads.
##     All data: before reads/index pruning, there are: 1764804 identical reads.
##     All data: after index pruning, there are: 420259 changed reads: 51.40%.
##     All data: after index pruning, there are: 1113889 identical reads: 63.12%.
##   Gathering identical, mutant, and sequencer reads/indexes.
##   Before classification, there are 1113889 identical reads.
##   Before classification, there are 420259 reads with mutations.
##   After classification, there are 650056 reads/indexes which are only identical.
##   After classification, there are 1554 reads/indexes which are strictly sequencer.
##   After classification, there are 63275 reads/indexes which are deemed from reverse transcriptase.
##   Counted by direction: 1429423 forward reads and 1637306 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various criteria.
## Starting sample: s6.
##   Reading the file containing mutations: preprocessing/s6/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/s6/step2_identical_reads.txt.xz
## Warning: 1 parsing failure.
## row    col expected   actual                                            file
##   1 readid a number read_num 'preprocessing/s6/step2_identical_reads.txt.xz'
##   Counting indexes before filtering.
##     Mutation data: removing any differences before position: 24.
##     Mutation data: before pruning, there are: 2012128 reads.
##     Mutation data: after min-position pruning, there are: 1966909 reads: 45219 lost or 2.25%.
##     Mutation data: removing any differences after position: 176.
##     Mutation data: before pruning, there are: 1966909 reads.
##     Mutation data: after max-position pruning, there are: 747617 reads: 1219292 lost or 61.99%.
##     Mutation data: removing any reads with 'N' as the hit.
##     Mutation data: after N pruning, there are: 734242 reads: 13375 lost or 1.79%.
##   Mutation data: all filters removed 1277886 reads, or 63.51%.
##     Gathering information about the number of reads per index.
##     Before reads/index pruning, there are: 844764 indexes in all the data.
##     After reads/index pruning, there are: 206713 indexes: 638051 lost or 75.53%.
##     All data: removing indexes with fewer than 3 reads/index.
##     All data: before reads/index pruning, there are: 734242 changed reads.
##     All data: before reads/index pruning, there are: 1254041 identical reads.
##     All data: after index pruning, there are: 316103 changed reads: 43.05%.
##     All data: after index pruning, there are: 627064 identical reads: 50.00%.
##   Gathering identical, mutant, and sequencer reads/indexes.
##   Before classification, there are 627064 identical reads.
##   Before classification, there are 316103 reads with mutations.
##   After classification, there are 375136 reads/indexes which are only identical.
##   After classification, there are 270 reads/indexes which are strictly sequencer.
##   After classification, there are 71568 reads/indexes which are deemed from reverse transcriptase.
##   Counted by direction: 727996 forward reads and 950480 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various criteria.
## Plotting index densities.
## Warning in max(nchar(as.character(matrix_melted[["category"]]))): no non-missing
## arguments to max; returning -Inf
## Warning in max(nchar(as.character(matrix_melted[["category"]]))): no non-missing
## arguments to max; returning -Inf

## Warning in max(nchar(as.character(matrix_melted[["category"]]))): no non-missing
## arguments to max; returning -Inf

## Warning in max(nchar(as.character(matrix_melted[["category"]]))): no non-missing
## arguments to max; returning -Inf

## Warning in max(nchar(as.character(matrix_melted[["category"]]))): no non-missing
## arguments to max; returning -Inf

## Warning in max(nchar(as.character(matrix_melted[["category"]]))): no non-missing
## arguments to max; returning -Inf

## Warning in max(nchar(as.character(matrix_melted[["category"]]))): no non-missing
## arguments to max; returning -Inf

## Warning in max(nchar(as.character(matrix_melted[["category"]]))): no non-missing
## arguments to max; returning -Inf
##   Writing a legend.
## Warning: Removed 12000 rows containing non-finite values (stat_density).
## Warning: Removed 12000 rows containing non-finite values (stat_density).
## Warning: Removed 11672 rows containing non-finite values (stat_density).

## Warning: Removed 11672 rows containing non-finite values (stat_density).
## Warning: Removed 30533 rows containing non-finite values (stat_density).

## Warning: Removed 30533 rows containing non-finite values (stat_density).
## Warning: Removed 4037 rows containing non-finite values (stat_density).

## Warning: Removed 4037 rows containing non-finite values (stat_density).
## Warning: Removed 1431 rows containing non-finite values (stat_density).

## Warning: Removed 1431 rows containing non-finite values (stat_density).
## Warning: Removed 3916 rows containing non-finite values (stat_density).

## Warning: Removed 3916 rows containing non-finite values (stat_density).
##   Writing raw data.
##   Writing cpm data.
##   Writing data normalized by reads/indexes.
##   Writing data normalized by reads/indexes and length.
## Dropped 3 rows from the sample metadata because they were blank.
## Starting sample: s4.
##   Reading the file containing mutations: preprocessing/s4/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/s4/step2_identical_reads.txt.xz
## Warning: 1 parsing failure.
## row    col expected   actual                                            file
##   1 readid a number read_num 'preprocessing/s4/step2_identical_reads.txt.xz'
##   Counting indexes before filtering.
##     Mutation data: removing any differences before position: 24.
##     Mutation data: before pruning, there are: 2239016 reads.
##     Mutation data: after min-position pruning, there are: 2183747 reads: 55269 lost or 2.47%.
##     Mutation data: removing any differences after position: 176.
##     Mutation data: before pruning, there are: 2183747 reads.
##     Mutation data: after max-position pruning, there are: 837996 reads: 1345751 lost or 61.63%.
##     Mutation data: removing any reads with 'N' as the hit.
##     Mutation data: after N pruning, there are: 832549 reads: 5447 lost or 0.65%.
##   Mutation data: all filters removed 1406467 reads, or 62.82%.
##     Gathering information about the number of reads per index.
##     Before reads/index pruning, there are: 973940 indexes in all the data.
##     After reads/index pruning, there are: 354890 indexes: 619050 lost or 63.56%.
##     All data: removing indexes with fewer than 3 reads/index.
##     All data: before reads/index pruning, there are: 832549 changed reads.
##     All data: before reads/index pruning, there are: 1975296 identical reads.
##     All data: after index pruning, there are: 454061 changed reads: 54.54%.
##     All data: after index pruning, there are: 1351672 identical reads: 68.43%.
##   Gathering identical, mutant, and sequencer reads/indexes.
##   Before classification, there are 1351672 identical reads.
##   Before classification, there are 454061 reads with mutations.
##   After classification, there are 779988 reads/indexes which are only identical.
##   After classification, there are 3025 reads/indexes which are strictly sequencer.
##   After classification, there are 38821 reads/indexes which are deemed from reverse transcriptase.
##   Counted by direction: 1634797 forward reads and 2107382 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various criteria.
## Starting sample: s5.
##   Reading the file containing mutations: preprocessing/s5/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/s5/step2_identical_reads.txt.xz
## Warning: 1 parsing failure.
## row    col expected   actual                                            file
##   1 readid a number read_num 'preprocessing/s5/step2_identical_reads.txt.xz'
##   Counting indexes before filtering.
##     Mutation data: removing any differences before position: 24.
##     Mutation data: before pruning, there are: 2259996 reads.
##     Mutation data: after min-position pruning, there are: 2204009 reads: 55987 lost or 2.48%.
##     Mutation data: removing any differences after position: 176.
##     Mutation data: before pruning, there are: 2204009 reads.
##     Mutation data: after max-position pruning, there are: 825004 reads: 1379005 lost or 62.57%.
##     Mutation data: removing any reads with 'N' as the hit.
##     Mutation data: after N pruning, there are: 817693 reads: 7311 lost or 0.89%.
##   Mutation data: all filters removed 1442303 reads, or 63.82%.
##     Gathering information about the number of reads per index.
##     Before reads/index pruning, there are: 965529 indexes in all the data.
##     After reads/index pruning, there are: 316034 indexes: 649495 lost or 67.27%.
##     All data: removing indexes with fewer than 3 reads/index.
##     All data: before reads/index pruning, there are: 817693 changed reads.
##     All data: before reads/index pruning, there are: 1764804 identical reads.
##     All data: after index pruning, there are: 420259 changed reads: 51.40%.
##     All data: after index pruning, there are: 1113889 identical reads: 63.12%.
##   Gathering identical, mutant, and sequencer reads/indexes.
##   Before classification, there are 1113889 identical reads.
##   Before classification, there are 420259 reads with mutations.
##   After classification, there are 650056 reads/indexes which are only identical.
##   After classification, there are 1554 reads/indexes which are strictly sequencer.
##   After classification, there are 63275 reads/indexes which are deemed from reverse transcriptase.
##   Counted by direction: 1429423 forward reads and 1637306 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various criteria.
## Starting sample: s6.
##   Reading the file containing mutations: preprocessing/s6/step4.txt.xz
##   Reading the file containing the identical reads: preprocessing/s6/step2_identical_reads.txt.xz
## Warning: 1 parsing failure.
## row    col expected   actual                                            file
##   1 readid a number read_num 'preprocessing/s6/step2_identical_reads.txt.xz'
##   Counting indexes before filtering.
##     Mutation data: removing any differences before position: 24.
##     Mutation data: before pruning, there are: 2012128 reads.
##     Mutation data: after min-position pruning, there are: 1966909 reads: 45219 lost or 2.25%.
##     Mutation data: removing any differences after position: 176.
##     Mutation data: before pruning, there are: 1966909 reads.
##     Mutation data: after max-position pruning, there are: 747617 reads: 1219292 lost or 61.99%.
##     Mutation data: removing any reads with 'N' as the hit.
##     Mutation data: after N pruning, there are: 734242 reads: 13375 lost or 1.79%.
##   Mutation data: all filters removed 1277886 reads, or 63.51%.
##     Gathering information about the number of reads per index.
##     Before reads/index pruning, there are: 844764 indexes in all the data.
##     After reads/index pruning, there are: 206713 indexes: 638051 lost or 75.53%.
##     All data: removing indexes with fewer than 3 reads/index.
##     All data: before reads/index pruning, there are: 734242 changed reads.
##     All data: before reads/index pruning, there are: 1254041 identical reads.
##     All data: after index pruning, there are: 316103 changed reads: 43.05%.
##     All data: after index pruning, there are: 627064 identical reads: 50.00%.
##   Gathering identical, mutant, and sequencer reads/indexes.
##   Before classification, there are 627064 identical reads.
##   Before classification, there are 316103 reads with mutations.
##   After classification, there are 375136 reads/indexes which are only identical.
##   After classification, there are 270 reads/indexes which are strictly sequencer.
##   After classification, there are 71568 reads/indexes which are deemed from reverse transcriptase.
##   Counted by direction: 727996 forward reads and 950480 reverse_reads.
## Subsetting based on mutations with at least 5 indexes.
## Classified mutation strings according to various criteria.
## Plotting index densities.
## Warning in max(nchar(as.character(matrix_melted[["category"]]))): no non-missing
## arguments to max; returning -Inf
## Warning in max(nchar(as.character(matrix_melted[["category"]]))): no non-missing
## arguments to max; returning -Inf

## Warning in max(nchar(as.character(matrix_melted[["category"]]))): no non-missing
## arguments to max; returning -Inf

## Warning in max(nchar(as.character(matrix_melted[["category"]]))): no non-missing
## arguments to max; returning -Inf

## Warning in max(nchar(as.character(matrix_melted[["category"]]))): no non-missing
## arguments to max; returning -Inf

## Warning in max(nchar(as.character(matrix_melted[["category"]]))): no non-missing
## arguments to max; returning -Inf

## Warning in max(nchar(as.character(matrix_melted[["category"]]))): no non-missing
## arguments to max; returning -Inf

## Warning in max(nchar(as.character(matrix_melted[["category"]]))): no non-missing
## arguments to max; returning -Inf
##   Writing a legend.
## Warning: Removed 12000 rows containing non-finite values (stat_density).
## Warning: Removed 12000 rows containing non-finite values (stat_density).
## Warning: Removed 11672 rows containing non-finite values (stat_density).

## Warning: Removed 11672 rows containing non-finite values (stat_density).
## Warning: Removed 30533 rows containing non-finite values (stat_density).

## Warning: Removed 30533 rows containing non-finite values (stat_density).
## Warning: Removed 4037 rows containing non-finite values (stat_density).

## Warning: Removed 4037 rows containing non-finite values (stat_density).
## Warning: Removed 1431 rows containing non-finite values (stat_density).

## Warning: Removed 1431 rows containing non-finite values (stat_density).
## Warning: Removed 3916 rows containing non-finite values (stat_density).

## Warning: Removed 3916 rows containing non-finite values (stat_density).
##   Writing raw data.
##   Writing cpm data.
##   Writing data normalized by reads/indexes.
##   Writing data normalized by reads/indexes and length.

2 Questions from Dr. DeStefano

I think what is best is to get the number of recovered mutations of each type from each data set. That would be A to T, A to G, A to C; T to A, T to G, T to C; G to A, G to C, G to T; and C to A, C to G, C to T; as well as deletions and insertions. I would then need the sum number of the reads that met all our criteria (i.e. at least 3 good recovered reads for that 14 nt index). Each set of 3 or more would ct as “1” read of that particular index so I would need the total with this in mind. I also need to know the total number of nucleotides that were in the region we decided to consider in the analysis. We may want to try this for 3 or more and 5 or more recovered indexes if it is not hard. This information does not include specific positions on the template where errors occurred but we can look at that latter. Right now I just want to get a general error rate and type of error. It would basically be calculated by dividing the number of recovered mutations of a particular type by sum number of the reads times the number of nucleotides screened in the template. As it ends up, this number does not really have a lot of meaning but it can be used to calculate the overall mutation rate as well as the rate for transversions, transitions, and deletions and insertions.

3 Answers

In order to address those queries, I invoked create_matrices() with a minimum index count of 3 and 5. It should be noted that this is not the same as requiring 3 or 5 reads per index. In both cases I require 3 reads per index.

3.1 Recovered mutations of each type

I am interpreting this question as the number of indexes recovered for each mutation type. I collect this information in 2 ways of interest: the indexes by type which are deemed to be from the RT and from the sequencer. In addition, I calculate a normalized (cpm) version of this information which may be used to look for changes across samples.

3.1.1 Mutations by RT index

This following block should print out tables of the numbers of mutant indexes observed for each type for the RT and the sequencer. One would hope that the sequencer will be consistent for all samples, but I think the results will instead suggest that my metric is not yet stringent enough.

## Error in knitr::kable(triples[["matrices"]][["miss_indexes_by_type"]]): object 'triples' not found
## Error in knitr::kable(triples_tenmpr[["matrices"]][["miss_indexes_by_type"]]): object 'triples_tenmpr' not found
## Error in knitr::kable(triples_fivempr[["matrices"]][["miss_indexes_by_type"]]): object 'triples_fivempr' not found
s4 s5 s6
A_C 1148 1390 1930
A_G 689 3474 8702
A_T 1597 2141 2774
C_A 21676 21121 18344
C_G 1545 8123 8239
C_T 1565 4569 10301
G_A 1440 11668 9513
G_C 635 584 497
G_T 5632 5387 4898
T_A 930 1790 2275
T_C 382 852 1305
T_G 1381 1952 2574
s4 s5 s6
A_C 1148 1390 1930
A_G 689 3474 8702
A_T 1597 2141 2774
C_A 21676 21121 18344
C_G 1545 8123 8239
C_T 1565 4569 10301
G_A 1440 11668 9513
G_C 635 584 497
G_T 5632 5387 4898
T_A 930 1790 2275
T_C 382 852 1305
T_G 1381 1952 2574
s4 s5 s6
A_C 1148 1390 1930
A_G 689 3474 8702
A_T 1597 2141 2774
C_A 21676 21121 18344
C_G 1545 8123 8239
C_T 1565 4569 10301
G_A 1440 11668 9513
G_C 635 584 497
G_T 5632 5387 4898
T_A 930 1790 2275
T_C 382 852 1305
T_G 1381 1952 2574
## Error in knitr::kable(triples[["matrices"]][["miss_sequencer_by_type"]]): object 'triples' not found
## Error in knitr::kable(triples_tenmpr[["matrices"]][["miss_sequencer_by_type"]]): object 'triples_tenmpr' not found
## Error in knitr::kable(triples_fivempr[["matrices"]][["miss_sequencer_by_type"]]): object 'triples_fivempr' not found
s4 s5 s6
A_C 177 120 0
A_T 233 128 15
C_A 1247 635 90
C_G 121 36 0
C_T 10 5 0
G_A 13 0 0
G_C 7 0 0
G_T 219 79 0
T_A 243 108 5
T_C 16 5 0
T_G 372 190 12
s4 s5 s6
A_C 177 120 0
A_T 233 128 15
C_A 1247 635 90
C_G 121 36 0
C_T 10 5 0
G_A 13 0 0
G_C 7 0 0
G_T 219 79 0
T_A 243 108 5
T_C 16 5 0
T_G 372 190 12
s4 s5 s6
A_C 177 120 0
A_T 233 128 15
C_A 1247 635 90
C_G 121 36 0
C_T 10 5 0
G_A 13 0 0
G_C 7 0 0
G_T 219 79 0
T_A 243 108 5
T_C 16 5 0
T_G 372 190 12

Plots of this information

## Error in eval(expr, envir, enclos): object 'triples' not found
## Error in eval(expr, envir, enclos): object 'triples_tenmpr' not found
## Error in eval(expr, envir, enclos): object 'triples_fivempr' not found

This suggests to me that this information needs to be normalized in some more sensible fashion. Thus the following:

3.1.2 Mutations by RT index, post normalization

The same numbers may be expressed in the context of the number of indexes observed / sample and/or as a cpm across samples. Thus in the first instance one can look at the apparent error rate for each sample, and in the second instance one may look for relative changes in apparent error rate across samples.

3.1.2.1 Rewriting the matrices as cpm to account for library sizes.

## Error in knitr::kable(triples[["normalized"]][["miss_indexes_by_type"]]): object 'triples' not found
## Error in knitr::kable(triples_tenmpr[["normalized"]][["miss_indexes_by_type"]]): object 'triples_tenmpr' not found
## Error in knitr::kable(triples_fivempr[["normalized"]][["miss_indexes_by_type"]]): object 'triples_fivempr' not found
## Warning in kable_markdown(x, padding = padding, ...): The table should have a
## header (column names)

|| || || ||

## Warning in kable_markdown(x, padding = padding, ...): The table should have a
## header (column names)

|| || || ||

## Warning in kable_markdown(x, padding = padding, ...): The table should have a
## header (column names)

|| || || ||

## Error in knitr::kable(triples[["normalized"]][["miss_sequencer_by_type"]]): object 'triples' not found
## Error in knitr::kable(triples_tenmpr[["normalized"]][["miss_sequencer_by_type"]]): object 'triples_tenmpr' not found
## Error in knitr::kable(triples_fivempr[["normalized"]][["miss_sequencer_by_type"]]): object 'triples_fivempr' not found
## Warning in kable_markdown(x, padding = padding, ...): The table should have a
## header (column names)

|| || || ||

## Warning in kable_markdown(x, padding = padding, ...): The table should have a
## header (column names)

|| || || ||

## Warning in kable_markdown(x, padding = padding, ...): The table should have a
## header (column names)

|| || || ||

3.1.2.2 Rewriting the matrices by dividing by all indexes

This I think starts to address the later text in your query.

## Error in knitr::kable(triples[["matrices_by_counts"]][["miss_indexes_by_type"]]): object 'triples' not found
## Warning in kable_markdown(x, padding = padding, ...): The table should have a
## header (column names)

|| || || ||

## Error in knitr::kable(triples[["matrices_by_counts"]][["miss_sequencer_by_type"]]): object 'triples' not found
## Warning in kable_markdown(x, padding = padding, ...): The table should have a
## header (column names)

|| || || ||

3.1.2.3 Rewriting the matrices by dividing by all indexes and cpm

I think this might prove to be where we get the most meaningful results.

The nicest thing in it is that after accounting for library sizes and total indexes observed, we finally see that the sequencer error is mostly consistent across all samples and mutation types – with a couple of notable exceptions.

By the same token, for the mutations which are identical for the sequencer, we have some which are decidedly different for the non-sequencer data. The most notable examples I think are A to G but _not G to A; and C to T.

## Error in knitr::kable(triples[["normalized_by_counts"]][["miss_indexes_by_type"]]): object 'triples' not found
## Error in knitr::kable(triples_tenmpr[["normalized_by_counts"]][["miss_indexes_by_type"]]): object 'triples_tenmpr' not found
## Error in knitr::kable(triples_fivempr[["normalized_by_counts"]][["miss_indexes_by_type"]]): object 'triples_fivempr' not found
## Warning in kable_markdown(x, padding = padding, ...): The table should have a
## header (column names)

|| || || ||

## Warning in kable_markdown(x, padding = padding, ...): The table should have a
## header (column names)

|| || || ||

## Warning in kable_markdown(x, padding = padding, ...): The table should have a
## header (column names)

|| || || ||

## Error in knitr::kable(triples[["normalized_by_counts"]][["miss_sequencer_by_type"]]): object 'triples' not found
## Error in knitr::kable(triples_tenmpr[["normalized_by_counts"]][["miss_sequencer_by_type"]]): object 'triples_tenmpr' not found
## Error in knitr::kable(triples_fivempr[["normalized_by_counts"]][["miss_sequencer_by_type"]]): object 'triples_fivempr' not found
## Warning in kable_markdown(x, padding = padding, ...): The table should have a
## header (column names)

|| || || ||

## Warning in kable_markdown(x, padding = padding, ...): The table should have a
## header (column names)

|| || || ||

## Warning in kable_markdown(x, padding = padding, ...): The table should have a
## header (column names)

|| || || ||

3.1.3 Indels by RT index

The following blocks will repeat the above, but looking for insertions. This data does not observe sufficient deletions to make a proper count for them.

## Error in knitr::kable(triples[["matrices"]][["insert_indexes_by_nt"]]): object 'triples' not found
## Error in knitr::kable(triples_tenmpr[["matrices"]][["insert_indexes_by_nt"]]): object 'triples_tenmpr' not found
## Error in knitr::kable(triples_fivempr[["matrices"]][["insert_indexes_by_nt"]]): object 'triples_fivempr' not found
s4 s5 s6
A 0 34 82
C 0 12 12
G 0 0 8
T 0 81 35
s4 s5 s6
A 0 34 82
C 0 12 12
G 0 0 8
T 0 81 35
s4 s5 s6
A 0 34 82
C 0 12 12
G 0 0 8
T 0 81 35
## Error in knitr::kable(triples[["matrices"]][["insert_sequencer_by_nt"]]): object 'triples' not found
## Error in knitr::kable(triples_tenmpr[["matrices"]][["insert_sequencer_by_nt"]]): object 'triples_tenmpr' not found
## Error in knitr::kable(triples_fivempr[["matrices"]][["insert_sequencer_by_nt"]]): object 'triples_fivempr' not found
s.x s.y s
s.x s.y s
s.x s.y s

Plots of this information

## Error in eval(expr, envir, enclos): object 'triples' not found
## Error in eval(expr, envir, enclos): object 'triples_tenmpr' not found
## Error in eval(expr, envir, enclos): object 'triples_fivempr' not found

3.1.4 Insertions by RT index, post normalization

3.1.4.1 Rewriting the matrices as cpm to account for library sizes.

## Error in knitr::kable(triples[["normalized"]][["insert_indexes_by_nt"]]): object 'triples' not found
## Error in knitr::kable(triples_tenmpr[["normalized"]][["insert_indexes_by_nt"]]): object 'triples_tenmpr' not found
## Error in knitr::kable(triples_fivempr[["normalized"]][["insert_indexes_by_nt"]]): object 'triples_fivempr' not found
## Warning in kable_markdown(x, padding = padding, ...): The table should have a
## header (column names)

|| || || ||

## Warning in kable_markdown(x, padding = padding, ...): The table should have a
## header (column names)

|| || || ||

## Warning in kable_markdown(x, padding = padding, ...): The table should have a
## header (column names)

|| || || ||

## Error in knitr::kable(triples[["normalized"]][["insert_sequencer_by_nt"]]): object 'triples' not found
## Error in knitr::kable(triples_tenmpr[["normalized"]][["insert_sequencer_by_nt"]]): object 'triples_tenmpr' not found
## Error in knitr::kable(triples_fivempr[["normalized"]][["insert_sequencer_by_nt"]]): object 'triples_fivempr' not found
## Warning in kable_markdown(x, padding = padding, ...): The table should have a
## header (column names)

|| || || ||

## Warning in kable_markdown(x, padding = padding, ...): The table should have a
## header (column names)

|| || || ||

## Warning in kable_markdown(x, padding = padding, ...): The table should have a
## header (column names)

|| || || ||

3.1.4.2 Rewriting the matrices by dividing by all indexes

I think that there are few enough insertion events that this gets a bit messed up. I will double check the logic of this, but that is my initial guess given how few insertions I was seeing when reading the outputs manually. Unfortunately, this means that for these I also cannot provide a cpm measurement.

## Error in knitr::kable(triples[["matrices_by_counts"]][["insert_indexes_by_nt"]]): object 'triples' not found
## Error in knitr::kable(triples_tenmpr[["matrices_by_counts"]][["insert_indexes_by_nt"]]): object 'triples_tenmpr' not found
## Error in knitr::kable(triples_fivempr[["matrices_by_counts"]][["insert_indexes_by_nt"]]): object 'triples_fivempr' not found
## Warning in kable_markdown(x, padding = padding, ...): The table should have a
## header (column names)

|| || || ||

## Warning in kable_markdown(x, padding = padding, ...): The table should have a
## header (column names)

|| || || ||

## Warning in kable_markdown(x, padding = padding, ...): The table should have a
## header (column names)

|| || || ||

## Error in knitr::kable(triples[["matrices_by_counts"]][["insert_sequencer_by_nt"]]): object 'triples' not found
## Error in knitr::kable(triples_tenmpr[["matrices_by_counts"]][["insert_sequencer_by_nt"]]): object 'triples_tenmpr' not found
## Error in knitr::kable(triples_fivempr[["matrices_by_counts"]][["insert_sequencer_by_nt"]]): object 'triples_fivempr' not found
## Warning in kable_markdown(x, padding = padding, ...): The table should have a
## header (column names)

|| || || ||

## Warning in kable_markdown(x, padding = padding, ...): The table should have a
## header (column names)

|| || || ||

## Warning in kable_markdown(x, padding = padding, ...): The table should have a
## header (column names)

|| || || ||

The following is my previous writing of this worksheet which just dumped the various tables.

