1 Changelog

  • 20230117: Trying to understand memory/time usage when I add back in variables. E.g. my last iteration limited the model to a very small portion of the most-strain-determining variant positions in the data. In order to predict attributes which are less obvious, I will need to be able to use a larger portion of the data.
  • Same day, later: I manually reclassified some of the tricky samples using IGV and comparisons against my changing panel of ‘canonical’ strains. In this iteration, I reclassified a couple of my references and so I hope that it will clarify future runs.

2 Creating a strain classifier using our data

I have not yet implemented a real ML classifier, so I will use some of the caret documentation in order to write a trainer and classifier of the TMRC2 strains.

3 Splitting the data based on outcome

I am taking this pretty much directly from:

https://topepo.github.io/caret/data-splitting.html#simple-splitting-based-on-the-outcome

In order to perform the split by outcome, I think I need to slightly reformat the structure of our variant data to more closely match what I see in the datasets: etitanic/iris/Sonar.

3.1 Define the outcome factor

Setting this to character here because I will be messing with the factor levels and I would rather not have to relevel multiple times.

Note that in the previous revision of this, I used the column ‘zymodemecategorical’, but now I am using knnv2classification. If things continue to change, I may make a v3 column.

ref_col <- "knnhclusttogethercall"
outcome_factor <- as.character(pData(both_snps)[[ref_col]])

3.2 Figure out why old samples were dropped.

In my previous attempt at this, I noticed that the older samples were getting dropped by na.omit, I want to take a moment here to figure out why.

dim(exprs(both_snps))
## [1] 1611844     134
both_norm <- normalize_expt(both_snps, transform="log2", norm="quant")
## transform_counts: Found 207410954 values equal to 0, adding 1 to the matrix.

3.3 Filter the data using near-zero variance

Stealing now from:

https://github.com/compgenomr/book/blob/master/05-supervisedLearning.Rmd

Multiple sections of the above document provide nice ways to remove redundant predictor variables. For my purposes, I would like to keep all of them because I am hoping to make use of the same model in a dataset which is vastly sparser than this. However, for the purposes of me learning, I will apply all of these methods and use it on the rest of the TMRC2 data.

In hopes of using more of the data, I will remove the SD filter.

texprs <- t(exprs(both_norm))
## Given this large amount of data, this step is slow, taking > 10 minutes.
nzv <- preProcess(texprs, method="nzv", uniqueCut=15)
nzv_texprs <- predict(nzv, texprs)
dim(nzv_texprs)
## [1]   134 97426

3.4 Filtering to the highest standard deviation variables

For the moment, I am excluding the following block in order to see how much time/memory keeping these variables costs. If I recall properly, the model of the top-2k variant positions cost ~ 1-4G of memory. I hope that this scales linearly, but I am thinking it might not.

standard_devs <- apply(nzv_texprs, 2, sd)
top_predictors <- order(standard_devs, decreasing = TRUE)[1:2000]
nzv_texprs <- nzv_texprs[, top_predictors]

3.5 Center the data

I think centering may not be needed for this data, but here is how:

nzv_center <- preProcess(nzv_texprs, method = "center")
nzv_texprs <- predict(nzv_center, nzv_texprs)

3.6 Drop correlated

In the same fashion, I want to leave this off because later applications of this model will include low coverage samples which may not have every variant represented.

nzv_correlated <- preProcess(nzv_texprs, method = "corr", cutoff = 0.9)
nzv_texprs <- predict(nzv_correlated, nzv_texprs)
dim(nzv_texprs)
anyNA(nzv_texprs)

3.7 Reformat the metadata to be used by caret

na_idx <- is.na(outcome_factor)
outcome_factor[na_idx] <- "unknown"
meta_factors <- pData(both_snps)[, c(ref_col, "clinicalcategorical", "sus_category_current")]
meta_factors[[ref_col]] <- outcome_factor

3.8 Add the metadata to the filtered variables

In addition, I want make sure that no unknown samples are used in training.

snp_df <- cbind(meta_factors, nzv_texprs)
dim(snp_df)
## [1]   134 97429
snp_train_idx <- createDataPartition(outcome_factor, p = 0.4,
                                     list = FALSE,
                                     times = 1)
summary(outcome_factor[snp_train_idx])
##    Length     Class      Mode 
##        56 character character
snp_train_start <- rownames(snp_df)[snp_train_idx]
train_known_idx <- meta_factors[snp_train_start, ref_col] != "unknown"
train_rownames <- snp_train_start[train_known_idx]
not_train_idx <- ! rownames(snp_df) %in% train_rownames
not_train_rownames <- rownames(snp_df)[not_train_idx]

train_df <- snp_df[train_rownames, ]
train_df[["clinicalcategorical"]] <- NULL
train_df[["sus_category_current"]] <- NULL
train_df[[1]] <- as.factor(as.character(train_df[[1]]))
test_df <- snp_df[not_train_rownames, ]
test_df[[ref_col]] <- NULL
test_df[["clinicalcategorical"]] <- NULL
test_df[["sus_category_current"]] <- NULL

3.9 KNN

Perform the fit and test the training set against itself.

knn_fit <- knn3(x = train_df[, -1],
                y = train_df[[ref_col]],
                k = 5)
train_predict <- predict(knn_fit, train_df[, -1], type = "class")
confusionMatrix(data = train_df[[1]], reference = train_predict)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction z10 z21 z22 z23 z32
##        z10   2   0   0   0   0
##        z21   0   5   0   0   0
##        z22   0   0  16   0   0
##        z23   0   0   0  17   0
##        z32   0   1   0   0   0
## 
## Overall Statistics
##                                         
##                Accuracy : 0.976         
##                  95% CI : (0.871, 0.999)
##     No Information Rate : 0.415         
##     P-Value [Acc > NIR] : 1.24e-14      
##                                         
##                   Kappa : 0.963         
##                                         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: z10 Class: z21 Class: z22 Class: z23 Class: z32
## Sensitivity              1.0000      0.833       1.00      1.000         NA
## Specificity              1.0000      1.000       1.00      1.000     0.9756
## Pos Pred Value           1.0000      1.000       1.00      1.000         NA
## Neg Pred Value           1.0000      0.972       1.00      1.000         NA
## Prevalence               0.0488      0.146       0.39      0.415     0.0000
## Detection Rate           0.0488      0.122       0.39      0.415     0.0000
## Detection Prevalence     0.0488      0.122       0.39      0.415     0.0244
## Balanced Accuracy        1.0000      0.917       1.00      1.000         NA
annotations <- train_df[[1]]
names(annotations) <- rownames(train_df)
names(train_predict) <- rownames(train_df)
summary(annotations == train_predict)
##    Mode   FALSE    TRUE 
## logical       1      40

It looks like my knn model has some disagreements in the already difficult cases of poorly represented strains, so I will call it a win. E.g it appears to fail on a z1.0, b2904, z3.0, z2.0, a 2.1, and 2.4.

Now lets try it out on the rest of the data.

test_predict <- predict(knn_fit, test_df, type = "class")
names(test_predict) <- rownames(test_df)
test_predict
## tmrc20001 tmrc20065 tmrc20005 tmrc20008 tmrc20028 tmrc20066 tmrc20039 tmrc20037 
##       z23       z23       z22       z22       z10       z23       z22       z23 
## tmrc20038 tmrc20067 tmrc20041 tmrc20009 tmrc20010 tmrc20016 tmrc20012 tmrc20014 
##       z23       z23       z21       z22       z23       z23       z22       z22 
## tmrc20020 tmrc20024 tmrc20036 tmrc20033 tmrc20026 tmrc20076 tmrc20055 tmrc20071 
##       z22       z22       z21       z22       z22       z22       z22       z23 
## tmrc20094 tmrc20042 tmrc20072 tmrc20059 tmrc20048 tmrc20088 tmrc20056 tmrc20063 
##       z23       z22       z21       z23       z23       z22       z22       z22 
## tmrc20053 tmrc20052 tmrc20064 tmrc20075 tmrc20051 tmrc20049 tmrc20080 tmrc20083 
##       z22       z23       z23       z23       z23       z22       z23       z22 
## tmrc20054 tmrc20085 tmrc20046 tmrc20047 tmrc20061 tmrc20105 tmrc20108 tmrc20096 
##       z23       z23       z21       z23       z21       z23       z23       z22 
## tmrc20092 tmrc20102 tmrc20099 tmrc20100 tmrc20091 tmrc20084 tmrc20087 tmrc20103 
##       z22       z23       z23       z23       z22       z21       z22       z21 
## tmrc20086 tmrc20081 tmrc20106 tmrc20095  hpgl0242  hpgl0243  hpgl0244  hpgl0245 
##       z22       z22       z22       z23       z22       z22       z23       z21 
##  hpgl0246  hpgl0247  hpgl0248  hpgl0316  hpgl0318  hpgl0320  hpgl0322  hpgl0631 
##       z23       z22       z22       z22       z22       z22       z22       z22 
##  hpgl0632  hpgl0633  hpgl0634  hpgl0635  hpgl0636  hpgl0638  hpgl0639  hpgl0641 
##       z22       z23       z22       z23       z22       z22       z23       z22 
##  hpgl0643  hpgl0651  hpgl0652  hpgl0653  hpgl0654  hpgl0655  hpgl0656  hpgl0658 
##       z22       z22       z22       z23       z22       z23       z22       z10 
##  hpgl0659  hpgl0660  hpgl0661  hpgl0662  hpgl0663 
##       z22       z23       z22       z23       z22 
## Levels: z10 z21 z22 z23 z32
ref_result <- pData(both_snps)[[ref_col]]
names(ref_result) <- rownames(pData(both_snps))

shared <- names(ref_result) %in% names(test_predict)
compare <- ref_result[shared]
compare <- compare[!is.na(compare)]
test_predict[compare != test_predict[names(compare)]]
## tmrc20085 
##       z23 
## Levels: z10 z21 z22 z23 z32

4 Look for genomic susceptibility markers

ref_col <- "sus_category_current"
outcome_factor <- as.character(pData(both_snps)[[ref_col]])
na_idx <- is.na(outcome_factor)
outcome_factor[na_idx] <- "unknown"
meta_factors <- pData(both_snps)[, c(ref_col, "clinicalcategorical", "sus_category_current")]
meta_factors[[ref_col]] <- as.factor(outcome_factor)

snp_df <- cbind(meta_factors, nzv_texprs)
dim(snp_df)
## [1]   134 97429
snp_train_idx <- createDataPartition(outcome_factor, p = 0.4,
                                     list = FALSE,
                                     times = 1)
summary(outcome_factor[snp_train_idx])
##    Length     Class      Mode 
##        55 character character
snp_train_start <- rownames(snp_df)[snp_train_idx]
train_known_idx <- meta_factors[snp_train_start, ref_col] != "unknown"
train_rownames <- snp_train_start[train_known_idx]
not_train_idx <- ! rownames(snp_df) %in% train_rownames
not_train_rownames <- rownames(snp_df)[not_train_idx]

train_df <- snp_df[train_rownames, ]
train_df[["clinicalcategorical"]] <- NULL
train_df[["sus_category_current.1"]] <- NULL
train_df[[1]] <- as.factor(as.character(train_df[[1]]))
test_df <- snp_df[not_train_rownames, ]
test_df[[ref_col]] <- NULL
test_df[["clinicalcategorical"]] <- NULL
test_df[["sus_category_current.1"]] <- NULL
test_df[["sus_category_current"]] <- NULL

knn_fit <- knn3(x = train_df[, -1],
                y = train_df[[ref_col]],
                k = 5)
train_predict <- predict(knn_fit, train_df[, -1], type = "class")
confusionMatrix(data = train_df[[1]], reference = train_predict)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  ambiguous resistant sensitive
##   ambiguous         3         0         3
##   resistant         2         1         2
##   sensitive         1         1        21
## 
## Overall Statistics
##                                         
##                Accuracy : 0.735         
##                  95% CI : (0.556, 0.871)
##     No Information Rate : 0.765         
##     P-Value [Acc > NIR] : 0.735         
##                                         
##                   Kappa : 0.402         
##                                         
##  Mcnemar's Test P-Value : 0.343         
## 
## Statistics by Class:
## 
##                      Class: ambiguous Class: resistant Class: sensitive
## Sensitivity                    0.5000           0.5000            0.808
## Specificity                    0.8929           0.8750            0.750
## Pos Pred Value                 0.5000           0.2000            0.913
## Neg Pred Value                 0.8929           0.9655            0.545
## Prevalence                     0.1765           0.0588            0.765
## Detection Rate                 0.0882           0.0294            0.618
## Detection Prevalence           0.1765           0.1471            0.676
## Balanced Accuracy              0.6964           0.6875            0.779
annotations <- train_df[[1]]
names(annotations) <- rownames(train_df)
names(train_predict) <- rownames(train_df)
summary(annotations == train_predict)
##    Mode   FALSE    TRUE 
## logical       9      25
test_predict <- predict(knn_fit, test_df, type = "class")
names(test_predict) <- rownames(test_df)

ref_result <- pData(both_snps)[[ref_col]]
names(ref_result) <- rownames(pData(both_snps))
ref_xref <- names(ref_result) %in% names(test_predict)
ref_comp <- ref_result[ref_xref]
ref_comp_idx <- ref_comp != "unknown"
ref_comp <- ref_comp[ref_comp_idx]
ref_comp_idx <- !is.na(ref_comp)
ref_comp <- ref_comp[ref_comp_idx]

test_comp_idx <- names(test_predict) %in% names(ref_comp)
test_comp <-test_predict[test_comp_idx]

agreement <- as.character(ref_comp) == as.character(test_comp)
agree_num <- sum(agreement)
agree_prop <- agree_num / length(ref_comp)
agree_prop
## [1] 0.7083
---
title: "TMRC2 ML Classification of strains using variants: 202301"
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
 html_document:
  code_download: true
  code_folding: show
  fig_caption: true
  fig_height: 7
  fig_width: 7
  highlight: default
  keep_md: false
  mode: selfcontained
  number_sections: true
  self_contained: true
  theme: readable
  toc: true
  toc_float:
   collapsed: false
   smooth_scroll: false
---

<style>
  body .main-container {
    max-width: 1600px;
  }
</style>

```{r options, include = FALSE}
library(hpgltools)
tt <- devtools::load_all("~/hpgltools")
knitr::opts_knit$set(progress = TRUE,
                     verbose = TRUE,
                     width = 90,
                     echo = TRUE)
knitr::opts_chunk$set(error = TRUE,
                      fig.width = 8,
                      fig.height = 8,
                      dpi = 96)
old_options <- options(digits = 4,
                       stringsAsFactors = FALSE,
                       knitr.duplicate.label = "allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size = 12))
ver <- "202301"
rundate <- format(Sys.Date(), format = "%Y%m%d")

## tmp <- try(sm(loadme(filename = gsub(pattern = "\\.Rmd", replace = "\\.rda\\.xz", x = previous_file))))
rmd_file <- glue::glue("tmrc2_genome_classifier_{ver}.Rmd")
savefile <- gsub(pattern = "\\.Rmd", replace = "\\.rda\\.xz", x = rmd_file)
loaded <- load(file=glue::glue("rda/both_snps-v{ver}.rda"))
library(caret)
```

# Changelog

* 20230117: Trying to understand memory/time usage when I add back in
  variables.  E.g. my last iteration limited the model to a very small
  portion of the most-strain-determining variant positions in the
  data.  In order to predict attributes which are less obvious, I will
  need to be able to use a larger portion of the data.
* Same day, later: I manually reclassified some of the tricky samples
  using IGV and comparisons against my changing panel of 'canonical'
  strains.  In this iteration, I reclassified a couple of my
  references and so I hope that it will clarify future runs.

# Creating a strain classifier using our data

I have not yet implemented a real ML classifier, so I will use some of
the caret documentation in order to write a trainer and classifier of
the TMRC2 strains.

# Splitting the data based on outcome

I am taking this pretty much directly from:

https://topepo.github.io/caret/data-splitting.html#simple-splitting-based-on-the-outcome

In order to perform the split by outcome, I think I need to slightly
reformat the structure of our variant data to more closely match what
I see in the datasets: etitanic/iris/Sonar.

## Define the outcome factor

Setting this to character here because I will be messing with the
factor levels and I would rather not have to relevel multiple times.

Note that in the previous revision of this, I used the column
'zymodemecategorical', but now I am using knnv2classification.  If
things continue to change, I may make a v3 column.

```{r classify_snps}
ref_col <- "knnhclusttogethercall"
outcome_factor <- as.character(pData(both_snps)[[ref_col]])
```

## Figure out why old samples were dropped.

In my previous attempt at this, I noticed that the older samples were
getting dropped by na.omit, I want to take a moment here to figure out
why.

```{r check_old_samples}
dim(exprs(both_snps))
both_norm <- normalize_expt(both_snps, transform="log2", norm="quant")
```

## Filter the data using near-zero variance

Stealing now from:

https://github.com/compgenomr/book/blob/master/05-supervisedLearning.Rmd

Multiple sections of the above document provide nice ways to remove
redundant predictor variables.  For my purposes, I would like to keep
all of them because I am hoping to make use of the same model in a
dataset which is vastly sparser than this.  However, for the purposes
of me learning, I will apply all of these methods and use it on the
rest of the TMRC2 data.

In hopes of using more of the data, I will remove the SD filter.

```{r nzv}
texprs <- t(exprs(both_norm))
## Given this large amount of data, this step is slow, taking > 10 minutes.
nzv <- preProcess(texprs, method="nzv", uniqueCut=15)
nzv_texprs <- predict(nzv, texprs)
dim(nzv_texprs)
```

## Filtering to the highest standard deviation variables

For the moment, I am excluding the following block in order to see how
much time/memory keeping these variables costs.  If I recall properly,
the model of the top-2k variant positions cost ~ 1-4G of memory.  I
hope that this scales linearly, but I am thinking it might not.

```{r exclude_sds, eval=FALSE}
standard_devs <- apply(nzv_texprs, 2, sd)
top_predictors <- order(standard_devs, decreasing = TRUE)[1:2000]
nzv_texprs <- nzv_texprs[, top_predictors]
```

## Center the data

I think centering may not be needed for this data, but here is how:

```{r center}
nzv_center <- preProcess(nzv_texprs, method = "center")
nzv_texprs <- predict(nzv_center, nzv_texprs)
```

## Drop correlated

In the same fashion, I want to leave this off because later
applications of this model will include low coverage samples which may
not have every variant represented.

```{r correlated, eval=FALSE}
nzv_correlated <- preProcess(nzv_texprs, method = "corr", cutoff = 0.9)
nzv_texprs <- predict(nzv_correlated, nzv_texprs)
dim(nzv_texprs)
anyNA(nzv_texprs)
```

## Reformat the metadata to be used by caret

```{r reformat_data}
na_idx <- is.na(outcome_factor)
outcome_factor[na_idx] <- "unknown"
meta_factors <- pData(both_snps)[, c(ref_col, "clinicalcategorical", "sus_category_current")]
meta_factors[[ref_col]] <- outcome_factor
```

## Add the metadata to the filtered variables

In addition, I want make sure that no unknown samples are used in training.

```{r merge_pieces}
snp_df <- cbind(meta_factors, nzv_texprs)
dim(snp_df)
snp_train_idx <- createDataPartition(outcome_factor, p = 0.4,
                                     list = FALSE,
                                     times = 1)
summary(outcome_factor[snp_train_idx])
snp_train_start <- rownames(snp_df)[snp_train_idx]
train_known_idx <- meta_factors[snp_train_start, ref_col] != "unknown"
train_rownames <- snp_train_start[train_known_idx]
not_train_idx <- ! rownames(snp_df) %in% train_rownames
not_train_rownames <- rownames(snp_df)[not_train_idx]

train_df <- snp_df[train_rownames, ]
train_df[["clinicalcategorical"]] <- NULL
train_df[["sus_category_current"]] <- NULL
train_df[[1]] <- as.factor(as.character(train_df[[1]]))
test_df <- snp_df[not_train_rownames, ]
test_df[[ref_col]] <- NULL
test_df[["clinicalcategorical"]] <- NULL
test_df[["sus_category_current"]] <- NULL
```

## KNN

Perform the fit and test the training set against itself.

```{r knn}
knn_fit <- knn3(x = train_df[, -1],
                y = train_df[[ref_col]],
                k = 5)
train_predict <- predict(knn_fit, train_df[, -1], type = "class")
confusionMatrix(data = train_df[[1]], reference = train_predict)

annotations <- train_df[[1]]
names(annotations) <- rownames(train_df)
names(train_predict) <- rownames(train_df)
summary(annotations == train_predict)
```

It looks like my knn model has some disagreements in the already
difficult cases of poorly represented strains, so I will call it a
win.  E.g it appears to fail on a z1.0, b2904, z3.0, z2.0, a 2.1, and
2.4.

Now lets try it out on the rest of the data.

```{r knn_test}
test_predict <- predict(knn_fit, test_df, type = "class")
names(test_predict) <- rownames(test_df)
test_predict

ref_result <- pData(both_snps)[[ref_col]]
names(ref_result) <- rownames(pData(both_snps))

shared <- names(ref_result) %in% names(test_predict)
compare <- ref_result[shared]
compare <- compare[!is.na(compare)]
test_predict[compare != test_predict[names(compare)]]
```

# Look for genomic susceptibility markers

```{r reformat_data_sus}
ref_col <- "sus_category_current"
outcome_factor <- as.character(pData(both_snps)[[ref_col]])
na_idx <- is.na(outcome_factor)
outcome_factor[na_idx] <- "unknown"
meta_factors <- pData(both_snps)[, c(ref_col, "clinicalcategorical", "sus_category_current")]
meta_factors[[ref_col]] <- as.factor(outcome_factor)

snp_df <- cbind(meta_factors, nzv_texprs)
dim(snp_df)
snp_train_idx <- createDataPartition(outcome_factor, p = 0.4,
                                     list = FALSE,
                                     times = 1)
summary(outcome_factor[snp_train_idx])
snp_train_start <- rownames(snp_df)[snp_train_idx]
train_known_idx <- meta_factors[snp_train_start, ref_col] != "unknown"
train_rownames <- snp_train_start[train_known_idx]
not_train_idx <- ! rownames(snp_df) %in% train_rownames
not_train_rownames <- rownames(snp_df)[not_train_idx]

train_df <- snp_df[train_rownames, ]
train_df[["clinicalcategorical"]] <- NULL
train_df[["sus_category_current.1"]] <- NULL
train_df[[1]] <- as.factor(as.character(train_df[[1]]))
test_df <- snp_df[not_train_rownames, ]
test_df[[ref_col]] <- NULL
test_df[["clinicalcategorical"]] <- NULL
test_df[["sus_category_current.1"]] <- NULL
test_df[["sus_category_current"]] <- NULL

knn_fit <- knn3(x = train_df[, -1],
                y = train_df[[ref_col]],
                k = 5)
train_predict <- predict(knn_fit, train_df[, -1], type = "class")
confusionMatrix(data = train_df[[1]], reference = train_predict)

annotations <- train_df[[1]]
names(annotations) <- rownames(train_df)
names(train_predict) <- rownames(train_df)
summary(annotations == train_predict)

test_predict <- predict(knn_fit, test_df, type = "class")
names(test_predict) <- rownames(test_df)

ref_result <- pData(both_snps)[[ref_col]]
names(ref_result) <- rownames(pData(both_snps))
ref_xref <- names(ref_result) %in% names(test_predict)
ref_comp <- ref_result[ref_xref]
ref_comp_idx <- ref_comp != "unknown"
ref_comp <- ref_comp[ref_comp_idx]
ref_comp_idx <- !is.na(ref_comp)
ref_comp <- ref_comp[ref_comp_idx]

test_comp_idx <- names(test_predict) %in% names(ref_comp)
test_comp <-test_predict[test_comp_idx]

agreement <- as.character(ref_comp) == as.character(test_comp)
agree_num <- sum(agreement)
agree_prop <- agree_num / length(ref_comp)
agree_prop
```
