1 Creating a strain classifier using our data

I have not yet implemented a real ML classifier, so I will use some of the caret documentation in order to write a trainer and classifier of the TMRC2 strains.

2 Splitting the data based on outcome

I am taking this pretty much directly from:

https://topepo.github.io/caret/data-splitting.html#simple-splitting-based-on-the-outcome

In order to perform the split by outcome, I think I need to slightly reformat the structure of our variant data to more closely match what I see in the datasets: etitanic/iris/Sonar.

2.1 Define the outcome factor

Setting this to character here because I will be messing with the factor levels and I would rather not have to relevel multiple times.

outcome_factor <- as.character(pData(both_snps)[["zymodemecategorical"]])

2.2 Figure out why old samples were dropped.

In my previous attempt at this, I noticed that the older samples were getting dropped by na.omit, I want to take a moment here to figure out why.

dim(exprs(both_snps))
## [1] 1611844     134
both_norm <- normalize_expt(both_snps, transform="log2", norm="quant")
## transform_counts: Found 207502544 values equal to 0, adding 1 to the matrix.

2.3 Filter the data using near-zero variance

Stealing now from:

https://github.com/compgenomr/book/blob/master/05-supervisedLearning.Rmd

Multiple sections of the above document provide nice ways to remove redundant predictor variables. For my purposes, I would like to keep all of them because I am hoping to make use of the same model in a dataset which is vastly sparser than this. However, for the purposes of me learning, I will apply all of these methods and use it on the rest of the TMRC2 data.

texprs <- t(exprs(both_norm))
nzv <- caret::preProcess(texprs, method="nzv", uniqueCut=15)
nzv_texprs <- predict(nzv, texprs)
dim(nzv_texprs)
## [1]   134 95016
standard_devs <- apply(nzv_texprs, 2, sd)
top_predictors <- order(standard_devs, decreasing = TRUE)[1:2000]
nzv_texprs <- nzv_texprs[, top_predictors]

2.4 Center the data

I think centering may not be needed for this data, but here is how:

nzv_center <- preProcess(nzv_texprs, method = "center")
## Error in preProcess(nzv_texprs, method = "center"): could not find function "preProcess"
nzv_texprs <- predict(nzv_center, nzv_texprs)
## Error in predict(nzv_center, nzv_texprs): object 'nzv_center' not found

2.5 Drop correlated

nzv_correlated <- preProcess(nzv_texprs, method = "corr", cutoff = 0.9)
## Error in preProcess(nzv_texprs, method = "corr", cutoff = 0.9): could not find function "preProcess"
nzv_texprs <- predict(nzv_correlated, nzv_texprs)
## Error in predict(nzv_correlated, nzv_texprs): object 'nzv_correlated' not found
dim(nzv_texprs)
## [1]  134 2000
anyNA(nzv_texprs)
## [1] FALSE

2.6 Reformat the metadata to be used by caret

na_idx <- is.na(outcome_factor)
outcome_factor[na_idx] <- "unknown"
meta_factors <- pData(both_snps)[, c("zymodemecategorical", "clinicalcategorical", "sus_category_current")]
meta_factors[["zymodemecategorical"]] <- outcome_factor

2.7 Add the metadata to the filtered variables

In addition, I want make sure that no unknown samples are used in training.

snp_df <- cbind(meta_factors, nzv_texprs)
dim(snp_df)
## [1]  134 2003
snp_train_idx <- createDataPartition(outcome_factor, p = 0.4,
                                     list = FALSE,
                                     times = 1)
## Error in createDataPartition(outcome_factor, p = 0.4, list = FALSE, times = 1): could not find function "createDataPartition"
summary(outcome_factor[snp_train_idx])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'summary': object 'snp_train_idx' not found
snp_train_start <- rownames(snp_df)[snp_train_idx]
## Error in eval(expr, envir, enclos): object 'snp_train_idx' not found
train_known_idx <- meta_factors[snp_train_start, "zymodemecategorical"] != "unknown"
## Error in `[.data.frame`(meta_factors, snp_train_start, "zymodemecategorical"): object 'snp_train_start' not found
train_rownames <- snp_train_start[train_known_idx]
## Error in eval(expr, envir, enclos): object 'snp_train_start' not found
not_train_idx <- ! rownames(snp_df) %in% train_rownames
## Error in h(simpleError(msg, call)): error in evaluating the argument 'table' in selecting a method for function '%in%': object 'train_rownames' not found
not_train_rownames <- rownames(snp_df)[not_train_idx]
## Error in eval(expr, envir, enclos): object 'not_train_idx' not found
train_df <- snp_df[train_rownames, ]
## Error in `[.data.frame`(snp_df, train_rownames, ): object 'train_rownames' not found
train_df[["clinicalcategorical"]] <- NULL
## Error in train_df[["clinicalcategorical"]] <- NULL: object 'train_df' not found
train_df[["sus_category_current"]] <- NULL
## Error in train_df[["sus_category_current"]] <- NULL: object 'train_df' not found
train_df[[1]] <- as.factor(as.character(train_df[[1]]))
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.factor': object 'train_df' not found
test_df <- snp_df[not_train_rownames, ]
## Error in `[.data.frame`(snp_df, not_train_rownames, ): object 'not_train_rownames' not found
test_df[["zymodemecategorical"]] <- NULL
## Error in test_df[["zymodemecategorical"]] <- NULL: object 'test_df' not found
test_df[["clinicalcategorical"]] <- NULL
## Error in test_df[["clinicalcategorical"]] <- NULL: object 'test_df' not found
test_df[["sus_category_current"]] <- NULL
## Error in test_df[["sus_category_current"]] <- NULL: object 'test_df' not found

2.8 KNN

Perform the fit and test the training set against itself.

knn_fit <- knn3(x = train_df[, -1],
                y = train_df[["zymodemecategorical"]],
                k = 5)
## Error in knn3(x = train_df[, -1], y = train_df[["zymodemecategorical"]], : could not find function "knn3"
train_predict <- predict(knn_fit, train_df[, -1], type = "class")
## Error in predict(knn_fit, train_df[, -1], type = "class"): object 'knn_fit' not found
confusionMatrix(data = train_df[[1]], reference = train_predict)
## Error in confusionMatrix(data = train_df[[1]], reference = train_predict): could not find function "confusionMatrix"
train_df[[1]]
## Error in eval(expr, envir, enclos): object 'train_df' not found
train_predict
## Error in eval(expr, envir, enclos): object 'train_predict' not found

It looks like my knn model has some disagreements in the already difficult cases of poorly represented strains, so I will call it a win. E.g it appears to fail on a z1.0, b2904, z3.0, z2.0, a 2.1, and 2.4.

Now lets try it out on the rest of the data.

test_predict <- predict(knn_fit, test_df, type = "class")
## Error in predict(knn_fit, test_df, type = "class"): object 'knn_fit' not found
names(test_predict) <- rownames(test_df)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'rownames': object 'test_df' not found
test_predict
## Error in eval(expr, envir, enclos): object 'test_predict' not found

Now let us manually compare these to the sample sheet and see how we did…

|———-|———-|—–|——|—-| |Sample |Annotation|MyML |Agree?|IGV | |TMRC20001 |z2.3 |z1.0 |No |I agree it is not a 2.3, but to me it doesn’t look like a 1.0 | |TMRC20007 |unknown |z2.3 | | | |TMRC20008 |unknown |z2.3 | | | |TMRC20039 |z2.2 |z2.2 |Yes | | |TMRC20037 |z2.3 |z2.3 |Yes | | |TMRC20038 |z2.3 |z2.3 |Yes | | |TMRC20068 |z2.3 |z2.3 |Yes | | |TMRC20041 |z2.2 |z2.1 |No |maybe more like 2.2? It doesn’t look like either to me | |TMRC20009 |z2.2 |z2.2 |Yes | | |TMRC20010 |z2.3 |z2.3 | Yes | | |TMRC20016 |z2.3 |z2.3 | Yes || |TMRC20014 |z2.2|z2.2|Yes|| |TMRC20019 |z2.2 |z2.2|Yes|| |TMRC20070 |z2.3 |z2.3|Yes|| |TMRC20021 |z2.3 |z2.3|Yes|| |TMRC20022 |z2.2 |z2.2|Yes|| |TMRC20036 |z2.2 |z2.1|No| It is definitely an odd 2.2; I can see why it said it is 2.1| |TMRC20031 |z2.2 |z2.2|Yes|| |TMRC20076 |z2.2|z2.2|Yes|| |TMRC20073 |z2.3|z2.3|Yes|| |TMRC20078 |z2.2|z2.2|Yes|| |TMRC20042 |z2.2|z2.2|Yes|| |TMRC20059 |z2.3|z2.3|Yes|| |TMRC20048 |z2.3|z2.3|Yes|| |TMRC20088 |z2.2|z2.2|Yes|| |TMRC20077 |z2.2|z2.2|Yes|| |TMRC20074 |z2.2|z2.2|Yes|| |TMRC20063 |z2.2|z2.2|Yes|| |TMRC20052 |z2.3|z2.3|Yes|| |TMRC20064 |z2.3|z2.3|Yes|| |TMRC20075 |z2.3|z2.3|Yes|| |TMRC20050 |z2.2|z2.2|Yes|| |TMRC20049 |z2.2|z2.2|Yes|| |TMRC20062 |z2.3|z2.3|Yes|| |TMRC20110 |z2.2|z2.2|Yes|| |TMRC20080 |z2.3|z2.3|Yes|| |TMRC20043 |z2.3|z2.3|Yes|| |TMRC20085 |z2.3|z2.3|Yes|| |TMRC20046 |z2.2|z2.1|No|This looks almost exactly like TMRC20036| |TMRC20093 |z2.1|z2.1|Yes|Interestingly, this does not look like TMRC20046 nor TMRC20036 in all places| |TMRC20047 |z2.4|z2.3|No|I see places where this looks like it uniquely matches both strains, they are crazy similar| |TMRC20109 |z2.2|z2.2|Yes|| |TMRC20096 |z2.2|z2.2|Yes|| |TMRC20097 |z2.2|z2.1|No|I bet it looks like TMRC20046/TMRC20036 before it opens, it does!| |TMRC20101 |z2.2|z2.2|Yes|| |TMRC20102 |z2.3|z2.3|Yes|| |TMRC20099 |z2.3|z2.3|Yes|| |TMRC20100 |z2.3|z2.3|Yes|| |TMRC20091 |z2.1|z2.2|No|This looks like what I think of as the ‘canonical’ z2.2 and not like TMRC20047.| |TMRC20084 |z2.1|z2.1|Yes|| |TMRC20087 |z2.2|z2.2|Yes|| |TMRC20103 |z2.1|z2.1|Yes|| |TMRC20104 |z2.3|z2.3|Yes|| |TMRC20086 |z2.2|z2.2|Yes|| |TMRC20107 |z2.3|z2.3|Yes|| |TMRC20095 |z2.3|z2.3|Yes||

---
title: "TMRC2 ML Classification of strains: 202212"
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
 html_document:
  code_download: true
  code_folding: show
  fig_caption: true
  fig_height: 7
  fig_width: 7
  highlight: default
  keep_md: false
  mode: selfcontained
  number_sections: true
  self_contained: true
  theme: readable
  toc: true
  toc_float:
   collapsed: false
   smooth_scroll: false
---

<style>
  body .main-container {
    max-width: 1600px;
  }
</style>

```{r options, include = FALSE}
library(hpgltools)
tt <- devtools::load_all("~/hpgltools")
knitr::opts_knit$set(progress = TRUE,
                     verbose = TRUE,
                     width = 90,
                     echo = TRUE)
knitr::opts_chunk$set(error = TRUE,
                      fig.width = 8,
                      fig.height = 8,
                      dpi = 96)
old_options <- options(digits = 4,
                       stringsAsFactors = FALSE,
                       knitr.duplicate.label = "allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size = 12))
ver <- "202212"
rundate <- format(Sys.Date(), format = "%Y%m%d")

## tmp <- try(sm(loadme(filename = gsub(pattern = "\\.Rmd", replace = "\\.rda\\.xz", x = previous_file))))
rmd_file <- glue::glue("tmrc2_classifier_{ver}.Rmd")
savefile <- gsub(pattern = "\\.Rmd", replace = "\\.rda\\.xz", x = rmd_file)
loaded <- load(file=glue::glue("rda/both_snps-v{ver}.rda"))
```

# Creating a strain classifier using our data

I have not yet implemented a real ML classifier, so I will use some of
the caret documentation in order to write a trainer and classifier of
the TMRC2 strains.

# Splitting the data based on outcome

I am taking this pretty much directly from:

https://topepo.github.io/caret/data-splitting.html#simple-splitting-based-on-the-outcome

In order to perform the split by outcome, I think I need to slightly
reformat the structure of our variant data to more closely match what
I see in the datasets: etitanic/iris/Sonar.

## Define the outcome factor

Setting this to character here because I will be messing with the
factor levels and I would rather not have to relevel multiple times.

```{r classify_snps}
outcome_factor <- as.character(pData(both_snps)[["zymodemecategorical"]])
```

## Figure out why old samples were dropped.

In my previous attempt at this, I noticed that the older samples were
getting dropped by na.omit, I want to take a moment here to figure out
why.

```{r check_old_samples}
dim(exprs(both_snps))
both_norm <- normalize_expt(both_snps, transform="log2", norm="quant")
```

## Filter the data using near-zero variance

Stealing now from:

https://github.com/compgenomr/book/blob/master/05-supervisedLearning.Rmd

Multiple sections of the above document provide nice ways to remove
redundant predictor variables.  For my purposes, I would like to keep
all of them because I am hoping to make use of the same model in a
dataset which is vastly sparser than this.  However, for the purposes
of me learning, I will apply all of these methods and use it on the
rest of the TMRC2 data.

```{r nzv}
texprs <- t(exprs(both_norm))
nzv <- caret::preProcess(texprs, method="nzv", uniqueCut=15)
nzv_texprs <- predict(nzv, texprs)
dim(nzv_texprs)

standard_devs <- apply(nzv_texprs, 2, sd)
top_predictors <- order(standard_devs, decreasing = TRUE)[1:2000]
nzv_texprs <- nzv_texprs[, top_predictors]
```

## Center the data

I think centering may not be needed for this data, but here is how:

```{r center}
nzv_center <- preProcess(nzv_texprs, method = "center")
nzv_texprs <- predict(nzv_center, nzv_texprs)
```

## Drop correlated

```{r correlated}
nzv_correlated <- preProcess(nzv_texprs, method = "corr", cutoff = 0.9)
nzv_texprs <- predict(nzv_correlated, nzv_texprs)
dim(nzv_texprs)
anyNA(nzv_texprs)
```

## Reformat the metadata to be used by caret

```{r reformat_data}
na_idx <- is.na(outcome_factor)
outcome_factor[na_idx] <- "unknown"
meta_factors <- pData(both_snps)[, c("zymodemecategorical", "clinicalcategorical", "sus_category_current")]
meta_factors[["zymodemecategorical"]] <- outcome_factor
```

## Add the metadata to the filtered variables

In addition, I want make sure that no unknown samples are used in training.

```{r merge_pieces}
snp_df <- cbind(meta_factors, nzv_texprs)
dim(snp_df)
snp_train_idx <- createDataPartition(outcome_factor, p = 0.4,
                                     list = FALSE,
                                     times = 1)
summary(outcome_factor[snp_train_idx])
snp_train_start <- rownames(snp_df)[snp_train_idx]
train_known_idx <- meta_factors[snp_train_start, "zymodemecategorical"] != "unknown"
train_rownames <- snp_train_start[train_known_idx]
not_train_idx <- ! rownames(snp_df) %in% train_rownames
not_train_rownames <- rownames(snp_df)[not_train_idx]

train_df <- snp_df[train_rownames, ]
train_df[["clinicalcategorical"]] <- NULL
train_df[["sus_category_current"]] <- NULL
train_df[[1]] <- as.factor(as.character(train_df[[1]]))
test_df <- snp_df[not_train_rownames, ]
test_df[["zymodemecategorical"]] <- NULL
test_df[["clinicalcategorical"]] <- NULL
test_df[["sus_category_current"]] <- NULL
```

## KNN

Perform the fit and test the training set against itself.

```{r knn}
knn_fit <- knn3(x = train_df[, -1],
                y = train_df[["zymodemecategorical"]],
                k = 5)
train_predict <- predict(knn_fit, train_df[, -1], type = "class")
confusionMatrix(data = train_df[[1]], reference = train_predict)

train_df[[1]]
train_predict
```

It looks like my knn model has some disagreements in the already
difficult cases of poorly represented strains, so I will call it a
win.  E.g it appears to fail on a z1.0, b2904, z3.0, z2.0, a 2.1, and
2.4.

Now lets try it out on the rest of the data.

```{r knn_test}
test_predict <- predict(knn_fit, test_df, type = "class")
names(test_predict) <- rownames(test_df)
test_predict
```

Now let us manually compare these to the sample sheet and see how we
did...

|----------|----------|-----|------|----|
|Sample    |Annotation|MyML |Agree?|IGV |
|TMRC20001 |z2.3      |z1.0 |No    |I agree it is not a 2.3, but to me it doesn't look like a 1.0    |
|TMRC20007 |unknown   |z2.3 |      |    |
|TMRC20008 |unknown   |z2.3 |      |    |
|TMRC20039 |z2.2      |z2.2 |Yes   |    |
|TMRC20037 |z2.3      |z2.3 |Yes   |    |
|TMRC20038 |z2.3      |z2.3 |Yes   |    |
|TMRC20068 |z2.3      |z2.3 |Yes   |    |
|TMRC20041 |z2.2      |z2.1 |No    |maybe more like 2.2?  It doesn't look like either to me    |
|TMRC20009 |z2.2      |z2.2 |Yes   |    |
|TMRC20010 |z2.3      |z2.3 | Yes  |    |
|TMRC20016 |z2.3      |z2.3 | Yes  ||
|TMRC20014 |z2.2|z2.2|Yes||
|TMRC20019 |z2.2 |z2.2|Yes||
|TMRC20070 |z2.3 |z2.3|Yes||
|TMRC20021 |z2.3 |z2.3|Yes||
|TMRC20022 |z2.2 |z2.2|Yes||
|TMRC20036 |z2.2 |z2.1|No| It is definitely an odd 2.2; I can see why it said it is 2.1|
|TMRC20031 |z2.2 |z2.2|Yes||
|TMRC20076 |z2.2|z2.2|Yes||
|TMRC20073 |z2.3|z2.3|Yes||
|TMRC20078 |z2.2|z2.2|Yes||
|TMRC20042 |z2.2|z2.2|Yes||
|TMRC20059 |z2.3|z2.3|Yes||
|TMRC20048 |z2.3|z2.3|Yes||
|TMRC20088 |z2.2|z2.2|Yes||
|TMRC20077 |z2.2|z2.2|Yes||
|TMRC20074 |z2.2|z2.2|Yes||
|TMRC20063 |z2.2|z2.2|Yes||
|TMRC20052 |z2.3|z2.3|Yes||
|TMRC20064 |z2.3|z2.3|Yes||
|TMRC20075 |z2.3|z2.3|Yes||
|TMRC20050 |z2.2|z2.2|Yes||
|TMRC20049 |z2.2|z2.2|Yes||
|TMRC20062 |z2.3|z2.3|Yes||
|TMRC20110 |z2.2|z2.2|Yes||
|TMRC20080 |z2.3|z2.3|Yes||
|TMRC20043 |z2.3|z2.3|Yes||
|TMRC20085 |z2.3|z2.3|Yes||
|TMRC20046 |z2.2|z2.1|No|This looks almost exactly like TMRC20036|
|TMRC20093 |z2.1|z2.1|Yes|Interestingly, this does not look like TMRC20046 nor TMRC20036 in all places|
|TMRC20047 |z2.4|z2.3|No|I see places where this looks like it uniquely matches both strains, they are crazy similar|
|TMRC20109 |z2.2|z2.2|Yes||
|TMRC20096 |z2.2|z2.2|Yes||
|TMRC20097 |z2.2|z2.1|No|I bet it looks like TMRC20046/TMRC20036 before it opens, it does!|
|TMRC20101 |z2.2|z2.2|Yes||
|TMRC20102 |z2.3|z2.3|Yes||
|TMRC20099 |z2.3|z2.3|Yes||
|TMRC20100 |z2.3|z2.3|Yes||
|TMRC20091 |z2.1|z2.2|No|This looks like what I think of as the 'canonical' z2.2 and _not_ like TMRC20047.|
|TMRC20084 |z2.1|z2.1|Yes||
|TMRC20087 |z2.2|z2.2|Yes||
|TMRC20103 |z2.1|z2.1|Yes||
|TMRC20104 |z2.3|z2.3|Yes||
|TMRC20086 |z2.2|z2.2|Yes||
|TMRC20107 |z2.3|z2.3|Yes||
|TMRC20095 |z2.3|z2.3|Yes||
