I have not yet implemented a real ML classifier, so I will use some of the caret documentation in order to write a trainer and classifier of the TMRC2 strains.
I am taking this pretty much directly from:
https://topepo.github.io/caret/data-splitting.html#simple-splitting-based-on-the-outcome
In order to perform the split by outcome, I think I need to slightly reformat the structure of our variant data to more closely match what I see in the datasets: etitanic/iris/Sonar.
Setting this to character here because I will be messing with the factor levels and I would rather not have to relevel multiple times.
Note that in the previous revision of this, I used the column ‘zymodemecategorical’, but now I am using knnv2classification. If things continue to change, I may make a v3 column.
ref_col <- "knnhclusttogethercall"
outcome_factor <- as.character(pData(both_snps)[[ref_col]])
In my previous attempt at this, I noticed that the older samples were getting dropped by na.omit, I want to take a moment here to figure out why.
dim(exprs(both_snps))
## [1] 1611844 134
both_norm <- normalize_expt(both_snps, transform="log2", norm="quant")
## transform_counts: Found 207410954 values equal to 0, adding 1 to the matrix.
Stealing now from:
https://github.com/compgenomr/book/blob/master/05-supervisedLearning.Rmd
Multiple sections of the above document provide nice ways to remove redundant predictor variables. For my purposes, I would like to keep all of them because I am hoping to make use of the same model in a dataset which is vastly sparser than this. However, for the purposes of me learning, I will apply all of these methods and use it on the rest of the TMRC2 data.
In hopes of using more of the data, I will remove the SD filter.
texprs <- t(exprs(both_norm))
## Given this large amount of data, this step is slow, taking > 10 minutes.
nzv <- preProcess(texprs, method="nzv", uniqueCut=15)
nzv_texprs <- predict(nzv, texprs)
dim(nzv_texprs)
## [1] 134 97426
For the moment, I am excluding the following block in order to see how much time/memory keeping these variables costs. If I recall properly, the model of the top-2k variant positions cost ~ 1-4G of memory. I hope that this scales linearly, but I am thinking it might not.
standard_devs <- apply(nzv_texprs, 2, sd)
top_predictors <- order(standard_devs, decreasing = TRUE)[1:2000]
nzv_texprs <- nzv_texprs[, top_predictors]
I think centering may not be needed for this data, but here is how:
nzv_center <- preProcess(nzv_texprs, method = "center")
nzv_texprs <- predict(nzv_center, nzv_texprs)
na_idx <- is.na(outcome_factor)
outcome_factor[na_idx] <- "unknown"
meta_factors <- pData(both_snps)[, c(ref_col, "clinicalcategorical", "sus_category_current")]
meta_factors[[ref_col]] <- outcome_factor
In addition, I want make sure that no unknown samples are used in training.
snp_df <- cbind(meta_factors, nzv_texprs)
dim(snp_df)
## [1] 134 97429
snp_train_idx <- createDataPartition(outcome_factor, p = 0.4,
list = FALSE,
times = 1)
summary(outcome_factor[snp_train_idx])
## Length Class Mode
## 56 character character
snp_train_start <- rownames(snp_df)[snp_train_idx]
train_known_idx <- meta_factors[snp_train_start, ref_col] != "unknown"
train_rownames <- snp_train_start[train_known_idx]
not_train_idx <- ! rownames(snp_df) %in% train_rownames
not_train_rownames <- rownames(snp_df)[not_train_idx]
train_df <- snp_df[train_rownames, ]
train_df[["clinicalcategorical"]] <- NULL
train_df[["sus_category_current"]] <- NULL
train_df[[1]] <- as.factor(as.character(train_df[[1]]))
test_df <- snp_df[not_train_rownames, ]
test_df[[ref_col]] <- NULL
test_df[["clinicalcategorical"]] <- NULL
test_df[["sus_category_current"]] <- NULL
Perform the fit and test the training set against itself.
knn_fit <- knn3(x = train_df[, -1],
y = train_df[[ref_col]],
k = 5)
train_predict <- predict(knn_fit, train_df[, -1], type = "class")
confusionMatrix(data = train_df[[1]], reference = train_predict)
## Confusion Matrix and Statistics
##
## Reference
## Prediction z10 z21 z22 z23 z32
## z10 2 0 0 0 0
## z21 0 5 0 0 0
## z22 0 0 16 0 0
## z23 0 0 0 17 0
## z32 0 1 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.976
## 95% CI : (0.871, 0.999)
## No Information Rate : 0.415
## P-Value [Acc > NIR] : 1.24e-14
##
## Kappa : 0.963
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: z10 Class: z21 Class: z22 Class: z23 Class: z32
## Sensitivity 1.0000 0.833 1.00 1.000 NA
## Specificity 1.0000 1.000 1.00 1.000 0.9756
## Pos Pred Value 1.0000 1.000 1.00 1.000 NA
## Neg Pred Value 1.0000 0.972 1.00 1.000 NA
## Prevalence 0.0488 0.146 0.39 0.415 0.0000
## Detection Rate 0.0488 0.122 0.39 0.415 0.0000
## Detection Prevalence 0.0488 0.122 0.39 0.415 0.0244
## Balanced Accuracy 1.0000 0.917 1.00 1.000 NA
annotations <- train_df[[1]]
names(annotations) <- rownames(train_df)
names(train_predict) <- rownames(train_df)
summary(annotations == train_predict)
## Mode FALSE TRUE
## logical 1 40
It looks like my knn model has some disagreements in the already difficult cases of poorly represented strains, so I will call it a win. E.g it appears to fail on a z1.0, b2904, z3.0, z2.0, a 2.1, and 2.4.
Now lets try it out on the rest of the data.
test_predict <- predict(knn_fit, test_df, type = "class")
names(test_predict) <- rownames(test_df)
test_predict
## tmrc20001 tmrc20065 tmrc20005 tmrc20008 tmrc20028 tmrc20066 tmrc20039 tmrc20037
## z23 z23 z22 z22 z10 z23 z22 z23
## tmrc20038 tmrc20067 tmrc20041 tmrc20009 tmrc20010 tmrc20016 tmrc20012 tmrc20014
## z23 z23 z21 z22 z23 z23 z22 z22
## tmrc20020 tmrc20024 tmrc20036 tmrc20033 tmrc20026 tmrc20076 tmrc20055 tmrc20071
## z22 z22 z21 z22 z22 z22 z22 z23
## tmrc20094 tmrc20042 tmrc20072 tmrc20059 tmrc20048 tmrc20088 tmrc20056 tmrc20063
## z23 z22 z21 z23 z23 z22 z22 z22
## tmrc20053 tmrc20052 tmrc20064 tmrc20075 tmrc20051 tmrc20049 tmrc20080 tmrc20083
## z22 z23 z23 z23 z23 z22 z23 z22
## tmrc20054 tmrc20085 tmrc20046 tmrc20047 tmrc20061 tmrc20105 tmrc20108 tmrc20096
## z23 z23 z21 z23 z21 z23 z23 z22
## tmrc20092 tmrc20102 tmrc20099 tmrc20100 tmrc20091 tmrc20084 tmrc20087 tmrc20103
## z22 z23 z23 z23 z22 z21 z22 z21
## tmrc20086 tmrc20081 tmrc20106 tmrc20095 hpgl0242 hpgl0243 hpgl0244 hpgl0245
## z22 z22 z22 z23 z22 z22 z23 z21
## hpgl0246 hpgl0247 hpgl0248 hpgl0316 hpgl0318 hpgl0320 hpgl0322 hpgl0631
## z23 z22 z22 z22 z22 z22 z22 z22
## hpgl0632 hpgl0633 hpgl0634 hpgl0635 hpgl0636 hpgl0638 hpgl0639 hpgl0641
## z22 z23 z22 z23 z22 z22 z23 z22
## hpgl0643 hpgl0651 hpgl0652 hpgl0653 hpgl0654 hpgl0655 hpgl0656 hpgl0658
## z22 z22 z22 z23 z22 z23 z22 z10
## hpgl0659 hpgl0660 hpgl0661 hpgl0662 hpgl0663
## z22 z23 z22 z23 z22
## Levels: z10 z21 z22 z23 z32
ref_result <- pData(both_snps)[[ref_col]]
names(ref_result) <- rownames(pData(both_snps))
shared <- names(ref_result) %in% names(test_predict)
compare <- ref_result[shared]
compare <- compare[!is.na(compare)]
test_predict[compare != test_predict[names(compare)]]
## tmrc20085
## z23
## Levels: z10 z21 z22 z23 z32
ref_col <- "sus_category_current"
outcome_factor <- as.character(pData(both_snps)[[ref_col]])
na_idx <- is.na(outcome_factor)
outcome_factor[na_idx] <- "unknown"
meta_factors <- pData(both_snps)[, c(ref_col, "clinicalcategorical", "sus_category_current")]
meta_factors[[ref_col]] <- as.factor(outcome_factor)
snp_df <- cbind(meta_factors, nzv_texprs)
dim(snp_df)
## [1] 134 97429
snp_train_idx <- createDataPartition(outcome_factor, p = 0.4,
list = FALSE,
times = 1)
summary(outcome_factor[snp_train_idx])
## Length Class Mode
## 55 character character
snp_train_start <- rownames(snp_df)[snp_train_idx]
train_known_idx <- meta_factors[snp_train_start, ref_col] != "unknown"
train_rownames <- snp_train_start[train_known_idx]
not_train_idx <- ! rownames(snp_df) %in% train_rownames
not_train_rownames <- rownames(snp_df)[not_train_idx]
train_df <- snp_df[train_rownames, ]
train_df[["clinicalcategorical"]] <- NULL
train_df[["sus_category_current.1"]] <- NULL
train_df[[1]] <- as.factor(as.character(train_df[[1]]))
test_df <- snp_df[not_train_rownames, ]
test_df[[ref_col]] <- NULL
test_df[["clinicalcategorical"]] <- NULL
test_df[["sus_category_current.1"]] <- NULL
test_df[["sus_category_current"]] <- NULL
knn_fit <- knn3(x = train_df[, -1],
y = train_df[[ref_col]],
k = 5)
train_predict <- predict(knn_fit, train_df[, -1], type = "class")
confusionMatrix(data = train_df[[1]], reference = train_predict)
## Confusion Matrix and Statistics
##
## Reference
## Prediction ambiguous resistant sensitive
## ambiguous 3 0 3
## resistant 2 1 2
## sensitive 1 1 21
##
## Overall Statistics
##
## Accuracy : 0.735
## 95% CI : (0.556, 0.871)
## No Information Rate : 0.765
## P-Value [Acc > NIR] : 0.735
##
## Kappa : 0.402
##
## Mcnemar's Test P-Value : 0.343
##
## Statistics by Class:
##
## Class: ambiguous Class: resistant Class: sensitive
## Sensitivity 0.5000 0.5000 0.808
## Specificity 0.8929 0.8750 0.750
## Pos Pred Value 0.5000 0.2000 0.913
## Neg Pred Value 0.8929 0.9655 0.545
## Prevalence 0.1765 0.0588 0.765
## Detection Rate 0.0882 0.0294 0.618
## Detection Prevalence 0.1765 0.1471 0.676
## Balanced Accuracy 0.6964 0.6875 0.779
annotations <- train_df[[1]]
names(annotations) <- rownames(train_df)
names(train_predict) <- rownames(train_df)
summary(annotations == train_predict)
## Mode FALSE TRUE
## logical 9 25
test_predict <- predict(knn_fit, test_df, type = "class")
names(test_predict) <- rownames(test_df)
ref_result <- pData(both_snps)[[ref_col]]
names(ref_result) <- rownames(pData(both_snps))
ref_xref <- names(ref_result) %in% names(test_predict)
ref_comp <- ref_result[ref_xref]
ref_comp_idx <- ref_comp != "unknown"
ref_comp <- ref_comp[ref_comp_idx]
ref_comp_idx <- !is.na(ref_comp)
ref_comp <- ref_comp[ref_comp_idx]
test_comp_idx <- names(test_predict) %in% names(ref_comp)
test_comp <-test_predict[test_comp_idx]
agreement <- as.character(ref_comp) == as.character(test_comp)
agree_num <- sum(agreement)
agree_prop <- agree_num / length(ref_comp)
agree_prop
## [1] 0.7083