1 Introduction

I had some success in classifying the TMRC2 samples by strain via ML and want to try something more difficult. Thus I will use the normalized gene expression data and try classifying it by cure/fail.

ref_col <- "finaloutcome"
outcome_factor <- as.factor(as.character(pData(tc_clinical)[[ref_col]]))

2 Starter data

In the strain classifier I used normalized variants. I am thinking to use normalized expression here and therefore explicitly limit myself to ~ 20k variables (significantly down from the 1.6M).

In addition, caret expects the data as (rows == samples) and (columns == variables) where each element is one observation. Thus we will need to transpose the expression matrix.

tc_norm <- normalize_expt(tc_clinical, transform = "log2", convert = "cpm")
## transform_counts: Found 1005383 values equal to 0, adding 1 to the matrix.
texprs <- t(exprs(tc_norm))

3 Filtering

The ML text I am reading provide some neat examples for how one might filter the data to make it more suitable for model creation.

3.1 Near zero variance, or genefilter’s cv

The first filter I was introduced to is quite familiar from our sequencing data, the removal of features with near-zero-variance. Indeed, I am pretty certain that normalize_expt() could do this equivalently and significantly faster than caret::preProcess().

system.time({
  equivalent <- normalize_expt(tc_norm, filter = "cv", cv_min = 0.1)
})
## Removing 3174 low-count genes (16749 remaining).
##    user  system elapsed 
##   4.712   0.565   5.309
dim(exprs(equivalent))
## [1] 16749   184
## Given this large amount of data, this step is slow, taking > 10 minutes.
## Yeah seriously, the following three lines get 16,723 genes in 10 minutes while
## the normalize_expt() call above gets 16,749 genes in 2.4 seconds.
#system.time({
#  nzv <- preProcess(texprs, method="nzv", uniqueCut=15)
#  nzv_texprs <- predict(nzv, texprs)
#  dim(nzv_texprs)
#}
nzv_texprs <- t(exprs(equivalent))

3.2 Filtering to the highest standard deviation variables

I think I am a bit confused by this filter, one would think that the nzv filter above, if applied correctly, should give you exactly this.

For the moment, I am excluding the following block in order to see how much time/memory keeping these variables costs. If I recall properly, the model of the top-2k variant positions cost ~ 1-4G of memory. I hope that this scales linearly, but I am thinking it might not.

standard_devs <- apply(nzv_texprs, 2, sd)
top_predictors <- order(standard_devs, decreasing = TRUE)[1:3000]
nzv_texprs <- nzv_texprs[, top_predictors]

3.3 Center the data

I think centering may not be needed for this data, but here is how:

nzv_center <- preProcess(nzv_texprs, method = "center")
nzv_texprs <- predict(nzv_center, nzv_texprs)

3.4 Drop correlated

This is a filter which does not correspond to any of those we use in sequencing data because genes which are highly correlated are likely to be of immediate interest.

In the same fashion, I want to leave this off because later applications of this model will include low coverage samples which may not have every variant represented.

## This step takes a while...
system.time({
  nzv_correlated <- preProcess(nzv_texprs, method = "corr", cutoff = 0.95)
  nzv_uncorr <- predict(nzv_correlated, nzv_texprs)
})
##    user  system elapsed 
## 1095.19   16.98 1131.66
dim(nzv_uncorr)
## [1]   184 12080

4 Merge the appropriate metadata

There are a few metadata factors which might prove of interest for classification. The most obvious are of course outcome, clinic, donor, visit, celltype. I am, for the moment, only likely to focus on outcome. AFAICT I can only include one of these at a time in the data, which is a shame.

interesting_meta <- pData(tc_clinical)[, c("finaloutcome", "donor", "persistence",
                                           "visitnumber", "selectionmethod",
                                           "typeofcells", "time", "clinic")]

ml_df <- as.data.frame(cbind(outcome_factor, as.data.frame(nzv_uncorr)))
ml_df[["outcome_factor"]] <- as.factor(ml_df[["outcome_factor"]])
dim(ml_df)
## [1]   184 12081

5 Split the data into training/testing

caret provides nice functionality for splitting up the data. I suspect there are many more fun knobs I can play with for instances where I need to exclude some levels of a factor and such. In this case I just want to split by outcome.

5.1 Via data splitting

## The variable outcome_factor was created at the top of this document,
## which is a little awkward here.
train_idx <- createDataPartition(outcome_factor, p = 0.6,
                                 list = FALSE,
                                 times = 1)
summary(ml_df[train_idx, "outcome_factor"])
##    cure failure 
##      74      38
## Training set has 112 samples.

train_rownames <- rownames(ml_df)[train_idx]
not_train_idx <- ! rownames(ml_df) %in% train_rownames
summary(ml_df[not_train_idx, "outcome_factor"])
##    cure failure 
##      48      24
not_train_rownames <- rownames(ml_df)[not_train_idx]

train_df <- as.data.frame(ml_df[train_rownames, ])
dim(train_df)
## [1]   112 12081
test_df <- as.data.frame(ml_df[not_train_rownames, ])
dim(test_df)
## [1]    72 12081
## Remove the outcome factor from the test data, just in case.
test_outcome <- test_df[["outcome_factor"]]
test_df[["outcome_factor"]] <- NULL

5.2 Via sampling

There are a few likely sampling methods: cross-validation, bootstrapping, and jackknifing. I will try those out later.

6 Try out training and prediction methods

My goals from here on will be to get the beginnings of a sense of the various methods I can use to create the models from the training data and predict the outcome on the test data. I am hoping also to pick up some idea of what the various arguments mean while I am at it.

6.1 Try out KNN

k-nearest neighbors is somewhat similar to a kmeans estimate. Thus the primary argument is ‘k’

6.1.1 Model creation and performance

knn_fit <- knn3(x = train_df[, -1],
                y = train_df[["outcome_factor"]],
                k = 3)
knn_predict_trained <- predict(knn_fit, train_df[, -1], type = "class")
confusionMatrix(data = train_df[[1]], reference = knn_predict_trained)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction cure failure
##    cure      69       5
##    failure    4      34
##                                         
##                Accuracy : 0.92          
##                  95% CI : (0.853, 0.963)
##     No Information Rate : 0.652         
##     P-Value [Acc > NIR] : 3.5e-11       
##                                         
##                   Kappa : 0.822         
##                                         
##  Mcnemar's Test P-Value : 1             
##                                         
##             Sensitivity : 0.945         
##             Specificity : 0.872         
##          Pos Pred Value : 0.932         
##          Neg Pred Value : 0.895         
##              Prevalence : 0.652         
##          Detection Rate : 0.616         
##    Detection Prevalence : 0.661         
##       Balanced Accuracy : 0.909         
##                                         
##        'Positive' Class : cure          
## 

As the confusion matrix shows, this failed for a few samples. Perhaps let us change k and see if it improves.

Here is a table of fase positives/negatives for a few values of ‘k’, in this context a false positive is calling a known cure as a failure and false negative is calling a known failure as a cure.

|—|—|—| |k |fp |fn | |2 |0 |8 | |3 |5 |5 | |4 |8 |9 | |5 |11 |7 | |6 |15 |8 |

Note: this depends on the luck of rand(), so the above numbers shift moderately from one run to the next. Thus I think I will just use 2 or 3.

knn_fit <- knn3(x = train_df[, -1],
                y = train_df[["outcome_factor"]],
                k = 3)
knn_predict_trained <- predict(knn_fit, train_df[, -1], type = "class")
confusionMatrix(data = train_df[[1]], reference = knn_predict_trained)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction cure failure
##    cure      69       5
##    failure    4      34
##                                         
##                Accuracy : 0.92          
##                  95% CI : (0.853, 0.963)
##     No Information Rate : 0.652         
##     P-Value [Acc > NIR] : 3.5e-11       
##                                         
##                   Kappa : 0.822         
##                                         
##  Mcnemar's Test P-Value : 1             
##                                         
##             Sensitivity : 0.945         
##             Specificity : 0.872         
##          Pos Pred Value : 0.932         
##          Neg Pred Value : 0.895         
##              Prevalence : 0.652         
##          Detection Rate : 0.616         
##    Detection Prevalence : 0.661         
##       Balanced Accuracy : 0.909         
##                                         
##        'Positive' Class : cure          
## 

6.1.2 Visualize the accuracy with a ROC curve

names(knn_predict_trained) <- rownames(train_df)
summary(knn_predict_trained == pData(tc_norm)[train_idx, "finaloutcome"])
##    Mode   FALSE    TRUE 
## logical       9     103
tt <- predict(knn_fit, train_df[, -1])
curve <- pROC::roc(response = train_df[, 1], predictor = tt[, 1],
                   levels = rev(levels(outcome_factor)))
## Setting direction: controls < cases
plot(curve)

pROC::auc(curve)
## Area under the curve: 0.967

6.1.3 Predict the rest of the data with this model.

## Return the classifications
knn_predict_test <- predict(knn_fit, test_df, type = "class")
## and the values, which will be used for our ROC curve.
knn_predict_test_values <- predict(knn_fit, test_df)

names(knn_predict_test) <- rownames(test_df)
knn_agreement <- knn_predict_test == test_outcome
names(knn_agreement) <- rownames(test_df)
summary(knn_agreement)
##    Mode   FALSE    TRUE 
## logical      16      56
names(knn_agreement)[knn_agreement == FALSE]
##  [1] "TMRC30018" "TMRC30210" "TMRC30105" "TMRC30164" "TMRC30119" "TMRC30103" "TMRC30026" "TMRC30165" "TMRC30044" "TMRC30166" "TMRC30200"
## [12] "TMRC30177" "TMRC30208" "TMRC30218" "TMRC30220" "TMRC30264"
curve <- pROC::roc(response = test_outcome,
                   predictor = knn_predict_test_values[, 1])
## Setting levels: control = cure, case = failure
## Setting direction: controls > cases
plot(curve)

pROC::auc(curve)
## Area under the curve: 0.803
## Find the samples which (dis)agree
ref_col <- "finaloutcome"
ref_result <- pData(tc_clinical)[[ref_col]]
names(ref_result) <- rownames(pData(tc_clinical))
ref_xref <- names(ref_result) %in% names(knn_predict_test)
ref_comp <- ref_result[ref_xref]
ref_comp_idx <- ref_comp != "unknown"
ref_comp <- ref_comp[ref_comp_idx]
ref_comp_idx <- !is.na(ref_comp)
ref_comp <- ref_comp[ref_comp_idx]

test_comp_idx <- names(knn_predict_test) %in% names(ref_comp)
test_comp <- knn_predict_test[test_comp_idx]

agreement <- as.character(ref_comp) == as.character(test_comp)
agree_num <- sum(agreement)
agree_prop <- agree_num / length(ref_comp)
agree_prop
## [1] 0.7778
disagree <- as.character(ref_comp) != as.character(test_comp)
test_comp[disagree]
## TMRC30018 TMRC30210 TMRC30105 TMRC30164 TMRC30119 TMRC30103 TMRC30026 TMRC30165 TMRC30044 TMRC30166 TMRC30200 TMRC30177 TMRC30208 TMRC30218 
##   failure   failure      cure   failure      cure   failure      cure   failure   failure   failure      cure      cure      cure      cure 
## TMRC30220 TMRC30264 
##      cure      cure 
## Levels: cure failure

6.2 Perform cross-validation to estimate k

The cross validation method of repeated sampling the data is all done within the train() function. With that in mind, here it is operating with the knn method.

6.2.1 CV with knn

When train() is called with the trControl and tuneGrid, we can control how the knn training is repeated, in this case it will iterate over k from 1 to 10.

cv_control <- trainControl(method = "cv", number = 10)
knn_train_fit <- train(outcome_factor ~ ., data = train_df,
                       method = "knn",
                       trControl = cv_control,
                       tuneGrid = data.frame(k = 1:10))
knn_train_fit[["bestTune"]]
##   k
## 1 1
plot(x = 1:10, 1 - knn_train_fit$results[, 2], pch = 19,
     ylab = "prediction error", xlab = "k")
lines(loess.smooth(x = 1:10, 1 - knn_train_fit$results[, 2],degree = 2),
      col = "#CC0000")

6.2.2 Bootstrap with knn

boot_control <- trainControl(method = "boot", number = 20,
                             returnResamp = "all")

knn_train_fit <- train(outcome_factor ~ ., data = train_df,
                       method = "knn",
                       trControl = boot_control,
                       tuneGrid = data.frame(k = 1:10))
knn_train_fit[["bestTune"]]
##   k
## 1 1
plot(x = 1:10, 1 - knn_train_fit$results[, 2], pch = 19,
     ylab = "prediction error", xlab = "k")
lines(loess.smooth(x = 1:10, 1 - knn_train_fit$results[, 2],degree = 2),
      col = "#CC0000")

6.2.3 Explain the important variables

In this instance we will search for genes which were important for the model’s creation.

The DALEX package provides a function: feature_importance() which seeks to use a series of other methods to extract (in this case, genes) features which do a good job of explaining the result produced by the model. In the case of this dataset, which has thousands of features, this does not appear to end well.

explainer_knn <- explain(knn_fit, label = "knn",
                         data = train_df[, -1],
                         y = as.numeric(train_df[, 1]))
## Preparation of a new explainer is initiated
##   -> model label       :  knn 
##   -> data              :  112  rows  12080  cols 
##   -> target variable   :  112  values 
##   -> predict function  :  yhat.default will be used (  default  )
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package Model of class: knn3 package unrecognized , ver. Unknown , task regression (  default  ) 
##   -> predicted values  :  numerical, min =  0 , mean =  0.3988 , max =  1  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  0.3333 , mean =  0.9405 , max =  1.667  
##   A new explainer has been created!
## AFAICT the following will take forever unless we drastically reduce the complexity of the model.
## features <- feature_importance(explainer_knn, n_sample = 50, type = "difference")

6.3 Random Forest

The parameter ‘mtry’ is often important, if I read the text correctly it controls how many variables to sample in each split of the tree. Thus higher numbers should presumably make it more specific at the risk of overfitting.

Setting min.node.size sets the minimume node size of terminal nodes in each tree. Each increment up speeds the algorithm.

I am going to use my boot control trainer from above and see how it goes.

library(ranger)
##rf_method <- trainControl(method = "ranger", number = 10, verbose = TRUE)

rf_train_fit <- train(outcome_factor ~ ., data = train_df,
                method = "ranger", trControl = boot_control,
                importance = "permutation",
                tuneGrid = data.frame(
                    mtry = 200,
                    min.node.size = 1,
                    splitrule = "gini"),
                verbose = TRUE)
rf_train_fit[["finalModel"]][["prediction.error"]]
## [1] 0.1964
variable_importance <- varImp(rf_train_fit)
plot(variable_importance, top = 15)

rf_predict_trained <- predict(rf_train_fit, train_df)
names(rf_predict_trained) <- rownames(train_df)
summary(rf_predict_trained == pData(tc_norm)[train_idx, "finaloutcome"])
##    Mode    TRUE 
## logical     112

6.3.0.1 Digression do the genes provide by varImp mean anything?

Let us take a moment and see if the top-n genes returned by varImp() have some meaning which jumps out. One might assume, given our extant Differential Expression results, that the interleukin response will be a likely candidate.

test_importance <- varImp(rf_train_fit, scale = TRUE)
test_idx <- order(test_importance[["importance"]][["Overall"]], decreasing = TRUE)
test_imp <- test_importance[["importance"]][test_idx, ]
names(test_imp) <- rownames(test_importance[["importance"]])[test_idx]
head(test_imp, n = 200)
## ENSG00000173702 ENSG00000103642 ENSG00000183741 ENSG00000248405 ENSG00000164303 ENSG00000108774 ENSG00000181404 ENSG00000172426 ENSG00000126709 
##          100.00           84.95           83.39           76.51           68.64           66.18           62.33           62.17           61.17 
## ENSG00000137965 ENSG00000134758 ENSG00000185304 ENSG00000180720 ENSG00000187608 ENSG00000160201 ENSG00000187950 ENSG00000081479 ENSG00000129295 
##           60.16           59.37           54.80           54.13           53.95           50.57           50.49           48.42           47.02 
## ENSG00000256870 ENSG00000237440 ENSG00000168280 ENSG00000106009 ENSG00000160932 ENSG00000172613 ENSG00000197826 ENSG00000174898 ENSG00000059378 
##           46.33           45.91           44.97           44.19           43.11           42.47           42.25           41.59           41.24 
## ENSG00000204632 ENSG00000182253 ENSG00000142396 ENSG00000126003 ENSG00000167608 ENSG00000073734 ENSG00000103226 ENSG00000140104 ENSG00000115155 
##           39.88           39.29           38.36           38.30           37.33           37.10           35.40           34.21           33.74 
## ENSG00000163435 ENSG00000138642 ENSG00000138035 ENSG00000167098 ENSG00000173193 ENSG00000136206 ENSG00000090339 ENSG00000166920 ENSG00000226288 
##           33.73           33.60           33.11           33.09           32.59           32.42           32.13           31.51           31.12 
## ENSG00000101605 ENSG00000196345 ENSG00000236444 ENSG00000284776 ENSG00000181315 ENSG00000187626 ENSG00000184979 ENSG00000184508 ENSG00000176945 
##           31.12           31.05           30.78           30.68           30.40           30.30           30.22           29.79           29.57 
## ENSG00000155530 ENSG00000130813 ENSG00000260272 ENSG00000204428 ENSG00000167962 ENSG00000168255 ENSG00000108830 ENSG00000174132 ENSG00000198870 
##           29.30           29.25           28.72           28.68           28.36           27.99           27.67           27.57           27.48 
## ENSG00000284194 ENSG00000104691 ENSG00000179933 ENSG00000286169 ENSG00000186654 ENSG00000151005 ENSG00000135365 ENSG00000130876 ENSG00000146809 
##           27.45           27.38           27.12           26.98           26.93           26.90           26.86           26.79           26.71 
## ENSG00000095015 ENSG00000096060 ENSG00000144560 ENSG00000196369 ENSG00000106436 ENSG00000185347 ENSG00000197822 ENSG00000148572 ENSG00000125821 
##           26.55           26.55           26.47           26.33           26.29           26.28           26.11           26.07           25.98 
## ENSG00000178084 ENSG00000144744 ENSG00000087303 ENSG00000176222 ENSG00000173698 ENSG00000176390 ENSG00000127314 ENSG00000166295 ENSG00000177875 
##           25.98           25.86           25.79           25.79           25.77           25.64           25.49           25.49           25.29 
## ENSG00000100523 ENSG00000161640 ENSG00000253626 ENSG00000054179 ENSG00000214360 ENSG00000179455 ENSG00000165164 ENSG00000198189 ENSG00000150455 
##           25.18           25.11           24.94           24.91           24.91           24.69           24.58           24.45           24.42 
## ENSG00000121058 ENSG00000163116 ENSG00000110768 ENSG00000136574 ENSG00000134326 ENSG00000105248 ENSG00000188157 ENSG00000148339 ENSG00000179407 
##           24.42           24.41           24.41           24.39           24.39           24.27           24.22           24.22           24.05 
## ENSG00000136244 ENSG00000205857 ENSG00000162627 ENSG00000183454 ENSG00000260001 ENSG00000269028 ENSG00000206560 ENSG00000204519 ENSG00000137628 
##           24.00           23.78           23.73           23.68           23.65           23.50           23.47           23.43           23.42 
## ENSG00000232629 ENSG00000084734 ENSG00000158477 ENSG00000185361 ENSG00000154654 ENSG00000179008 ENSG00000196839 ENSG00000168958 ENSG00000107736 
##           23.32           23.21           23.20           23.01           22.99           22.97           22.96           22.95           22.93 
## ENSG00000284505 ENSG00000172456 ENSG00000188981 ENSG00000137331 ENSG00000188039 ENSG00000104856 ENSG00000164007 ENSG00000006634 ENSG00000110906 
##           22.93           22.91           22.91           22.84           22.80           22.80           22.72           22.61           22.61 
## ENSG00000119522 ENSG00000168671 ENSG00000128872 ENSG00000169427 ENSG00000186714 ENSG00000172986 ENSG00000103375 ENSG00000134690 ENSG00000163645 
##           22.55           22.52           22.50           22.43           22.37           22.29           22.29           22.29           22.29 
## ENSG00000185038 ENSG00000153498 ENSG00000009790 ENSG00000160867 ENSG00000121653 ENSG00000133106 ENSG00000174145 ENSG00000053770 ENSG00000146007 
##           22.29           22.25           22.25           22.25           22.18           22.18           22.09           22.09           22.09 
## ENSG00000146205 ENSG00000155158 ENSG00000162972 ENSG00000165688 ENSG00000179344 ENSG00000241149 ENSG00000178150 ENSG00000096006 ENSG00000111450 
##           22.09           22.09           22.09           22.09           22.09           22.09           22.02           22.01           21.92 
## ENSG00000183793 ENSG00000166971 ENSG00000170807 ENSG00000161594 ENSG00000232268 ENSG00000241553 ENSG00000133169 ENSG00000120217 ENSG00000260097 
##           21.90           21.90           21.86           21.85           21.83           21.82           21.82           21.82           21.76 
## ENSG00000055332 ENSG00000158865 ENSG00000162714 ENSG00000166133 ENSG00000143156 ENSG00000137204 ENSG00000104267 ENSG00000105662 ENSG00000174243 
##           21.75           21.71           21.71           21.71           21.71           21.63           21.54           21.54           21.54 
## ENSG00000070761 ENSG00000085871 ENSG00000162775 ENSG00000137802 ENSG00000152767 ENSG00000158806 ENSG00000204001 ENSG00000125900 ENSG00000010818 
##           21.54           21.54           21.53           21.42           21.42           21.42           21.39           21.38           21.38 
## ENSG00000125355 ENSG00000149124 ENSG00000182348 ENSG00000115419 ENSG00000102271 ENSG00000241399 ENSG00000112419 ENSG00000152779 ENSG00000169994 
##           21.38           21.38           21.33           21.31           21.22           21.22           21.22           21.22           21.22 
## ENSG00000004948 ENSG00000145920 
##           21.09           21.09
most_important_ids <- names(head(test_imp, n = 200))
convert_ids(most_important_ids, from = "ENSEMBL", to = "SYMBOL")
## Before conversion: 200, after conversion: 197.
##             ENSEMBL          SYMBOL
## 1   ENSG00000173702           MUC13
## 2   ENSG00000103642           LACTB
## 3   ENSG00000183741            CBX6
## 4   ENSG00000248405    PRR5-ARHGAP8
## 5   ENSG00000164303           ENPP6
## 6   ENSG00000108774           RAB5C
## 7   ENSG00000181404          WASHC1
## 8   ENSG00000172426           RSPH9
## 9   ENSG00000126709            IFI6
## 10  ENSG00000137965           IFI44
## 11  ENSG00000134758          RNF138
## 12  ENSG00000185304           RGPD2
## 13  ENSG00000180720           CHRM4
## 14  ENSG00000187608           ISG15
## 15  ENSG00000160201           U2AF1
## 16  ENSG00000160201    LOC102724594
## 17  ENSG00000187950           OVCH1
## 18  ENSG00000081479            LRP2
## 19  ENSG00000129295         DNAAF11
## 20  ENSG00000256870          SLC5A8
## 21  ENSG00000237440          ZNF737
## 22  ENSG00000168280           KIF5C
## 23  ENSG00000106009           BRAT1
## 24  ENSG00000160932            LY6E
## 25  ENSG00000172613           RAD9A
## 26  ENSG00000197826         CFAP299
## 27  ENSG00000174898        CATSPERD
## 28  ENSG00000059378          PARP12
## 29  ENSG00000204632           HLA-G
## 30  ENSG00000182253            SYNM
## 31  ENSG00000142396         ERVK3-1
## 32  ENSG00000126003          PLAGL2
## 33  ENSG00000167608            TMC4
## 34  ENSG00000073734          ABCB11
## 35  ENSG00000103226           NOMO3
## 36  ENSG00000140104           CLBA1
## 37  ENSG00000115155            OTOF
## 38  ENSG00000163435            ELF3
## 39  ENSG00000138642           HERC6
## 40  ENSG00000138035           PNPT1
## 41  ENSG00000167098            SUN5
## 42  ENSG00000173193          PARP14
## 43  ENSG00000136206          SPDYE1
## 44  ENSG00000090339           ICAM1
## 45  ENSG00000166920        C15orf48
## 46  ENSG00000226288          OR52I2
## 47  ENSG00000101605           MYOM1
## 48  ENSG00000196345         ZKSCAN7
## 49  ENSG00000236444          UBE2L5
## 51  ENSG00000181315          ZNF322
## 52  ENSG00000187626         ZKSCAN4
## 53  ENSG00000184979           USP18
## 54  ENSG00000184508           HDDC3
## 55  ENSG00000176945           MUC20
## 56  ENSG00000155530           LRGUK
## 57  ENSG00000130813            SHFL
## 59  ENSG00000204428          LY6G5C
## 60  ENSG00000167962          ZNF598
## 61  ENSG00000168255 POLR2J3-UPK3BL2
## 62  ENSG00000108830            RND2
## 63  ENSG00000174132         FAM174A
## 64  ENSG00000198870          STKLD1
## 65  ENSG00000284194            SCO2
## 66  ENSG00000104691           UBXN8
## 67  ENSG00000179933       C14orf119
## 69  ENSG00000186654            PRR5
## 70  ENSG00000151005           TKTL2
## 71  ENSG00000135365          PHF21A
## 72  ENSG00000130876         SLC7A10
## 73  ENSG00000146809           ASB15
## 74  ENSG00000095015          MAP3K1
## 75  ENSG00000096060           FKBP5
## 76  ENSG00000144560           VGLL4
## 77  ENSG00000196369         SRGAP2B
## 78  ENSG00000106436           MYL10
## 79  ENSG00000185347           TEDC1
## 80  ENSG00000197822            OCLN
## 81  ENSG00000148572           NRBF2
## 82  ENSG00000125821            DTD1
## 83  ENSG00000178084           HTR3C
## 84  ENSG00000144744            UBA3
## 85  ENSG00000087303            NID2
## 86  ENSG00000176222          ZNF404
## 87  ENSG00000173698          ADGRG2
## 88  ENSG00000176390           CRLF3
## 89  ENSG00000127314           RAP1B
## 90  ENSG00000166295         ANAPC16
## 91  ENSG00000177875         CCDC184
## 92  ENSG00000100523           DDHD1
## 93  ENSG00000161640        SIGLEC11
## 94  ENSG00000253626         EIF5AL1
## 95  ENSG00000054179          ENTPD2
## 96  ENSG00000214360          EFCAB9
## 97  ENSG00000179455           MKRN3
## 98  ENSG00000165164          CFAP47
## 99  ENSG00000198189        HSD17B11
## 100 ENSG00000150455           TIRAP
## 101 ENSG00000121058            COIL
## 102 ENSG00000163116           STPG2
## 103 ENSG00000110768          GTF2H1
## 104 ENSG00000136574           GATA4
## 105 ENSG00000134326           CMPK2
## 106 ENSG00000105248            YJU2
## 107 ENSG00000188157            AGRN
## 108 ENSG00000148339        SLC25A25
## 109 ENSG00000179407          DNAJB8
## 110 ENSG00000136244             IL6
## 111 ENSG00000205857         NANOGNB
## 112 ENSG00000162627            SNX7
## 113 ENSG00000183454          GRIN2A
## 114 ENSG00000260001         TGFBR3L
## 116 ENSG00000206560         ANKRD28
## 117 ENSG00000204519          ZNF551
## 118 ENSG00000137628           DDX60
## 119 ENSG00000232629        HLA-DQB2
## 120 ENSG00000084734            GCKR
## 121 ENSG00000158477            CD1A
## 122 ENSG00000185361       TNFAIP8L1
## 123 ENSG00000154654           NCAM2
## 124 ENSG00000179008        C14orf39
## 125 ENSG00000196839             ADA
## 126 ENSG00000168958             MFF
## 127 ENSG00000107736           CDH23
## 128 ENSG00000284505    LYNX1-SLURP2
## 129 ENSG00000172456            FGGY
## 130 ENSG00000188981         MSANTD1
## 131 ENSG00000137331            IER3
## 132 ENSG00000188039            NWD1
## 133 ENSG00000104856            RELB
## 134 ENSG00000164007          CLDN19
## 135 ENSG00000006634            DBF4
## 136 ENSG00000110906          KCTD10
## 137 ENSG00000119522         DENND1A
## 138 ENSG00000168671          UGT3A2
## 139 ENSG00000128872           TMOD2
## 140 ENSG00000169427           KCNK9
## 141 ENSG00000186714          CCDC73
## 142 ENSG00000172986          GXYLT2
## 143 ENSG00000103375            AQP8
## 144 ENSG00000134690           CDCA8
## 145 ENSG00000163645          ERICH6
## 146 ENSG00000185038          MROH2A
## 147 ENSG00000153498          SPACA7
## 148 ENSG00000009790        TRAF3IP3
## 149 ENSG00000160867           FGFR4
## 150 ENSG00000121653        MAPK8IP1
## 151 ENSG00000133106          EPSTI1
## 152 ENSG00000174145            NWD2
## 153 ENSG00000053770           AP5M1
## 154 ENSG00000146007           ZMAT2
## 155 ENSG00000146205            ANO7
## 156 ENSG00000155158          TTC39B
## 157 ENSG00000162972           MAIP1
## 158 ENSG00000165688           PMPCA
## 159 ENSG00000179344        HLA-DQB1
## 160 ENSG00000241149          ZNF722
## 161 ENSG00000178150          ZNF114
## 162 ENSG00000096006          CRISP3
## 163 ENSG00000111450            STX2
## 164 ENSG00000183793          NPIPA5
## 165 ENSG00000166971           AKTIP
## 166 ENSG00000170807           LMOD2
## 167 ENSG00000161594          KLHL10
## 168 ENSG00000232268          OR52I1
## 169 ENSG00000241553           ARPC4
## 170 ENSG00000133169            BEX1
## 171 ENSG00000120217           CD274
## 172 ENSG00000260097          SPDYE6
## 173 ENSG00000055332         EIF2AK2
## 174 ENSG00000158865         SLC5A11
## 175 ENSG00000162714          ZNF496
## 176 ENSG00000166133          RPUSD2
## 177 ENSG00000143156            NME7
## 178 ENSG00000137204         SLC22A7
## 179 ENSG00000104267             CA2
## 180 ENSG00000105662           CRTC1
## 181 ENSG00000174243           DDX23
## 182 ENSG00000070761          CFAP20
## 183 ENSG00000085871           MGST2
## 184 ENSG00000162775           RBM15
## 185 ENSG00000137802         MAPKBP1
## 186 ENSG00000152767           FARP1
## 187 ENSG00000158806            NPM2
## 188 ENSG00000204001            LCN8
## 189 ENSG00000125900           SIRPD
## 190 ENSG00000010818          HIVEP2
## 191 ENSG00000125355        TMEM255A
## 192 ENSG00000149124           GLYAT
## 193 ENSG00000182348         ZNF804B
## 194 ENSG00000115419             GLS
## 195 ENSG00000102271           KLHL4
## 196 ENSG00000241399           CD302
## 197 ENSG00000112419         PHACTR2
## 198 ENSG00000152779        SLC16A12
## 199 ENSG00000169994           MYO7B
## 200 ENSG00000004948           CALCR
## 201 ENSG00000145920           CPLX2
importance_gp <- simple_gprofiler(names(head(test_imp, n = 60)))
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
importance_gp$pvalue_plots$BP
## NULL

6.3.1 Now the random forest testers!

rf_predict_test <- predict(rf_train_fit, test_df)
names(rf_predict_test) <- rownames(test_df)
rf_agreement <- rf_predict_test == pData(tc_norm)[not_train_idx, "finaloutcome"]
names(rf_agreement) <- rownames(test_df)
summary(rf_agreement)
##    Mode   FALSE    TRUE 
## logical      17      55
names(rf_agreement)[rf_agreement == FALSE]
##  [1] "TMRC30018" "TMRC30056" "TMRC30080" "TMRC30119" "TMRC30103" "TMRC30122" "TMRC30026" "TMRC30166" "TMRC30048" "TMRC30200" "TMRC30177"
## [12] "TMRC30238" "TMRC30208" "TMRC30218" "TMRC30220" "TMRC30264" "TMRC30265"

6.4 GLM, or Logistic regression and regularization

Logistic regression is a statistical method for binary responses. However, it is able to work with multiple classes as well. The general idea of this method is to find parameters which increase the likelihood that the observed data is sampled from a statistical distribution of interest. The transformations and linear regression-esque tasks performed are confusing, but once those are performed, the task becomes setting the model’s (fitting) parameters to values which increase the probability that the statistical model looks like the actual dataset given the training data, and that when samples, will return values which are similar. The most likely statistical distributions one will want to fit are the Gaussian, in which case we want to transform/normalize the mean/variance of our variables so they look whatever normal distribution we are using. Conversely, logistic regression uses a binnomial distribution (like our raw sequencing data!) but which is from 0-1.

6.4.1 Using a single gene

Let us take the most important gene observed in one of our previous training sets: ENSG00000248405 PRR5-ARHGAP8

gene_id <- "ENSG00000248405"
single_fit <- train(
    outcome_factor ~ ENSG00000248405, data = train_df,
    method = "glm", family = "binomial", trControl = trainControl("none"))

tt <- data.frame("ENSG00000248405" = seq(min(train_df[[gene_id]]),
                                         max(train_df[[gene_id]]),len=100))
## predict probabilities for the simulated data
tt$subtype = predict(single_fit, newdata=tt, type="prob")[, 1]
## plot the sigmoid curve and the training data
plot(ifelse(outcome_factor == "cure", 1, 0) ~ ENSG00000248405,
     data=train_df, col="red4",
     ylab="CF as 0 or 1", xlab="favorite gene expression")
lines(subtype ~ ENSG00000248405, tt, col="green4", lwd=2)

Having tried with 1 gene, let us extend this to all genes. In my first try of this, it took a long time.

glm_fit <- train(outcome_factor ~ ., data = train_df,
                 trControl = boot_control,
                 method = "glm", family = "binomial")
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
library(glmnet)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:S4Vectors':
## 
##     expand
## Loaded glmnet 4.1-7
##rf_method <- trainControl(method = "ranger", number = 10, verbose = TRUE)
## train_method <- trainControl(method = "cv", number = 10)
glm_fit <- train(outcome_factor ~ ., data = train_df, method = "glmnet",
                 trControl = boot_control, importance = "permutation",
                 tuneGrid = data.frame(
                   alpha = 0.5,
                   lambda = seq(0.1, 0.7, 0.05)),
                 verbose = TRUE)

glm_predict_trained <- predict(glm_fit, train_df)
names(glm_predict_trained) <- rownames(train_df)
summary(glm_predict_trained == pData(tc_norm)[train_idx, "finaloutcome"])
##    Mode    TRUE 
## logical     112
confusionMatrix(data = train_df[, 1], reference = glm_predict_trained)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction cure failure
##    cure      74       0
##    failure    0      38
##                                     
##                Accuracy : 1         
##                  95% CI : (0.968, 1)
##     No Information Rate : 0.661     
##     P-Value [Acc > NIR] : <2e-16    
##                                     
##                   Kappa : 1         
##                                     
##  Mcnemar's Test P-Value : NA        
##                                     
##             Sensitivity : 1.000     
##             Specificity : 1.000     
##          Pos Pred Value : 1.000     
##          Neg Pred Value : 1.000     
##              Prevalence : 0.661     
##          Detection Rate : 0.661     
##    Detection Prevalence : 0.661     
##       Balanced Accuracy : 1.000     
##                                     
##        'Positive' Class : cure      
## 

6.4.2 Now the GLM testers!

glm_predict_test <- predict(glm_fit, test_df)
names(glm_predict_test) <- rownames(test_df)
glm_predict_test
## TMRC30186 TMRC30253 TMRC30140 TMRC30176 TMRC30226 TMRC30227 TMRC30016 TMRC30231 TMRC30233 TMRC30018 TMRC30210 TMRC30213 TMRC30216 TMRC30214 
##      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure 
## TMRC30274 TMRC30254 TMRC30277 TMRC30240 TMRC30280 TMRC30281 TMRC30284 TMRC30282 TMRC30285 TMRC30056 TMRC30105 TMRC30164 TMRC30080 TMRC30094 
##      cure      cure      cure      cure      cure      cure      cure      cure      cure   failure   failure      cure   failure   failure 
## TMRC30119 TMRC30082 TMRC30103 TMRC30122 TMRC30022 TMRC30029 TMRC30083 TMRC30028 TMRC30026 TMRC30165 TMRC30044 TMRC30166 TMRC30195 TMRC30048 
##   failure      cure      cure   failure      cure      cure   failure      cure   failure      cure      cure      cure      cure   failure 
## TMRC30045 TMRC30055 TMRC30041 TMRC30139 TMRC30160 TMRC30183 TMRC30167 TMRC30078 TMRC30076 TMRC30159 TMRC30129 TMRC30088 TMRC30142 TMRC30168 
##      cure   failure      cure      cure      cure      cure      cure   failure   failure      cure      cure   failure      cure      cure 
## TMRC30146 TMRC30199 TMRC30200 TMRC30204 TMRC30177 TMRC30155 TMRC30206 TMRC30136 TMRC30238 TMRC30217 TMRC30208 TMRC30218 TMRC30220 TMRC30264 
##      cure   failure      cure   failure   failure      cure   failure      cure      cure   failure   failure   failure      cure      cure 
## TMRC30144 TMRC30265 
##      cure      cure 
## Levels: cure failure
glm_agreement <- glm_predict_test == pData(tc_norm)[not_train_idx, "finaloutcome"]
names(glm_agreement) <- rownames(test_df)
summary(glm_agreement)
##    Mode   FALSE    TRUE 
## logical       6      66
names(glm_agreement)[glm_agreement == FALSE]
## [1] "TMRC30080" "TMRC30200" "TMRC30238" "TMRC30220" "TMRC30264" "TMRC30265"

6.5 Gradient Booster

library(xgboost)
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:IRanges':
## 
##     slice
##rf_method <- trainControl(method = "ranger", number = 10, verbose = TRUE)
train_method <- trainControl(method = "cv", number = 10)

gb_fit <- train(outcome_factor ~ ., data = train_df,
                method = "xgbTree", trControl = train_method,
                tuneGrid = data.frame(
                    nrounds = 200,
                    eta = c(0.05, 0.1, 0.3),
                    max_depth = 4,
                    gamma = 0,
                    colsample_bytree = 1,
                    subsample = 0.5,
                    min_child_weight = 1),
                verbose = TRUE)

gb_predict_trained <- predict(gb_fit, train_df)
names(gb_predict_trained) <- rownames(train_df)
summary(gb_predict_trained == pData(tc_norm)[train_idx, "finaloutcome"])
##    Mode    TRUE 
## logical     112

6.5.1 Now the GB testers!

gb_predict_test <- predict(gb_fit, test_df)
names(gb_predict_test) <- rownames(test_df)
gb_predict_test
## TMRC30186 TMRC30253 TMRC30140 TMRC30176 TMRC30226 TMRC30227 TMRC30016 TMRC30231 TMRC30233 TMRC30018 TMRC30210 TMRC30213 TMRC30216 TMRC30214 
##      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure 
## TMRC30274 TMRC30254 TMRC30277 TMRC30240 TMRC30280 TMRC30281 TMRC30284 TMRC30282 TMRC30285 TMRC30056 TMRC30105 TMRC30164 TMRC30080 TMRC30094 
##      cure      cure      cure      cure      cure      cure      cure      cure      cure   failure   failure      cure   failure   failure 
## TMRC30119 TMRC30082 TMRC30103 TMRC30122 TMRC30022 TMRC30029 TMRC30083 TMRC30028 TMRC30026 TMRC30165 TMRC30044 TMRC30166 TMRC30195 TMRC30048 
##   failure      cure      cure   failure      cure      cure   failure      cure   failure      cure      cure   failure      cure   failure 
## TMRC30045 TMRC30055 TMRC30041 TMRC30139 TMRC30160 TMRC30183 TMRC30167 TMRC30078 TMRC30076 TMRC30159 TMRC30129 TMRC30088 TMRC30142 TMRC30168 
##      cure   failure      cure      cure      cure      cure      cure   failure   failure      cure      cure   failure      cure      cure 
## TMRC30146 TMRC30199 TMRC30200 TMRC30204 TMRC30177 TMRC30155 TMRC30206 TMRC30136 TMRC30238 TMRC30217 TMRC30208 TMRC30218 TMRC30220 TMRC30264 
##      cure   failure   failure   failure   failure      cure   failure      cure      cure   failure   failure   failure      cure      cure 
## TMRC30144 TMRC30265 
##      cure      cure 
## Levels: cure failure
gb_agreement <- gb_predict_test == pData(tc_norm)[not_train_idx, "finaloutcome"]
names(gb_agreement) <- rownames(test_df)
summary(gb_agreement)
##    Mode   FALSE    TRUE 
## logical       6      66
names(gb_agreement)[gb_agreement == FALSE]
## [1] "TMRC30080" "TMRC30166" "TMRC30238" "TMRC30220" "TMRC30264" "TMRC30265"

6.6 SVM

library(kernlab)
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
## 
##     alpha
## The following object is masked from 'package:BiocGenerics':
## 
##     type
##rf_method <- trainControl(method = "ranger", number = 10, verbose = TRUE)
train_method <- trainControl(method = "cv", number = 10)

svm_fit <- train(outcome_factor ~ ., data = train_df,
                method = "svmRadial", trControl = train_method,
                tuneGrid = data.frame(
                    C = c(0.25, 0.5, 1),
                    sigma = 1),
                verbose = TRUE)
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
svm_predict_trained <- predict(svm_fit, train_df)
names(svm_predict_trained) <- rownames(train_df)
svm_predict_trained
## TMRC30156 TMRC30185 TMRC30178 TMRC30179 TMRC30221 TMRC30222 TMRC30223 TMRC30224 TMRC30269 TMRC30148 TMRC30149 TMRC30150 TMRC30138 TMRC30153 
##      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure 
## TMRC30151 TMRC30234 TMRC30235 TMRC30270 TMRC30225 TMRC30228 TMRC30229 TMRC30230 TMRC30017 TMRC30232 TMRC30209 TMRC30211 TMRC30212 TMRC30215 
##      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure 
## TMRC30271 TMRC30273 TMRC30275 TMRC30272 TMRC30276 TMRC30255 TMRC30256 TMRC30239 TMRC30278 TMRC30279 TMRC30257 TMRC30019 TMRC30258 TMRC30283 
##      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure 
## TMRC30071 TMRC30020 TMRC30113 TMRC30058 TMRC30169 TMRC30093 TMRC30107 TMRC30170 TMRC30032 TMRC30096 TMRC30115 TMRC30118 TMRC30180 TMRC30014 
##      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure 
## TMRC30121 TMRC30196 TMRC30030 TMRC30021 TMRC30037 TMRC30031 TMRC30027 TMRC30194 TMRC30054 TMRC30046 TMRC30070 TMRC30049 TMRC30047 TMRC30191 
##      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure 
## TMRC30053 TMRC30068 TMRC30171 TMRC30192 TMRC30042 TMRC30158 TMRC30132 TMRC30157 TMRC30123 TMRC30181 TMRC30072 TMRC30133 TMRC30043 TMRC30116 
##      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure 
## TMRC30184 TMRC30172 TMRC30134 TMRC30174 TMRC30137 TMRC30161 TMRC30175 TMRC30145 TMRC30143 TMRC30197 TMRC30182 TMRC30198 TMRC30201 TMRC30203 
##      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure 
## TMRC30202 TMRC30205 TMRC30152 TMRC30154 TMRC30241 TMRC30237 TMRC30207 TMRC30074 TMRC30077 TMRC30219 TMRC30079 TMRC30135 TMRC30173 TMRC30147 
##      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure 
## Levels: cure failure
summary(svm_predict_trained == pData(tc_norm)[train_idx, "finaloutcome"])
##    Mode   FALSE    TRUE 
## logical      38      74

6.6.1 Now the SVM testers!

svm_predict_test <- predict(svm_fit, test_df)
names(svm_predict_test) <- rownames(test_df)
svm_predict_test
## TMRC30186 TMRC30253 TMRC30140 TMRC30176 TMRC30226 TMRC30227 TMRC30016 TMRC30231 TMRC30233 TMRC30018 TMRC30210 TMRC30213 TMRC30216 TMRC30214 
##      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure 
## TMRC30274 TMRC30254 TMRC30277 TMRC30240 TMRC30280 TMRC30281 TMRC30284 TMRC30282 TMRC30285 TMRC30056 TMRC30105 TMRC30164 TMRC30080 TMRC30094 
##      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure 
## TMRC30119 TMRC30082 TMRC30103 TMRC30122 TMRC30022 TMRC30029 TMRC30083 TMRC30028 TMRC30026 TMRC30165 TMRC30044 TMRC30166 TMRC30195 TMRC30048 
##      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure 
## TMRC30045 TMRC30055 TMRC30041 TMRC30139 TMRC30160 TMRC30183 TMRC30167 TMRC30078 TMRC30076 TMRC30159 TMRC30129 TMRC30088 TMRC30142 TMRC30168 
##      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure 
## TMRC30146 TMRC30199 TMRC30200 TMRC30204 TMRC30177 TMRC30155 TMRC30206 TMRC30136 TMRC30238 TMRC30217 TMRC30208 TMRC30218 TMRC30220 TMRC30264 
##      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure      cure 
## TMRC30144 TMRC30265 
##      cure      cure 
## Levels: cure failure
svm_agreement <- svm_predict_test == pData(tc_norm)[not_train_idx, "finaloutcome"]
names(svm_agreement) <- rownames(test_df)
summary(svm_agreement)
##    Mode   FALSE    TRUE 
## logical      24      48
names(svm_agreement)[svm_agreement == FALSE]
##  [1] "TMRC30056" "TMRC30105" "TMRC30094" "TMRC30119" "TMRC30122" "TMRC30083" "TMRC30026" "TMRC30048" "TMRC30055" "TMRC30078" "TMRC30076"
## [12] "TMRC30088" "TMRC30199" "TMRC30200" "TMRC30204" "TMRC30177" "TMRC30206" "TMRC30238" "TMRC30217" "TMRC30208" "TMRC30218" "TMRC30220"
## [23] "TMRC30264" "TMRC30265"

7 Reapply to persistence

ref_col <- "persistence"
outcome_factor <- as.factor(as.character(pData(tc_clinical)[[ref_col]]))

ml_df <- as.data.frame(cbind(outcome_factor, as.data.frame(nzv_uncorr)))
ml_df[["outcome_factor"]] <- as.factor(ml_df[["outcome_factor"]])
only_yn <- ml_df[["outcome_factor"]] == "Y" | ml_df[["outcome_factor"]] == "N"
outcome_factor <- as.factor(as.character(outcome_factor[only_yn]))


ml_df <- ml_df[only_yn, ]
ml_df[["outcome_factor"]] <- as.factor(as.character(ml_df[["outcome_factor"]]))
summary(ml_df[["outcome_factor"]])
##   N   Y 
##  55 101
train_idx <- createDataPartition(outcome_factor, p = 0.5,
                                 list = FALSE,
                                 times = 1)
summary(ml_df[train_idx, "outcome_factor"])
##  N  Y 
## 28 51
## Training set has 112 samples.

train_rownames <- rownames(ml_df)[train_idx]
not_train_idx <- ! rownames(ml_df) %in% train_rownames
summary(ml_df[not_train_idx, "outcome_factor"])
##  N  Y 
## 27 50
not_train_rownames <- rownames(ml_df)[not_train_idx]

train_df <- as.data.frame(ml_df[train_rownames, ])
dim(train_df)
## [1]    79 12081
test_df <- as.data.frame(ml_df[not_train_rownames, ])
dim(test_df)
## [1]    77 12081
## Remove the outcome factor from the test data, just in case.
test_outcome <- test_df[["outcome_factor"]]
test_df[["outcome_factor"]] <- NULL

train_fit <- train(outcome_factor ~ ., data = train_df,
                method = "ranger", trControl = boot_control,
                importance = "permutation",
                tuneGrid = data.frame(
                    mtry = 200,
                    min.node.size = 1,
                    splitrule = "gini"),
                verbose = TRUE)
train_fit[["finalModel"]][["prediction.error"]]
## [1] 0.2532
variable_importance <- varImp(train_fit)
plot(variable_importance, top = 15)

rf_predict_trained <- predict(rf_train_fit, train_df)
confusionMatrix(data = train_df[[1]], reference = rf_predict_trained)
## Error in confusionMatrix.default(data = train_df[[1]], reference = rf_predict_trained): The data must contain some levels that overlap the reference.
rf_predict_test <- predict(rf_train_fit, test_df)
names(rf_predict_test) <- rownames(test_df)
used_names <- names(rf_predict_test)
used_pdata <- pData(tc_norm)[used_names, ]

rf_agreement <- as.character(rf_predict_test) == used_pdata[[ref_col]]
names(rf_agreement) <- rownames(test_df)
summary(rf_agreement)
##    Mode   FALSE 
## logical      77
names(rf_agreement)[rf_agreement == FALSE]
##  [1] "TMRC30156" "TMRC30178" "TMRC30223" "TMRC30138" "TMRC30153" "TMRC30151" "TMRC30234" "TMRC30235" "TMRC30227" "TMRC30231" "TMRC30232"
## [12] "TMRC30233" "TMRC30209" "TMRC30216" "TMRC30214" "TMRC30273" "TMRC30275" "TMRC30276" "TMRC30255" "TMRC30256" "TMRC30240" "TMRC30279"
## [23] "TMRC30280" "TMRC30283" "TMRC30285" "TMRC30071" "TMRC30056" "TMRC30105" "TMRC30058" "TMRC30103" "TMRC30122" "TMRC30107" "TMRC30096"
## [34] "TMRC30115" "TMRC30121" "TMRC30166" "TMRC30048" "TMRC30054" "TMRC30070" "TMRC30049" "TMRC30055" "TMRC30053" "TMRC30068" "TMRC30158"
## [45] "TMRC30132" "TMRC30157" "TMRC30123" "TMRC30181" "TMRC30133" "TMRC30184" "TMRC30159" "TMRC30129" "TMRC30172" "TMRC30137" "TMRC30142"
## [56] "TMRC30143" "TMRC30197" "TMRC30182" "TMRC30203" "TMRC30202" "TMRC30205" "TMRC30177" "TMRC30154" "TMRC30241" "TMRC30237" "TMRC30206"
## [67] "TMRC30238" "TMRC30074" "TMRC30217" "TMRC30208" "TMRC30219" "TMRC30218" "TMRC30220" "TMRC30173" "TMRC30264" "TMRC30147" "TMRC30265"

8 Try something a little different

I redid the mappings with a combined host/parasite transcriptome in the hopes that it might provide some interesting variance for these classifiers. Lets us poke at it and see if anything is relevant or if it was a bad idea.

wanted_idx <- grepl(x = rownames(exprs(hslp_gene_expt)), pattern = "^LP")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'grepl': error in evaluating the argument 'x' in selecting a method for function 'rownames': error in evaluating the argument 'object' in selecting a method for function 'exprs': object 'hslp_gene_expt' not found
wanted_ids <- rownames(exprs(hslp_gene_expt))[wanted_idx]
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'rownames': error in evaluating the argument 'object' in selecting a method for function 'exprs': object 'hslp_gene_expt' not found
test_lp <- exclude_genes_expt(hslp_gene_expt, ids = wanted_ids, method = "keep")
## Note, I renamed this to subset_genes().
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'fData': object 'hslp_gene_expt' not found
tt <- plot_libsize(test_lp)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'data' in selecting a method for function 'plot_libsize': object 'test_lp' not found
tt$plot
## NULL

I am going to mostly copy paste the code from up above here but change the inputs to fit my new combined data structure of human and parasite data.

ref_col <- "finaloutcome"
outcome_factor <- as.factor(as.character(pData(hslp_gene_expt)[[ref_col]]))
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.factor': error in evaluating the argument 'object' in selecting a method for function 'pData': object 'hslp_gene_expt' not found
hslp_norm <- normalize_expt(hslp_gene_expt, transform = "log2", convert = "cpm")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'expt' in selecting a method for function 'state': object 'hslp_gene_expt' not found
hslp_norm <- normalize_expt(hslp_norm, filter = "cv", cv_min = 0.1)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'expt' in selecting a method for function 'state': object 'hslp_norm' not found
retained_lp <- grepl(x = rownames(exprs(hslp_norm)), pattern = "^LP")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'grepl': error in evaluating the argument 'x' in selecting a method for function 'rownames': error in evaluating the argument 'object' in selecting a method for function 'exprs': object 'hslp_norm' not found
sum(retained_lp)
## Error in eval(expr, envir, enclos): object 'retained_lp' not found
nzv_thslp <- t(exprs(hslp_norm))
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 't': error in evaluating the argument 'object' in selecting a method for function 'exprs': object 'hslp_norm' not found
nzv_hslp_center <- preProcess(nzv_thslp, method = "center")
## Error in preProcess(nzv_thslp, method = "center"): object 'nzv_thslp' not found
nzv_thslp <- predict(nzv_hslp_center, nzv_thslp)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'predict': object 'nzv_hslp_center' not found
nzv_thslp_correlated <- preProcess(nzv_thslp, method = "corr", cutoff = 0.95)
## Error in preProcess(nzv_thslp, method = "corr", cutoff = 0.95): object 'nzv_thslp' not found
nzv_thslp_uncorr <- predict(nzv_thslp_correlated, nzv_thslp)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'predict': object 'nzv_thslp_correlated' not found
interesting_meta <- pData(hslp_norm)[, c("finaloutcome", "donor",
                                         "visitnumber", "selectionmethod",
                                         "typeofcells", "time", "clinic")]
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'pData': object 'hslp_norm' not found
hslp_ml_df <- as.data.frame(cbind(outcome_factor,
                                  as.data.frame(nzv_thslp_uncorr)))
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'nzv_thslp_uncorr' not found
hslp_ml_df[["outcome_factor"]] <- as.factor(hslp_ml_df[["outcome_factor"]])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.factor': object 'hslp_ml_df' not found
dim(hslp_ml_df)
## Error in eval(expr, envir, enclos): object 'hslp_ml_df' not found

9 Split the data for training/testing

## The variable outcome_factor was created at the top of this document,
## which is a little awkward here.
train_idx <- createDataPartition(outcome_factor, p = 0.6,
                                 list = FALSE,
                                 times = 1)
summary(hslp_ml_df[train_idx, "outcome_factor"])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'summary': object 'hslp_ml_df' not found
## Training set has 112 samples.

train_rownames <- rownames(hslp_ml_df)[train_idx]
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'rownames': object 'hslp_ml_df' not found
not_train_idx <- ! rownames(hslp_ml_df) %in% train_rownames
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function '%in%': error in evaluating the argument 'x' in selecting a method for function 'rownames': object 'hslp_ml_df' not found
summary(hslp_ml_df[not_train_idx, "outcome_factor"])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'summary': object 'hslp_ml_df' not found
not_train_rownames <- rownames(hslp_ml_df)[not_train_idx]
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'rownames': object 'hslp_ml_df' not found
train_df <- as.data.frame(hslp_ml_df[train_rownames, ])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'hslp_ml_df' not found
dim(train_df)
## [1]    79 12081
test_df <- as.data.frame(hslp_ml_df[not_train_rownames, ])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'hslp_ml_df' not found
dim(test_df)
## [1]    77 12080
## Remove the outcome factor from the test data, just in case.
test_outcome <- test_df[["outcome_factor"]]
test_df[["outcome_factor"]] <- NULL
boot_control <- trainControl(method = "boot", number = 20,
                             returnResamp = "all")

knn_fit <- knn3(x = train_df[, -1],
                y = train_df[["outcome_factor"]],
                k = 3)
knn_predict_trained <- predict(knn_fit, train_df[, -1], type = "class")
confusionMatrix(data = train_df[[1]], reference = knn_predict_trained)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  N  Y
##          N 24  4
##          Y  4 47
##                                        
##                Accuracy : 0.899        
##                  95% CI : (0.81, 0.955)
##     No Information Rate : 0.646        
##     P-Value [Acc > NIR] : 2.59e-07     
##                                        
##                   Kappa : 0.779        
##                                        
##  Mcnemar's Test P-Value : 1            
##                                        
##             Sensitivity : 0.857        
##             Specificity : 0.922        
##          Pos Pred Value : 0.857        
##          Neg Pred Value : 0.922        
##              Prevalence : 0.354        
##          Detection Rate : 0.304        
##    Detection Prevalence : 0.354        
##       Balanced Accuracy : 0.889        
##                                        
##        'Positive' Class : N            
## 
names(knn_predict_trained) <- rownames(train_df)
summary(knn_predict_trained == pData(hslp_gene_expt)[train_idx, "finaloutcome"])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'summary': error in evaluating the argument 'object' in selecting a method for function 'pData': object 'hslp_gene_expt' not found
knn_predict_test <- predict(knn_fit, test_df, type = "class")
## and the values, which will be used for our ROC curve.
knn_predict_test_values <- predict(knn_fit, test_df)

names(knn_predict_test) <- rownames(test_df)
knn_agreement <- knn_predict_test == test_outcome
names(knn_agreement) <- rownames(test_df)
## Error in names(knn_agreement) <- rownames(test_df): 'names' attribute [77] must be the same length as the vector [0]
summary(knn_agreement)
##    Mode 
## logical
names(knn_agreement)[knn_agreement == FALSE]
## NULL
cv_control <- trainControl(method = "cv", number = 10)
knn_train_fit <- train(outcome_factor ~ ., data = train_df,
                       method = "knn",
                       trControl = cv_control,
                       tuneGrid = data.frame(k = 1:10))
knn_train_fit[["bestTune"]]
##   k
## 3 3
plot(x = 1:10, 1 - knn_train_fit$results[, 2], pch = 19,
     ylab = "prediction error", xlab = "k")
lines(loess.smooth(x = 1:10, 1 - knn_train_fit$results[, 2],degree = 2),
      col = "#CC0000")

---
title: "TMRC3 ML Classification of outcome: 202302"
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
 html_document:
  code_download: true
  code_folding: show
  fig_caption: true
  fig_height: 7
  fig_width: 7
  highlight: tango
  keep_md: false
  mode: selfcontained
  number_sections: true
  self_contained: true
  theme: readable
  toc: true
  toc_float:
   collapsed: false
   smooth_scroll: false
---

<style>
  body .main-container {
    max-width: 1600px;
  }
</style>

```{r options, include = FALSE}
library(hpgltools)
tt <- devtools::load_all("~/hpgltools")
knitr::opts_knit$set(progress = TRUE,
                     verbose = TRUE,
                     width = 90,
                     echo = TRUE)
knitr::opts_chunk$set(error = TRUE,
                      fig.width = 8,
                      fig.height = 8,
                      dpi = 96)
old_options <- options(digits = 4,
                       stringsAsFactors = FALSE,
                       knitr.duplicate.label = "allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size = 12))
ver <- "202301"
rundate <- format(Sys.Date(), format = "%Y%m%d")

## tmp <- try(sm(loadme(filename = gsub(pattern = "\\.Rmd", replace = "\\.rda\\.xz", x = previous_file))))
rmd_file <- glue::glue("tmrc3_classifier_{ver}.Rmd")
savefile <- gsub(pattern = "\\.Rmd", replace = "\\.rda\\.xz", x = rmd_file)
loaded <- load(file = glue::glue("rda/tmrc3_data_structures-v{ver}.rda"))
library(caret)
library(pROC)
library(DALEX)
```

# Introduction

I had some success in classifying the TMRC2 samples by strain via ML
and want to try something more difficult.  Thus I will use the
normalized gene expression data and try classifying it by cure/fail.

```{r classify_cf}
ref_col <- "finaloutcome"
outcome_factor <- as.factor(as.character(pData(tc_clinical)[[ref_col]]))
```

# Starter data

In the strain classifier I used normalized variants.  I am thinking to
use normalized expression here and therefore explicitly limit myself
to ~ 20k variables (significantly down from the 1.6M).

In addition, caret expects the data as (rows == samples) and
(columns == variables) where each element is one observation.
Thus we will need to transpose the expression matrix.

```{r starter}
tc_norm <- normalize_expt(tc_clinical, transform = "log2", convert = "cpm")
texprs <- t(exprs(tc_norm))
```

# Filtering

The ML text I am reading provide some neat examples for how one might
filter the data to make it more suitable for model creation.

## Near zero variance, or genefilter's cv

The first filter I was introduced to is quite familiar from our
sequencing data, the removal of features with near-zero-variance.
Indeed, I am pretty certain that normalize_expt() could do this
equivalently and significantly faster than caret::preProcess().

```{r nzv}
system.time({
  equivalent <- normalize_expt(tc_norm, filter = "cv", cv_min = 0.1)
})
dim(exprs(equivalent))

## Given this large amount of data, this step is slow, taking > 10 minutes.
## Yeah seriously, the following three lines get 16,723 genes in 10 minutes while
## the normalize_expt() call above gets 16,749 genes in 2.4 seconds.
#system.time({
#  nzv <- preProcess(texprs, method="nzv", uniqueCut=15)
#  nzv_texprs <- predict(nzv, texprs)
#  dim(nzv_texprs)
#}
nzv_texprs <- t(exprs(equivalent))
```

## Filtering to the highest standard deviation variables

I think I am a bit confused by this filter, one would think that the
nzv filter above, if applied correctly, should give you exactly this.

For the moment, I am excluding the following block in order to see how
much time/memory keeping these variables costs.  If I recall properly,
the model of the top-2k variant positions cost ~ 1-4G of memory.  I
hope that this scales linearly, but I am thinking it might not.

```{r exclude_sds, eval=FALSE}
standard_devs <- apply(nzv_texprs, 2, sd)
top_predictors <- order(standard_devs, decreasing = TRUE)[1:3000]
nzv_texprs <- nzv_texprs[, top_predictors]
```

## Center the data

I think centering may not be needed for this data, but here is how:

```{r center}
nzv_center <- preProcess(nzv_texprs, method = "center")
nzv_texprs <- predict(nzv_center, nzv_texprs)
```

## Drop correlated

This is a filter which does not correspond to any of those we use in
sequencing data because genes which are highly correlated are
likely to be of immediate interest.

In the same fashion, I want to leave this off because later
applications of this model will include low coverage samples which may
not have every variant represented.

```{r correlated}
## This step takes a while...
system.time({
  nzv_correlated <- preProcess(nzv_texprs, method = "corr", cutoff = 0.95)
  nzv_uncorr <- predict(nzv_correlated, nzv_texprs)
})
dim(nzv_uncorr)
```

# Merge the appropriate metadata

There are a few metadata factors which might prove of interest for
classification.  The most obvious are of course outcome, clinic,
donor, visit, celltype.  I am, for the moment, only likely to focus on
outcome.  AFAICT I can only include one of these at a time in the
data, which is a shame.

```{r merge_pieces}
interesting_meta <- pData(tc_clinical)[, c("finaloutcome", "donor", "persistence",
                                           "visitnumber", "selectionmethod",
                                           "typeofcells", "time", "clinic")]

ml_df <- as.data.frame(cbind(outcome_factor, as.data.frame(nzv_uncorr)))
ml_df[["outcome_factor"]] <- as.factor(ml_df[["outcome_factor"]])
dim(ml_df)
```

# Split the data into training/testing

caret provides nice functionality for splitting up the data.  I
suspect there are many more fun knobs I can play with for instances
where I need to exclude some levels of a factor and such.  In this
case I just want to split by outcome.

## Via data splitting

```{r split_train_test}
## The variable outcome_factor was created at the top of this document,
## which is a little awkward here.
train_idx <- createDataPartition(outcome_factor, p = 0.6,
                                 list = FALSE,
                                 times = 1)
summary(ml_df[train_idx, "outcome_factor"])
## Training set has 112 samples.

train_rownames <- rownames(ml_df)[train_idx]
not_train_idx <- ! rownames(ml_df) %in% train_rownames
summary(ml_df[not_train_idx, "outcome_factor"])
not_train_rownames <- rownames(ml_df)[not_train_idx]

train_df <- as.data.frame(ml_df[train_rownames, ])
dim(train_df)
test_df <- as.data.frame(ml_df[not_train_rownames, ])
dim(test_df)
## Remove the outcome factor from the test data, just in case.
test_outcome <- test_df[["outcome_factor"]]
test_df[["outcome_factor"]] <- NULL
```

## Via sampling

There are a few likely sampling methods: cross-validation,
bootstrapping, and jackknifing.  I will try those out later.

# Try out training and prediction methods

My goals from here on will be to get the beginnings of a sense of the
various methods I can use to create the models from the training data
and predict the outcome on the test data.  I am hoping also to pick up
some idea of what the various arguments mean while I am at it.

## Try out KNN

k-nearest neighbors is somewhat similar to a kmeans estimate.  Thus
the primary argument is 'k'

### Model creation and performance

```{r knn}
knn_fit <- knn3(x = train_df[, -1],
                y = train_df[["outcome_factor"]],
                k = 3)
knn_predict_trained <- predict(knn_fit, train_df[, -1], type = "class")
confusionMatrix(data = train_df[[1]], reference = knn_predict_trained)
```

As the confusion matrix shows, this failed for a few samples.  Perhaps
let us change k and see if it improves.

Here is a table of fase positives/negatives for a few values of 'k',
in this context a false positive is calling a known cure as a failure
and false negative is calling a known failure as a cure.

|---|---|---|
|k  |fp |fn |
|2  |0  |8  |
|3  |5  |5  |
|4  |8  |9  |
|5  |11 |7  |
|6  |15 |8  |

Note: this depends on the luck of rand(), so the above numbers shift
moderately from one run to the next.  Thus I think I will just use 2
or 3.

```{r knn_fit2}
knn_fit <- knn3(x = train_df[, -1],
                y = train_df[["outcome_factor"]],
                k = 3)
knn_predict_trained <- predict(knn_fit, train_df[, -1], type = "class")
confusionMatrix(data = train_df[[1]], reference = knn_predict_trained)
```

### Visualize the accuracy with a ROC curve

```{r knn_trained}
names(knn_predict_trained) <- rownames(train_df)
summary(knn_predict_trained == pData(tc_norm)[train_idx, "finaloutcome"])

tt <- predict(knn_fit, train_df[, -1])
curve <- pROC::roc(response = train_df[, 1], predictor = tt[, 1],
                   levels = rev(levels(outcome_factor)))
plot(curve)
pROC::auc(curve)
```

### Predict the rest of the data with this model.

```{r knn_test}
## Return the classifications
knn_predict_test <- predict(knn_fit, test_df, type = "class")
## and the values, which will be used for our ROC curve.
knn_predict_test_values <- predict(knn_fit, test_df)

names(knn_predict_test) <- rownames(test_df)
knn_agreement <- knn_predict_test == test_outcome
names(knn_agreement) <- rownames(test_df)
summary(knn_agreement)
names(knn_agreement)[knn_agreement == FALSE]

curve <- pROC::roc(response = test_outcome,
                   predictor = knn_predict_test_values[, 1])
plot(curve)
pROC::auc(curve)

## Find the samples which (dis)agree
ref_col <- "finaloutcome"
ref_result <- pData(tc_clinical)[[ref_col]]
names(ref_result) <- rownames(pData(tc_clinical))
ref_xref <- names(ref_result) %in% names(knn_predict_test)
ref_comp <- ref_result[ref_xref]
ref_comp_idx <- ref_comp != "unknown"
ref_comp <- ref_comp[ref_comp_idx]
ref_comp_idx <- !is.na(ref_comp)
ref_comp <- ref_comp[ref_comp_idx]

test_comp_idx <- names(knn_predict_test) %in% names(ref_comp)
test_comp <- knn_predict_test[test_comp_idx]

agreement <- as.character(ref_comp) == as.character(test_comp)
agree_num <- sum(agreement)
agree_prop <- agree_num / length(ref_comp)
agree_prop

disagree <- as.character(ref_comp) != as.character(test_comp)
test_comp[disagree]
```

## Perform cross-validation to estimate k

The cross validation method of repeated sampling the data is all done
within the train() function.  With that in mind, here it is operating
with the knn method.

### CV with knn

When train() is called with the trControl and tuneGrid, we can control
how the knn training is repeated, in this case it will iterate over k
from 1 to 10.

```{r cv_knn}
cv_control <- trainControl(method = "cv", number = 10)
knn_train_fit <- train(outcome_factor ~ ., data = train_df,
                       method = "knn",
                       trControl = cv_control,
                       tuneGrid = data.frame(k = 1:10))
knn_train_fit[["bestTune"]]

plot(x = 1:10, 1 - knn_train_fit$results[, 2], pch = 19,
     ylab = "prediction error", xlab = "k")
lines(loess.smooth(x = 1:10, 1 - knn_train_fit$results[, 2],degree = 2),
      col = "#CC0000")
```

### Bootstrap with knn

```{r boot_knn}
boot_control <- trainControl(method = "boot", number = 20,
                             returnResamp = "all")

knn_train_fit <- train(outcome_factor ~ ., data = train_df,
                       method = "knn",
                       trControl = boot_control,
                       tuneGrid = data.frame(k = 1:10))
knn_train_fit[["bestTune"]]

plot(x = 1:10, 1 - knn_train_fit$results[, 2], pch = 19,
     ylab = "prediction error", xlab = "k")
lines(loess.smooth(x = 1:10, 1 - knn_train_fit$results[, 2],degree = 2),
      col = "#CC0000")
```

### Explain the important variables

In this instance we will search for genes which were important for the
model's creation.

The DALEX package provides a function: feature_importance() which
seeks to use a series of other methods to extract (in this case,
genes) features which do a good job of explaining the result produced
by the model.  In the case of this dataset, which has thousands of
features, this does not appear to end well.

```{r explain_knn}
explainer_knn <- explain(knn_fit, label = "knn",
                         data = train_df[, -1],
                         y = as.numeric(train_df[, 1]))

## AFAICT the following will take forever unless we drastically reduce the complexity of the model.
## features <- feature_importance(explainer_knn, n_sample = 50, type = "difference")
```

## Random Forest

The parameter 'mtry' is often important, if I read the text correctly
it controls how many variables to sample in each split of the tree.
Thus higher numbers should presumably make it more specific at the
risk of overfitting.

Setting min.node.size sets the minimume node size of terminal nodes in
each tree.  Each increment up speeds the algorithm.

I am going to use my boot control trainer from above and see how it goes.

```{r rf_test}
library(ranger)
##rf_method <- trainControl(method = "ranger", number = 10, verbose = TRUE)

rf_train_fit <- train(outcome_factor ~ ., data = train_df,
                method = "ranger", trControl = boot_control,
                importance = "permutation",
                tuneGrid = data.frame(
                    mtry = 200,
                    min.node.size = 1,
                    splitrule = "gini"),
                verbose = TRUE)
rf_train_fit[["finalModel"]][["prediction.error"]]
variable_importance <- varImp(rf_train_fit)
plot(variable_importance, top = 15)

rf_predict_trained <- predict(rf_train_fit, train_df)
names(rf_predict_trained) <- rownames(train_df)
summary(rf_predict_trained == pData(tc_norm)[train_idx, "finaloutcome"])
```

#### Digression do the genes provide by varImp mean anything?

Let us take a moment and see if the top-n genes returned by varImp()
have some meaning which jumps out.  One might assume, given our extant
Differential Expression results, that the interleukin response will be
a likely candidate.

```{r varimp_gp}
test_importance <- varImp(rf_train_fit, scale = TRUE)
test_idx <- order(test_importance[["importance"]][["Overall"]], decreasing = TRUE)
test_imp <- test_importance[["importance"]][test_idx, ]
names(test_imp) <- rownames(test_importance[["importance"]])[test_idx]
head(test_imp, n = 200)
most_important_ids <- names(head(test_imp, n = 200))
convert_ids(most_important_ids, from = "ENSEMBL", to = "SYMBOL")

importance_gp <- simple_gprofiler(names(head(test_imp, n = 60)))
importance_gp$pvalue_plots$BP
```

### Now the random forest testers!

```{r rf_predict_test}
rf_predict_test <- predict(rf_train_fit, test_df)
names(rf_predict_test) <- rownames(test_df)
rf_agreement <- rf_predict_test == pData(tc_norm)[not_train_idx, "finaloutcome"]
names(rf_agreement) <- rownames(test_df)
summary(rf_agreement)
names(rf_agreement)[rf_agreement == FALSE]
```

## GLM, or Logistic regression and regularization

Logistic regression is a statistical method for binary responses.
However, it is able to work with multiple classes as well.  The
general idea of this method is to find parameters which increase the
likelihood that the observed data is sampled from a statistical
distribution of interest.  The transformations and linear
regression-esque tasks performed are confusing, but once those are
performed, the task becomes setting the model's (fitting) parameters
to values which increase the probability that the statistical model
looks like the actual dataset given the training data, and that when
samples, will return values which are similar.  The most likely
statistical distributions one will want to fit are the Gaussian, in
which case we want to transform/normalize the mean/variance of our
variables so they look whatever normal distribution we are using.
Conversely, logistic regression uses a binnomial distribution (like
our raw sequencing data!) but which is from 0-1.

### Using a single gene

Let us take the most important gene observed in one of our previous
training sets: ENSG00000248405 PRR5-ARHGAP8

```{r glm_regression}
gene_id <- "ENSG00000248405"
single_fit <- train(
    outcome_factor ~ ENSG00000248405, data = train_df,
    method = "glm", family = "binomial", trControl = trainControl("none"))

tt <- data.frame("ENSG00000248405" = seq(min(train_df[[gene_id]]),
                                         max(train_df[[gene_id]]),len=100))
## predict probabilities for the simulated data
tt$subtype = predict(single_fit, newdata=tt, type="prob")[, 1]
## plot the sigmoid curve and the training data
plot(ifelse(outcome_factor == "cure", 1, 0) ~ ENSG00000248405,
     data=train_df, col="red4",
     ylab="CF as 0 or 1", xlab="favorite gene expression")
lines(subtype ~ ENSG00000248405, tt, col="green4", lwd=2)
```

Having tried with 1 gene, let us extend this to all genes.  In my
first try of this, it took a long time.

```{r glm_all}
glm_fit <- train(outcome_factor ~ ., data = train_df,
                 trControl = boot_control,
                 method = "glm", family = "binomial")
```

```{r glm_test}
library(glmnet)
##rf_method <- trainControl(method = "ranger", number = 10, verbose = TRUE)
## train_method <- trainControl(method = "cv", number = 10)
glm_fit <- train(outcome_factor ~ ., data = train_df, method = "glmnet",
                 trControl = boot_control, importance = "permutation",
                 tuneGrid = data.frame(
                   alpha = 0.5,
                   lambda = seq(0.1, 0.7, 0.05)),
                 verbose = TRUE)

glm_predict_trained <- predict(glm_fit, train_df)
names(glm_predict_trained) <- rownames(train_df)
summary(glm_predict_trained == pData(tc_norm)[train_idx, "finaloutcome"])
confusionMatrix(data = train_df[, 1], reference = glm_predict_trained)
```

### Now the GLM testers!

```{r glm_predict_test}
glm_predict_test <- predict(glm_fit, test_df)
names(glm_predict_test) <- rownames(test_df)
glm_predict_test
glm_agreement <- glm_predict_test == pData(tc_norm)[not_train_idx, "finaloutcome"]
names(glm_agreement) <- rownames(test_df)
summary(glm_agreement)
names(glm_agreement)[glm_agreement == FALSE]
```

## Gradient Booster

```{r gb_test}
library(xgboost)
##rf_method <- trainControl(method = "ranger", number = 10, verbose = TRUE)
train_method <- trainControl(method = "cv", number = 10)

gb_fit <- train(outcome_factor ~ ., data = train_df,
                method = "xgbTree", trControl = train_method,
                tuneGrid = data.frame(
                    nrounds = 200,
                    eta = c(0.05, 0.1, 0.3),
                    max_depth = 4,
                    gamma = 0,
                    colsample_bytree = 1,
                    subsample = 0.5,
                    min_child_weight = 1),
                verbose = TRUE)

gb_predict_trained <- predict(gb_fit, train_df)
names(gb_predict_trained) <- rownames(train_df)
summary(gb_predict_trained == pData(tc_norm)[train_idx, "finaloutcome"])
```

### Now the GB testers!

```{r gb_predict_test}
gb_predict_test <- predict(gb_fit, test_df)
names(gb_predict_test) <- rownames(test_df)
gb_predict_test
gb_agreement <- gb_predict_test == pData(tc_norm)[not_train_idx, "finaloutcome"]
names(gb_agreement) <- rownames(test_df)
summary(gb_agreement)
names(gb_agreement)[gb_agreement == FALSE]
```

## SVM

```{r svm_test}
library(kernlab)
##rf_method <- trainControl(method = "ranger", number = 10, verbose = TRUE)
train_method <- trainControl(method = "cv", number = 10)

svm_fit <- train(outcome_factor ~ ., data = train_df,
                method = "svmRadial", trControl = train_method,
                tuneGrid = data.frame(
                    C = c(0.25, 0.5, 1),
                    sigma = 1),
                verbose = TRUE)

svm_predict_trained <- predict(svm_fit, train_df)
names(svm_predict_trained) <- rownames(train_df)
svm_predict_trained
summary(svm_predict_trained == pData(tc_norm)[train_idx, "finaloutcome"])
```

### Now the SVM testers!

```{r svm_predict_test}
svm_predict_test <- predict(svm_fit, test_df)
names(svm_predict_test) <- rownames(test_df)
svm_predict_test
svm_agreement <- svm_predict_test == pData(tc_norm)[not_train_idx, "finaloutcome"]
names(svm_agreement) <- rownames(test_df)
summary(svm_agreement)
names(svm_agreement)[svm_agreement == FALSE]
```

# Reapply to persistence

```{r persistence}
ref_col <- "persistence"
outcome_factor <- as.factor(as.character(pData(tc_clinical)[[ref_col]]))

ml_df <- as.data.frame(cbind(outcome_factor, as.data.frame(nzv_uncorr)))
ml_df[["outcome_factor"]] <- as.factor(ml_df[["outcome_factor"]])
only_yn <- ml_df[["outcome_factor"]] == "Y" | ml_df[["outcome_factor"]] == "N"
outcome_factor <- as.factor(as.character(outcome_factor[only_yn]))


ml_df <- ml_df[only_yn, ]
ml_df[["outcome_factor"]] <- as.factor(as.character(ml_df[["outcome_factor"]]))
summary(ml_df[["outcome_factor"]])

train_idx <- createDataPartition(outcome_factor, p = 0.5,
                                 list = FALSE,
                                 times = 1)
summary(ml_df[train_idx, "outcome_factor"])
## Training set has 112 samples.

train_rownames <- rownames(ml_df)[train_idx]
not_train_idx <- ! rownames(ml_df) %in% train_rownames
summary(ml_df[not_train_idx, "outcome_factor"])
not_train_rownames <- rownames(ml_df)[not_train_idx]

train_df <- as.data.frame(ml_df[train_rownames, ])
dim(train_df)
test_df <- as.data.frame(ml_df[not_train_rownames, ])
dim(test_df)
## Remove the outcome factor from the test data, just in case.
test_outcome <- test_df[["outcome_factor"]]
test_df[["outcome_factor"]] <- NULL

train_fit <- train(outcome_factor ~ ., data = train_df,
                method = "ranger", trControl = boot_control,
                importance = "permutation",
                tuneGrid = data.frame(
                    mtry = 200,
                    min.node.size = 1,
                    splitrule = "gini"),
                verbose = TRUE)
train_fit[["finalModel"]][["prediction.error"]]
variable_importance <- varImp(train_fit)
plot(variable_importance, top = 15)

rf_predict_trained <- predict(rf_train_fit, train_df)
confusionMatrix(data = train_df[[1]], reference = rf_predict_trained)

rf_predict_test <- predict(rf_train_fit, test_df)
names(rf_predict_test) <- rownames(test_df)
used_names <- names(rf_predict_test)
used_pdata <- pData(tc_norm)[used_names, ]

rf_agreement <- as.character(rf_predict_test) == used_pdata[[ref_col]]
names(rf_agreement) <- rownames(test_df)
summary(rf_agreement)
names(rf_agreement)[rf_agreement == FALSE]
```

# Try something a little different

I redid the mappings with a combined host/parasite transcriptome in
the hopes that it might provide some interesting variance for these
classifiers.  Lets us poke at it and see if anything is relevant or if
it was a bad idea.

```{r poke_combined}
wanted_idx <- grepl(x = rownames(exprs(hslp_gene_expt)), pattern = "^LP")
wanted_ids <- rownames(exprs(hslp_gene_expt))[wanted_idx]
test_lp <- exclude_genes_expt(hslp_gene_expt, ids = wanted_ids, method = "keep")

tt <- plot_libsize(test_lp)
tt$plot
```

I am going to mostly copy paste the code from up above here but change
the inputs to fit my new combined data structure of human and parasite data.

```{r classify_hpcf}
ref_col <- "finaloutcome"
outcome_factor <- as.factor(as.character(pData(hslp_gene_expt)[[ref_col]]))

hslp_norm <- normalize_expt(hslp_gene_expt, transform = "log2", convert = "cpm")
hslp_norm <- normalize_expt(hslp_norm, filter = "cv", cv_min = 0.1)
retained_lp <- grepl(x = rownames(exprs(hslp_norm)), pattern = "^LP")
sum(retained_lp)

nzv_thslp <- t(exprs(hslp_norm))
nzv_hslp_center <- preProcess(nzv_thslp, method = "center")
nzv_thslp <- predict(nzv_hslp_center, nzv_thslp)

nzv_thslp_correlated <- preProcess(nzv_thslp, method = "corr", cutoff = 0.95)
nzv_thslp_uncorr <- predict(nzv_thslp_correlated, nzv_thslp)

interesting_meta <- pData(hslp_norm)[, c("finaloutcome", "donor",
                                         "visitnumber", "selectionmethod",
                                         "typeofcells", "time", "clinic")]

hslp_ml_df <- as.data.frame(cbind(outcome_factor,
                                  as.data.frame(nzv_thslp_uncorr)))
hslp_ml_df[["outcome_factor"]] <- as.factor(hslp_ml_df[["outcome_factor"]])
dim(hslp_ml_df)
```

# Split the data for training/testing

```{r split_train_test}
## The variable outcome_factor was created at the top of this document,
## which is a little awkward here.
train_idx <- createDataPartition(outcome_factor, p = 0.6,
                                 list = FALSE,
                                 times = 1)
summary(hslp_ml_df[train_idx, "outcome_factor"])
## Training set has 112 samples.

train_rownames <- rownames(hslp_ml_df)[train_idx]
not_train_idx <- ! rownames(hslp_ml_df) %in% train_rownames
summary(hslp_ml_df[not_train_idx, "outcome_factor"])
not_train_rownames <- rownames(hslp_ml_df)[not_train_idx]

train_df <- as.data.frame(hslp_ml_df[train_rownames, ])
dim(train_df)
test_df <- as.data.frame(hslp_ml_df[not_train_rownames, ])
dim(test_df)
## Remove the outcome factor from the test data, just in case.
test_outcome <- test_df[["outcome_factor"]]
test_df[["outcome_factor"]] <- NULL
```

```{r random_forest}
boot_control <- trainControl(method = "boot", number = 20,
                             returnResamp = "all")

knn_fit <- knn3(x = train_df[, -1],
                y = train_df[["outcome_factor"]],
                k = 3)
knn_predict_trained <- predict(knn_fit, train_df[, -1], type = "class")
confusionMatrix(data = train_df[[1]], reference = knn_predict_trained)

names(knn_predict_trained) <- rownames(train_df)
summary(knn_predict_trained == pData(hslp_gene_expt)[train_idx, "finaloutcome"])

knn_predict_test <- predict(knn_fit, test_df, type = "class")
## and the values, which will be used for our ROC curve.
knn_predict_test_values <- predict(knn_fit, test_df)

names(knn_predict_test) <- rownames(test_df)
knn_agreement <- knn_predict_test == test_outcome
names(knn_agreement) <- rownames(test_df)
summary(knn_agreement)
names(knn_agreement)[knn_agreement == FALSE]

cv_control <- trainControl(method = "cv", number = 10)
knn_train_fit <- train(outcome_factor ~ ., data = train_df,
                       method = "knn",
                       trControl = cv_control,
                       tuneGrid = data.frame(k = 1:10))
knn_train_fit[["bestTune"]]

plot(x = 1:10, 1 - knn_train_fit$results[, 2], pch = 19,
     ylab = "prediction error", xlab = "k")
lines(loess.smooth(x = 1:10, 1 - knn_train_fit$results[, 2],degree = 2),
      col = "#CC0000")

```
