1 Introduction

I had some success in classifying the TMRC2 samples by strain via ML and want to try something more difficult. Thus I will use the normalized gene expression data and try classifying it by cure/fail.

2 Starter data

In the strain classifier I used normalized variants. I am thinking to use normalized expression here and therefore explicitly limit myself to ~ 20k variables (significantly down from the 1.6M).

In addition, caret expects the data as (rows == samples) and (columns == variables) where each element is one observation. Thus we will need to transpose the expression matrix.

input_data <- subset_expt(tc_clinical, subset="batch!='biopsy'")
## subset_expt(): There were 184, now there are 166 samples.
tc_norm <- normalize_expt(input_data, transform = "log2", convert = "cpm")
## transform_counts: Found 929232 values equal to 0, adding 1 to the matrix.
texprs <- t(exprs(tc_norm))

ref_col <- "finaloutcome"
outcome_factor <- as.factor(as.character(pData(input_data)[[ref_col]]))

3 Filtering

The ML text I am reading provide some neat examples for how one might filter the data to make it more suitable for model creation.

3.1 Near zero variance, or genefilter’s cv

The first filter I was introduced to is quite familiar from our sequencing data, the removal of features with near-zero-variance. Indeed, I am pretty certain that normalize_expt() could do this equivalently and significantly faster than caret::preProcess().

system.time({
  equivalent <- normalize_expt(tc_norm, filter = "cv", cv_min = 0.1)
})
## Removing 3612 low-count genes (16311 remaining).
##    user  system elapsed 
##   2.705   0.240   2.946
dim(exprs(equivalent))
## [1] 16311   166
## Given this large amount of data, this step is slow, taking > 10 minutes.
## Yeah seriously, the following three lines get 16,723 genes in 10 minutes while
## the normalize_expt() call above gets 16,749 genes in 2.4 seconds.
#system.time({
#  nzv <- preProcess(texprs, method="nzv", uniqueCut=15)
#  nzv_texprs <- predict(nzv, texprs)
#  dim(nzv_texprs)
#}
nzv_texprs <- t(exprs(equivalent))

3.2 Filtering to the highest standard deviation variables

I think I am a bit confused by this filter, one would think that the nzv filter above, if applied correctly, should give you exactly this.

For the moment, I am excluding the following block in order to see how much time/memory keeping these variables costs. If I recall properly, the model of the top-2k variant positions cost ~ 1-4G of memory. I hope that this scales linearly, but I am thinking it might not.

standard_devs <- apply(nzv_texprs, 2, sd)
top_predictors <- order(standard_devs, decreasing = TRUE)[1:3000]
nzv_texprs <- nzv_texprs[, top_predictors]

3.3 Center the data

I think centering may not be needed for this data, but here is how:

nzv_center <- preProcess(nzv_texprs, method = "center")
nzv_texprs <- predict(nzv_center, nzv_texprs)

3.4 Drop correlated

This is a filter which does not correspond to any of those we use in sequencing data because genes which are highly correlated are likely to be of immediate interest.

In the same fashion, I want to leave this off because later applications of this model will include low coverage samples which may not have every variant represented.

## This step takes a while...
system.time({
  nzv_correlated <- preProcess(nzv_texprs, method = "corr", cutoff = 0.95)
  nzv_uncorr <- predict(nzv_correlated, nzv_texprs)
})
##    user  system elapsed 
## 593.023   6.466 599.562
dim(nzv_uncorr)
## [1]   166 13091

4 Merge the appropriate metadata

There are a few metadata factors which might prove of interest for classification. The most obvious are of course outcome, clinic, donor, visit, celltype. I am, for the moment, only likely to focus on outcome. AFAICT I can only include one of these at a time in the data, which is a shame.

interesting_meta <- pData(input_data)[, c("finaloutcome", "donor", "persistence",
                                           "visitnumber", "selectionmethod",
                                           "typeofcells", "time", "clinic")]

ml_df <- as.data.frame(cbind(outcome_factor, as.data.frame(nzv_uncorr)))
ml_df[["outcome_factor"]] <- as.factor(ml_df[["outcome_factor"]])
dim(ml_df)
## [1]   166 13092

5 Split the data into training/testing

caret provides nice functionality for splitting up the data. I suspect there are many more fun knobs I can play with for instances where I need to exclude some levels of a factor and such. In this case I just want to split by outcome.

5.1 Via data splitting

ml_df <- as.data.frame(cbind(outcome_factor, as.data.frame(nzv_uncorr)))

datasets <- create_partitions(nzv_uncorr, interesting_meta,
                              outcome_factor = outcome_factor)

5.2 Via sampling

There are a few likely sampling methods: cross-validation, bootstrapping, and jackknifing. I will try those out later.

6 Try out training and prediction methods

My goals from here on will be to get the beginnings of a sense of the various methods I can use to create the models from the training data and predict the outcome on the test data. I am hoping also to pick up some idea of what the various arguments mean while I am at it.

6.1 Try out KNN

k-nearest neighbors is somewhat similar to a kmeans estimate. Thus the primary argument is ‘k’

6.1.1 Model creation and performance

split <- 1
train_all <- datasets[["trainers"]][[split]]
train_df <- datasets[["trainers_stripped"]][[split]]
train_idx <- datasets[["train_idx"]][[split]]
train_outcomes <- datasets[["trainer_outcomes"]][[split]]
test_df <- datasets[["testers"]][[split]]
test_idx <- datasets[["test_idx"]][[split]]
test_outcomes <- datasets[["tester_outcomes"]][[split]]

knn_fit <- knn3(x = train_df,
                y = train_outcomes,
                k = 3)
knn_predict_trained <- predict(knn_fit, train_df, type = "prob")

knn_train_evaluated <- self_evaluate_model(knn_predict_trained, datasets,
                                           which = split, type = "train")
## Setting levels: control = cure, case = failure
## Setting direction: controls > cases

knn_train_evaluated
## The summary of the (in)correct calls is:
##    Mode   FALSE    TRUE 
## logical      17      84
## The missed samples are:
##  [1] "TMRC30178" "TMRC30113" "TMRC30164" "TMRC30094" "TMRC30169" "TMRC30170" "TMRC30032" "TMRC30096" "TMRC30180" "TMRC30166" "TMRC30068"
## [12] "TMRC30072" "TMRC30161" "TMRC30175" "TMRC30146" "TMRC30264" "TMRC30265"
## The confusion matrix is:
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction cure failure
##    cure      56      10
##    failure    7      28
##                                         
##                Accuracy : 0.832         
##                  95% CI : (0.744, 0.899)
##     No Information Rate : 0.624         
##     P-Value [Acc > NIR] : 4.32e-06      
##                                         
##                   Kappa : 0.636         
##                                         
##  Mcnemar's Test P-Value : 0.628         
##                                         
##             Sensitivity : 0.889         
##             Specificity : 0.737         
##          Pos Pred Value : 0.848         
##          Neg Pred Value : 0.800         
##               Precision : 0.848         
##                  Recall : 0.889         
##                      F1 : 0.868         
##              Prevalence : 0.624         
##          Detection Rate : 0.554         
##    Detection Prevalence : 0.653         
##       Balanced Accuracy : 0.813         
##                                         
##        'Positive' Class : cure          
## 
## The ROC AUC is: 0.920779220779221.

As the confusion matrix shows, this failed for a few samples. Perhaps let us change k and see if it improves.

Here is a table of fase positives/negatives for a few values of ‘k’, in this context a false positive is calling a known cure as a failure and false negative is calling a known failure as a cure.

|—|—|—| |k |fp |fn | |2 |0 |8 | |3 |5 |5 | |4 |8 |9 | |5 |11 |7 | |6 |15 |8 |

Note: this depends on the luck of rand(), so the above numbers shift moderately from one run to the next. Thus I think I will just use 2 or 3.

knn_fit2 <- knn3(x = train_df,
                y = train_outcomes,
                k = 5)
knn_predict_trained2 <- predict(knn_fit2, train_df, type = "prob")

knn_train_evaluated2 <- self_evaluate_model(knn_predict_trained2, datasets,
                                            which = split, type = "train")
## Setting levels: control = cure, case = failure
## Setting direction: controls > cases

knn_train_evaluated2
## The summary of the (in)correct calls is:
##    Mode   FALSE    TRUE 
## logical      19      82
## The missed samples are:
##  [1] "TMRC30178" "TMRC30179" "TMRC30224" "TMRC30164" "TMRC30094" "TMRC30119" "TMRC30170" "TMRC30032" "TMRC30096" "TMRC30180" "TMRC30166"
## [12] "TMRC30157" "TMRC30072" "TMRC30129" "TMRC30161" "TMRC30175" "TMRC30146" "TMRC30147" "TMRC30265"
## The confusion matrix is:
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction cure failure
##    cure      55      11
##    failure    8      27
##                                         
##                Accuracy : 0.812         
##                  95% CI : (0.722, 0.883)
##     No Information Rate : 0.624         
##     P-Value [Acc > NIR] : 3.43e-05      
##                                         
##                   Kappa : 0.593         
##                                         
##  Mcnemar's Test P-Value : 0.646         
##                                         
##             Sensitivity : 0.873         
##             Specificity : 0.711         
##          Pos Pred Value : 0.833         
##          Neg Pred Value : 0.771         
##               Precision : 0.833         
##                  Recall : 0.873         
##                      F1 : 0.853         
##              Prevalence : 0.624         
##          Detection Rate : 0.545         
##    Detection Prevalence : 0.653         
##       Balanced Accuracy : 0.792         
##                                         
##        'Positive' Class : cure          
## 
## The ROC AUC is: 0.895021645021645.

6.1.2 Predict the rest of the data with this model.

knn_predict_test <- predict(knn_fit, test_df)

knn_test_evaluated <- self_evaluate_model(knn_predict_test, datasets,
                                     which = split, type = "test")
## Setting levels: control = cure, case = failure
## Setting direction: controls > cases

knn_test_evaluated
## The summary of the (in)correct calls is:
##    Mode   FALSE    TRUE 
## logical      16      49
## The missed samples are:
##  [1] "TMRC30221" "TMRC30223" "TMRC30071" "TMRC30105" "TMRC30058" "TMRC30029" "TMRC30028" "TMRC30118" "TMRC30121" "TMRC30037" "TMRC30165"
## [12] "TMRC30042" "TMRC30158" "TMRC30123" "TMRC30078" "TMRC30145"
## The confusion matrix is:
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction cure failure
##    cure      36       7
##    failure    9      13
##                                         
##                Accuracy : 0.754         
##                  95% CI : (0.631, 0.852)
##     No Information Rate : 0.692         
##     P-Value [Acc > NIR] : 0.174         
##                                         
##                   Kappa : 0.438         
##                                         
##  Mcnemar's Test P-Value : 0.803         
##                                         
##             Sensitivity : 0.800         
##             Specificity : 0.650         
##          Pos Pred Value : 0.837         
##          Neg Pred Value : 0.591         
##               Precision : 0.837         
##                  Recall : 0.800         
##                      F1 : 0.818         
##              Prevalence : 0.692         
##          Detection Rate : 0.554         
##    Detection Prevalence : 0.662         
##       Balanced Accuracy : 0.725         
##                                         
##        'Positive' Class : cure          
## 
## The ROC AUC is: 0.80708245243129.
knn_predict_test2 <- predict(knn_fit2, test_df)
knn_test_evaluated2 <- self_evaluate_model(knn_predict_test2, datasets,
                                           which = split, type = "test")
## Setting levels: control = cure, case = failure
## Setting direction: controls > cases

knn_test_evaluated2
## The summary of the (in)correct calls is:
##    Mode   FALSE    TRUE 
## logical      19      46
## The missed samples are:
##  [1] "TMRC30221" "TMRC30222" "TMRC30223" "TMRC30071" "TMRC30105" "TMRC30058" "TMRC30028" "TMRC30118" "TMRC30121" "TMRC30037" "TMRC30042"
## [12] "TMRC30160" "TMRC30167" "TMRC30123" "TMRC30043" "TMRC30184" "TMRC30143" "TMRC30173" "TMRC30144"
## The confusion matrix is:
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction cure failure
##    cure      33      10
##    failure    9      13
##                                         
##                Accuracy : 0.708         
##                  95% CI : (0.582, 0.814)
##     No Information Rate : 0.646         
##     P-Value [Acc > NIR] : 0.183         
##                                         
##                   Kappa : 0.354         
##                                         
##  Mcnemar's Test P-Value : 1.000         
##                                         
##             Sensitivity : 0.786         
##             Specificity : 0.565         
##          Pos Pred Value : 0.767         
##          Neg Pred Value : 0.591         
##               Precision : 0.767         
##                  Recall : 0.786         
##                      F1 : 0.776         
##              Prevalence : 0.646         
##          Detection Rate : 0.508         
##    Detection Prevalence : 0.662         
##       Balanced Accuracy : 0.675         
##                                         
##        'Positive' Class : cure          
## 
## The ROC AUC is: 0.784883720930233.

6.2 Perform cross-validation to estimate k

The cross validation method of repeated sampling the data is all done within the train() function. With that in mind, here it is operating with the knn method.

6.2.1 CV with knn

When train() is called with the trControl and tuneGrid, we can control how the knn training is repeated, in this case it will iterate over k from 1 to 10.

cv_control <- trainControl(method = "cv", number = 10)

knn_train_fit <- train(outcome_factor ~ ., data = train_df,
                       method = "knn",
                       trControl = cv_control,
                       tuneGrid = data.frame(k = 1:10))
## Error in model.frame.default(form = outcome_factor ~ ., data = train_df, : variable lengths differ (found for 'ENSG00000000003')
knn_train_fit[["bestTune"]]
## Error in eval(expr, envir, enclos): object 'knn_train_fit' not found
plot(x = 1:10, 1 - knn_train_fit$results[, 2], pch = 19,
     ylab = "prediction error", xlab = "k")
## Error in xy.coords(x, y, xlabel, ylabel, log): object 'knn_train_fit' not found
lines(loess.smooth(x = 1:10, 1 - knn_train_fit$results[, 2],degree = 2),
      col = "#CC0000")
## Error in loess.smooth(x = 1:10, 1 - knn_train_fit$results[, 2], degree = 2): object 'knn_train_fit' not found

6.2.2 Bootstrap with knn

boot_control <- trainControl(method = "boot", number = 20,
                             returnResamp = "all")

knn_train_fit <- train(outcome ~ ., data = train_all,
                       method = "knn",
                       trControl = boot_control,
                       tuneGrid = data.frame(k = 1:10))
knn_train_fit[["bestTune"]]
##   k
## 1 1
plot(x = 1:10, 1 - knn_train_fit$results[, 2], pch = 19,
     ylab = "prediction error", xlab = "k")
lines(loess.smooth(x = 1:10, 1 - knn_train_fit$results[, 2],degree = 2),
      col = "#CC0000")

6.2.3 Explain the important variables

In this instance we will search for genes which were important for the model’s creation.

The DALEX package provides a function: feature_importance() which seeks to use a series of other methods to extract (in this case, genes) features which do a good job of explaining the result produced by the model. In the case of this dataset, which has thousands of features, this does not appear to end well.

explainer_knn <- DALEX::explain(knn_fit, label = "knn",
                                data = train_df,
                                y = as.numeric(train_outcomes))
## Preparation of a new explainer is initiated
##   -> model label       :  knn 
##   -> data              :  101  rows  13091  cols 
##   -> target variable   :  101  values 
##   -> predict function  :  yhat.default will be used (  default  )
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package Model of class: knn3 package unrecognized , ver. Unknown , task regression (  default  ) 
##   -> predicted values  :  numerical, min =  0 , mean =  0.3927 , max =  1  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  0.3333 , mean =  0.9538 , max =  1.667  
##   A new explainer has been created!
## AFAICT the following will take forever unless we drastically reduce the complexity of the model.
## yeah, I let it run for a week.
## features <- feature_importance(explainer_knn, n_sample = 50, type = "difference")

6.3 Random Forest

The parameter ‘mtry’ is often important, if I read the text correctly it controls how many variables to sample in each split of the tree. Thus higher numbers should presumably make it more specific at the risk of overfitting.

Setting min.node.size sets the minimume node size of terminal nodes in each tree. Each increment up speeds the algorithm.

I am going to use my boot control trainer from above and see how it goes.

rf_train_fit <- train(outcome ~ ., data = train_all,
                method = "ranger", trControl = boot_control,
                importance = "permutation",
                tuneGrid = data.frame(
                    mtry = 200,
                    min.node.size = 1,
                    splitrule = "gini"),
                verbose = TRUE)
rf_train_fit[["finalModel"]][["prediction.error"]]
## [1] 0.1782
variable_importance <- varImp(rf_train_fit)
plot(variable_importance, top = 15)

rf_variables <- variable_importance[["importance"]] %>%
  arrange(desc(Overall))

rf_predict_trained <- predict(rf_train_fit, train_df)
rf_predict_evaluated <- self_evaluate_model(rf_predict_trained, datasets,
                                            which = split, type = "train")
## Setting levels: control = cure, case = failure
## Setting direction: controls < cases

rf_predict_evaluated
## The summary of the (in)correct calls is:
##    Mode    TRUE 
## logical     101
## The missed samples are:
## character(0)
## The confusion matrix is:
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction cure failure
##    cure      66       0
##    failure    0      35
##                                     
##                Accuracy : 1         
##                  95% CI : (0.964, 1)
##     No Information Rate : 0.653     
##     P-Value [Acc > NIR] : <2e-16    
##                                     
##                   Kappa : 1         
##                                     
##  Mcnemar's Test P-Value : NA        
##                                     
##             Sensitivity : 1.000     
##             Specificity : 1.000     
##          Pos Pred Value : 1.000     
##          Neg Pred Value : 1.000     
##               Precision : 1.000     
##                  Recall : 1.000     
##                      F1 : 1.000     
##              Prevalence : 0.653     
##          Detection Rate : 0.653     
##    Detection Prevalence : 0.653     
##       Balanced Accuracy : 1.000     
##                                     
##        'Positive' Class : cure      
## 
## The ROC AUC is: 1.

6.4 Compare topn important genes to DE genes

Given that we have separated the various analyses, it will take me a minute to figure out where I saved the relevant differential expression analysis. I do not actually save the various DE results to rda files by default, instead opting to send them to xlsx files to share. Recall if you will, that the data that I think might be used for the paper also does not go into the default excel directory but instead mirrors the box organization scheme.

Thus, I think the most relevant file is: “analyses/4_tumaco/DE_Cure_vs_Fail/All_Samples/t_cf_clinical_tables_sva-v202207.xlsx”

all_de_cf <- openxlsx::readWorkbook(
  "analyses/4_tumaco/DE_Cure_vs_Fail/t_all_visitcf_tables_sva-v202305.xlsx",
  sheet = 2, startRow = 2)
rownames(all_de_cf) <- all_de_cf[["row.names"]]
all_de_cf[["row.names"]] <- NULL
deseq_de_cf <- all_de_cf[, c("deseq_logfc", "deseq_adjp", "deseq_basemean", "deseq_lfcse")]

6.4.1 What would be shared between DESeq2 and the ML classifier?

Presumably DESeq and the models should be responding to variance in the data, for which I think the logFC values, p-values, mean values, or standard errors are the most likely proxies to which I have easy access. So, let us pull the top/bottom n genes vis a vis each of those categories and see what happens?

top_100_lfc <- all_de_cf %>%
  arrange(desc(deseq_logfc)) %>%
  top_n(n = 100, wt = deseq_logfc)
bottom_100_lfc <- all_de_cf %>%
  arrange(deseq_logfc) %>%
  top_n(n = -100, wt = deseq_logfc)
top_bottom_ids <- c(rownames(top_100_lfc), rownames(bottom_100_lfc))

top_200_ml <- rownames(head(rf_variables, n = 200))

comparison <- list("de" = top_bottom_ids, "ml" = top_400_ml)
## Error in eval(expr, envir, enclos): object 'top_400_ml' not found
comparison_venn <- Vennerable::Venn(comparison)
## Error in Vennerable::Venn(comparison): object 'comparison' not found
fData(c_monocytes)[comparison_venn@IntersectionSets["11"][[1]], ]
## Error in `[.data.frame`(fData(c_monocytes), comparison_venn@IntersectionSets["11"][[1]], : object 'comparison_venn' not found
Vennerable::plot(comparison_venn, doWeights = FALSE, type = "circles")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'plot': object 'comparison_venn' not found
top_400_adjp <- all_de_cf %>%
  top_n(n = -400, wt = deseq_adjp)
lowest_adjp <- rownames(top_400_adjp)
comparison <- list("de" = lowest_adjp, "ml" = top_400_ml)
## Error in eval(expr, envir, enclos): object 'top_400_ml' not found
comparison_venn <- Vennerable::Venn(comparison)
## Error in Vennerable::Venn(comparison): object 'comparison' not found
fData(c_monocytes)[comparison_venn@IntersectionSets["11"][[1]], ]
## Error in `[.data.frame`(fData(c_monocytes), comparison_venn@IntersectionSets["11"][[1]], : object 'comparison_venn' not found
Vennerable::plot(comparison_venn, doWeights = FALSE, type = "circles")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'plot': object 'comparison_venn' not found
top_400_exprs <- all_de_cf %>%
  top_n(n = 400, wt = deseq_basemean)
highest_exprs <- rownames(top_400_exprs)
comparison <- list("de" = highest_exprs, "ml" = top_400_ml)
## Error in eval(expr, envir, enclos): object 'top_400_ml' not found
comparison_venn <- Vennerable::Venn(comparison)
## Error in Vennerable::Venn(comparison): object 'comparison' not found
fData(c_monocytes)[comparison_venn@IntersectionSets["11"][[1]], ]
## Error in `[.data.frame`(fData(c_monocytes), comparison_venn@IntersectionSets["11"][[1]], : object 'comparison_venn' not found
Vennerable::plot(comparison_venn, doWeights = FALSE, type = "circles")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'plot': object 'comparison_venn' not found
tt <- merge(all_de_cf, rf_variables, by = "row.names")
rownames(tt) <- tt[["Row.names"]]
tt[["Row.names"]] <- NULL
cor.test(tt[["deseq_logfc"]], tt[["Overall"]])
## 
##  Pearson's product-moment correlation
## 
## data:  tt[["deseq_logfc"]] and tt[["Overall"]]
## t = 9.3, df = 9369, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07544 0.11557
## sample estimates:
##     cor 
## 0.09554

6.4.2 Compare to the WGCNA results

A couple months ago I spent a little time attempting to recapitulate Alejandro’s WGCNA results. I think I did so by mostly copy/pasting his work and adding some commentary and tweaking parts of it so that it was easier for me to read/understand. In the process, I generated a series of modules which looked similar/identical to his. Unfortunately, I did not add some sections to record the genes/modules to some output files. I am therefore going back to that now and doing so in the hopes that I can compare those modules to the results produced by the clasifiers.

wgcna_result <- openxlsx::readWorkbook(glue("excel/wgcna_interesting_genes-v{ver}.xlsx"))
rownames(wgcna_result) <- wgcna_result[["row.names"]]
wgcna_result[["row.names"]] <- NULL

top_100_ml <- rownames(head(rf_variables, n = 200))
comparison <- list("wgcna" = rownames(wgcna_result), "ml" = top_100_ml)
comparison_venn <- Vennerable::Venn(comparison)
fData(c_monocytes)[comparison_venn@IntersectionSets["11"][[1]], ]
##                 ensembl_gene_id ensembl_transcript_id version transcript_version hgnc_symbol
## ENSG00000059378 ENSG00000059378       ENST00000489383      13                  1      PARP12
## ENSG00000152689 ENSG00000152689       ENST00000484909      18                  5     RASGRP3
##                                                                                      description   gene_biotype cds_length chromosome_name strand
## ENSG00000059378 poly(ADP-ribose) polymerase family member 12 [Source:HGNC Symbol;Acc:HGNC:21919] protein_coding  undefined               7      -
## ENSG00000152689               RAS guanyl releasing protein 3 [Source:HGNC Symbol;Acc:HGNC:14545] protein_coding  undefined               2      +
##                 start_position end_position        transcript     mean_cds_len
## ENSG00000059378      140023749    140062951 ENSG00000059378.1            851.6
## ENSG00000152689       33436324     33564750 ENSG00000152689.5 906.555555555556
Vennerable::plot(comparison_venn, doWeights = FALSE, type = "circles")

6.4.2.1 Digression do the genes provide by varImp mean anything?

Let us take a moment and see if the top-n genes returned by varImp() have some meaning which jumps out. One might assume, given our extant Differential Expression results, that the interleukin response will be a likely candidate.

importance_gp <- simple_gprofiler(rownames(head(rf_variables, n = 300)))
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
importance_gp
## A set of ontologies produced by gprofiler using 300
## genes against the hsapiens annotations and significance cutoff 0.05.
## There are 8 GO hits, 0, KEGG hits, 0 reactome hits, 0 wikipathway hits, 2 transcription factor hits, 0 miRNA hits, 0 HPA hits, 0 HP hits, and 0 CORUM hits.
## Category BP is the most populated with 3 hits.

6.4.3 Now the random forest testers!

rf_predict_test <- predict(rf_train_fit, test_df)

rf_predict_test_evaluated <- self_evaluate_model(rf_predict_test, datasets,
                                     which = split, type = "test")
## Setting levels: control = cure, case = failure
## Setting direction: controls < cases

rf_predict_test_evaluated
## The summary of the (in)correct calls is:
##    Mode   FALSE    TRUE 
## logical       8      57
## The missed samples are:
## [1] "TMRC30221" "TMRC30222" "TMRC30223" "TMRC30056" "TMRC30058" "TMRC30118" "TMRC30121" "TMRC30078"
## The confusion matrix is:
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction cure failure
##    cure      43       0
##    failure    8      14
##                                         
##                Accuracy : 0.877         
##                  95% CI : (0.772, 0.945)
##     No Information Rate : 0.785         
##     P-Value [Acc > NIR] : 0.0423        
##                                         
##                   Kappa : 0.698         
##                                         
##  Mcnemar's Test P-Value : 0.0133        
##                                         
##             Sensitivity : 0.843         
##             Specificity : 1.000         
##          Pos Pred Value : 1.000         
##          Neg Pred Value : 0.636         
##               Precision : 1.000         
##                  Recall : 0.843         
##                      F1 : 0.915         
##              Prevalence : 0.785         
##          Detection Rate : 0.662         
##    Detection Prevalence : 0.662         
##       Balanced Accuracy : 0.922         
##                                         
##        'Positive' Class : cure          
## 
## The ROC AUC is: 0.818181818181818.

6.5 GLM, or Logistic regression and regularization

Logistic regression is a statistical method for binary responses. However, it is able to work with multiple classes as well. The general idea of this method is to find parameters which increase the likelihood that the observed data is sampled from a statistical distribution of interest. The transformations and linear regression-esque tasks performed are confusing, but once those are performed, the task becomes setting the model’s (fitting) parameters to values which increase the probability that the statistical model looks like the actual dataset given the training data, and that when samples, will return values which are similar. The most likely statistical distributions one will want to fit are the Gaussian, in which case we want to transform/normalize the mean/variance of our variables so they look whatever normal distribution we are using. Conversely, logistic regression uses a binnomial distribution (like our raw sequencing data!) but which is from 0-1.

6.5.1 Using a single gene

Let us take the most important gene observed in one of our previous training sets: ENSG00000248405 PRR5-ARHGAP8

gene_id <- "ENSG00000248405"
single_fit <- train(
    outcome ~ ENSG00000248405, data = train_all,
    method = "glm", family = "binomial", trControl = trainControl("none"))

tt <- data.frame("ENSG00000248405" = seq(min(train_df[[gene_id]]),
                                         max(train_df[[gene_id]]),len=100))
## predict probabilities for the simulated data
tt$subtype = predict(single_fit, newdata=tt, type="prob")[, 1]
## plot the sigmoid curve and the training data
plot(ifelse(outcome == "cure", 1, 0) ~ ENSG00000248405,
     data = train_all, col = "red4",
     ylab = "CF as 0 or 1", xlab = "favorite gene expression")
lines(subtype ~ ENSG00000248405, tt, col = "green4", lwd = 2)

plot_df <- train_all[, c("outcome", "ENSG00000248405")]
ggbetweenstats(plot_df, "outcome", "ENSG00000248405")

Having tried with 1 gene, let us extend this to all genes. In my first try of this, it took a long time.

glm_train_fit <- train(outcome ~ ., data = train_all,
                 trControl = boot_control,
                 method = "glm", family = "binomial")
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge

6.6 Compare GLM and WGCNA/DE

glm_variable_importance <- varImp(glm_train_fit)
## Oh, this only produces 100 entries -- so me getting the top 400 is silly.
glm_variables <- glm_variable_importance[["importance"]] %>%
  arrange(desc(Overall))
plot(glm_variable_importance, top = 15)

simple_gprofiler(rownames(head(glm_variables, n = 100)))
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## A set of ontologies produced by gprofiler using 100
## genes against the hsapiens annotations and significance cutoff 0.05.
## There are 6 GO hits, 0, KEGG hits, 0 reactome hits, 0 wikipathway hits, 4 transcription factor hits, 0 miRNA hits, 0 HPA hits, 0 HP hits, and 0 CORUM hits.
## Category BP is the most populated with 4 hits.

top_glm <- rownames(glm_variables)
comparison <- list("de" = top_bottom_ids, "ml" = top_glm)
comparison_venn <- Vennerable::Venn(comparison)
fData(c_monocytes)[comparison_venn@IntersectionSets["11"][[1]], ]
##                 ensembl_gene_id ensembl_transcript_id version transcript_version hgnc_symbol
## ENSG00000003137 ENSG00000003137       ENST00000412253       8                  1     CYP26B1
##                                                                                        description   gene_biotype cds_length chromosome_name
## ENSG00000003137 cytochrome P450 family 26 subfamily B member 1 [Source:HGNC Symbol;Acc:HGNC:20581] protein_coding        966               2
##                 strand start_position end_position        transcript mean_cds_len
## ENSG00000003137      -       72129238     72148038 ENSG00000003137.1        976.6
Vennerable::plot(comparison_venn, doWeights = FALSE, type = "circles")

comparison <- list("de" = lowest_adjp, "ml" = top_glm)
comparison_venn <- Vennerable::Venn(comparison)
fData(c_monocytes)[comparison_venn@IntersectionSets["11"][[1]], ]
##                 ensembl_gene_id ensembl_transcript_id version transcript_version hgnc_symbol
## ENSG00000002016 ENSG00000002016       ENST00000488642      18                  6       RAD52
## ENSG00000004660 ENSG00000004660       ENST00000381769      15                  6      CAMKK1
## ENSG00000005243 ENSG00000005243       ENST00000583414      10                  5       COPZ2
## ENSG00000005486 ENSG00000005486       ENST00000476218      17                  1      RHBDD2
## ENSG00000006704 ENSG00000006704       ENST00000476977      10                  5    GTF2IRD1
##                                                                                             description   gene_biotype cds_length chromosome_name
## ENSG00000002016                    RAD52 homolog, DNA repair protein [Source:HGNC Symbol;Acc:HGNC:9824] protein_coding  undefined              12
## ENSG00000004660 calcium/calmodulin dependent protein kinase kinase 1 [Source:HGNC Symbol;Acc:HGNC:1469] protein_coding       1599              17
## ENSG00000005243             coatomer protein complex subunit zeta 2 [Source:HGNC Symbol;Acc:HGNC:19356] protein_coding        138              17
## ENSG00000005486                        rhomboid domain containing 2 [Source:HGNC Symbol;Acc:HGNC:23082] protein_coding  undefined               7
## ENSG00000006704                     GTF2I repeat domain containing 1 [Source:HGNC Symbol;Acc:HGNC:4661] protein_coding       2883               7
##                 strand start_position end_position        transcript     mean_cds_len
## ENSG00000002016      -         911736       990053 ENSG00000002016.6 612.636363636364
## ENSG00000004660      -        3860315      3894891 ENSG00000004660.6             1560
## ENSG00000005243      -       48026167     48038030 ENSG00000005243.5 283.571428571429
## ENSG00000005486      +       75842602     75888926 ENSG00000005486.1           669.75
## ENSG00000006704      +       74453790     74602604 ENSG00000006704.5           2470.6
Vennerable::plot(comparison_venn, doWeights = FALSE, type = "circles")

comparison <- list("de" = highest_exprs, "ml" = top_glm)
comparison_venn <- Vennerable::Venn(comparison)
fData(c_monocytes)[comparison_venn@IntersectionSets["11"][[1]], ]
##  [1] ensembl_gene_id       ensembl_transcript_id version               transcript_version    hgnc_symbol           description          
##  [7] gene_biotype          cds_length            chromosome_name       strand                start_position        end_position         
## [13] transcript            mean_cds_len         
## <0 rows> (or 0-length row.names)
Vennerable::plot(comparison_venn, doWeights = FALSE, type = "circles")

comparison <- list("wgcna" = rownames(wgcna_result), "ml" = top_glm)
comparison_venn <- Vennerable::Venn(comparison)
fData(c_monocytes)[comparison_venn@IntersectionSets["11"][[1]], ]
##                 ensembl_gene_id ensembl_transcript_id version transcript_version hgnc_symbol
## ENSG00000002549 ENSG00000002549       ENST00000508497      12                  5        LAP3
## ENSG00000003400 ENSG00000003400       ENST00000374650      15                  7      CASP10
## ENSG00000004468 ENSG00000004468       ENST00000511430      13                  1        CD38
##                                                                  description   gene_biotype cds_length chromosome_name strand start_position
## ENSG00000002549 leucine aminopeptidase 3 [Source:HGNC Symbol;Acc:HGNC:18449] protein_coding  undefined               4      +       17577192
## ENSG00000003400                caspase 10 [Source:HGNC Symbol;Acc:HGNC:1500] protein_coding        744               2      +      201182898
## ENSG00000004468             CD38 molecule [Source:HGNC Symbol;Acc:HGNC:1667] protein_coding  undefined               4      +       15778275
##                 end_position        transcript     mean_cds_len
## ENSG00000002549     17607972 ENSG00000002549.5           1114.4
## ENSG00000003400    201229406 ENSG00000003400.7         1162.875
## ENSG00000004468     15853232 ENSG00000004468.1 575.666666666667
Vennerable::plot(comparison_venn, doWeights = FALSE, type = "circles")

##rf_method <- trainControl(method = "ranger", number = 10, verbose = TRUE)
## train_method <- trainControl(method = "cv", number = 10)
glm_fit <- train(outcome ~ ., data = train_all, method = "glmnet",
                 trControl = boot_control, importance = "permutation",
                 tuneGrid = data.frame(
                   alpha = 0.5,
                   lambda = seq(0.1, 0.7, 0.05)),
                 verbose = TRUE)
glm_fit
## glmnet 
## 
##   101 samples
## 13091 predictors
##     2 classes: 'cure', 'failure' 
## 
## No pre-processing
## Resampling: Bootstrapped (20 reps) 
## Summary of sample sizes: 101, 101, 101, 101, 101, 101, ... 
## Resampling results across tuning parameters:
## 
##   lambda  Accuracy  Kappa  
##   0.10    0.7700    0.46805
##   0.15    0.7569    0.42844
##   0.20    0.7483    0.38923
##   0.25    0.7320    0.33040
##   0.30    0.7139    0.26046
##   0.35    0.7124    0.22226
##   0.40    0.6782    0.07448
##   0.45    0.6771    0.02441
##   0.50    0.6781    0.01350
##   0.55    0.6766    0.00000
##   0.60    0.6766    0.00000
##   0.65    0.6766    0.00000
##   0.70    0.6766    0.00000
## 
## Tuning parameter 'alpha' was held constant at a value of 0.5
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 0.5 and lambda = 0.1.
glm_predict_trained <- predict(glm_fit, train_df)

glm_train_eval <- self_evaluate_model(glm_predict_trained, datasets,
                                      which = split, type = "train")
## Setting levels: control = cure, case = failure
## Setting direction: controls < cases

glm_train_eval
## The summary of the (in)correct calls is:
##    Mode    TRUE 
## logical     101
## The missed samples are:
## character(0)
## The confusion matrix is:
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction cure failure
##    cure      66       0
##    failure    0      35
##                                     
##                Accuracy : 1         
##                  95% CI : (0.964, 1)
##     No Information Rate : 0.653     
##     P-Value [Acc > NIR] : <2e-16    
##                                     
##                   Kappa : 1         
##                                     
##  Mcnemar's Test P-Value : NA        
##                                     
##             Sensitivity : 1.000     
##             Specificity : 1.000     
##          Pos Pred Value : 1.000     
##          Neg Pred Value : 1.000     
##               Precision : 1.000     
##                  Recall : 1.000     
##                      F1 : 1.000     
##              Prevalence : 0.653     
##          Detection Rate : 0.653     
##    Detection Prevalence : 0.653     
##       Balanced Accuracy : 1.000     
##                                     
##        'Positive' Class : cure      
## 
## The ROC AUC is: 1.

6.6.1 Now the GLM testers!

glm_predict_test <- predict(glm_fit, test_df)

glm_fit_eval_test <- self_evaluate_model(glm_predict_test, datasets,
                                         which = split, type = "test")
## Setting levels: control = cure, case = failure
## Setting direction: controls < cases

glm_fit_eval_test
## The summary of the (in)correct calls is:
##    Mode   FALSE    TRUE 
## logical       9      56
## The missed samples are:
## [1] "TMRC30221" "TMRC30222" "TMRC30223" "TMRC30105" "TMRC30058" "TMRC30042" "TMRC30123" "TMRC30116" "TMRC30144"
## The confusion matrix is:
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction cure failure
##    cure      41       2
##    failure    7      15
##                                         
##                Accuracy : 0.862         
##                  95% CI : (0.753, 0.935)
##     No Information Rate : 0.738         
##     P-Value [Acc > NIR] : 0.0132        
##                                         
##                   Kappa : 0.673         
##                                         
##  Mcnemar's Test P-Value : 0.1824        
##                                         
##             Sensitivity : 0.854         
##             Specificity : 0.882         
##          Pos Pred Value : 0.953         
##          Neg Pred Value : 0.682         
##               Precision : 0.953         
##                  Recall : 0.854         
##                      F1 : 0.901         
##              Prevalence : 0.738         
##          Detection Rate : 0.631         
##    Detection Prevalence : 0.662         
##       Balanced Accuracy : 0.868         
##                                         
##        'Positive' Class : cure          
## 
## The ROC AUC is: 0.817653276955602.

6.6.2 Compare again vs DE/WGCNA

variable_importance <- varImp(glm_fit)
plot(variable_importance, top = 15)

top_400_ml <- rownames(head(variable_importance$importance, n = 400))
top_100_ml <- rownames(head(variable_importance$importance, n = 100))

comparison <- list("de" = top_bottom_ids, "ml" = top_400_ml)
comparison_venn <- Vennerable::Venn(comparison)
fData(c_monocytes)[comparison_venn@IntersectionSets["11"][[1]], ]
Vennerable::plot(comparison_venn, doWeights = FALSE, type = "circles")

comparison <- list("de" = lowest_adjp, "ml" = top_400_ml)
comparison_venn <- Vennerable::Venn(comparison)
fData(c_monocytes)[comparison_venn@IntersectionSets["11"][[1]], ]
Vennerable::plot(comparison_venn, doWeights = FALSE, type = "circles")

comparison <- list("de" = highest_exprs, "ml" = top_400_ml)
comparison_venn <- Vennerable::Venn(comparison)
fData(c_monocytes)[comparison_venn@IntersectionSets["11"][[1]]], ]
Vennerable::plot(comparison_venn, doWeights = FALSE, type = "circles")

top_100_ml <- rownames(head(variable_importance$importance, n = 100))
comparison <- list("wgcna" = rownames(wgcna_result), "ml" = top_100_ml)
comparison_venn <- Vennerable::Venn(comparison)
fData(c_monocytes)[comparison_venn@IntersectionSets["11"][[1]], ]
Vennerable::plot(comparison_venn, doWeights = FALSE, type = "circles")
## Error: <text>:19:64: unexpected ','
## 18: comparison_venn <- Vennerable::Venn(comparison)
## 19: fData(c_monocytes)[comparison_venn@IntersectionSets["11"][[1]]],
##                                                                    ^

6.7 Gradient Booster

##rf_method <- trainControl(method = "ranger", number = 10, verbose = TRUE)
train_method <- trainControl(method = "cv", number = 10)

gb_fit <- train(outcome ~ ., data = train_all,
                method = "xgbTree", trControl = train_method,
                tuneGrid = data.frame(
                    nrounds = 200,
                    eta = c(0.05, 0.1, 0.3),
                    max_depth = 4,
                    gamma = 0,
                    colsample_bytree = 1,
                    subsample = 0.5,
                    min_child_weight = 1),
                verbose = TRUE)

gb_predict_trained <- predict(gb_fit, train_df)
gb_predict_trained
##   [1] cure    cure    failure failure failure cure    cure    cure    cure    cure    cure    cure    cure    cure    cure    cure    cure   
##  [18] cure    cure    cure    cure    cure    cure    cure    cure    cure    cure    cure    cure    cure    cure    cure    cure    cure   
##  [35] cure    cure    failure failure cure    cure    failure cure    cure    cure    cure    failure failure cure    cure    cure    cure   
##  [52] cure    cure    cure    cure    failure failure failure failure failure cure    failure cure    failure cure    cure    cure    cure   
##  [69] cure    cure    failure cure    cure    failure cure    cure    cure    cure    cure    failure cure    cure    failure failure failure
##  [86] failure failure failure failure failure failure failure failure failure failure failure failure cure    failure cure    failure
## Levels: cure failure
gb_train_eval <- self_evaluate_model(gb_predict_trained, datasets,
                                     which = split, type = "train")
## Setting levels: control = cure, case = failure
## Setting direction: controls < cases

gb_train_eval
## The summary of the (in)correct calls is:
##    Mode    TRUE 
## logical     101
## The missed samples are:
## character(0)
## The confusion matrix is:
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction cure failure
##    cure      66       0
##    failure    0      35
##                                     
##                Accuracy : 1         
##                  95% CI : (0.964, 1)
##     No Information Rate : 0.653     
##     P-Value [Acc > NIR] : <2e-16    
##                                     
##                   Kappa : 1         
##                                     
##  Mcnemar's Test P-Value : NA        
##                                     
##             Sensitivity : 1.000     
##             Specificity : 1.000     
##          Pos Pred Value : 1.000     
##          Neg Pred Value : 1.000     
##               Precision : 1.000     
##                  Recall : 1.000     
##                      F1 : 1.000     
##              Prevalence : 0.653     
##          Detection Rate : 0.653     
##    Detection Prevalence : 0.653     
##       Balanced Accuracy : 1.000     
##                                     
##        'Positive' Class : cure      
## 
## The ROC AUC is: 1.

6.7.1 Now the GB testers!

gb_predict_test <- predict(gb_fit, test_df)

gb_predict_test_evaluated <- self_evaluate_model(gb_predict_test, datasets,
                                                 which = split, type = "test")
## Setting levels: control = cure, case = failure
## Setting direction: controls < cases

gb_predict_test_evaluated
## The summary of the (in)correct calls is:
##    Mode   FALSE    TRUE 
## logical       8      57
## The missed samples are:
## [1] "TMRC30221" "TMRC30222" "TMRC30223" "TMRC30056" "TMRC30121" "TMRC30123" "TMRC30116" "TMRC30144"
## The confusion matrix is:
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction cure failure
##    cure      42       1
##    failure    7      15
##                                         
##                Accuracy : 0.877         
##                  95% CI : (0.772, 0.945)
##     No Information Rate : 0.754         
##     P-Value [Acc > NIR] : 0.0113        
##                                         
##                   Kappa : 0.706         
##                                         
##  Mcnemar's Test P-Value : 0.0771        
##                                         
##             Sensitivity : 0.857         
##             Specificity : 0.938         
##          Pos Pred Value : 0.977         
##          Neg Pred Value : 0.682         
##               Precision : 0.977         
##                  Recall : 0.857         
##                      F1 : 0.913         
##              Prevalence : 0.754         
##          Detection Rate : 0.646         
##    Detection Prevalence : 0.662         
##       Balanced Accuracy : 0.897         
##                                         
##        'Positive' Class : cure          
## 
## The ROC AUC is: 0.829281183932347.

6.8 SVM

##rf_method <- trainControl(method = "ranger", number = 10, verbose = TRUE)
train_method <- trainControl(method = "cv", number = 10)

svm_train_fit <- train(outcome ~ ., data = train_all,
                       method = "svmRadial", trControl = train_method,
                       tuneGrid = data.frame(
                         C = c(0.25, 0.5, 1),
                         sigma = 1),
                       verbose = TRUE)
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
svm_train_fit
## Support Vector Machines with Radial Basis Function Kernel 
## 
##   101 samples
## 13091 predictors
##     2 classes: 'cure', 'failure' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 91, 91, 91, 90, 91, 92, ... 
## Resampling results across tuning parameters:
## 
##   C     Accuracy  Kappa
##   0.25  0.6542    0    
##   0.50  0.6542    0    
##   1.00  0.6542    0    
## 
## Tuning parameter 'sigma' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 1 and C = 0.25.
svm_predict_trained <- predict(svm_train_fit, train_df)

svm_predict_evaluated <- self_evaluate_model(svm_predict_trained, datasets,
                                             which = split, type = "train")
## Setting levels: control = cure, case = failure
## Setting direction: controls < cases

svm_predict_evaluated
## The summary of the (in)correct calls is:
##    Mode   FALSE    TRUE 
## logical      35      66
## The missed samples are:
##  [1] "TMRC30178" "TMRC30179" "TMRC30224" "TMRC30094" "TMRC30119" "TMRC30122" "TMRC30096" "TMRC30115" "TMRC30048" "TMRC30054" "TMRC30046"
## [12] "TMRC30070" "TMRC30055" "TMRC30053" "TMRC30068" "TMRC30072" "TMRC30088" "TMRC30197" "TMRC30198" "TMRC30201" "TMRC30200" "TMRC30203"
## [23] "TMRC30202" "TMRC30204" "TMRC30237" "TMRC30206" "TMRC30238" "TMRC30074" "TMRC30217" "TMRC30077" "TMRC30219" "TMRC30218" "TMRC30220"
## [34] "TMRC30264" "TMRC30265"
## The confusion matrix is:
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction cure failure
##    cure      66       0
##    failure   35       0
##                                         
##                Accuracy : 0.653         
##                  95% CI : (0.552, 0.745)
##     No Information Rate : 1             
##     P-Value [Acc > NIR] : 1             
##                                         
##                   Kappa : 0             
##                                         
##  Mcnemar's Test P-Value : 9.08e-09      
##                                         
##             Sensitivity : 0.653         
##             Specificity :    NA         
##          Pos Pred Value :    NA         
##          Neg Pred Value :    NA         
##               Precision : 1.000         
##                  Recall : 0.653         
##                      F1 : 0.790         
##              Prevalence : 1.000         
##          Detection Rate : 0.653         
##    Detection Prevalence : 0.653         
##       Balanced Accuracy :    NA         
##                                         
##        'Positive' Class : cure          
## 
## The ROC AUC is: 0.5.

6.8.1 Now the SVM testers!

svm_predict_test <- predict(svm_fit, test_df)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'predict': object 'svm_fit' not found
svm_predict_test_evaluated <- self_evaluate_model(svm_predict_test, datasets,
                                     which = split, type = "test")
## Error in self_evaluate_model(svm_predict_test, datasets, which = split, : object 'svm_predict_test' not found
svm_predict_test_evaluated
## Error in eval(expr, envir, enclos): object 'svm_predict_test_evaluated' not found

7 Reapply to persistence

ref_col <- "persistence"
outcome_factor <- as.factor(as.character(pData(tc_clinical)[[ref_col]]))

ml_df <- as.data.frame(cbind(outcome_factor, as.data.frame(nzv_uncorr)))
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': arguments imply differing number of rows: 184, 166
ml_df[["outcome_factor"]] <- as.factor(ml_df[["outcome_factor"]])
only_yn <- ml_df[["outcome_factor"]] == "Y" | ml_df[["outcome_factor"]] == "N"
outcome_factor <- as.factor(as.character(outcome_factor[only_yn]))


ml_df <- ml_df[only_yn, ]
ml_df[["outcome_factor"]] <- as.factor(as.character(ml_df[["outcome_factor"]]))
summary(ml_df[["outcome_factor"]])
## integer(0)
train_idx <- createDataPartition(outcome_factor, p = 0.5,
                                 list = FALSE,
                                 times = 1)
## Error in createDataPartition(outcome_factor, p = 0.5, list = FALSE, times = 1): y must have at least 2 data points
summary(ml_df[train_idx, "outcome_factor"])
## NA's 
##  101
## Training set has 112 samples.

train_rownames <- rownames(ml_df)[train_idx]
not_train_idx <- ! rownames(ml_df) %in% train_rownames
summary(ml_df[not_train_idx, "outcome_factor"])
## integer(0)
not_train_rownames <- rownames(ml_df)[not_train_idx]

train_df <- as.data.frame(ml_df[train_rownames, ])
dim(train_df)
## [1]   101 13092
test_df <- as.data.frame(ml_df[not_train_rownames, ])
dim(test_df)
## [1]     0 13092
## Remove the outcome factor from the test data, just in case.
test_outcome <- test_df[["outcome_factor"]]
test_df[["outcome_factor"]] <- NULL

train_fit <- train(outcome_factor ~ ., data = train_df,
                method = "ranger", trControl = boot_control,
                importance = "permutation",
                tuneGrid = data.frame(
                    mtry = 200,
                    min.node.size = 1,
                    splitrule = "gini"),
                verbose = TRUE)
## Error in na.fail.default(structure(list(outcome_factor = structure(c(NA_integer_, : missing values in object
train_fit[["finalModel"]][["prediction.error"]]
## Error in eval(expr, envir, enclos): object 'train_fit' not found
variable_importance <- varImp(train_fit)
## Error in varImp(train_fit): object 'train_fit' not found
plot(variable_importance, top = 15)

rf_predict_trained <- predict(rf_train_fit, train_df)
## Error in predict.ranger.forest(forest, data, predict.all, num.trees, type, : User interrupt or internal error.
confusionMatrix(data = train_df[[1]], reference = rf_predict_trained)
## Error in confusionMatrix.default(data = train_df[[1]], reference = rf_predict_trained): The data must contain some levels that overlap the reference.
rf_predict_test <- predict(rf_train_fit, test_df)
## Error in predict.ranger.forest(forest, data, predict.all, num.trees, type, : User interrupt or internal error.
names(rf_predict_test) <- rownames(test_df)
used_names <- names(rf_predict_test)
used_pdata <- pData(tc_norm)[used_names, ]

rf_agreement <- as.character(rf_predict_test) == used_pdata[[ref_col]]
names(rf_agreement) <- rownames(test_df)
summary(rf_agreement)
##    Mode    NA's 
## logical      65
names(rf_agreement)[rf_agreement == FALSE]
##  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [48] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

8 Try something a little different

I redid the mappings with a combined host/parasite transcriptome in the hopes that it might provide some interesting variance for these classifiers. Lets us poke at it and see if anything is relevant or if it was a bad idea.

wanted_idx <- grepl(x = rownames(exprs(hslp_gene_expt)), pattern = "^LP")
wanted_ids <- rownames(exprs(hslp_gene_expt))[wanted_idx]
test_lp <- exclude_genes_expt(hslp_gene_expt, ids = wanted_ids, method = "keep")
## Note, I renamed this to subset_genes().
## remove_genes_expt(), before removal, there were 28567 genes, now there are 8632.
## There are 210 samples which kept less than 90 percent counts.
## TMRC30015 TMRC30016 TMRC30017 TMRC30018 TMRC30019 TMRC30020 TMRC30022 TMRC30025 TMRC30026 TMRC30044 TMRC30045 TMRC30152 TMRC30154 TMRC30155 
## 2.423e-02 3.550e+00 3.315e+00 3.249e-03 4.734e-02 2.356e-02 5.704e-01 6.508e-03 4.101e-02 8.361e-01 6.937e-03 3.771e-03 6.344e-03 8.391e-03 
## TMRC30156 TMRC30177 TMRC30241 TMRC30242 TMRC30253 TMRC30269 TMRC30270 TMRC30023 TMRC30028 TMRC30029 TMRC30032 TMRC30033 TMRC30036 TMRC30043 
## 6.375e-01 1.926e+00 5.777e-01 1.055e-01 4.958e-01 6.739e-03 2.953e-03 1.361e-03 5.899e-04 2.999e-02 6.606e-05 1.514e-04 7.416e-05 7.652e-03 
## TMRC30048 TMRC30054 TMRC30070 TMRC30071 TMRC30074 TMRC30077 TMRC30079 TMRC30113 TMRC30119 TMRC30122 TMRC30135 TMRC30136 TMRC30138 TMRC30144 
## 6.121e-02 5.516e-02 5.497e-02 6.451e-02 1.504e-02 1.626e-02 1.778e-02 6.376e-02 3.685e-03 3.160e-02 2.010e-02 1.835e-02 2.762e-03 1.667e-03 
## TMRC30147 TMRC30151 TMRC30159 TMRC30161 TMRC30164 TMRC30168 TMRC30173 TMRC30180 TMRC30182 TMRC30190 TMRC30196 TMRC30211 TMRC30216 TMRC30227 
## 7.091e-04 3.200e-03 4.034e-03 3.406e-03 3.413e-03 1.076e-01 9.201e-03 2.422e-02 3.973e-03 7.041e-03 5.445e-03 5.996e-04 8.780e-04 1.493e-03 
## TMRC30230 TMRC30233 TMRC30254 TMRC30257 TMRC30271 TMRC30272 TMRC30277 TMRC30278 TMRC30281 TMRC30282 TMRC30010 TMRC30012 TMRC30013 TMRC30014 
## 5.921e-04 8.393e-04 6.035e-03 5.280e-03 3.948e-03 1.830e-03 3.849e-03 2.850e-03 9.388e-03 3.011e-03 0.000e+00 0.000e+00 0.000e+00 2.158e-05 
## TMRC30024 TMRC30030 TMRC30034 TMRC30037 TMRC30038 TMRC30041 TMRC30046 TMRC30049 TMRC30050 TMRC30055 TMRC30056 TMRC30072 TMRC30078 TMRC30080 
## 6.469e-04 0.000e+00 6.100e-05 8.953e-05 9.267e-03 4.422e-03 5.532e-02 5.443e-02 1.270e-01 5.668e-02 2.472e-01 2.826e-02 1.078e-02 1.102e-02 
## TMRC30082 TMRC30096 TMRC30105 TMRC30107 TMRC30115 TMRC30123 TMRC30129 TMRC30132 TMRC30139 TMRC30142 TMRC30145 TMRC30148 TMRC30150 TMRC30157 
## 1.045e-02 4.088e-02 2.508e-02 1.865e-02 1.922e-02 2.376e-03 1.459e-02 1.090e-02 1.712e-03 8.774e-04 1.202e-03 5.905e-03 3.198e-03 1.010e-01 
## TMRC30165 TMRC30169 TMRC30171 TMRC30172 TMRC30174 TMRC30176 TMRC30178 TMRC30183 TMRC30184 TMRC30185 TMRC30188 TMRC30191 TMRC30194 TMRC30197 
## 2.533e-03 1.923e-02 3.299e-03 3.387e-03 5.936e-03 4.178e-03 3.450e-03 5.765e-02 9.120e-02 1.243e-03 1.332e-01 1.101e-02 1.748e-03 1.736e-03 
## TMRC30199 TMRC30201 TMRC30203 TMRC30205 TMRC30207 TMRC30209 TMRC30212 TMRC30214 TMRC30217 TMRC30219 TMRC30221 TMRC30223 TMRC30225 TMRC30228 
## 5.007e-02 2.957e-03 3.439e-03 2.451e-03 1.150e-03 1.355e-03 5.024e-04 4.655e-04 1.200e-03 1.511e-03 9.979e-04 1.077e-03 6.754e-04 5.335e-04 
## TMRC30231 TMRC30234 TMRC30237 TMRC30239 TMRC30255 TMRC30258 TMRC30260 TMRC30262 TMRC30264 TMRC30273 TMRC30274 TMRC30279 TMRC30283 TMRC30009 
## 3.311e-04 5.569e-04 1.613e-02 2.330e-03 4.405e-02 4.952e-02 4.195e-03 1.054e-02 7.074e-03 1.321e-03 7.386e-04 2.169e-03 7.736e-04 3.719e-05 
## TMRC30011 TMRC30021 TMRC30027 TMRC30031 TMRC30035 TMRC30039 TMRC30040 TMRC30042 TMRC30047 TMRC30052 TMRC30053 TMRC30058 TMRC30068 TMRC30076 
## 0.000e+00 5.979e-04 6.366e-04 7.509e-05 8.564e-04 1.285e-02 1.226e-02 9.144e-03 1.017e-01 1.635e-01 7.250e-02 1.598e-01 4.955e-02 1.604e-02 
## TMRC30083 TMRC30088 TMRC30093 TMRC30094 TMRC30103 TMRC30116 TMRC30118 TMRC30121 TMRC30133 TMRC30134 TMRC30137 TMRC30140 TMRC30143 TMRC30146 
## 3.835e-02 3.482e-02 6.272e-02 5.251e-02 3.066e-02 6.987e-02 4.443e-03 1.129e+00 1.639e-02 1.017e-01 1.485e-02 2.779e-03 1.657e-03 1.097e-03 
## TMRC30149 TMRC30153 TMRC30158 TMRC30160 TMRC30166 TMRC30167 TMRC30170 TMRC30175 TMRC30179 TMRC30181 TMRC30186 TMRC30189 TMRC30192 TMRC30195 
## 2.869e-02 9.937e-03 3.498e-03 4.037e-03 3.535e-03 1.072e-02 2.627e-02 1.200e-02 5.508e-03 4.397e-03 7.468e-03 4.325e-03 5.187e-03 4.102e-03 
## TMRC30198 TMRC30200 TMRC30202 TMRC30204 TMRC30206 TMRC30208 TMRC30210 TMRC30213 TMRC30215 TMRC30218 TMRC30220 TMRC30222 TMRC30224 TMRC30226 
## 1.932e-03 2.613e-03 3.130e-03 1.200e-02 3.668e-03 2.316e-03 1.077e-03 6.854e-04 8.871e-04 2.791e-03 3.228e-03 1.450e-03 2.239e-03 8.040e-04 
## TMRC30229 TMRC30232 TMRC30235 TMRC30238 TMRC30240 TMRC30256 TMRC30261 TMRC30263 TMRC30265 TMRC30275 TMRC30276 TMRC30280 TMRC30284 TMRC30285 
## 6.279e-04 6.287e-04 1.360e-03 4.140e-03 3.408e-03 1.016e-01 5.074e-03 8.926e-03 1.255e-02 1.679e-03 1.146e-03 3.511e-03 4.669e-03 7.622e-04
tt <- plot_libsize(test_lp)
tt$plot
## Warning: Transformation introduced infinite values in continuous y-axis
## Transformation introduced infinite values in continuous y-axis
## Warning: Removed 5 rows containing missing values (`geom_bar()`).

I am going to mostly copy paste the code from up above here but change the inputs to fit my new combined data structure of human and parasite data.

ref_col <- "finaloutcome"
outcome_factor <- as.factor(as.character(pData(hslp_gene_expt)[[ref_col]]))

hslp_norm <- normalize_expt(hslp_gene_expt, transform = "log2", convert = "cpm")
## transform_counts: Found 2958956 values equal to 0, adding 1 to the matrix.
hslp_norm <- normalize_expt(hslp_norm, filter = "cv", cv_min = 0.1)
## Removing 2106 low-count genes (26461 remaining).
retained_lp <- grepl(x = rownames(exprs(hslp_norm)), pattern = "^LP")
sum(retained_lp)
## [1] 8536
nzv_thslp <- t(exprs(hslp_norm))
nzv_hslp_center <- preProcess(nzv_thslp, method = "center")
nzv_thslp <- predict(nzv_hslp_center, nzv_thslp)

nzv_thslp_correlated <- preProcess(nzv_thslp, method = "corr", cutoff = 0.95)
nzv_thslp_uncorr <- predict(nzv_thslp_correlated, nzv_thslp)

interesting_meta <- pData(hslp_norm)[, c("finaloutcome", "donor",
                                         "visitnumber", "selectionmethod",
                                         "typeofcells", "time", "clinic")]

hslp_ml_df <- as.data.frame(cbind(outcome_factor,
                                  as.data.frame(nzv_thslp_uncorr)))
hslp_ml_df[["outcome_factor"]] <- as.factor(hslp_ml_df[["outcome_factor"]])
dim(hslp_ml_df)
## [1]   210 16871

9 Split the data for training/testing

## The variable outcome_factor was created at the top of this document,
## which is a little awkward here.
train_idx <- createDataPartition(outcome_factor, p = 0.6,
                                 list = FALSE,
                                 times = 1)
summary(hslp_ml_df[train_idx, "outcome_factor"])
##    cure failure    lost 
##      78      39      10
## Training set has 112 samples.

train_rownames <- rownames(hslp_ml_df)[train_idx]
not_train_idx <- ! rownames(hslp_ml_df) %in% train_rownames
summary(hslp_ml_df[not_train_idx, "outcome_factor"])
##    cure failure    lost 
##      52      25       6
not_train_rownames <- rownames(hslp_ml_df)[not_train_idx]

train_df <- as.data.frame(hslp_ml_df[train_rownames, ])
dim(train_df)
## [1]   127 16871
test_df <- as.data.frame(hslp_ml_df[not_train_rownames, ])
dim(test_df)
## [1]    83 16871
## Remove the outcome factor from the test data, just in case.
test_outcome <- test_df[["outcome_factor"]]
test_df[["outcome_factor"]] <- NULL
boot_control <- trainControl(method = "boot", number = 20,
                             returnResamp = "all")

knn_fit <- knn3(x = train_df[, -1],
                y = train_df[["outcome_factor"]],
                k = 3)
knn_predict_trained <- predict(knn_fit, train_df[, -1], type = "class")
confusionMatrix(data = train_df[[1]], reference = knn_predict_trained)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction cure failure lost
##    cure      69       9    0
##    failure    5      34    0
##    lost       3       0    7
## 
## Overall Statistics
##                                        
##                Accuracy : 0.866        
##                  95% CI : (0.794, 0.92)
##     No Information Rate : 0.606        
##     P-Value [Acc > NIR] : 1.13e-10     
##                                        
##                   Kappa : 0.742        
##                                        
##  Mcnemar's Test P-Value : NA           
## 
## Statistics by Class:
## 
##                      Class: cure Class: failure Class: lost
## Sensitivity                0.896          0.791      1.0000
## Specificity                0.820          0.940      0.9750
## Pos Pred Value             0.885          0.872      0.7000
## Neg Pred Value             0.837          0.898      1.0000
## Prevalence                 0.606          0.339      0.0551
## Detection Rate             0.543          0.268      0.0551
## Detection Prevalence       0.614          0.307      0.0787
## Balanced Accuracy          0.858          0.866      0.9875
names(knn_predict_trained) <- rownames(train_df)
summary(knn_predict_trained == pData(hslp_gene_expt)[train_idx, "finaloutcome"])
##    Mode   FALSE    TRUE 
## logical      17     110
knn_predict_test <- predict(knn_fit, test_df, type = "class")
## and the values, which will be used for our ROC curve.
knn_predict_test_values <- predict(knn_fit, test_df)

names(knn_predict_test) <- rownames(test_df)
knn_agreement <- knn_predict_test == test_outcome
names(knn_agreement) <- rownames(test_df)
summary(knn_agreement)
##    Mode   FALSE    TRUE 
## logical      25      58
names(knn_agreement)[knn_agreement == FALSE]
##  [1] "TMRC30015" "TMRC30016" "TMRC30019" "TMRC30177" "TMRC30023" "TMRC30071" "TMRC30119" "TMRC30144" "TMRC30164" "TMRC30173" "TMRC30012"
## [12] "TMRC30024" "TMRC30030" "TMRC30150" "TMRC30194" "TMRC30262" "TMRC30058" "TMRC30068" "TMRC30094" "TMRC30118" "TMRC30121" "TMRC30179"
## [23] "TMRC30189" "TMRC30222" "TMRC30263"
cv_control <- trainControl(method = "cv", number = 10)
knn_train_fit <- train(outcome_factor ~ ., data = train_df,
                       method = "knn",
                       trControl = cv_control,
                       tuneGrid = data.frame(k = 1:10))
## Error: protect(): protection stack overflow
knn_train_fit[["bestTune"]]
##   k
## 1 1
plot(x = 1:10, 1 - knn_train_fit$results[, 2], pch = 19,
     ylab = "prediction error", xlab = "k")
lines(loess.smooth(x = 1:10, 1 - knn_train_fit$results[, 2],degree = 2),
      col = "#CC0000")

---
title: "TMRC3 ML Classification of outcome: 202305"
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
 html_document:
  code_download: true
  code_folding: show
  fig_caption: true
  fig_height: 7
  fig_width: 7
  highlight: tango
  keep_md: false
  mode: selfcontained
  number_sections: true
  self_contained: true
  theme: readable
  toc: true
  toc_float:
   collapsed: false
   smooth_scroll: false
---

<style>
  body .main-container {
    max-width: 1600px;
  }
</style>

```{r options, include = FALSE}
library(hpgltools)
library(caret)
library(dplyr)
library(pROC)
library(DALEX)
library(glmnet)
library(glue)
library(kernlab)
library(ranger)
library(xgboost)
library(ggstatsplot)
tt <- devtools::load_all("~/hpgltools")
knitr::opts_knit$set(
  progress = TRUE, verbose = TRUE, width = 90, echo = TRUE)
knitr::opts_chunk$set(
  error = TRUE, fig.width = 8, fig.height = 8, fig.retina = 2,
  fig.pos = "t", fig.align = "center", dpi = if (knitr::is_latex_output()) 72 else 300,
  out.width = "100%", dev = "png",
  dev.args = list(png = list(type = "cairo-png")))
old_options <- options(digits = 4,
                       stringsAsFactors = FALSE,
                       knitr.duplicate.label = "allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size = 12))
ver <- "202305"
rundate <- format(Sys.Date(), format = "%Y%m%d")
previous_file <- ""
rmd_file <- glue("tmrc3_classifier_{ver}.Rmd")
savefile <- gsub(pattern = "\\.Rmd", replace = "\\.rda\\.xz", x = rmd_file)
loaded <- load(file = glue("rda/tmrc3_data_structures-v{ver}.rda"))
```

# Introduction

I had some success in classifying the TMRC2 samples by strain via ML
and want to try something more difficult.  Thus I will use the
normalized gene expression data and try classifying it by cure/fail.

# Starter data

In the strain classifier I used normalized variants.  I am thinking to
use normalized expression here and therefore explicitly limit myself
to ~ 20k variables (significantly down from the 1.6M).

In addition, caret expects the data as (rows == samples) and
(columns == variables) where each element is one observation.
Thus we will need to transpose the expression matrix.

```{r starter}
input_data <- subset_expt(tc_clinical, subset="batch!='biopsy'")
tc_norm <- normalize_expt(input_data, transform = "log2", convert = "cpm")
texprs <- t(exprs(tc_norm))

ref_col <- "finaloutcome"
outcome_factor <- as.factor(as.character(pData(input_data)[[ref_col]]))
```

# Filtering

The ML text I am reading provide some neat examples for how one might
filter the data to make it more suitable for model creation.

## Near zero variance, or genefilter's cv

The first filter I was introduced to is quite familiar from our
sequencing data, the removal of features with near-zero-variance.
Indeed, I am pretty certain that normalize_expt() could do this
equivalently and significantly faster than caret::preProcess().

```{r nzv}
system.time({
  equivalent <- normalize_expt(tc_norm, filter = "cv", cv_min = 0.1)
})
dim(exprs(equivalent))

## Given this large amount of data, this step is slow, taking > 10 minutes.
## Yeah seriously, the following three lines get 16,723 genes in 10 minutes while
## the normalize_expt() call above gets 16,749 genes in 2.4 seconds.
#system.time({
#  nzv <- preProcess(texprs, method="nzv", uniqueCut=15)
#  nzv_texprs <- predict(nzv, texprs)
#  dim(nzv_texprs)
#}
nzv_texprs <- t(exprs(equivalent))
```

## Filtering to the highest standard deviation variables

I think I am a bit confused by this filter, one would think that the
nzv filter above, if applied correctly, should give you exactly this.

For the moment, I am excluding the following block in order to see how
much time/memory keeping these variables costs.  If I recall properly,
the model of the top-2k variant positions cost ~ 1-4G of memory.  I
hope that this scales linearly, but I am thinking it might not.

```{r exclude_sds, eval=FALSE}
standard_devs <- apply(nzv_texprs, 2, sd)
top_predictors <- order(standard_devs, decreasing = TRUE)[1:3000]
nzv_texprs <- nzv_texprs[, top_predictors]
```

## Center the data

I think centering may not be needed for this data, but here is how:

```{r center}
nzv_center <- preProcess(nzv_texprs, method = "center")
nzv_texprs <- predict(nzv_center, nzv_texprs)
```

## Drop correlated

This is a filter which does not correspond to any of those we use in
sequencing data because genes which are highly correlated are
likely to be of immediate interest.

In the same fashion, I want to leave this off because later
applications of this model will include low coverage samples which may
not have every variant represented.

```{r correlated}
## This step takes a while...
system.time({
  nzv_correlated <- preProcess(nzv_texprs, method = "corr", cutoff = 0.95)
  nzv_uncorr <- predict(nzv_correlated, nzv_texprs)
})
dim(nzv_uncorr)
```

# Merge the appropriate metadata

There are a few metadata factors which might prove of interest for
classification.  The most obvious are of course outcome, clinic,
donor, visit, celltype.  I am, for the moment, only likely to focus on
outcome.  AFAICT I can only include one of these at a time in the
data, which is a shame.

```{r merge_pieces}
interesting_meta <- pData(input_data)[, c("finaloutcome", "donor", "persistence",
                                           "visitnumber", "selectionmethod",
                                           "typeofcells", "time", "clinic")]

ml_df <- as.data.frame(cbind(outcome_factor, as.data.frame(nzv_uncorr)))
ml_df[["outcome_factor"]] <- as.factor(ml_df[["outcome_factor"]])
dim(ml_df)
```

# Split the data into training/testing

caret provides nice functionality for splitting up the data.  I
suspect there are many more fun knobs I can play with for instances
where I need to exclude some levels of a factor and such.  In this
case I just want to split by outcome.

## Via data splitting

```{r split_train_test}
ml_df <- as.data.frame(cbind(outcome_factor, as.data.frame(nzv_uncorr)))

datasets <- create_partitions(nzv_uncorr, interesting_meta,
                              outcome_factor = outcome_factor)
```

## Via sampling

There are a few likely sampling methods: cross-validation,
bootstrapping, and jackknifing.  I will try those out later.

# Try out training and prediction methods

My goals from here on will be to get the beginnings of a sense of the
various methods I can use to create the models from the training data
and predict the outcome on the test data.  I am hoping also to pick up
some idea of what the various arguments mean while I am at it.

## Try out KNN

k-nearest neighbors is somewhat similar to a kmeans estimate.  Thus
the primary argument is 'k'

### Model creation and performance

```{r knn}
split <- 1
train_all <- datasets[["trainers"]][[split]]
train_df <- datasets[["trainers_stripped"]][[split]]
train_idx <- datasets[["train_idx"]][[split]]
train_outcomes <- datasets[["trainer_outcomes"]][[split]]
test_df <- datasets[["testers"]][[split]]
test_idx <- datasets[["test_idx"]][[split]]
test_outcomes <- datasets[["tester_outcomes"]][[split]]

knn_fit <- knn3(x = train_df,
                y = train_outcomes,
                k = 3)
knn_predict_trained <- predict(knn_fit, train_df, type = "prob")

knn_train_evaluated <- self_evaluate_model(knn_predict_trained, datasets,
                                           which = split, type = "train")
knn_train_evaluated
```

As the confusion matrix shows, this failed for a few samples.  Perhaps
let us change k and see if it improves.

Here is a table of fase positives/negatives for a few values of 'k',
in this context a false positive is calling a known cure as a failure
and false negative is calling a known failure as a cure.

|---|---|---|
|k  |fp |fn |
|2  |0  |8  |
|3  |5  |5  |
|4  |8  |9  |
|5  |11 |7  |
|6  |15 |8  |

Note: this depends on the luck of rand(), so the above numbers shift
moderately from one run to the next.  Thus I think I will just use 2
or 3.

```{r knn_fit2}
knn_fit2 <- knn3(x = train_df,
                y = train_outcomes,
                k = 5)
knn_predict_trained2 <- predict(knn_fit2, train_df, type = "prob")

knn_train_evaluated2 <- self_evaluate_model(knn_predict_trained2, datasets,
                                            which = split, type = "train")
knn_train_evaluated2
```

### Predict the rest of the data with this model.

```{r knn_test}
knn_predict_test <- predict(knn_fit, test_df)

knn_test_evaluated <- self_evaluate_model(knn_predict_test, datasets,
                                     which = split, type = "test")
knn_test_evaluated

knn_predict_test2 <- predict(knn_fit2, test_df)
knn_test_evaluated2 <- self_evaluate_model(knn_predict_test2, datasets,
                                           which = split, type = "test")
knn_test_evaluated2
```

## Perform cross-validation to estimate k

The cross validation method of repeated sampling the data is all done
within the train() function.  With that in mind, here it is operating
with the knn method.

### CV with knn

When train() is called with the trControl and tuneGrid, we can control
how the knn training is repeated, in this case it will iterate over k
from 1 to 10.

```{r cv_knn}
cv_control <- trainControl(method = "cv", number = 10)

knn_train_fit <- train(outcome_factor ~ ., data = train_df,
                       method = "knn",
                       trControl = cv_control,
                       tuneGrid = data.frame(k = 1:10))
knn_train_fit[["bestTune"]]

plot(x = 1:10, 1 - knn_train_fit$results[, 2], pch = 19,
     ylab = "prediction error", xlab = "k")
lines(loess.smooth(x = 1:10, 1 - knn_train_fit$results[, 2],degree = 2),
      col = "#CC0000")
```

### Bootstrap with knn

```{r boot_knn}
boot_control <- trainControl(method = "boot", number = 20,
                             returnResamp = "all")

knn_train_fit <- train(outcome ~ ., data = train_all,
                       method = "knn",
                       trControl = boot_control,
                       tuneGrid = data.frame(k = 1:10))
knn_train_fit[["bestTune"]]

plot(x = 1:10, 1 - knn_train_fit$results[, 2], pch = 19,
     ylab = "prediction error", xlab = "k")
lines(loess.smooth(x = 1:10, 1 - knn_train_fit$results[, 2],degree = 2),
      col = "#CC0000")
```

### Explain the important variables

In this instance we will search for genes which were important for the
model's creation.

The DALEX package provides a function: feature_importance() which
seeks to use a series of other methods to extract (in this case,
genes) features which do a good job of explaining the result produced
by the model.  In the case of this dataset, which has thousands of
features, this does not appear to end well.

```{r explain_knn}
explainer_knn <- DALEX::explain(knn_fit, label = "knn",
                                data = train_df,
                                y = as.numeric(train_outcomes))

## AFAICT the following will take forever unless we drastically reduce the complexity of the model.
## yeah, I let it run for a week.
## features <- feature_importance(explainer_knn, n_sample = 50, type = "difference")
```

## Random Forest

The parameter 'mtry' is often important, if I read the text correctly
it controls how many variables to sample in each split of the tree.
Thus higher numbers should presumably make it more specific at the
risk of overfitting.

Setting min.node.size sets the minimume node size of terminal nodes in
each tree.  Each increment up speeds the algorithm.

I am going to use my boot control trainer from above and see how it goes.

```{r rf_test}
rf_train_fit <- train(outcome ~ ., data = train_all,
                method = "ranger", trControl = boot_control,
                importance = "permutation",
                tuneGrid = data.frame(
                    mtry = 200,
                    min.node.size = 1,
                    splitrule = "gini"),
                verbose = TRUE)
rf_train_fit[["finalModel"]][["prediction.error"]]
variable_importance <- varImp(rf_train_fit)
plot(variable_importance, top = 15)
rf_variables <- variable_importance[["importance"]] %>%
  arrange(desc(Overall))

rf_predict_trained <- predict(rf_train_fit, train_df)
rf_predict_evaluated <- self_evaluate_model(rf_predict_trained, datasets,
                                            which = split, type = "train")
rf_predict_evaluated
```

## Compare topn important genes to DE genes

Given that we have separated the various analyses, it will take me a
minute to figure out where I saved the relevant differential
expression analysis.  I do not actually save the various DE results to
rda files by default, instead opting to send them to xlsx files to
share.  Recall if you will, that the data that I think might be used
for the paper also does not go into the default excel directory but
instead mirrors the box organization scheme.

Thus, I think the most relevant file is:
"analyses/4_tumaco/DE_Cure_vs_Fail/All_Samples/t_cf_clinical_tables_sva-v202207.xlsx"

```{r compare_ml_de_cf}
all_de_cf <- openxlsx::readWorkbook(
  "analyses/4_tumaco/DE_Cure_vs_Fail/t_all_visitcf_tables_sva-v202305.xlsx",
  sheet = 2, startRow = 2)
rownames(all_de_cf) <- all_de_cf[["row.names"]]
all_de_cf[["row.names"]] <- NULL
deseq_de_cf <- all_de_cf[, c("deseq_logfc", "deseq_adjp", "deseq_basemean", "deseq_lfcse")]
```

### What would be shared between DESeq2 and the ML classifier?

Presumably DESeq and the models should be responding to variance in
the data, for which I think the logFC values, p-values, mean values,
or standard errors are the most likely proxies to which I have easy
access.  So, let us pull the top/bottom n genes vis a vis each of
those categories and see what happens?

```{r topn_deseq_vs_ml}
top_100_lfc <- all_de_cf %>%
  arrange(desc(deseq_logfc)) %>%
  top_n(n = 100, wt = deseq_logfc)
bottom_100_lfc <- all_de_cf %>%
  arrange(deseq_logfc) %>%
  top_n(n = -100, wt = deseq_logfc)
top_bottom_ids <- c(rownames(top_100_lfc), rownames(bottom_100_lfc))

top_200_ml <- rownames(head(rf_variables, n = 200))

comparison <- list("de" = top_bottom_ids, "ml" = top_400_ml)
comparison_venn <- Vennerable::Venn(comparison)
fData(c_monocytes)[comparison_venn@IntersectionSets["11"][[1]], ]
Vennerable::plot(comparison_venn, doWeights = FALSE, type = "circles")

top_400_adjp <- all_de_cf %>%
  top_n(n = -400, wt = deseq_adjp)
lowest_adjp <- rownames(top_400_adjp)
comparison <- list("de" = lowest_adjp, "ml" = top_400_ml)
comparison_venn <- Vennerable::Venn(comparison)
fData(c_monocytes)[comparison_venn@IntersectionSets["11"][[1]], ]
Vennerable::plot(comparison_venn, doWeights = FALSE, type = "circles")

top_400_exprs <- all_de_cf %>%
  top_n(n = 400, wt = deseq_basemean)
highest_exprs <- rownames(top_400_exprs)
comparison <- list("de" = highest_exprs, "ml" = top_400_ml)
comparison_venn <- Vennerable::Venn(comparison)
fData(c_monocytes)[comparison_venn@IntersectionSets["11"][[1]], ]
Vennerable::plot(comparison_venn, doWeights = FALSE, type = "circles")

tt <- merge(all_de_cf, rf_variables, by = "row.names")
rownames(tt) <- tt[["Row.names"]]
tt[["Row.names"]] <- NULL
cor.test(tt[["deseq_logfc"]], tt[["Overall"]])
```

### Compare to the WGCNA results

A couple months ago I spent a little time attempting to recapitulate
Alejandro's WGCNA results.  I think I did so by mostly copy/pasting
his work and adding some commentary and tweaking parts of it so that
it was easier for me to read/understand.  In the process, I generated
a series of modules which looked similar/identical to his.
Unfortunately, I did not add some sections to record the genes/modules
to some output files.  I am therefore going back to that now and doing
so in the hopes that I can compare those modules to the results
produced by the clasifiers.

```{r compare_wgcna}
wgcna_result <- openxlsx::readWorkbook(glue("excel/wgcna_interesting_genes-v{ver}.xlsx"))
rownames(wgcna_result) <- wgcna_result[["row.names"]]
wgcna_result[["row.names"]] <- NULL

top_100_ml <- rownames(head(rf_variables, n = 200))
comparison <- list("wgcna" = rownames(wgcna_result), "ml" = top_100_ml)
comparison_venn <- Vennerable::Venn(comparison)
fData(c_monocytes)[comparison_venn@IntersectionSets["11"][[1]], ]
Vennerable::plot(comparison_venn, doWeights = FALSE, type = "circles")
```

#### Digression do the genes provide by varImp mean anything?

Let us take a moment and see if the top-n genes returned by varImp()
have some meaning which jumps out.  One might assume, given our extant
Differential Expression results, that the interleukin response will be
a likely candidate.

```{r varimp_gp}
importance_gp <- simple_gprofiler(rownames(head(rf_variables, n = 300)))
importance_gp
```

### Now the random forest testers!

```{r rf_predict_test}
rf_predict_test <- predict(rf_train_fit, test_df)

rf_predict_test_evaluated <- self_evaluate_model(rf_predict_test, datasets,
                                     which = split, type = "test")
rf_predict_test_evaluated
```

## GLM, or Logistic regression and regularization

Logistic regression is a statistical method for binary responses.
However, it is able to work with multiple classes as well.  The
general idea of this method is to find parameters which increase the
likelihood that the observed data is sampled from a statistical
distribution of interest.  The transformations and linear
regression-esque tasks performed are confusing, but once those are
performed, the task becomes setting the model's (fitting) parameters
to values which increase the probability that the statistical model
looks like the actual dataset given the training data, and that when
samples, will return values which are similar.  The most likely
statistical distributions one will want to fit are the Gaussian, in
which case we want to transform/normalize the mean/variance of our
variables so they look whatever normal distribution we are using.
Conversely, logistic regression uses a binnomial distribution (like
our raw sequencing data!) but which is from 0-1.

### Using a single gene

Let us take the most important gene observed in one of our previous
training sets: ENSG00000248405 PRR5-ARHGAP8

```{r glm_regression}
gene_id <- "ENSG00000248405"
single_fit <- train(
    outcome ~ ENSG00000248405, data = train_all,
    method = "glm", family = "binomial", trControl = trainControl("none"))

tt <- data.frame("ENSG00000248405" = seq(min(train_df[[gene_id]]),
                                         max(train_df[[gene_id]]),len=100))
## predict probabilities for the simulated data
tt$subtype = predict(single_fit, newdata=tt, type="prob")[, 1]
## plot the sigmoid curve and the training data
plot(ifelse(outcome == "cure", 1, 0) ~ ENSG00000248405,
     data = train_all, col = "red4",
     ylab = "CF as 0 or 1", xlab = "favorite gene expression")
lines(subtype ~ ENSG00000248405, tt, col = "green4", lwd = 2)

plot_df <- train_all[, c("outcome", "ENSG00000248405")]
ggbetweenstats(plot_df, "outcome", "ENSG00000248405")
```

Having tried with 1 gene, let us extend this to all genes.  In my
first try of this, it took a long time.

```{r glm_all}
glm_train_fit <- train(outcome ~ ., data = train_all,
                 trControl = boot_control,
                 method = "glm", family = "binomial")
```

## Compare GLM and WGCNA/DE

```{r compare_glm_wgcna_de}
glm_variable_importance <- varImp(glm_train_fit)
## Oh, this only produces 100 entries -- so me getting the top 400 is silly.
glm_variables <- glm_variable_importance[["importance"]] %>%
  arrange(desc(Overall))
plot(glm_variable_importance, top = 15)
simple_gprofiler(rownames(head(glm_variables, n = 100)))

top_glm <- rownames(glm_variables)
comparison <- list("de" = top_bottom_ids, "ml" = top_glm)
comparison_venn <- Vennerable::Venn(comparison)
fData(c_monocytes)[comparison_venn@IntersectionSets["11"][[1]], ]
Vennerable::plot(comparison_venn, doWeights = FALSE, type = "circles")

comparison <- list("de" = lowest_adjp, "ml" = top_glm)
comparison_venn <- Vennerable::Venn(comparison)
fData(c_monocytes)[comparison_venn@IntersectionSets["11"][[1]], ]
Vennerable::plot(comparison_venn, doWeights = FALSE, type = "circles")

comparison <- list("de" = highest_exprs, "ml" = top_glm)
comparison_venn <- Vennerable::Venn(comparison)
fData(c_monocytes)[comparison_venn@IntersectionSets["11"][[1]], ]
Vennerable::plot(comparison_venn, doWeights = FALSE, type = "circles")

comparison <- list("wgcna" = rownames(wgcna_result), "ml" = top_glm)
comparison_venn <- Vennerable::Venn(comparison)
fData(c_monocytes)[comparison_venn@IntersectionSets["11"][[1]], ]
Vennerable::plot(comparison_venn, doWeights = FALSE, type = "circles")
```

```{r glm_test}
##rf_method <- trainControl(method = "ranger", number = 10, verbose = TRUE)
## train_method <- trainControl(method = "cv", number = 10)
glm_fit <- train(outcome ~ ., data = train_all, method = "glmnet",
                 trControl = boot_control, importance = "permutation",
                 tuneGrid = data.frame(
                   alpha = 0.5,
                   lambda = seq(0.1, 0.7, 0.05)),
                 verbose = TRUE)
glm_fit

glm_predict_trained <- predict(glm_fit, train_df)

glm_train_eval <- self_evaluate_model(glm_predict_trained, datasets,
                                      which = split, type = "train")
glm_train_eval
```

### Now the GLM testers!

```{r glm_predict_test}
glm_predict_test <- predict(glm_fit, test_df)

glm_fit_eval_test <- self_evaluate_model(glm_predict_test, datasets,
                                         which = split, type = "test")
glm_fit_eval_test
```

### Compare again vs DE/WGCNA

```{r glmv2_vs_de_wgcna}
variable_importance <- varImp(glm_fit)
plot(variable_importance, top = 15)

top_400_ml <- rownames(head(variable_importance$importance, n = 400))
top_100_ml <- rownames(head(variable_importance$importance, n = 100))

comparison <- list("de" = top_bottom_ids, "ml" = top_400_ml)
comparison_venn <- Vennerable::Venn(comparison)
fData(c_monocytes)[comparison_venn@IntersectionSets["11"][[1]], ]
Vennerable::plot(comparison_venn, doWeights = FALSE, type = "circles")

comparison <- list("de" = lowest_adjp, "ml" = top_400_ml)
comparison_venn <- Vennerable::Venn(comparison)
fData(c_monocytes)[comparison_venn@IntersectionSets["11"][[1]], ]
Vennerable::plot(comparison_venn, doWeights = FALSE, type = "circles")

comparison <- list("de" = highest_exprs, "ml" = top_400_ml)
comparison_venn <- Vennerable::Venn(comparison)
fData(c_monocytes)[comparison_venn@IntersectionSets["11"][[1]]], ]
Vennerable::plot(comparison_venn, doWeights = FALSE, type = "circles")

top_100_ml <- rownames(head(variable_importance$importance, n = 100))
comparison <- list("wgcna" = rownames(wgcna_result), "ml" = top_100_ml)
comparison_venn <- Vennerable::Venn(comparison)
fData(c_monocytes)[comparison_venn@IntersectionSets["11"][[1]], ]
Vennerable::plot(comparison_venn, doWeights = FALSE, type = "circles")
```

## Gradient Booster

```{r gb_test}
##rf_method <- trainControl(method = "ranger", number = 10, verbose = TRUE)
train_method <- trainControl(method = "cv", number = 10)

gb_fit <- train(outcome ~ ., data = train_all,
                method = "xgbTree", trControl = train_method,
                tuneGrid = data.frame(
                    nrounds = 200,
                    eta = c(0.05, 0.1, 0.3),
                    max_depth = 4,
                    gamma = 0,
                    colsample_bytree = 1,
                    subsample = 0.5,
                    min_child_weight = 1),
                verbose = TRUE)

gb_predict_trained <- predict(gb_fit, train_df)
gb_predict_trained

gb_train_eval <- self_evaluate_model(gb_predict_trained, datasets,
                                     which = split, type = "train")
gb_train_eval
```

### Now the GB testers!

```{r gb_predict_test}
gb_predict_test <- predict(gb_fit, test_df)

gb_predict_test_evaluated <- self_evaluate_model(gb_predict_test, datasets,
                                                 which = split, type = "test")
gb_predict_test_evaluated
```

## SVM

```{r svm_test}
##rf_method <- trainControl(method = "ranger", number = 10, verbose = TRUE)
train_method <- trainControl(method = "cv", number = 10)

svm_train_fit <- train(outcome ~ ., data = train_all,
                       method = "svmRadial", trControl = train_method,
                       tuneGrid = data.frame(
                         C = c(0.25, 0.5, 1),
                         sigma = 1),
                       verbose = TRUE)
svm_train_fit

svm_predict_trained <- predict(svm_train_fit, train_df)

svm_predict_evaluated <- self_evaluate_model(svm_predict_trained, datasets,
                                             which = split, type = "train")
svm_predict_evaluated
```

### Now the SVM testers!

```{r svm_predict_test}
svm_predict_test <- predict(svm_fit, test_df)

svm_predict_test_evaluated <- self_evaluate_model(svm_predict_test, datasets,
                                     which = split, type = "test")
svm_predict_test_evaluated
```

# Reapply to persistence

```{r persistence}
ref_col <- "persistence"
outcome_factor <- as.factor(as.character(pData(tc_clinical)[[ref_col]]))

ml_df <- as.data.frame(cbind(outcome_factor, as.data.frame(nzv_uncorr)))
ml_df[["outcome_factor"]] <- as.factor(ml_df[["outcome_factor"]])
only_yn <- ml_df[["outcome_factor"]] == "Y" | ml_df[["outcome_factor"]] == "N"
outcome_factor <- as.factor(as.character(outcome_factor[only_yn]))


ml_df <- ml_df[only_yn, ]
ml_df[["outcome_factor"]] <- as.factor(as.character(ml_df[["outcome_factor"]]))
summary(ml_df[["outcome_factor"]])

train_idx <- createDataPartition(outcome_factor, p = 0.5,
                                 list = FALSE,
                                 times = 1)
summary(ml_df[train_idx, "outcome_factor"])
## Training set has 112 samples.

train_rownames <- rownames(ml_df)[train_idx]
not_train_idx <- ! rownames(ml_df) %in% train_rownames
summary(ml_df[not_train_idx, "outcome_factor"])
not_train_rownames <- rownames(ml_df)[not_train_idx]

train_df <- as.data.frame(ml_df[train_rownames, ])
dim(train_df)
test_df <- as.data.frame(ml_df[not_train_rownames, ])
dim(test_df)
## Remove the outcome factor from the test data, just in case.
test_outcome <- test_df[["outcome_factor"]]
test_df[["outcome_factor"]] <- NULL

train_fit <- train(outcome_factor ~ ., data = train_df,
                method = "ranger", trControl = boot_control,
                importance = "permutation",
                tuneGrid = data.frame(
                    mtry = 200,
                    min.node.size = 1,
                    splitrule = "gini"),
                verbose = TRUE)
train_fit[["finalModel"]][["prediction.error"]]
variable_importance <- varImp(train_fit)
plot(variable_importance, top = 15)

rf_predict_trained <- predict(rf_train_fit, train_df)
confusionMatrix(data = train_df[[1]], reference = rf_predict_trained)

rf_predict_test <- predict(rf_train_fit, test_df)
names(rf_predict_test) <- rownames(test_df)
used_names <- names(rf_predict_test)
used_pdata <- pData(tc_norm)[used_names, ]

rf_agreement <- as.character(rf_predict_test) == used_pdata[[ref_col]]
names(rf_agreement) <- rownames(test_df)
summary(rf_agreement)
names(rf_agreement)[rf_agreement == FALSE]
```

# Try something a little different

I redid the mappings with a combined host/parasite transcriptome in
the hopes that it might provide some interesting variance for these
classifiers.  Lets us poke at it and see if anything is relevant or if
it was a bad idea.

```{r poke_combined}
wanted_idx <- grepl(x = rownames(exprs(hslp_gene_expt)), pattern = "^LP")
wanted_ids <- rownames(exprs(hslp_gene_expt))[wanted_idx]
test_lp <- exclude_genes_expt(hslp_gene_expt, ids = wanted_ids, method = "keep")

tt <- plot_libsize(test_lp)
tt$plot
```

I am going to mostly copy paste the code from up above here but change
the inputs to fit my new combined data structure of human and parasite data.

```{r classify_hpcf}
ref_col <- "finaloutcome"
outcome_factor <- as.factor(as.character(pData(hslp_gene_expt)[[ref_col]]))

hslp_norm <- normalize_expt(hslp_gene_expt, transform = "log2", convert = "cpm")
hslp_norm <- normalize_expt(hslp_norm, filter = "cv", cv_min = 0.1)
retained_lp <- grepl(x = rownames(exprs(hslp_norm)), pattern = "^LP")
sum(retained_lp)

nzv_thslp <- t(exprs(hslp_norm))
nzv_hslp_center <- preProcess(nzv_thslp, method = "center")
nzv_thslp <- predict(nzv_hslp_center, nzv_thslp)

nzv_thslp_correlated <- preProcess(nzv_thslp, method = "corr", cutoff = 0.95)
nzv_thslp_uncorr <- predict(nzv_thslp_correlated, nzv_thslp)

interesting_meta <- pData(hslp_norm)[, c("finaloutcome", "donor",
                                         "visitnumber", "selectionmethod",
                                         "typeofcells", "time", "clinic")]

hslp_ml_df <- as.data.frame(cbind(outcome_factor,
                                  as.data.frame(nzv_thslp_uncorr)))
hslp_ml_df[["outcome_factor"]] <- as.factor(hslp_ml_df[["outcome_factor"]])
dim(hslp_ml_df)
```

# Split the data for training/testing

```{r split_train_test}
## The variable outcome_factor was created at the top of this document,
## which is a little awkward here.
train_idx <- createDataPartition(outcome_factor, p = 0.6,
                                 list = FALSE,
                                 times = 1)
summary(hslp_ml_df[train_idx, "outcome_factor"])
## Training set has 112 samples.

train_rownames <- rownames(hslp_ml_df)[train_idx]
not_train_idx <- ! rownames(hslp_ml_df) %in% train_rownames
summary(hslp_ml_df[not_train_idx, "outcome_factor"])
not_train_rownames <- rownames(hslp_ml_df)[not_train_idx]

train_df <- as.data.frame(hslp_ml_df[train_rownames, ])
dim(train_df)
test_df <- as.data.frame(hslp_ml_df[not_train_rownames, ])
dim(test_df)
## Remove the outcome factor from the test data, just in case.
test_outcome <- test_df[["outcome_factor"]]
test_df[["outcome_factor"]] <- NULL
```

```{r random_forest}
boot_control <- trainControl(method = "boot", number = 20,
                             returnResamp = "all")

knn_fit <- knn3(x = train_df[, -1],
                y = train_df[["outcome_factor"]],
                k = 3)
knn_predict_trained <- predict(knn_fit, train_df[, -1], type = "class")
confusionMatrix(data = train_df[[1]], reference = knn_predict_trained)

names(knn_predict_trained) <- rownames(train_df)
summary(knn_predict_trained == pData(hslp_gene_expt)[train_idx, "finaloutcome"])

knn_predict_test <- predict(knn_fit, test_df, type = "class")
## and the values, which will be used for our ROC curve.
knn_predict_test_values <- predict(knn_fit, test_df)

names(knn_predict_test) <- rownames(test_df)
knn_agreement <- knn_predict_test == test_outcome
names(knn_agreement) <- rownames(test_df)
summary(knn_agreement)
names(knn_agreement)[knn_agreement == FALSE]

cv_control <- trainControl(method = "cv", number = 10)
knn_train_fit <- train(outcome_factor ~ ., data = train_df,
                       method = "knn",
                       trControl = cv_control,
                       tuneGrid = data.frame(k = 1:10))
knn_train_fit[["bestTune"]]

plot(x = 1:10, 1 - knn_train_fit$results[, 2], pch = 19,
     ylab = "prediction error", xlab = "k")
lines(loess.smooth(x = 1:10, 1 - knn_train_fit$results[, 2],degree = 2),
      col = "#CC0000")

```
