I had some success in classifying the TMRC2 samples by strain via ML and want to try something more difficult. Thus I will use the normalized gene expression data and try classifying it by cure/fail.
<- "finaloutcome"
ref_col <- as.factor(as.character(pData(tc_clinical)[[ref_col]])) outcome_factor
In the strain classifier I used normalized variants. I am thinking to use normalized expression here and therefore explicitly limit myself to ~ 20k variables (significantly down from the 1.6M).
In addition, caret expects the data as (rows == samples) and (columns == variables) where each element is one observation. Thus we will need to transpose the expression matrix.
<- normalize_expt(tc_clinical, transform = "log2", convert = "cpm") tc_norm
## transform_counts: Found 1005383 values equal to 0, adding 1 to the matrix.
<- t(exprs(tc_norm)) texprs
The ML text I am reading provide some neat examples for how one might filter the data to make it more suitable for model creation.
The first filter I was introduced to is quite familiar from our sequencing data, the removal of features with near-zero-variance. Indeed, I am pretty certain that normalize_expt() could do this equivalently and significantly faster than caret::preProcess().
system.time({
<- normalize_expt(tc_norm, filter = "cv", cv_min = 0.1)
equivalent })
## Removing 3174 low-count genes (16749 remaining).
## user system elapsed
## 4.712 0.565 5.309
dim(exprs(equivalent))
## [1] 16749 184
## Given this large amount of data, this step is slow, taking > 10 minutes.
## Yeah seriously, the following three lines get 16,723 genes in 10 minutes while
## the normalize_expt() call above gets 16,749 genes in 2.4 seconds.
#system.time({
# nzv <- preProcess(texprs, method="nzv", uniqueCut=15)
# nzv_texprs <- predict(nzv, texprs)
# dim(nzv_texprs)
#}
<- t(exprs(equivalent)) nzv_texprs
I think I am a bit confused by this filter, one would think that the nzv filter above, if applied correctly, should give you exactly this.
For the moment, I am excluding the following block in order to see how much time/memory keeping these variables costs. If I recall properly, the model of the top-2k variant positions cost ~ 1-4G of memory. I hope that this scales linearly, but I am thinking it might not.
<- apply(nzv_texprs, 2, sd)
standard_devs <- order(standard_devs, decreasing = TRUE)[1:3000]
top_predictors <- nzv_texprs[, top_predictors] nzv_texprs
I think centering may not be needed for this data, but here is how:
<- preProcess(nzv_texprs, method = "center")
nzv_center <- predict(nzv_center, nzv_texprs) nzv_texprs
There are a few metadata factors which might prove of interest for classification. The most obvious are of course outcome, clinic, donor, visit, celltype. I am, for the moment, only likely to focus on outcome. AFAICT I can only include one of these at a time in the data, which is a shame.
<- pData(tc_clinical)[, c("finaloutcome", "donor", "persistence",
interesting_meta "visitnumber", "selectionmethod",
"typeofcells", "time", "clinic")]
<- as.data.frame(cbind(outcome_factor, as.data.frame(nzv_uncorr)))
ml_df "outcome_factor"]] <- as.factor(ml_df[["outcome_factor"]])
ml_df[[dim(ml_df)
## [1] 184 12081
caret provides nice functionality for splitting up the data. I suspect there are many more fun knobs I can play with for instances where I need to exclude some levels of a factor and such. In this case I just want to split by outcome.
## The variable outcome_factor was created at the top of this document,
## which is a little awkward here.
<- createDataPartition(outcome_factor, p = 0.6,
train_idx list = FALSE,
times = 1)
summary(ml_df[train_idx, "outcome_factor"])
## cure failure
## 74 38
## Training set has 112 samples.
<- rownames(ml_df)[train_idx]
train_rownames <- ! rownames(ml_df) %in% train_rownames
not_train_idx summary(ml_df[not_train_idx, "outcome_factor"])
## cure failure
## 48 24
<- rownames(ml_df)[not_train_idx]
not_train_rownames
<- as.data.frame(ml_df[train_rownames, ])
train_df dim(train_df)
## [1] 112 12081
<- as.data.frame(ml_df[not_train_rownames, ])
test_df dim(test_df)
## [1] 72 12081
## Remove the outcome factor from the test data, just in case.
<- test_df[["outcome_factor"]]
test_outcome "outcome_factor"]] <- NULL test_df[[
There are a few likely sampling methods: cross-validation, bootstrapping, and jackknifing. I will try those out later.
My goals from here on will be to get the beginnings of a sense of the various methods I can use to create the models from the training data and predict the outcome on the test data. I am hoping also to pick up some idea of what the various arguments mean while I am at it.
k-nearest neighbors is somewhat similar to a kmeans estimate. Thus the primary argument is ‘k’
<- knn3(x = train_df[, -1],
knn_fit y = train_df[["outcome_factor"]],
k = 3)
<- predict(knn_fit, train_df[, -1], type = "class")
knn_predict_trained confusionMatrix(data = train_df[[1]], reference = knn_predict_trained)
## Confusion Matrix and Statistics
##
## Reference
## Prediction cure failure
## cure 69 5
## failure 4 34
##
## Accuracy : 0.92
## 95% CI : (0.853, 0.963)
## No Information Rate : 0.652
## P-Value [Acc > NIR] : 3.5e-11
##
## Kappa : 0.822
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.945
## Specificity : 0.872
## Pos Pred Value : 0.932
## Neg Pred Value : 0.895
## Prevalence : 0.652
## Detection Rate : 0.616
## Detection Prevalence : 0.661
## Balanced Accuracy : 0.909
##
## 'Positive' Class : cure
##
As the confusion matrix shows, this failed for a few samples. Perhaps let us change k and see if it improves.
Here is a table of fase positives/negatives for a few values of ‘k’, in this context a false positive is calling a known cure as a failure and false negative is calling a known failure as a cure.
|—|—|—| |k |fp |fn | |2 |0 |8 | |3 |5 |5 | |4 |8 |9 | |5 |11 |7 | |6 |15 |8 |
Note: this depends on the luck of rand(), so the above numbers shift moderately from one run to the next. Thus I think I will just use 2 or 3.
<- knn3(x = train_df[, -1],
knn_fit y = train_df[["outcome_factor"]],
k = 3)
<- predict(knn_fit, train_df[, -1], type = "class")
knn_predict_trained confusionMatrix(data = train_df[[1]], reference = knn_predict_trained)
## Confusion Matrix and Statistics
##
## Reference
## Prediction cure failure
## cure 69 5
## failure 4 34
##
## Accuracy : 0.92
## 95% CI : (0.853, 0.963)
## No Information Rate : 0.652
## P-Value [Acc > NIR] : 3.5e-11
##
## Kappa : 0.822
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.945
## Specificity : 0.872
## Pos Pred Value : 0.932
## Neg Pred Value : 0.895
## Prevalence : 0.652
## Detection Rate : 0.616
## Detection Prevalence : 0.661
## Balanced Accuracy : 0.909
##
## 'Positive' Class : cure
##
names(knn_predict_trained) <- rownames(train_df)
summary(knn_predict_trained == pData(tc_norm)[train_idx, "finaloutcome"])
## Mode FALSE TRUE
## logical 9 103
<- predict(knn_fit, train_df[, -1])
tt <- pROC::roc(response = train_df[, 1], predictor = tt[, 1],
curve levels = rev(levels(outcome_factor)))
## Setting direction: controls < cases
plot(curve)
::auc(curve) pROC
## Area under the curve: 0.967
## Return the classifications
<- predict(knn_fit, test_df, type = "class")
knn_predict_test ## and the values, which will be used for our ROC curve.
<- predict(knn_fit, test_df)
knn_predict_test_values
names(knn_predict_test) <- rownames(test_df)
<- knn_predict_test == test_outcome
knn_agreement names(knn_agreement) <- rownames(test_df)
summary(knn_agreement)
## Mode FALSE TRUE
## logical 16 56
names(knn_agreement)[knn_agreement == FALSE]
## [1] "TMRC30018" "TMRC30210" "TMRC30105" "TMRC30164" "TMRC30119" "TMRC30103" "TMRC30026" "TMRC30165" "TMRC30044" "TMRC30166" "TMRC30200"
## [12] "TMRC30177" "TMRC30208" "TMRC30218" "TMRC30220" "TMRC30264"
<- pROC::roc(response = test_outcome,
curve predictor = knn_predict_test_values[, 1])
## Setting levels: control = cure, case = failure
## Setting direction: controls > cases
plot(curve)
::auc(curve) pROC
## Area under the curve: 0.803
## Find the samples which (dis)agree
<- "finaloutcome"
ref_col <- pData(tc_clinical)[[ref_col]]
ref_result names(ref_result) <- rownames(pData(tc_clinical))
<- names(ref_result) %in% names(knn_predict_test)
ref_xref <- ref_result[ref_xref]
ref_comp <- ref_comp != "unknown"
ref_comp_idx <- ref_comp[ref_comp_idx]
ref_comp <- !is.na(ref_comp)
ref_comp_idx <- ref_comp[ref_comp_idx]
ref_comp
<- names(knn_predict_test) %in% names(ref_comp)
test_comp_idx <- knn_predict_test[test_comp_idx]
test_comp
<- as.character(ref_comp) == as.character(test_comp)
agreement <- sum(agreement)
agree_num <- agree_num / length(ref_comp)
agree_prop agree_prop
## [1] 0.7778
<- as.character(ref_comp) != as.character(test_comp)
disagree test_comp[disagree]
## TMRC30018 TMRC30210 TMRC30105 TMRC30164 TMRC30119 TMRC30103 TMRC30026 TMRC30165 TMRC30044 TMRC30166 TMRC30200 TMRC30177 TMRC30208 TMRC30218
## failure failure cure failure cure failure cure failure failure failure cure cure cure cure
## TMRC30220 TMRC30264
## cure cure
## Levels: cure failure
The cross validation method of repeated sampling the data is all done within the train() function. With that in mind, here it is operating with the knn method.
When train() is called with the trControl and tuneGrid, we can control how the knn training is repeated, in this case it will iterate over k from 1 to 10.
<- trainControl(method = "cv", number = 10)
cv_control <- train(outcome_factor ~ ., data = train_df,
knn_train_fit method = "knn",
trControl = cv_control,
tuneGrid = data.frame(k = 1:10))
"bestTune"]] knn_train_fit[[
## k
## 1 1
plot(x = 1:10, 1 - knn_train_fit$results[, 2], pch = 19,
ylab = "prediction error", xlab = "k")
lines(loess.smooth(x = 1:10, 1 - knn_train_fit$results[, 2],degree = 2),
col = "#CC0000")
<- trainControl(method = "boot", number = 20,
boot_control returnResamp = "all")
<- train(outcome_factor ~ ., data = train_df,
knn_train_fit method = "knn",
trControl = boot_control,
tuneGrid = data.frame(k = 1:10))
"bestTune"]] knn_train_fit[[
## k
## 1 1
plot(x = 1:10, 1 - knn_train_fit$results[, 2], pch = 19,
ylab = "prediction error", xlab = "k")
lines(loess.smooth(x = 1:10, 1 - knn_train_fit$results[, 2],degree = 2),
col = "#CC0000")
In this instance we will search for genes which were important for the model’s creation.
The DALEX package provides a function: feature_importance() which seeks to use a series of other methods to extract (in this case, genes) features which do a good job of explaining the result produced by the model. In the case of this dataset, which has thousands of features, this does not appear to end well.
<- explain(knn_fit, label = "knn",
explainer_knn data = train_df[, -1],
y = as.numeric(train_df[, 1]))
## Preparation of a new explainer is initiated
## -> model label : knn
## -> data : 112 rows 12080 cols
## -> target variable : 112 values
## -> predict function : yhat.default will be used ( default )
## -> predicted values : No value for predict function target column. ( default )
## -> model_info : package Model of class: knn3 package unrecognized , ver. Unknown , task regression ( default )
## -> predicted values : numerical, min = 0 , mean = 0.3988 , max = 1
## -> residual function : difference between y and yhat ( default )
## -> residuals : numerical, min = 0.3333 , mean = 0.9405 , max = 1.667
## A new explainer has been created!
## AFAICT the following will take forever unless we drastically reduce the complexity of the model.
## features <- feature_importance(explainer_knn, n_sample = 50, type = "difference")
The parameter ‘mtry’ is often important, if I read the text correctly it controls how many variables to sample in each split of the tree. Thus higher numbers should presumably make it more specific at the risk of overfitting.
Setting min.node.size sets the minimume node size of terminal nodes in each tree. Each increment up speeds the algorithm.
I am going to use my boot control trainer from above and see how it goes.
library(ranger)
##rf_method <- trainControl(method = "ranger", number = 10, verbose = TRUE)
<- train(outcome_factor ~ ., data = train_df,
rf_train_fit method = "ranger", trControl = boot_control,
importance = "permutation",
tuneGrid = data.frame(
mtry = 200,
min.node.size = 1,
splitrule = "gini"),
verbose = TRUE)
"finalModel"]][["prediction.error"]] rf_train_fit[[
## [1] 0.1964
<- varImp(rf_train_fit)
variable_importance plot(variable_importance, top = 15)
<- predict(rf_train_fit, train_df)
rf_predict_trained names(rf_predict_trained) <- rownames(train_df)
summary(rf_predict_trained == pData(tc_norm)[train_idx, "finaloutcome"])
## Mode TRUE
## logical 112
Let us take a moment and see if the top-n genes returned by varImp() have some meaning which jumps out. One might assume, given our extant Differential Expression results, that the interleukin response will be a likely candidate.
<- varImp(rf_train_fit, scale = TRUE)
test_importance <- order(test_importance[["importance"]][["Overall"]], decreasing = TRUE)
test_idx <- test_importance[["importance"]][test_idx, ]
test_imp names(test_imp) <- rownames(test_importance[["importance"]])[test_idx]
head(test_imp, n = 200)
## ENSG00000173702 ENSG00000103642 ENSG00000183741 ENSG00000248405 ENSG00000164303 ENSG00000108774 ENSG00000181404 ENSG00000172426 ENSG00000126709
## 100.00 84.95 83.39 76.51 68.64 66.18 62.33 62.17 61.17
## ENSG00000137965 ENSG00000134758 ENSG00000185304 ENSG00000180720 ENSG00000187608 ENSG00000160201 ENSG00000187950 ENSG00000081479 ENSG00000129295
## 60.16 59.37 54.80 54.13 53.95 50.57 50.49 48.42 47.02
## ENSG00000256870 ENSG00000237440 ENSG00000168280 ENSG00000106009 ENSG00000160932 ENSG00000172613 ENSG00000197826 ENSG00000174898 ENSG00000059378
## 46.33 45.91 44.97 44.19 43.11 42.47 42.25 41.59 41.24
## ENSG00000204632 ENSG00000182253 ENSG00000142396 ENSG00000126003 ENSG00000167608 ENSG00000073734 ENSG00000103226 ENSG00000140104 ENSG00000115155
## 39.88 39.29 38.36 38.30 37.33 37.10 35.40 34.21 33.74
## ENSG00000163435 ENSG00000138642 ENSG00000138035 ENSG00000167098 ENSG00000173193 ENSG00000136206 ENSG00000090339 ENSG00000166920 ENSG00000226288
## 33.73 33.60 33.11 33.09 32.59 32.42 32.13 31.51 31.12
## ENSG00000101605 ENSG00000196345 ENSG00000236444 ENSG00000284776 ENSG00000181315 ENSG00000187626 ENSG00000184979 ENSG00000184508 ENSG00000176945
## 31.12 31.05 30.78 30.68 30.40 30.30 30.22 29.79 29.57
## ENSG00000155530 ENSG00000130813 ENSG00000260272 ENSG00000204428 ENSG00000167962 ENSG00000168255 ENSG00000108830 ENSG00000174132 ENSG00000198870
## 29.30 29.25 28.72 28.68 28.36 27.99 27.67 27.57 27.48
## ENSG00000284194 ENSG00000104691 ENSG00000179933 ENSG00000286169 ENSG00000186654 ENSG00000151005 ENSG00000135365 ENSG00000130876 ENSG00000146809
## 27.45 27.38 27.12 26.98 26.93 26.90 26.86 26.79 26.71
## ENSG00000095015 ENSG00000096060 ENSG00000144560 ENSG00000196369 ENSG00000106436 ENSG00000185347 ENSG00000197822 ENSG00000148572 ENSG00000125821
## 26.55 26.55 26.47 26.33 26.29 26.28 26.11 26.07 25.98
## ENSG00000178084 ENSG00000144744 ENSG00000087303 ENSG00000176222 ENSG00000173698 ENSG00000176390 ENSG00000127314 ENSG00000166295 ENSG00000177875
## 25.98 25.86 25.79 25.79 25.77 25.64 25.49 25.49 25.29
## ENSG00000100523 ENSG00000161640 ENSG00000253626 ENSG00000054179 ENSG00000214360 ENSG00000179455 ENSG00000165164 ENSG00000198189 ENSG00000150455
## 25.18 25.11 24.94 24.91 24.91 24.69 24.58 24.45 24.42
## ENSG00000121058 ENSG00000163116 ENSG00000110768 ENSG00000136574 ENSG00000134326 ENSG00000105248 ENSG00000188157 ENSG00000148339 ENSG00000179407
## 24.42 24.41 24.41 24.39 24.39 24.27 24.22 24.22 24.05
## ENSG00000136244 ENSG00000205857 ENSG00000162627 ENSG00000183454 ENSG00000260001 ENSG00000269028 ENSG00000206560 ENSG00000204519 ENSG00000137628
## 24.00 23.78 23.73 23.68 23.65 23.50 23.47 23.43 23.42
## ENSG00000232629 ENSG00000084734 ENSG00000158477 ENSG00000185361 ENSG00000154654 ENSG00000179008 ENSG00000196839 ENSG00000168958 ENSG00000107736
## 23.32 23.21 23.20 23.01 22.99 22.97 22.96 22.95 22.93
## ENSG00000284505 ENSG00000172456 ENSG00000188981 ENSG00000137331 ENSG00000188039 ENSG00000104856 ENSG00000164007 ENSG00000006634 ENSG00000110906
## 22.93 22.91 22.91 22.84 22.80 22.80 22.72 22.61 22.61
## ENSG00000119522 ENSG00000168671 ENSG00000128872 ENSG00000169427 ENSG00000186714 ENSG00000172986 ENSG00000103375 ENSG00000134690 ENSG00000163645
## 22.55 22.52 22.50 22.43 22.37 22.29 22.29 22.29 22.29
## ENSG00000185038 ENSG00000153498 ENSG00000009790 ENSG00000160867 ENSG00000121653 ENSG00000133106 ENSG00000174145 ENSG00000053770 ENSG00000146007
## 22.29 22.25 22.25 22.25 22.18 22.18 22.09 22.09 22.09
## ENSG00000146205 ENSG00000155158 ENSG00000162972 ENSG00000165688 ENSG00000179344 ENSG00000241149 ENSG00000178150 ENSG00000096006 ENSG00000111450
## 22.09 22.09 22.09 22.09 22.09 22.09 22.02 22.01 21.92
## ENSG00000183793 ENSG00000166971 ENSG00000170807 ENSG00000161594 ENSG00000232268 ENSG00000241553 ENSG00000133169 ENSG00000120217 ENSG00000260097
## 21.90 21.90 21.86 21.85 21.83 21.82 21.82 21.82 21.76
## ENSG00000055332 ENSG00000158865 ENSG00000162714 ENSG00000166133 ENSG00000143156 ENSG00000137204 ENSG00000104267 ENSG00000105662 ENSG00000174243
## 21.75 21.71 21.71 21.71 21.71 21.63 21.54 21.54 21.54
## ENSG00000070761 ENSG00000085871 ENSG00000162775 ENSG00000137802 ENSG00000152767 ENSG00000158806 ENSG00000204001 ENSG00000125900 ENSG00000010818
## 21.54 21.54 21.53 21.42 21.42 21.42 21.39 21.38 21.38
## ENSG00000125355 ENSG00000149124 ENSG00000182348 ENSG00000115419 ENSG00000102271 ENSG00000241399 ENSG00000112419 ENSG00000152779 ENSG00000169994
## 21.38 21.38 21.33 21.31 21.22 21.22 21.22 21.22 21.22
## ENSG00000004948 ENSG00000145920
## 21.09 21.09
<- names(head(test_imp, n = 200))
most_important_ids convert_ids(most_important_ids, from = "ENSEMBL", to = "SYMBOL")
## Before conversion: 200, after conversion: 197.
## ENSEMBL SYMBOL
## 1 ENSG00000173702 MUC13
## 2 ENSG00000103642 LACTB
## 3 ENSG00000183741 CBX6
## 4 ENSG00000248405 PRR5-ARHGAP8
## 5 ENSG00000164303 ENPP6
## 6 ENSG00000108774 RAB5C
## 7 ENSG00000181404 WASHC1
## 8 ENSG00000172426 RSPH9
## 9 ENSG00000126709 IFI6
## 10 ENSG00000137965 IFI44
## 11 ENSG00000134758 RNF138
## 12 ENSG00000185304 RGPD2
## 13 ENSG00000180720 CHRM4
## 14 ENSG00000187608 ISG15
## 15 ENSG00000160201 U2AF1
## 16 ENSG00000160201 LOC102724594
## 17 ENSG00000187950 OVCH1
## 18 ENSG00000081479 LRP2
## 19 ENSG00000129295 DNAAF11
## 20 ENSG00000256870 SLC5A8
## 21 ENSG00000237440 ZNF737
## 22 ENSG00000168280 KIF5C
## 23 ENSG00000106009 BRAT1
## 24 ENSG00000160932 LY6E
## 25 ENSG00000172613 RAD9A
## 26 ENSG00000197826 CFAP299
## 27 ENSG00000174898 CATSPERD
## 28 ENSG00000059378 PARP12
## 29 ENSG00000204632 HLA-G
## 30 ENSG00000182253 SYNM
## 31 ENSG00000142396 ERVK3-1
## 32 ENSG00000126003 PLAGL2
## 33 ENSG00000167608 TMC4
## 34 ENSG00000073734 ABCB11
## 35 ENSG00000103226 NOMO3
## 36 ENSG00000140104 CLBA1
## 37 ENSG00000115155 OTOF
## 38 ENSG00000163435 ELF3
## 39 ENSG00000138642 HERC6
## 40 ENSG00000138035 PNPT1
## 41 ENSG00000167098 SUN5
## 42 ENSG00000173193 PARP14
## 43 ENSG00000136206 SPDYE1
## 44 ENSG00000090339 ICAM1
## 45 ENSG00000166920 C15orf48
## 46 ENSG00000226288 OR52I2
## 47 ENSG00000101605 MYOM1
## 48 ENSG00000196345 ZKSCAN7
## 49 ENSG00000236444 UBE2L5
## 51 ENSG00000181315 ZNF322
## 52 ENSG00000187626 ZKSCAN4
## 53 ENSG00000184979 USP18
## 54 ENSG00000184508 HDDC3
## 55 ENSG00000176945 MUC20
## 56 ENSG00000155530 LRGUK
## 57 ENSG00000130813 SHFL
## 59 ENSG00000204428 LY6G5C
## 60 ENSG00000167962 ZNF598
## 61 ENSG00000168255 POLR2J3-UPK3BL2
## 62 ENSG00000108830 RND2
## 63 ENSG00000174132 FAM174A
## 64 ENSG00000198870 STKLD1
## 65 ENSG00000284194 SCO2
## 66 ENSG00000104691 UBXN8
## 67 ENSG00000179933 C14orf119
## 69 ENSG00000186654 PRR5
## 70 ENSG00000151005 TKTL2
## 71 ENSG00000135365 PHF21A
## 72 ENSG00000130876 SLC7A10
## 73 ENSG00000146809 ASB15
## 74 ENSG00000095015 MAP3K1
## 75 ENSG00000096060 FKBP5
## 76 ENSG00000144560 VGLL4
## 77 ENSG00000196369 SRGAP2B
## 78 ENSG00000106436 MYL10
## 79 ENSG00000185347 TEDC1
## 80 ENSG00000197822 OCLN
## 81 ENSG00000148572 NRBF2
## 82 ENSG00000125821 DTD1
## 83 ENSG00000178084 HTR3C
## 84 ENSG00000144744 UBA3
## 85 ENSG00000087303 NID2
## 86 ENSG00000176222 ZNF404
## 87 ENSG00000173698 ADGRG2
## 88 ENSG00000176390 CRLF3
## 89 ENSG00000127314 RAP1B
## 90 ENSG00000166295 ANAPC16
## 91 ENSG00000177875 CCDC184
## 92 ENSG00000100523 DDHD1
## 93 ENSG00000161640 SIGLEC11
## 94 ENSG00000253626 EIF5AL1
## 95 ENSG00000054179 ENTPD2
## 96 ENSG00000214360 EFCAB9
## 97 ENSG00000179455 MKRN3
## 98 ENSG00000165164 CFAP47
## 99 ENSG00000198189 HSD17B11
## 100 ENSG00000150455 TIRAP
## 101 ENSG00000121058 COIL
## 102 ENSG00000163116 STPG2
## 103 ENSG00000110768 GTF2H1
## 104 ENSG00000136574 GATA4
## 105 ENSG00000134326 CMPK2
## 106 ENSG00000105248 YJU2
## 107 ENSG00000188157 AGRN
## 108 ENSG00000148339 SLC25A25
## 109 ENSG00000179407 DNAJB8
## 110 ENSG00000136244 IL6
## 111 ENSG00000205857 NANOGNB
## 112 ENSG00000162627 SNX7
## 113 ENSG00000183454 GRIN2A
## 114 ENSG00000260001 TGFBR3L
## 116 ENSG00000206560 ANKRD28
## 117 ENSG00000204519 ZNF551
## 118 ENSG00000137628 DDX60
## 119 ENSG00000232629 HLA-DQB2
## 120 ENSG00000084734 GCKR
## 121 ENSG00000158477 CD1A
## 122 ENSG00000185361 TNFAIP8L1
## 123 ENSG00000154654 NCAM2
## 124 ENSG00000179008 C14orf39
## 125 ENSG00000196839 ADA
## 126 ENSG00000168958 MFF
## 127 ENSG00000107736 CDH23
## 128 ENSG00000284505 LYNX1-SLURP2
## 129 ENSG00000172456 FGGY
## 130 ENSG00000188981 MSANTD1
## 131 ENSG00000137331 IER3
## 132 ENSG00000188039 NWD1
## 133 ENSG00000104856 RELB
## 134 ENSG00000164007 CLDN19
## 135 ENSG00000006634 DBF4
## 136 ENSG00000110906 KCTD10
## 137 ENSG00000119522 DENND1A
## 138 ENSG00000168671 UGT3A2
## 139 ENSG00000128872 TMOD2
## 140 ENSG00000169427 KCNK9
## 141 ENSG00000186714 CCDC73
## 142 ENSG00000172986 GXYLT2
## 143 ENSG00000103375 AQP8
## 144 ENSG00000134690 CDCA8
## 145 ENSG00000163645 ERICH6
## 146 ENSG00000185038 MROH2A
## 147 ENSG00000153498 SPACA7
## 148 ENSG00000009790 TRAF3IP3
## 149 ENSG00000160867 FGFR4
## 150 ENSG00000121653 MAPK8IP1
## 151 ENSG00000133106 EPSTI1
## 152 ENSG00000174145 NWD2
## 153 ENSG00000053770 AP5M1
## 154 ENSG00000146007 ZMAT2
## 155 ENSG00000146205 ANO7
## 156 ENSG00000155158 TTC39B
## 157 ENSG00000162972 MAIP1
## 158 ENSG00000165688 PMPCA
## 159 ENSG00000179344 HLA-DQB1
## 160 ENSG00000241149 ZNF722
## 161 ENSG00000178150 ZNF114
## 162 ENSG00000096006 CRISP3
## 163 ENSG00000111450 STX2
## 164 ENSG00000183793 NPIPA5
## 165 ENSG00000166971 AKTIP
## 166 ENSG00000170807 LMOD2
## 167 ENSG00000161594 KLHL10
## 168 ENSG00000232268 OR52I1
## 169 ENSG00000241553 ARPC4
## 170 ENSG00000133169 BEX1
## 171 ENSG00000120217 CD274
## 172 ENSG00000260097 SPDYE6
## 173 ENSG00000055332 EIF2AK2
## 174 ENSG00000158865 SLC5A11
## 175 ENSG00000162714 ZNF496
## 176 ENSG00000166133 RPUSD2
## 177 ENSG00000143156 NME7
## 178 ENSG00000137204 SLC22A7
## 179 ENSG00000104267 CA2
## 180 ENSG00000105662 CRTC1
## 181 ENSG00000174243 DDX23
## 182 ENSG00000070761 CFAP20
## 183 ENSG00000085871 MGST2
## 184 ENSG00000162775 RBM15
## 185 ENSG00000137802 MAPKBP1
## 186 ENSG00000152767 FARP1
## 187 ENSG00000158806 NPM2
## 188 ENSG00000204001 LCN8
## 189 ENSG00000125900 SIRPD
## 190 ENSG00000010818 HIVEP2
## 191 ENSG00000125355 TMEM255A
## 192 ENSG00000149124 GLYAT
## 193 ENSG00000182348 ZNF804B
## 194 ENSG00000115419 GLS
## 195 ENSG00000102271 KLHL4
## 196 ENSG00000241399 CD302
## 197 ENSG00000112419 PHACTR2
## 198 ENSG00000152779 SLC16A12
## 199 ENSG00000169994 MYO7B
## 200 ENSG00000004948 CALCR
## 201 ENSG00000145920 CPLX2
<- simple_gprofiler(names(head(test_imp, n = 60))) importance_gp
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
## No results to show
## Please make sure that the organism is correct or set significant = FALSE
$pvalue_plots$BP importance_gp
## NULL
<- predict(rf_train_fit, test_df)
rf_predict_test names(rf_predict_test) <- rownames(test_df)
<- rf_predict_test == pData(tc_norm)[not_train_idx, "finaloutcome"]
rf_agreement names(rf_agreement) <- rownames(test_df)
summary(rf_agreement)
## Mode FALSE TRUE
## logical 17 55
names(rf_agreement)[rf_agreement == FALSE]
## [1] "TMRC30018" "TMRC30056" "TMRC30080" "TMRC30119" "TMRC30103" "TMRC30122" "TMRC30026" "TMRC30166" "TMRC30048" "TMRC30200" "TMRC30177"
## [12] "TMRC30238" "TMRC30208" "TMRC30218" "TMRC30220" "TMRC30264" "TMRC30265"
Logistic regression is a statistical method for binary responses. However, it is able to work with multiple classes as well. The general idea of this method is to find parameters which increase the likelihood that the observed data is sampled from a statistical distribution of interest. The transformations and linear regression-esque tasks performed are confusing, but once those are performed, the task becomes setting the model’s (fitting) parameters to values which increase the probability that the statistical model looks like the actual dataset given the training data, and that when samples, will return values which are similar. The most likely statistical distributions one will want to fit are the Gaussian, in which case we want to transform/normalize the mean/variance of our variables so they look whatever normal distribution we are using. Conversely, logistic regression uses a binnomial distribution (like our raw sequencing data!) but which is from 0-1.
Let us take the most important gene observed in one of our previous training sets: ENSG00000248405 PRR5-ARHGAP8
<- "ENSG00000248405"
gene_id <- train(
single_fit ~ ENSG00000248405, data = train_df,
outcome_factor method = "glm", family = "binomial", trControl = trainControl("none"))
<- data.frame("ENSG00000248405" = seq(min(train_df[[gene_id]]),
tt max(train_df[[gene_id]]),len=100))
## predict probabilities for the simulated data
$subtype = predict(single_fit, newdata=tt, type="prob")[, 1]
tt## plot the sigmoid curve and the training data
plot(ifelse(outcome_factor == "cure", 1, 0) ~ ENSG00000248405,
data=train_df, col="red4",
ylab="CF as 0 or 1", xlab="favorite gene expression")
lines(subtype ~ ENSG00000248405, tt, col="green4", lwd=2)
Having tried with 1 gene, let us extend this to all genes. In my first try of this, it took a long time.
<- train(outcome_factor ~ ., data = train_df,
glm_fit trControl = boot_control,
method = "glm", family = "binomial")
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:S4Vectors':
##
## expand
## Loaded glmnet 4.1-7
##rf_method <- trainControl(method = "ranger", number = 10, verbose = TRUE)
## train_method <- trainControl(method = "cv", number = 10)
<- train(outcome_factor ~ ., data = train_df, method = "glmnet",
glm_fit trControl = boot_control, importance = "permutation",
tuneGrid = data.frame(
alpha = 0.5,
lambda = seq(0.1, 0.7, 0.05)),
verbose = TRUE)
<- predict(glm_fit, train_df)
glm_predict_trained names(glm_predict_trained) <- rownames(train_df)
summary(glm_predict_trained == pData(tc_norm)[train_idx, "finaloutcome"])
## Mode TRUE
## logical 112
confusionMatrix(data = train_df[, 1], reference = glm_predict_trained)
## Confusion Matrix and Statistics
##
## Reference
## Prediction cure failure
## cure 74 0
## failure 0 38
##
## Accuracy : 1
## 95% CI : (0.968, 1)
## No Information Rate : 0.661
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.000
## Specificity : 1.000
## Pos Pred Value : 1.000
## Neg Pred Value : 1.000
## Prevalence : 0.661
## Detection Rate : 0.661
## Detection Prevalence : 0.661
## Balanced Accuracy : 1.000
##
## 'Positive' Class : cure
##
<- predict(glm_fit, test_df)
glm_predict_test names(glm_predict_test) <- rownames(test_df)
glm_predict_test
## TMRC30186 TMRC30253 TMRC30140 TMRC30176 TMRC30226 TMRC30227 TMRC30016 TMRC30231 TMRC30233 TMRC30018 TMRC30210 TMRC30213 TMRC30216 TMRC30214
## cure cure cure cure cure cure cure cure cure cure cure cure cure cure
## TMRC30274 TMRC30254 TMRC30277 TMRC30240 TMRC30280 TMRC30281 TMRC30284 TMRC30282 TMRC30285 TMRC30056 TMRC30105 TMRC30164 TMRC30080 TMRC30094
## cure cure cure cure cure cure cure cure cure failure failure cure failure failure
## TMRC30119 TMRC30082 TMRC30103 TMRC30122 TMRC30022 TMRC30029 TMRC30083 TMRC30028 TMRC30026 TMRC30165 TMRC30044 TMRC30166 TMRC30195 TMRC30048
## failure cure cure failure cure cure failure cure failure cure cure cure cure failure
## TMRC30045 TMRC30055 TMRC30041 TMRC30139 TMRC30160 TMRC30183 TMRC30167 TMRC30078 TMRC30076 TMRC30159 TMRC30129 TMRC30088 TMRC30142 TMRC30168
## cure failure cure cure cure cure cure failure failure cure cure failure cure cure
## TMRC30146 TMRC30199 TMRC30200 TMRC30204 TMRC30177 TMRC30155 TMRC30206 TMRC30136 TMRC30238 TMRC30217 TMRC30208 TMRC30218 TMRC30220 TMRC30264
## cure failure cure failure failure cure failure cure cure failure failure failure cure cure
## TMRC30144 TMRC30265
## cure cure
## Levels: cure failure
<- glm_predict_test == pData(tc_norm)[not_train_idx, "finaloutcome"]
glm_agreement names(glm_agreement) <- rownames(test_df)
summary(glm_agreement)
## Mode FALSE TRUE
## logical 6 66
names(glm_agreement)[glm_agreement == FALSE]
## [1] "TMRC30080" "TMRC30200" "TMRC30238" "TMRC30220" "TMRC30264" "TMRC30265"
library(xgboost)
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:IRanges':
##
## slice
##rf_method <- trainControl(method = "ranger", number = 10, verbose = TRUE)
<- trainControl(method = "cv", number = 10)
train_method
<- train(outcome_factor ~ ., data = train_df,
gb_fit method = "xgbTree", trControl = train_method,
tuneGrid = data.frame(
nrounds = 200,
eta = c(0.05, 0.1, 0.3),
max_depth = 4,
gamma = 0,
colsample_bytree = 1,
subsample = 0.5,
min_child_weight = 1),
verbose = TRUE)
<- predict(gb_fit, train_df)
gb_predict_trained names(gb_predict_trained) <- rownames(train_df)
summary(gb_predict_trained == pData(tc_norm)[train_idx, "finaloutcome"])
## Mode TRUE
## logical 112
<- predict(gb_fit, test_df)
gb_predict_test names(gb_predict_test) <- rownames(test_df)
gb_predict_test
## TMRC30186 TMRC30253 TMRC30140 TMRC30176 TMRC30226 TMRC30227 TMRC30016 TMRC30231 TMRC30233 TMRC30018 TMRC30210 TMRC30213 TMRC30216 TMRC30214
## cure cure cure cure cure cure cure cure cure cure cure cure cure cure
## TMRC30274 TMRC30254 TMRC30277 TMRC30240 TMRC30280 TMRC30281 TMRC30284 TMRC30282 TMRC30285 TMRC30056 TMRC30105 TMRC30164 TMRC30080 TMRC30094
## cure cure cure cure cure cure cure cure cure failure failure cure failure failure
## TMRC30119 TMRC30082 TMRC30103 TMRC30122 TMRC30022 TMRC30029 TMRC30083 TMRC30028 TMRC30026 TMRC30165 TMRC30044 TMRC30166 TMRC30195 TMRC30048
## failure cure cure failure cure cure failure cure failure cure cure failure cure failure
## TMRC30045 TMRC30055 TMRC30041 TMRC30139 TMRC30160 TMRC30183 TMRC30167 TMRC30078 TMRC30076 TMRC30159 TMRC30129 TMRC30088 TMRC30142 TMRC30168
## cure failure cure cure cure cure cure failure failure cure cure failure cure cure
## TMRC30146 TMRC30199 TMRC30200 TMRC30204 TMRC30177 TMRC30155 TMRC30206 TMRC30136 TMRC30238 TMRC30217 TMRC30208 TMRC30218 TMRC30220 TMRC30264
## cure failure failure failure failure cure failure cure cure failure failure failure cure cure
## TMRC30144 TMRC30265
## cure cure
## Levels: cure failure
<- gb_predict_test == pData(tc_norm)[not_train_idx, "finaloutcome"]
gb_agreement names(gb_agreement) <- rownames(test_df)
summary(gb_agreement)
## Mode FALSE TRUE
## logical 6 66
names(gb_agreement)[gb_agreement == FALSE]
## [1] "TMRC30080" "TMRC30166" "TMRC30238" "TMRC30220" "TMRC30264" "TMRC30265"
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
## The following object is masked from 'package:BiocGenerics':
##
## type
##rf_method <- trainControl(method = "ranger", number = 10, verbose = TRUE)
<- trainControl(method = "cv", number = 10)
train_method
<- train(outcome_factor ~ ., data = train_df,
svm_fit method = "svmRadial", trControl = train_method,
tuneGrid = data.frame(
C = c(0.25, 0.5, 1),
sigma = 1),
verbose = TRUE)
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
<- predict(svm_fit, train_df)
svm_predict_trained names(svm_predict_trained) <- rownames(train_df)
svm_predict_trained
## TMRC30156 TMRC30185 TMRC30178 TMRC30179 TMRC30221 TMRC30222 TMRC30223 TMRC30224 TMRC30269 TMRC30148 TMRC30149 TMRC30150 TMRC30138 TMRC30153
## cure cure cure cure cure cure cure cure cure cure cure cure cure cure
## TMRC30151 TMRC30234 TMRC30235 TMRC30270 TMRC30225 TMRC30228 TMRC30229 TMRC30230 TMRC30017 TMRC30232 TMRC30209 TMRC30211 TMRC30212 TMRC30215
## cure cure cure cure cure cure cure cure cure cure cure cure cure cure
## TMRC30271 TMRC30273 TMRC30275 TMRC30272 TMRC30276 TMRC30255 TMRC30256 TMRC30239 TMRC30278 TMRC30279 TMRC30257 TMRC30019 TMRC30258 TMRC30283
## cure cure cure cure cure cure cure cure cure cure cure cure cure cure
## TMRC30071 TMRC30020 TMRC30113 TMRC30058 TMRC30169 TMRC30093 TMRC30107 TMRC30170 TMRC30032 TMRC30096 TMRC30115 TMRC30118 TMRC30180 TMRC30014
## cure cure cure cure cure cure cure cure cure cure cure cure cure cure
## TMRC30121 TMRC30196 TMRC30030 TMRC30021 TMRC30037 TMRC30031 TMRC30027 TMRC30194 TMRC30054 TMRC30046 TMRC30070 TMRC30049 TMRC30047 TMRC30191
## cure cure cure cure cure cure cure cure cure cure cure cure cure cure
## TMRC30053 TMRC30068 TMRC30171 TMRC30192 TMRC30042 TMRC30158 TMRC30132 TMRC30157 TMRC30123 TMRC30181 TMRC30072 TMRC30133 TMRC30043 TMRC30116
## cure cure cure cure cure cure cure cure cure cure cure cure cure cure
## TMRC30184 TMRC30172 TMRC30134 TMRC30174 TMRC30137 TMRC30161 TMRC30175 TMRC30145 TMRC30143 TMRC30197 TMRC30182 TMRC30198 TMRC30201 TMRC30203
## cure cure cure cure cure cure cure cure cure cure cure cure cure cure
## TMRC30202 TMRC30205 TMRC30152 TMRC30154 TMRC30241 TMRC30237 TMRC30207 TMRC30074 TMRC30077 TMRC30219 TMRC30079 TMRC30135 TMRC30173 TMRC30147
## cure cure cure cure cure cure cure cure cure cure cure cure cure cure
## Levels: cure failure
summary(svm_predict_trained == pData(tc_norm)[train_idx, "finaloutcome"])
## Mode FALSE TRUE
## logical 38 74
<- predict(svm_fit, test_df)
svm_predict_test names(svm_predict_test) <- rownames(test_df)
svm_predict_test
## TMRC30186 TMRC30253 TMRC30140 TMRC30176 TMRC30226 TMRC30227 TMRC30016 TMRC30231 TMRC30233 TMRC30018 TMRC30210 TMRC30213 TMRC30216 TMRC30214
## cure cure cure cure cure cure cure cure cure cure cure cure cure cure
## TMRC30274 TMRC30254 TMRC30277 TMRC30240 TMRC30280 TMRC30281 TMRC30284 TMRC30282 TMRC30285 TMRC30056 TMRC30105 TMRC30164 TMRC30080 TMRC30094
## cure cure cure cure cure cure cure cure cure cure cure cure cure cure
## TMRC30119 TMRC30082 TMRC30103 TMRC30122 TMRC30022 TMRC30029 TMRC30083 TMRC30028 TMRC30026 TMRC30165 TMRC30044 TMRC30166 TMRC30195 TMRC30048
## cure cure cure cure cure cure cure cure cure cure cure cure cure cure
## TMRC30045 TMRC30055 TMRC30041 TMRC30139 TMRC30160 TMRC30183 TMRC30167 TMRC30078 TMRC30076 TMRC30159 TMRC30129 TMRC30088 TMRC30142 TMRC30168
## cure cure cure cure cure cure cure cure cure cure cure cure cure cure
## TMRC30146 TMRC30199 TMRC30200 TMRC30204 TMRC30177 TMRC30155 TMRC30206 TMRC30136 TMRC30238 TMRC30217 TMRC30208 TMRC30218 TMRC30220 TMRC30264
## cure cure cure cure cure cure cure cure cure cure cure cure cure cure
## TMRC30144 TMRC30265
## cure cure
## Levels: cure failure
<- svm_predict_test == pData(tc_norm)[not_train_idx, "finaloutcome"]
svm_agreement names(svm_agreement) <- rownames(test_df)
summary(svm_agreement)
## Mode FALSE TRUE
## logical 24 48
names(svm_agreement)[svm_agreement == FALSE]
## [1] "TMRC30056" "TMRC30105" "TMRC30094" "TMRC30119" "TMRC30122" "TMRC30083" "TMRC30026" "TMRC30048" "TMRC30055" "TMRC30078" "TMRC30076"
## [12] "TMRC30088" "TMRC30199" "TMRC30200" "TMRC30204" "TMRC30177" "TMRC30206" "TMRC30238" "TMRC30217" "TMRC30208" "TMRC30218" "TMRC30220"
## [23] "TMRC30264" "TMRC30265"
<- "persistence"
ref_col <- as.factor(as.character(pData(tc_clinical)[[ref_col]]))
outcome_factor
<- as.data.frame(cbind(outcome_factor, as.data.frame(nzv_uncorr)))
ml_df "outcome_factor"]] <- as.factor(ml_df[["outcome_factor"]])
ml_df[[<- ml_df[["outcome_factor"]] == "Y" | ml_df[["outcome_factor"]] == "N"
only_yn <- as.factor(as.character(outcome_factor[only_yn]))
outcome_factor
<- ml_df[only_yn, ]
ml_df "outcome_factor"]] <- as.factor(as.character(ml_df[["outcome_factor"]]))
ml_df[[summary(ml_df[["outcome_factor"]])
## N Y
## 55 101
<- createDataPartition(outcome_factor, p = 0.5,
train_idx list = FALSE,
times = 1)
summary(ml_df[train_idx, "outcome_factor"])
## N Y
## 28 51
## Training set has 112 samples.
<- rownames(ml_df)[train_idx]
train_rownames <- ! rownames(ml_df) %in% train_rownames
not_train_idx summary(ml_df[not_train_idx, "outcome_factor"])
## N Y
## 27 50
<- rownames(ml_df)[not_train_idx]
not_train_rownames
<- as.data.frame(ml_df[train_rownames, ])
train_df dim(train_df)
## [1] 79 12081
<- as.data.frame(ml_df[not_train_rownames, ])
test_df dim(test_df)
## [1] 77 12081
## Remove the outcome factor from the test data, just in case.
<- test_df[["outcome_factor"]]
test_outcome "outcome_factor"]] <- NULL
test_df[[
<- train(outcome_factor ~ ., data = train_df,
train_fit method = "ranger", trControl = boot_control,
importance = "permutation",
tuneGrid = data.frame(
mtry = 200,
min.node.size = 1,
splitrule = "gini"),
verbose = TRUE)
"finalModel"]][["prediction.error"]] train_fit[[
## [1] 0.2532
<- varImp(train_fit)
variable_importance plot(variable_importance, top = 15)
<- predict(rf_train_fit, train_df)
rf_predict_trained confusionMatrix(data = train_df[[1]], reference = rf_predict_trained)
## Error in confusionMatrix.default(data = train_df[[1]], reference = rf_predict_trained): The data must contain some levels that overlap the reference.
<- predict(rf_train_fit, test_df)
rf_predict_test names(rf_predict_test) <- rownames(test_df)
<- names(rf_predict_test)
used_names <- pData(tc_norm)[used_names, ]
used_pdata
<- as.character(rf_predict_test) == used_pdata[[ref_col]]
rf_agreement names(rf_agreement) <- rownames(test_df)
summary(rf_agreement)
## Mode FALSE
## logical 77
names(rf_agreement)[rf_agreement == FALSE]
## [1] "TMRC30156" "TMRC30178" "TMRC30223" "TMRC30138" "TMRC30153" "TMRC30151" "TMRC30234" "TMRC30235" "TMRC30227" "TMRC30231" "TMRC30232"
## [12] "TMRC30233" "TMRC30209" "TMRC30216" "TMRC30214" "TMRC30273" "TMRC30275" "TMRC30276" "TMRC30255" "TMRC30256" "TMRC30240" "TMRC30279"
## [23] "TMRC30280" "TMRC30283" "TMRC30285" "TMRC30071" "TMRC30056" "TMRC30105" "TMRC30058" "TMRC30103" "TMRC30122" "TMRC30107" "TMRC30096"
## [34] "TMRC30115" "TMRC30121" "TMRC30166" "TMRC30048" "TMRC30054" "TMRC30070" "TMRC30049" "TMRC30055" "TMRC30053" "TMRC30068" "TMRC30158"
## [45] "TMRC30132" "TMRC30157" "TMRC30123" "TMRC30181" "TMRC30133" "TMRC30184" "TMRC30159" "TMRC30129" "TMRC30172" "TMRC30137" "TMRC30142"
## [56] "TMRC30143" "TMRC30197" "TMRC30182" "TMRC30203" "TMRC30202" "TMRC30205" "TMRC30177" "TMRC30154" "TMRC30241" "TMRC30237" "TMRC30206"
## [67] "TMRC30238" "TMRC30074" "TMRC30217" "TMRC30208" "TMRC30219" "TMRC30218" "TMRC30220" "TMRC30173" "TMRC30264" "TMRC30147" "TMRC30265"
I redid the mappings with a combined host/parasite transcriptome in the hopes that it might provide some interesting variance for these classifiers. Lets us poke at it and see if anything is relevant or if it was a bad idea.
<- grepl(x = rownames(exprs(hslp_gene_expt)), pattern = "^LP") wanted_idx
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'grepl': error in evaluating the argument 'x' in selecting a method for function 'rownames': error in evaluating the argument 'object' in selecting a method for function 'exprs': object 'hslp_gene_expt' not found
<- rownames(exprs(hslp_gene_expt))[wanted_idx] wanted_ids
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'rownames': error in evaluating the argument 'object' in selecting a method for function 'exprs': object 'hslp_gene_expt' not found
<- exclude_genes_expt(hslp_gene_expt, ids = wanted_ids, method = "keep") test_lp
## Note, I renamed this to subset_genes().
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'fData': object 'hslp_gene_expt' not found
<- plot_libsize(test_lp) tt
## Error in h(simpleError(msg, call)): error in evaluating the argument 'data' in selecting a method for function 'plot_libsize': object 'test_lp' not found
$plot tt
## NULL
I am going to mostly copy paste the code from up above here but change the inputs to fit my new combined data structure of human and parasite data.
<- "finaloutcome"
ref_col <- as.factor(as.character(pData(hslp_gene_expt)[[ref_col]])) outcome_factor
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.factor': error in evaluating the argument 'object' in selecting a method for function 'pData': object 'hslp_gene_expt' not found
<- normalize_expt(hslp_gene_expt, transform = "log2", convert = "cpm") hslp_norm
## Error in h(simpleError(msg, call)): error in evaluating the argument 'expt' in selecting a method for function 'state': object 'hslp_gene_expt' not found
<- normalize_expt(hslp_norm, filter = "cv", cv_min = 0.1) hslp_norm
## Error in h(simpleError(msg, call)): error in evaluating the argument 'expt' in selecting a method for function 'state': object 'hslp_norm' not found
<- grepl(x = rownames(exprs(hslp_norm)), pattern = "^LP") retained_lp
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'grepl': error in evaluating the argument 'x' in selecting a method for function 'rownames': error in evaluating the argument 'object' in selecting a method for function 'exprs': object 'hslp_norm' not found
sum(retained_lp)
## Error in eval(expr, envir, enclos): object 'retained_lp' not found
<- t(exprs(hslp_norm)) nzv_thslp
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 't': error in evaluating the argument 'object' in selecting a method for function 'exprs': object 'hslp_norm' not found
<- preProcess(nzv_thslp, method = "center") nzv_hslp_center
## Error in preProcess(nzv_thslp, method = "center"): object 'nzv_thslp' not found
<- predict(nzv_hslp_center, nzv_thslp) nzv_thslp
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'predict': object 'nzv_hslp_center' not found
<- preProcess(nzv_thslp, method = "corr", cutoff = 0.95) nzv_thslp_correlated
## Error in preProcess(nzv_thslp, method = "corr", cutoff = 0.95): object 'nzv_thslp' not found
<- predict(nzv_thslp_correlated, nzv_thslp) nzv_thslp_uncorr
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'predict': object 'nzv_thslp_correlated' not found
<- pData(hslp_norm)[, c("finaloutcome", "donor",
interesting_meta "visitnumber", "selectionmethod",
"typeofcells", "time", "clinic")]
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'pData': object 'hslp_norm' not found
<- as.data.frame(cbind(outcome_factor,
hslp_ml_df as.data.frame(nzv_thslp_uncorr)))
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'nzv_thslp_uncorr' not found
"outcome_factor"]] <- as.factor(hslp_ml_df[["outcome_factor"]]) hslp_ml_df[[
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.factor': object 'hslp_ml_df' not found
dim(hslp_ml_df)
## Error in eval(expr, envir, enclos): object 'hslp_ml_df' not found
## The variable outcome_factor was created at the top of this document,
## which is a little awkward here.
<- createDataPartition(outcome_factor, p = 0.6,
train_idx list = FALSE,
times = 1)
summary(hslp_ml_df[train_idx, "outcome_factor"])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'summary': object 'hslp_ml_df' not found
## Training set has 112 samples.
<- rownames(hslp_ml_df)[train_idx] train_rownames
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'rownames': object 'hslp_ml_df' not found
<- ! rownames(hslp_ml_df) %in% train_rownames not_train_idx
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function '%in%': error in evaluating the argument 'x' in selecting a method for function 'rownames': object 'hslp_ml_df' not found
summary(hslp_ml_df[not_train_idx, "outcome_factor"])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'summary': object 'hslp_ml_df' not found
<- rownames(hslp_ml_df)[not_train_idx] not_train_rownames
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'rownames': object 'hslp_ml_df' not found
<- as.data.frame(hslp_ml_df[train_rownames, ]) train_df
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'hslp_ml_df' not found
dim(train_df)
## [1] 79 12081
<- as.data.frame(hslp_ml_df[not_train_rownames, ]) test_df
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'hslp_ml_df' not found
dim(test_df)
## [1] 77 12080
## Remove the outcome factor from the test data, just in case.
<- test_df[["outcome_factor"]]
test_outcome "outcome_factor"]] <- NULL test_df[[
<- trainControl(method = "boot", number = 20,
boot_control returnResamp = "all")
<- knn3(x = train_df[, -1],
knn_fit y = train_df[["outcome_factor"]],
k = 3)
<- predict(knn_fit, train_df[, -1], type = "class")
knn_predict_trained confusionMatrix(data = train_df[[1]], reference = knn_predict_trained)
## Confusion Matrix and Statistics
##
## Reference
## Prediction N Y
## N 24 4
## Y 4 47
##
## Accuracy : 0.899
## 95% CI : (0.81, 0.955)
## No Information Rate : 0.646
## P-Value [Acc > NIR] : 2.59e-07
##
## Kappa : 0.779
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.857
## Specificity : 0.922
## Pos Pred Value : 0.857
## Neg Pred Value : 0.922
## Prevalence : 0.354
## Detection Rate : 0.304
## Detection Prevalence : 0.354
## Balanced Accuracy : 0.889
##
## 'Positive' Class : N
##
names(knn_predict_trained) <- rownames(train_df)
summary(knn_predict_trained == pData(hslp_gene_expt)[train_idx, "finaloutcome"])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'summary': error in evaluating the argument 'object' in selecting a method for function 'pData': object 'hslp_gene_expt' not found
<- predict(knn_fit, test_df, type = "class")
knn_predict_test ## and the values, which will be used for our ROC curve.
<- predict(knn_fit, test_df)
knn_predict_test_values
names(knn_predict_test) <- rownames(test_df)
<- knn_predict_test == test_outcome
knn_agreement names(knn_agreement) <- rownames(test_df)
## Error in names(knn_agreement) <- rownames(test_df): 'names' attribute [77] must be the same length as the vector [0]
summary(knn_agreement)
## Mode
## logical
names(knn_agreement)[knn_agreement == FALSE]
## NULL
<- trainControl(method = "cv", number = 10)
cv_control <- train(outcome_factor ~ ., data = train_df,
knn_train_fit method = "knn",
trControl = cv_control,
tuneGrid = data.frame(k = 1:10))
"bestTune"]] knn_train_fit[[
## k
## 3 3
plot(x = 1:10, 1 - knn_train_fit$results[, 2], pch = 19,
ylab = "prediction error", xlab = "k")
lines(loess.smooth(x = 1:10, 1 - knn_train_fit$results[, 2],degree = 2),
col = "#CC0000")