使用 R 语言进行机器学习特征选择1

特征选择是实用机器学习的重要一步, 一般数据集都带有太多的特征用于模型构建, 如何找出有用特征是值得关注的内容.

使用 caret 包, 使用递归特征消除法, rfe 参数: x, 预测变量的矩阵或数据框, y, 输出结果向量(数值型或因子型),sizes, 用于测试的特定子集大小的整型向量, rfeControl, 用于指定预测模型和方法的一系列选项

一些列函数可以用于 rfeControl$functions, 包括: 线性回归 (lmFuncs), 随机森林(rfFuncs), 朴素贝叶斯(nbFuncs),bagged trees(treebagFuncs) 和可以用于 caret 的 train 函数的函数(caretFuncs).

1 移除冗余特征, 移除高度关联的特征.

set.seed(1234)
library(mlbench)
library(caret)
data(PimaIndiansDiabetes)
Matrix <- PimaIndiansDiabetes[,1:8]
library(Hmisc)
up_CorMatrix <- function(cor,p) {ut <- upper.tri(cor)
data.frame(row = rownames(cor)[row(cor)[ut]] ,
           column = rownames(cor)[col(cor)[ut]],
           cor =(cor)[ut] ) }
res <- rcorr(as.matrix(Matrix))
cor_data <- up_CorMatrix (res$r)
cor_data <- subset(cor_data, cor_data$cor> 0.5)
 cor_data
row column       cor
22 pregnant    age 0.5443412

2 根据重要性进行特征排序

特征重要性可以通过构建模型获取. 一些模型, 诸如决策树, 内建有特征重要性的获取机制. 另一些模型, 每个特征重要性利用 ROC 曲线分析获取. 下例加载 Pima Indians Diabetes 数据集, 构建一个 Learning Vector Quantization(LVQ)模型. varImp 用于获取特征重要性. 从图中可以看出 glucose, mass 和 age 是前三个最重要的特征, insulin 是最不重要的特征.

# ensure results are repeatable
set.seed(1234)
# load the library
library(mlbench)
library(caret)
# load the dataset
data(PimaIndiansDiabetes)
# prepare training scheme
control <- trainControl(method="repeatedcv", number=10, repeats=3)
# train the model
model <- train(diabetes~., data=PimaIndiansDiabetes, method="lvq", preProcess="scale", trControl=control)
# estimate variable importance
importance <- varImp(model, scale=FALSE)
# summarize importance
print(importance)
# plot importance
plot(importance)
ROC curve variable importance
Importance
glucose      0.7881
mass         0.6876
age          0.6869
pregnant     0.6195
pedigree     0.6062
pressure     0.5865
triceps      0.5536
insulin      0.5379

3 特征选择

自动特征选择用于构建不同子集的许多模型, 识别哪些特征有助于构建准确模型, 哪些特征没什么帮助. 特征选择的一个流行的自动方法称为递归特征消除 (Recursive Feature Elimination) 或 RFE.

下例在 Pima Indians Diabetes 数据集上提供 RFE 方法例子. 随机森林算法用于每一轮迭代中评估模型的方法. 该算法用于探索所有可能的特征子集. 从图中可以看出当使用 5 个特征时即可获取与最高性能相差无几的结果.

# ensure the results are repeatable
set.seed(7)
# load the library
library(mlbench)
library(caret)
# load the data
data(PimaIndiansDiabetes)
# define the control using a random forest selection function
control <- rfeControl(functions=rfFuncs, method="cv", number=10)
# run the RFE algorithm
results <- rfe(PimaIndiansDiabetes[,1:8], PimaIndiansDiabetes[,9], sizes=c(1:8), rfeControl=control)
# summarize the results
print(results)
# list the chosen features
predictors(results)
# plot the results
plot(results, type=c("g", "o"))
Recursive feature selection
Outer resampling method: Cross-Validated (10 fold)
Resampling performance over subset size:
 Variables Accuracy  Kappa AccuracySD KappaSD Selected
         1   0.6926 0.2653    0.04916 0.10925
         2   0.7343 0.3906    0.04725 0.10847
         3   0.7356 0.4058    0.05105 0.11126
         4   0.7513 0.4435    0.04222 0.09472
         5   0.7604 0.4539    0.05007 0.11691        *
         6   0.7499 0.4364    0.04327 0.09967
         7   0.7603 0.4574    0.04052 0.09838
         8   0.7590 0.4549    0.04804 0.10781
The top 5 variables (out of 5):
   glucose, mass, age, pregnant, insulin

来源: http://www.jianshu.com/p/3ce79b6f371a

与本文相关文章

暂无,快来抢沙发吧！