随机森林
对数据样本及特征随机抽取, 进行多个决策树训练, 防止过拟合, 提高泛化能力
一般随机森林的特点:
1 有放回抽样 (所以生成每棵树的时候, 实际数据集会有重复),
2 以最优划分分裂
Given a standard training set D of size n, bagging generates m new training sets D_i, each of size n, by sampling from D uniformly and with replacement. This kind of sample is known as a bootstrap sample. The m models are fitted using the above m bootstrap samples and combined by averaging the output (for regression) or voting (for classification).
ExtraTrees 算法多一层随机性, 在对连续变量特征选取最优分裂值时, 不会计算所有分裂值的效果, 来选择分裂特征
而是对每一个特征, 在它的特征取值范围内, 随机生成一个 split value, 再计算看选取哪一个特征来进行分裂
- Empirical good default values are max_features=n_features for regression problems, and max_features=sqrt(n_features) for classification tasks (where n_features is the number of features in the data).
- In addition, note that in random forests, bootstrap samples are used by default (bootstrap=True) while the default strategy for extra-trees is to use the whole dataset (bootstrap=False).
来源: http://www.bubuko.com/infodetail-2546379.html