R 语言 - 数据处理 - 样本集划分

library(caret)
> sIndex<-createDataPartition(outp$V1,p=0.7,list=FALSE)
> outpTrain<-outp[sIndex]
> outpTest<-outp[-sIndex]
> describe(outpTrain)
 outpTrain
        n  missing distinct     Info     Mean      Gmd      .05      .10
      139        0      125        1    21.45    3.894    16.11    17.41
      .25      .50      .75      .90      .95
    19.19    21.66    23.54    25.62    27.20
 lowest : 12.04 12.62 13.03 14.45 14.61, highest: 27.70 27.95 28.16 29.45 31.30
> describe(outpTest)
 outpTest
        n  missing distinct     Info     Mean      Gmd      .05      .10
       56        0       55        1    21.75    3.586    16.99    17.48
      .25      .50      .75      .90      .95
    19.39    21.66    23.50    24.91    27.08
 lowest : 15.75 16.03 16.78 17.06 17.41, highest: 26.15 26.97 27.41 28.58 32.30

PS: 根据因变量特征值进行数据分区, outp$V1 其中 outp 为因变量列表, V1 为特征值的 name

按照 p=0.7 划分, 训练集占 70%, 测试集占 30%, 对划分的结果进行描述 describe 可知

训练集均值 21.45 测试集均值 21.75

但是有一点疑问, 测试集最小 5 个数值均小于测试集最小值???, 如何更均匀??

来源: http://www.bubuko.com/infodetail-3045009.html

与本文相关文章

暂无,快来抢沙发吧！