案例背景与挖掘目标
输入数据:
某市统计年鉴(1995-2014)
挖掘目标:
梳理影响地方财政收入的关键特征, 分析, 识别影响地方财政收入的关键特征的选择模型
结合目标 1 的因素分析, 对某市 2015 年的财政总收入及各个类别收入进行预测
分析方法与过程(选择的原则)
以往对财政收入的分析会使用多元线性回归模型, 和最小二乘估计方法来估计回归模型的系统, 但这样的结果对数据依赖程度很大, 并且求得的往往只是局部最优解, 后续的检验可能会失去应有的意义.
因此本案例运用 Adaptive-Lasso 变量选择方法来研究.
子任务规划
从某市统计局网站以及各统计年鉴搜集到该市财政收入以及各类别收入
建立 Adaptive-Lasso 变量选择模型
代入构建好的人工神经网络模型中, 从而得到 2015 年预测值
实验
掌握 Adaptive-Lasso 变量选择和神经网络预测模型
分析数据, 识别关键特征, 使用 Adaptive-Lasso 变量选择方法进行筛选
用 GM(1,1)灰色预测方法得到筛选出的关键影响因素的 2014,2015 的预测值
代入神经网络模型, 得到 2014,2015 预测值
代码存档:
实验
掌握 Adaptive-Lasso 变量选择和神经网络预测模型
分析数据, 识别关键特征, 使用 Adaptive-Lasso 变量选择方法进行筛选
用 GM(1,1)灰色预测方法得到筛选出的关键影响因素的 2014,2015 的预测值
代入神经网络模型, 得到 2014,2015 预测值
- import numpy as np
- import pandas as pd
- import os
- # 查看数据概况
- dpath = './demo/data/data1.csv'
- input_data = pd.read_csv(dpath)
- r = [input_data.min(),input_data.max(),input_data.mean(),input_data.std()]
- r = pd.DataFrame(r, index=['Min','Max','Mean','Std'])
- r = np.round(r,2)
- print(r)
- x1 x2 x3 x4 x5 x6 x7 Min 3831732.00 181.54 448.19 7571.00 6212.70 6370241.00 525.71
- Max 7599295.00 2110.78 6882.85 42049.14 33156.83 8323096.00 4454.55
- Mean 5579519.95 765.04 2370.83 19644.69 15870.95 7350513.60 1712.24
- Std 1262194.72 595.70 1919.17 10203.02 8199.77 621341.85 1184.71
- x8 x9 x10 x11 x12 x13 y
- Min 985.31 60.62 65.66 97.50 1.03 5321.00 64.87
- Max 15420.14 228.46 852.56 120.00 1.91 41972.00 2088.14
- Mean 5705.80 129.49 340.22 103.31 1.42 17273.80 618.08
- Std 4478.40 50.51 251.58 5.51 0.25 11109.19 609.25
- # 求解 Pearson 相关系数
- np.round(input_data.corr(method='pearson'),2)
x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 | x11 | x12 | x13 | y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
x1 | 1.00 | 0.95 | 0.95 | 0.97 | 0.97 | 0.99 | 0.95 | 0.97 | 0.98 | 0.98 | -0.29 | 0.94 | 0.96 | 0.94 |
x2 | 0.95 | 1.00 | 1.00 | 0.99 | 0.99 | 0.92 | 0.99 | 0.99 | 0.98 | 0.98 | -0.13 | 0.89 | 1.00 | 0.98 |
x3 | 0.95 | 1.00 | 1.00 | 0.99 | 0.99 | 0.92 | 1.00 | 0.99 | 0.98 | 0.99 | -0.15 | 0.89 | 1.00 | 0.99 |
x4 | 0.97 | 0.99 | 0.99 | 1.00 | 1.00 | 0.95 | 0.99 | 1.00 | 0.99 | 1.00 | -0.19 | 0.91 | 1.00 | 0.99 |
x5 | 0.97 | 0.99 | 0.99 | 1.00 | 1.00 | 0.95 | 0.99 | 1.00 | 0.99 | 1.00 | -0.18 | 0.90 | 0.99 | 0.99 |
x6 | 0.99 | 0.92 | 0.92 | 0.95 | 0.95 | 1.00 | 0.93 | 0.95 | 0.97 | 0.96 | -0.34 | 0.95 | 0.94 | 0.91 |
x7 | 0.95 | 0.99 | 1.00 | 0.99 | 0.99 | 0.93 | 1.00 | 0.99 | 0.98 | 0.99 | -0.15 | 0.89 | 1.00 | 0.99 |
x8 | 0.97 | 0.99 | 0.99 | 1.00 | 1.00 | 0.95 | 0.99 | 1.00 | 0.99 | 1.00 | -0.15 | 0.90 | 1.00 | 0.99 |
x9 | 0.98 | 0.98 | 0.98 | 0.99 | 0.99 | 0.97 | 0.98 | 0.99 | 1.00 | 0.99 | -0.23 | 0.91 | 0.99 | 0.98 |
x10 | 0.98 | 0.98 | 0.99 | 1.00 | 1.00 | 0.96 | 0.99 | 1.00 | 0.99 | 1.00 | -0.17 | 0.90 | 0.99 | 0.99 |
x11 | -0.29 | -0.13 | -0.15 | -0.19 | -0.18 | -0.34 | -0.15 | -0.15 | -0.23 | -0.17 | 1.00 | -0.43 | -0.16 | -0.12 |
x12 | 0.94 | 0.89 | 0.89 | 0.91 | 0.90 | 0.95 | 0.89 | 0.90 | 0.91 | 0.90 | -0.43 | 1.00 | 0.90 | 0.87 |
x13 | 0.96 | 1.00 | 1.00 | 1.00 | 0.99 | 0.94 | 1.00 | 1.00 | 0.99 | 0.99 | -0.16 | 0.90 | 1.00 | 0.99 |
y | 0.94 | 0.98 | 0.99 | 0.99 | 0.99 | 0.91 | 0.99 | 0.99 | 0.98 | 0.99 | -0.12 | 0.87 | 0.99 | 1.00 |
结果显示只有 X11 与结果 y 值呈现负相关, 其余变量均为正相关.
- # 导入 AdaptiveLasso
- from sklearn import linear_model
- model = linear_model.Lasso(alpha=1)
- model.fit(input_data.iloc[:,0:13], input_data['y'])
- model.coef_
- /Users/januswing/Library/Python/3.6/lib/python/site-packages/sklearn/linear_model/coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
- ConvergenceWarning)
- array([-1.85085555e-04, -3.15519378e-01, 4.32896206e-01, -3.15753523e-02,
- 7.58007814e-02, 4.03145358e-04, 2.41255896e-01, -3.70482514e-02,
- -2.55448330e+00, 4.41363280e-01, 5.69277642e+00, -0.00000000e+00,
- -3.98946837e-02])
- def GM11(x0): #自定义灰色预测函数
- import numpy as np
- x1 = x0.cumsum() #1-AGO 序列
- z1 = (x1[:len(x1)-1] + x1[1:])/2.0 #紧邻均值 (MEAN) 生成序列
- z1 = z1.reshape((len(z1),1))
- B = np.append(-z1, np.ones_like(z1), axis = 1)
- Yn = x0[1:].reshape((len(x0)-1, 1))
- [[a],[b]] = np.dot(np.dot(np.linalg.inv(np.dot(B.T, B)), B.T), Yn) #计算参数
- f = lambda k: (x0[0]-b/a)*np.exp(-a*(k-1))-(x0[0]-b/a)*np.exp(-a*(k-2)) #还原值
- delta = np.abs(x0 - np.array([f(i) for i in range(1,len(x0)+1)]))
- C = delta.std()/x0.std()
- P = 1.0*(np.abs(delta - delta.mean()) < 0.6745*x0.std()).sum()/len(x0)
- return f, a, b, x0[0], C, P #返回灰色预测函数, a,b, 首项, 方差比, 小残差概率
- inputfile = './demo/data/data1.csv' #输入的数据文件
- outputfile = './demo/tmp/data1_GM11.xls' #灰色预测后保存的路径
- data = pd.read_csv(inputfile) #读取数据
- data.index = range(1994, 2014)
- data.loc[2014] = None
- data.loc[2015] = None
- l = ['x1', 'x2', 'x3', 'x4', 'x5', 'x7']
- for i in l:
- f = GM11(data[i][:20].as_matrix())[0]
- data[i][2014] = f(len(data)-1) #2014 年预测结果
- data[i][2015] = f(len(data)) #2015 年预测结果
- data[i] = data[i].round(2) #保留两位小数
- data[l+['y']].to_excel(outputfile) #结果输出
- data[l+['y']]
x1 | x2 | x3 | x4 | x5 | x7 | y | |
---|---|---|---|---|---|---|---|
1994 | 3831732.00 | 181.54 | 448.19 | 7571.00 | 6212.70 | 525.71 | 64.87 |
1995 | 3913824.00 | 214.63 | 549.97 | 9038.16 | 7601.73 | 618.25 | 99.75 |
1996 | 3928907.00 | 239.56 | 686.44 | 9905.31 | 8092.82 | 638.94 | 88.11 |
1997 | 4282130.00 | 261.58 | 802.59 | 10444.60 | 8767.98 | 656.58 | 106.07 |
1998 | 4453911.00 | 283.14 | 904.57 | 11255.70 | 9422.33 | 758.83 | 137.32 |
1999 | 4548852.00 | 308.58 | 1000.69 | 12018.52 | 9751.44 | 878.26 | 188.14 |
2000 | 4962579.00 | 348.09 | 1121.13 | 13966.53 | 11349.47 | 923.67 | 219.91 |
2001 | 5029338.00 | 387.81 | 1248.29 | 14694.00 | 11467.35 | 978.21 | 271.91 |
2002 | 5070216.00 | 453.49 | 1370.68 | 13380.47 | 10671.78 | 1009.24 | 269.10 |
2003 | 5210706.00 | 533.55 | 1494.27 | 15002.59 | 11570.58 | 1175.17 | 300.55 |
2004 | 5407087.00 | 598.33 | 1677.77 | 16884.16 | 13120.83 | 1348.93 | 338.45 |
2005 | 5744550.00 | 665.32 | 1905.84 | 18287.24 | 14468.24 | 1519.16 | 408.86 |
2006 | 5994973.00 | 738.97 | 2199.14 | 19850.66 | 15444.93 | 1696.38 | 476.72 |
2007 | 6236312.00 | 877.07 | 2624.24 | 22469.22 | 18951.32 | 1863.34 | 838.99 |
2008 | 6529045.00 | 1005.37 | 3187.39 | 25316.72 | 20835.95 | 2105.54 | 843.14 |
2009 | 6791495.00 | 1118.03 | 3615.77 | 27609.59 | 22820.89 | 2659.85 | 1107.67 |
2010 | 7110695.00 | 1304.48 | 4476.38 | 30658.49 | 25011.61 | 3263.57 | 1399.16 |
2011 | 7431755.00 | 1700.87 | 5243.03 | 34438.08 | 28209.74 | 3412.21 | 1535.14 |
2012 | 7512997.00 | 1969.51 | 5977.27 | 38053.52 | 30490.44 | 3758.39 | 1579.68 |
2013 | 7599295.00 | 2110.78 | 6882.85 | 42049.14 | 33156.83 | 4454.55 | 2088.14 |
2014 | 8142148.24 | 2239.29 | 7042.31 | 43611.84 | 35046.63 | 4600.40 | NaN |
2015 | 8460489.28 | 2581.14 | 8166.92 | 47792.22 | 38384.22 | 5214.78 | NaN |
- import pandas as pd
- inputfile = './tmp/data1_GM11.xls' #灰色预测后保存的路径
- outputfile = './data/revenue.xls' #神经网络预测后保存的结果
- modelfile = './tmp/1-net.model' #模型保存路径
- data = pd.read_excel(inputfile) #读取数据
- feature = ['x1', 'x2', 'x3', 'x4', 'x5', 'x7'] #特征所在列
- data_train = data.loc[range(1994,2014)].copy() #取 2014 年前的数据建模
- data_mean = data_train.mean()
- data_std = data_train.std()
- data_train = (data_train - data_mean)/data_std #数据标准化
- x_train = data_train[feature].as_matrix() #特征数据
- y_train = data_train['y'].as_matrix() #标签数据
- from keras.models import Sequential
- from keras.layers.core import Dense, Activation
- model = Sequential() #建立模型
- model.add(Dense(input_dim=6, output_dim=12))
- model.add(Activation('relu')) #用 relu 函数作为激活函数, 能够大幅提供准确度
- model.add(Dense(input_dim=12, output_dim=1))
- model.compile(loss='mean_squared_error', optimizer='adam') #编译模型
- model.fit(x_train, y_train, nb_epoch = 10000, batch_size = 16, verbose=0) #训练模型, 学习一万次
- model.save_weights(modelfile) #保存模型参数
- # 预测, 并还原结果.
- x = ((data[feature] - data_mean[feature])/data_std[feature]).as_matrix()
- data[u'y_pred'] = model.predict(x) * data_std['y'] + data_mean['y']
- data.to_excel(outputfile)
- /Users/januswing/Library/Python/3.6/lib/python/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
- from ._conv import register_converters as _register_converters
- Using TensorFlow backend.
- /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel_launcher.py:19: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(input_dim=6, units=12)`
- /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel_launcher.py:21: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(input_dim=12, units=1)`
- /Users/januswing/Library/Python/3.6/lib/python/site-packages/keras/models.py:942: UserWarning: The `nb_epoch` argument in `fit` has been renamed `epochs`.
- warnings.warn('The `nb_epoch` argument in `fit`'
- import matplotlib.pyplot as plt #画出预测结果图
- p = data[['y','y_pred']].plot(subplots = True, style=['b-o','r-*'])
- plt.show()
- ?
提出问题:
识别关键特征的方法还有哪些? 哪些在 PyTorch 里面可以用?
来源: http://www.bubuko.com/infodetail-2718558.html