准备
keras 的 IMDB 数据集, 包含一个词集和对应的情感标签
- import pandas as pd
- from keras.preprocessing import sequence
- from keras.models import Sequential
- from keras.layers import Dense,Dropout,Activation
- from keras.layers import Embedding
- from keras.layers import Conv1D,GlobalAveragePooling1D
- from keras.datasets import imdb
- from sklearn.metrics import accuracy_score,classification_report
- # 参数 最大特征数 6000 单个句子最大长度 400
- max_features = 6000
- max_length = 400
- (x_train,y_train),(x_test,y_test) = imdb.load_data(num_words=max_features)
- print(len(x_train),'train observations')
- print(len(x_test),'test observations')
- wind = imdb.get_word_index() # 给单词编号, 用数字代替单词
- revind = dict((k,v) for k,v in enumerate(wind))
- # 单词编号: 情感词性编号 字典 => 情感词性编号: 一堆该词性的单词编号列表
- print(x_train[0])
- print(y_train[0])
- def decode(sent_list): # 逆映射字典解码 数字 => 单词
- new_words = []
- for i in sent_list:
- new_words.append(revind[i])
- comb_words = " ".join(new_words)
- return comb_words
- print(decode(x_train[0]))
输出:
- 25000 train observations
- 25000 test observations
- [1, 14, 22, 16, 43, 530, 973, 1622, 1385, ...]
- 1
- tsukino 'royale rumbustious canet thrace bellow headbanger ...
如何实现
1. 预处理, 数据整合到一个固定的维度
2. 一维 CNN 模型的构建和验证
3. 模型评估
代码
- import pandas as pd
- from keras.preprocessing import sequence
- from keras.models import Sequential
- from keras.layers import Dense,Dropout,Activation
- from keras.layers import Embedding
- from keras.layers import Conv1D,GlobalAveragePooling1D
- from keras.datasets import imdb
- from sklearn.metrics import accuracy_score,classification_report
- # 参数 最大特征数 6000 单个句子最大长度 400
- max_features = 6000
- max_length = 400
- (x_train,y_train),(x_test,y_test) = imdb.load_data(num_words=max_features)
- # print(x_train) # 一堆句子, 每个句子有有一堆单词编码
- # print(y_train) # 一堆 0 或 1
- # print(len(x_train),'train observations')
- # print(len(x_test),'test observations')
- wind = imdb.get_word_index() # 给单词编号, 用数字代替单词
- revind = dict((k, v) for k, v in enumerate(wind))
- # 单词编号: 情感词性编号 字典 => 情感词性编号: 一堆该词性的单词编号列表
- # print(x_train[0])
- # print(y_train[0])
- def decode(sent_list): # 逆映射字典解码 数字 => 单词
- new_words = []
- for i in sent_list:
- new_words.append(revind[i])
- comb_words = " ".join(new_words)
- return comb_words
- # print(decode(x_train[0]))
- # 将句子填充到最大长度 400 使数据长度保持一致
- x_train = sequence.pad_sequences(x_train,maxlen=max_length)
- x_test = sequence.pad_sequences(x_test,maxlen=max_length)
- print('x_train.shape:',x_train.shape)
- print('x_test.shape:',x_test.shape)
- ## Keras 框架 深度学习 一维 CNN 模型
- # 参数
- batch_size = 32
- embedding_dims = 60
- num_kernels = 260
- kernel_size = 3
- hidden_dims = 300
- epochs = 3
- # 建立模型
- model = Sequential()
- model.add(Embedding(max_features,embedding_dims,input_length=max_length))
- model.add(Dropout(0.2))
- model.add(Conv1D(num_kernels,kernel_size,padding='valid',activation='relu',strides=1))
- model.add(GlobalAveragePooling1D())
- model.add(Dense(hidden_dims))
- model.add(Dropout(0.5))
- model.add(Activation('relu'))
- model.add(Dense(1))
- model.add(Activation('sigmoid'))
- model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
- print(model.summary())
- model.fit(x_train,y_train,batch_size=batch_size,epochs=epochs,validation_split=0.2)
- # 模型预测
- y_train_predclass = model.predict_classes(x_train,batch_size=batch_size)
- y_test_preclass = model.predict_classes(x_test,batch_size=batch_size)
- y_train_predclass.shape = y_train.shape
- y_test_preclass.shape = y_test.shape
- print('\n\nCNN 1D - Train accuracy:',round(accuracy_score(y_train,y_train_predclass),3))
- print('\nCNN 1D of Training data\n',classification_report(y_train,y_train_predclass))
- print('\nCNN 1D - Train Confusion Matrix\n\n',pd.crosstab(y_train,y_train_predclass,
- rownames=['Actuall'],colnames=['Predicted']))
- print('\nCNN 1D - Test accuracy:',round(accuracy_score(y_test,y_test_preclass),3))
- print('\nCNN 1D of Test data\n',classification_report(y_test,y_test_preclass))
- print('\nCNN 1D - Test Confusion Matrix\n\n',pd.crosstab(y_test,y_test_preclass,
- rownames=['Actuall'],colnames=['Predicted']))
输出:
- Using TensorFlow backend.
- x_train.shape: (25000, 400)
- x_test.shape: (25000, 400)
- WARNING:tensorflow:From
- D:\Python37\Lib\site-packages\tensorflow\python\framework\op_def_library.py:263:
- colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a
- future version.
- Instructions for updating:
- Colocations handled automatically by placer.
- WARNING:tensorflow:From
- D:\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:3445: calling dropout
- (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a
- future version.
- Instructions for updating:
- Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
- _________________________________________________________________
- Layer (type)???????????????? Output Shape????????????? Param #??
- =================================================================
- embedding_1 (Embedding)????? (None, 400, 60)?????????? 360000???
- _________________________________________________________________
- dropout_1 (Dropout)????????? (None, 400, 60)?????????? 0????????
- _________________________________________________________________
- conv1d_1 (Conv1D)??????????? (None, 398, 260)????????? 47060????
- _________________________________________________________________
- global_average_pooling1d_1 ( (None, 260)?????????????? 0????????
- _________________________________________________________________
- dense_1 (Dense)????????????? (None, 300)?????????????? 78300????
- _________________________________________________________________
- dropout_2 (Dropout)????????? (None, 300)?????????????? 0????????
- _________________________________________________________________
- activation_1 (Activation)??? (None, 300)?????????????? 0????????
- _________________________________________________________________
- dense_2 (Dense)????????????? (None, 1)???????????????? 301??????
- _________________________________________________________________
- activation_2 (Activation)??? (None, 1)???????????????? 0????????
- =================================================================
- Total params: 485,661
- Trainable params: 485,661
- Non-trainable params: 0
- _________________________________________________________________
- None
- WARNING:tensorflow:From
- D:\Python37\Lib\site-packages\tensorflow\python\ops\math_ops.py:3066: to_int32 (from
- tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
- Instructions for updating:
- Use tf.cast instead.
- Train on 20000 samples, validate on 5000 samples
- Epoch 1/3
- 2019-07-07 15:27:37.848057: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU
- supports instructions that this TensorFlow binary was not compiled to use: AVX2
- ?? 32/20000 [..............................] - ETA: 7:03 - loss: 0.6929 - acc: 0.5000
- ?? 64/20000 [..............................] - ETA: 4:13 - loss: 0.6927 - acc: 0.5156
- ?? 96/20000 [..............................] - ETA: 3:19 - loss: 0.6933 - acc: 0.5000
- ? 128/20000 [..............................] - ETA: 2:50 - loss: 0.6935 - acc: 0.4844
- ? 160/20000 [..............................] - ETA: 2:32 - loss: 0.6931 - acc: 0.4813
此处省略一堆 epoch 的一堆操作
- CNN 1D - Train accuracy: 0.949
- CNN 1D of Training data
- ?????????????? precision??? recall? f1-score?? support
- ?????????? 0?????? 0.94????? 0.96????? 0.95???? 12500
- ?????????? 1?????? 0.95????? 0.94????? 0.95???? 12500
- ??? accuracy?????????????????????????? 0.95???? 25000
- ?? macro avg?????? 0.95????? 0.95????? 0.95???? 25000
- weighted avg?????? 0.95????? 0.95????? 0.95???? 25000
- CNN 1D - Train Confusion Matrix
- ?Predicted????? 0????? 1
- Actuall???????????????
- 0????????? 11938??? 562
- 1??????????? 715? 11785
- CNN 1D - Test accuracy: 0.876
- CNN 1D of Test data
- ?????????????? precision??? recall? f1-score?? support
- ?????????? 0?????? 0.86????? 0.89????? 0.88???? 12500
- ?????????? 1?????? 0.89????? 0.86????? 0.87???? 12500
- ??? accuracy?????????????????????????? 0.88???? 25000
- ?? macro avg?????? 0.88????? 0.88????? 0.88???? 25000
- weighted avg?????? 0.88????? 0.88????? 0.88???? 25000
- CNN 1D - Test Confusion Matrix
- ?Predicted????? 0????? 1
- Actuall???????????????
- 0????????? 11144?? 1356
- 1?????????? 1744? 10756
来源: http://www.bubuko.com/infodetail-3116657.html