项目介绍
在文章 CNN 大战验证码 https://www.jianshu.com/p/0287bcd24f78 中, 我们利用 TensorFlow 搭建了简单的 CNN 模型来破解某个网站的验证码. 验证码如下:
在本文中, 我们将会用 Keras 来搭建一个稍微复杂的 CNN 模型来破解以上的验证码.
数据集
对于验证码图片的处理过程在本文中将不再具体叙述, 有兴趣的读者可以参考文章 CNN 大战验证码 https://www.jianshu.com/p/0287bcd24f78 .
在这个项目中, 我们现在的样本一共是 1668 个样本, 每个样本都是一个字符图片, 字符图片的大小为 16*20. 样本的特征为字符图片的像素, 0 代表白色, 1 代表黑色, 每个样本为 320 个特征, 取值为 0 或 1, 特征变量名称为 v1 到 v320, 样本的类别标签即为该字符. 整个数据集的部分如下:
CNN 模型
利用 Keras 可以快速方便地搭建 CNN 模型, 本文搭建的 CNN 模型如下:
将数据集分为训练集和测试集, 占比为 8:2, 该模型训练的代码如下:
- # -*- coding: utf-8 -*-
- import numpy as np
- import pandas as pd
- from sklearn.model_selection import train_test_split
- from matplotlib import pyplot as plt
- from keras.utils import np_utils, plot_model
- from keras.models import Sequential
- from keras.layers.core import Dense, Dropout, Activation, Flatten
- from keras.callbacks import EarlyStopping
- from keras.layers import Conv2D, MaxPooling2D
- # 读取数据
- df = pd.read_csv('F://verifycode_data/data.csv')
- # 标签值
- vals = range(31)
- keys = ['1','2','3','4','5','6','7','8','9','A','B','C','D','E','F','G','H','J','K','L','N','P','Q','R','S','T','U','V','X','Y','Z']
- label_dict = dict(zip(keys, vals))
- x_data = df[['v'+str(i+1) for i in range(320)]]
- y_data = pd.DataFrame({'label':df['label']})
- y_data['class'] = y_data['label'].apply(lambda x: label_dict[x])
- # 将数据分为训练集和测试集
- X_train, X_test, Y_train, Y_test = train_test_split(x_data, y_data['class'], test_size=0.3, random_state=42)
- x_train = np.array(X_train).reshape((1167, 20, 16, 1))
- x_test = np.array(X_test).reshape((501, 20, 16, 1))
- # 对标签值进行 one-hot encoding
- n_classes = 31
- y_train = np_utils.to_categorical(Y_train, n_classes)
- y_val = np_utils.to_categorical(Y_test, n_classes)
- input_shape = x_train[0].shape
- # CNN 模型
- model = Sequential()
- # 卷积层和池化层
- model.add(Conv2D(32, kernel_size=(3, 3), input_shape=input_shape, padding='same'))
- model.add(Activation('relu'))
- model.add(Conv2D(32, kernel_size=(3, 3), padding='same'))
- model.add(Activation('relu'))
- model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))
- # Dropout 层
- model.add(Dropout(0.25))
- model.add(Conv2D(64, kernel_size=(3, 3), padding='same'))
- model.add(Activation('relu'))
- model.add(Conv2D(64, kernel_size=(3, 3), padding='same'))
- model.add(Activation('relu'))
- model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))
- model.add(Dropout(0.25))
- model.add(Conv2D(128, kernel_size=(3, 3), padding='same'))
- model.add(Activation('relu'))
- model.add(Conv2D(128, kernel_size=(3, 3), padding='same'))
- model.add(Activation('relu'))
- model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))
- model.add(Dropout(0.25))
- model.add(Flatten())
- # 全连接层
- model.add(Dense(256, activation='relu'))
- model.add(Dropout(0.5))
- model.add(Dense(128, activation='relu'))
- model.add(Dense(n_classes, activation='softmax'))
- model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
- # plot model
- plot_model(model, to_file=r'./model.png', show_shapes=True)
- # 模型训练
- callbacks = [EarlyStopping(monitor='val_acc', patience=5, verbose=1)]
- batch_size = 64
- n_epochs = 100
- history = model.fit(x_train, y_train, batch_size=batch_size, epochs=n_epochs, \
- verbose=1, validation_data=(x_test, y_val), callbacks=callbacks)
- mp = 'F://verifycode_data/verifycode_Keras.h5'
- model.save(mp)
- # 绘制验证集上的准确率曲线
- val_acc = history.history['val_acc']
- plt.plot(range(len(val_acc)), val_acc, label='CNN model')
- plt.title('Validation accuracy on verifycode dataset')
- plt.xlabel('epochs')
- plt.ylabel('accuracy')
- plt.legend()
- plt.show()
在上述代码中, 我们训练模型的时候采用了 early stopping 技巧. early stopping 是用于提前停止训练的 callbacks. 具体地, 可以达到当训练集上的 loss 不在减小 (即减小的程度小于某个阈值) 的时候停止继续训练.
模型训练
运行上述模型训练代码, 输出的结果如下:
- ......(忽略之前的输出)
- Epoch 22/100
- 64/1167 [>.............................] - ETA: 3s - loss: 0.0399 - acc: 1.0000
- 128/1167 [==>...........................] - ETA: 3s - loss: 0.1195 - acc: 0.9844
- 192/1167 [===>..........................] - ETA: 2s - loss: 0.1085 - acc: 0.9792
- 256/1167 [=====>........................] - ETA: 2s - loss: 0.1132 - acc: 0.9727
- 320/1167 [=======>......................] - ETA: 2s - loss: 0.1045 - acc: 0.9750
- 384/1167 [========>.....................] - ETA: 2s - loss: 0.1006 - acc: 0.9740
- 448/1167 [==========>...................] - ETA: 2s - loss: 0.1522 - acc: 0.9643
- 512/1167 [============>.................] - ETA: 1s - loss: 0.1450 - acc: 0.9648
- 576/1167 [=============>................] - ETA: 1s - loss: 0.1368 - acc: 0.9653
- 640/1167 [===============>..............] - ETA: 1s - loss: 0.1353 - acc: 0.9641
- 704/1167 [=================>............] - ETA: 1s - loss: 0.1280 - acc: 0.9659
- 768/1167 [==================>...........] - ETA: 1s - loss: 0.1243 - acc: 0.9674
- 832/1167 [====================>.........] - ETA: 0s - loss: 0.1577 - acc: 0.9639
- 896/1167 [======================>.......] - ETA: 0s - loss: 0.1488 - acc: 0.9665
- 960/1167 [=======================>......] - ETA: 0s - loss: 0.1488 - acc: 0.9656
- 1024/1167 [=========================>....] - ETA: 0s - loss: 0.1427 - acc: 0.9668
- 1088/1167 [==========================>...] - ETA: 0s - loss: 0.1435 - acc: 0.9669
- 1152/1167 [============================>.] - ETA: 0s - loss: 0.1383 - acc: 0.9688
- 1167/1167 [==============================] - 4s 3ms/step - loss: 0.1380 - acc: 0.9683 - val_loss: 0.0835 - val_acc: 0.9760
- Epoch 00022: early stopping
可以看到, 一共训练了 21 次, 最近一次的训练后, 在测试集上的准确率为 96.83%. 在测试集的准确率曲线如下图:
模型预测
模型训练完后, 我们对新的验证码进行预测. 新的 100 张验证码如下图:
使用训练好的 CNN 模型, 对这些新的验证码进行预测, 预测的 Python 代码如下:
- # -*- coding: utf-8 -*-
- import os
- import cv2
- import numpy as np
- def split_picture(imagepath):
- # 以灰度模式读取图片
- gray = cv2.imread(imagepath, 0)
- # 将图片的边缘变为白色
- height, width = gray.shape
- for i in range(width):
- gray[0, i] = 255
- gray[height-1, i] = 255
- for j in range(height):
- gray[j, 0] = 255
- gray[j, width-1] = 255
- # 中值滤波
- blur = cv2.medianBlur(gray, 3) #模板大小 3*3
- # 二值化
- ret,thresh1 = cv2.threshold(blur, 200, 255, cv2.THRESH_BINARY)
- # 提取单个字符
- chars_list = []
- image, contours, hierarchy = cv2.findContours(thresh1, 2, 2)
- for cnt in contours:
- # 最小的外接矩形
- x, y, w, h = cv2.boundingRect(cnt)
- if x != 0 and y != 0 and w*h>= 100:
- chars_list.append((x,y,w,h))
- sorted_chars_list = sorted(chars_list, key=lambda x:x[0])
- for i,item in enumerate(sorted_chars_list):
- x, y, w, h = item
- cv2.imwrite('F://test_verifycode/chars/%d.jpg'%(i+1), thresh1[y:y+h, x:x+w])
- def remove_edge_picture(imagepath):
- image = cv2.imread(imagepath, 0)
- height, width = image.shape
- corner_list = [image[0,0] <127,
- image[height-1, 0] < 127,
- image[0, width-1]<127,
- image[ height-1, width-1] < 127
- ]
- if sum(corner_list)>= 3:
- os.remove(imagepath)
- def resplit_with_parts(imagepath, parts):
- image = cv2.imread(imagepath, 0)
- os.remove(imagepath)
- height, width = image.shape
- file_name = imagepath.split('/')[-1].split(r'.')[0]
- # 将图片重新分裂成 parts 部分
- step = width//parts # 步长
- start = 0 # 起始位置
- for i in range(parts):
- cv2.imwrite('F://test_verifycode/chars/%s.jpg'%(file_name+'-'+str(i)), \
- image[:, start:start+step])
- start += step
- def resplit(imagepath):
- image = cv2.imread(imagepath, 0)
- height, width = image.shape
- if width>= 64:
- resplit_with_parts(imagepath, 4)
- elif width>= 48:
- resplit_with_parts(imagepath, 3)
- elif width>= 26:
- resplit_with_parts(imagepath, 2)
- # rename and convert to 16*20 size
- def convert(dir, file):
- imagepath = dir+'/'+file
- # 读取图片
- image = cv2.imread(imagepath, 0)
- # 二值化
- ret, thresh = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)
- img = cv2.resize(thresh, (16, 20), interpolation=cv2.INTER_AREA)
- # 保存图片
- cv2.imwrite('%s/%s' % (dir, file), img)
- # 读取图片的数据, 并转化为 0-1 值
- def Read_Data(dir, file):
- imagepath = dir+'/'+file
- # 读取图片
- image = cv2.imread(imagepath, 0)
- # 二值化
- ret, thresh = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)
- # 显示图片
- bin_values = [1 if pixel==255 else 0 for pixel in thresh.ravel()]
- return bin_values
- def predict(VerifyCodePath):
- dir = 'F://test_verifycode/chars'
- files = os.listdir(dir)
- # 清空原有的文件
- if files:
- for file in files:
- os.remove(dir + '/' + file)
- split_picture(VerifyCodePath)
- files = os.listdir(dir)
- if not files:
- print('查看的文件夹为空!')
- else:
- # 去除噪声图片
- for file in files:
- remove_edge_picture(dir + '/' + file)
- # 对黏连图片进行重分割
- for file in os.listdir(dir):
- resplit(dir + '/' + file)
- # 将图片统一调整至 16*20 大小
- for file in os.listdir(dir):
- convert(dir, file)
- # 图片中的字符代表的向量
- files = sorted(os.listdir(dir), key=lambda x: x[0])
- table = np.array([Read_Data(dir, file) for file in files]).reshape(-1,20,16,1)
- # 模型保存地址
- mp = 'F://verifycode_data/verifycode_Keras.h5'
- # 载入模型
- from keras.models import load_model
- cnn = load_model(mp)
- # 模型预测
- y_pred = cnn.predict(table)
- predictions = np.argmax(y_pred, axis=1)
- # 标签字典
- keys = range(31)
- vals = ['1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'J', 'K', 'L', 'N',
- 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'X', 'Y', 'Z']
- label_dict = dict(zip(keys, vals))
- return ''.join([label_dict[pred] for pred in predictions])
- def main():
- dir = 'F://VerifyCode/'
- correct = 0
- for i, file in enumerate(os.listdir(dir)):
- true_label = file.split('.')[0]
- VerifyCodePath = dir+file
- pred = predict(VerifyCodePath)
- if true_label == pred:
- correct += 1
- print(i+1, (true_label, pred), true_label == pred, correct)
- total = len(os.listdir(dir))
- print('\n 总共图片:%d 张 \ n 识别正确:%d 张 \ n 识别准确率:%.2f%%.'\
- %(total, correct, correct*100/total))
- main()
以下是该 CNN 模型的预测结果:
- Using TensorFlow backend.
- -10-25 15:13:50.390130: I C:\tf_jenkins\workspace\rel-win\M\Windows\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
- ('ZK6N', 'ZK6N') True 1
- ('4JPX', '4JPX') True 2
- ('5GP5', '5GP5') True 3
- ('5RQ8', '5RQ8') True 4
- ('5TQP', '5TQP') True 5
- ('7S62', '7S62') True 6
- ('8R2Z', '8R2Z') True 7
- ('8RFV', '8RFV') True 8
- ('9BBT', '9BBT') True 9
- ('9LNE', '9LNE') True 10
- ('67UH', '67UH') True 11
- ('74UK', '74UK') True 12
- ('A5T2', 'A5T2') True 13
- ('AHYV', 'AHYV') True 14
- ('ASEY', 'ASEY') True 15
- ('B371', 'B371') True 16
- ('CCQL', 'CCQL') True 17
- ('CFD5', 'GFD5') False 17
- ('CJLJ', 'CJLJ') True 18
- ('D4QV', 'D4QV') True 19
- ('DFQ8', 'DFQ8') True 20
- ('DP18', 'DP18') True 21
- ('E3HC', 'E3HC') True 22
- ('E8VB', 'E8VB') True 23
- ('DE1U', 'DE1U') True 24
- ('FK1R', 'FK1R') True 25
- ('FK91', 'FK91') True 26
- ('FSKP', 'FSKP') True 27
- ('FVZP', 'FVZP') True 28
- ('GC6H', 'GC6H') True 29
- ('GH62', 'GH62') True 30
- ('H9FQ', 'H9FQ') True 31
- ('H67Q', 'H67Q') True 32
- ('HEKC', 'HEKC') True 33
- ('HV2B', 'HV2B') True 34
- ('J65Z', 'J65Z') True 35
- ('JZCX', 'JZCX') True 36
- ('KH5D', 'KH5D') True 37
- ('KXD2', 'KXD2') True 38
- ('1GDH', '1GDH') True 39
- ('LCL3', 'LCL3') True 40
- ('LNZR', 'LNZR') True 41
- ('LZU5', 'LZU5') True 42
- ('N5AK', 'N5AK') True 43
- ('N5Q3', 'N5Q3') True 44
- ('N96Z', 'N96Z') True 45
- ('NCDG', 'NCDG') True 46
- ('NELS', 'NELS') True 47
- ('P96U', 'P96U') True 48
- ('PD42', 'PD42') True 49
- ('PECG', 'PEQG') False 49
- ('PPZF', 'PPZF') True 50
- ('PUUL', 'PUUL') True 51
- ('Q2DN', 'D2DN') False 51
- ('QCQ9', 'QCQ9') True 52
- ('QDB1', 'QDBJ') False 52
- ('QZUD', 'QZUD') True 53
- ('R3T5', 'R3T5') True 54
- ('S1YT', 'S1YT') True 55
- ('SP7L', 'SP7L') True 56
- ('SR2K', 'SR2K') True 57
- ('SUP5', 'SVP5') False 57
- ('T2SP', 'T2SP') True 58
- ('U6V9', 'U6V9') True 59
- ('UC9P', 'UC9P') True 60
- ('UFYD', 'UFYD') True 61
- ('V9NJ', 'V9NH') False 61
- ('V35X', 'V35X') True 62
- ('V98F', 'V98F') True 63
- ('VD28', 'VD28') True 64
- ('YGHE', 'YGHE') True 65
- ('YNKD', 'YNKD') True 66
- ('YVXV', 'YVXV') True 67
- ('ZFBS', 'ZFBS') True 68
- ('ET6X', 'ET6X') True 69
- ('TKVC', 'TKVC') True 70
- ('2UCU', '2UCU') True 71
- ('HNBK', 'HNBK') True 72
- ('X8FD', 'X8FD') True 73
- ('ZGNX', 'ZGNX') True 74
- ('LQCU', 'LQCU') True 75
- ('JNZY', 'JNZVY') False 75
- ('RX34', 'RX34') True 76
- ('811E', '811E') True 77
- ('ETDX', 'ETDX') True 78
- ('4CPR', '4CPR') True 79
- ('FE91', 'FE91') True 80
- ('B7XH', 'B7XH') True 81
- ('1RUA', '1RUA') True 82
- ('UBCX', 'UBCX') True 83
- ('KVT5', 'KVT5') True 84
- ('HZ3A', 'HZ3A') True 85
- ('3XLR', '3XLR') True 86
- ('VC7T', 'VC7T') True 87
- ('7PG1', '7PQ1') False 87
- ('4F21', '4F21') True 88
- ('3HLJ', '3HLJ') True 89
- ('1KT7', '1KT7') True 90
- ('1RHE', '1RHE') True 91
- ('1TTA', '1TTA') True 92
总共图片: 100 张
识别正确: 92 张
识别准确率: 92.00%.
可以看到, 该训练后的 CNN 模型, 其预测新验证的准确率在 90% 以上.
总结
在文章 CNN 大战验证码 https://www.jianshu.com/p/0287bcd24f78 中, 笔者使用 TensorFlow 搭建了 CNN 模型, 代码较长, 训练时间在两个小时以上, 而使用 Keras 搭建该模型, 代码简洁, 且使用 early stopping 技巧后能缩短训练时间, 同时保证模型的准确率, 由此可见 Keras 的优势所在.
该项目已开源, GitHub 地址为: https://github.com/percent4/CNN_4_Verifycode .
来源: https://www.cnblogs.com/jclian91/p/9958164.html