keras 入门 (三) 搭建 CNN 模型破解网站验证码

项目介绍

在文章 CNN 大战验证码 https://www.jianshu.com/p/0287bcd24f78 中, 我们利用 TensorFlow 搭建了简单的 CNN 模型来破解某个网站的验证码. 验证码如下:

在本文中, 我们将会用 Keras 来搭建一个稍微复杂的 CNN 模型来破解以上的验证码.

数据集

对于验证码图片的处理过程在本文中将不再具体叙述, 有兴趣的读者可以参考文章 CNN 大战验证码 https://www.jianshu.com/p/0287bcd24f78 .

在这个项目中, 我们现在的样本一共是 1668 个样本, 每个样本都是一个字符图片, 字符图片的大小为 16*20. 样本的特征为字符图片的像素, 0 代表白色, 1 代表黑色, 每个样本为 320 个特征, 取值为 0 或 1, 特征变量名称为 v1 到 v320, 样本的类别标签即为该字符. 整个数据集的部分如下:

CNN 模型

利用 Keras 可以快速方便地搭建 CNN 模型, 本文搭建的 CNN 模型如下:

将数据集分为训练集和测试集, 占比为 8:2, 该模型训练的代码如下:

# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
from keras.utils import np_utils, plot_model
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.callbacks import EarlyStopping
from keras.layers import Conv2D, MaxPooling2D
# 读取数据
df = pd.read_csv('F://verifycode_data/data.csv')
# 标签值
vals = range(31)
keys = ['1','2','3','4','5','6','7','8','9','A','B','C','D','E','F','G','H','J','K','L','N','P','Q','R','S','T','U','V','X','Y','Z']
label_dict = dict(zip(keys, vals))
x_data = df[['v'+str(i+1) for i in range(320)]]
y_data = pd.DataFrame({'label':df['label']})
y_data['class'] = y_data['label'].apply(lambda x: label_dict[x])
# 将数据分为训练集和测试集
X_train, X_test, Y_train, Y_test = train_test_split(x_data, y_data['class'], test_size=0.3, random_state=42)
x_train = np.array(X_train).reshape((1167, 20, 16, 1))
x_test = np.array(X_test).reshape((501, 20, 16, 1))
# 对标签值进行 one-hot encoding
n_classes = 31
y_train = np_utils.to_categorical(Y_train, n_classes)
y_val = np_utils.to_categorical(Y_test, n_classes)
input_shape = x_train[0].shape
# CNN 模型
model = Sequential()
# 卷积层和池化层
model.add(Conv2D(32, kernel_size=(3, 3), input_shape=input_shape, padding='same'))
model.add(Activation('relu'))
model.add(Conv2D(32, kernel_size=(3, 3), padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))
# Dropout 层
model.add(Dropout(0.25))
model.add(Conv2D(64, kernel_size=(3, 3), padding='same'))
model.add(Activation('relu'))
model.add(Conv2D(64, kernel_size=(3, 3), padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))
model.add(Dropout(0.25))
model.add(Conv2D(128, kernel_size=(3, 3), padding='same'))
model.add(Activation('relu'))
model.add(Conv2D(128, kernel_size=(3, 3), padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))
model.add(Dropout(0.25))
model.add(Flatten())
# 全连接层
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(128, activation='relu'))
model.add(Dense(n_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# plot model
plot_model(model, to_file=r'./model.png', show_shapes=True)
# 模型训练
callbacks = [EarlyStopping(monitor='val_acc', patience=5, verbose=1)]
batch_size = 64
n_epochs = 100
history = model.fit(x_train, y_train, batch_size=batch_size, epochs=n_epochs, \
                    verbose=1, validation_data=(x_test, y_val), callbacks=callbacks)
mp = 'F://verifycode_data/verifycode_Keras.h5'
model.save(mp)
# 绘制验证集上的准确率曲线
val_acc = history.history['val_acc']
plt.plot(range(len(val_acc)), val_acc, label='CNN model')
plt.title('Validation accuracy on verifycode dataset')
plt.xlabel('epochs')
plt.ylabel('accuracy')
plt.legend()
plt.show()

在上述代码中, 我们训练模型的时候采用了 early stopping 技巧. early stopping 是用于提前停止训练的 callbacks. 具体地, 可以达到当训练集上的 loss 不在减小 (即减小的程度小于某个阈值) 的时候停止继续训练.

模型训练

运行上述模型训练代码, 输出的结果如下:

......(忽略之前的输出)
Epoch 22/100
  64/1167 [>.............................] - ETA: 3s - loss: 0.0399 - acc: 1.0000
 128/1167 [==>...........................] - ETA: 3s - loss: 0.1195 - acc: 0.9844
 192/1167 [===>..........................] - ETA: 2s - loss: 0.1085 - acc: 0.9792
 256/1167 [=====>........................] - ETA: 2s - loss: 0.1132 - acc: 0.9727
 320/1167 [=======>......................] - ETA: 2s - loss: 0.1045 - acc: 0.9750
 384/1167 [========>.....................] - ETA: 2s - loss: 0.1006 - acc: 0.9740
 448/1167 [==========>...................] - ETA: 2s - loss: 0.1522 - acc: 0.9643
 512/1167 [============>.................] - ETA: 1s - loss: 0.1450 - acc: 0.9648
 576/1167 [=============>................] - ETA: 1s - loss: 0.1368 - acc: 0.9653
 640/1167 [===============>..............] - ETA: 1s - loss: 0.1353 - acc: 0.9641
 704/1167 [=================>............] - ETA: 1s - loss: 0.1280 - acc: 0.9659
 768/1167 [==================>...........] - ETA: 1s - loss: 0.1243 - acc: 0.9674
 832/1167 [====================>.........] - ETA: 0s - loss: 0.1577 - acc: 0.9639
 896/1167 [======================>.......] - ETA: 0s - loss: 0.1488 - acc: 0.9665
 960/1167 [=======================>......] - ETA: 0s - loss: 0.1488 - acc: 0.9656
1024/1167 [=========================>....] - ETA: 0s - loss: 0.1427 - acc: 0.9668
1088/1167 [==========================>...] - ETA: 0s - loss: 0.1435 - acc: 0.9669
1152/1167 [============================>.] - ETA: 0s - loss: 0.1383 - acc: 0.9688
1167/1167 [==============================] - 4s 3ms/step - loss: 0.1380 - acc: 0.9683 - val_loss: 0.0835 - val_acc: 0.9760
Epoch 00022: early stopping

可以看到, 一共训练了 21 次, 最近一次的训练后, 在测试集上的准确率为 96.83%. 在测试集的准确率曲线如下图:

模型预测

模型训练完后, 我们对新的验证码进行预测. 新的 100 张验证码如下图:

使用训练好的 CNN 模型, 对这些新的验证码进行预测, 预测的 Python 代码如下:

# -*- coding: utf-8 -*-
import os
import cv2
import numpy as np
def split_picture(imagepath):
    # 以灰度模式读取图片
    gray = cv2.imread(imagepath, 0)
    # 将图片的边缘变为白色
    height, width = gray.shape
    for i in range(width):
        gray[0, i] = 255
        gray[height-1, i] = 255
    for j in range(height):
        gray[j, 0] = 255
        gray[j, width-1] = 255
    # 中值滤波
    blur = cv2.medianBlur(gray, 3) #模板大小 3*3
    # 二值化
    ret,thresh1 = cv2.threshold(blur, 200, 255, cv2.THRESH_BINARY)
    # 提取单个字符
    chars_list = []
    image, contours, hierarchy = cv2.findContours(thresh1, 2, 2)
    for cnt in contours:
        # 最小的外接矩形
        x, y, w, h = cv2.boundingRect(cnt)
        if x != 0 and y != 0 and w*h>= 100:
            chars_list.append((x,y,w,h))
    sorted_chars_list = sorted(chars_list, key=lambda x:x[0])
    for i,item in enumerate(sorted_chars_list):
        x, y, w, h = item
        cv2.imwrite('F://test_verifycode/chars/%d.jpg'%(i+1), thresh1[y:y+h, x:x+w])
def remove_edge_picture(imagepath):
    image = cv2.imread(imagepath, 0)
    height, width = image.shape
    corner_list = [image[0,0] <127,
                   image[height-1, 0] < 127,
                   image[0, width-1]<127,
                   image[ height-1, width-1] < 127
                   ]
    if sum(corner_list)>= 3:
        os.remove(imagepath)
def resplit_with_parts(imagepath, parts):
    image = cv2.imread(imagepath, 0)
    os.remove(imagepath)
    height, width = image.shape
    file_name = imagepath.split('/')[-1].split(r'.')[0]
    # 将图片重新分裂成 parts 部分
    step = width//parts     # 步长
    start = 0             # 起始位置
    for i in range(parts):
        cv2.imwrite('F://test_verifycode/chars/%s.jpg'%(file_name+'-'+str(i)), \
                    image[:, start:start+step])
        start += step
def resplit(imagepath):
    image = cv2.imread(imagepath, 0)
    height, width = image.shape
    if width>= 64:
        resplit_with_parts(imagepath, 4)
    elif width>= 48:
        resplit_with_parts(imagepath, 3)
    elif width>= 26:
        resplit_with_parts(imagepath, 2)
# rename and convert to 16*20 size
def convert(dir, file):
    imagepath = dir+'/'+file
    # 读取图片
    image = cv2.imread(imagepath, 0)
    # 二值化
    ret, thresh = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)
    img = cv2.resize(thresh, (16, 20), interpolation=cv2.INTER_AREA)
    # 保存图片
    cv2.imwrite('%s/%s' % (dir, file), img)
# 读取图片的数据, 并转化为 0-1 值
def Read_Data(dir, file):
    imagepath = dir+'/'+file
    # 读取图片
    image = cv2.imread(imagepath, 0)
    # 二值化
    ret, thresh = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)
    # 显示图片
    bin_values = [1 if pixel==255 else 0 for pixel in thresh.ravel()]
    return bin_values
def predict(VerifyCodePath):
    dir = 'F://test_verifycode/chars'
    files = os.listdir(dir)
    # 清空原有的文件
    if files:
        for file in files:
            os.remove(dir + '/' + file)
    split_picture(VerifyCodePath)
    files = os.listdir(dir)
    if not files:
        print('查看的文件夹为空!')
    else:
        # 去除噪声图片
        for file in files:
            remove_edge_picture(dir + '/' + file)
        # 对黏连图片进行重分割
        for file in os.listdir(dir):
            resplit(dir + '/' + file)
        # 将图片统一调整至 16*20 大小
        for file in os.listdir(dir):
            convert(dir, file)
        # 图片中的字符代表的向量
        files = sorted(os.listdir(dir), key=lambda x: x[0])
        table = np.array([Read_Data(dir, file) for file in files]).reshape(-1,20,16,1)
        # 模型保存地址
        mp = 'F://verifycode_data/verifycode_Keras.h5'
        # 载入模型
        from keras.models import load_model
        cnn = load_model(mp)
        # 模型预测
        y_pred = cnn.predict(table)
        predictions = np.argmax(y_pred, axis=1)
        # 标签字典
        keys = range(31)
        vals = ['1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'J', 'K', 'L', 'N',
                'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'X', 'Y', 'Z']
        label_dict = dict(zip(keys, vals))
        return ''.join([label_dict[pred] for pred in predictions])
def main():
    dir = 'F://VerifyCode/'
    correct = 0
    for i, file in enumerate(os.listdir(dir)):
        true_label = file.split('.')[0]
        VerifyCodePath = dir+file
        pred = predict(VerifyCodePath)
        if true_label == pred:
            correct += 1
        print(i+1, (true_label, pred), true_label == pred, correct)
    total = len(os.listdir(dir))
    print('\n 总共图片:%d 张 \ n 识别正确:%d 张 \ n 识别准确率:%.2f%%.'\
          %(total, correct, correct*100/total))
main()

以下是该 CNN 模型的预测结果:

Using TensorFlow backend.
-10-25 15:13:50.390130: I C:\tf_jenkins\workspace\rel-win\M\Windows\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
 ('ZK6N', 'ZK6N') True 1
 ('4JPX', '4JPX') True 2
 ('5GP5', '5GP5') True 3
 ('5RQ8', '5RQ8') True 4
 ('5TQP', '5TQP') True 5
 ('7S62', '7S62') True 6
 ('8R2Z', '8R2Z') True 7
 ('8RFV', '8RFV') True 8
 ('9BBT', '9BBT') True 9
 ('9LNE', '9LNE') True 10
 ('67UH', '67UH') True 11
 ('74UK', '74UK') True 12
 ('A5T2', 'A5T2') True 13
 ('AHYV', 'AHYV') True 14
 ('ASEY', 'ASEY') True 15
 ('B371', 'B371') True 16
 ('CCQL', 'CCQL') True 17
 ('CFD5', 'GFD5') False 17
 ('CJLJ', 'CJLJ') True 18
 ('D4QV', 'D4QV') True 19
 ('DFQ8', 'DFQ8') True 20
 ('DP18', 'DP18') True 21
 ('E3HC', 'E3HC') True 22
 ('E8VB', 'E8VB') True 23
 ('DE1U', 'DE1U') True 24
 ('FK1R', 'FK1R') True 25
 ('FK91', 'FK91') True 26
 ('FSKP', 'FSKP') True 27
 ('FVZP', 'FVZP') True 28
 ('GC6H', 'GC6H') True 29
 ('GH62', 'GH62') True 30
 ('H9FQ', 'H9FQ') True 31
 ('H67Q', 'H67Q') True 32
 ('HEKC', 'HEKC') True 33
 ('HV2B', 'HV2B') True 34
 ('J65Z', 'J65Z') True 35
 ('JZCX', 'JZCX') True 36
 ('KH5D', 'KH5D') True 37
 ('KXD2', 'KXD2') True 38
 ('1GDH', '1GDH') True 39
 ('LCL3', 'LCL3') True 40
 ('LNZR', 'LNZR') True 41
 ('LZU5', 'LZU5') True 42
 ('N5AK', 'N5AK') True 43
 ('N5Q3', 'N5Q3') True 44
 ('N96Z', 'N96Z') True 45
 ('NCDG', 'NCDG') True 46
 ('NELS', 'NELS') True 47
 ('P96U', 'P96U') True 48
 ('PD42', 'PD42') True 49
 ('PECG', 'PEQG') False 49
 ('PPZF', 'PPZF') True 50
 ('PUUL', 'PUUL') True 51
 ('Q2DN', 'D2DN') False 51
 ('QCQ9', 'QCQ9') True 52
 ('QDB1', 'QDBJ') False 52
 ('QZUD', 'QZUD') True 53
 ('R3T5', 'R3T5') True 54
 ('S1YT', 'S1YT') True 55
 ('SP7L', 'SP7L') True 56
 ('SR2K', 'SR2K') True 57
 ('SUP5', 'SVP5') False 57
 ('T2SP', 'T2SP') True 58
 ('U6V9', 'U6V9') True 59
 ('UC9P', 'UC9P') True 60
 ('UFYD', 'UFYD') True 61
 ('V9NJ', 'V9NH') False 61
 ('V35X', 'V35X') True 62
 ('V98F', 'V98F') True 63
 ('VD28', 'VD28') True 64
 ('YGHE', 'YGHE') True 65
 ('YNKD', 'YNKD') True 66
 ('YVXV', 'YVXV') True 67
 ('ZFBS', 'ZFBS') True 68
 ('ET6X', 'ET6X') True 69
 ('TKVC', 'TKVC') True 70
 ('2UCU', '2UCU') True 71
 ('HNBK', 'HNBK') True 72
 ('X8FD', 'X8FD') True 73
 ('ZGNX', 'ZGNX') True 74
 ('LQCU', 'LQCU') True 75
 ('JNZY', 'JNZVY') False 75
 ('RX34', 'RX34') True 76
 ('811E', '811E') True 77
 ('ETDX', 'ETDX') True 78
 ('4CPR', '4CPR') True 79
 ('FE91', 'FE91') True 80
 ('B7XH', 'B7XH') True 81
 ('1RUA', '1RUA') True 82
 ('UBCX', 'UBCX') True 83
 ('KVT5', 'KVT5') True 84
 ('HZ3A', 'HZ3A') True 85
 ('3XLR', '3XLR') True 86
 ('VC7T', 'VC7T') True 87
 ('7PG1', '7PQ1') False 87
 ('4F21', '4F21') True 88
 ('3HLJ', '3HLJ') True 89
 ('1KT7', '1KT7') True 90
 ('1RHE', '1RHE') True 91
 ('1TTA', '1TTA') True 92

总共图片: 100 张

识别正确: 92 张

识别准确率: 92.00%.

可以看到, 该训练后的 CNN 模型, 其预测新验证的准确率在 90% 以上.

总结

在文章 CNN 大战验证码 https://www.jianshu.com/p/0287bcd24f78 中, 笔者使用 TensorFlow 搭建了 CNN 模型, 代码较长, 训练时间在两个小时以上, 而使用 Keras 搭建该模型, 代码简洁, 且使用 early stopping 技巧后能缩短训练时间, 同时保证模型的准确率, 由此可见 Keras 的优势所在.

该项目已开源, GitHub 地址为: https://github.com/percent4/CNN_4_Verifycode .

来源: https://www.cnblogs.com/jclian91/p/9958164.html

与本文相关文章

暂无,快来抢沙发吧！