因此,输出数据是每个单词的 one-hot 编码,它表示一种理想化的概率分布,即除了实际词位置之外所有词位置的值都为 0,实际词位置的值为 1。
- # create sequences of images, input sequences and output words for an image
- def create_sequences(tokenizer, max_length, descriptions, photos):
- X1, X2, y = list(), list(), list()
- # walk through each image identifier
- for key, desc_list in descriptions.items():
- # walk through each description for the image
- for desc in desc_list:
- # encode the sequence
- seq = tokenizer.texts_to_sequences([desc])[0]
- # split one sequence into multiple X,y pairs
- for i in range(1, len(seq)):
- # split into input and output pair
- in_seq, out_seq = seq[:i], seq[i]
- # pad input sequence
- in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
- # encode output sequence
- out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
- # store
- X1.append(photos[key][0])
- X2.append(in_seq)
- y.append(out_seq)
- return array(X1), array(X2), array(y)
我们需要计算最长描述中单词的最大数量。下面是一个有帮助的函数 max_length()。
- # calculate the length of the description with the most words
- def max_length(descriptions):
- lines = to_lines(descriptions)
- return max(len(d.split()) for d in lines)
现在我们可以为训练和开发数据集加载数据,并将加载数据转换成输入 - 输出对来拟合深度学习模型。
定义模型
我们将根据 Marc Tanti, et al. 在 2017 年论文中描述的「merge-model」定义深度学习模型。
论文作者提供了该模型的简图,如下所示:
我们将从三部分描述该模型:
图像特征提取器模型的输入图像特征是维度为 4096 的向量,这些向量经过全连接层处理并生成图像的 256 元素表征。
序列处理器模型期望馈送至嵌入层的预定义长度(34 个单词)输入序列使用掩码来忽略 padded 值。之后是具备 256 个循环单元的 LSTM 层。
两个输入模型均输出 256 元素的向量。此外,输入模型以 50% 的 dropout 率使用正则化,旨在减少训练数据集的过拟合情况,因为该模型配置学习非常快。
解码器模型使用额外的操作融合来自两个输入模型的向量。然后将其馈送至 256 个神经元的密集层,然后输送至最终输出密集层,从而在所有输出词汇上对序列中的下一个单词进行 softmax 预测。
下面的 define_model() 函数定义和返回要拟合的模型。
- # define the captioning model
- def define_model(vocab_size, max_length):
- # feature extractor model
- inputs1 = Input(shape=(4096,))
- fe1 = Dropout(0.5)(inputs1)
- fe2 = Dense(256, activation='relu')(fe1)
- # sequence model
- inputs2 = Input(shape=(max_length,))
- se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
- se2 = Dropout(0.5)(se1)
- se3 = LSTM(256)(se2)
- # decoder model
- decoder1 = add([fe2, se3])
- decoder2 = Dense(256, activation='relu')(decoder1)
- outputs = Dense(vocab_size, activation='softmax')(decoder2)
- # tie it together [image, seq] [word]
- model = Model(inputs=[inputs1, inputs2], outputs=outputs)
- model.compile(loss='categorical_crossentropy', optimizer='adam')
- # summarize model
- print(model.summary())
- plot_model(model, to_file='model.png', show_shapes=True)
- return model
要了解模型结构,特别是层的形状,请参考下表中的总结。
- ____________________________________________________________________________________________________
- Layer (type) Output Shape Param # Connected to
- ====================================================================================================
- input_2 (InputLayer) (None, 34) 0
- ____________________________________________________________________________________________________
- input_1 (InputLayer) (None, 4096) 0
- ____________________________________________________________________________________________________
- embedding_1 (Embedding) (None, 34, 256) 1940224 input_2[0][0]
- ____________________________________________________________________________________________________
- dropout_1 (Dropout) (None, 4096) 0 input_1[0][0]
- ____________________________________________________________________________________________________
- dropout_2 (Dropout) (None, 34, 256) 0 embedding_1[0][0]
- ____________________________________________________________________________________________________
- dense_1 (Dense) (None, 256) 1048832 dropout_1[0][0]
- ____________________________________________________________________________________________________
- lstm_1 (LSTM) (None, 256) 525312 dropout_2[0][0]
- ____________________________________________________________________________________________________
- add_1 (Add) (None, 256) 0 dense_1[0][0]
- lstm_1[0][0]
- ____________________________________________________________________________________________________
- dense_2 (Dense) (None, 256) 65792 add_1[0][0]
- ____________________________________________________________________________________________________
- dense_3 (Dense) (None, 7579) 1947803 dense_2[0][0]
- ====================================================================================================
- Total params: 5,527,963
- Trainable params: 5,527,963
- Non-trainable params: 0
- ____________________________________________________________________________________________________
我们还创建了一幅图来可视化网络结构,帮助理解两个输入流。
图像字幕生成深度学习模型示意图。
拟合模型
现在我们已经了解如何定义模型了,那么接下来我们要在训练数据集上拟合模型。
该模型学习速度快,很快就会对训练数据集产生过拟合。因此,我们需要在留出的开发数据集上监控训练模型的泛化情况。如果模型在开发数据集上的技能在每个 epoch 结束时有所提升,则我们将整个模型保存至文件。
在运行结束时,我们能够使用训练数据集上具备最优技能的模型作为最终模型。
通过在 Keras 中定义 ModelCheckpoint,使之监控验证数据集上的最小损失,我们可以实现以上目的。然后将该模型保存至文件名中包含训练损失和验证损失的文件中。
- # define checkpoint callback
- filepath = 'model-ep{epoch:03d}-loss{loss:.3f}-val_loss{val_loss:.3f}.h5'
- checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
之后,通过 fit() 中的 callbacks 参数指定检查点。我们还需要 fit() 中的 validation_data 参数指定开发数据集。
我们仅拟合模型 20 epoch,给出一定量的训练数据,在一般硬件上每个 epoch 可能需要 30 分钟。
- # fit model
- model.fit([X1train, X2train], ytrain, epochs=20, verbose=2, callbacks=[checkpoint], validation_data=([X1test, X2test], ytest))
完成示例
在训练数据上拟合模型的完整示例如下:
- from numpy import array
- from pickle import load
- from keras.preprocessing.text import Tokenizer
- from keras.preprocessing.sequence import pad_sequences
- from keras.utils import to_categorical
- from keras.utils import plot_model
- from keras.models import Model
- from keras.layers import Input
- from keras.layers import Dense
- from keras.layers import LSTM
- from keras.layers import Embedding
- from keras.layers import Dropout
- from keras.layers.merge import add
- from keras.callbacks import ModelCheckpoint
- # load doc into memory
- def load_doc(filename):
- # open the file as read only
- file = open(filename, 'r')
- # read all text
- text = file.read()
- # close the file
- file.close()
- return text
- # load a pre-defined list of photo identifiers
- def load_set(filename):
- doc = load_doc(filename)
- dataset = list()
- # process line by line
- for line in doc.split('\n'):
- # skip empty lines
- if len(line) < 1:
- continue
- # get the image identifier
- identifier = line.split('.')[0]
- dataset.append(identifier)
- return set(dataset)
- # load clean descriptions into memory
- def load_clean_descriptions(filename, dataset):
- # load document
- doc = load_doc(filename)
- descriptions = dict()
- for line in doc.split('\n'):
- # split line by white space
- tokens = line.split()
- # split id from description
- image_id, image_desc = tokens[0], tokens[1:]
- # skip images not in the set
- if image_id in dataset:
- # create list
- if image_id not in descriptions:
- descriptions[image_id] = list()
- # wrap description in tokens
- desc = 'startseq ' + ' '.join(image_desc) + ' endseq'
- # store
- descriptions[image_id].append(desc)
- return descriptions
- # load photo features
- def load_photo_features(filename, dataset):
- # load all features
- all_features = load(open(filename, 'rb'))
- # filter features
- features = {k: all_features[k] for k in dataset}
- return features
- # covert a dictionary of clean descriptions to a list of descriptions
- def to_lines(descriptions):
- all_desc = list()
- for key in descriptions.keys():
- [all_desc.append(d) for d in descriptions[key]]
- return all_desc
- # fit a tokenizer given caption descriptions
- def create_tokenizer(descriptions):
- lines = to_lines(descriptions)
- tokenizer = Tokenizer()
- tokenizer.fit_on_texts(lines)
- return tokenizer
- # calculate the length of the description with the most words
- def max_length(descriptions):
- lines = to_lines(descriptions)
- return max(len(d.split()) for d in lines)
- # create sequences of images, input sequences and output words for an image
- def create_sequences(tokenizer, max_length, descriptions, photos):
- X1, X2, y = list(), list(), list()
- # walk through each image identifier
- for key, desc_list in descriptions.items():
- # walk through each description for the image
- for desc in desc_list:
- # encode the sequence
- seq = tokenizer.texts_to_sequences([desc])[0]
- # split one sequence into multiple X,y pairs
- for i in range(1, len(seq)):
- # split into input and output pair
- in_seq, out_seq = seq[:i], seq[i]
- # pad input sequence
- in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
- # encode output sequence
- out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
- # store
- X1.append(photos[key][0])
- X2.append(in_seq)
- y.append(out_seq)
- return array(X1), array(X2), array(y)
- # define the captioning model
- def define_model(vocab_size, max_length):
- # feature extractor model
- inputs1 = Input(shape=(4096,))
- fe1 = Dropout(0.5)(inputs1)
- fe2 = Dense(256, activation='relu')(fe1)
- # sequence model
- inputs2 = Input(shape=(max_length,))
- se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
- se2 = Dropout(0.5)(se1)
- se3 = LSTM(256)(se2)
- # decoder model
- decoder1 = add([fe2, se3])
- decoder2 = Dense(256, activation='relu')(decoder1)
- outputs = Dense(vocab_size, activation='softmax')(decoder2)
- # tie it together [image, seq] [word]
- model = Model(inputs=[inputs1, inputs2], outputs=outputs)
- model.compile(loss='categorical_crossentropy', optimizer='adam')
- # summarize model
- print(model.summary())
- plot_model(model, to_file='model.png', show_shapes=True)
- return model
- # train dataset
- # load training dataset (6K)
- filename = 'Flickr8k_text/Flickr_8k.trainImages.txt'
- train = load_set(filename)
- print('Dataset: %d' % len(train))
- # descriptions
- train_descriptions = load_clean_descriptions('descriptions.txt', train)
- print('Descriptions: train=%d' % len(train_descriptions))
- # photo features
- train_features = load_photo_features('features.pkl', train)
- print('Photos: train=%d' % len(train_features))
- # prepare tokenizer
- tokenizer = create_tokenizer(train_descriptions)
- vocab_size = len(tokenizer.word_index) + 1
- print('Vocabulary Size: %d' % vocab_size)
- # determine the maximum sequence length
- max_length = max_length(train_descriptions)
- print('Description Length: %d' % max_length)
- # prepare sequences
- X1train, X2train, ytrain = create_sequences(tokenizer, max_length, train_descriptions, train_features)
- # dev dataset
- # load test set
- filename = 'Flickr8k_text/Flickr_8k.devImages.txt'
- test = load_set(filename)
- print('Dataset: %d' % len(test))
- # descriptions
- test_descriptions = load_clean_descriptions('descriptions.txt', test)
- print('Descriptions: test=%d' % len(test_descriptions))
- # photo features
- test_features = load_photo_features('features.pkl', test)
- print('Photos: test=%d' % len(test_features))
- # prepare sequences
- X1test, X2test, ytest = create_sequences(tokenizer, max_length, test_descriptions, test_features)
- # fit model
- # define the model
- model = define_model(vocab_size, max_length)
- # define checkpoint callback
- filepath = 'model-ep{epoch:03d}-loss{loss:.3f}-val_loss{val_loss:.3f}.h5'
- checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
- # fit model
- model.fit([X1train, X2train], ytrain, epochs=20, verbose=2, callbacks=[checkpoint], validation_data=([X1test, X2test], ytest))
运行该示例首先打印加载训练和开发数据集的摘要。
- Dataset: 6,
- 000 Descriptions: train = 6,
- 000 Photos: train = 6,
- 000 Vocabulary Size: 7,
- 579 Description Length: 34 Dataset: 1,
- 000 Descriptions: test = 1,
- 000 Photos: test = 1,
- 000
之后,我们可以了解训练和验证(开发)输入 - 输出对的整体数量。
- Train on
- 306
- ,
- 404
- samples, validate on
- 50
- ,
- 903
- samples
然后运行模型,将最优模型保存至. h5 文件。
在运行过程中,我把最优验证结果的模型保存至文件中:
该模型在第 2 个 epoch 中结束时被保存,在训练数据集上的损失为 3.245,在开发数据集上的损失为 3.612,每个人的具体结果不同。如果你在 AWS 中运行上述示例,那么将模型文件复制回你当前的工作文件夹。
原 文: How to Develop a Deep Learning Caption Generation Model in Keras from Scratch
译 文: 机器之心
作 者:Jason Brownlee
来源: https://sdk.cn/news/7850