从头开始在 Python 中开发深度学习字幕生成模型（下）

因此，输出数据是每个单词的 one-hot 编码，它表示一种理想化的概率分布，即除了实际词位置之外所有词位置的值都为 0，实际词位置的值为 1。

# create sequences of images, input sequences and output words for an image
def create_sequences(tokenizer, max_length, descriptions, photos):
	X1, X2, y = list(), list(), list()
	# walk through each image identifier
	for key, desc_list in descriptions.items():
		# walk through each description for the image
		for desc in desc_list:
			# encode the sequence
			seq = tokenizer.texts_to_sequences([desc])[0]
			# split one sequence into multiple X,y pairs
			for i in range(1, len(seq)):
				# split into input and output pair
				in_seq, out_seq = seq[:i], seq[i]
				# pad input sequence
				in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
				# encode output sequence
				out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
				# store
				X1.append(photos[key][0])
				X2.append(in_seq)
				y.append(out_seq)
	return array(X1), array(X2), array(y)

我们需要计算最长描述中单词的最大数量。下面是一个有帮助的函数 max_length()。

# calculate the length of the description with the most words
def max_length(descriptions):
	lines = to_lines(descriptions)
	return max(len(d.split()) for d in lines)

现在我们可以为训练和开发数据集加载数据，并将加载数据转换成输入 - 输出对来拟合深度学习模型。

定义模型

我们将根据 Marc Tanti, et al. 在 2017 年论文中描述的「merge-model」定义深度学习模型。

Where to put the Image in an Image Caption Generator，2017
What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator?，2017

论文作者提供了该模型的简图，如下所示：

我们将从三部分描述该模型：

图像特征提取器：这是一个在 ImageNet 数据集上预训练的 16 层 VGG 模型。我们已经使用 VGG 模型（没有输出层）对图像进行预处理，并将使用该模型预测的提取特征作为输入。
序列处理器：合适一个词嵌入层，用于处理文本输入，后面是长短期记忆（LSTM）循环神经网络层。
解码器：特征提取器和序列处理器输出一个固定长度向量。这些向量由密集层（Dense layer）融合和处理，来进行最终预测。

图像特征提取器模型的输入图像特征是维度为 4096 的向量，这些向量经过全连接层处理并生成图像的 256 元素表征。

序列处理器模型期望馈送至嵌入层的预定义长度（34 个单词）输入序列使用掩码来忽略 padded 值。之后是具备 256 个循环单元的 LSTM 层。

两个输入模型均输出 256 元素的向量。此外，输入模型以 50% 的 dropout 率使用正则化，旨在减少训练数据集的过拟合情况，因为该模型配置学习非常快。

解码器模型使用额外的操作融合来自两个输入模型的向量。然后将其馈送至 256 个神经元的密集层，然后输送至最终输出密集层，从而在所有输出词汇上对序列中的下一个单词进行 softmax 预测。

下面的 define_model() 函数定义和返回要拟合的模型。

# define the captioning model
def define_model(vocab_size, max_length):
	# feature extractor model
	inputs1 = Input(shape=(4096,))
	fe1 = Dropout(0.5)(inputs1)
	fe2 = Dense(256, activation='relu')(fe1)
	# sequence model
	inputs2 = Input(shape=(max_length,))
	se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
	se2 = Dropout(0.5)(se1)
	se3 = LSTM(256)(se2)
	# decoder model
	decoder1 = add([fe2, se3])
	decoder2 = Dense(256, activation='relu')(decoder1)
	outputs = Dense(vocab_size, activation='softmax')(decoder2)
	# tie it together [image, seq] [word]
	model = Model(inputs=[inputs1, inputs2], outputs=outputs)
	model.compile(loss='categorical_crossentropy', optimizer='adam')
	# summarize model
	print(model.summary())
	plot_model(model, to_file='model.png', show_shapes=True)
	return model

要了解模型结构，特别是层的形状，请参考下表中的总结。

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to
====================================================================================================
input_2 (InputLayer)             (None, 34)            0
____________________________________________________________________________________________________
input_1 (InputLayer)             (None, 4096)          0
____________________________________________________________________________________________________
embedding_1 (Embedding)          (None, 34, 256)       1940224     input_2[0][0]
____________________________________________________________________________________________________
dropout_1 (Dropout)              (None, 4096)          0           input_1[0][0]
____________________________________________________________________________________________________
dropout_2 (Dropout)              (None, 34, 256)       0           embedding_1[0][0]
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 256)           1048832     dropout_1[0][0]
____________________________________________________________________________________________________
lstm_1 (LSTM)                    (None, 256)           525312      dropout_2[0][0]
____________________________________________________________________________________________________
add_1 (Add)                      (None, 256)           0           dense_1[0][0]
                                                                   lstm_1[0][0]
____________________________________________________________________________________________________
dense_2 (Dense)                  (None, 256)           65792       add_1[0][0]
____________________________________________________________________________________________________
dense_3 (Dense)                  (None, 7579)          1947803     dense_2[0][0]
====================================================================================================
Total params: 5,527,963
Trainable params: 5,527,963
Non-trainable params: 0
____________________________________________________________________________________________________

我们还创建了一幅图来可视化网络结构，帮助理解两个输入流。

图像字幕生成深度学习模型示意图。

拟合模型

现在我们已经了解如何定义模型了，那么接下来我们要在训练数据集上拟合模型。

该模型学习速度快，很快就会对训练数据集产生过拟合。因此，我们需要在留出的开发数据集上监控训练模型的泛化情况。如果模型在开发数据集上的技能在每个 epoch 结束时有所提升，则我们将整个模型保存至文件。

在运行结束时，我们能够使用训练数据集上具备最优技能的模型作为最终模型。

通过在 Keras 中定义 ModelCheckpoint，使之监控验证数据集上的最小损失，我们可以实现以上目的。然后将该模型保存至文件名中包含训练损失和验证损失的文件中。

# define checkpoint callback
filepath = 'model-ep{epoch:03d}-loss{loss:.3f}-val_loss{val_loss:.3f}.h5'
checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='min')

之后，通过 fit() 中的 callbacks 参数指定检查点。我们还需要 fit() 中的 validation_data 参数指定开发数据集。

我们仅拟合模型 20 epoch，给出一定量的训练数据，在一般硬件上每个 epoch 可能需要 30 分钟。

# fit model
model.fit([X1train, X2train], ytrain, epochs=20, verbose=2, callbacks=[checkpoint], validation_data=([X1test, X2test], ytest))

完成示例

在训练数据上拟合模型的完整示例如下：

from numpy import array
from pickle import load
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.utils import plot_model
from keras.models import Model
from keras.layers import Input
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding
from keras.layers import Dropout
from keras.layers.merge import add
from keras.callbacks import ModelCheckpoint
 
# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text
 
# load a pre-defined list of photo identifiers
def load_set(filename):
	doc = load_doc(filename)
	dataset = list()
	# process line by line
	for line in doc.split('\n'):
		# skip empty lines
		if len(line) < 1:
			continue
		# get the image identifier
		identifier = line.split('.')[0]
		dataset.append(identifier)
	return set(dataset)
 
# load clean descriptions into memory
def load_clean_descriptions(filename, dataset):
	# load document
	doc = load_doc(filename)
	descriptions = dict()
	for line in doc.split('\n'):
		# split line by white space
		tokens = line.split()
		# split id from description
		image_id, image_desc = tokens[0], tokens[1:]
		# skip images not in the set
		if image_id in dataset:
			# create list
			if image_id not in descriptions:
				descriptions[image_id] = list()
			# wrap description in tokens
			desc = 'startseq ' + ' '.join(image_desc) + ' endseq'
			# store
			descriptions[image_id].append(desc)
	return descriptions
 
# load photo features
def load_photo_features(filename, dataset):
	# load all features
	all_features = load(open(filename, 'rb'))
	# filter features
	features = {k: all_features[k] for k in dataset}
	return features
 
# covert a dictionary of clean descriptions to a list of descriptions
def to_lines(descriptions):
	all_desc = list()
	for key in descriptions.keys():
		[all_desc.append(d) for d in descriptions[key]]
	return all_desc
 
# fit a tokenizer given caption descriptions
def create_tokenizer(descriptions):
	lines = to_lines(descriptions)
	tokenizer = Tokenizer()
	tokenizer.fit_on_texts(lines)
	return tokenizer
 
# calculate the length of the description with the most words
def max_length(descriptions):
	lines = to_lines(descriptions)
	return max(len(d.split()) for d in lines)
 
# create sequences of images, input sequences and output words for an image
def create_sequences(tokenizer, max_length, descriptions, photos):
	X1, X2, y = list(), list(), list()
	# walk through each image identifier
	for key, desc_list in descriptions.items():
		# walk through each description for the image
		for desc in desc_list:
			# encode the sequence
			seq = tokenizer.texts_to_sequences([desc])[0]
			# split one sequence into multiple X,y pairs
			for i in range(1, len(seq)):
				# split into input and output pair
				in_seq, out_seq = seq[:i], seq[i]
				# pad input sequence
				in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
				# encode output sequence
				out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
				# store
				X1.append(photos[key][0])
				X2.append(in_seq)
				y.append(out_seq)
	return array(X1), array(X2), array(y)
 
# define the captioning model
def define_model(vocab_size, max_length):
	# feature extractor model
	inputs1 = Input(shape=(4096,))
	fe1 = Dropout(0.5)(inputs1)
	fe2 = Dense(256, activation='relu')(fe1)
	# sequence model
	inputs2 = Input(shape=(max_length,))
	se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
	se2 = Dropout(0.5)(se1)
	se3 = LSTM(256)(se2)
	# decoder model
	decoder1 = add([fe2, se3])
	decoder2 = Dense(256, activation='relu')(decoder1)
	outputs = Dense(vocab_size, activation='softmax')(decoder2)
	# tie it together [image, seq] [word]
	model = Model(inputs=[inputs1, inputs2], outputs=outputs)
	model.compile(loss='categorical_crossentropy', optimizer='adam')
	# summarize model
	print(model.summary())
	plot_model(model, to_file='model.png', show_shapes=True)
	return model
 
# train dataset
 
# load training dataset (6K)
filename = 'Flickr8k_text/Flickr_8k.trainImages.txt'
train = load_set(filename)
print('Dataset: %d' % len(train))
# descriptions
train_descriptions = load_clean_descriptions('descriptions.txt', train)
print('Descriptions: train=%d' % len(train_descriptions))
# photo features
train_features = load_photo_features('features.pkl', train)
print('Photos: train=%d' % len(train_features))
# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# determine the maximum sequence length
max_length = max_length(train_descriptions)
print('Description Length: %d' % max_length)
# prepare sequences
X1train, X2train, ytrain = create_sequences(tokenizer, max_length, train_descriptions, train_features)
 
# dev dataset
 
# load test set
filename = 'Flickr8k_text/Flickr_8k.devImages.txt'
test = load_set(filename)
print('Dataset: %d' % len(test))
# descriptions
test_descriptions = load_clean_descriptions('descriptions.txt', test)
print('Descriptions: test=%d' % len(test_descriptions))
# photo features
test_features = load_photo_features('features.pkl', test)
print('Photos: test=%d' % len(test_features))
# prepare sequences
X1test, X2test, ytest = create_sequences(tokenizer, max_length, test_descriptions, test_features)
 
# fit model
 
# define the model
model = define_model(vocab_size, max_length)
# define checkpoint callback
filepath = 'model-ep{epoch:03d}-loss{loss:.3f}-val_loss{val_loss:.3f}.h5'
checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
# fit model
model.fit([X1train, X2train], ytrain, epochs=20, verbose=2, callbacks=[checkpoint], validation_data=([X1test, X2test], ytest))

运行该示例首先打印加载训练和开发数据集的摘要。

Dataset: 6,
000 Descriptions: train = 6,
000 Photos: train = 6,
000 Vocabulary Size: 7,
579 Description Length: 34 Dataset: 1,
000 Descriptions: test = 1,
000 Photos: test = 1,
000

之后，我们可以了解训练和验证（开发）输入 - 输出对的整体数量。

Train on 
306
,
404
 samples, validate on 
50
,
903
 samples

然后运行模型，将最优模型保存至. h5 文件。

在运行过程中，我把最优验证结果的模型保存至文件中：

model-ep002-loss3.245-val_loss3.612.h5

该模型在第 2 个 epoch 中结束时被保存，在训练数据集上的损失为 3.245，在开发数据集上的损失为 3.612，每个人的具体结果不同。如果你在 AWS 中运行上述示例，那么将模型文件复制回你当前的工作文件夹。

原    文： How to Develop a Deep Learning Caption Generation Model in Keras from Scratch
译    文：机器之心
作    者：Jason Brownlee

来源: https://sdk.cn/news/7850

与本文相关文章

暂无,快来抢沙发吧！