摘要: 如何用 TFLearn,keras 等高层框架来学习自动生成莎翁的戏剧或者尼采的哲学文章
高层框架: TFLearn 和 Keras
上一节我们学习了 Tensorflow 的高层 API 封装, 可以通过简单的几步就生成一个 DNN 分类器来解决 MNIST 手写识别问题.
尽管 Tensorflow 也在不断推进 Estimator API. 但是, 这并不是工具的全部. 在 Tensorflow 官方的 API 方外, 我们还有强大的工具, 比如 TFLearn 和 Keras.
这节我们就做一个武器库的展示, 看看专门为 Tensorflow 做的高层框架 TFLearn 和跨 Tensorflow 和 CNTK 几种后端的 Keras 为我们做了哪些强大的功能封装.
机器来写莎士比亚的戏剧
之前我们简单介绍了强大的用于处理序列数据的 RNN.RNN 比起其它网络的重要优点是可以学习了序列数据之后进行自生成.
比如, 学习唐诗三百首可以写诗, 学习了 Linux Kernel 源代码就能写 C 代码 (虽然基本上编译不过).
我们首先来一个自动写莎士比亚戏剧的例子吧.
在看代码之前我先唠叨几句. 深度学习对于数据量的要求还是比较高的, 像训练自动生成的这种, 一般得几百万到几千万量级的训练数据下才能有好的效果. 比如只用几篇小说来训练肯定生成不知所云的小说. 就算是人类也做不到只学几首诗就会写诗么.
另外一点就是, 训练数据量上来了, 对于时间和算力的要求也是指数级提高的.
比如我们用莎翁的戏剧来训练, 虽然数据量也不是特别的大, 也就 16 万多行, 但是在 CPU 上训练的话也不是一两个小时能搞定的, 大约是天为单位.
后面我们举图像或视频的例子, 在 CPU 上训, 论月也是并不意外的.
那么, 这个需要训一天左右的例子, 代码会多复杂呢? 答案是核心代码不过 10 几行, 总共加上数据处理和测试代码也不过 50 行左右.
- from __future__ import absolute_import, division, print_function
- import os
- import pickle
- from six.moves import urllib
- import tflearn
- from tflearn.data_utils import *
- path = "shakespeare_input.txt"
- char_idx_file = 'char_idx.pickle'
- if not os.path.isfile(path):
- urllib.request.urlretrieve("https://raw.githubusercontent.com/tflearn/tflearn.github.io/master/resources/shakespeare_input.txt", path)
- maxlen = 25
- char_idx = None
- if os.path.isfile(char_idx_file):
- print('Loading previous char_idx')
- char_idx = pickle.load(open(char_idx_file, 'rb'))
- X, Y, char_idx = \
textfile_to_semi_redundant_sequences(path, seq_maxlen=maxlen, redun_step=3,
- pre_defined_char_idx=char_idx)
- pickle.dump(char_idx, open(char_idx_file,'wb'))
- g = tflearn.input_data([None, maxlen, len(char_idx)])
- g = tflearn.lstm(g, 512, return_seq=True)
- g = tflearn.dropout(g, 0.5)
- g = tflearn.lstm(g, 512, return_seq=True)
- g = tflearn.dropout(g, 0.5)
- g = tflearn.lstm(g, 512)
- g = tflearn.dropout(g, 0.5)
- g = tflearn.fully_connected(g, len(char_idx), activation='softmax')
g = tflearn.regression(g, optimizer='adam', loss='categorical_crossentropy',
- learning_rate=0.001)
- m = tflearn.SequenceGenerator(g, dictionary=char_idx,
- seq_maxlen=maxlen,
- clip_gradients=5.0,
- checkpoint_path='model_shakespeare')
- for i in range(50):
- seed = random_sequence_from_textfile(path, maxlen)
m.fit(X, Y, validation_set=0.1, batch_size=128,
- n_epoch=1, run_id='shakespeare')
- print("-- TESTING...")
- print("-- Test with temperature of 1.0 --")
- print(m.generate(600, temperature=1.0, seq_seed=seed))
- print("-- Test with temperature of 0.5 --")
- print(m.generate(600, temperature=0.5, seq_seed=seed))
上面的例子需要使用 TFLearn 框架, 可以通过
pip install tflearn
来安装.
TFLearn 是专门为 Tensorflow 开发的高层次 API 框架.
用 TFLearn API 的主要好处是可读性更好, 比如刚才的核心代码:
- g = tflearn.input_data([None, maxlen, len(char_idx)])
- g = tflearn.lstm(g, 512, return_seq=True)
- g = tflearn.dropout(g, 0.5)
- g = tflearn.lstm(g, 512, return_seq=True)
- g = tflearn.dropout(g, 0.5)
- g = tflearn.lstm(g, 512)
- g = tflearn.dropout(g, 0.5)
- g = tflearn.fully_connected(g, len(char_idx), activation='softmax')
g = tflearn.regression(g, optimizer='adam', loss='categorical_crossentropy',
- learning_rate=0.001)
- m = tflearn.SequenceGenerator(g, dictionary=char_idx,
- seq_maxlen=maxlen,
- clip_gradients=5.0,
- checkpoint_path='model_shakespeare')
从输入数据, 三层 LSTM, 三层 Dropout, 最后是一个 softmax 的全连接层.
我们再来看一个预测泰坦尼克号幸存概率的网络的结构:
- # Build neural network
- net = tflearn.input_data(shape=[None, 6])
- net = tflearn.fully_connected(net, 32)
- net = tflearn.fully_connected(net, 32)
- net = tflearn.fully_connected(net, 2, activation='softmax')
- net = tflearn.regression(net)
- # Define model
- model = tflearn.DNN(net)
- # Start training (apply gradient descent algorithm)
- model.fit(data, labels, n_epoch=10, batch_size=16, show_metric=True)
从生成城市名字说起
大家的莎士比亚模型应该正在训练过程中吧, 咱们闲着也是闲着, 不如从一个更简单的例子来看看这个生成过程.
我们还是取 TFLearn 的官方例子, 通过读取美国主要城市名字列表来生成一些新的城市名字.
我们以 Z 开头的城市为例:
- Zachary
- Zafra
- Zag
- Zahl
- Zaleski
- Zalma
- Zama
- Zanesfield
- Zanesville
- Zap
- Zapata
- Zarah
- Zavalla
- Zearing
- Zebina
- Zebulon
- Zeeland
- Zeigler
- Zela
- Zelienople
- Zell
- Zellwood
- Zemple
- Zena
- Zenda
- Zenith
- Zephyr
- Zephyr Cove
- Zephyrhills
- Zia Pueblo
- Zillah
- Zilwaukee
- Zim
- Zimmerman
- Zinc
- Zion
- Zionsville
- Zita
- Zoar
- Zolfo Springs
- Zona
- Zumbro Falls
- Zumbrota
- Zuni
- Zurich
- Zwingle
- Zwolle
一共 20580 个城市. 这个训练就快多了, 在纯 CPU 上训练, 大约 5 到 6 分钟可以训练一轮.
代码如下, 跟上面写莎翁的戏剧的如出一辙:
- from __future__ import absolute_import, division, print_function
- import os
- from six import moves
- import ssl
- import tflearn
- from tflearn.data_utils import *
- path = "US_Cities.txt"
- if not os.path.isfile(path):
- context = ssl._create_unverified_context()
- moves.urllib.request.urlretrieve("https://raw.githubusercontent.com/tflearn/tflearn.github.io/master/resources/US_Cities.txt", path, context=context)
- maxlen = 20
- string_utf8 = open(path, "r").read().decode('utf-8')
- X, Y, char_idx = \
- string_to_semi_redundant_sequences(string_utf8, seq_maxlen=maxlen, redun_step=3)
- g = tflearn.input_data(shape=[None, maxlen, len(char_idx)])
- g = tflearn.lstm(g, 512, return_seq=True)
- g = tflearn.dropout(g, 0.5)
- g = tflearn.lstm(g, 512)
- g = tflearn.dropout(g, 0.5)
- g = tflearn.fully_connected(g, len(char_idx), activation='softmax')
g = tflearn.regression(g, optimizer='adam', loss='categorical_crossentropy',
- learning_rate=0.001)
- m = tflearn.SequenceGenerator(g, dictionary=char_idx,
- seq_maxlen=maxlen,
- clip_gradients=5.0,
- checkpoint_path='model_us_cities')
- for i in range(40):
- seed = random_sequence_from_string(string_utf8, maxlen)
m.fit(X, Y, validation_set=0.1, batch_size=128,
- n_epoch=1, run_id='us_cities')
- print("-- TESTING...")
- print("-- Test with temperature of 1.2 --")
- print(m.generate(30, temperature=1.2, seq_seed=seed).encode('utf-8'))
- print("-- Test with temperature of 1.0 --")
- print(m.generate(30, temperature=1.0, seq_seed=seed).encode('utf-8'))
- print("-- Test with temperature of 0.5 --")
- print(m.generate(30, temperature=0.5, seq_seed=seed).encode('utf-8'))
我们看下第一轮训练完生成的城市名:
- t and Shoot
- Cuthbertd
- Lettfrecv
- El
- Ceoneel Sutd
- Sa
第二轮:
- stle
- Finchford
- Finch Dasthond
- madloogd
- Wlaycoyarfw
第三轮:
- averal
- Cape Carteret
Acbiropa Heowar Sor Dittoy
Do
第十轮:
- hoenchen
- Schofield
- Stcojos
- Schabell
- StcaKnerum Cri
跨后端高层 API - Keras, 生成尼采的文章
Keras 是可以跨 Tensorflow, 微软的 CNTK 等多种后端的 API.
可以通过
pip install keras
来安装 keras. 我们安装了 Tensorflow 之后, Keras 会选用 Tensorflow 来做它的后端.
我们也看下 Keras 上文本生成的例子. 官方例子是用来生成尼采的句子.
核心语句就 6 句话:
- model = Sequential()
- model.add(LSTM(128, input_shape=(maxlen, len(chars))))
- model.add(Dense(len(chars)))
- model.add(Activation('softmax'))
- optimizer = RMSprop(lr=0.01)
- model.compile(loss='categorical_crossentropy', optimizer=optimizer)
下面是完整的代码, 大家跑来玩玩吧. 如果对尼采不感兴趣, 也可以换成别的文章. 不过请注意, 正如注释中所说的, 文本随便换, 但是要保持在 10 万字符以上. 最好是 100 万字符以上.
'''Example script to generate text from Nietzsche's writings.
At least 20 epochs are required before the generated text
starts sounding coherent.
It is recommended to run this script on GPU, as recurrent
networks are quite computationally intensive.
If you try this script on new data, make sure your corpus
has at least ~100k characters. ~1M is better.
- '''
- from __future__ import print_function
- from keras.callbacks import LambdaCallback
- from keras.models import Sequential
- from keras.layers import Dense, Activation
- from keras.layers import LSTM
- from keras.optimizers import RMSprop
- from keras.utils.data_utils import get_file
- import numpy as np
- import random
- import sys
- import io
- path = get_file('nietzsche.txt', origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')
- with io.open(path, encoding='utf-8') as f:
- text = f.read().lower()
- print('corpus length:', len(text))
- chars = sorted(list(set(text)))
- print('total chars:', len(chars))
- char_indices = dict((c, i) for i, c in enumerate(chars))
- indices_char = dict((i, c) for i, c in enumerate(chars))
- # cut the text in semi-redundant sequences of maxlen characters
- maxlen = 40
- step = 3
- sentences = []
- next_chars = []
- for i in range(0, len(text) - maxlen, step):
- sentences.append(text[i: i + maxlen])
- next_chars.append(text[i + maxlen])
- print('nb sequences:', len(sentences))
- print('Vectorization...')
- x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
- y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
- for i, sentence in enumerate(sentences):
- for t, char in enumerate(sentence):
- x[i, t, char_indices[char]] = 1
- y[i, char_indices[next_chars[i]]] = 1
- # build the model: a single LSTM
- print('Build model...')
- model = Sequential()
- model.add(LSTM(128, input_shape=(maxlen, len(chars))))
- model.add(Dense(len(chars)))
- model.add(Activation('softmax'))
- optimizer = RMSprop(lr=0.01)
- model.compile(loss='categorical_crossentropy', optimizer=optimizer)
- def sample(preds, temperature=1.0):
- # helper function to sample an index from a probability array
- preds = np.asarray(preds).astype('float64')
- preds = np.log(preds) / temperature
- exp_preds = np.exp(preds)
- preds = exp_preds / np.sum(exp_preds)
- probas = np.random.multinomial(1, preds, 1)
- return np.argmax(probas)
- def on_epoch_end(epoch, logs):
- # Function invoked at end of each epoch. Prints generated text.
- print()
- print('----- Generating text after Epoch: %d' % epoch)
- start_index = random.randint(0, len(text) - maxlen - 1)
- for diversity in [0.2, 0.5, 1.0, 1.2]:
- print('----- diversity:', diversity)
- generated = ''
- sentence = text[start_index: start_index + maxlen]
- generated += sentence
- print('----- Generating with seed:"' + sentence + '"')
- sys.stdout.write(generated)
- for i in range(400):
- x_pred = np.zeros((1, maxlen, len(chars)))
- for t, char in enumerate(sentence):
- x_pred[0, t, char_indices[char]] = 1.
- preds = model.predict(x_pred, verbose=0)[0]
- next_index = sample(preds, diversity)
- next_char = indices_char[next_index]
- generated += next_char
- sentence = sentence[1:] + next_char
- sys.stdout.write(next_char)
- sys.stdout.flush()
- print()
- print_callback = LambdaCallback(on_epoch_end=on_epoch_end)
model.fit(x, y,
- batch_size=128,
- epochs=60,
- callbacks=[print_callback])
来源: http://www.jianshu.com/p/485c4dc7f54b