10.13 Update: 最近新出了一个 state-of-the-art 预训练模型, 传送门:
李入魔:[NLP] Google BERT 详解zhuanlan.zhihu.com
1. 简介
长期以来, 词向量一直是 NLP 任务中的主要表征技术. 随着 2017 年底以及 2018 年初的一系列技术突破, 研究证实预训练的语言表征经过精调后可以在众多 NLP 任务中达到更好的表现. 目前预训练有两种方法:
Feature-based: 将训练出的 representation 作为 feature 用于任务, 从词向量, 句向量, 段向量, 文本向量都是这样的. 新的 ELMo 也属于这类, 但迁移后需要重新计算出输入的表征.
Fine-tuning: 这个主要借鉴于 CV, 就是在预训练好的模型上加些针对任务的层, 再对后几层进行精调. 新的 ULMFit 和 OpenAI GPT 属于这一类.
本文主要对 ELMo,ULMFiT 以及 OpenAI GPT 三种预训练语言模型作简要介绍.
2. ELMo
2.1 模型原理与架构
ELMo 是从双向语言模型 (BiLM) 中提取出的 Embedding. 训练时使用 BiLSTM, 给定 N 个 tokens (t1, t2,...,tN), 目标为最大化:
ELMo 对于每个 token , 通过一个 L 层的 biLM 计算出 2L+1 个表示:
其中 是对 token 进行直接编码的结果(这里是字符通过 CNN 编码),
是每个 biLSTM 层输出的结果.
应用中将 ELMo 中所有层的输出 R 压缩为单个向量, , 最简单的压缩方法是取最上层的结果做为 token 的表示: , 更通用的做法是通过一些参数来联合所有层的信息:
其中 是 softmax 出来的权重, 是一个任务相关的 scale 参数, 在优化过程中很重要, 同时因为每层 BiLM 的输出分布不同, 可以对层起到 normalisation 的作用.
论文中使用的预训练 BiLM 在 Jozefowicz et al. https://arxiv.org/abs/1602.02410 中的 CNN-BIG-LSTM 基础上做了修改, 最终模型为 2 层 biLSTM(4096 units, 512 dimension projections), 并在第一层和第二层之间增加了残差连接. 同时使用 CNN 和两层 Highway 对 token 进行字符级的上下文无关编码. 使得模型最终对每个 token 输出三层向量表示.
2.2 模型训练注意事项
- 正则化:
1. Dropout
2. 在 loss 中添加权重的惩罚项 (实验结果显示 ELMo 适合较小的 )
- TF 版源码解析:
1. 模型架构的代码主要在 training 模块的 LanguageModel 类中, 分为两步: 第一步创建 Word 或 character 的 Embedding 层(CNN+Highway); 第二步创建 BiLSTM 层.
2. 加载所需的预训练模型为 model 模块中的 BidirectionalLanguageModel 类.
2.3 模型的使用
将 ELMo 向量 与传统的词向量 拼接成 后, 输入到对应具体任务的 RNN 中.
将 ELMo 向量放到模型输出部分, 与具体任务 RNN 输出的 拼接成 .
keras 代码示例 https://github.com/strongio/keras-elmo
- import tensorflow as tf
- from keras import backend as K
- import keras.layers as layers
- from keras.models import Model
- # Initialize session
- sess = tf.Session()
- K.set_session(sess)
- # Instantiate the elmo model
- elmo_model = hub.Module("https://tfhub.dev/google/elmo/1", trainable=True)
- sess.run(tf.global_variables_initializer())
- sess.run(tf.tables_initializer())
- # We create a function to integrate the tensorflow model with a Keras model
- # This requires explicitly casting the tensor to a string, because of a Keras quirk
- def ElmoEmbedding(x):
- return elmo_model(tf.squeeze(tf.cast(x, tf.string)), signature="default", as_dict=True)["default"]
- input_text = layers.Input(shape=(1,), dtype=tf.string)
- embedding = layers.Lambda(ElmoEmbedding, output_shape=(1024,))(input_text)
- dense = layers.Dense(256, activation='relu')(embedding)
- pred = layers.Dense(1, activation='sigmoid')(dense)
- model = Model(inputs=[input_text], outputs=pred)
- model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
2.4 模型的优缺点
优点:
效果好, 在大部分任务上都较传统模型有提升. 实验正式 ELMo 相比于词向量, 可以更好地捕捉到语法和语义层面的信息.
传统的预训练词向量只能提供一层表征, 而且词汇量受到限制. ELMo 所提供的是 character-level 的表征, 对词汇量没有限制.
缺点:
速度较慢, 对每个 token 编码都要通过 language model 计算得出.
2.5 适用任务
- Question Answering
- Textual entailment
- Semantic role labeling
- Coreference resolution
- Named entity extraction
- Sentiment analysis
- 3. ULMFiT
3.1 模型原理与架构
- T: number of training iterations
- cut_frac: fraction of iterations we increase the LR
- cut: the iteration when we switch from increasing to decreasing the LR
- p: the fraction of the number of iterations we have increased or will decrease the LR respectively
- # location: fastai/lm_rnn.py
- def get_language_model(n_tok, emb_sz, n_hid, n_layers, pad_token,
- dropout=0.4, dropouth=0.3, dropouti=0.5, dropoute=0.1, wdrop=0.5, tie_weights=True, qrnn=False, bias=False):
- """Returns a SequentialRNN model.
- A RNN_Encoder layer is instantiated using the parameters provided.
- This is followed by the creation of a LinearDecoder layer.
- Also by default (i.e. tie_weights = True), the embedding matrix used in the RNN_Encoder
- is used to instantiate the weights for the LinearDecoder layer.
- The SequentialRNN layer is the native torch's Sequential wrapper that puts the RNN_Encoder and
- LinearDecoder layers sequentially in the model.
- Args:
- n_tok (int): number of unique vocabulary words (or tokens) in the source dataset
- emb_sz (int): the embedding size to use to encode each token
- n_hid (int): number of hidden activation per LSTM layer
- n_layers (int): number of LSTM layers to use in the architecture
- pad_token (int): the int value used for padding text.
- dropouth (float): dropout to apply to the activations going from one LSTM layer to another
- dropouti (float): dropout to apply to the input layer.
- dropoute (float): dropout to apply to the embedding layer.
- wdrop (float): dropout used for a LSTM's internal (or hidden) recurrent weights.
- tie_weights (bool): decide if the weights of the embedding matrix in the RNN encoder should be tied to the
- weights of the LinearDecoder layer.
- qrnn (bool): decide if the model is composed of LSTMS (False) or QRNNs (True).
- bias (bool): decide if the decoder should have a bias layer or not.
- Returns:
- A SequentialRNN model
- """
- rnn_enc = RNN_Encoder(n_tok, emb_sz, n_hid=n_hid, n_layers=n_layers, pad_token=pad_token,
- dropouth=dropouth, dropouti=dropouti, dropoute=dropoute, wdrop=wdrop, qrnn=qrnn)
- enc = rnn_enc.encoder if tie_weights else None
- return SequentialRNN(rnn_enc, LinearDecoder(n_tok, emb_sz, dropout, tie_encoder=enc, bias=bias))
- def get_rnn_classifier(bptt, max_seq, n_class, n_tok, emb_sz, n_hid, n_layers, pad_token, layers, drops, bidir=False,
- dropouth=0.3, dropouti=0.5, dropoute=0.1, wdrop=0.5, qrnn=False):
- rnn_enc = MultiBatchRNN(bptt, max_seq, n_tok, emb_sz, n_hid, n_layers, pad_token=pad_token, bidir=bidir,
- dropouth=dropouth, dropouti=dropouti, dropoute=dropoute, wdrop=wdrop, qrnn=qrnn)
- Classification
- Sequence labeling
- 4. OpenAI GPT
- # location: finetune-transformer-lm/train.py
- def model(X, M, Y, train=False, reuse=False):
- with tf.variable_scope('model', reuse=reuse):
- # n_special=3, 作者把数据集分为三份
- # n_ctx 应该是 n_context
- we = tf.get_variable("we", [n_vocab+n_special+n_ctx, n_embd], initializer=tf.random_normal_initializer(stddev=0.02))
- we = dropout(we, embd_pdrop, train)
- X = tf.reshape(X, [-1, n_ctx, 2])
- M = tf.reshape(M, [-1, n_ctx])
- # 1. Embedding
- h = embed(X, we)
- # 2. transformer block
- for layer in range(n_layer):
- h = block(h, 'h%d'%layer, train=train, scale=True)
- # 3. 计算语言模型 loss
- lm_h = tf.reshape(h[:, :-1], [-1, n_embd])
- lm_logits = tf.matmul(lm_h, we, transpose_b=True)
- lm_losses = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=lm_logits, labels=tf.reshape(X[:, 1:, 0], [-1]))
- lm_losses = tf.reshape(lm_losses, [shape_list(X)[0], shape_list(X)[1]-1])
- lm_losses = tf.reduce_sum(lm_losses*M[:, 1:], 1)/tf.reduce_sum(M[:, 1:], 1)
- # 4. 计算 classifier loss
- clf_h = tf.reshape(h, [-1, n_embd])
- pool_idx = tf.cast(tf.argmax(tf.cast(tf.equal(X[:, :, 0], clf_token), tf.float32), 1), tf.int32)
- clf_h = tf.gather(clf_h, tf.range(shape_list(X)[0], dtype=tf.int32)*n_ctx+pool_idx)
- clf_h = tf.reshape(clf_h, [-1, 2, n_embd])
- if train and clf_pdrop> 0:
- shape = shape_list(clf_h)
- shape[1] = 1
- clf_h = tf.nn.dropout(clf_h, 1-clf_pdrop, shape)
- clf_h = tf.reshape(clf_h, [-1, n_embd])
- clf_logits = clf(clf_h, 1, train=train)
- clf_logits = tf.reshape(clf_logits, [-1, 2])
- clf_losses = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=clf_logits, labels=Y)
- Natural Language Inference
- Question Answering and commonsense reasoning
- Classification
- Semantic Similarity
来源: https://juejin.im/post/5c00a92d6fb9a049ab0d5669