当前位置：

首页
/
IT
/
程序
/
Objective-C
/
开源中文分词工具探析（四）：THULAC

开源中文分词工具探析（四）：THULAC

是一款相当不错的中文分词工具，准确率高、分词速度蛮快的；并且在工程上做了很多优化，比如：用 DAT 存储训练特征（压缩训练模型），加入了标点符号的特征（提高分词准确率）等。

1. 前言

THULAC 所采用的分词模型为（Structured Perceptron, SP），属于两种 CWS 模型中的 Character-Based Model，将中文分词看作为一个序列标注问题：对于字符序列 \(C=c_1^n\)，找出最有可能的标注序列 \(Y=y_1^n\)。定义 score 函数 \(S(Y,C)\) 为在 \(C\) 的情况下标注序列为 \(Y\) 的得分。SP 以最大熵准则建模 score 函数，分词结果则等同于最大 score 函数所对应的标注序列。记在时刻 \(t\) 的状态为 \(y\) 的路径 \(y_1^{t}\) 所对应的 score 函数最大值为

\[\delta_t(y) = \max S(y_1^{t-1}, C, y_t=y) \]

那么，则有递推式

\[\delta_{t+1}(y) = \max_{y'} \ \{\delta_t(y') + w_{y',y} + F(y_{t+1}=y,C) \} \]

其中，\(w_{y',y}\) 为转移特征 \((y',y)\) 所对应的权值，\(F(y_{t+1}=y, C)\) 为特征模板的特征值的加权之和：

\[F(y_{t+1}=y,C) = \sum_{i=1}^7 \alpha_i f_i(y_{t+1}=y,C) \]

其中，\(\alpha_i\) 为特征 \(f_i(y_{t+1}=y,C)\) 所对应的权重。

2. 分解

以下源码分析基于最近 lite 版本（没有显式的版本号，只能说是最近的了）。

训练模型

seg_only 模式（只分词没有 POS）对应的训练模型的数据包括三种：权重数据

cws_model.bin

、序列标注类别数据

cws_label.txt

、特征数据

cws_dat.bin

。权重数据对应于类

character.CBModel

，其格式如下：

int l_size (size of the labels): 4
int f_size (size of the features): 2453880
int[] ll_weights // weights of (label, label):
-42717
39325
-33792
...
-31595
26794
int[] fl_weights //weights of (feature, label):
-4958
2517
5373
-2930
-286
0
...

训练模型中的标注类别（label）共有 4 类 "0", "2", "3", "1"（见文件

cws_label.txt

），分别对应于中文分词中的标注类别 B、E、S、M；这一点可以在 C++ 版的

thulac_base.h

找到映照：

enum POC{
    kPOC_B='0',
    kPOC_M='1',
    kPOC_E='2',
    kPOC_S='3'
};

但是，THULAC 在解码时用到的 label，则是 BMES 在特征数据文件的索引位置，因此标注类别 B、M、E、S 映射到整数 0、3、1、2；这些映射可以在后面的构建 label 转移矩阵

labelTransPre

及 Viterbi 解码的代码中找到印证。那么，则 B 转移到 B 的权重 \(w_{B,B}=w_{0,0}=-42717\)，B 转移到 E 的权重 \(w_{B,E}=w_{0,1}=39325\)，B 转移到 S 的权重 \(w_{B,S}=w_{0,2}=-33792\)。此即与实际情况相对应，B 只能会转移到 M 和 E，而不可能转移到 S。

权重数据中标识对应于特征模板的 feature 共有 2453880 个。那么，label 与 label 之间的组合共有 4×4=16 种，即

ll_weights

的长度为 16；feature 与 label 之间的组合共有 4×2453880=9815520 种，即

fl_weights

的长度为 9815520。

特征数据则是用双数组 Trie 树来存储的，对应于类

base.Dat

，其格式如下：

int datSize: 7643071
Vector<Entry> dat：
0 0
0 1
0 51
0 52
...
25053 0
5 25088
...

datSize

为

dat

的长度，等于 7643071；类

Entry

表示双数组中的的 base 与 check 值。那么，问题来了：一是特征长什么样？二是知道特征如何得到对应的权重？三是为什么 DAT 的长度远大于字符特征的个数 2453880？

特征

首先，我们来看看特征长什么样，参看特征生成方法

CBNGramFeature::featureGeneration

：

public void featureGeneration(String seq, Indexer indexer, Counter bigramCounter){
  ...
  for(int i = 0; i < seq.length(); i ++){
    mid = seq.charAt(i);
    left = (i > 0) ? (seq.charAt(i-1)) : (SENTENCE_BOUNDARY);
    ...
    key = ((char)mid)+((char)SEPERATOR) + "1";
    key = ((char)left)+((char)SEPERATOR) + "2";
    key = ((char)right)+((char)SEPERATOR) + "3";
    key = ((char)left)+((char)mid)+((char)SEPERATOR) + "1";
    key = ((char)mid)+((char)right)+((char)SEPERATOR) + "2";
    key = ((char)left2)+((char)left)+((char)SEPERATOR) + "1"; // should be + "3"
    key = ((char)right)+((char)right2)+((char)SEPERATOR) + "1"; // should be + "4"
    ...
  }
}

特征模板共定义 7 个字符特征：3 个 unigram 字符特征与 4 个 bigram 字符特征。在处理特征时，字符后面加上了空格，然后在加上标识 1、2、3、4，用以区分特征的种类。值得指出的是 Java 版作者写错了最后两个 bigram 特征，应该是加上数字 3、4；在 C++ 版的函数

NGramFeature::feature_generation

可找到印证。通过特征模板定义，我们发现 THULAC 既考虑到了前面 2 个字符 (各种组合) 对当前字符标注的影响，也考虑到了后面 2 个字符的影响。

接下来，为了解决第二个问题，我们来看看用 Viterbi 算法解码前的代码——THULAC 先将特征值的加权之和 \(F(t_i,C)\) 计算出来，然后按 label 次序逐个放入 values[i*4+label] 中。主干代码如下：

/**
 * @param datSize DAT size
 * @param ch1 first character
 * @param ch2 second character
 * @return [unigram字符特征的base + " ", bigram字符特征的base + " "]
 */
public Vector findBases(int datSize, int ch1, int ch2) {
  uniBase = dat.get(ch1).base + SEPERATOR;
  biBase = dat.get(ind).base + SEPERATOR;
}
/**
 * 按label 0,1,2,3 将特征值的加权之和放入values数组中
 * @param valueOffset values数组偏置量，在putValues中调用时按步长4递增
 * @param base unigram字符特征或bigram字符特征的base加上空格的index
 * @param del 标识'1', '2', '3', '4' -> 49, 50, 51, 52
 * @param pAllowedLable null
 */
private void addValues(int valueOffset, int base, int del, int[] pAllowedLable) {
  int ind = dat.get(base).base + del; // 加上标识del后特征的index
  int offset = dat.get(ind).base; // 特征的base
  int weightOffset = offset * model.l_size; // 特征数组的偏移量
  int allowedLabel;
  if (model.l_size == 4) {
    values[valueOffset] += model.fl_weights[weightOffset];
    values[valueOffset + 1] += model.fl_weights[weightOffset + 1];
    values[valueOffset + 2] += model.fl_weights[weightOffset + 2];
    values[valueOffset + 3] += model.fl_weights[weightOffset + 3];
  }
}
public int putValues(String sequence, int len) {
  int base = 0;
  for (int i = 0; i < len; i++) {
    int valueOffset = i * model.l_size;
    if ((base = uniBases[i + 1]) != -1) {
      addValues(valueOffset, base, 49, null); // c_{i}t_{i}
    }
    if ((base = uniBases[i]) != -1) {
      addValues(valueOffset, base, 50, null); // c_{i-1}t_{i}
    }
    if ((base = uniBases[i + 2]) != -1) {
      addValues(valueOffset, base, 51, null); // c_{i+1}t_{i}
    }
    if ((base = biBases[i + 1]) != -1) {
      addValues(valueOffset, base, 49, null); // c_{i-1}c_{i}t_{i}
    }
    if ((base = biBases[i + 2]) != -1) {
      addValues(valueOffset, base, 50, null); // c_{i}c_{i+1}t_{i}
    }
    if ((base = biBases[i]) != -1) {
      addValues(valueOffset, base, 51, null); // c_{i-2}c_{i-1}t_{i}
    }
    if ((base = biBases[i + 3]) != -1) {
      addValues(valueOffset, base, 52, null); // c_{i+1}c_{i+2}t_{i}
    }
  }
}

注意：49 对应于数字 1 的 unicode 值，50 对应于数字 2 等。从上述代码中，我们发现特征数组 fl[4*i+j] 对应于特征的 base 为 i，label 为 j。拼接特征的流程如下：先得到 unigram 或 bigram 字符特征，然后加空格，再加标识。在拼接过程中，按照 DAT 的转移方程进行转移：

base[r] + c = s
check[s] = r

最后，我们回到第三个问题——为什么 DAT 的长度远大于特征数？这是因为在构建 DAT 时，需要存储很多的中间结果。

解码

THULAC 用于解码的类

character.CBTaggingDecoder

，Viterbi 算法实现对应于方法

AlphaBeta::dbDecode

；CBTaggingDecoder 的主要字段如下：

public char separator;
private int maxLength; // 定义的最大句子长度20000
private int len; // 待分词句子的长度
private String sequence; // 待分词句子的深拷贝
private int[][] allowedLabelLists; // 根据标点符号及句子结构，判断当前字符允许的label: int[len][]
private int[][] pocsToTags; // index -> allow-labels: int[16][2,3,4]
private CBNGramFeature nGramFeature;
private Dat dat; // 字符特征DAT
private CBModel model; // 权重
private Node[] nodes; // 分词DAG邻接表
private int[] values; // 特征模板F(t_i,X)的加权之和: int[i*4+label]
private AlphaBeta[] alphas; // 解码时用来记录path, [i, j]指c_{i}的标注为t_{j}的前一结点
private int[] result; // 分词后的标注数组
private String[] labelInfo; // ["0", "2", "3", "1"]
private int[] labelTrans; // 
private int[][] labelTransPre; // 可能情况的前labels: int[4][3]
public int threshold;
private int[][] labelLookingFor;

其中，二维数组

pocsToTags

为 index 与 allow labels 之间的映射，内容如下：

[null, [0, -1], [3, -1], [0, 3, -1], 
[1, -1], [0, 1, -1], [1, 3, -1], [0, 1, 3, -1], 
[2, -1], [0, 2, -1], [2, 3, -1], [0, 2, 3, -1], 
[1, 2, -1], [0, 1, 2, -1], [1, 2, 3, -1], [0, 1, 2, 3, -1]]

该数组的 index 表示了当前字符具有某些性质，比如：

1（[0]）表示词的开始，即标注 B；
2（[3]）表示词的中间字符，即标注 M；
4（[1]）表示词的结尾，即标注 E；
8（[2]）表示标点符号，即标注 S；
9（[0,2]）表示某一个词的开始，或者能单独成词
12（[1,2]）表示句子的结束；
15（[0,1,2,3]）为默认值。

在分词前，THULAC 根据标点符号等特征，给一些字符加入 allow labels 以提高分词准确性。比如，根据书名号确定里面为一个词，

val sentence = "倒模，替身算什么？钟汉良、ab《孤芳不自赏》抠图来充数"
val poc_cands = new POCGraph
val tagged = new TaggedSentence
val segged = new SegmentedSentence
val segmenter = new CBTaggingDecoder
val preprocesser = new Preprocesser
val prefix = "models/"
segmenter.init(prefix + "cws_model.bin", prefix + "cws_dat.bin", prefix + "cws_label.txt")
segmenter.setLabelTrans()
segmenter.segment(raw, poc_cands, tagged)
segmenter.get_seg_result(segged)
println(segged.mkString(" "))
// 倒模 ， 替身 算 什么 ？ 钟汉良 、 ab 《 孤芳不自赏 》 抠 图 来 充数

3. 参考资料

[1] Li, Z., & Sun, M. (2009). Punctuation as implicit annotations for Chinese word segmentation. Computational Linguistics, 35(4), 505-512.

来源: http://www.cnblogs.com/en-heng/p/6429355.html

与本文相关文章

暂无,快来抢沙发吧！