前言
虽然目前 nlp 很多任务已经发展到了使用深度学习的循环神经网络模型和注意力模型, 但传统的模型咱们也一样要了解. 这里看下如何使用隐马尔科夫模型 (HMM) 进行分词.
隐马尔科夫模型
隐马尔科夫模型是一种有向图模型, 图模型能清晰表达变量相关关系的概率, 常见的图模型还有条件随机场, 节点表示变量, 节点之间的连线表示两者相关概率. 用以下定义来描述 HMM 模型,
设系统所有可能的状态集合为, 所有能观察的对象的集合, 那么就一共有 n 种隐状态和 m 种显状态.
再设总共 T 个时刻的状态序列, 对应有 T 个时刻的观察序列, 这两个很容易理解, 对应到 nlp 的词性标注中就是一句话和这句话的词性标注, 比如 "我 / 是 / 中国 / 人" 和 "代词 / 动词 / 名词 / 名词".
隐马尔科夫模型主要有三组概率: 转移概率, 观测概率和初始状态概率.
系统状态之间存在着转移概率, 设一个转移矩阵, 因为有 n 中隐状态, 所以是 n 行 n 列. 概率的计算公式为,
此表示任意时刻 t 的状态若为 , 则下一时刻状态为 的概率, 即任意时刻两种状态的转移概率了.
观测概率矩阵为 , 其中
, 它用于表示在任意时刻 t, 若状态为, 则生成观察状态 的概率.
除此之外还有一个初始状态概率为 , 用于表示初始时刻各状态出现的概率, 其中
, 即 t=1 时刻状态为 的概率.
综上所述, 一个隐马尔科夫模型可以用 描述.
序列标签
- def read_data(filename):
- sentences = []
- sentence = []
- with open(filename, 'r', encoding='utf-8') as f:
- for line in f.readlines():
- word_label = line.strip().split('\t')
- if len(word_label) == 2:
- observation_set.add(word_label[0])
- state_set.add(word_label[1])
- sentence.append(word_label)
- else:
- sentences.append(sentence)
- sentence = []
- return sentences
- def train():
- print('begin training......')
- sentences = read_data(data_path)
- for sentence in sentences:
- pre_label = -1
- for Word, label in sentence:
- observation_matrix[label][Word] = observation_matrix.setdefault(label, {}).setdefault(Word, 0) + 1
- if pre_label == -1:
- pi_state[label] = pi_state.setdefault(label, 0) + 1
- else:
- transition_matrix[pre_label][label] = transition_matrix.setdefault(pre_label, {}).setdefault(label,
- 0) + 1
- pre_label = label
- for key, value in transition_matrix.items():
- number_total = 0
- for k, v in value.items():
- number_total += v
- for k, v in value.items():
- transition_matrix[key][k] = 1.0 * v / number_total
- for key, value in observation_matrix.items():
- number_total = 0
- for k, v in value.items():
- number_total += v
- for k, v in value.items():
- observation_matrix[key][k] = 1.0 * v / number_total
- number_total = sum(pi_state.values())
- for k, v in pi_state.items():
- pi_state[k] = 1.0 * v / number_total
- print('finish training.....')
- save_model()
- def load_model():
- print('loading model...')
- with open(model_path, 'rb') as f:
- model = pickle.load(f)
- return model
- def save_model():
- print('saving model...')
- model = [transition_matrix, observation_matrix, pi_state, state_set, observation_set]
- with open(model_path, 'wb') as f:
- pickle.dump(model, f)
- def predict():
- text = '我在图书馆看书'
- min_probability = -1 * float('inf')
- words = [{} for _ in text]
- path = {}
- for state in state_set:
- words[0][state] = 1.0 * pi_state.get(state, default_probability) * observation_matrix.get(state, {}).get(
- text[0],
- default_probability)
- path[state] = [state]
- for t in range(1, len(text)):
- new_path = {}
- for state in state_set:
- max_probability = min_probability
- max_state = ''
- for pre_state in state_set:
- probability = words[t - 1][pre_state] * transition_matrix.get(pre_state, {}).get(state,
- default_probability) \
- * observation_matrix.get(state, {}).get(text[t], default_probability)
- max_probability, max_state = max((max_probability, max_state), (probability, pre_state))
- words[t][state] = max_probability
- tmp = copy.deepcopy(path[max_state])
- tmp.append(state)
- new_path[state] = tmp
- path = new_path
- max_probability, max_state = max((words[len(text) - 1][s], s) for s in state_set)
- result = []
- p = re.compile('BM*E|S')
- for i in p.finditer(''.join(path[max_state])):
- start, end = i.span()
- Word = text[start:end]
- result.append(Word)
- print(result)
- 'M': ['S']
- 'B': ['S']
- 'E': ['S']
- 'S': ['S']
- 'M': ['S', 'S', 'B', 'M', 'E', 'B', 'M']
- 'B': ['S', 'S', 'B', 'M', 'E', 'S', 'B']
- 'E': ['S', 'S', 'B', 'M', 'E', 'B', 'E']
- 'S': ['S', 'S', 'B', 'M', 'E', 'S', 'S']
来源: https://juejin.im/post/5bf5f927f265da615a417373