使用 NLTK 做文本分析

NLTK(Natural Language Toolkit)是一个功能强大的 Python 包, 它提供了一组自然语言算法, 例如切分词 (Tokenize), 词性标注(Part-Of-Speech Tagging), 词干分析(Stem) 和命名实体识别(Named Entity Recognition), 分类算法(classification). 安装和引用 NLTK

pip install nltk
import nltk

一, 切词

文本是由段落 (Paragraph) 构成的, 段落是由句子 (Sentence) 构成的, 句子是由单词构成的. 切词是文本分析的第一步, 它把文本段落分解为较小的实体 (如单词或句子), 每一个实体叫做一个 Token,Token 是构成句子(sentence ) 的单词, 是段落 (paragraph) 的句子. NLTK 能够实现句子切分和单词切分两种功能.

1, 句子切分

句子切分是指把段落切分成句子:

from nltk.tokenize import sent_tokenize
text="""Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome.
The sky is pinkish-blue. You shouldn't eat cardboard"""
tokenized_text=sent_tokenize(text)
print(tokenized_text)

句子切分的结果:

['Hello Mr. Smith, how are you doing today?', 'The weather is great, and city is awesome.', 'The sky is pinkish-blue.', "You shouldn't eat cardboard"]

2, 单词切分

单词切分是把句子切分成单词

from nltk.tokenize import word_tokenize
text="""Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome.
The sky is pinkish-blue. You shouldn't eat cardboard"""
tokenized_text=word_tokenize(text)
print(tokenized_text)

单词切分的结果是:

['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?',
'The', 'weather', 'is', 'great', ',', 'and', 'city', 'is', 'awesome', '.',
'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', "n't",'eat','cardboard']

可以发现, 切词之后, 标点符号也包括在结果中.

二, 处理切词

对切词的处理, 需要移除标点符号和移除停用词和词汇规范化.

1, 移除标点符号

对每个切词调用该函数, 移除字符串中的标点符号, string.punctuation 包含了所有的标点符号, 从切词中把这些标点符号替换为空格.

import string
s='abc.'
s.translate(str.maketrans(string.punctuation,""*len(string.punctuation),"")

2, 移除停用词

停用词 (stopword) 是文本中的噪音单词, 没有任何意义, 常用的英语停用词, 例如: is, am, are, this, a, an, the.NLTK 的语料库中由一个停用词, 用户必须从切词列表中把停用词去掉.

from nltk.corpus import stopwords
stop_words = stopwords.words("english")
word_tokens = nltk.tokenize.word_tokenize(text.strip())
filtered_sentence = [w for w in word_tokens if not w in stop_words]

三, 词汇规范化(Lexicon Normalization)

词汇规范化是指把词的各种派生形式转换为词根, stem 是把单词转换为词干, 在 NLTK 中存在两种抽取词干的方法 porter 和 wordnet.

from nltk.stem.wordnet import WordNetLemmatizer
lem = WordNetLemmatizer()
from nltk.stem.porter import PorterStemmer
stem = PorterStemmer()
Word = "flying"
print("Lemmatized Word:",lem.lemmatize(Word,"v"))
print("Stemmed Word:",stem.stem(Word))

四, 词性标注

词性 (POS) 标记的主要目标是识别给定单词的语法组, POS 标记查找句子内的关系, 并为该单词分配相应的标签.

sent = "Albert Einstein was born in Ulm, Germany in 1879."
tokens=nltk.word_tokenize(sent)
nltk.pos_tag(tokens)

五, 分类

略

参考文档:

NLTK in Python https://pythonspot.com/category/nltk/
Text Analytics for Beginners using NLTK

NLTK 学习笔记 -- 字符串操作

[NLP] Python NLTK 走进大秦帝国

来源: https://www.cnblogs.com/ljhdo/p/10571791.html

与本文相关文章

暂无,快来抢沙发吧！