User: 你好我是森林
Date:2018-04-01
处理自然语言
概括数据
在之前我们了解了如何把文本内容分解成 n-gram 模型, 或者说是 n 个单词长度的词组. 从最基本的功能上说, 这个集合可以用来确定这段文字中最常用的单词和短语. 另外, 还可以提取原文中那些最常用的短语周围的句子, 对原文进行看似合理的概括.
例如我们根据威廉 . 亨利 . 哈里森的就职演全文进行分析. 文章地址
- from urllib.request import urlopen
- from bs4 import BeautifulSoup
- import re
- import string
- from collections import Counter
- def cleanSentence(sentence):
- sentence = sentence.split(' ')
- sentence = [word.strip(string.punctuation+string.whitespace) for word in sentence]
- sentence = [word for word in sentence if len(word)> 1 or (word.lower() == 'a' or word.lower() == 'I')]
- return sentence
- def cleanInput(content):
- content = content.upper()
- content = re.sub('\n', ' ', content)
- content = bytes(content, 'UTF-8')
- content = content.decode('ascii', 'ignore')
- sentences = content.split('.')
- return [cleanSentence(sentence) for sentence in sentences]
- def getNgramsFromSentence(content, n):
- output = []
- for i in range(len(content)-n+1):
- output.append(content[i:i+n])
- return output
- def getNgrams(content, n):
- content = cleanInput(content)
- ngrams = Counter()
- ngrams_list = []
- for sentence in content:
- newNgrams = [' '.join(ngram) for ngram in getNgramsFromSentence(sentence, n)]
- ngrams_list.extend(newNgrams)
- ngrams.update(newNgrams)
- return(ngrams)
- content = str(
- urlopen('http://pythonscraping.com/files/inaugurationSpeech.txt').read(),
- 'utf-8')
- ngrams = getNgrams(content, 3)
- print(ngrams)
[图片上传失败...(image-261aa4-1522854980301)]
自然语言工具包
自然语言工具包 (
Natural Language Toolkit,NLTK
) 就是这样一个 Python 库, 用于识别和标记英语文本中各个词的词性 (parts of speech).
安装与配置
NLTK 网站 (
http://www.nltk.org/install.html
). 安装软件比较简单, 例如 pip 安装.
- psysh git:(master) pip install nltk
- Collecting nltk
- Using cached nltk-3.2.5.tar.gz
- Requirement already satisfied: six in /usr/local/lib/python3.6/site-packages (from nltk)
- Building wheels for collected packages: nltk
- Running setup.py bdist_wheel for nltk ... done
- Stored in directory: /Users/demo/Library/Caches/pip/wheels/18/9c/1f/276bc3f421614062468cb1c9d695e6086d0c73d67ea363c501
- Successfully built nltk
- Installing collected packages: nltk
- Successfully installed nltk-3.2.5
- You are using pip version 9.0.1, however version 9.0.3 is available.
- You should consider upgrading via the 'pip install --upgrade pip' command.
检测一下就 OK
- psysh git:(master) python
- Python 3.6.4 (default, Mar 1 2018, 18:36:50)
- [GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)] on darwin
- Type "help", "copyright", "credits" or "license" for more information.
- >>> import nltk
- >>>
输入 nltk.download() 就可以看到 NLTK 下载器.
[站外图片上传中...(image-11f722-1522854980301)]
默认下载全部的包, 新手减少排除的相关的麻烦.
[站外图片上传中...(image-63e219-1522854980301)]
用 NLTK 做统计分析
用 NLTK 做统计分析一般是从 Text 对象开始的. Text 对象可以通过下面的方法用简单的 Python 字符串来创建:
- from nltk import word_tokenize
- from nltk import Text
- tokens = word_tokenize("哈哈哈哈哈")
- text = Text(tokens)
word_tokenize 函数的参数可以是任何 Python 字符串. 如果你手边没有任何长字符串, 但是还想尝试一些功能, 在 NLTK 库里已经内置了几本书, 可以用 import 函数导入:
from nltk.book import *
统计文本中不重复的单词, 然后与总单词数据进行比较:
- >>> len(text6)/len(words)
- .
今天内容比较少, 消化比较困难. 哈哈哈
来源: http://www.jianshu.com/p/12bfc7dd527a