Python 数据分析 8----- 网页文本处理

1, 去除网页的标签, 如 <br/>

from bs4 import BeautifulrSoup
preData=BeautifulSoup(data,'html.parser').get_text()

2, 将标点符号等去掉, 用正则表达式.

import re
# 表示将 data 中的除了大小写字母之外的符号换成空格
preData=re.sub(r'[^a-zA-Z]',' ',data)

3, 将文本中的单词小写化, 并将 data 用空格分开

words=data.lower().split()

4, 去掉停用词

# 可以自己下载停用词
#nltk.download()
words_notstop=[w for w in words if w not in stopwords]

5, 将所有的词连接成一个句子

sentence=' '.join(words)

来源: http://www.bubuko.com/infodetail-2689273.html

暂无,快来抢沙发吧！