python 网络爬虫 (7) 爬取静态数据详解

目的

爬取 http://seputu.com / 数据并存储 CSV 文件

导入库

lxml 用于解析解析网页 html 等源码, 提取数据. 一些参考: https://www.cnblogs.com/zhangxinqi/p/9210211.html

requests 请求网页

chardet 用于判断网页中的字符编码格式

CSV 用于存储文本使用.

re 用于正则表达式

from lxml import etree
import requests
import chardet
import CSV
import re

获取网页

生成网页头带入到 request.get 中, 可以模拟浏览器. 其中的网页头, 可以在浏览器控制台, network 下查找到.

user_agent='Mozilla/5.0 (Windows NT 10.0; WOW64) ApplewebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0'
headers={
	'User-Agent':user_agent	
}
r=requests.get('http://seputu.com/',headers=headers)

判断并转换编码

r.encoding=chardet.detect(r.content)['encoding']

解析网页

HTML=etree.HTML(r.text)

提取网页信息

浏览器打开对应网站, 找到要提取的标签, 通过元素审查, 完成 HTML 文本内容的抽取.

这里抽取的内容为 h2_title,href,title 内容. title 通过正则表达式完成分组, 并进行数据提取.

注意的是: python 正则表达式部分, 不支持部分的零宽断言语法, 采用分组方案, 避开了可能出现的错误!

如以下代码会出错:

import re
box_title='[2012-5-23 21:14:42] 盗墓笔记 贺岁篇 真相'
pattern=re.compile(r'(?<=\[.*\]\s).*')
result1=re.search(pattern, box_title)

rows 存储了二维数据, 用于写入 CSV 文件.

div_mulus=HTML.xpath('.//*[@class="mulu"]')
rows=[]
for div_mulu in div_mulus:
    div_h2=div_mulu.xpath('./div[@class="mulu-title"]/center/h2/text()')
    if len(div_h2)>0:
        h2_title=div_h2[0]
        a_s=div_mulu.xpath('./div[@class="box"]/ul/li/a')
        for a in a_s:
            href=a.xpath('./@href')[0]
            box_title=a.xpath('./@title')[0]
            pattern=re.compile(r'\s*\[(.*)\]\s+(.*)')
            result1=re.search(pattern, box_title)
            rows.append([h2_title,result1.group(2),href,result1.group(1)])
            pass
        pass
    pass

存储数据

建立 header 一维数据, 配合之前 rows 二维数据, 通过 w 权限, 配合 writer 方法, 完成一维, 二维的数据写入

通过最后的输出, 标记正常完成.

headers=['title','real_title','href','date']
with open('text.csv','w') as f:
    f_csv=CSV.writer(f,)
    f_csv.writerow(headers)
    f_csv.writerows(rows)
print('finished')

来源: http://www.bubuko.com/infodetail-3085517.html

与本文相关文章

暂无,快来抢沙发吧！