Python 爬虫系列 - 初探: 爬取旅游评论

Python 爬虫目前是基于 requests 包, 下面是该包的文档, 查一些资料还是比较方便.

http://docs.python-requests.org/en/master/

爬取某旅游网站的产品评论, 通过分析, 获取 JSON 文件需要 POST 指令. 简单来说:

GET 是将需要发送的信息直接添加在网址后面发送

POST 方式是发送一个另外的内容到服务器

那么通过 POST 发送的内容可以大概有三种, 即 form,JSON 和 multipart, 目前先介绍前两种

1.content in form
Content-Type: application/x-www-form-urlencoded

将内容放入 dict, 然后传递给参数 data 即可.

payload = {
	'key1': 'value1', 'key2': 'value2'	
}
r = requests.post(url, data=payload)
2. content in JSON
Content-Type: application/JSON

将 dict 转换为 JSON, 传递给 data 参数.

payload = {
	'some': 'data'	
}
r = requests.post(url, data=JSON.dumps(payload))

或者将 dict 传递给 JSON 参数.

payload = {
	'some': 'data'	
}
r = requests.post(url, JSON=payload)

然后贴一下简单的代码供参考.

import requests
import JSON
def getCommentStr():
    url = r"https://package.com/user/comment/product/queryComments.json"
    header = {
        'User-Agent':           r'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0',
        'Accept':               r'application/json, text/javascript, */*; q=0.01',
        'Accept-Language':      r'en-US,en;q=0.5',
        'Accept-Encoding':      r'gzip, deflate, br',
        'Content-Type':         r'application/x-www-form-urlencoded; charset=UTF-8',
        'X-Requested-With':     r'XMLHttpRequest',
        'Content-Length':       '65',
        'DNT':                  '1',
        'Connection':           r'keep-alive',
        'TE':                   r'Trailers'
    }
    params = {
        'pageNo':               '2',
        'pageSize':             '10',
        'productId':            '2590732030',
        'rateStatus':           'ALL',
        'type':                 'all'
    }
    r = requests.post(url, headers = header, data = params)
    print(r.text)
getCommentStr()

小技巧

对于 cookies, 感觉可以用浏览器的编辑功能, 逐步删除每次发送的 cookies 信息, 判断哪些是没有用的?

对于测试代码阶段, 我还是比较习惯于将爬取的数据存为 str, 也算是为了服务器减负吧.

来源: http://www.bubuko.com/infodetail-2826792.html

与本文相关文章

暂无,快来抢沙发吧！