写在前面
写了一段时间的博客了, 忽然间忘记了, 其实博客频道的博客也是可以抓取的, 所以我干了.....
打开 F12 抓取一下数据 API, 很容易就获取到了他的接口
提取链接长成这个样子
https://blog.csdn.net/api/articles?type=more&category=newarticles&shown_offset=1540381234000000
- import requests
- import pymongo
- import time
- START_URL = "https://www.csdn.net/api/articles?type=more&category=newarticles&shown_offset={}"
- HEADERS = {
- "Accept":"application/json",
- "Host":"www.csdn.net",
- "Referer":"https://www.csdn.net/nav/newarticles",
- "User-Agent":"你自己的浏览器配置",
- "X-Requested-With":"XMLHttpRequest"
- }
- def get_url(url):
- try:
- res = requests.get(url,
- headers=HEADERS,
- timeout=3)
- articles = res.JSON()
- if articles["status"]:
- need_data = articles["articles"]
- if need_data:
- collection.insert_many(need_data) # 数据插入
- print("成功插入 {} 条数据".format(len(need_data)))
- last_shown_offset = articles["shown_offset"] # 获取最后一条数据的时间戳
- if last_shown_offset:
- time.sleep(1)
- get_url(START_URL.format(last_shown_offset))
- except Exception as e:
- print(e)
- print("系统暂停 60s, 当前出问题的是{}".format(url))
- time.sleep(60) # 出问题之后, 停止 60s, 继续抓取
- get_url(url)
来源: http://www.bubuko.com/infodetail-2906855.html