设置爬取间隔
setting.py
- from random import random
- DOWNLOAD_DELAY = random()*5
ps: 此次的爬取间隔, 在读取 seeting 文件确定, 并非每次随机
禁用缓存
- # Disable cookies (enabled by default)
- COOKIES_ENABLED = False
- COOKIES_ENABLES = False
ps: enabled,enables 就不纠结哪个对了, 全写
设置随机访问头
setting.py 加入头列表, 并启用中间件
- USER_AGENT_LIST=[
- "Mozilla/5.0 (Windows NT 6.1; WOW64) ApplewebKit/537.1 (Khtml, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
- "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
- "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
- "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
- "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
- "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
- "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
- "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
- "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
- "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
- "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
- "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
- "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
- "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
- "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
- "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
- "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
- ]
- DOWNLOADER_MIDDLEWARES = {
- # 'tj_spider.middlewares.TjSpiderDownloaderMiddleware': 543,
- 'tj_spider.middlewares.RandomUserAgentMiddleware': 400,
- }
middlewares.py 加入设置随机头
- from scrapy import signals
- from settings import USER_AGENT_LIST
- import random
- class RandomUserAgentMiddleware(object):
- def process_request(self, request, spider):
- rand_use = random.choice(USER_AGENT_LIST)
- if rand_use:
- request.headers.setdefault('User-Agent', rand_use)
来源: http://www.bubuko.com/infodetail-2762752.html