到了讲 scrapy-redis 框架的时候啦, 在讲它之前先提出三个问题:
我们要使用分布式, 那么分布式有什么优点?
Scrapy 不支持分布式, 是为什么?
如果要使 Scrapy 支持分布式, 需要解决哪些问题?
scrapy-redis 是怎么解决这些问题的?
接下来, 我们逐个回答:
分布式的主要优点包括如下两种:
1)充分利用多机器的宽带加速爬取.
2)充分利用多机的 IP 加速爬取速度.
在爬虫课堂(十六)|Scrapy 框架结构及工作原理章节中, 我们已经讲解过 Scrapy 运行流程, 如下图 26-1 所示:
1)当爬虫 (Spider) 要爬取某 URL 地址的页面时, 使用该 URL 初始化 Request 对象提交给引擎(Scrapy Engine), 并设置回调函数.
2)Request 对象进入调度器 (Scheduler) 按某种算法进行排队, 之后的每个时刻调度器将其出列, 送往下载器.
在 Scrapy 中, 以上的流程都是在单机操作, 其他服务器是无法从现在的 Scheduler 中取出 requests 任务队列, 另外这块的去重操作也是在当前服务器的内存中进行, 这就导致 Scrapy 不支持分布式.
图 26-1 Scrapy 架构图
基于上面的分析, 我们知道要使 Scrapy 支持分布式, 那么就需要解决三个问题:
1)requests 队列需要集中管理.
2)去重逻辑也需要集中管理.
3)保持数据逻辑也需要集中管理.
scrapy-redis 是怎么解决这些问题的?
我们先进入 scrapy-redis 的 GitHub 页面 https://github.com/rmax/scrapy-redis , 它在 Usage 明确说明了需要设置的地方:
- # Enables scheduling storing requests queue in redis.
- SCHEDULER = "scrapy_redis.scheduler.Scheduler"
- # Ensure all spiders share same duplicates filter through redis.
- DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
- # Default requests serializer is pickle, but it can be changed to any module
- # with loads and dumps functions. Note that pickle is not compatible between
- # python versions.
- # Caveat: In python 3.x, the serializer must return strings keys and support
- # bytes as values. Because of this reason the json or msgpack module will not
- # work by default. In python 2.x there is no such issue and you can use
- # 'json' or 'msgpack' as serializers.
- #SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat"
- # Don't cleanup redis queues, allows to pause/resume crawls.
- #SCHEDULER_PERSIST = True
- # Schedule requests using a priority queue. (default)
- #SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'
- # Alternative queues.
- #SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue'
- #SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue'
- # Max idle time to prevent the spider from being closed when distributed crawling.
- # This only works if queue class is SpiderQueue or SpiderStack,
- # and may also block the same time when your spider start at the first time (because the queue is empty).
- #SCHEDULER_IDLE_BEFORE_CLOSE = 10
- # Store scraped item in redis for post-processing.
- ITEM_PIPELINES = {
- 'scrapy_redis.pipelines.RedisPipeline': 300
- }
- # The item pipeline serializes and stores the items in this redis key.
- #REDIS_ITEMS_KEY = '%(spider)s:items'
- # The items serializer is by default ScrapyJSONEncoder. You can use any
- # importable path to a callable object.
- #REDIS_ITEMS_SERIALIZER = 'json.dumps'
- # Specify the host and port to use when connecting to Redis (optional).
- #REDIS_HOST = 'localhost'
- #REDIS_PORT = 6379
- # Specify the full Redis URL for connecting (optional).
- # If set, this takes precedence over the REDIS_HOST and REDIS_PORT settings.
- #REDIS_URL = 'redis://user:pass@hostname:9001'
- # Custom redis client parameters (i.e.: socket timeout, etc.)
- #REDIS_PARAMS = {}
- # Use custom redis client class.
- #REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient'
- # If True, it uses redis' ``SPOP`` operation. You have to use the ``SADD``
- # command to add URLs to the redis queue. This could be useful if you
- # want to avoid duplicates in your start urls list and the order of
- # processing does not matter.
- #REDIS_START_URLS_AS_SET = False
- # Default start urls key for RedisSpider and RedisCrawlSpider.
- #REDIS_START_URLS_KEY = '%(name)s:start_urls'
- # Use other encoding than utf-8 for redis.
- #REDIS_ENCODING = 'latin1'
设置里面主要包括三个地方, SCHEDULER 处理列队的问题(分配任务),DUPEFILTER_CLASS 处理去重的问题(任务去重),RedisPipeline 处理保存的问题(数据存储).
- # Enables scheduling storing requests queue in redis.
- SCHEDULER = "scrapy_redis.scheduler.Scheduler"
- # Ensure all spiders share same duplicates filter through redis.
- DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
- # Store scraped item in redis for post-processing.
- ITEM_PIPELINES = {
- 'scrapy_redis.pipelines.RedisPipeline': 300
- }
在创建爬虫的时候也有一个调整.
原来非分布式爬虫时的方式如下:
- class MySpider(Spider):
- name = 'myspider'
- def parse(self, response):
- # do stuff
- pass
要使用分布式的时候, 需要把 Spider 修改为 RedisSpider.
from scrapy_redis.spiders import RedisSpider
- class MySpider(RedisSpider):
- name = 'myspider'
- def parse(self, response):
- # do stuff
- pass
抱歉, 本章因为时间问题就先写到这里, 今天加班太晚啦. 读者也早点休息, 明天继续.
下一章节, 我们通过分析 scrapy-redis 源码, 来进一步了解 scrapy-redis 框架是如何解决分配任务, 任务去重以及把所有爬虫采集的数据汇总一处的三个问题的.
来源: http://www.jianshu.com/p/ec80c267d12b