scrapy 框架之 (CrawlSpider)

一. CrawlSpider 简介

如果想要通过爬虫程序去爬取 "糗百" 全站数据新闻数据的话, 有几种实现方法?

方法一: 基于 Scrapy 框架中的 Spider 的递归爬取进行实现 (Request 模块递归回调 parse 方法).

方法二: 基于 CrawlSpider 的自动爬取进行实现 (更加简洁和高效).

一. 简介

CrawlSpider 其实是 Spider 的一个子类, 除了继承到 Spider 的特性和功能外, 还派生除了其自己独有的更加强大的特性和功能. 其中最显著的功能就是 "LinkExtractors 链接提取器".Spider 是所有爬虫的基类, 其设计原则只是为了爬取 start_url 列表中网页, 而从爬取到的网页中提取出的 url 进行继续的爬取工作使用 CrawlSpider 更合适.

二. 使用

1. 创建 scrapy 工程: scrapy startproject projectName

2. 创建爬虫文件: scrapy genspider-t crawl spiderName www.xxx.com

-- 此指令对比以前的指令多了 "-t crawl", 表示创建的爬虫文件是基于 CrawlSpider 这个类的, 而不再是 Spider 这个基类.

3. 观察生成的爬虫文件

爬虫文件. py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
# 不再是引入 spider, 而是引入了 crawlspider, 还引入了 LinkExtracor(连接提取器),Rule 解析器
class ChoutiSpider(CrawlSpider):
    name = 'chouti'
    #allowed_domains = ['www.xxx.com']
    start_urls = ['https://dig.chouti.com/r/scoff/hot/1']
  #allow 后面跟着正则匹配, 用正则去匹配符合的连接
  #rule 规则解析器则会去把提取器提取到的连接发起请求, 并把获得的响应对象用回调函数去解析
  #follow 表示是否把连接解析器继续作用到提取到的 url 中 (是否提取全站的 url)
    rules = (
        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
    )
    def parse_item(self, response):
        item = {}
        #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
        #item['name'] = response.xpath('//div[@id="name"]').get()
        #item['description'] = response.xpath('//div[@id="description"]').get()
        return item

案例一:(全站提取)

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class ChoutiSpider(CrawlSpider):
    name = 'chouti'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://dig.chouti.com/r/scoff/hot/1']
    #把这个单独写比较好看
    link=LinkExtractor(allow=r'/r/scoff/hot/\d+')
    rules = (
        Rule(link,callback='parse_item', follow=False),
    )
    def parse_item(self, response):
        print(response)
# 这样就可以迭代提取到我们想要的所有内容, 因为其起始页的 url 为: https://dig.chouti.com/r/scoff/hot/1

案例二:(第一页没有数字编号的)

class ChoutiSpider(CrawlSpider):
    name = 'chouti'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.qiushibaike.com/text/']
    #把这个单独写比较好看
    link=LinkExtractor(allow=r'/text/page/\d+/')
    link1=LinkExtractor(allow=r'/text/')
    rules = (
        Rule(link,callback='parse_item', follow=True),
        Rule(link1, callback='parse_item', follow=True),
    )
    def parse_item(self, response):
        print(response)
# 注意观察器其实 url:
https://www.qiushibaike.com/text/
# 第一页没有数字表示

案例三:(正匹配会有很多相似的, 限定开头或者结尾)

class ChoutiSpider(CrawlSpider):
    name = 'chouti'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.qiushibaike.com/pic/']
    # 把这个单独写比较好看
  #这边的? 记得转义 \
    link = LinkExtractor(allow=r'/pic/page/\d+\?s=')
    link1 = LinkExtractor(allow=r'/pic/$')  #提取第一页这个匹配会有很多其他的干扰, 这些并不是我们想要的, 要限定结尾 $
    rules = (
        Rule(link, callback='parse_item', follow=True),
        Rule(link1, callback='parse_item', follow=True),
    )
    def parse_item(self, response):
        print(response)

注: 如果 allow 没有为空, 那就是匹配网页中所有的 url

来源: http://www.bubuko.com/infodetail-2974291.html

与本文相关文章

暂无,快来抢沙发吧！