当前位置：

首页
/
IT
/
程序
/
Objective-C
/
Scrapy 爬虫框架第七讲 [ITEM PIPELINE 用法]

Scrapy 爬虫框架第七讲 [ITEM PIPELINE 用法]

ITEM PIPELINE 用法详解:

ITEM PIPELINE 作用:

清理 html 数据

验证爬取的数据 (检查 item 包含某些字段)

去重 (并丢弃)[预防数据去重, 真正去重是在 url, 即请求阶段做]

将爬取结果保存到数据库中

ITEM PIPELINE 核心方法 (4 个)

(1),open_spider(spider)
(2),close_spider(spider)
(3),from_crawler(cls,crawler)
(4),process_item(item,spider)

下面小伙伴们我们依次来分析:

1,open_spider(spider) [参数 spider 即被开启的 Spider 对象]

该方法非必需, 在 Spider 开启时被调用, 主要做一些初始化操作, 如连接数据库等

2,close_spider(spider)[参数 spider 即被关闭的 Spider 对象]

该方法非必需, 在 Spider 关闭时被调用, 主要做一些如关闭数据库连接等收尾性质的工作

3,from_crawler(cls,crawler)[参数一: Class 类参数二: crawler 对象]

该方法非必需, Spider 启用时调用, 早于 open_spider() 方法, 是一个类方法, 用 @classmethod 标识, 它与__init__函有关, 这里我们不详解 (一般我们不对它进行修改)

4,process_item(item,spider)[参数一: 被处理的 Item 对象参数二: 生成该 Item 的 Spider 对象]

该方法必需实现, 定义的 Item pipeline 会默认调用该方法对 Item 进行处理, 它返回 Item 类型的值或者抛出 DropItem 异常

实例分析 (以下实例来自官网: https://doc.scrapy.org/en/latest/topics/item-pipeline.html)

from scrapy.exceptions import DropItem
 class PricePipeline(object):
     vat_factor = 1.15
     def process_item(self, item, spider):
         if item['price']:
             if item['price_excludes_vat']:
                 item['price'] = item['price'] * self.vat_factor
             return item
         else:
             raise DropItem("Missing price in %s" % item)

代码分析:

首先定义了一个 PricePipeline 类

定义了增值税税率因子为 1.15

主函数 process_item 方法实现了如下功能: 判断 Item 中的 price 字段, 如没计算增值税, 则乘以 1.15, 并返回 Item, 否则直接抛弃

总结: 该方法要么 return item 给后边的管道处理, 要么抛出异常

数据去重

from scrapy.exceptions import DropItem
 class DuplicatesPipeline(object):
     def __init__(self):
         self.ids_seen = set()
     def process_item(self, item, spider):
         if item['id'] in self.ids_seen:
             raise DropItem("Duplicate item found: %s" % item)
         else:
             self.ids_seen.add(item['id'])
             return item

代码分析:

首先定义了一个 DuplicatesPipeline 类

这里比上面多了一个初始化函数__init__,set()--- 去重函数

主函数 process_item 方法首先判断 item 数据中的 id 是否重复, 重复的就将其抛弃, 否则就增加到 id, 然后传给下个管道

将数据写入文件

import json
 class JsonWriterPipeline(object):
     def open_spider(self, spider):
         self.file = open('items.jl', 'w')
     def close_spider(self, spider):
         self.file.close()
     def process_item(self, item, spider):
         line = json.dumps(dict(item)) + "\n"
         self.file.write(line)
         return item

代码分析:

首先我们定义了一个 JsonWritePipeline 类

定义了三个函数:

first:open_spider() 在 Spider 开启时启用作用很简单即打开文件, 准备写入数据

second:close_spider() 在 Spider 关闭时启用作用也很简单即关闭文件

third(主要):process_items() 作用如下首先将 item 转换为字典类型, 在用 json.dumps() 序列化为 json 字符串格式, 再写入文件, 最后返回修改的 item 给下一个管道

综合实例

import pymongo
 class MongoPipeline(object):
     collection_name = 'scrapy_items'
     def __init__(self, mongo_uri, mongo_db):
         self.mongo_uri = mongo_uri
         self.mongo_db = mongo_db
     @classmethod
     def from_crawler(cls, crawler):
         return cls(
             mongo_uri=crawler.settings.get('MONGO_URI'),
             mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
         )
     def open_spider(self, spider):
         self.client = pymongo.MongoClient(self.mongo_uri)
         self.db = self.client[self.mongo_db]
     def close_spider(self, spider):
         self.client.close()
     def process_item(self, item, spider):
         self.db[self.collection_name].insert(dict(item))
         return item

代码分析:

首先我们定义了一个 MongoPipeline 类

这里我们修改了初始化函数__init__, 给出了存储到 Mongodb 的链接地址和数据库名称所以更改了 from_crawler() 工厂函数函数 (生产它的对象), 这里指定了链接地址和数据表名称

最后我们定义了三个函数:

first:open_spider() 在 Spider 开启时启用作用是打开 mongodb 数据库

second:close_spider() 在 Spider 关闭时启用作用是关闭数据库

third:process_items() 作用如下在数据库中插入 item

项目实战:(我们以 58 同城镇江房屋出租为例) 抓取出租信息的标题, 价格, 详情页的 url

我是在 ubuntu16.04 环境下跑的

启动终端并激活虚拟环境: source course_python3.5/bin/activate

创建一个新目录 project:mkdir project

创建项目: scrapy startproject city58-----cd city58---- 创建爬虫 (这里小伙伴们注意项目名不能与爬虫名重名)scrapy genspider city58_test

下面我们正式开始

(1), 修改 items.py

(2) 修改 city58_test.py 文件 (这里我们使用 pyquery 选择器)

(3), 重点来了, 修改 pipelines.py 文件, 小伙伴们可参考上面的案例分析

(4) 最后通过 settings.py 启动 pipeline

这里向小伙伴们科普一个小知识点: 后面的数字是优先级, 数字越小, 越优先执行

(5) 项目运行结果 (部分)---- 下次小伙伴们想了解出租信息可以找我, 我帮你秒下. 哈哈!

并且我们可以在同级目录下找到我们写入的文件

总结:

(1), 首先了解了管道的作用

(2), 掌握了核心的方法, 其中特别是 process_item() 方法

(3), 最后我们通过实例和项目进行实战, 后面我们会继续学习如何使用管道进行高级的操作, 敬请期待, 记得最后一定要在配置文件中开启 Spider 中间件

来源: https://www.cnblogs.com/518894-lu/p/9053939.html

与本文相关文章

暂无,快来抢沙发吧！