做一个有产品思维的研发: Scrapy 安装

每天 10 分钟, 解决一个研发问题.

如果你想了解我在做什么, 请看《做一个有产品思维的研发: 课程大纲》传送门: https://www.cnblogs.com/hunttown/p/10490965.html

今天我们说一下 Scrapy 爬虫:

Scrapy 在 Python 2.7 和 Python 3.3 或者更高版本上运行, 他是用纯 Python 编写的, 并且依赖于一些关键的 Python 包 (其中包括):

1,lxml , 一个高效的 xml 和 HTML 解析器

2,parsel , 一个基于 lxml 的 HTML / xml 数据提取库

3,w3lib , 一个用于处理 URL 和网页编码的多用途助手

4,twisted, 一个异步的网络框架

5,cryptography 和 pyOpenSSL , 以处理各种网络级安全需求

Scrapy 经过测试支持的最低版本为:

a. Twisted 14.0
b. lxml 3.4
c. pyOpenSSL 0.14

一, 推荐使用 Linux 安装, 大家可以安装一个虚拟机来做

1,VMWare 安装 CentOS 6.5 教程 https://www.cnblogs.com/hunttown/p/5450343.html

二, Lniux 环境安装完毕以后, 还需要安装 pip, 这个是下面要用到的命令

2,CentOS6.5 安装 pip 教程 https://www.cnblogs.com/hunttown/p/9626827.html

三, 安装依赖

[MyCentOS6 ~]$ yum install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev

四, 由于 pyphon3 在安装 pip 的时候安装好了, 这里我们直接安装 scrapy

[MyCentOs6 ~]$ pip install scrapy

五, 安装完以后还需要安装一些将来要用到的软件

1,Linux 安装 simplejson 教程 https://www.cnblogs.com/hunttown/p/9796242.html

2,Linux 下安装 pymysql 教程 https://www.cnblogs.com/hunttown/p/9828754.html

3,Linux 下安装 jieba 教程 https://www.cnblogs.com/hunttown/p/9828893.html

4,Linux 下安装 Gensim 教程 https://www.cnblogs.com/hunttown/p/9828946.html

5,Linux 下安装 whl 文件 https://www.cnblogs.com/hunttown/p/9829152.html

六, 创建一个爬虫项目

[MyCentOs6 ~]$ scrapy startproject tutorial

然后在 tutorial / spiders 目录下创建 quotes_spider.py 文件:

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

上面的 Spider 继承了 scrapy.Spider 并定义了一些属性和方法:

a. name: 标识爬虫. 它在项目中必须是唯一的, 也就是说, 您不能为不同的 Spider 设置相同的名称.

b. start_requests(): 必须返回一个可迭代的 Requests(您可以返回一个 request 列表或写一个生成器函数),Spider 将开始抓取. 后续请求将从这些初始请求中连续生成.

c. parse(): 被调用来处理 response 的方法, response 由每个 request 下载生成. response 参数是一个 TextResponse 的实例, 它保存页面内容, 并具有更多有用的方法来处理它.

运行爬虫:

[MyCentOS6 ~]$ scrapy crawl quotes

如果你想把数据存储到一个文件中:

[MySentOS6 ~]$ scrapy crawl quotes -o quotes.JSON

今日总结:

1. 最初的的方式是使用 urllib2, 读取 url 解析 HTML, 然后通过正则表达式匹配出想要的数据.

2. 现在的 Scrapy,Python 开发的一个快速, 高层次的 web 抓取框架, 用于抓取 Web 站点并从页面中提取结构化的数据, Scrapy 的用途非常广泛.

来源: http://www.bubuko.com/infodetail-3038729.html

与本文相关文章

暂无,快来抢沙发吧！