Scrapy 框架实战 - 妹子图爬虫

Scrapy 这个成熟的爬虫框架, 用起来之后发现并没有想象中的那么难即便是在一些小型的项目上, 用 scrapy 甚至比用 requestsurlliburllib2 更方便, 简单, 效率也更高废话不多说, 下面详细介绍下如何用 scrapy 将妹子图爬下来, 存储在你的硬盘之中关于 PythonScrapy 的安装以及 scrapy 的原理这里就不作介绍, 自行 google 百度了解学习

一开发工具

Pycharm 2017
Python 2.7
Scrapy 1.5.0
requests

二爬取过程

1 创建 mzitu 项目

进入 "E:\Code\PythonSpider>" 目录执行 scrapy startproject mzitu 命令创建一个爬虫项目:

1 scrapy startproject mzitu

执行完成后, 生产目录文件结果如下:

mzitu
     mzitu
        __init__.py
        items.py
        middlewares.py
        pipelines.py
        settings.py
        spiders
            __init__.py
            Mymzitu.py
     scrapy.cfg

2 进入 mzitu 项目, 编写修改 items.py 文件

定义 titile, 用于存储图片目录的名称

定义 img, 用于存储图片的 url

定义 name, 用于存储图片的名称

# -*- coding: utf-8 -*-
 # Define here the models for your scraped items
 #
 # See documentation in:
 # https://doc.scrapy.org/en/latest/topics/items.html
 import scrapy
 class MzituItem(scrapy.Item):
     # define the fields for your item here like:
     title = scrapy.Field()
     img = scrapy.Field()
     name = scrapy.Field()

3 编写修改 spiders/Mymzitu.py 文件

# -*- coding: utf-8 -*-
 import scrapy
 from mzitu.items import MzituItem
 from lxml import etree
 import requests
 import sys
 reload(sys)
 sys.setdefaultencoding('utf8')
 class MymzituSpider(scrapy.Spider):
     def get_urls():
         url = 'http://www.mzitu.com'
         headers = {}
         headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) ApplewebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
         r = requests.get(url,headers=headers)
         html = etree.HTML(r.text)
         urls = html.xpath('//*[@id="pins"]/li/a/@href')
         return urls
     name = 'Mymzitu'
     allowed_domains = ['www.mzitu.com']
     start_urls = get_urls()
     def parse(self, response):
         item = MzituItem()
         #item['title'] = response.xpath('//h2[@class="main-title"]/text()')[0].extract()
         item['title'] = response.xpath('//h2[@class="main-title"]/text()')[0].extract().split('(')[0]
         item['img'] = response.xpath('//div[@class="main-image"]/p/a/img/@src')[0].extract()
         item['name'] = response.xpath('//div[@class="main-image"]/p/a/img/@src')[0].extract().split('/')[-1]
         yield item
         next_url = response.xpath('//div[@class="pagenavi"]/a/@href')[-1].extract()
         if next_url:
             yield scrapy.Request(next_url, callback=self.parse)

我们要爬取的是妹子图网站最新的妹子图片, 对应的主 url 是 http://www.mzitu.com, 通过查看网页源代码发现每一个图片主题的 url 在 < li > 标签中, 通过上面代码中 get_urls 函数可以获取, 并且返回一个 url 列表, 这里必须说明一下, 用 python 写爬虫, 像 rexpathBeautiful Soup 之类的模块必须掌握一个, 否则根本无法下手这里使用 xpath 工具来获取 url 地址, 在 lxml 和 scrapy 中, 都支持使用 xpath

def get_urls():
         url = 'http://www.mzitu.com'
         headers = {}
         headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
         r = requests.get(url,headers=headers)
         html = etree.HTML(r.text)
         urls = html.xpath('//*[@id="pins"]/li/a/@href')
         return urls

name 定义爬虫的名称, allowed_domains 定义包含了 spider 允许爬取的域名 (domain) 列表(list),start_urls 定义了爬取了 url 列表

name = 'Mymzitu'
 allowed_domains = ['www.mzitu.com']
 start_urls = get_urls()

分析图片详情页, 获取图片主题图片 url 和图片名称, 同时获取下一页, 循环爬取:

def parse(self, response):
         item = MzituItem()
         #item['title'] = response.xpath('//h2[@class="main-title"]/text()')[0].extract()
         item['title'] = response.xpath('//h2[@class="main-title"]/text()')[0].extract().split('(')[0]
         item['img'] = response.xpath('//div[@class="main-image"]/p/a/img/@src')[0].extract()
         item['name'] = response.xpath('//div[@class="main-image"]/p/a/img/@src')[0].extract().split('/')[-1]
         yield item
         next_url = response.xpath('//div[@class="pagenavi"]/a/@href')[-1].extract()
         if next_url:
             yield scrapy.Request(next_url, callback=self.parse)

4 编写修改 pipelines.py 文件, 下载图片

# -*- coding: utf-8 -*-
 # Define your item pipelines here
 #
 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
 import requests
 import os
 class MzituPipeline(object):
     def process_item(self, item, spider):
         headers = {
             'Referer': 'http://www.mzitu.com/'
         }
         local_dir = 'E:\\data\\mzitu\\' + item['title']
         local_file = local_dir + '\\' + item['name']
         if not os.path.exists(local_dir):
             os.makedirs(local_dir)
         with open(local_file,'wb') as f:
             f.write(requests.get(item['img'],headers=headers).content)
         return item

5middlewares.py 文件中新增一个 RotateUserAgentMiddleware 类

class RotateUserAgentMiddleware(UserAgentMiddleware):
     def __init__(self, user_agent=''):
         self.user_agent = user_agent
     def process_request(self, request, spider):
         ua = random.choice(self.user_agent_list)
         if ua:
             request.headers.setdefault('User-Agent', ua)
     #the default user_agent_list composes chrome,IE,firefox,Mozilla,opera,netscape
     #for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
     user_agent_list = [
         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"\
         "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",\
         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",\
         "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",\
         "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",\
         "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",\
         "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",\
         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
         "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
         "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
         "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
         "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
         "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
         "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",\
         "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",\
         "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
     ]

6settings.py 设置

# Obey robots.txt rules
 ROBOTSTXT_OBEY = False
 # Configure maximum concurrent requests performed by Scrapy (default: 16)
 CONCURRENT_REQUESTS = 100
 # Disable cookies (enabled by default)
 COOKIES_ENABLED = False
 DOWNLOADER_MIDDLEWARES = {
     'mzitu.middlewares.MzituDownloaderMiddleware': 543,
     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware':None,
     'mzitu.middlewares.RotateUserAgentMiddleware': 400,
 }

7 运行爬虫

进入 E:\Code\PythonSpider\mzitu 目录, 运行 scrapy crawl Mymzitu 命令启动爬虫:

来源: https://www.cnblogs.com/Eivll0m/p/8453842.html

与本文相关文章

暂无,快来抢沙发吧！