Python 抓取框架 Scrapy 爬虫入门:页面提取

这里有新鲜出炉的 Python3 官方中文指南, 程序狗速度看过来!

Python 编程语言

Python 是一种面向对象, 解释型计算机程序设计语言, 由 Guido van Rossum 于 1989 年底发明, 第一个公开发行版发行于 1991 年. Python 语法简洁而清晰, 具有丰富和强大的类库. 它常被昵称为胶水语言, 它能够把用其他语言制作的各种模块 (尤其是 C/C++) 很轻松地联结在一起.

Scrapy 吸引人的地方在于它是一个框架, 任何人都可以根据需求方便的修改, 下面这篇文章主要给大家介绍了关于 Python 抓取框架 Scrapy 爬虫入门之页面提取的相关资料, 文中通过示例代码介绍的非常详细, 需要的朋友可以参考下.

前言

Scrapy 是一个非常好的抓取框架, 它不仅提供了一些开箱可用的基础组建, 还能够根据自己的需求, 进行强大的自定义. 本文主要给大家介绍了关于 Python 抓取框架 Scrapy 之页面提取的相关内容, 分享出来供大家参考学习, 下面随着小编来一起学习学习吧.

在开始之前, 关于 scrapy 框架的入门大家可以参考这篇文章: /article/16/0804/241147.html

下面创建一个爬虫项目, 以图虫网为例抓取图片.

一, 内容分析

打开图虫网 , 顶部菜单 "发现" "标签" 里面是对各种图片的分类, 点击一个标签, 比如 "美女", 网页的链接为: https://tuchong.com/tags / 美女 / , 我们以此作为爬虫入口, 分析一下该页面:

打开页面后出现一个个的图集, 点击图集可全屏浏览图片, 向下滚动页面会出现更多的图集, 没有页码翻页的设置. Chrome 右键 "检查元素" 打开开发者工具, 检查页面源码, 内容部分如下:

<div class="content">
    <div class="widget-gallery">
        <ul class="pagelist-wrapper">
            <li class="gallery-item...

可以判断每一个 li.gallery-item 是一个图集的入口, 存放在 ul.pagelist-wrapper 下, div.widget-gallery 是一个容器, 如果使用 xpath 选取应该是://div[@class="widget-gallery"]/ul/li, 按照一般页面的逻辑, 在 li.gallery-item 下面找到对应的链接地址, 再往下深入一层页面抓取图片.

但是如果用类似 Postman 的 HTTP 调试工具请求该页面, 得到的内容是:

<div class="content">
 <div class="widget-gallery"></div>
</div>

也就是并没有实际的图集内容, 因此可以断定页面使用了 Ajax 请求, 只有在浏览器载入页面时才会请求图集内容并加入 div.widget-gallery 中, 通过开发者工具查看 XHR 请求地址为:

https://tuchong.com/rest/tags / 美女 / posts?page=1&count=20&order=weekly&before_timestamp=

参数很简单, page 是页码, count 是每页图集数量, order 是排序, before_timestamp 为空, 图虫因为是推送内容式的网站, 因此 before_timestamp 应该是一个时间值, 不同的时间会显示不同的内容, 这里我们把它丢弃, 不考虑时间直接从最新的页面向前抓取.

请求结果为 JSON 格式内容, 降低了抓取难度, 结果如下:

{
    "postList": [{
        "post_id": "15624611",
        "type": "multi-photo",
        "url": "https://weishexi.tuchong.com/15624611/",
        "site_id": "443122",
        "author_id": "443122",
        "published_at": "2017-10-28 18:01:03",
        "excerpt": "10 月 18 日",
        "favorites": 4052,
        "comments": 353,
        "rewardable": true,
        "parent_comments": "165",
        "rewards": "2",
        "views": 52709,
        "title": "微风不燥 秋意正好",
        "image_count": 15,
        "images": [{
            "img_id": 11585752,
            "user_id": 443122,
            "title": "",
            "excerpt": "",
            "width": 5016,
            "height": 3840
        },
        {
            "img_id": 11585737,
            "user_id": 443122,
            "title": "",
            "excerpt": "",
            "width": 3840,
            "height": 5760
        },
        ...],
        "title_image": null,
        "tags": [{
            "tag_id": 131,
            "type": "subject",
            "tag_name": "人像",
            "event_type": "",
            "vote": ""
        },
        {
            "tag_id": 564,
            "type": "subject",
            "tag_name": "美女",
            "event_type": "",
            "vote": ""
        }],
        "favorite_list_prefix": [],
        "reward_list_prefix": [],
        "comment_list_prefix": [],
        "cover_image_src": "https://photo.tuchong.com/443122/g/11585752.webp",
        "is_favorite": false
    }],
    "siteList": {...
    },
    "following": false,
    "coverUrl": "https://photo.tuchong.com/443122/ft640/11585752.webp",
    "tag_name": "美女",
    "tag_id": "564",
    "url": "https://tuchong.com/tags / 美女 /",
    "more": true,
    "result": "SUCCESS"
}

根据属性名称很容易知道对应的内容含义, 这里我们只需关心 postlist 这个属性, 它对应的一个数组元素便是一个图集, 图集元素中有几项属性我们需要用到:

url: 单个图集浏览的页面地址

post_id: 图集编号, 在网站中应该是唯一的, 可以用来判断是否已经抓取过该内容

(PROJECT)

│ scrapy.cfg

│

└─tuchong

│ items.py

│ middlewares.py

│ pipelines.py

│ settings.py

│ __init__.py

│

├─spiders

│ │ photo.py

│ │ __init__.py

│ │

│ └─__pycache__

│ __init__.cpython-36.pyc

│

└─__pycache__

settings.cpython-36.pyc
  __init__.cpython-36.pyc
import scrapy
class TuchongItem(scrapy.Item):
 post_id = scrapy.Field()
 site_id = scrapy.Field()
 title = scrapy.Field()
 type = scrapy.Field()
 url = scrapy.Field()
 image_count = scrapy.Field()
 images = scrapy.Field()
 tags = scrapy.Field()
 excerpt = scrapy.Field()
 ...
import scrapy
class PhotoSpider(scrapy.Spider):
 name = 'photo'
 allowed_domains = ['tuchong.com']
 start_urls = ['http://tuchong.com/']
 def parse(self, response):
 pass
import scrapy, json
from ..items import TuchongItem
class PhotoSpider(scrapy.Spider):
 name = 'photo'
 # allowed_domains = ['tuchong.com']
 # start_urls = ['http://tuchong.com/']
 def start_requests(self):
 url = 'https://tuchong.com/rest/tags/%s/posts?page=%d&count=20&order=weekly';
 # 抓取 10 个页面, 每页 20 个图集
 # 指定 parse 作为回调函数并返回 Requests 请求对象
 for page in range(1, 11):
  yield scrapy.Request(url=url % ('美女', page), callback=self.parse)
 # 回调函数, 处理抓取内容填充 TuchongItem 属性
 def parse(self, response):
 body = json.loads(response.body_as_unicode())
 items = []
 for post in body['postList']:
  item = TuchongItem()
  item['type'] = post['type']
  item['post_id'] = post['post_id']
  item['site_id'] = post['site_id']
  item['title'] = post['title']
  item['url'] = post['url']
  item['excerpt'] = post['excerpt']
  item['image_count'] = int(post['image_count'])
  item['images'] = {}
  # 将 images 处理成 {img_id: img_url} 对象数组
  for img in post.get('images', ''):
  img_id = img['img_id']
  url = 'https://photo.tuchong.com/%s/f/%s.jpg' % (item['site_id'], img_id)
  item['images'][img_id] = url
  item['tags'] = []
  # 将 tags 处理成 tag_name 数组
  for tag in post.get('tags', ''):
  item['tags'].append(tag['tag_name'])
  items.append(item)
 return items
...
 def process_item(self, item, spider):
 # 不符合条件触发 scrapy.exceptions.DropItem 异常, 符合条件的输出地址
 if int(item['image_count']) < 3:
  raise DropItem("美女太少:" + item['url'])
 elif item['type'] != 'multi-photo':
  raise DropItem("格式不对:" + + item['url'])
 else:
  print(item['url'])
 return item
...
ITEM_PIPELINES = {
 'tuchong.pipelines.TuchongPipeline': 300, # 管道名称: 运行优先级 (数字小优先)
}
scrapy crawl photo
[scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 491,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 10224,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 11, 27, 7, 20, 24, 414201),
 'item_dropped_count': 5,
 'item_dropped_reasons_count/DropItem': 5,
 'item_scraped_count': 15,
 'log_count/DEBUG': 18,
 'log_count/INFO': 8,
 'log_count/WARNING': 5,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 11, 27, 7, 20, 23, 867300)}
scrapy crawl photo -o output.json # 输出为 JSON 文件
scrapy crawl photo -o output.csv # 输出为 CSV 文件
...
 def process_item(self, item, spider):
  ...
  else:
   print(item['url'])
   self.myblog.add_post(item) # myblog 是一个数据库类, 用于处理数据库操作
  return item
...

来源: http://www.phperz.com/article/18/0129/361505.html

与本文相关文章

暂无,快来抢沙发吧！