人生苦短, 我用 Python
前文传送门:
小白学 Python 爬虫(1): 开篇 https://www.geekdigging.com/2019/11/13/3303836941/
小白学 Python 爬虫 (2): 前置准备(一) 基本类库的安装 https://www.geekdigging.com/2019/11/20/2586166930/
小白学 Python 爬虫(3): 前置准备(二)Linux 基础入门 https://www.geekdigging.com/2019/11/21/1005563697/
小白学 Python 爬虫(4): 前置准备(三)Docker 基础入门 https://www.geekdigging.com/2019/11/22/3679472340/
小白学 Python 爬虫 (5): 前置准备(四) 数据库基础 https://www.geekdigging.com/2019/11/24/334078215/
小白学 Python 爬虫 (6): 前置准备(五) 爬虫框架的安装 https://www.geekdigging.com/2019/11/25/1881661601/
小白学 Python 爬虫(7):HTTP 基础 https://www.geekdigging.com/2019/11/26/1197821400/
小白学 Python 爬虫(8): 网页基础 https://www.geekdigging.com/2019/11/27/101847406/
小白学 Python 爬虫(9): 爬虫基础 https://www.geekdigging.com/2019/11/28/1668465912/
小白学 Python 爬虫(10):Session 和 Cookies https://www.geekdigging.com/2019/12/01/2475257648/
小白学 Python 爬虫(11):urllib 基础使用(一) https://www.geekdigging.com/2019/12/02/2333822325/
小白学 Python 爬虫(12):urllib 基础使用(二) https://www.geekdigging.com/2019/12/03/819896244/
小白学 Python 爬虫(13):urllib 基础使用(三) https://www.geekdigging.com/2019/12/04/2992515886/
小白学 Python 爬虫(14):urllib 基础使用(四) https://www.geekdigging.com/2019/12/05/104488944/
小白学 Python 爬虫(15):urllib 基础使用(五) https://www.geekdigging.com/2019/12/07/2788855167/
小白学 Python 爬虫(16):urllib 实战之爬取妹子图 https://www.geekdigging.com/2019/12/09/1691033431/
小白学 Python 爬虫(17):Requests 基础使用 https://www.geekdigging.com/2019/12/10/1910005577/
小白学 Python 爬虫(18):Requests 进阶操作 https://www.geekdigging.com/2019/12/11/1468953802/
小白学 Python 爬虫(19):Xpath 基操 https://www.geekdigging.com/2019/12/12/3568648672/
小白学 Python 爬虫(20):Xpath 进阶 https://www.geekdigging.com/2019/12/13/2569867940/
小白学 Python 爬虫(21): 解析库 Beautiful Soup(上) https://www.geekdigging.com/2019/12/15/2789385418/
小白学 Python 爬虫(22): 解析库 Beautiful Soup(下) https://www.geekdigging.com/2019/12/16/876770087/
小白学 Python 爬虫(23): 解析库 pyquery 入门 https://www.geekdigging.com/2019/12/17/876770088/
小白学 Python 爬虫(24):2019 豆瓣电影排行 https://www.geekdigging.com/2019/12/18/1275791678/
小白学 Python 爬虫(25): 爬取股票信息 https://www.geekdigging.com/2019/12/19/1066903974/
小白学 Python 爬虫(26): 为啥买不起上海二手房你都买不起 https://www.geekdigging.com/2019/12/20/788803015/
小白学 Python 爬虫(27): 自动化测试框架 Selenium 从入门到放弃(上) https://www.geekdigging.com/2019/12/22/151891020/
小白学 Python 爬虫(28): 自动化测试框架 Selenium 从入门到放弃(下) https://www.geekdigging.com/2019/12/24/1100772905/
小白学 Python 爬虫(29):Selenium 获取某大型电商网站商品信息 https://www.geekdigging.com/2019/12/25/7469407721/
小白学 Python 爬虫(30): 代理基础 https://www.geekdigging.com/2019/12/26/9565104888/
小白学 Python 爬虫(31): 自己构建一个简单的代理池 https://www.geekdigging.com/2019/12/30/8571162753/
小白学 Python 爬虫(32): 异步请求库 AIOHTTP 基础入门 https://www.geekdigging.com/2019/12/31/5591013128/
小白学 Python 爬虫(33): 爬虫框架 Scrapy 入门基础(一) https://www.geekdigging.com/2020/01/05/4332169375/
小白学 Python 爬虫(34): 爬虫框架 Scrapy 入门基础(二) https://www.geekdigging.com/2020/01/06/5952331367/
小白学 Python 爬虫(35): 爬虫框架 Scrapy 入门基础(三) Selector 选择器 https://www.geekdigging.com/2020/01/07/5324623576/
小白学 Python 爬虫(36): 爬虫框架 Scrapy 入门基础(四) Downloader Middleware https://www.geekdigging.com/2020/01/08/8971187302/
小白学 Python 爬虫(37): 爬虫框架 Scrapy 入门基础(五) Spider Middleware https://www.geekdigging.com/2020/01/09/4676268033/
小白学 Python 爬虫(38): 爬虫框架 Scrapy 入门基础(六) Item Pipeline https://www.geekdigging.com/2020/01/10/2131379794/
小白学 Python 爬虫(39): JavaScript 渲染服务 Scrapy-Splash 入门 https://www.geekdigging.com/2020/01/11/4748644439/
小白学 Python 爬虫 (40): 爬虫框架 Scrapy 入门基础(七) 对接 Selenium 实战 https://www.geekdigging.com/2020/01/13/546552645/
- DOWNLOADER_MIDDLEWARES = {
- 'scrapy_splash.SplashCookiesMiddleware': 723,
- 'scrapy_splash.SplashMiddleware': 725,
- 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
- }
- SPIDER_MIDDLEWARES = {
- 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
- }
- yield SplashRequest(url, self.parse_result,
- args={
- # optional; parameters passed to Splash HTTP API
- 'wait': 0.5,
- # 'url' is prefilled from request url
- # 'http_method' is set to 'POST' for POST requests
- # 'body' is set to request body for POST requests
- },
- endpoint='render.json', # optional; default is render.html
- splash_url='<url>', # optional; overrides SPLASH_URL
- slot_policy=scrapy_splash.SlotPolicy.PER_DOMAIN, # optional
- )
- yield scrapy.Request(url, self.parse_result, meta={
- 'splash': {
- 'args': {
- # set rendering arguments here
- 'html': 1,
- 'png': 1,
- # 'url' is prefilled from request url
- # 'http_method' is set to 'POST' for POST requests
- # 'body' is set to request body for POST requests
- },
- # optional parameters
- 'endpoint': 'render.json', # optional; default is render.JSON
- 'splash_url': '<url>', # optional; overrides SPLASH_URL
- 'slot_policy': scrapy_splash.SlotPolicy.PER_DOMAIN,
- 'splash_headers': {}, # optional; a dict with headers sent to Splash
- 'dont_process_response': True, # optional, default is False
- 'dont_send_headers': True, # optional, default is False
- 'magic_response': False, # optional, default is True
- }
- })
- function main(splash, args)
- splash:go("https://www.jd.com/")
- return {
- url = splash:url(),
- jpeg = splash:jpeg(),
- har = splash:har(),
- cookies = splash:get_cookies()
- }
- end
- # -*- coding: utf-8 -*-
- import scrapy
- from scrapy_splash import SplashRequest
- lua_script = """
- function main(splash, args)
- splash:go(args.url)
- return {
- url = splash:url(),
- jpeg = splash:jpeg(),
- har = splash:har(),
- cookies = splash:get_cookies()
- }
- end
- """
- class JdSpider(scrapy.Spider):
- name = 'jd'
- allowed_domains = ['www.jd.com']
- start_urls = ['http://www.jd.com/']
- def start_requests(self):
- url = 'https://www.jd.com/'
- yield SplashRequest(url=url, callback=self.parse)
- def parse(self, response):
- self.logger.debug(response.text)
来源: https://www.cnblogs.com/babycomeon/p/12199536.html