博主微博:
http://weibo.com/234654758Github:
https://github.com/thinkgamer该篇文章将是 Scrapy 爬虫系列的开篇,随后会不定时更新该框架方面的内容和知识,在 scrapy 之前写爬虫主要用的 BeautifulSoup, request 和 urllib,但随着使用程度的加深,慢慢意识到功能和效率都是不够的,那么便重新接触了 Scrapy 框架,并尝试着写出一些有趣的东西。
它是一个应用程序框架,用来抓取并提取网站中的结构化数据,在数据挖掘,信息处理等方面十分有用。
- pip install Scrapy
它是一个 python 包,依赖于以下几个 python 包
- scrapy startproject tiebaSpider
目录结构为
- tiebaSpider / scrapy.cfg#配置文件
- tiebaSpider / #py模块__init__.py
- items.py#定义一些数据模型文件
- pipelines.py#管道,用来进行数据的保存
- settings.py#关于项目的一些配置
- spiders / #该目录下编写自己的代码逻辑__init__.py
在 soiders / 下新建一个 tibe.py 文件,内容为:
- #coding: utf - 8
- import scrapy
- class TiebaSpider(scrapy.Spider) : name = "tiebaSpider"
- def start_requests(self) : urls = ['https://tieba.baidu.com/f?kw=戒赌&ie=utf-8&pn=0']
- for url in urls: yield scrapy.Request(url = url, callback = self.parse)
- def parse(self, response) : page = response.url.split("pn=")[1] filename = "tieba-%s.html" % page with open(filename, 'wb') as f: f.write(response.body) self.log('Saved file %s' % filename)
运行爬虫:
- scrapy crawl tiebaSpider
这里我们也可以这样写
- #coding: utf - 8
- import scrapy
- class TiebaSpider(scrapy.Spider) :
- name = "tiebaSpider"
- start_urls = ['https://tieba.baidu.com/f?kw=戒赌&ie=utf-8&pn=0', ] def parse(self, response) : page = response.url.split("pn=")[1] filename = "tieba-%s.html" % page with open(filename, 'wb') as f: f.write(response.body) self.log('Saved file %s' % filename)
因为 parse 是 scrapy 的默认回调函数,当然如果你要写好几自己定义的函数的话,就必须要指定自己的回调函数。
爬取贴吧的尽可能多的用户名,作为推广使用(暂时使用以下的方法来获取用户名,后续会更新一种新的方法:对于给定的贴吧名,爬取其前三页所有帖子的发帖人和回帖人,这里只保存用户名)
这里的实现思路是 search 出来指定的贴吧,然后获取所有贴吧前三页的帖子 href,然后保存到文件中,再启用另外一个爬虫逐个帖子去爬取解析用户名
- tieba data tieba spiders __init__.py tieba1.py#获取所有的帖子的链接tieba2.py#获取发帖人和回帖人的用户名__init__.py items.py middlewares.py pipelines.py settings.py name.txt#存放贴吧名scrapy.cfg
tieba1.py
- #coding: utf - 8
- import scrapy import urllib import time
- class TiebaSpider(scrapy.Spider) :
- name = 'tieba'
- def __init__(self) : self.urls = []
- #加载贴吧名fr = open("name.txt", "r")
- for one in fr.readlines() : for i in range(0, 3) : self.urls.append('https://tieba.baidu.com/f?kw=' + urllib.quote(one.strip()) + '&ie=utf-8&pn=' + str(i * 50)) fr.close()
- def start_requests(self) : urls = self.urls
- for url in urls: yield scrapy.Request(url = url, callback = self.parse)
- def parse(self, response) : sel = scrapy.Selector(response) ahref_list = sel.xpath('//a[re:test(@class, "j_th_tit ")]//@href').extract()
- fw = open("data/%s_all_href.txt" % time.strftime('%Y%m%d'), "a") for ahref in ahref_list: href = "https://tieba.baidu.com" + ahref fw.write(href + "\n") fw.close()
tieba2.py
- #coding: utf - 8
- import scrapy import time from scrapy.http.request import Request from scrapy.http import HtmlResponse
- class TiebaSpider2(scrapy.Spider) :
- name = 'tieba2'
- def __init__(self) : self.urls = []
- #加载贴吧名fr = open("data/%s_all_href.txt" % time.strftime('%Y%m%d'), "r")
- for one in fr.readlines() : self.urls.append(one.strip()) fr.close()
- def start_requests(self) : urls = self.urls
- for one in urls: yield scrapy.Request(url = one, callback = self.parse)
- def parse_uname(self, response) : #response = HtmlResponse(url = page_url.url) sel = scrapy.Selector(response) name_list = sel.xpath('//li[re:test(@class, "d_name")]//a/text()').extract()#print respons fw = open("data/%s_all_name.txt" % time.strftime('%Y%m%d'), "a") for name in list(set(name_list)) : fw.write(name.encode("utf-8")) fw.write("\n") fw.close()
- def parse(self, response) : sel = scrapy.Selector(response)
- #可能有些帖子被删除
- try: #得到每个帖子有多少页num = int(sel.xpath('//span[re:test(@class,"red")]//text()').extract()[1])#遍历每页获得用户名
- for page_num in range(1, num + 1) : one_url = response.url + "?pn=" + str(page_num)
- yield Request(url = one_url, callback = self.parse_uname) except Exception as e: pass
代码的 github 地址为:点击查看
相比 BeautifulSoup ,urllib,request 来讲,Scrapy 更加的快捷,高效和灵活,而且使用 lxml,parsel 来进行网页内容的解析比 BS 也更快
来源: http://blog.csdn.net/gamer_gyt/article/details/75043398