CSS 选择器: BeautifulSoup4
Beautiful Soup 也是一个 html/XML 的解析器, 主要的功能也是如何解析和提取 HTML/XML 数据.
pip 安装:
pip install beautifulsoup4
官方文档: http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/
抓取工具 | 速度 | 使用难度 | 安装难度 |
---|---|---|---|
正则 | 最快 | 困难 | 无(内置) |
BeautifulSoup | 慢 | 最简单 | 简单 |
lxml | 快 | 简单 | 一般 |
使用 BeautifuSoup4 爬腾讯社招页面
地址: http://hr.tencent.com/position.php?&start=10#a
- # bs4_tencent.py
- from bs4 import BeautifulSoup
- import urllib2
- import urllib
- import json # 使用了 json 格式存储
- def tencent():
- url = 'http://hr.tencent.com/'
- request = urllib2.Request(url + 'position.php?&start=10#a')
- response =urllib2.urlopen(request)
- resHtml = response.read()
- output =open('tencent.json','w')
- html = BeautifulSoup(resHtml,'lxml')
- # 创建 CSS 选择器
- result = html.select('tr[class="even"]')
- result2 = html.select('tr[class="odd"]')
- result += result2
- items = []
- for site in result:
- item = {}
- name = site.select('td a')[0].get_text()
- detailLink = site.select('td a')[0].attrs['href']
- catalog = site.select('td')[1].get_text()
- recruitNumber = site.select('td')[2].get_text()
- workLocation = site.select('td')[3].get_text()
- publishTime = site.select('td')[4].get_text()
- item['name'] = name
- item['detailLink'] = url + detailLink
- item['catalog'] = catalog
- item['recruitNumber'] = recruitNumber
- item['publishTime'] = publishTime
- items.append(item)
- # 禁用 ascii 编码, 按 utf-8 编码
- line = json.dumps(items,ensure_ascii=False)
- output.write(line.encode('utf-8'))
- output.close()
- if __name__ == "__main__":
- tencent()
来源: http://www.bubuko.com/infodetail-2723816.html