Python 爬虫开发 [第 1 篇] [beautifulSoup4 解析器]

CSS 选择器: BeautifulSoup4

Beautiful Soup 也是一个 html/XML 的解析器, 主要的功能也是如何解析和提取 HTML/XML 数据.

pip 安装:

pip install beautifulsoup4

官方文档: http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

抓取工具	速度	使用难度	安装难度
正则	最快	困难	无（内置）
BeautifulSoup	慢	最简单	简单
lxml	快	简单	一般

使用 BeautifuSoup4 爬腾讯社招页面

地址: http://hr.tencent.com/position.php?&start=10#a

# bs4_tencent.py
 from bs4 import BeautifulSoup
 import urllib2
 import urllib
 import json    # 使用了 json 格式存储
 def tencent():
     url = 'http://hr.tencent.com/'
     request = urllib2.Request(url + 'position.php?&start=10#a')
     response =urllib2.urlopen(request)
     resHtml = response.read()
     output =open('tencent.json','w')
     html = BeautifulSoup(resHtml,'lxml')
 # 创建 CSS 选择器
     result = html.select('tr[class="even"]')
     result2 = html.select('tr[class="odd"]')
     result += result2
     items = []
     for site in result:
         item = {}
         name = site.select('td a')[0].get_text()
         detailLink = site.select('td a')[0].attrs['href']
         catalog = site.select('td')[1].get_text()
         recruitNumber = site.select('td')[2].get_text()
         workLocation = site.select('td')[3].get_text()
         publishTime = site.select('td')[4].get_text()
         item['name'] = name
         item['detailLink'] = url + detailLink
         item['catalog'] = catalog
         item['recruitNumber'] = recruitNumber
         item['publishTime'] = publishTime
         items.append(item)
     # 禁用 ascii 编码, 按 utf-8 编码
     line = json.dumps(items,ensure_ascii=False)
     output.write(line.encode('utf-8'))
     output.close()
 if __name__ == "__main__":
    tencent()

来源: http://www.bubuko.com/infodetail-2723816.html

与本文相关文章

暂无,快来抢沙发吧！