Python 获取网页标题
使用 Python2.x 的 urllib2 和 lxml, 速度应该还快于 BeautifulSoup4(话说回来, 为什么大家都要用 BS4 呢? 一个 XPATH 不就完了吗)
没有安装过的, 用 pip 安装一下
pip install lxml
Shell 演示:
- >> from lxml import etree
- >> import urllib2
- >> page = etree.html(urllib2.urlopen('https://blog.csdn.net/z690798364/article/details/79960358').read().decode('utf-8'))
- >> print page.xpath(u"/html/head/title")[0].text
Lxml 解析网页用法笔记 - z690798364 的专栏 - CSDN 博客
封装好了的函数:
- from lxml import etree
- import urllib2
- #...
- def get_site_title(link):
- send_headers = {
- 'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; rv:16.0) Gecko/20100101 Firefox/16.0',
- 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
- 'Connection': 'keep-alive'
- } # 伪装一下 header, 防止被 403
- title = etree.HTML(urllib2.urlopen(urllib2.Request(link, headers=send_headers)).read().decode('utf-8')).xpath("/html/head/title")
- if title is None:
- raise 'target miss'
- return title[0].text
来源: http://www.bubuko.com/infodetail-2936747.html