Python 网络数据采集读书笔记 (二)

1 通过的名称和属性查找标签

和之前一样, 抓取整个页面, 然后创建一个 BeautifulSoup 对象这里面 lxml 解析器需要另外下载

pip3 install lxml
>>> from urllib.request import urlopen
>>> from bs4 import BeautifulSoup
>>> html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
>>> bsObj = BeautifulSoup(html, "lxml")

finAll()可以获取页面中所有指定的标签, 抽取只包含在 < span class="green"></span > 标签里的文字, 这样就会得到一个人物名称的列表. get_text()会把所有的标签都清除, 返回一个只包含文字的字符串

>>> nameList = bsObj.findAll("span", {"class":"green"})
>>> for name in nameList:
        print(name.get_text())

2 详解 finAll()和 find()的参数

findAll(tag, attributes, recursive, text, limit, keywords)
find(tag, attributes, recursive, text, keywords)

tag-- 可以传一个标签的名称或多个标签名称组成的列表做参数; 如下示例将返回一个包含 HTML 文档中所有标题标签的列表

>>> bsObj.findAll({"h1", "h2", "h3", "h4", "h5", "h6"})
[<h1>War and Peace</h1>, <h2>Chapter 1</h2>]

attributes-- 字典封装一个标签的若干属性和对应的属性值; 如下示例会返回 HTML 文档里红色与绿色两种颜色的 span 标签

>>> bsObj.findAll("span", {"class":{"green", "red"}})

recursive-- 布尔变量 (默认为 True) 查找标签参数的所有子标签, 及子标签的子标签; 为 False 时, 只查找文档的一级标签

text-- 是用标签的文本内容去匹配; 如下示例作用是查找网页中包含 the prince 内容的标签数量

>>> nameList=bsObj.findAll(text="the prince")
>>> print(len(nameList))
7

limit-- 按照网页上的顺序排序, 获取的前 x 项结果(等于 1 时等价于 find)

keywords-- 可以让你选择那些具有指定属性的标签

>>> allText = bsObj.findAll(id="text")
>>> print(allText[0].get_text())

下面两行代码是完全一样的

bsObj.findAll(id="text")
bsObj.findAll("", {"id":"text"})

由于 class 为关键字, 使用 keywords 指定时需要多加一个下划线

bsObj.findAll(class_="green")
bsObj.findAll("", {"class":"green"})

3BeautifulSoup 对象

BeautifulSoup 对象: 如前面代码示例中的 bsObj

Tag 对象: BeautifulSoup 对象通过 find 和 findAll 或直接调用子标签获取的一列象或单个对象(如 bsObj.div.h1)

NavigableString 对象: 表示标签里的文字

Comment 对象: 查找 HTML 文档的注释标签

4 标签解析树的导航: 通过标签在文档中的位置来查找标签

>>> from urllib.request import urlopen
>>> from bs4 import BeautifulSoup
>>> html = urlopen("http://www.pythonscraping.com/pages/page3.html")
>>> bsObj = BeautifulSoup(html, "lxml")

>>> for child in bsObj.find("table", {"id":"giftList"}).children:
print(child)

>>> for sibling in bsObj.find("table", {"id":"giftList"}).tr.next_siblings:
        print(sibling)

>>> print(bsObj.find("img", {"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text())
$15.00

来源: http://www.bubuko.com/infodetail-2543502.html

与本文相关文章

暂无,快来抢沙发吧！