一 beautifulsoup 的简单使用
简单来说, Beautiful Soup 是 python 的一个库, 最主要的功能是从网页抓取数据官方解释如下:
Beautiful Soup 提供一些简单的 python 式的函数用来处理导航搜索修改分析树等功能
它是一个工具箱, 通过解析文档为用户提供需要抓取的数据, 因为简单, 所以不需要多少代码就可以写出一个完整的应用程序
更多知识访问: 官方文档
1. 安装
pip3 install beautifulsoup4
(1) 解析器
Beautiful Soup 支持 Python 标准库中的 html 解析器, 还支持一些第三方的解析器, 如果我们不安装它, 则 Python 会使用 Python 默认的解析器, lxml 解析器更加强大, 速度更快, 推荐安装
pip3 install lxml
另一个可供选择的解析器是纯 Python 实现的 html5lib , html5lib 的解析方式与浏览器相同, 可以选择下列方法来安装 html5lib:
pip install html5lib
(2) 解析器对比
2. 快速开始
下面的一段 HTML 代码将作为例子被多次用到. 这是 爱丽丝梦游仙境的 的一段内容 (以后内容中简称为 爱丽丝 的文档):
- <html><head><title>The Dormouses story</title></head>
- <body>
- <p class="title"><b>The Dormouses story</b></p>
- <p class="story">Once upon a time there were three little sisters; and their names were
- <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
- <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
- <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
- and they lived at the bottom of a well.</p>
- <p class="story">...</p>
使用 BeautifulSoup 解析这段代码, 能够得到一个 BeautifulSoup 的对象, 并能按照标准的缩进格式的结构输出:
- from bs4 import BeautifulSoup
- soup = BeautifulSoup(html_doc, html.parser) #<class bs4.BeautifulSoup> 类型, html 解析器: html.parser
- print(soup.prettify()) #以标准格式输出
结果展示:
- <html>
- <head>
- <title>
- The Dormouses story
- </title>
- </head>
- <body>
- <p class="title">
- <b>
- The Dormouses story
- </b>
- </p>
- <p class="story">
- Once upon a time there were three little sisters; and their names were
- <a class="sister" href="http://example.com/elsie" id="link1">
- Elsie
- </a>
- ,
- <a class="sister" href="http://example.com/lacie" id="link2">
- Lacie
- </a>
- and
- <a class="sister" href="http://example.com/tillie" id="link3">
- Tillie
- </a>
- ; and they lived at the bottom of a well.
- </p>
- <p class="story">
- ...
- </p>
- </body>
- </html>
- View Code
二 beautifulsoup 的遍历文档树
几个简单的浏览结构化数据的方法:
操作文档树最简单的方法就是告诉它你想获取的 tag 的 name.
- soup.head
- # <head><title>The Dormouses story</title></head>
- soup.title
- # <title>The Dormouses story</title>
- soup.body.b
- # <b>The Dormouses story</b>
- soup.a #总共又三个
- # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
- soup.find_all(a)
- # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
- # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
- # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
- head_tag.contents
- [<title>The Dormouses story</title>]
- soup.contents[1].name #切片当然可以
- # uhtml
- for child in title_tag.children:
- print(child)
- # The Dormouses story
- print(soup.head.contents) #直接的子标签只有一个
- # [<title>The Dormouses story</title>]
- for i in soup.head.descendants: #子标签有一个, 还有一个孙子标签
- print(i)
- # < title > TheDormouses story</title>
- # The Dormouses story
- # 注意: 字符串也可以作为一个独立的标签
- print(soup.title.string)
- # The Dormouses story
- print(soup.head.string) #即使有多层标签, 也可以打印出来
- # The Dormouses story
- print(soup.body.string) #由于有多个子节点, 所以不知道去哪一个
- # None
- for i in soup.body: #有多个子节点可以使用循环,
- print(i)
- for string in soup.stripped_strings:
- print(repr(string))
- # "The Dormouses story"
- # "The Dormouses story"
- # Once upon a time there were three little sisters; and their names were
- # Elsie
- # ,
- # Lacie
- # and
- # Tillie
- # ;\nand they lived at the bottom of a well.
- # ...
- print(soup.title.parent)
- # <head><title>The Dormouses story</title></head>
- for i in soup.a.parents: #它是一次从内到外
- print(i.name)
- # p
- # body
- # html
- # [document]
- # None
- sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")
- print(sibling_soup.prettify())
- # <html>
- # <body>
- # <a>
- # <b>
- # text1
- # </b>
- # <c>
- # text2
- # </c>
- # </a>
- # </body>
- # </html>
- sibling_soup.b.next_sibling
- # <c>text2</c>
- sibling_soup.c.previous_sibling
- # <b>text1</b>
- for i in enumerate(soup.a.next_siblings,1): #向下找
- print(i)
- # (1, ,\n)
- # (2, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>)
- # (3, and\n)
- # (4, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>)
- # (5, ;\nand they lived at the bottom of a well.)
- for i in enumerate(soup.a.previous_siblings,1): #向上找
- print(i)
- # (1, Once upon a time there were three little sisters; and their names were\n)
- <html><head><title>The Dormouses story</title></head>
- <p class="title"><b>The Dormouses story</b></p>
- print(soup.find("a",id="link2").next_element)
- #Lacie
- print(soup.find("a",id="link2").previous_element)
- # ,
- for element in soup.find("a",id="link3").next_elements:
- print(repr(element))
- # Tillie
- # ;\nand they lived at the bottom of a well.
- # \n
- # <p class="story">...</p>
- # ...
- # \n
来源: http://www.bubuko.com/infodetail-2512203.html