(1) 如果想获取 <head> 标签, 只要用 soup.head :

一 beautifulsoup 的简单使用

简单来说, Beautiful Soup 是 python 的一个库, 最主要的功能是从网页抓取数据官方解释如下:

Beautiful Soup 提供一些简单的 python 式的函数用来处理导航搜索修改分析树等功能

它是一个工具箱, 通过解析文档为用户提供需要抓取的数据, 因为简单, 所以不需要多少代码就可以写出一个完整的应用程序

更多知识访问: 官方文档

1. 安装

pip3 install beautifulsoup4

(1) 解析器

Beautiful Soup 支持 Python 标准库中的 html 解析器, 还支持一些第三方的解析器, 如果我们不安装它, 则 Python 会使用 Python 默认的解析器, lxml 解析器更加强大, 速度更快, 推荐安装

pip3 install lxml

另一个可供选择的解析器是纯 Python 实现的 html5lib , html5lib 的解析方式与浏览器相同, 可以选择下列方法来安装 html5lib:

pip install html5lib

(2) 解析器对比

2. 快速开始

下面的一段 HTML 代码将作为例子被多次用到. 这是爱丽丝梦游仙境的的一段内容 (以后内容中简称为爱丽丝的文档):

<html><head><title>The Dormouses story</title></head>
<body>
<p class="title"><b>The Dormouses story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>

使用 BeautifulSoup 解析这段代码, 能够得到一个 BeautifulSoup 的对象, 并能按照标准的缩进格式的结构输出:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, html.parser) #<class bs4.BeautifulSoup> 类型, html 解析器: html.parser
print(soup.prettify())   #以标准格式输出

结果展示:

<html>
    
    <head>
        <title>
            The Dormouses story
        </title>
    </head>
    
    <body>
        <p class="title">
            <b>
                The Dormouses story
            </b>
        </p>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
                Elsie
            </a>
            ,
            <a class="sister" href="http://example.com/lacie" id="link2">
                Lacie
            </a>
            and
            <a class="sister" href="http://example.com/tillie" id="link3">
                Tillie
            </a>
            ; and they lived at the bottom of a well.
        </p>
        <p class="story">
            ...
        </p>
    </body>
 
</html>
View Code

二 beautifulsoup 的遍历文档树

几个简单的浏览结构化数据的方法:

操作文档树最简单的方法就是告诉它你想获取的 tag 的 name.

soup.head
# <head><title>The Dormouses story</title></head>
soup.title
# <title>The Dormouses story</title>

soup.body.b
# <b>The Dormouses story</b>

soup.a  #总共又三个
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all(a)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

head_tag.contents
[<title>The Dormouses story</title>]
soup.contents[1].name  #切片当然可以
# uhtml

for child in title_tag.children:
    print(child)
    # The Dormouses story

print(soup.head.contents)  #直接的子标签只有一个
# [<title>The Dormouses story</title>]
for i in soup.head.descendants:  #子标签有一个, 还有一个孙子标签
    print(i)
# < title > TheDormouses story</title>
# The Dormouses story
# 注意: 字符串也可以作为一个独立的标签

print(soup.title.string)
# The Dormouses story
print(soup.head.string)   #即使有多层标签, 也可以打印出来
# The Dormouses story
print(soup.body.string) #由于有多个子节点, 所以不知道去哪一个
# None
for i in soup.body:   #有多个子节点可以使用循环,
    print(i)

for string in soup.stripped_strings:
    print(repr(string))
# "The Dormouses story"
# "The Dormouses story"
# Once upon a time there were three little sisters; and their names were
# Elsie
# ,
# Lacie
# and
# Tillie
# ;\nand they lived at the bottom of a well.
# ...

print(soup.title.parent)
# <head><title>The Dormouses story</title></head>

for i in soup.a.parents:  #它是一次从内到外
    print(i.name)
# p
# body
# html
# [document]
# None

sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")
print(sibling_soup.prettify())
# <html>
#  <body>
#   <a>
#    <b>
#     text1
#    </b>
#    <c>
#     text2
#    </c>
#   </a>
#  </body>
# </html>

sibling_soup.b.next_sibling
# <c>text2</c>
sibling_soup.c.previous_sibling
# <b>text1</b>

for i in enumerate(soup.a.next_siblings,1):  #向下找
    print(i)
# (1, ,\n)
# (2, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>)
# (3,  and\n)
# (4, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>)
# (5, ;\nand they lived at the bottom of a well.)
for i in enumerate(soup.a.previous_siblings,1):  #向上找
    print(i)
# (1, Once upon a time there were three little sisters; and their names were\n)

<html><head><title>The Dormouses story</title></head>
<p class="title"><b>The Dormouses story</b></p>

print(soup.find("a",id="link2").next_element)
#Lacie

print(soup.find("a",id="link2").previous_element)
# ,

for element in soup.find("a",id="link3").next_elements:
    print(repr(element))
# Tillie
# ;\nand they lived at the bottom of a well.
# \n
# <p class="story">...</p>
# ...
# \n

来源: http://www.bubuko.com/infodetail-2512203.html

与本文相关文章

暂无,快来抢沙发吧！