python3解析库BeautifulSoup4的安装配置与基本用法

Beautiful Soup是python的一个HTML或XML的解析库，我们可以用它来方便的从网页中提取数据，它拥有强大的API和多样的解析方式。

只存在于虚拟的King

3726人浏览 · 2023-12-03 19:30:00

只存在于虚拟的King · 2023-12-03 19:30:00 发布

文章目录

前言
Beautiful Soup的三个特点：
1、Beautiful Soup4的安装配置
2、BeautifulSoup的基本用法
其他方法：
Beautiful Soup异常处理：
- - 关于Python技术储备

前言

Beautiful Soup是python的一个HTML或XML的解析库，我们可以用它来方便的从网页中提取数据，它拥有强大的API和多样的解析方式。
在这里插入图片描述

Beautiful Soup的三个特点：

Beautiful Soup提供一些简单的方法和python式函数，用于浏览，搜索和修改解析树，它是一个工具箱，通过解析文档为用户提供需要抓取的数据
Beautiful Soup自动将转入稳定转换为Unicode编码，输出文档转换为UTF-8编码，不需要考虑编码，除非文档没有指定编码方式，这时只需要指定原始编码即可
Beautiful Soup位于流行的Python解析器（如lxml和html5lib）之上，允许您尝试不同的解析策略或交易速度以获得灵活性。

1、Beautiful Soup4的安装配置

Beautiful Soup4通过PyPi发布，所以可以通过系统管理包工具安装，包名字为beautifulsoup4

$easy_install beautifulsoup4
或者
$pip install beautifulsoup4

也可用通过下载源码包来安装：

#wget https://www.crummy.com/software/BeautifulSoup/bs4/download/4.0/beautifulsoup4-4.1.0.tar.gz
#tar xf beautifulsoup4-4.1.0.tar.gz
#cd beautifulsoup4
#python setup.py install

Beautiful Soup在解析时实际上是依赖解析器的，它除了支持python标准库中的HTML解析器外还支持第三方解析器如lxml

安装解析器：

$pip install lxml
$pip install html5lib

推荐使用lxml作为解析器,因为效率更高. 在Python2.7.3之前的版本和Python3中3.2.2之前的版本,必须安装lxml或html5lib, 因为那些Python版本的标准库中内置的HTML解析方法不够稳定

2、BeautifulSoup的基本用法

通过传入一段字符或一个文件句柄，BeautifulSoup的构造方法就能得到一个文档的对象，选择合适的解析器来解析文档，如手动指定将选择指定的解析器来解析文档,Beautiful Soup将复杂的HTML文档转换成一个复杂的树形结构，每个节点都是python对象，所有对象可以归纳为4种：Tag、NavigableString、BeautifulSoup、Comment

注意：BeautifulSoup版本4的包是在bs4中引入的

from bs4 import BeautifulSoup

#下面代码示例都是用此文档测试
html\_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" rel="external nofollow" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
markup="<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup=BeautifulSoup(html\_doc,"lxml")
soup1=BeautifulSoup(markup,"lxml")
tag=soup.a
navstr=tag.string
comment=soup1.b.string
print(type(tag)) #Tag标签对象
print(type(comment)) #Comment对象包含文档注释内容
print(type(navstr)) #NavigableString对象包装字符串内容
print(type(soup)) #BeautifulSoup对象为文档的全部内容

#
<class 'bs4.element.Tag'>
<class 'bs4.element.Comment'>
<class 'bs4.element.NavigableString'>
<class 'bs4.BeautifulSoup'>

（1）节点选择器(tag)

直接调用节点的名称就可以选择节点元素，节点可以嵌套选择返回的类型都是bs4.element.Tag对象

soup=BeautifulSoup(html\_doc,'lxml')
print(soup.head) #获取head标签
print(soup.p.b) #获取p节点下的b节点
print(soup.a.string) #获取a标签下的文本，只获取第一个

name属性获取节点名称：

soup.body.name

attrs属性获取节点属性，也可以字典的形式直接获取，返回的结果可能是列表或字符串类型，取决于节点类型

soup.p.attrs #获取p节点所有属性
soup.p.attrs\['class'\] #获取p节点class属性
soup.p\['class'\] #直接获取p节点class属性

string属性获取节点元素包含的文本内容：

soup.p.string #获取第一个p节点下的文本内容

contents属性获取节点的直接子节点，以列表的形式返回内容

soup.body.contents #是直接子节点，不包括子孙节点

children属性获取的也是节点的直接子节点，只是以生成器的类型返回

soup.body.children

descendants属性获取子孙节点，返回生成器

soup.body.descendants

parent属性获取父节点，parents获取祖先节点，返回生成器

soup.b.parent
soup.b.parents

next_sibling属性返回下一个兄弟节点，previous_sibling返回上一个兄弟节点,注意换行符也是一个节点，所以有时候在获取兄弟节点是通常是字符串或者空白

soup.a.next\_sibling
soup.a.previous\_sibling

next_siblings和previous_sibling分别返回前面和后面的所有兄弟节点，返回生成器

soup.a.next\_siblings
soup.a.previous\_siblings

next_element和previous_element属性获取下一个被解析的对象，或者上一个

soup.a.next\_element
soup.a.previous\_element

next_elements和previous_elements迭代器向前或者后访问文档解析内容

soup.a.next\_elements
soup.a.previous\_elements

（2）方法选择器

前面使用的都是通过节点属性来选择的，这种方法非常快，但在进行比较复杂的选择时就不够灵活，幸好Beautiful Soup还为我们提供了一些查询方法，如fang_all()和find()等

find_all(name,attrs,recursive,text,**kwargs)：查询所有符合条件的元素，其中的参数

name表示可以查找所有名字为name的标签(tag)，也可以是过滤器，正则表达式，列表或者是True

attrs表示传入的属性，可以通过attrs参数以字典的形式指定如常用属性id,attrs={‘id’:‘123’}，由于class属性是python中的关键字，所有在查询时需要在class后面加上下划线即class_=‘element’，返回的结果是tag类型的列表

text参数用来匹配节点的文本，传入的形式可以是字符串也可以是正则表达式对象

recursive表示，如果只想搜索直接子节点可以将参数设为false：recursive=Flase

limit参数，可以用来限制返回结果的数量，与SQL中的limit关键字类似

import re
from bs4 import BeautifulSoup

html\_doc = """ #下面示例都是用此文本内容测试
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
 ddd
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
<span>中文</span>
"""

soup=BeautifulSoup(html\_doc,'lxml')
print(type(soup))
print(soup.find\_all('span')) #标签查找
print(soup.find\_all('a',id='link1')) #属性加标签过滤
print(soup.find\_all('a',attrs={'class':'sister','id':'link3'})) #多属性
print(soup.find\_all('p',class\_='title')) #class特殊性,此次传入的参数是\*\*kwargs
print(soup.find\_all(text=re.compile('Tillie'))) #文本过滤
print(soup.find\_all('a',limit=2)) #限制输出数量


#
<class 'bs4.BeautifulSoup'>
\[<span>中文</span>\]
\[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>\]
\[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>\]
\[<p class="title"><b>The Dormouse's story</b></p>\]
\['Tillie'\]
\[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>\]

find( name , attrs , recursive , text , **kwargs )：它返回的是单个元素，也就是第一个匹配的元素，类型依然是tag类型

参数同find_all()一样

另外还有许多查询方法，其用法和前面介绍的find_all()方法完全相同，只不过查询范围不同，参数也一样

find_parents(name , attrs , recursive , text , **kwargs )和find_parent(name , attrs , recursive , text , **kwargs ) ：前者返回所有祖先节点，后者返回直接父节点

find_next_siblings(name , attrs , recursive , text , **kwargs )和find_next_sibling(name , attrs , recursive , text , **kwargs ) ：对当前tag后面的节点进行迭代，前者返回后面的所有兄弟节点，后者返回后面第一个兄弟节点

find_previous_siblings(name , attrs , recursive , text , **kwargs )和find_previous_sibling(name , attrs , recursive , text , **kwargs ) ：对当前tag前面的节点进行迭代，前者返回前面的所有兄弟节点，后者返回前面的第一个兄弟节点

find_all_next(name , attrs , recursive , text , **kwargs )和find_next(name , attrs , recursive , text , **kwargs ) ：对当前tag之后的tag和字符串进行迭代，前者返回所有符合条件的节点，后者返回第一个符合条件的节点

find_all_previous()和find_previous() ：对当前tag之前的tag和字符串进行迭代，前者返回节点后所有符合条件的节点，后者返回第一个符合条件的节点

（3）CSS选择器

Beautiful Soup还提供了CSS选择器，如果多CSS选择器不熟悉可以参考下http://www.w3school.com.cn/cssref/css_selectors.asp

在 Tag 或 BeautifulSoup 对象的 .select()方法中传入字符串参数,即可使用CSS选择器的语法找到tag:

In \[10\]: soup.select('title')
Out\[10\]: \[<title>The Dormouse's story</title>\]

通过tag标签逐层查找：

In \[12\]: soup.select('body a')
Out\[12\]: 
\[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>\]

查找某个tag标签下的直接子标签：

In \[13\]: soup.select('head > title')
Out\[13\]: \[<title>The Dormouse's story</title>\]

查找兄弟节点标签：

In \[14\]: soup.select('#link1 ~ .sister')
Out\[14\]: 
\[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>\]

通过CSS类名查找：

In \[15\]: soup.select('.title')
Out\[15\]: \[<p class="title"><b>The Dormouse's story</b></p>\]

In \[16\]: soup.select('\[class~=title\]')
Out\[16\]: \[<p class="title"><b>The Dormouse's story</b></p>\]

通过tag的id查找：

In \[17\]: soup.select('#link1')
Out\[17\]: \[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>\]

In \[18\]: soup.select('a#link2')
Out\[18\]: \[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>\]

通过是否存在某个属性来查找：

In \[20\]: soup.select('a\[href\]')
Out\[20\]: 
\[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>\]

通过属性的值来查找匹配：

In \[22\]: soup.select('a\[href="http://example.com/elsie"\]')
Out\[22\]: \[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>\]

In \[23\]: soup.select('a\[href^="http://example.com/"\]') #匹配值的开头
Out\[23\]: 
\[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>\]

In \[24\]: soup.select('a\[href$="tillie"\]') #匹配值的结尾
Out\[24\]: \[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>\]

In \[25\]: soup.select('a\[href\*=".com/el"\]') #模糊匹配
Out\[25\]: \[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>\]

tag节点查找，方法选择器查找和CSS选择器查找三种方法的实现方式基本相似，tag相对于其他两种所有最快速的查找方式，但方法选择器提供更便利更复杂的查找方式，使用更如有上手。

（4）tag修改方法

Beautiful Soup的强项是文档的搜索功能，修改功能使用场景不是很多只做简单介绍，要了解更多修改方法请前往Beautiful Soup官方文档查看。

Beautiful Soup可以实现改变tag标志的属性的值，添加或删除属性和内容，下面介绍一些常用的方法

In \[26\]: markup='<a href="http://www.baidu.com/">baidu</a>'
In \[28\]: soup=BeautifulSoup(markup,'lxml')
In \[29\]: soup.a.string='百度'
In \[30\]: soup.a
Out\[30\]: <a href="http://www.baidu.com/">百度</a>
#如果a节点下包括子也将被覆盖掉

Tag.append() 方法想tag中添加内容,就好像Python的列表的 .append() 方法:

In \[30\]: soup.a
Out\[30\]: <a href="http://www.baidu.com/">百度</a>

In \[31\]: soup.a.append('一下')

In \[32\]: soup.a
Out\[32\]: <a href="http://www.baidu.com/">百度一下</a>

new_tag()方法用于创建一个tag标签

In \[33\]: soup=BeautifulSoup('<b></b>','lxml')

In \[34\]: new\_tag=soup.new\_tag('a',href="http://www.python.org") #创建tag,第一个参数必须为tag的名称

In \[35\]: soup.b.append(new\_tag) #添加到b节点下

In \[36\]: new\_tag.string='python' #为tag设置值

In \[37\]: soup.b
Out\[37\]: <b><a href="http://www.python.org">python</a></b>

其他方法：

insert()将元素插入到指定的位置

inert_before()在当前tag或文本节点前插入内容

insert_after()在当前tag或文本节点后插入内容

clear()移除当前tag的内容

extract()将当前tag移除文档数，并作为方法结果返回

prettify()将Beautiful Soup的文档数格式化后以Unicode编码输出，tag节点也可以调用

get_text()输出tag中包含的文本内容，包括子孙tag中的内容

soup.original_encoding 属性记录了自动识别的编码结果

from_encoding:参数在创建BeautifulSoup对象是可以用来指定编码，减少猜测编码的运行速度

#解析部分文档，可以使用SoupStrainer类来创建一个内容过滤器，它接受同搜索方法相同的参数

from bs4 import BeautifulSoup,SoupStrainer

html\_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
 ddd
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
<span>中文</span>
"""


only\_a\_tags = SoupStrainer('a') #顾虑器

soup=BeautifulSoup(html\_doc,'lxml',parse\_only=only\_a\_tags)

print(soup.prettify())

#
<a class="sister" href="http://example.com/elsie" id="link1">
 Elsie
</a>
<a class="sister" href="http://example.com/lacie" id="link2">
 Lacie
</a>
<a class="sister" href="http://example.com/tillie" id="link3">
 Tillie
</a>

Beautiful Soup异常处理：

HTMLParser.HTMLParseError：malformed start tag

HTMLParser.HTMLParseError：bad end tag 这个两个异常都是解析器引起的，解决方法是安装lxml或者html5lib

关于Python技术储备

学好 Python 不论是就业还是做副业赚钱都不错，但要学会 Python 还是要有一个学习规划。最后大家分享一份全套的 Python 学习资料，给那些想学习 Python 的小伙伴们一点帮助！

保存图片微信扫描下方CSDN官方认证二维码免费领取【保证100%免费】

一、Python所有方向的学习路线

Python所有方向的技术点做的整理，形成各个领域的知识点汇总，它的用处就在于，你可以按照上面的知识点去找对应的学习资源，保证自己学得较为全面。
在这里插入图片描述

二、Python基础学习视频

② 路线对应学习视频

还有很多适合0基础入门的学习视频，有了这些视频，轻轻松松上手Python~在这里插入图片描述

③练习题

每节视频课后，都有对应的练习题哦，可以检验学习成果哈哈！
在这里插入图片描述
因篇幅有限，仅展示部分资料

三、精品Python学习书籍

当我学到一定基础，有自己的理解能力的时候，会去阅读一些前辈整理的书籍或者手写的笔记资料，这些笔记详细记载了他们对一些技术点的理解，这些理解是比较独到，可以学到不一样的思路。
在这里插入图片描述

四、Python工具包+项目源码合集

①Python工具包

学习Python常用的开发软件都在这里了！每个都有详细的安装教程，保证你可以安装成功哦！
在这里插入图片描述

②Python实战案例

光学理论是没用的，要学会跟着一起敲代码，动手实操，才能将自己的所学运用到实际当中去，这时候可以搞点实战案例来学习。100+实战案例源码等你来拿！
在这里插入图片描述

③Python小游戏源码

如果觉得上面的实战案例有点枯燥，可以试试自己用Python编写小游戏，让你的学习过程中增添一点趣味！
在这里插入图片描述

五、面试资料

我们学习Python必然是为了找到高薪的工作，下面这些面试题是来自阿里、腾讯、字节等一线互联网大厂最新的面试资料，并且有阿里大佬给出了权威的解答，刷完这一套面试资料相信大家都能找到满意的工作。
在这里插入图片描述

六、Python兼职渠道

而且学会Python以后，还可以在各大兼职平台接单赚钱，各种兼职渠道+兼职注意事项+如何和客户沟通，我都整理成文档了。
在这里插入图片描述

这份完整版的Python全套学习资料已经上传CSDN，朋友们如果需要可以保存图片微信扫描下方CSDN官方认证二维码免费领取【保证100%免费】

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐

UFW防火墙安全指南

UFW（Uncomplicated Firewall）是Ubuntu/Debian系统中简化防火墙管理的工具，通过直观命令帮助用户有效控制网络流量，提升系统安全性。文章详细介绍了UFW的基本命令，包括启停防火墙、添加规则、限制连接速率和日志配置等操作，并提供了安全最佳实践，如默认拒绝策略、IP地址限制和服务级规则管理。同时，还涵盖高级配置技巧，例如多网络接口设置、规则优先级调整、IPv6支持及与f