Python爬虫工具BeautifulSoup使用详解

terry 2年前 (2023-09-24) 阅读数 59 #后端开发

1.模块介绍

Beautiful Soup 是一个可以从 HTML 或 XML 文件中提取数据的 Python 库。它可以使用您喜欢的文档方法实现标准的文档浏览、搜索和编辑。美味的汤会帮助您节省工作时间。

2。如何使用

1.安装 beautifulsoup

pip install beautifulsoup4

2。输入模式

from bs4 import beautifulsoup

3。选择解析器解析指定内容

soup=beautifulsoup(解析内容,解析器)

常用：html。 parser, lxml, xml, html5lib

有时你需要安装一个解析器：比如 pip3 install lxml

BeautifulSoup 默认支持 Python 的标准 HTML 站点，但它也支持第三方库： Parser

如何使用优点缺点 Python 标准库 BeautifulSoup(markup,❀❀ 标准库 -Python 的标准平均输出速度 Python 2.7.3 或3.2.2) 以前版本的文档容错 y) 中文文件容错 - 容错不是很好BeautIfulsou♻♻p lxml》）

速度

对强大文档的容错

需要安装C语言库lxml XML解析器BeautifulSoup》lxml XML解析器BeautifulSoup BeautifulSoup

（标签，“xml”）

速度

唯一支持XML的解析器

需要安装C语言库♶

方法一：获取单个属性soup.find_all('div',id="even")            # 获取所有id=even属性的div标签soup.find_all('div',attrs={'id':"even"})    # 效果同上
方法二:soup.find_all('div',id="even",class_="square")            # 获取所有id=even并且class=square属性的div标签soup.find_all('div',attrs={"id":"even","class":"square"})    # 效果同上

p(mark , "html5lib ")

最佳容错能力

打开浏览器模式下的文档

创建HTML5格式的文档

速度慢

不返回外部。一些简单的搜索结构化数据格式

#获取标签，简单来说就是HTML中的标签

#获取Tag，通俗点就是HTML中的一个个标签soup.title                    # 获取整个title标签字段：<title>The Dormouse's story</title>soup.title.name               # 获取title标签名称  ：titlesoup.title.parent.name        # 获取 title 的父级标签名称：headsoup.p                        # 获取第一个p标签字段：<p class="title"><b>The Dormouse's story</b></p>soup.p['class']               # 获取第一个p中class属性值：titlesoup.p.get('class')           # 等价于上面soup.a                        # 获取第一个a标签字段soup.find_all('a')            # 获取所有a标签字段soup.find(id="link3")         # 获取属性id值为link3的字段soup.a['class'] = "newClass"  # 可以对这些属性和内容等等进行修改del bs.a['class']             # 还可以对这个属性进行删除soup.find('a').get('id')      # 获取class值为story的a标签中id属性的值soup.title.string             # 获取title标签的值  ：The Dormouse's story

3.特殊用途

1.获得具有特定属性的品牌

方法一：获取单个属性soup.find_all('div',id="even")            # 获取所有id=even属性的div标签soup.find_all('div',attrs={'id':"even"})    # 效果同上
方法二:soup.find_all('div',id="even",class_="square")            # 获取所有id=even并且class=square属性的div标签soup.find_all('div',attrs={"id":"even","class":"square"})    # 效果同上

2。取符号的值？该方法构建 Beautiful Soup 文档树并将其输出为 Unicode 编码。每个 XML/HTML 标签都有自己的行

markup = '<a href="http://example.com/" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow" >I linked to <i>example.com</i></a>'soup = BeautifulSoup(markup)soup.prettify()# '<html>\n <head>\n </head>\n <body>\n  <a href="http://example.com/" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow" >\n...'print(soup.prettify())# <html>#  <head>#  </head>#  <body>#   <a href="http://example.com/" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow" >#    I linked to#    <i>#     example.com#    </i>#   </a>#  </body># </html>

2。 get_text()

如果你只是想获取要获取标签的文本内容，可以调用 get_text() 方法。此方法接收标签内所有文本的内容，包括父标签的内容，并以 Unicode 字符串形式返回结果：

markup = '<a href="http://example.com/" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow" >\nI linked to <i>example.com</i>\n</a>'soup = BeautifulSoup(markup)soup.get_text()'\nI linked to example.com\n'soup.i.get_text()'example.com'

参考文档

https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

版权声明

本文仅代表作者观点，不代表Code前端网立场。
本文系作者Code前端网发表，如需转载，请注明页面地址。

上一篇：Django微信公众号开发：获取access_token 下一篇：微信公众号 Django开发及三步接入设置

发表评论:取消回复

◎欢迎参与讨论，请在这里发表您的看法、交流您的观点。