beautifulsoup (bs4)

爬虫爬虫

创建时间:2025-03-03 15:47

阅读:

1.介绍
2.导入库并解析文档
3.常用操作
4.使用不同的解析器
5.实际应用示例

1.介绍

用于从 HTML 或 XML 文档中提取数据。它提供了简单易用的 API，能够快速解析和遍历文档树，并提取所需的信息。
文档：http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0
安装： pip3 install beautifulsoup4

2.导入库并解析文档

from bs4 import BeautifulSoup

# 示例 HTML 文档
html_doc = """
<html>
  <head><title>示例网页</title></head>
  <body>
    <p class="title"><b>标题</b></p>
    <p class="story">这是一个示例段落。
      <a href="http://example.com/1" class="link" id="link1">链接1</a>
      <a href="http://example.com/2" class="link" id="link2">链接2</a>
    </p>
    <p class="story">另一个段落。</p>
  </body>
</html>
"""

# 使用 BeautifulSoup 解析 HTML
soup = BeautifulSoup(html_doc, 'lxml')  # 也可以使用 'html.parser'

3.常用操作

 (1) 提取文本内容
    # 提取整个文档的文本
    print(soup.get_text())


 (2) 获取标签内容
    # 获取 <title> 标签的内容
    title_tag = soup.title
    print(title_tag.string)  # 输出: 示例网页

    # 获取 <p> 标签的内容
    p_tags = soup.find_all('p')
    for p in p_tags:
        print(p.text)


(3) 查找单个标签
    # 查找第一个 <a> 标签
    first_link = soup.a
    print(first_link['href'])  # 输出: http://example.com/1
    print(first_link.text)     # 输出: 链接1


(4) 查找所有标签
    # 查找所有 <a> 标签
    links = soup.find_all('a')
    for link in links:
        print(link['href'], link.text)


(5) 根据属性查找
    # 查找 class 为 "link" 的所有标签
    link_tags = soup.find_all(class_='link')
    for tag in link_tags:
        print(tag['href'])
    
    # 查找 id 为 "link2" 的标签
    link2 = soup.find(id='link2')
    print(link2['href'])  # 输出: http://example.com/2


(6) 获取父标签和子标签
    # 获取父标签
    parent_tag = first_link.parent
    print(parent_tag.name)  # 输出: p

    # 获取子标签
    children = parent_tag.children
    for child in children:
        print(child)


(7) 获取标签属性
    # 获取标签的所有属性
    print(first_link.attrs)  
    # 输出: {'href': 'http://example.com/1', 'class': ['link'], 'id': 'link1'}

    # 获取特定属性
    print(first_link['href'])  # 输出: http://example.com/1


(8) 修改文档
    # 修改标签内容
    first_link.string = "新的链接文本"
    print(first_link)  
    # 输出: <a href="http://example.com/1" class="link" id="link1">新的链接文本</a>

    # 添加新标签
    new_tag = soup.new_tag('a', href="http://example.com/3")
    new_tag.string = "链接3"
    soup.body.append(new_tag)
    print(soup.body)

4.使用不同的解析器

BeautifulSoup 支持多种解析器，常见的有：
html.parser：Python 内置解析器，速度较慢但无需额外安装。
lxml：速度快，功能强大，推荐使用。
html5lib：容错性最好，但速度最慢。


soup = BeautifulSoup(html_doc, 'html.parser')  # 使用内置解析器
soup = BeautifulSoup(html_doc, 'lxml')         # 使用 lxml 解析器
soup = BeautifulSoup(html_doc, 'html5lib')     # 使用 html5lib 解析器

5.实际应用示例

假设我们要从一个网页中提取所有链接：

import requests
from bs4 import BeautifulSoup

# 获取网页内容
url = 'http://example.com'
response = requests.get(url)
html_content = response.text

# 解析网页
soup = BeautifulSoup(html_content, 'lxml')

# 提取所有链接
links = soup.find_all('a')
for link in links:
    print(link['href'])

转载请注明来源，欢迎对文章中的引用来源进行考证，欢迎指出任何有错误或不够清晰的表达。