Python BeautifulSoup从段落中提取文本，并通过<；br/>；_Python_Html_Web Scraping_Beautifulsoup

Python BeautifulSoup从段落中提取文本，并通过<；br/>；

python html web-scraping

Python BeautifulSoup从段落中提取文本，并通过<；br/>；,python,html,web-scraping,beautifulsoup,Python,Html,Web Scraping,Beautifulsoup,我对靓汤很陌生我如何才能从html源代码中提取段落中的文本，在出现时拆分文本，并将其存储到数组中，使数组中的每个元素都是段落文本（由拆分）中的一个块例如，对于以下段落： Pancakes A delicious type of food 我尝试的是：

我对靓汤很陌生

我如何才能从html源代码中提取段落中的文本，在出现
时拆分文本，并将其存储到数组中，使数组中的每个元素都是段落文本（由
拆分）中的一个块

例如，对于以下段落：

<p>
    <strong>Pancakes</strong>
    <br/> 
    A <strong>delicious</strong> type of food
    <br/>
</p>

我尝试的是：

import bs4 as bs

soup = bs.BeautifulSoup("<p>Pancakes<br/> A delicious type of food<br/></p>")
p = soup.findAll('p')
p[0] = p[0].getText()
print(p)

有什么方法可以对它进行编码，这样我就可以得到一个数组，其中包含由段落中的任何
拆分的段落文本？

试试这个

来自bs4导入美化组，NavigableString
html='煎饼
一种美味的食物
'
soup=BeautifulSoup（html，'html.parser'）
p=soup.findAll（'p'）
结果=[str（child）.strip（）用于p[0]中的子项。子项
如果存在（子项、导航项）]

深度递归更新

from bs4 import BeautifulSoup, NavigableString, Tag

html = "<p><strong>Pancakes</strong><br/> A <strong>delicious</strong> type of food<br/></p>"

soup = BeautifulSoup(html, 'html.parser')
p = soup.find('p').find_all(text=True, recursive=True)

从bs4导入BeautifulSoup、NavigableString、Tag
html=“煎饼
一种美味的食物类型
”
soup=BeautifulSoup（html，'html.parser'）
p=soup.find（'p'）.find_all（text=True，recursive=True）

仅由

from bs4 import BeautifulSoup, NavigableString, Tag

html = "<p><strong>Pancakes</strong><br/> A <strong>delicious</strong> type of food<br/></p>"

soup = BeautifulSoup(html, 'html.parser')
text = ''
for child in soup.find_all('p')[0]:
    if isinstance(child, NavigableString):
        text += str(child).strip()
    elif isinstance(child, Tag):
        if child.name != 'br':
            text += child.text.strip()
        else:
            text += '\n'

result = text.strip().split('\n')
print(result)

从bs4导入BeautifulSoup、NavigableString、Tag
html=“煎饼
一种美味的食物类型
”
soup=BeautifulSoup（html，'html.parser'）
文本=“”
用于汤中的孩子。查找所有（'p'）[0]：
如果存在（子项、导航字符串）：
text+=str（子）.strip（）
elif isinstance（子项，标记）：
如果child.name！='br'：
text+=child.text.strip（）
其他：
text+='\n'
结果=text.strip（）.split（'\n'）
打印（结果）

感谢您提供的解决方案！然而，我想知道是否有可能使它不“删除”段落中的其他标记。例如，使用您的代码，如果段落是“煎饼
A美味食品类型
”，它将输出“['A'，'type of food']”而不是['Pancakes'，'A delicious type of food'”。抱歉再次打扰您。输出看起来像是“[‘煎饼’、‘A’、‘美味’、‘食物类型’]”，但我仍然在寻找“[‘煎饼’、‘美味食物类型’]”。我只想在
处拆分文本，而不想在任何其他标签上拆分文本。[“煎饼是一种美味的食物”]？你确定吗？不是[煎饼”，“一种美味的食物]？哦，我的错。是的，我正在寻找输出[‘煎饼’，‘一种美味的食物’]的更新，如上所述
from bs4 import BeautifulSoup, NavigableString, Tag html = "Pancakes A delicious type of food " soup = BeautifulSoup(html, 'html.parser') p = soup.find('p').find_all(text=True, recursive=True)

from bs4 import BeautifulSoup, NavigableString, Tag html = "Pancakes A delicious type of food " soup = BeautifulSoup(html, 'html.parser') text = '' for child in soup.find_all('p')[0]: if isinstance(child, NavigableString): text += str(child).strip() elif isinstance(child, Tag): if child.name != 'br': text += child.text.strip() else: text += '\n' result = text.strip().split('\n') print(result)