Html 如何识别'；span'；子标记存在于'；p'；beautifulsoup返回的标签？_Html_Python 3.x_If Statement_Beautifulsoup

Html 如何识别'；span'；子标记存在于'；p'；beautifulsoup返回的标签？

html python-3.x if-statement

Html 如何识别'；span'；子标记存在于'；p'；beautifulsoup返回的标签？,html,python-3.x,if-statement,beautifulsoup,Html,Python 3.x,If Statement,Beautifulsoup,我正在制作一个webscraper，它从索引网页中抓取一本在线小说，代码为小说的每本书创建一个epub文件。小说的翻译人员为小说设置了两种不同格式的网页第一种格式是p标记，内有span标记。span标记中的每个段落部分都有一组css，这取决于它是普通文本还是初始化另一种格式是p标记中的文本，没有span标记，也没有css代码。我已经能够使用Beautifulsoup从网页中获取只有小说的代码部分。我一直在试图做一个if语句，说如果span存在于章节内容中，运行一个代码，否则运行其他代码我尝

我正在制作一个webscraper，它从索引网页中抓取一本在线小说，代码为小说的每本书创建一个epub文件。小说的翻译人员为小说设置了两种不同格式的网页

第一种格式是

标记，内有

span

标记。

span

标记中的每个段落部分都有一组css，这取决于它是普通文本还是初始化

另一种格式是

标记中的文本，没有

span

标记，也没有css代码。我已经能够使用Beautifulsoup从网页中获取只有小说的代码部分。我一直在试图做一个

if

语句，说如果

span

存在于章节内容中，运行一个代码，否则运行其他代码

我尝试过使用

if chapter.find（'span'）！=[]：

和

如果章节。查找所有（'span'）！=[]：

来自beautifulsoup，但这些beautifulsoup代码返回实际值，而不是布尔值。我通过打印“是”或“否”来测试这一点，如果章节有标签，那么输出要么是“是”，要么是“否”，当我检查两个不同的章节以确认它们没有不同的格式时

我正在使用的代码：

    #get link for chapter 1 from index
    r = requests.get(data[1]['link'])
    soup = BeautifulSoup(r.content, 'html.parser')

    # if webpage announcement change 0 to 1
    chapter = soup.find_all('div', {"class" : "fr-view"})[0].find_all('p')

根据章节的不同，输出为：

    #chapter equals this
    [<p dir="ltr"><span style="color: rgb(0, 0, 0); background-color: transparent; font-weight: 400; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap">Chapter 1 - title</span></p>,
    <p dir="ltr"><span style="color: rgb(0, 0, 0); background-color: transparent; font-weight: 400; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap">stuff</span></p>,
    <p dir="ltr"><span style="color: rgb(0, 0, 0); background-color: transparent; font-weight: 400; font-style: italic; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap">italizes</span><span style="color: rgb(0, 0, 0); background-color: transparent; font-weight: 400; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap"> stuff</span></p>]

#这一章等于第一章-标题，东西，

将内容斜体化

或：

#这一章等于
[第6章-标题，
东西]

我正在尝试生成和

if

语句，该语句可以阅读章节并告诉我

span

标记是否存在，以便我可以执行正确的代码。

使用您的代码片段：

html = """

<p dir="ltr"><span style="color: rgb(0, 0, 0); background-color: transparent; font-weight: 400; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap">Chapter 1 - title</span></p>,
<p dir="ltr"><span style="color: rgb(0, 0, 0); background-color: transparent; font-weight: 400; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap">stuff</span></p>,
<p dir="ltr"><span style="color: rgb(0, 0, 0); background-color: transparent; font-weight: 400; font-style: italic; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap">italizes</span><span style="color: rgb(0, 0, 0); background-color: transparent; font-weight: 400; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap"> stuff</span></p>

<p>Chapter 6 - title</p>,
<p>stuff</p>
"""

输出：

found span`
found span
found span
no span
no span`

我想这就是您要查找的内容。

使用您的代码片段：

html = """

<p dir="ltr"><span style="color: rgb(0, 0, 0); background-color: transparent; font-weight: 400; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap">Chapter 1 - title</span></p>,
<p dir="ltr"><span style="color: rgb(0, 0, 0); background-color: transparent; font-weight: 400; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap">stuff</span></p>,
<p dir="ltr"><span style="color: rgb(0, 0, 0); background-color: transparent; font-weight: 400; font-style: italic; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap">italizes</span><span style="color: rgb(0, 0, 0); background-color: transparent; font-weight: 400; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap"> stuff</span></p>

<p>Chapter 6 - title</p>,
<p>stuff</p>
"""

输出：

found span`
found span
found span
no span
no span`

我想这就是你想要的。

在Beauty Soup 4.7+中，Beauty Soup使用了一个新的CSS选择器库，名为Soup Screef。使用

find_all

和

find

是有条件地过滤标记的好方法，但我想展示一种可以使用CSS选择器进行复杂过滤的替代方法。汤筛提供了许多有用的功能，由于Beautiful Soup依赖于它，如果您使用的是Beautiful Soup 4.7+，那么应该已经安装了它

在本例中，我们只需搜索

标记，然后直接利用Soup Sieve的API创建一个过滤器来比较返回的标记。只是另一种做事的方式

from bs4 import BeautifulSoup
import soupsieve as sv

html = """

<p dir="ltr"><span style="color: rgb(0, 0, 0); background-color: transparent; font-weight: 400; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap">Chapter 1 - title</span></p>,
<p dir="ltr"><span style="color: rgb(0, 0, 0); background-color: transparent; font-weight: 400; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap">stuff</span></p>,
<p dir="ltr"><span style="color: rgb(0, 0, 0); background-color: transparent; font-weight: 400; font-style: italic; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap">italizes</span><span style="color: rgb(0, 0, 0); background-color: transparent; font-weight: 400; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap"> stuff</span></p>

<p>Chapter 6 - title</p>,
<p>stuff</p>
"""

soup = BeautifulSoup(html, "html.parser")
css_match = sv.compile(':has(span)')
for i in soup.select('p'):
    if css_match.match(i):
        print('found span')
    else:
        print('no span')

在BeautifulSoup4.7+中，BeautifulSoup使用了一个新的CSS选择器库，名为SoupSieve。使用

find_all

和

find

在本例中，我们只需搜索

标记，然后直接利用Soup Sieve的API创建一个过滤器来比较返回的标记。只是另一种做事的方式

from bs4 import BeautifulSoup
import soupsieve as sv

html = """

<p dir="ltr"><span style="color: rgb(0, 0, 0); background-color: transparent; font-weight: 400; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap">Chapter 1 - title</span></p>,
<p dir="ltr"><span style="color: rgb(0, 0, 0); background-color: transparent; font-weight: 400; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap">stuff</span></p>,
<p dir="ltr"><span style="color: rgb(0, 0, 0); background-color: transparent; font-weight: 400; font-style: italic; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap">italizes</span><span style="color: rgb(0, 0, 0); background-color: transparent; font-weight: 400; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap"> stuff</span></p>

<p>Chapter 6 - title</p>,
<p>stuff</p>
"""

soup = BeautifulSoup(html, "html.parser")
css_match = sv.compile(':has(span)')
for i in soup.select('p'):
    if css_match.match(i):
        print('found span')
    else:
        print('no span')

我尝试了这个方法，但它只会在两种格式上输出“发现垃圾邮件”。我想这是因为

I.find（'spam'）

不是一个布尔函数，而是一个返回

span

tag的字符串值的函数。我尝试了这个方法，但它只会在两种格式上输出“find spam”。我认为这是因为

I.find（'spam'）

不是一个布尔函数，而是一个返回

span

标记字符串值的函数