Python Beautiful soup extract标记内容，但不包括使用正则表达式或其他_Python_Regex_Beautifulsoup

Python Beautiful soup extract标记内容，但不包括使用正则表达式或其他

python regex

Python Beautiful soup extract标记内容，但不包括使用正则表达式或其他,python,regex,beautifulsoup,Python,Regex,Beautifulsoup,我有一个包含多个表的html文件（下面有两个表）。我只想从任何具有“宽度”的标记中提取字符串Quatermass 2和Ghostbusters:“41%”。我的问题是每个表的“宽度”中都有“标题”：“41%”，我不想提取它 <table width='100%' border='0' cellpadding='0' class='blackbg textheadtitle'> <tr> <td

我有一个包含多个表的html文件（下面有两个表）。我只想从任何具有“宽度”的标记中提取字符串Quatermass 2和Ghostbusters:“41%”。我的问题是每个表的“宽度”中都有“标题”：“41%”，我不想提取它

        <table width='100%' border='0' cellpadding='0' class='blackbg textheadtitle'>
            <tr>
                <td width='41%' align='left'>Title</td>
                <td width='10%' align='left'>Year</td>
                    <table width='99%' border='0' cellpadding='1' class="normal">
            <tr>
                <td width='41%' align='left'><strong>Quatermass 2</strong></td>
                <td width='10%' align='left'>1957</td>


        <table width='100%' border='0' cellpadding='0' class='blackbg textheadtitle'>
            <tr>
                <td width='41%' align='left'>Title</td>
                <td width='10%' align='left'>Year</td>
                    <table width='99%' border='0' cellpadding='1' class="normal">
            <tr>
                <td width='41%' align='left'><strong>Quatermass 3</strong></td>
                <td width='10%' align='left'>1958</td>

输出为

Title
Quatermass 2
Title
Ghostbusters

我曾尝试在print语句中使用notequal运算符，但它不起作用

for name in soup.find_all("td", {"width": "41%"}):
    print((name).get_text(!='Title'))

我是否可以在find_all函数中添加一个regex子句来排除'Title'字符串？

您可以将一个与除

Title

之外的任何字符串匹配的

^（？！Title$）

regex传递给

find_all

方法：

import re
#...
for name in soup.find_all("td", {"width": "41%"}, string=re.compile(r'^(?!Title$)')):
    print((name).get_text())

输出：

Quatermass 2
Quatermass 3

使用

CSS

选择器和

：not（：contains（text））

从bs4导入美化组
html=“”
标题
年
Quatermass 2
1957
标题
年
Quatermass 3
1958'''
soup=BeautifulSoup（html，'html.parser'）
对于汤中的标记。选择（“td[width='41%]：not（：contains（Title））”：
打印（tag.text）

您的示例字符串不包含

Ghostbusters

，只包含

Quatermass 3

Quatermass 2
Quatermass 3

from bs4 import BeautifulSoup
html='''<table width='100%' border='0' cellpadding='0' class='blackbg textheadtitle'>
            <tr>
                <td width='41%' align='left'>Title</td>
                <td width='10%' align='left'>Year</td>
                    <table width='99%' border='0' cellpadding='1' class="normal">
            <tr>
                <td width='41%' align='left'><strong>Quatermass 2</strong></td>
                <td width='10%' align='left'>1957</td>


        <table width='100%' border='0' cellpadding='0' class='blackbg textheadtitle'>
            <tr>
                <td width='41%' align='left'>Title</td>
                <td width='10%' align='left'>Year</td>
                    <table width='99%' border='0' cellpadding='1' class="normal">
            <tr>
                <td width='41%' align='left'><strong>Quatermass 3</strong></td>
                <td width='10%' align='left'>1958</td>'''
soup=BeautifulSoup(html,'html.parser')
for tag in soup.select("td[width='41%']:not(:contains(Title))"):
    print(tag.text)