Python BeautifulSoup删除后跟特定标记和特定属性的标记_Python_Html_Beautifulsoup

Python BeautifulSoup删除后跟特定标记和特定属性的标记

python html

Python BeautifulSoup删除后跟特定标记和特定属性的标记,python,html,beautifulsoup,Python,Html,Beautifulsoup,我是新来的，现在我对BeautifulSoup感到惊讶。然而，有些事情我做不到我想做的是删除一些标记，这些标记后面跟着一些特定的标记和特定的属性让我告诉你： #Import modules from bs4 import BeautifulSoup import requests #Parse URL url = "http://www.soccervista.com/Italy-Serie_A-2016_2017-845699.html" r = requests.get(url) da

我是新来的，现在我对BeautifulSoup感到惊讶。然而，有些事情我做不到

我想做的是删除一些标记，这些标记后面跟着一些特定的标记和特定的属性

让我告诉你：

#Import modules
from bs4 import BeautifulSoup
import requests

#Parse URL
url = "http://www.soccervista.com/Italy-Serie_A-2016_2017-845699.html"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')

#This is the table which I want to extract
table = soup.find_all('table')[4]

在获得我想要操作的正确表之后，有一些“tr”标记，后面跟着“td”和属性“colspan”

我最后想要的是删除那些特定的“tr”，因为我需要更多的“tr”标签

具有“colspan”属性的“td”的总数为3：

#Output for 'td' with 'colspan'

print(table.select('td[colspan]'))

[<td colspan="13"><img height="10" src="/imgs/line.png" width="100%"/></td>,
 <td colspan="13"><img height="10" src="/imgs/line.png" width="100%"/></td>,
 <td colspan="13"><img height="10" src="/imgs/line.png" width="100%"/></td>]

这里是HTML的摘录和一个我想删除的特定“tr”的示例，下面插入了一个注释，上面写着这个！：

 <td align="center">
    2:1
   </td>
   <td class="one">
    AC Milan
   </td>
   <td>
    <a href="/Cagliari-AC_Milan-2320071-2320071.html">
     <img alt="More details about  -  soccer game" border="0" height="14" src="/imgs/detail3.gif" width="14"/>
    </a>
   </td>
  </tr>
  ***<tr class="predict"> ------------- >>> **#THIS ONE!*****
   <td colspan="13">
    <img height="10" src="/imgs/line.png" width="100%"/>
   </td>
   <tr class="predict">
    <td>
     27 May
    </td>
    <td>
     38
    </td>
    <td>
     FT
    </td>
    <td align="right" class="one">

顺便说一下，我想删除'td colspan'和'img'

有什么想法吗

*已安装Python最新版本

*BeautifulSoup module已安装最新版本

找到要删除的特定标签，然后使用或

或

编辑

要查找特定标记，您可以首先查找所有tr标记，然后检查该标记是否具有属性为colspan=13的td，如果是，则分解它

您已经获得了表和td[colspan]，然后可以从表中获取td的父元素，然后将解析器从html.parser更改为lxml，如下所示：

from bs4 import BeautifulSoup
import requests

#Parse URL
url = "http://www.soccervista.com/Italy-Serie_A-2016_2017-845699.html"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'lxml') #change the parser from html.parser to lxml

#This is the table which I want to extract
table = soup.find_all('table')[4]
for tdcol in table.select('td[colspan]'):
    tdcol.parent.decompose()
print table.prettify()

然后，表将删除以下项目：

<tr class="predict"><td colspan="13"><img height="10" src="/imgs/line.png" width="100%"/></td></tr>
<tr class="predict"><td colspan="13"><img height="10" src="/imgs/line.png" width="100%"/></td></tr>
<tr class="predict"><td colspan="13"><img height="10" src="/imgs/line.png" width="100%"/></td></tr>

请明确说明您要删除哪些tr标记。抱歉@MD.KhairulBasar，这三个“tr”标记是这样的：但问题是它们太多了……我想要后面跟着这些标记：@MD.KhairulBasarI更新了我的答案，效果很好。这看起来是我想要的……实际上它删除了这些标记，但是其余的信息也被删除了！有没有办法保存其余的信息@我想删除的Tiny.d只是三个“tr”标签，如下所示：但问题是它们太多了……我希望后面跟着这些：@Tiny.DIn additon，奇怪的事情正在发生。当我运行for循环时，我得到这样一条消息：“NoneType”对象没有属性“decompose”。但当我再次运行它时，其余的信息将被删除@这正是我想要的。谢谢@Tiny.D问题是…为什么要用lxml而不是html.parser？我需要弄清楚！太神了这很好用@MD.KhairulBasar问题是，为什么要用lxml而不是html.parser？@edmudowright我使用lxml解析器的原因是，建议在BeautifulSoup的官方文档中使用lxml。lxml比html.parser快。如果html格式不正确，lxml将修复html格式。

import requests
from bs4 import BeautifulSoup

url = "http://www.soccervista.com/Italy-Serie_A-2016_2017-845699.html"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'lxml')

table = soup.find_all('table')[4]    
for t in table.find_all("tr", class_="predict"):

    check = t.find("td", colspan="13")
    if(check != None):
        t.decompose()

from bs4 import BeautifulSoup
import requests

#Parse URL
url = "http://www.soccervista.com/Italy-Serie_A-2016_2017-845699.html"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'lxml') #change the parser from html.parser to lxml

#This is the table which I want to extract
table = soup.find_all('table')[4]
for tdcol in table.select('td[colspan]'):
    tdcol.parent.decompose()
print table.prettify()

<tr class="predict"><td colspan="13"><img height="10" src="/imgs/line.png" width="100%"/></td></tr>
<tr class="predict"><td colspan="13"><img height="10" src="/imgs/line.png" width="100%"/></td></tr>
<tr class="predict"><td colspan="13"><img height="10" src="/imgs/line.png" width="100%"/></td></tr>