Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/284.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/asp.net-core/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用beautifulsoup排除内部标签和特定标签_Python_Html_Web Scraping_Beautifulsoup - Fatal编程技术网

Python 使用beautifulsoup排除内部标签和特定标签

Python 使用beautifulsoup排除内部标签和特定标签,python,html,web-scraping,beautifulsoup,Python,Html,Web Scraping,Beautifulsoup,这可能是一个基本的问题,但我还没有弄明白。仍在学习如何使用beautifulsoup 我正在尝试解析HTML,它看起来像 <dl class=""> <div> <ol> <li><label>Tournament Name</label>TCG Saturday</li> <li><label id="tournament_id" data-tournament-id="000002">

这可能是一个基本的问题,但我还没有弄明白。仍在学习如何使用beautifulsoup

我正在尝试解析HTML,它看起来像

<dl class="">
<div>
<ol>
<li><label>Tournament Name</label>TCG Saturday</li>
<li><label id="tournament_id" data-tournament-id="000002">Tournament ID</label>000002</li>
<li><label>Category</label>TCG: Unlimited</li>
<li><label>Registration</label>12:15PM to 1:15PM</li>
<li><label>Status</label>Complete</li>
</ol>
</div>
</dl>
我试过了

soup = BeautifulSoup(html)
for lis in soup.find_all('li'):
    print(lis.text)
但这也会导致读取标签标签的文本并将它们关联在一起。它还读取网页上的其他文本并打印出来

Tournament NameTCG Saturday
Tournament ID000002
CategoryTCG: Unlimited
Registration12:15PM to 1:15PM
StatusComplete
我还可以使用

soup = BeautifulSoup(html)
for lis in soup.find_all('label'):
    print(lis.text)
但这之后没有文本(这是可以理解的)

我不明白如何解析这个HTML以便我可以

1) 仅li标记中的文本,不包括标签标记中的文本(如上所述),或

2) li标记中特定标签的文本(例如,指定“锦标赛ID”标签并返回“000002”)。

来自文档:

decompose()
从树中删除标记,然后完全销毁它及其内容:

代码:

soup = BeautifulSoup(html)
for lis in soup.find_all('label'):
    print(lis.text)
from bs4 import BeautifulSoup

data = '''
<dl class="">
<div>
<ol>
<li><label>Tournament Name</label>TCG Saturday</li>
<li><label id="tournament_id" data-tournament-id="000002">Tournament ID</label>000002</li>
<li><label>Category</label>TCG: Unlimited</li>
<li><label>Registration</label>12:15PM to 1:15PM</li>
<li><label>Status</label>Complete</li>
</ol>
</div>
</dl>
'''

soup = BeautifulSoup(data, 'html.parser')
for lis in soup.find_all('li'):
    lis.label.decompose()
print(soup.text)
TCG Saturday
000002
TCG: Unlimited
12:15PM to 1:15PM
Complete