Python:提取类和文本
我想从一个网站上提取数据,我需要知道它是否包含一些设备。正如下面的例子,我知道A有CD,但他没有CD HTML: 从我的代码中,我将从HTML中提取所有li,如下所示:Python:提取类和文本,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我想从一个网站上提取数据,我需要知道它是否包含一些设备。正如下面的例子,我知道A有CD,但他没有CD HTML: 从我的代码中,我将从HTML中提取所有li,如下所示: <li class="specChecked"><p>CD</p></li> <li class="specChecked"><p>VCD</p></li> <li class=""><p>CDA</p&
<li class="specChecked"><p>CD</p></li>
<li class="specChecked"><p>VCD</p></li>
<li class=""><p>CDA</p></li>
<li>
<p>b1<span>1</span></p>
</li>
<li>
<p>b2<span>2</span></p>
</li>
specChecked, CD
specChecked, VCD
, CDA
(或者我可以将specChecked替换为1,将空格替换为0)
- 您可以使用
检查has\u attr
是否具有class属性li
获取类值link.get
提取文本link.text
- 您可以使用
检查has\u attr
是否具有class属性li
获取类值link.get
提取文本link.text
from bs4 import BeautifulSoup
content = """
<div class="ABC">
<h3>A</h3>
<ul>
<li class="specChecked"><p>CD</p></li>
<li class="specChecked"><p>VCD</p></li>
<li class=""><p>CDA</p></li>
</ul>
<h3>B</h3>
<div class="buyCarDetailContentSpecContent ">
<ul>
<li>
<p>b1<span>1</span></p>
</li>
<li>
<p>b2<span>2</span></p>
</li>
</ul>
</div>
</div>
"""
soup = BeautifulSoup(content, "html.parser")
for item in soup.find_all('li',class_=["specChecked",""]):
print("{}, {}".format(' '.join(item['class']),item.text))
您可以执行如下操作来获取所需类的内容以及空类的内容
from bs4 import BeautifulSoup
content = """
<div class="ABC">
<h3>A</h3>
<ul>
<li class="specChecked"><p>CD</p></li>
<li class="specChecked"><p>VCD</p></li>
<li class=""><p>CDA</p></li>
</ul>
<h3>B</h3>
<div class="buyCarDetailContentSpecContent ">
<ul>
<li>
<p>b1<span>1</span></p>
</li>
<li>
<p>b2<span>2</span></p>
</li>
</ul>
</div>
</div>
"""
soup = BeautifulSoup(content, "html.parser")
for item in soup.find_all('li',class_=["specChecked",""]):
print("{}, {}".format(' '.join(item['class']),item.text))
不必检查
li
是否有class属性,您可以使用soup.find\u all('li',class\u=True)
。不必检查li
是否有class属性,您可以使用soup.find\u all('li',class\u=True)
。
s = """<div class="ABC">
<h3>A</h3>
<ul>
<li class="specChecked"><p>CD</p></li>
<li class="specChecked"><p>VCD</p></li>
<li class=""><p>CDA</p></li>
</ul>
<h3>B</h3>
<div class="buyCarDetailContentSpecContent ">
<ul>
<li>
<p>b1<span>1</span></p>
</li>
<li>
<p>b2<span>2</span></p>
</li>
</ul>
</div>
</div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(s, "html.parser")
for link in soup.find_all('li'):
if link.has_attr("class"):
print(link.get("class", ""), link.text)
[u'specChecked'], u'CD'
[u'specChecked'], u'VCD'
[u''], u'CDA'
from bs4 import BeautifulSoup
content = """
<div class="ABC">
<h3>A</h3>
<ul>
<li class="specChecked"><p>CD</p></li>
<li class="specChecked"><p>VCD</p></li>
<li class=""><p>CDA</p></li>
</ul>
<h3>B</h3>
<div class="buyCarDetailContentSpecContent ">
<ul>
<li>
<p>b1<span>1</span></p>
</li>
<li>
<p>b2<span>2</span></p>
</li>
</ul>
</div>
</div>
"""
soup = BeautifulSoup(content, "html.parser")
for item in soup.find_all('li',class_=["specChecked",""]):
print("{}, {}".format(' '.join(item['class']),item.text))
specChecked, CD
specChecked, VCD
, CDA