Python 3.x 如何在从Python网站提取数据时忽略一个类中的文本
我试图从网站上提取评论,每当有人回复评论时,之前的帖子就会包含在评论中。我试图在提取时忽略这些回复Python 3.x 如何在从Python网站提取数据时忽略一个类中的文本,python-3.x,web-scraping,beautifulsoup,Python 3.x,Web Scraping,Beautifulsoup,我试图从网站上提取评论,每当有人回复评论时,之前的帖子就会包含在评论中。我试图在提取时忽略这些回复 url = "https://www.f150forum.com/f118/do-all-2018-f150-trucks-come-adaptive-cruise-control-369065/index2/" page = requests.get(url) soup = BeautifulSoup(page.text, 'html.parser') comments_lst= soup.f
url = "https://www.f150forum.com/f118/do-all-2018-f150-trucks-come-adaptive-cruise-control-369065/index2/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
comments_lst= soup.findAll('div',attrs={"class":"ism-true"})
comments =[]
for item in comments_lst:
result = [item.get_text(strip=True, separator=" ")]
comments.append(result)
quotes = []
for item in soup.findAll('div',attrs={"class":"panel alt2"}):
result = [item.get_text(strip=True, separator=" ")]
quotes.append(result)
对于最终结果,我不希望引用列表中的数据包含在我的评论中。我尝试使用if,但结果不正确
示例注释[6]给出了以下结果
'Quote: Originally Posted by jeff_the_pilot What the difference between adaptive cruise control on 2018 versus 2017? I believe mine brakes if I encroach another vehicle. It will work in stop and go traffic!'
我的预期结果
It will work in stop and go traffic!
您需要添加一些逻辑,以使用class
panel alt2删除divs中包含的文本:
comments =[]
for item in comments_lst:
result = [item.get_text(strip=True, separator=" ")]
if div := item.find('div', class_="panel alt2"):
result[0] = ' '.join(result[0].split(div.text.split()[-1])[1:])
comments.append(result)
>>> comments[6]
[' It will work in stop and go traffic!']
这将获取所有不带引号的消息:
import requests
from bs4 import BeautifulSoup
url = "https://www.f150forum.com/f118/do-all-2018-f150-trucks-come-adaptive-cruise-control-369065/index2/"
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
msgs = []
for msg in soup.select('[id^="post_message_"]'):
for div in msg.select('div:has(> div > label:contains("Quote:"))'):
div.extract()
msgs.append( msg.get_text(strip=True, separator='\n') )
#print(msgs) # <-- uncomment to see all messages without Quoted messages
print(msgs[6])
它说附近的语法无效“:=”我尝试使用“!=”但它抛出了一个错误,说“name'div'未定义”@anonymous13哎呀,我使用的是python3.8语法。您可以将其替换为div=item.find('div',class=“panel alt2”)
,如果div:
It will work in stop and go traffic!