Python 3.x 如何在从Python网站提取数据时忽略一个类中的文本_Python 3.x_Web Scraping_Beautifulsoup

Python 3.x 如何在从Python网站提取数据时忽略一个类中的文本

python-3.x web-scraping

Python 3.x 如何在从Python网站提取数据时忽略一个类中的文本,python-3.x,web-scraping,beautifulsoup,Python 3.x,Web Scraping,Beautifulsoup,我试图从网站上提取评论，每当有人回复评论时，之前的帖子就会包含在评论中。我试图在提取时忽略这些回复 url = "https://www.f150forum.com/f118/do-all-2018-f150-trucks-come-adaptive-cruise-control-369065/index2/" page = requests.get(url) soup = BeautifulSoup(page.text, 'html.parser') comments_lst= soup.f

我试图从网站上提取评论，每当有人回复评论时，之前的帖子就会包含在评论中。我试图在提取时忽略这些回复

url = "https://www.f150forum.com/f118/do-all-2018-f150-trucks-come-adaptive-cruise-control-369065/index2/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')

comments_lst= soup.findAll('div',attrs={"class":"ism-true"})
comments =[]
for item in comments_lst:
    result = [item.get_text(strip=True, separator=" ")]
    comments.append(result)
quotes = []
for item in soup.findAll('div',attrs={"class":"panel alt2"}):
    result = [item.get_text(strip=True, separator=" ")]
    quotes.append(result)

对于最终结果，我不希望引用列表中的数据包含在我的评论中。我尝试使用if，但结果不正确

示例注释[6]给出了以下结果

'Quote: Originally Posted by jeff_the_pilot What the difference between adaptive cruise control on 2018 versus 2017? I believe mine brakes if I encroach another vehicle. It will work in stop and go traffic!'

我的预期结果

It will work in stop and go traffic!

您需要添加一些逻辑，以使用class

panel alt2删除divs中包含的文本：
comments =[]
for item in comments_lst:
    result = [item.get_text(strip=True, separator=" ")]
    if div := item.find('div', class_="panel alt2"):
        result[0] = ' '.join(result[0].split(div.text.split()[-1])[1:])
    comments.append(result)

>>> comments[6]
[' It will work in stop and go traffic!']

这将获取所有不带引号的消息：
import requests
from bs4 import BeautifulSoup

url = "https://www.f150forum.com/f118/do-all-2018-f150-trucks-come-adaptive-cruise-control-369065/index2/"

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

msgs = []
for msg in soup.select('[id^="post_message_"]'):
    for div in msg.select('div:has(> div > label:contains("Quote:"))'):
        div.extract()
    msgs.append( msg.get_text(strip=True, separator='\n') )

#print(msgs) # <-- uncomment to see all messages without Quoted messages

print(msgs[6])

它说附近的语法无效“：=”我尝试使用“！=”但它抛出了一个错误，说“name'div'未定义”@anonymous13哎呀，我使用的是python3.8语法。您可以将其替换为div=item.find（'div'，class=“panel alt2”）
，如果div:

It will work in stop and go traffic!