Python 使用Selenium和BeautifulSoup进行Web抓取在滚动后不会更新提取的代码_Python_Selenium Webdriver_Web Scraping_Beautifulsoup_Selenium Chromedriver

Python 使用Selenium和BeautifulSoup进行Web抓取在滚动后不会更新提取的代码

python selenium-webdriver web-scraping

Python 使用Selenium和BeautifulSoup进行Web抓取在滚动后不会更新提取的代码,python,selenium-webdriver,web-scraping,beautifulsoup,selenium-chromedriver,Python,Selenium Webdriver,Web Scraping,Beautifulsoup,Selenium Chromedriver,我试图在Steam上搜集一些游戏的评论。除非滚动到页面底部，否则评论页面上只有10篇评论可用，并且将加载更多评论。我使用selenium来滚动，但是BeautifulSoup对象（预计包含20条评论）仍然只有10条评论。这是我的密码： from bs4 import BeautifulSoup from selenium import webdriver import time driver = webdriver.Chrome('E:\Download\chromedriver.exe'

我试图在Steam上搜集一些游戏的评论。除非滚动到页面底部，否则评论页面上只有10篇评论可用，并且将加载更多评论。我使用selenium来滚动，但是BeautifulSoup对象（预计包含20条评论）仍然只有10条评论。这是我的密码：

from bs4 import BeautifulSoup
from selenium import webdriver
import time

driver = webdriver.Chrome('E:\Download\chromedriver.exe')
driver.get('https://steamcommunity.com/app/466560/reviews/?browsefilter=toprated&snr=1_5_100010_')
SCROLL_PAUSE_TIME = 0.5
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2)
soup = BeautifulSoup(driver.page_source)

如何修复它？

您需要等待，直到元素ID

操作\u wait

不可见，如果不再查看，则查找文本，或者只需设置所需的max review

在本例中，结果限制为100，您可以增加它，但如果您不想等待更长时间，只需

Ctrl+C

，数据将被处理为美丽组

driver.get('https://.....')
maxResult = 100
currentResults = 0
pageSource = ''

try:
    print('press "Ctrl + C" to stop loop and process using beautfulsoup.')
    while currentResults < maxResult:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        WebDriverWait(driver, 10).until(EC.invisibility_of_element_located((By.ID, "action_wait")))
        currentResults = len(driver.find_elements_by_css_selector('.apphub_Card.modalContentLink.interactable'))
        print('currentResults: %s' % currentResults)
        pageSource = driver.page_source
except KeyboardInterrupt:
        print "Cancelled by user"
except: pass

soup = BeautifulSoup(pageSource, 'html.parser')

reviews = soup.select('.apphub_Card.modalContentLink.interactable')

print('reviews count by BeautifulSoup: %s' % len(reviews))

driver.get（'https://.....')
maxResult=100
currentResults=0
pageSource=“”
尝试：
打印（'按“Ctrl+C”停止循环并使用beautfulsoup进行处理'）
当currentResults

您需要等待，直到元素ID

操作\u wait

不可见，如果不再查看，则查找文本，或者只需设置所需的max review

在本例中，结果限制为100，您可以增加它，但如果您不想等待更长时间，只需

Ctrl+C

，数据将被处理为美丽组

driver.get('https://.....')
maxResult = 100
currentResults = 0
pageSource = ''

try:
    print('press "Ctrl + C" to stop loop and process using beautfulsoup.')
    while currentResults < maxResult:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        WebDriverWait(driver, 10).until(EC.invisibility_of_element_located((By.ID, "action_wait")))
        currentResults = len(driver.find_elements_by_css_selector('.apphub_Card.modalContentLink.interactable'))
        print('currentResults: %s' % currentResults)
        pageSource = driver.page_source
except KeyboardInterrupt:
        print "Cancelled by user"
except: pass

soup = BeautifulSoup(pageSource, 'html.parser')

reviews = soup.select('.apphub_Card.modalContentLink.interactable')

print('reviews count by BeautifulSoup: %s' % len(reviews))

driver.get（'https://.....')
maxResult=100
currentResults=0
pageSource=“”
尝试：
打印（'按“Ctrl+C”停止循环并使用beautfulsoup进行处理'）
当currentResults

页面使用jquery进行更新，每卷10条记录。它每次都会偏移以获得下一组。当列表用尽时，文本可见。您可以使用此选项滚动到最后。如果要在任何特定点停止，则循环退出条件应为

len（d.find\u elements\u by\u css\u selector（'.reviewInfo'））给出的所需审阅次数。

页面使用jquery进行更新，每滚动10条记录。它每次都会偏移以获得下一组。当列表用尽时，文本可见。您可以使用此选项滚动到最后。如果要在任何特定点停止，则循环退出条件应为

len（d.find\u elements\u by\u css\u selector（'.reviewInfo'））给出的所需审阅次数。

我这样做了，我检查了len（soup.text），每次我向下滚动得到新的文本时，它都会不断增加soup@IslamTaha我不太明白。“每次我向下滚动”是什么意思？driver.execute_脚本（“window.scrollTo（0，document.body.scrollHeight）”）实际上已经更新了。真的很奇怪，因为30分钟前我试着打印汤的时候，它还是一样的。我做了，我检查了len（soup.text），每次我向下滚动并得到新的文本时，它都会不断增加soup@IslamTaha我不太明白。“每次我向下滚动”是什么意思？driver.execute_脚本（“window.scrollTo（0，document.body.scrollHeight）”）实际上已经更新了。真的很奇怪，因为30分钟前我试着印汤的时候，汤还是原样。