在python中使用selenium webdriver滚动网页_Python_Selenium_Selenium Webdriver_Web Scraping_Beautifulsoup

在python中使用selenium webdriver滚动网页

python selenium selenium-webdriver web-scraping

在python中使用selenium webdriver滚动网页,python,selenium,selenium-webdriver,web-scraping,beautifulsoup,Python,Selenium,Selenium Webdriver,Web Scraping,Beautifulsoup,我目前正在使用SeleniumWebDriver解析这个网页（）以使用Python提取所有启动URL。我尝试了这篇文章中提到的所有相关方法：以及其他在线建议然而，这个网站并没有成功。它只装载了前25家初创公司。一些代码示例： from time import sleep from bs4 import BeautifulSoup from datetime import datetime from selenium import webdriver from selenium.webdrive

我目前正在使用SeleniumWebDriver解析这个网页（）以使用Python提取所有启动URL。我尝试了这篇文章中提到的所有相关方法：以及其他在线建议

然而，这个网站并没有成功。它只装载了前25家初创公司。一些代码示例：

from time import sleep
from bs4 import BeautifulSoup
from datetime import datetime
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

webdriver = webdriver.Chrome(executable_path='chromedriver')

# Write into csv file
filename = "startups_urls.csv"
f = open(BLD / "processed/startups_urls.csv", "w")
headers = "startups_urls\n"
f.write(headers)

url = "https://startup-map.berlin/companies.startups/f/all_locations/allof_Berlin/data_type/allof_Verified"

webdriver.get(url)
time.sleep(3)

# Get scroll height
last_height = webdriver.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down to bottom
    webdriver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(3)

    # Calculate new scroll height and compare with last scroll height
    new_height = webdriver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

htmlSource = webdriver.page_source
page_soup = BeautifulSoup(htmlSource, "html.parser")
startups = page_soup.findAll("div", {"class": "type-element type-element--h3 hbox entity-name__name entity-name__name--black"})
if startups != []:
    for startup in startups:
        startups_href = startup.a["href"]
        startups_url = "https://startup-map.berlin" + startups_href
        open_file.write(startups_url + "\n")
else:
    print("NaN.") 
              
f.close()
driver.close()

有什么建议吗？非常感谢。

您可以根据

垂直拇指的位置获得滚动过程的指示。

因此，您可以做的是获取其样式的translateY
值，并将其与以前的值进行比较，类似于您当前尝试将new\u height
与last\u height
进行比较的方式

此css选择器可以定位该元素：#窗口滚动条。垂直拇指

因此，您可以执行以下操作：
element = webdriver.find_element_by_css_selector("#window-scrollbar .vertical-thumb")
attributeValue = element.get_attribute("style")

index = attributeValue.find('translateY(')
sub_string = attributeValue[index:]
new_y_value = int(filter(str.isdigit, sub_string))

现在，attributeValue
字符串包含如下内容
position: relative; display: block; width: 100%; background-color: rgba(34, 34, 34, 0.6); border-radius: 4px; z-index: 1500; height: 30px; transform: translateY(847px);

现在，您可以找到包含translateY
的子字符串，并从中提取数字，如下所示：
element = webdriver.find_element_by_css_selector("#window-scrollbar .vertical-thumb")
attributeValue = element.get_attribute("style")

index = attributeValue.find('translateY(')
sub_string = attributeValue[index:]
new_y_value = int(filter(str.isdigit, sub_string))

如果int（filter（str.isdigit，sub_string））
不能正常工作（虽然它应该），请尝试使用它
new_y_value = re.findall('\d+', sub_string)

要使用re
您必须首先通过
import re

只需按一下键。翻页
谢谢。我刚试过，但也没有成功。它显示错误消息“WebElement”对象没有属性“page\u source”。我刚刚更新了问题中的代码。你能告诉我怎么了吗？非常感谢，非常感谢。但我收到了一条错误消息：“int（）参数必须是字符串、类似于对象或数字的字节，而不是‘filter’”，请您再澄清一点好吗？好的，我们需要调试它。不幸的是，我的机器上根本没有安装Python，所以我将问您几个问题。那么，当调试时，attributeValue
是否包含带有答案中所示值的字符串？那么索引是否表示整数？包含子字符串的内容是什么？是，attributeValue
显示位置：相对；显示：块；宽度：100%；背景色：rgba（34,34,34,0.6）；边界半径：4px；z指数：1500；高度：499px；变换：translateY（0px）；'index=attributeValue.find（'translateY（0px）；'）
，返回151。我没有使用index=attributeValue.find（'translateY（'）
，它返回-1。sub_字符串包含'translateY（0px）；'
很好，所以问题实际上是最后一行代码。如果是这样，请“导入re”并尝试“new__值=re.findall（'\d+），sub_字符串）”。我也会更新答案以明确答案。谢谢。new_y_value
返回['0']
，但这无助于我进行网页搜索。您能告诉我应该在问题中放置代码的位置吗？