Python 需要向下滚动时进行网页刮取_Python_Web Scraping_Python Requests

Python 需要向下滚动时进行网页刮取

python web-scraping

Python 需要向下滚动时进行网页刮取,python,web-scraping,python-requests,Python,Web Scraping,Python Requests,例如，我想把网页下前200个问题的标题删掉。我尝试了以下代码： import requests from bs4 import BeautifulSoup url = "https://www.quora.com/topic/Stack-Overflow-4/all_questions" print("url") print(url) r = requests.get(url) # HTTP request print("r") print(r) html_doc = r.text # Ext

例如，我想把网页下前200个问题的标题删掉。我尝试了以下代码：

import requests
from bs4 import BeautifulSoup

url = "https://www.quora.com/topic/Stack-Overflow-4/all_questions"
print("url")
print(url)
r = requests.get(url) # HTTP request
print("r")
print(r)
html_doc = r.text # Extracts the html
print("html_doc")
print(html_doc)
soup = BeautifulSoup(html_doc, 'lxml') # Create a BeautifulSoup object
print("soup")
print(soup)

它给了我一条短信。如果我们搜索

href='/

，我们可以看到html确实包含一些问题的标题。但问题是数量不够,；实际上，在网页上，用户需要手动向下滚动以触发额外加载

有人知道我如何通过程序模拟“向下滚动”来加载页面的更多内容吗？

网页上的无限滚动是基于Javascript功能的。因此，要找到我们需要访问的URL和要使用的参数，我们需要彻底研究页面内部的JS代码，或者最好检查浏览器在向下滚动页面时执行的请求。我们可以使用开发人员工具研究请求。

向下滚动的次数越多，生成的请求就越多。因此，现在您的请求将被发送到该url而不是普通url，但请记住发送正确的标题和播放负载

其他更简单的解决方案是使用硒

我建议使用硒而不是bs。
selenium可以控制浏览器和解析。如向下滚动、单击按钮等

此示例用于在instagram中向下滚动获取所有喜欢的用户。

如果内容仅在“向下滚动”时加载，这可能意味着页面正在使用Javascript动态加载内容

您可以尝试使用web客户端加载页面并在其中执行javascript，并通过注入一些JS（如

document.body.scrollTop=sY）来模拟滚动（）。
无法使用请求找到响应。但是你可以使用硒。首先在第一次加载时打印出问题的数量，然后发送结束键以模拟向下滚动。发送结束键后，您可以看到问题数量从20个增加到40个
我使用了driver.com，在再次加载DOM之前隐式等待5秒钟，以防在加载DOM之前脚本加载速度过快。您可以通过使用含硒的EC来改善
该页面每个卷轴加载20个问题。因此，如果你想勉强回答100个问题，那么你需要发送5次结束键
要使用下面的代码，您需要安装chromedriver。

从selenium导入webdriver
从selenium.webdriver.chrome.options导入选项
从selenium.webdriver.common.keys导入密钥
从selenium.webdriver.common.by导入
CHROMEDRIVER_PATH=“”
CHROME_PATH=“”
窗口大小=“19201080”
chrome_options=options（）
#chrome\u选项。添加\u参数（“--headless”）
chrome\u选项。添加参数（“--window size=%s”%window\u size）
chrome\u options.binary\u location=chrome\u路径
prefs={'profile.managed_default_content_settings.images'：2}
chrome_选项。添加_实验_选项（“prefs”，prefs）
url=”https://www.quora.com/topic/Stack-Overflow-4/all_questions"
def刮取（url，次数）：
如果不是url.startswith（'http'）：
引发异常（'URL需要以“http”开头）
driver=webdriver.Chrome(
可执行路径=CHROMEDRIVER路径，
chrome\u选项=chrome\u选项
)
获取驱动程序（url）
计数器=1
虽然柜台可能重复感谢您的完整代码。。。但是你确定驱动程序隐式等待（5）
有效吗？在我的测试中，浏览器立即关闭，我们得到的问题2
与问题
相同。此外，我们需要向下滚动以获得额外的负载，我们在代码中没有看到向下滚动。使用发送结束键模拟向下滚动。使用wait和time.sleep更新了代码。这不应该是最好的方法，但我不知道如何使用EC等待元素出现在DOM中。
    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    from selenium.webdriver.common.keys import Keys
    from selenium.webdriver.common.by import By

    CHROMEDRIVER_PATH = ""
    CHROME_PATH = ""
    WINDOW_SIZE = "1920,1080"

    chrome_options = Options()
    # chrome_options.add_argument("--headless")  
    chrome_options.add_argument("--window-size=%s" % WINDOW_SIZE)
    chrome_options.binary_location = CHROME_PATH
    prefs = {'profile.managed_default_content_settings.images':2}
    chrome_options.add_experimental_option("prefs", prefs)

    url = "https://www.quora.com/topic/Stack-Overflow-4/all_questions"

    def scrape(url, times):

    if not url.startswith('http'):
        raise Exception('URLs need to start with "http"')

    driver = webdriver.Chrome(
    executable_path=CHROMEDRIVER_PATH,
    chrome_options=chrome_options
    )

    driver.get(url)

    counter = 1
    while counter <= times:

        q_list = driver.find_element_by_class_name('TopicAllQuestionsList')
        questions = [x for x in q_list.find_elements_by_xpath('//div[@class="pagedlist_item"]')]
        q_len = len(questions)
        print(q_len)

        html = driver.find_element_by_tag_name('html')
        html.send_keys(Keys.END)

        wait = WebDriverWait(driver, 5)
        time.sleep(5)

        questions2 = [x for x in q_list.find_elements_by_xpath('//div[@class="pagedlist_item"]')]
        print(len(questions2))

        counter += 1

    driver.close()

if __name__ == '__main__':
    scrape(url, 5)