Python 使用Selenium加载所有帖子，然后提取帖子_Python_Selenium_Selenium Webdriver_Web Scraping_Web Crawler

Python 使用Selenium加载所有帖子，然后提取帖子

python selenium selenium-webdriver web-scraping web-crawler

Python 使用Selenium加载所有帖子，然后提取帖子,python,selenium,selenium-webdriver,web-scraping,web-crawler,Python,Selenium,Selenium Webdriver,Web Scraping,Web Crawler,我将对此URL进行爬网我编写了以下命令，首先单击查看更多帖子加载所有帖子，然后提取每篇帖子的全文。我正在尝试运行代码，但是它花费了太多的时间！！！我在过去的两天里运行了代码，我仍在等待完成运行。我想它仍然试图通过代码的第一部分加载帖子，因为我还没有看到任何输出（提取的帖子）。我不知道我是否做对了我的代码如下： wait = WebDriverWait(driver, 10) driver.execute_script("window.scrollTo(0,document.body

我将对此URL进行爬网

我编写了以下命令，首先单击

查看更多帖子

加载所有帖子，然后提取每篇帖子的全文。我正在尝试运行代码，但是它花费了太多的时间！！！我在过去的两天里运行了代码，我仍在等待完成运行。我想它仍然试图通过代码的第一部分加载帖子，因为我还没有看到任何输出（提取的帖子）。我不知道我是否做对了

我的代码如下：

wait = WebDriverWait(driver, 10)
driver.execute_script("window.scrollTo(0,document.body.scrollHeight);")
## Load all posts
while (driver.find_element_by_xpath('//*[@id="__next"]/main/div[2]/div[1]/div[1]/div[3]/div[31]/button')):
    time.sleep(5)
    driver.find_element_by_xpath('//*[@id="__next"]/main/div[2]/div[1]/div[1]/div[3]/div[31]/button').click()
    

##extract posts 
driver.execute_script("window.scrollTo(0,document.body.scrollHeight);")
time.sleep(3)
lst_post = [x.get_attribute('href') for x in driver.find_elements_by_xpath("//div[@class='results-post']/a")]
for lst in lst_post:
    time.sleep(5)
    driver.get(lst)
    post_body = wait.until(EC.presence_of_element_located((By.XPATH,"/html/body/div[1]/main/div[2]/div[1]/div[1]/div[1]")))
    like_count = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,".post-action--like")))
    #print (ascii(post_body.text))
    print (post_body.text)
    print('‍\n')

该站点似乎使用API获取帖子列表和帖子数据：

职位列表：

发布url:

使用

请求

，您可以调用这些API而不是使用selenium，这样会更快

此外，通过这种方式，您可以通过记录最后一个帖子ID来控制何时停止刮帖。例如，如果需要，您可以从停止刮帖的位置开始

下面的代码获取上个月创建的所有帖子，并获取它们各自的信息：

import requests
import time
from datetime import datetime, timedelta

allPostUrl = 'https://solaris.healthunlocked.com/posts/positivewellbeing/latest'

now = datetime.today()
postFromTime = now + timedelta(days=-1*30) # last month

fetchAllPost = False
nextPost = ""
posts = []

while not fetchAllPost:
    url = f'{allPostUrl}{f"?createdBeforePostId={nextPost}" if nextPost else ""}'
    print(f"GET {url}")
    r = requests.get(url)
    result = r.json()
    posts.extend(result)
    if len(result) > 0 and nextPost != result[len(result)-1]["postId"]:
        lastCreated = datetime.strptime(result[len(result)-1]["dateCreated"], '%Y-%m-%dT%H:%M:%S.%fZ')
        if lastCreated < postFromTime:
            fetchAllPost = True
        else:
            nextPost = result[len(result)-1]["postId"]
    else:
        fetchAllPost = True

print(f"received {len(posts)} posts")

data = []
for idx, post in enumerate(posts):
    url = f'https://solaris.healthunlocked.com/posts/positivewellbeing/{post["postId"]}'
    print(f"[{idx+1}/{len(posts)}] GET {url}")
    r = requests.get(url)
    result = r.json()
    data.append({
        "body": result["body"],
        "likes": result["numRatings"]
    })

print(data)

导入请求
导入时间
从datetime导入datetime，timedelta
所有姿势https://solaris.healthunlocked.com/posts/positivewellbeing/latest'
now=datetime.today（）
postFromTime=now+timedelta（天数=-1*30）#上个月
fetchAllPost=False
nextPost=“”
职位=[]
虽然不是fetchAllPost：
url=f'{allpostrl}{f''createdBeforePostId={nextPost}“如果nextPost else”“}”
打印（f“获取{url}”）
r=请求。获取（url）
result=r.json（）
posts.extend（结果）
如果len（结果）>0且nextPost！=结果[len（result）-1][“postId”]：
lastCreated=datetime.strTime（结果[len（结果）-1][“dateCreated”]，“%Y-%m-%dT%H:%m:%S.%fZ”）
如果lastCreated


我很无聊，所以我采取了自己的方法。
它刮取所有链接，访问它们，返回并刮取尚未访问的链接
import time
from selenium import webdriver

from selenium.common.exceptions import ElementClickInterceptedException


driver = webdriver.Chrome()
driver.implicitly_wait(6)
driver.get("https://healthunlocked.com/positivewellbeing/posts")
# click accept cookies
driver.find_element_by_id("ccc-notify-accept").click()
post_links = set()
while True:
    driver.get("https://healthunlocked.com/positivewellbeing/posts")
    all_posts = [post for post in
                 driver.find_element_by_class_name("results-posts").find_elements_by_class_name("results-post") if
                 "results-post" == post.get_attribute("class")]
    # handle clicking more posts
    while len(all_posts) <= len(post_links):

        see_more_posts = [btn for btn in driver.find_elements_by_class_name("btn-secondary")
                          if btn.text == "See more posts"]
        try:
            see_more_posts[0].click()
        except ElementClickInterceptedException:
            # handle floating box covering "see more posts" button
            driver.execute_script("return document.getElementsByClassName('floating-box-sign-up')[0].remove();")
            see_more_posts[0].click()
        all_posts = [post for post in driver.find_element_by_class_name("results-posts").find_elements_by_class_name("results-post") if "results-post" == post.get_attribute("class")]
    # popoulate links
    start_from = len(post_links)
    for post in all_posts[start_from:]: # len(post_links): <-- to avoid visiting same links
        # save link
        link = post.find_element_by_tag_name("a").get_attribute("href")
        post_links.add(link)

    # visit the site and scrape info
    for post_site in list(post_links)[start_from:]:

        driver.get(post_site)
        post_text = driver.find_element_by_class_name("post-body").text
        for btn in driver.find_element_by_class_name("post-actions__buttons").find_elements_by_tag_name("button"):
            if "Like" in btn.text:
                post_like = btn.text.split()[1][1]

        print(f"\n{post_text}\nLikes -->{post_like}\n")

导入时间
从selenium导入webdriver
从selenium.common.exceptions导入元素中，单击InterceptedException
driver=webdriver.Chrome（）
驱动程序。隐式等待（6）
驱动程序。获取（“https://healthunlocked.com/positivewellbeing/posts")
#单击接受cookies
驱动程序。按id（“ccc通知接受”）查找元素。单击（）
post_links=set（）
尽管如此：
驱动程序。获取（“https://healthunlocked.com/positivewellbeing/posts")
所有_帖子=[post for post in in
驱动程序。按类别名称（“结果帖子”）查找元素。如果
“结果发布”==post.get_属性（“类”）]
#处理单击更多帖子
而len（所有_帖子）是否在无头模式下运行selenium？否则它会打开一个新窗口，您应该能够看到正在发生的事情。也就是说，您应该能够看到脚本是否仍在加载POST，或者是否已经完成了加载。请检查“While”循环，看看我是否做得对吗？可能不是，driver.find_element\u by_xpath（）
在找不到您指定的元素时引发NoSuchElementException
异常，因此我猜，如果你到了评论的结尾，脚本将中断而不是移到第二部分谢谢你的意思是我应该以这种方式更改“while”循环？while（driver.find_element_by_xpath（'/*[@id=“_next”]/main/div[2]/div[1]/div[3]/div[31]/button'）：try:time.sleep（5）driver.find_element_by_xpath（'/*[@id=“_next”]/main/div[2]/div[1]/div[1]/div[3]/div[31]/button'）。单击“除此之外，在循环停止时，请停止加载”。取决于后面的帖子有多远。。。如果你喜欢，我可以分享我的方法。非常感谢你的代码。我是一个初学者，请你给我一个链接，看看我如何修改你的代码开始登录页面？因为我需要登录，然后提取帖子，因为当您使用帐户登录网站时，共享帖子是不同的。此外，代码仅返回270篇帖子，而页面有8000多篇帖子。非常感谢您的帮助。它工作得很好，但每个帖子都会返回两次。似乎每次加载后都会从一开始就提取帖子。在此之后，for loopfor post in all_posts[start_from:://code>post_链接集具有正确的链接