Python 从新闻网站中提取用户评论

Python 从新闻网站中提取用户评论,python,selenium,selenium-webdriver,web-scraping,screen-scraping,Python,Selenium,Selenium Webdriver,Web Scraping,Screen Scraping,这是我需要提取所有评论的链接 但是我的代码只提取了前10条注释。单击按钮后,将动态加载其他10条注释。如何使用PythonSelenium提取所有这些注释的想法是查找页面上有多少“更多想法”元素。每次单击按钮并加载更多评论时,就会出现一个“更多想法”红色按钮。实施: from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.we

这是我需要提取所有评论的链接


但是我的代码只提取了前10条注释。单击按钮后,将动态加载其他10条注释。如何使用PythonSelenium提取所有这些注释的想法是查找页面上有多少“更多想法”元素。每次单击按钮并加载更多评论时,就会出现一个“更多想法”红色按钮。实施:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


def wait(dr, x):
  element = WebDriverWait(dr, 50).until(
    EC.presence_of_all_elements_located((By.XPATH, x))
)
return element
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("http://www.dinamalar.com/user_comments.asp? uid=14701&name=%E0%AE%A4%E0%AE%AE%E0%AE%BF%E0%AE%B4%E0%AF%8D%E0%AE%9A%E0%AF%86%E0%AE%B2%E0%AF%8D%E0%AE%B5%E0%AE%A9%E0%AF%8D")
for elem in wait(browser, '//*[@id="commsec"]/div[2]/div[1]'):
print elem.text

请注意,我还删除了URL中的额外空间。

谢谢,效果很好。我是这方面的初学者,那么如何comments@VinayakumarR我会在这里使用XPath:
comments=[element.text for element in browser.find_elements\u by_XPath(“//div[@class='boxcmt1']//a[@class='heading']/following sibling::div”)]
。请测试一下。谢谢。在将这一行添加到现有代码后,显示了警告I/O警告找到了非ASCII,但我尝试运行它,它显示了一个错误此操作很好注释=[element.text for element in browser.find_elements_by_xpath(“//div[@class='boxcmt1']//a[@class='heading']”),但它返回unicode@VinayakumarR当然可以,但是unicode有什么问题?谢谢
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver


browser = webdriver.Firefox()
wait = WebDriverWait(browser, 10)
browser.get("http://www.dinamalar.com/user_comments.asp?uid=14701&name=%E0%AE%A4%E0%AE%AE%E0%AE%BF%E0%AE%B4%E0%AF%8D%E0%AE%9A%E0%AF%86%E0%AE%B2%E0%AF%8D%E0%AE%B5%E0%AE%A9%E0%AF%8D")

# initial wait for the page to load
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".morered")))

pages = 1
while True:
    browser.find_elements_by_css_selector(".morered")[-1].click()

    # wait for more "load more" buttons to be present
    try:
        wait.until(lambda browser: len(browser.find_elements_by_css_selector(".morered")) > pages)
    except TimeoutException:
        break  # no more data loaded, exit the loop

    print("Comments loaded: %d" % len(browser.find_elements_by_css_selector(".dateg")))

    pages += 1

browser.close()