如何使用Python从《纽约时报》在线文章的“评论”选项卡中获取数据？_Python_Beautifulsoup

如何使用Python从《纽约时报》在线文章的“评论”选项卡中获取数据？

python

如何使用Python从《纽约时报》在线文章的“评论”选项卡中获取数据？,python,beautifulsoup,Python,Beautifulsoup,下面是《纽约时报》一篇文章的URL：包含评论选项卡的URL是它有一个评论标签，我想使用Python的BeautifulSoup库从网站上抓取所有评论来实现我的目标下面是我的代码。但结果却是空的。我想这是一个不告诉计算机在哪里找到源链接的问题。有人可以修改它吗？谢谢大家! import bs4 import requests session = requests.Session() url = "http://www.nytimes.com/2017/01/04/world/asia/chin

下面是《纽约时报》一篇文章的URL：包含评论选项卡的URL是

它有一个评论标签，我想使用Python的BeautifulSoup库从网站上抓取所有评论来实现我的目标

下面是我的代码。但结果却是空的。我想这是一个不告诉计算机在哪里找到源链接的问题。有人可以修改它吗？谢谢大家!

import bs4
import requests
session = requests.Session()
url = "http://www.nytimes.com/2017/01/04/world/asia/china-xinhua-donald-trump-twitter.html"
page  = session.get(url).text
soup = bs4.BeautifulSoup(page)
comments= soup.find_all(class_='comments-panel')
for e in comments:
    print comments.string

包含所有注释的“注释”选项卡是隐藏的，并将通过javascript事件显示。按照@eLRuLL的建议，您可以使用selenium打开comment选项卡并检索如下注释（在Python 3中）：

编辑：

要检索所有评论和对评论的所有回复，需要1）选择元素“阅读更多”和“查看所有回复”，2）迭代并单击它们。我已经相应地修改了我的代码示例：

import time
from bs4 import BeautifulSoup
from selenium import webdriver, common

driver = webdriver.firefox.webdriver.WebDriver(executable_path='.../geckodriver') # adapt the path to the geckodriver

# set the browser window size to desktop view
driver.set_window_size(2024, 1000)

url = 'http://www.nytimes.com/2017/01/04/world/asia/china-xinhua-donald-trump-twitter.html'
driver.get(url)

# waiting for the page is fully loaded
time.sleep(5)

# select the link 'SEE ALL COMMENTS' and READ MORE and click them
elem = driver.find_element_by_css_selector('button.button.comments-button.theme-speech-bubble').click()
while True:
    try:
        driver.find_element_by_css_selector('div.comments-expand.comments-thread-expand').click()
        time.sleep(3)
    except common.exceptions.ElementNotVisibleException:
        break

# select the links SEE ALL REPLIES and click them
replies = driver.find_elements_by_css_selector('div.comments-expand.comments-subthread-expand')
for reply in replies:
    reply.click()
    time.sleep(3)

# get source code and close the browser
page  = driver.page_source
driver.close()

soup = BeautifulSoup(page, 'html.parser')

comments = soup.find_all('div', class_='comments-panel')
print(comments[0].prettify())

您是否使用浏览器的开发人员工具查看过页面源代码？你应该可以很容易地找到评论部分。你不想要

class='comments-view'）

或选项卡内容吗？我建议你试试浏览器模拟器，检查depperm：谢谢。但是你的建议给出了相同的结果。MattDMo-我尝试使用Chrome的Inspect并查找评论部分的来源，正如depperm所说，我尝试了几乎所有的类，即评论面板、评论视图等。每个类都产生相同的结果-空！谢谢你本杰明-非常聪明的解决方案，我修改了一点，它刮的评论，但不是全部。除非我们点击“阅读更多”，否则会有隐藏的评论。也许，一个比抓取更有效的替代方法是使用《纽约时报》，它可以访问注册用户的评论：本杰明：我可以看出你对HTML、CSS和JAVASCRIPT有着深刻的理解。还有两个问题：（1）运行代码时出错，提示为“selenium.common.exceptions.WebDriverException:消息：未知错误：元素在点（728817）处不可单击）（会话信息：chrome=55.0.2883.87）（驱动程序信息：chromedriver=2.25.426923（0390b88869384d6eb0d5d09729679f934aab9eed），platform=Windows NT 6.1.7601 SP1 x86_64）”（2）对于css选择器部分，您如何找到“button.button.comments button.theme speech bubble”？很抱歉，我对web编程几乎没有经验。今天早上，我用firefox geckodriver测试了扩展版代码（由于不同的计算机设置）-geckodriver可以正常工作-但我没有在帖子中更改webdriver，假设chromedriver以相同的方式工作。情况似乎不是这样。chromedriver也会出现同样的错误。目前我不知道问题出在哪里。因此，我在帖子中将webdriver更改为firefox geckodriver。我关于你的2.问题：右键点击文章标题->inspect元素中的speech bubble图标，chrome或firefox将向你显示带有class属性的div标签。

import time
from bs4 import BeautifulSoup
from selenium import webdriver, common

driver = webdriver.firefox.webdriver.WebDriver(executable_path='.../geckodriver') # adapt the path to the geckodriver

# set the browser window size to desktop view
driver.set_window_size(2024, 1000)

url = 'http://www.nytimes.com/2017/01/04/world/asia/china-xinhua-donald-trump-twitter.html'
driver.get(url)

# waiting for the page is fully loaded
time.sleep(5)

# select the link 'SEE ALL COMMENTS' and READ MORE and click them
elem = driver.find_element_by_css_selector('button.button.comments-button.theme-speech-bubble').click()
while True:
    try:
        driver.find_element_by_css_selector('div.comments-expand.comments-thread-expand').click()
        time.sleep(3)
    except common.exceptions.ElementNotVisibleException:
        break

# select the links SEE ALL REPLIES and click them
replies = driver.find_elements_by_css_selector('div.comments-expand.comments-subthread-expand')
for reply in replies:
    reply.click()
    time.sleep(3)

# get source code and close the browser
page  = driver.page_source
driver.close()

soup = BeautifulSoup(page, 'html.parser')

comments = soup.find_all('div', class_='comments-panel')
print(comments[0].prettify())