Python 在GoogleNews中学习网页抓取时出现ArticleException错误_Python_Web_Screen Scraping_Google News

Python 在GoogleNews中学习网页抓取时出现ArticleException错误

python web

Python 在GoogleNews中学习网页抓取时出现ArticleException错误,python,web,screen-scraping,google-news,Python,Web,Screen Scraping,Google News,我不是程序员或python专家，我只是从教程中复制了一些代码，并试图进行练习，以便为我未来的研究收集数据。但是，我得到了以下错误。有谁能帮我解决这个问题吗？谢谢 from GoogleNews import GoogleNews from newspaper import Article from newspaper import Config import pandas as pd import nltk nltk.download('punkt') user_agent = 'Mozill

我不是程序员或python专家，我只是从教程中复制了一些代码，并试图进行练习，以便为我未来的研究收集数据。但是，我得到了以下错误。有谁能帮我解决这个问题吗？谢谢

from GoogleNews import GoogleNews
from newspaper import Article
from newspaper import Config
import pandas as pd
import nltk
nltk.download('punkt')

user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'

config=config（） config.browser\u user\u agent=user\u agent

  googlenews=GoogleNews(start='05/01/2020',end='05/31/2020')
googlenews.search('Coronavirus')
result=googlenews.result()
df=pd.DataFrame(result)
print(df.head())


for i in range(2,20):
        googlenews.getpage(i)
        result=googlenews.result()
        df=pd.DataFrame(result)
    list=[]

for ind in df.index:
    dict={}
    article = Article(df['link'][ind],config=config)
    article.download()
    article.parse()
    article.nlp()
    dict['Date']=df['date'][ind]
    dict['Media']=df['media'][ind]
    dict['Title']=article.title
    dict['Article']=article.text
    dict['Summary']=article.summary
    list.append(dict)
 
news_df=pd.DataFrame(list)
news_df.to_excel("articles.xlsx")

ArticleException                          Traceback (most recent call last)
<ipython-input-37-e5be28c653bc> in <module>
      3     article = Article(df['link'][ind],config=config)
      4     article.download()
----> 5     article.parse()
      6     article.nlp()
      7     dict['Date']=df['date'][ind]

~\anaconda3\lib\site-packages\newspaper\article.py in parse(self)
    189 
    190     def parse(self):
--> 191         self.throw_if_not_downloaded_verbose()
    192 
    193         self.doc = self.config.get_parser().fromstring(self.html)

~\anaconda3\lib\site-packages\newspaper\article.py in throw_if_not_downloaded_verbose(self)
    529             raise ArticleException('You must `download()` an article first!')
    530         elif self.download_state == ArticleDownloadState.FAILED_RESPONSE:
--> 531             raise ArticleException('Article `download()` failed with %s on URL %s' %
    532                   (self.download_exception_msg, self.url))
    533 

ArticleException: Article `download()` failed with HTTPSConnectionPool(host='www.washingtonpost.com', port=443): Read timed out. (read timeout=7) on URL https://www.washingtonpost.com/health/2020/05/13/coronavirus-treatments/

googlenews=googlenews（开始日期为2020年1月5日，结束日期为2020年5月31日）
谷歌新闻搜索（“冠状病毒”）
result=googlenews.result（）
df=pd.DataFrame（结果）
打印（df.head（））
对于范围（2,20）内的i：
googlenews.getpage（一）
result=googlenews.result（）
df=pd.DataFrame（结果）
列表=[]
对于df.index中的ind：
dict={}
article=article（df['link'][ind]，config=config）
文章下载（）
第条解析（）
第条nlp（）
dict['Date']=df['Date'][ind]
dict['Media']=df['Media'][ind]
dict['Title']=article.Title
dict['Article']=Article.text
dict['Summary']=article.Summary
list.append（dict）
news_df=pd.DataFrame（列表）
新闻到excel（“articles.xlsx”）
ArticleException回溯（最后一次最近调用）
在里面
3 article=article（df['link'][ind]，config=config）
4.文章下载（）
---->第5条解析（）
第6条nlp（）
7 dict['Date']=df['Date'][ind]
解析中的~\anaconda3\lib\site packages\paper\article.py（self）
189
190 def解析（自我）：
-->191 self.throw\u if\u not\u下载的\u verbose（）
192
193 self.doc=self.config.get_parser（）.fromstring（self.html）
~\anaconda3\lib\site packages\paper\article.py在throw\u如果\u未下载\u详细（self）
529 raise ArticleException（'您必须'download（）'先下载一篇文章！'）
530 elif self.download\u state==ArticleDownloadState.FAILED\u响应：
-->531引发ArticleException（'Article`download（）`失败，URL%s上有%s'%
532（self.download\u exception\u msg，self.url））
533
ArticleException:Article`download（）`与HTTPSConnectionPool（host='www.washingtonpost.com'，port=443）一起失败：读取超时。URL上的（读取超时=7）https://www.washingtonpost.com/health/2020/05/13/coronavirus-treatments/

可能是个愚蠢的问题，但你能通过相同的互联网连接在浏览器中正常访问文章吗？