Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/assembly/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 在GoogleNews中学习网页抓取时出现ArticleException错误_Python_Web_Screen Scraping_Google News - Fatal编程技术网

Python 在GoogleNews中学习网页抓取时出现ArticleException错误

Python 在GoogleNews中学习网页抓取时出现ArticleException错误,python,web,screen-scraping,google-news,Python,Web,Screen Scraping,Google News,我不是程序员或python专家,我只是从教程中复制了一些代码,并试图进行练习,以便为我未来的研究收集数据。但是,我得到了以下错误。有谁能帮我解决这个问题吗?谢谢 from GoogleNews import GoogleNews from newspaper import Article from newspaper import Config import pandas as pd import nltk nltk.download('punkt') user_agent = 'Mozill

我不是程序员或python专家,我只是从教程中复制了一些代码,并试图进行练习,以便为我未来的研究收集数据。但是,我得到了以下错误。有谁能帮我解决这个问题吗?谢谢

from GoogleNews import GoogleNews
from newspaper import Article
from newspaper import Config
import pandas as pd
import nltk
nltk.download('punkt')

user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
config=config() config.browser\u user\u agent=user\u agent

  googlenews=GoogleNews(start='05/01/2020',end='05/31/2020')
googlenews.search('Coronavirus')
result=googlenews.result()
df=pd.DataFrame(result)
print(df.head())


for i in range(2,20):
        googlenews.getpage(i)
        result=googlenews.result()
        df=pd.DataFrame(result)
    list=[]

for ind in df.index:
    dict={}
    article = Article(df['link'][ind],config=config)
    article.download()
    article.parse()
    article.nlp()
    dict['Date']=df['date'][ind]
    dict['Media']=df['media'][ind]
    dict['Title']=article.title
    dict['Article']=article.text
    dict['Summary']=article.summary
    list.append(dict)
 
news_df=pd.DataFrame(list)
news_df.to_excel("articles.xlsx")

ArticleException                          Traceback (most recent call last)
<ipython-input-37-e5be28c653bc> in <module>
      3     article = Article(df['link'][ind],config=config)
      4     article.download()
----> 5     article.parse()
      6     article.nlp()
      7     dict['Date']=df['date'][ind]

~\anaconda3\lib\site-packages\newspaper\article.py in parse(self)
    189 
    190     def parse(self):
--> 191         self.throw_if_not_downloaded_verbose()
    192 
    193         self.doc = self.config.get_parser().fromstring(self.html)

~\anaconda3\lib\site-packages\newspaper\article.py in throw_if_not_downloaded_verbose(self)
    529             raise ArticleException('You must `download()` an article first!')
    530         elif self.download_state == ArticleDownloadState.FAILED_RESPONSE:
--> 531             raise ArticleException('Article `download()` failed with %s on URL %s' %
    532                   (self.download_exception_msg, self.url))
    533 

ArticleException: Article `download()` failed with HTTPSConnectionPool(host='www.washingtonpost.com', port=443): Read timed out. (read timeout=7) on URL https://www.washingtonpost.com/health/2020/05/13/coronavirus-treatments/
googlenews=googlenews(开始日期为2020年1月5日,结束日期为2020年5月31日)
谷歌新闻搜索(“冠状病毒”)
result=googlenews.result()
df=pd.DataFrame(结果)
打印(df.head())
对于范围(2,20)内的i:
googlenews.getpage(一)
result=googlenews.result()
df=pd.DataFrame(结果)
列表=[]
对于df.index中的ind:
dict={}
article=article(df['link'][ind],config=config)
文章下载()
第条解析()
第条nlp()
dict['Date']=df['Date'][ind]
dict['Media']=df['Media'][ind]
dict['Title']=article.Title
dict['Article']=Article.text
dict['Summary']=article.Summary
list.append(dict)
news_df=pd.DataFrame(列表)
新闻到excel(“articles.xlsx”)
ArticleException回溯(最后一次最近调用)
在里面
3 article=article(df['link'][ind],config=config)
4.文章下载()
---->第5条解析()
第6条nlp()
7 dict['Date']=df['Date'][ind]
解析中的~\anaconda3\lib\site packages\paper\article.py(self)
189
190 def解析(自我):
-->191 self.throw\u if\u not\u下载的\u verbose()
192
193 self.doc=self.config.get_parser().fromstring(self.html)
~\anaconda3\lib\site packages\paper\article.py在throw\u如果\u未下载\u详细(self)
529 raise ArticleException('您必须'download()'先下载一篇文章!')
530 elif self.download\u state==ArticleDownloadState.FAILED\u响应:
-->531引发ArticleException('Article`download()`失败,URL%s上有%s'%
532(self.download\u exception\u msg,self.url))
533
ArticleException:Article`download()`与HTTPSConnectionPool(host='www.washingtonpost.com',port=443)一起失败:读取超时。URL上的(读取超时=7)https://www.washingtonpost.com/health/2020/05/13/coronavirus-treatments/

可能是个愚蠢的问题,但你能通过相同的互联网连接在浏览器中正常访问文章吗?