Python 在GoogleNews中学习网页抓取时出现ArticleException错误
我不是程序员或python专家,我只是从教程中复制了一些代码,并试图进行练习,以便为我未来的研究收集数据。但是,我得到了以下错误。有谁能帮我解决这个问题吗?谢谢Python 在GoogleNews中学习网页抓取时出现ArticleException错误,python,web,screen-scraping,google-news,Python,Web,Screen Scraping,Google News,我不是程序员或python专家,我只是从教程中复制了一些代码,并试图进行练习,以便为我未来的研究收集数据。但是,我得到了以下错误。有谁能帮我解决这个问题吗?谢谢 from GoogleNews import GoogleNews from newspaper import Article from newspaper import Config import pandas as pd import nltk nltk.download('punkt') user_agent = 'Mozill
from GoogleNews import GoogleNews
from newspaper import Article
from newspaper import Config
import pandas as pd
import nltk
nltk.download('punkt')
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
config=config()
config.browser\u user\u agent=user\u agent
googlenews=GoogleNews(start='05/01/2020',end='05/31/2020')
googlenews.search('Coronavirus')
result=googlenews.result()
df=pd.DataFrame(result)
print(df.head())
for i in range(2,20):
googlenews.getpage(i)
result=googlenews.result()
df=pd.DataFrame(result)
list=[]
for ind in df.index:
dict={}
article = Article(df['link'][ind],config=config)
article.download()
article.parse()
article.nlp()
dict['Date']=df['date'][ind]
dict['Media']=df['media'][ind]
dict['Title']=article.title
dict['Article']=article.text
dict['Summary']=article.summary
list.append(dict)
news_df=pd.DataFrame(list)
news_df.to_excel("articles.xlsx")
ArticleException Traceback (most recent call last)
<ipython-input-37-e5be28c653bc> in <module>
3 article = Article(df['link'][ind],config=config)
4 article.download()
----> 5 article.parse()
6 article.nlp()
7 dict['Date']=df['date'][ind]
~\anaconda3\lib\site-packages\newspaper\article.py in parse(self)
189
190 def parse(self):
--> 191 self.throw_if_not_downloaded_verbose()
192
193 self.doc = self.config.get_parser().fromstring(self.html)
~\anaconda3\lib\site-packages\newspaper\article.py in throw_if_not_downloaded_verbose(self)
529 raise ArticleException('You must `download()` an article first!')
530 elif self.download_state == ArticleDownloadState.FAILED_RESPONSE:
--> 531 raise ArticleException('Article `download()` failed with %s on URL %s' %
532 (self.download_exception_msg, self.url))
533
ArticleException: Article `download()` failed with HTTPSConnectionPool(host='www.washingtonpost.com', port=443): Read timed out. (read timeout=7) on URL https://www.washingtonpost.com/health/2020/05/13/coronavirus-treatments/
googlenews=googlenews(开始日期为2020年1月5日,结束日期为2020年5月31日)
谷歌新闻搜索(“冠状病毒”)
result=googlenews.result()
df=pd.DataFrame(结果)
打印(df.head())
对于范围(2,20)内的i:
googlenews.getpage(一)
result=googlenews.result()
df=pd.DataFrame(结果)
列表=[]
对于df.index中的ind:
dict={}
article=article(df['link'][ind],config=config)
文章下载()
第条解析()
第条nlp()
dict['Date']=df['Date'][ind]
dict['Media']=df['Media'][ind]
dict['Title']=article.Title
dict['Article']=Article.text
dict['Summary']=article.Summary
list.append(dict)
news_df=pd.DataFrame(列表)
新闻到excel(“articles.xlsx”)
ArticleException回溯(最后一次最近调用)
在里面
3 article=article(df['link'][ind],config=config)
4.文章下载()
---->第5条解析()
第6条nlp()
7 dict['Date']=df['Date'][ind]
解析中的~\anaconda3\lib\site packages\paper\article.py(self)
189
190 def解析(自我):
-->191 self.throw\u if\u not\u下载的\u verbose()
192
193 self.doc=self.config.get_parser().fromstring(self.html)
~\anaconda3\lib\site packages\paper\article.py在throw\u如果\u未下载\u详细(self)
529 raise ArticleException('您必须'download()'先下载一篇文章!')
530 elif self.download\u state==ArticleDownloadState.FAILED\u响应:
-->531引发ArticleException('Article`download()`失败,URL%s上有%s'%
532(self.download\u exception\u msg,self.url))
533
ArticleException:Article`download()`与HTTPSConnectionPool(host='www.washingtonpost.com',port=443)一起失败:读取超时。URL上的(读取超时=7)https://www.washingtonpost.com/health/2020/05/13/coronavirus-treatments/
可能是个愚蠢的问题,但你能通过相同的互联网连接在浏览器中正常访问文章吗?