Python 处理报纸中的文章例外

Python 处理报纸中的文章例外,python,web-scraping,nlp,python-newspaper,Python,Web Scraping,Nlp,Python Newspaper,我有一点代码,可以使用报纸来查看各种媒体并从中下载文章。这已经很长一段时间运作良好,但最近开始出现问题。我知道问题是什么,但由于我是Python新手,我不确定解决这个问题的最佳方法。基本上,我认为我需要进行修改,以防止偶尔出现格式错误的web地址完全破坏脚本,而是允许它放弃该web地址并继续使用其他web地址 错误的根源是当我尝试使用以下工具下载文章时: article.download() 他们每天更改的某些文章显然会抛出以下错误,但脚本仍在运行: Traceback (most r

我有一点代码,可以使用报纸来查看各种媒体并从中下载文章。这已经很长一段时间运作良好,但最近开始出现问题。我知道问题是什么,但由于我是Python新手,我不确定解决这个问题的最佳方法。基本上,我认为我需要进行修改,以防止偶尔出现格式错误的web地址完全破坏脚本,而是允许它放弃该web地址并继续使用其他web地址

错误的根源是当我尝试使用以下工具下载文章时:

article.download()
他们每天更改的某些文章显然会抛出以下错误,但脚本仍在运行:

    Traceback (most recent call last):
      File "C:\Anaconda3\lib\encodings\idna.py", line 167, in encode
        raise UnicodeError("label too long")
   UnicodeError: label too long

   The above exception was the direct cause of the following exception:

   Traceback (most recent call last):
     File "C:\Anaconda3\lib\site-packages\newspaper\mthreading.py", line 38, in run
       func(*args, **kargs)
     File "C:\Anaconda3\lib\site-packages\newspaper\source.py", line 350, in download_articles
       html = network.get_html(url, config=self.config)
     File "C:\Anaconda3\lib\site-packages\newspaper\network.py", line 39, in get_html return get_html_2XX_only(url, config, response)
     File "C:\Anaconda3\lib\site-packages\newspaper\network.py", line 60, in get_html_2XX_only url=url, **get_request_kwargs(timeout, useragent))
     File "C:\Anaconda3\lib\site-packages\requests\api.py", line 72, in get return request('get', url, params=params, **kwargs)
     File "C:\Anaconda3\lib\site-packages\requests\api.py", line 58, in request return session.request(method=method, url=url, **kwargs)
     File "C:\Anaconda3\lib\site-packages\requests\sessions.py", line 502, in request resp = self.send(prep, **send_kwargs)
     File "C:\Anaconda3\lib\site-packages\requests\sessions.py", line 612, in send r = adapter.send(request, **kwargs)
     File "C:\Anaconda3\lib\site-packages\requests\adapters.py", line 440, in send timeout=timeout
     File "C:\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen chunked=chunked)
     File "C:\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 356, in _make_request conn.request(method, url, **httplib_request_kw)
     File "C:\Anaconda3\lib\http\client.py", line 1107, in request self._send_request(method, url, body, headers)
     File "C:\Anaconda3\lib\http\client.py", line 1152, in _send_request self.endheaders(body)
     File "C:\Anaconda3\lib\http\client.py", line 1103, in endheaders     self._send_output(message_body)
     File "C:\Anaconda3\lib\http\client.py", line 934, in _send_output self.send(msg)
     File "C:\Anaconda3\lib\http\client.py", line 877, in send     self.connect()
     File "C:\Anaconda3\lib\site-packages\urllib3\connection.py", line 166, in connect conn = self._new_conn()
     File "C:\Anaconda3\lib\site-packages\urllib3\connection.py", line 141, in _new_conn  (self.host, self.port), self.timeout, **extra_kw)
     File "C:\Anaconda3\lib\site-packages\urllib3\util\connection.py", line 60, in create_connection for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
     File "C:\Anaconda3\lib\socket.py", line 733, in getaddrinfo for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
 UnicodeError: encoding with 'idna' codec failed (UnicodeError: label too long)
下一位应该对每篇文章进行解析和运行自然语言处理,并将某些元素写入数据帧,这样我就可以:

for paper in papers:    
for article in paper.articles:
    article.parse()
    print(article.title)
    article.nlp()
    if article.publish_date is None:
        d = datetime.now().date()
    else:
        d = article.publish_date.date()
    stories.loc[i] = [paper.brand, d, datetime.now().date(), article.title, article.summary, article.keywords, article.url]
    i += 1
这可能也有点草率,但这是另一天的问题

这将正常运行,直到它到达其中一个URL并出现错误,然后抛出一个文章异常,脚本崩溃:

    C:\Anaconda3\lib\site-packages\PIL\TiffImagePlugin.py:709: UserWarning: Corrupt EXIF data.  Expecting to read 2 bytes but only got 0.
   warnings.warn(str(msg))

   ArticleException                          Traceback (most recent call last) <ipython-input-17-2106485c4bbb> in <module>()
          4 for paper in papers:
          5     for article in paper.articles:
    ----> 6         article.parse()
          7         print(article.title)
          8         article.nlp()

   C:\Anaconda3\lib\site-packages\newspaper\article.py in parse(self)
       183 
       184     def parse(self):
   --> 185         self.throw_if_not_downloaded_verbose()
       186 
       187         self.doc = self.config.get_parser().fromstring(self.html)

   C:\Anaconda3\lib\site-packages\newspaper\article.py in throw_if_not_downloaded_verbose(self)
       519         if self.download_state == ArticleDownloadState.NOT_STARTED:
       520             print('You must `download()` an article first!')
   --> 521             raise ArticleException()
       522         elif self.download_state == ArticleDownloadState.FAILED_RESPONSE:
       523             print('Article `download()` failed with %s on URL %s' %

  ArticleException: 
那么,什么是防止终止我的脚本的最好方法呢?我应该在出现unicode错误的下载阶段解决它,还是在解析阶段告诉它忽略那些错误的地址?我将如何实施这一修正


非常感谢你的建议

我也遇到了同样的问题,尽管通常使用except:pass是不推荐的,但以下几点对我很有用:

    try:
        a.parse()
        file.write( a.title+'\n')
    except :
        pass

我也有同样的问题,尽管通常使用except:pass是不推荐的,但以下方法对我有效:

    try:
        a.parse()
        file.write( a.title+'\n')
    except :
        pass

我发现Navid对于这个问题是正确的

然而,parse只是其中一个会让你绊倒的函数。我将所有调用包装在try/accept结构中,如下所示:

word_list = []

for words in google_news.articles:

try:
    words.download()
    words.parse()
    words.nlp()

except:
    pass

word_list.append(words.keywords)

我发现Navid对于这个问题是正确的

然而,parse只是其中一个会让你绊倒的函数。我将所有调用包装在try/accept结构中,如下所示:

word_list = []

for words in google_news.articles:

try:
    words.download()
    words.parse()
    words.nlp()

except:
    pass

word_list.append(words.keywords)

您可以尝试捕获ArticleException。别忘了导入报纸模块

try:
  article.download()
  article.parse()
except newspaper.article.ArticleException:
  # do something

您可以尝试捕获ArticleException。别忘了导入报纸模块

try:
  article.download()
  article.parse()
except newspaper.article.ArticleException:
  # do something