带web存档的Python报纸(回送机)

带web存档的Python报纸(回送机),python,python-3.x,archive,python-newspaper,newspaper3k,Python,Python 3.x,Archive,Python Newspaper,Newspaper3k,我正在尝试将Python库与来自的存档一起使用,它存储存档的网站的旧版本。理论上,旧的新闻文章可以从这些档案中查询和下载 例如,下面的代码查询存档中的CNBC,以获取特定的存档日期 import newspaper url = 'http://web.archive.org/web/20161201123529/http://www.cnbc.com/' paper = newspaper.build(url, memoize_articles = False ) 虽然存档的网站本身包含201

我正在尝试将Python库与来自的存档一起使用,它存储存档的网站的旧版本。理论上,旧的新闻文章可以从这些档案中查询和下载

例如,下面的代码查询存档中的CNBC,以获取特定的存档日期

import newspaper
url = 'http://web.archive.org/web/20161201123529/http://www.cnbc.com/'
paper = newspaper.build(url, memoize_articles = False )
虽然存档的网站本身包含2016-12-01年的实际新闻文章链接,但报纸模块似乎没有收到这些链接。相反,您会得到如下URL:

https://blog.archive.org/2016/10/23/defining-web-pages-web-sites-and-web-captures/
这些不是CNBC存档版本中的实际文章。然而,报纸在今天的版本中效果很好


我认为它会因为url的格式(包含两个
http
s)而变得混乱。有人对如何从档案中提取文章有什么建议吗?

这是一个有趣的问题,我将把它添加到GitHub上的文档中

我尝试使用paper.build,但无法使其正常工作,所以我使用了paper Source

from time import sleep
from random import randint
from newspaper import Config
from newspaper import Source

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

wayback_cnbc = Source(url='https://web.archive.org/web/20180301012621/https://www.cnbc.com/', config=config,
                  memoize_articles=False, language='en', number_threads=20, thread_timeout_seconds=2)

wayback_cnbc.build()
for article_extract in wayback_cnbc.articles:
   article_extract.download()
   article_extract.parse()

   print(article_extract.publish_date)
   print(article_extract.title)
   print(article_extract.url)
   print('')

   # this sleep timer is helping with some timeout issues
   # that were happening when querying
   sleep(randint(1,3))
上面的示例输出了以下内容:

None
Media
https://web.archive.org/web/20180301012621/https://www.cnbc.com/media/
    
None
CNBC Video
https://web.archive.org/web/20180301012621/https://www.cnbc.com/video/

2017-11-08 00:00:00
CNBC Healthy Returns
https://web.archive.org/web/20180301012621/https://www.cnbc.com/2017/11/08/healthy-returns.html

2018-02-28 00:00:00
Markets in Asia decline as dollar steadies; Nikkei falls 307 points 
https://web.archive.org/web/20180301012621/https://www.cnbc.com/2018/02/28/asia-markets-stocks-dollar-and-china-caixin-pmi-in-focus.html

2018-02-28 00:00:00
S&P 500 rises, but on track to snap longest monthly win streak since 1959
https://web.archive.org/web/20180301012621/https://www.cnbc.com/2018/02/28/us-stocks-interest-rates-fed-markets.html
     
希望这个答案能帮助您查询回程机器中的文章。如果你有任何问题,请告诉我