Web scraping 下载维基百科页面的全部历史记录_Web Scraping_Wikipedia

Web scraping 下载维基百科页面的全部历史记录

web-scraping

Web scraping 下载维基百科页面的全部历史记录,web-scraping,wikipedia,Web Scraping,Wikipedia,我想下载维基百科上一篇文章的全部修订历史，但我遇到了一个障碍下载整个维基百科文章或使用URL参数获取其历史片段非常容易： curl -d "" 'https://en.wikipedia.org/w/index.php?title=Special:Export&pages=Stack_Overflow&limit=1000&offset=1' -o "StackOverflow.xml" 当然，我可以下载整个网站，包括所有文章的所有版本，但这需要很多TB的数据，远远超

我想下载维基百科上一篇文章的全部修订历史，但我遇到了一个障碍

下载整个维基百科文章或使用URL参数获取其历史片段非常容易：

curl -d "" 'https://en.wikipedia.org/w/index.php?title=Special:Export&pages=Stack_Overflow&limit=1000&offset=1' -o "StackOverflow.xml"

当然，我可以下载整个网站，包括所有文章的所有版本，但这需要很多TB的数据，远远超过我的需要

是否有一种预先构建的方法来实现这一点？（似乎一定有。）

漫无目的地四处游荡，寻找我自己的另一个问题的线索——我的说法是我对这个话题一无所知我是在读了你的问题后发现这一点的。查看

修订版

方法

编辑：我也明白了

使用

mwclient

模块的示例代码：

import mwclient, pickle

print 'getting page...'
site = mwclient.Site(('https', 'en.wikipedia.org'))
page = site.pages['Stack_Overflow']

print 'extracting revisions (may take a really long time, depending on the page)...'
revisions = []
for i, revision in enumerate(page.revisions()):
    revisions.append(revision)

print 'saving to file...'
pickle.dump(revisions, open('StackOverflowRevisions.pkl', 'wb'))

上面的示例只获取有关修订的信息，而不是实际内容本身。下面是一个简短的python脚本，它将页面的完整内容和元数据历史数据下载到各个json文件中：

import mwclient
import json
import time

site = mwclient.Site('en.wikipedia.org')
page = site.pages['Wikipedia']

for i, (info, content) in enumerate(zip(page.revisions(), page.revisions(prop='content'))):
    info['timestamp'] = time.strftime("%Y-%m-%dT%H:%M:%S", info['timestamp'])
    print(i, info['timestamp'])
    open("%s.json" % info['timestamp'], "w").write(json.dumps(
        { 'info': info,
            'content': content}, indent=4))

这太好了，谢谢比尔！为完整起见，在答案中添加一些示例代码。不客气，做得好！我正要自己加一些。