Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/359.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何从某篇文章中获取完整的维基百科修订历史列表?_Python_Web Scraping_Wikipedia Api_Revision History - Fatal编程技术网

Python 如何从某篇文章中获取完整的维基百科修订历史列表?

Python 如何从某篇文章中获取完整的维基百科修订历史列表?,python,web-scraping,wikipedia-api,revision-history,Python,Web Scraping,Wikipedia Api,Revision History,如何获取完整的维基百科修订历史列表?(不想刮) 程序包链接:如果您需要超过500个修订条目,则必须与操作查询、属性修订和参数rvcontinue一起使用,这是从上一个请求中获取的,因此您无法仅通过一个请求获得整个列表: https://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Coffee&rvcontinue=... 要获得您选择的更具体信息,您还必须使用rvprop参数: &r

如何获取完整的维基百科修订历史列表?(不想刮)


程序包链接:

如果您需要超过500个修订条目,则必须与操作查询、属性修订和参数rvcontinue一起使用,这是从上一个请求中获取的,因此您无法仅通过一个请求获得整个列表:

https://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Coffee&rvcontinue=...
要获得您选择的更具体信息,您还必须使用rvprop参数:

&rvprop=ids|flags|timestamp|user|userid|size|sha1|contentmodel|comment|parsedcomment|content|tags|parsetree|flagged
您可以找到的所有可用参数的摘要

以下是如何在C#中获取完整的维基百科页面修订历史记录:

下面是“Coffee”文章的最后10个修订版(它们从API中以相反的顺序返回),不要忘记,如果需要更具体的修订信息,可以在请求中使用
rvprop
参数

for i in revisions[0:10]:
    print(i)

#<rev revid="698019402" parentid="698018324" user="Termininja" timestamp="2016-01-03T13:51:27Z" comment="short link" />
#<rev revid="698018324" parentid="697691358" user="AXRL" timestamp="2016-01-03T13:39:14Z" comment="/* See also */" />
#<rev revid="697691358" parentid="697690475" user="Zekenyan" timestamp="2016-01-01T05:31:33Z" comment="first coffee trade" />
#<rev revid="697690475" parentid="697272803" user="Zekenyan" timestamp="2016-01-01T05:18:11Z" comment="since country of origin is not first sighting of someone drinking coffee I have removed the origin section completely" />
#<rev revid="697272803" parentid="697272470" minor="" user="Materialscientist" timestamp="2015-12-29T11:13:18Z" comment="Reverted edits by [[Special:Contribs/Media3dd|Media3dd]] ([[User talk:Media3dd|talk]]) to last version by Materialscientist" />
#<rev revid="697272470" parentid="697270507" user="Media3dd" timestamp="2015-12-29T11:09:14Z" comment="/* External links */" />
#<rev revid="697270507" parentid="697270388" minor="" user="Materialscientist" timestamp="2015-12-29T10:45:46Z" comment="Reverted edits by [[Special:Contribs/89.197.43.130|89.197.43.130]] ([[User talk:89.197.43.130|talk]]) to last version by Mahdijiba" />
#<rev revid="697270388" parentid="697265765" user="89.197.43.130" anon="" timestamp="2015-12-29T10:44:02Z" comment="/* See also */" />
#<rev revid="697265765" parentid="697175433" user="Mahdijiba" timestamp="2015-12-29T09:45:03Z" comment="" />
#<rev revid="697175433" parentid="697167005" user="EvergreenFir" timestamp="2015-12-28T19:51:25Z" comment="Reverted 1 pending edit by [[Special:Contributions/2.24.63.78|2.24.63.78]] to revision 696892548 by Zefr: [[WP:CENTURY]]" />
[0:10]修订版中的i的

印刷品(一)
#
#
#
#
#
#
#
#
#
#

如果你使用pywikibot,你可以拉一个生成器,它将为你运行完整的修订历史记录。例如,要获得一个生成器,该生成器将逐步完成英文维基百科中“pagename”页面的所有修订(包括其内容),请使用:

site = pywikibot.Site("en", "wikipedia")
page = pywikibot.Page(site, "pagename")
revs = page.revisions(content=True)
您可以将更多参数应用于查询。您可以找到API文档

值得注意的是:

修订(反向=False,总计=None,内容=False,回滚=False,开始时间=None,结束时间=None)

生成器,将版本历史记录作为修订实例加载


pywikibot似乎是许多维基百科编辑自动编辑的方法。

完整的修订列表不是
get\u revs
吗?“少了什么?”凯文:不是。“只不过是500英镑。”摩根·索拉普说。我对任何做这项工作的软件包/代码都持开放态度。请看API,API只允许您获得500次修订。@Morgan Thrapp感谢您提供的信息!有解决办法吗?允许刮取吗?对于像我这样在这一点上挣扎着如何处理
revs
的人,你可以将其转换为一个列表
list(revs)
。如果列表足够小,可以加载到内存中,这一点肯定有效。我的不是:)而是我循环:对于rev in revs:
import urllib2
import re

def GetRevisions(pageTitle):
    url = "https://en.wikipedia.org/w/api.php?action=query&format=xml&prop=revisions&rvlimit=500&titles=" + pageTitle
    revisions = []                                        #list of all accumulated revisions
    next = ''                                             #information for the next request
    while True:
        response = urllib2.urlopen(url + next).read()     #web request
        revisions += re.findall('<rev [^>]*>', response)  #adds all revisions from the current request to the list

        cont = re.search('<continue rvcontinue="([^"]+)"', response)
        if not cont:                                      #break the loop if 'continue' element missing
            break

        next = "&rvcontinue=" + cont.group(1)             #gets the revision Id from which to start the next request

    return revisions;
revisions = GetRevisions("Coffee")

print(len(revisions))
#10418
for i in revisions[0:10]:
    print(i)

#<rev revid="698019402" parentid="698018324" user="Termininja" timestamp="2016-01-03T13:51:27Z" comment="short link" />
#<rev revid="698018324" parentid="697691358" user="AXRL" timestamp="2016-01-03T13:39:14Z" comment="/* See also */" />
#<rev revid="697691358" parentid="697690475" user="Zekenyan" timestamp="2016-01-01T05:31:33Z" comment="first coffee trade" />
#<rev revid="697690475" parentid="697272803" user="Zekenyan" timestamp="2016-01-01T05:18:11Z" comment="since country of origin is not first sighting of someone drinking coffee I have removed the origin section completely" />
#<rev revid="697272803" parentid="697272470" minor="" user="Materialscientist" timestamp="2015-12-29T11:13:18Z" comment="Reverted edits by [[Special:Contribs/Media3dd|Media3dd]] ([[User talk:Media3dd|talk]]) to last version by Materialscientist" />
#<rev revid="697272470" parentid="697270507" user="Media3dd" timestamp="2015-12-29T11:09:14Z" comment="/* External links */" />
#<rev revid="697270507" parentid="697270388" minor="" user="Materialscientist" timestamp="2015-12-29T10:45:46Z" comment="Reverted edits by [[Special:Contribs/89.197.43.130|89.197.43.130]] ([[User talk:89.197.43.130|talk]]) to last version by Mahdijiba" />
#<rev revid="697270388" parentid="697265765" user="89.197.43.130" anon="" timestamp="2015-12-29T10:44:02Z" comment="/* See also */" />
#<rev revid="697265765" parentid="697175433" user="Mahdijiba" timestamp="2015-12-29T09:45:03Z" comment="" />
#<rev revid="697175433" parentid="697167005" user="EvergreenFir" timestamp="2015-12-28T19:51:25Z" comment="Reverted 1 pending edit by [[Special:Contributions/2.24.63.78|2.24.63.78]] to revision 696892548 by Zefr: [[WP:CENTURY]]" />
site = pywikibot.Site("en", "wikipedia")
page = pywikibot.Page(site, "pagename")
revs = page.revisions(content=True)