Python 获取维基百科文章的第一段_Python_Xml_Wikipedia

Python 获取维基百科文章的第一段

python xml

Python 获取维基百科文章的第一段,python,xml,wikipedia,Python,Xml,Wikipedia,我使用下面的代码从维基百科的文章中获取第一段。这是你的电话号码。我只需要这一段。可能吗？还是有更好的选择 “帕波里语”（{lang as|''''''''''？''''}}）是一种[[阿萨姆语]]特征由[[Jahnu Barua]]导演的电影。这部电影由戈皮·德赛主演，苏希尔·戈斯瓦米、切塔纳·达斯和杜拉尔·罗伊。这部电影于1986年发行这是我的密码： #!/usr/bin/python from lxml import etree import urllib from Beautiful

我使用下面的代码从维基百科的文章中获取第一段。这是你的电话号码。我只需要这一段。可能吗？还是有更好的选择

“帕波里语”（{lang as|''''''''''？''''}}）是一种[[阿萨姆语]]特征由[[Jahnu Barua]]导演的电影。这部电影由戈皮·德赛主演，苏希尔·戈斯瓦米、切塔纳·达斯和杜拉尔·罗伊。这部电影于1986年发行

这是我的密码：

#!/usr/bin/python
from lxml import etree
import urllib
from BeautifulSoup import BeautifulSoup

class AppURLopener(urllib.FancyURLopener):
    version = "WikiDownloader"

urllib._urlopener = AppURLopener()
query = 'http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&titles=papori&rvsection=0'
#data = { 'catname':'', 'wpDownload':1, 'pages':"\n".join(pages)}
#data = urllib.urlencode(data)
f = urllib.urlopen(query)
s = f.read()
#doc = etree.parse(f)
#print(s)
soup = BeautifulSoup(s)
secondPTag = soup.findAll('rev')
print secondPTag

代码更新：任何人都可以帮助我删除{{}之间的文本。因为没有必要。谢谢

是的，这是可能的。您可以使用类似的HTML解析器，但我建议

使用正则表达式删除子字符串，如下所示：

>>> email = "tony@tiremove_thisger.net"
>>> m = re.search("remove_this", email)
>>> email[:m.start()] + email[m.end():]
'tony@tiger.net'

是的，这是可能的。您可以使用类似的HTML解析器，但我建议

使用正则表达式删除子字符串，如下所示：

>>> email = "tony@tiremove_thisger.net"
>>> m = re.search("remove_this", email)
>>> email[:m.start()] + email[m.end():]
'tony@tiger.net'

要删除从

{{

到

''Papori'

的所有内容：

要删除从第一个“{{”到匹配的“}}”的所有内容：

完整示例输出

“'Papori'”（{{lang as|”পাপৰী'''}}) 是阿萨姆人
由[[Jahnu Barua]]导演的故事片
明星戈皮·德赛、[[Biju Phukan]]、寿司·戈斯瓦米、切塔纳·达斯
和杜拉尔·罗伊。这部电影于1986年上映
网址=http://www.chaosmag.in/barua.html|标题=帕波里-1986-
阿萨姆电影|出版商=朝斯麦|访问日期=2月4日
2010}}

要删除从

{{

到

''Papori'

的所有内容，请执行以下操作：

要删除从第一个“{{”到匹配的“}}”的所有内容：

完整示例输出

“'Papori'”（{{lang as|”পাপৰী'''}}) 是阿萨姆人
由[[Jahnu Barua]]导演的故事片
明星戈皮·德赛、[[Biju Phukan]]、寿司·戈斯瓦米、切塔纳·达斯
和杜拉尔·罗伊。这部电影于1986年上映
网址=http://www.chaosmag.in/barua.html|标题=帕波里-1986-
阿萨姆电影|出版商=朝斯麦|访问日期=2月4日
2010}}

我用的是漂亮的汤。太棒了。但是我想删除

{}

之间的这段文字。我如何删除？thanks@vivek：若要删除

“删除此”

您可以使用：

电子邮件。替换（“删除此”，”）

。我使用的是漂亮的汤。它太棒了。但是在

{}之间有一个文本

我要删除。如何删除？thanks@vivek：要删除

“删除此”

您可以使用：

电子邮件。替换（“删除此”，”）

。谢谢，但文本是动态的。我想删除从

{{

到

的所有内容。thanks@user559744：我添加了一个变量，它删除了第一个

“{{

到匹配的

“}}”

。谢谢，但文本是动态的。我想删除从

{{

到

的所有内容。thanks@user559744：我添加了一个变量，它删除了从第一个

“{{{”

到匹配的

“}}”

的所有内容。

prefix, sep, rest = rev_data.partition("{{")
if sep: # found the first "{{"
    rest = sep + rest # put it back
    while rest.startswith("{{"):
        # remove nested "{{expr}}" one by one until there is none
        rest, n = re.subn(r"{{(?:[^{]|(?<!{){)*?}}", "", rest, 1)
        if n == 0: 
            break # the first "{{" is unmatched; can't remove it
    else: # deletion is successful
        rev_data = prefix + rest
print(rev_data)

prefix, sep, rest = rev_data.partition("{{")
if sep: # found the first "{{"
    depth = 1
    prevc = None
    for i, c in enumerate(rest):
        if c == "{" and  prevc == c:  # found "{{"
            depth += 1
            prevc = None # match "{{{ " only once
        elif c == "}" and prevc == c: # found "}}"
            depth -= 1
            if depth == 0: # found matching "}}"
                rev_data = prefix + rest[i+1:] # after matching "}}"
                break
            prevc = None # match "}}} " only once
        else:
            prevc = c
print(rev_data)

#!/usr/bin/env python
import urllib, urllib2
import xml.etree.cElementTree as etree

# download & parse xml, find rev data
params = dict(action="query", prop="revisions", rvprop="content",
              format="xml", titles="papori", rvsection=0)
request = urllib2.Request(
    "http://en.wikipedia.org/w/api.php?" + urllib.urlencode(params), 
    headers={"User-Agent": "WikiDownloader/1.0",
             "Referer": "http://stackoverflow.com/q/7937855"})
tree = etree.parse(urllib2.urlopen(request))
rev_data = tree.findtext('.//rev')

# remove everything from the first "{{" to matching "}}"
prefix, sep, rest = rev_data.partition("{{")
if sep: # found the first "{{"
    depth = 1
    prevc = None
    for i, c in enumerate(rest):
        if c == "{" and  prevc == c:  # found "{{"
            depth += 1
            prevc = None # match "{{{ " only once
        elif c == "}" and prevc == c: # found "}}"
            depth -= 1
            if depth == 0: # found matching "}}"
                rev_data = prefix + rest[i+1:] # after matching "}}"
                break
            prevc = None # match "}}} " only once
        else:
            prevc = c
print rev_data

'''Papori''' ({{lang-as|'''পাপৰী'''}}) is an [[Assamese
language]] feature film directed by [[Jahnu Barua]]. The film
stars Gopi Desai, [[Biju Phukan]], Sushil Goswami, Chetana Das
and Dulal Roy. The film was released in 1986.<ref name="ab">{{cite
web|url=http://www.chaosmag.in/barua.html|title=Papori – 1986 –
Assamese film|publisher=Chaosmag|accessdate=4 February
2010}}</ref>