python正则表达式mediawiki节解析_Python_Regex

python正则表达式mediawiki节解析

python regex

python正则表达式mediawiki节解析,python,regex,Python,Regex,我的文本类似于以下内容： ==Mainsection1== Some text here ===Subsection1.1=== Other text here ==Mainsection2== Text goes here ===Subsecttion2.1=== Other text goes here. 在上面的文本中，主要部分1和2有不同的名称，可以是用户想要的任何东西。这同样适用于各小节我想对正则表达式做的是获取main节的文本，包括它的子节（如果有

我的文本类似于以下内容：

==Mainsection1==  
Some text here  
===Subsection1.1===  
Other text here  

==Mainsection2==  
Text goes here  
===Subsecttion2.1===  
Other text goes here.

在上面的文本中，主要部分1和2有不同的名称，可以是用户想要的任何东西。这同样适用于各小节

我想对正则表达式做的是获取main节的文本，包括它的子节（如果有）。是的，这是一个维基页面。所有主要部分名称均以

开头，以

结尾所有子部分的名称都比

2==

多

regex =re.compile('==(.*)==([^=]*)', re.MULTILINE)  
regex.findall(text)

但上面返回的是每个单独的部分。这意味着它可以完美地返回一个main节，但可以自己看到一个小节

我希望有人能帮我解决这个问题，因为它困扰了我一段时间

编辑：结果应该是：

[('Mainsection1', 'Some text here\n===Subsection1.1===  
Other text here\n'), ('Mainsection2', 'Text goes here\n===Subsecttion2.1===  
Other text goes here.\n')]

编辑2:
我已经重写了我的代码，不使用正则表达式。我得出的结论是，自己解析它就足够容易了。这让我更容易理解

这是我的代码：

def createTokensFromText(text):    
    sections = []
    cur_section = None
    cur_lines = []


    for line in text.split('\n'):
        line = line.strip()
        if line.startswith('==') and not line.startswith('==='):
            if cur_section:
                sections.append( (cur_section, '\n'.join(cur_lines)) )
                cur_lines = []
            cur_section = line
            continue
        if cur_section:
            cur_lines.append(line)

    if cur_section:
        sections.append( (cur_section, '\n'.join(cur_lines)) )
    return sections

谢谢大家的帮助

提供的所有答案对我帮助很大

这里的问题是

==（.*）==

匹配

==（=小节=）==

，因此首先要做的是确保标题中没有

：

==（[^=]*）==（[^=]*）==（[^=]*）

然后，我们需要确保在比赛开始之前没有

，否则，三个字幕中的第一个被忽略，字幕被匹配。这将完成以下操作：

（？），它的意思是“如果前面没有…则匹配”
我们也可以在最后这样做，以确保最终结果是（？）
首先，应该知道，我对Python有一点了解，但我从来没有用它正式编程过……Codepad说这是可行的，所以这里是！：D--抱歉，表达式太复杂了：
(?<!=)==([^=]+)==(?!=)([\s\S]*?(?=$|(?<!=)==[^=]+==(?!=)))

编辑：分解后，表达式基本上表示：
01 (?<!=)        # First, look behind to assert that there is not an equals sign
02 ==            # Match two equals signs
03 ([^=]+)       # Capture one or more characters that are not an equals sign
04 ==            # Match two equals signs
05 (?!=)         # Then verify that there are no equals signs following this
06 (             # Start a capturing group
07   [\s\S]*?    #   Match zero or more of ANY character (even CrLf), but BE LAZY
08   (?=         #   Look ahead to verify that either...
09     $         #     this is the end of the 
10     |         #     -OR-
11     (?<!=)    #     when I look behind there is no equals sign
12     ==        #     then there are two equals signs
13     [^=]+     #     then one or more characters that are not equals signs
14     ==        #     then two equals signs
15     (?!=)     #     then verify that there are no equals signs following this
16   )           #   End look-ahead group
17 )             # End capturing group 

如果你愿意，我可以帮你把它分解！只要问一下！编辑（见上面的编辑）结束编辑
嗯，我们已经到了，但是我遗漏了一些东西。我仍然需要子部分的文本，这样结果将是：[（'Mainsection1'，'Some text here\n==Subsection1.1==Other text here\n'）非常感谢，恐怕我们已经达到了我的正则表达式知识的极限。我只知道使用正则表达式的第一部分（（？）检测标题，然后在匹配之间获取文本。你想让我详细说明这个想法，还是不是一个选项？如果你想的话，可以这样做，请继续！非常感谢farMaybe，你最好使用现有的wikimedia标记解析器？当然，乍一看，mwlib看起来最有希望。这不是一个好工作b对于正则表达式。你最好使用一个真正的解析器（比如PLY或PyParsing），或者更好：一个其他人已经编写的库。这对正则表达式来说可能不是一个真正的好工作，但它确实是可行的-问题是你的特定语法与任何可用的wiki解析器有多接近-以及你可能有什么理由偏离“标准"或者至少到目前为止，regex的流行语法似乎是可行的。但是感谢mwlib的链接，这也很有效。非常感谢。你能帮我分解regex吗？非常感谢！非常干净explanation@Fox如果您得到我的答案或其他答案的帮助，请单击下面的复选标记/勾号选择最佳答案e答案顶部的投票箭头-它有助于鼓励将来有更多好的和有用的答案：D
section = re.compile(r"(?<!=)==([^=]*)==(?!=)")

result = []
mo = section.search(x)
previous_end = 0
previous_section = None
while mo is not None:
    start = mo.start()
    if previous_section:
        result.append((previous_section, x[previous_end:start]))
    previous_section = mo.group(0)
    previous_end = mo.end()
    mo = section.search(x, previous_end)
result.append((previous_section, x[previous_end:]))
print result

[('==Mainsection1==',
  '  \nSome text here  \n===Subsection1.1===  \nOther text here  \n\n'),
 ('==Mainsection2==',
  '  \nText goes here  \n===Subsecttion2.1===  \nOther text goes here. ')]

(?<!=)==([^=]+)==(?!=)([\s\S]*?(?=$|(?<!=)==[^=]+==(?!=)))

import re

wikiText = """==Mainsection1==
Some text here
===Subsection1.1===
Other text here

==Mainsection2==
Text goes here
===Subsecttion2.1===
Other text goes here. """

outputArray = re.findall('(?<!=)==([^=]+)==(?!=)([\s\S]*?(?=$|(?<!=)==[^=]+==(?!=)))', wikiText)
print outputArray

[('Mainsection1', '\nSome text here\n===Subsection1.1===\nOther text here\n\n'), ('Mainsection2', '\nText goes here\n===Subsecttion2.1===\nOther text goes here. ')]

01 (?<!=)        # First, look behind to assert that there is not an equals sign
02 ==            # Match two equals signs
03 ([^=]+)       # Capture one or more characters that are not an equals sign
04 ==            # Match two equals signs
05 (?!=)         # Then verify that there are no equals signs following this
06 (             # Start a capturing group
07   [\s\S]*?    #   Match zero or more of ANY character (even CrLf), but BE LAZY
08   (?=         #   Look ahead to verify that either...
09     $         #     this is the end of the 
10     |         #     -OR-
11     (?<!=)    #     when I look behind there is no equals sign
12     ==        #     then there are two equals signs
13     [^=]+     #     then one or more characters that are not equals signs
14     ==        #     then two equals signs
15     (?!=)     #     then verify that there are no equals signs following this
16   )           #   End look-ahead group
17 )             # End capturing group 

(?<!=)==(?P<SectionName>[^=]+)==(?!=)(?P<SectionContent>[\s\S]*?(?=$|(?<!=)==[^=]+==(?!=)))