Python 具有正向前向和后向的正则表达式模式

Python 具有正向前向和后向的正则表达式模式,python,regex,regex-lookarounds,Python,Regex,Regex Lookarounds,我有一个文件看起来像: maar beroepsmensen p( maar | <s> ) = 0.005859305 [ -2.232154 ] p( beroepsmensen | maar ...) = 7.865118e-06 [ -5.104295 ] # <- first match: 7.865118e-06 p( kunnen | beroepsmensen ...) = 6.842439e-08 [ -5.104

我有一个文件看起来像:

maar beroepsmensen
    p( maar | <s> )     =  0.005859305 [ -2.232154 ]
    p( beroepsmensen | maar ...)    =  7.865118e-06 [ -5.104295 ] # <- first match: 7.865118e-06
    p( kunnen | beroepsmensen ...)  =  6.842439e-08 [ -5.104295 ]
    p( </s> | kunnen ...)   =  0.04018713 [ -1.395913 ]
1 sentences, 2 words, 0 OOVs
0 zeroprobs, logprob= -8.732362 ppl= 814.3052 ppl1= 23237.04

dan scootermobiel
    p( dan | <s> )  =  0.005859305 [ -2.232154 ]
    p( scootermobiel | dan)     =  0.827746 [ -9.106363 ] # <- second match: 0.827746
    p( he | scootermobiel)  =  0.2520393 [ -3.123365 ]
    p( </s> | he ...)   =  0.04499642 [ -1.346822 ]
1 sentences, 2 words, 0 OOVs
还有一个包含一些单词的列表,例如mylst=['beroepsens','scootermobiel']

我想在列表中循环,找到带有模式p ithwordfromlist | anotherword=9.9999999的行中的第一个数字。请参见上面关于Toye示例的匹配。请注意|后面的另一个单词后面可以有三个点,数字有时由e结构组成

到目前为止,我成功地编写了一个正则表达式,它使用一个可选的。和可选的e-使用一个正向前瞻来查找前面的所有数字:

\d++.\d++?e-\d++?=++\[数字后的空格数也可能不同


但是,我未能编写与数字之前的模式匹配的正向lookback。例如,lookback之类的?像这样的正则表达式如何:

\s*p\s*\(\s*\w+\s*\|\s*\w+\s*\)\s*=\s*([\de\-\.]+)\s*\[\s*[\-\.\de]+\s*\]\s*
import re

pattern = r'\s*p\s*\(\s*\w+\s*\|\s*\w+\s*\)\s*=\s*([\de\-\.]+)\s*\[\s*[\-\.\de]+\s*\]\s*'

f = """maar beroepsmensen
    p( maar | <s> )     =  0.005859305 [ -2.232154 ]
    p( beroepsmensen | maar )    =  7.865118e-06 [ -5.104295 ] # <- first match: 7.865118e-06
    p( </s> | beroepsmensen ...)    =  0.04018713 [ -1.395913 ]
1 sentences, 2 words, 0 OOVs
0 zeroprobs, logprob= -8.732362 ppl= 814.3052 ppl1= 23237.04

dan scootermobiel
    p( dan | <s> )  =  0.005859305 [ -2.232154 ]
    p( scootermobiel | dan)     =  0.827746 [ -9.106363 ] # <- second match: 0.827746
    p( </s> | scootermobiel ...)    =  0.04499642 [ -1.346822 ]
1 sentences, 2 words, 0 OOVs"""

print(re.findall(pattern, f))
如所见

你所需要做的就是从每一场比赛中抽取第1组

完整的代码如下所示:

\s*p\s*\(\s*\w+\s*\|\s*\w+\s*\)\s*=\s*([\de\-\.]+)\s*\[\s*[\-\.\de]+\s*\]\s*
import re

pattern = r'\s*p\s*\(\s*\w+\s*\|\s*\w+\s*\)\s*=\s*([\de\-\.]+)\s*\[\s*[\-\.\de]+\s*\]\s*'

f = """maar beroepsmensen
    p( maar | <s> )     =  0.005859305 [ -2.232154 ]
    p( beroepsmensen | maar )    =  7.865118e-06 [ -5.104295 ] # <- first match: 7.865118e-06
    p( </s> | beroepsmensen ...)    =  0.04018713 [ -1.395913 ]
1 sentences, 2 words, 0 OOVs
0 zeroprobs, logprob= -8.732362 ppl= 814.3052 ppl1= 23237.04

dan scootermobiel
    p( dan | <s> )  =  0.005859305 [ -2.232154 ]
    p( scootermobiel | dan)     =  0.827746 [ -9.106363 ] # <- second match: 0.827746
    p( </s> | scootermobiel ...)    =  0.04499642 [ -1.346822 ]
1 sentences, 2 words, 0 OOVs"""

print(re.findall(pattern, f))

输出将为['7.865118e-06','0.827746']

Thx用于此解决方案,Robo Mop。不幸的是,我的玩具示例不清楚,输入文件中可以有更多的行,模式为\s*p\s*\\s*\w+\s*\\\124s*\w*\w+\s*\\s*=。我将编辑我的问题。你能解释一下输入文件中更多行的含义吗?谢谢你的评论,因为它让我清楚我的问题不是足够精确。所以我只对行感兴趣,列表中的元素在括号的第一部分,比如p ith|u word | otherword