使用正则表达式和Python匹配标题下的段落_Python_Regex_Python 2.7

使用正则表达式和Python匹配标题下的段落

python regex python-2.7

使用正则表达式和Python匹配标题下的段落,python,regex,python-2.7,Python,Regex,Python 2.7,我有一个部分需要匹配。我的条件是：匹配所有内容，包括标题。标题的模式已经由我匹配，我需要匹配以单词“fig”开头的段落。我已经做得很好了，但我注意到，一旦遇到不匹配项，它就会停止进一步匹配。另一个条件是，如果一个段落少于3个单词，则不匹配以下是示例文本： List of tables and figure captions: Figure 1 shows study area and locations of borewell and surface water sampling p

我有一个部分需要匹配。我的条件是：匹配所有内容，包括标题。标题的模式已经由我匹配，我需要匹配以单词“fig”开头的段落。我已经做得很好了，但我注意到，一旦遇到不匹配项，它就会停止进一步匹配。
另一个条件是，如果一个段落少于3个单词，则不匹配

以下是示例文本：

List of tables and figure captions:

Figure 1 shows study area and locations of borewell and surface water sampling  points. Low lying area on the western side is clearly visible.


Figure 2 displays nothing much.
no match
here


Fig.y yhth hyt htyh hyt htyh th thyt htyht thh

Table xvnm,mcxnv  bvv nd vdm v

段落之间可以有任意数量的行。这里发生的事情是，在以图2开头的段落的行尾之后，这些单词不匹配，因为它们不是以“Fig”开头的，而是它们后面的句子以“Fig”开头。我怎么可能将该行与

Fig.y

匹配

这是我的正则表达式：

'((?:^(?:Supp[elmntary]*\s|list\sof\s)?[^\n]*Fig[ures]*[^\n]*(?:Captions?|Legends?|Lists?)[^\n])(?:(?!^)[^\n]+|(?!\n\w+\s*\w+\s*:?\s*$)\n|Fig)*)'

使用的标志：

re.I

，

re.M

，

re.S

（DOTALL）

我试着提前把这些加起来：

(?:.*^Fig[^\n]*$){0,}

但这不起作用，因为我找不到方法跳过包含

“不匹配”

和

“此处”

的行

谢谢你的帮助。我将使用新的答案有可能我还没有完全理解你的要求，但我会再次尝试一下。我假设捕获标题的适当正则表达式可以从原始正则表达式中插入

# Python 2.7
# Typos may exist, didn't test yet
import re

def emitRecord(matches):
  if len(matches) > 0:
    print "----- Start record -----"
    print "\n".join(matches)
    print "----- End record -----"

matches = []
seenTitle = False
titleRegex = re.compile(r'expression to capture titles here')
figureRegex = re.compile(r'^(?:fig|figure)[^a-z]', re.I)
with open('text.txt', 'r') as text:
  for line in text:
    if not line.strip(): continue
    if titleRegex.search(line):
      seenTitle = True
      emitRecord(matches)
      matches = [line.strip()]
    elif seenTitle:
      if len(line.split()) < 3: continue
      if figureRegex.search(line): matches.append(line.strip())
emitRecord(matches)

#Python 2.7
#可能存在拼写错误，但尚未测试
进口稀土
def记录（匹配项）：
如果len（匹配）>0：
打印“----开始记录------”
打印“\n”。加入（匹配项）
打印“----结束记录------”
匹配项=[]
seenTitle=假
titleRegex=re.compile（r'expression to capture titles here'）
figureRegex=re.compile（r'^（？：fig | figure）[^a-z]'，re.I）
以open（'text.txt'，'r'）作为文本：
对于文本中的行：
如果不是line.strip（）：继续
如果标题为regex.search（第行）：
seenTitle=真
记录（比赛）
匹配项=[line.strip（）]
elif seenTitle：
如果len（line.split（））小于3：继续
if figureRegex.search（line）：匹配.append（line.strip（））
记录（比赛）

对不起，这对我不起作用。我的文本以xml格式展开。而u没有考虑到标题的变化。上述操作只需使用正则表达式即可完成。对冠军的展望是主要的挑战（也是为什么现在还没有其他答案的原因之一），我又尝试了一次。抱歉，如果我再次错过了-我只是可能没有完全理解您的用例。