Python/Regex：如何按Regex模式分割字符串，同时将模式保留在匹配中？_Python_Regex

Python/Regex：如何按Regex模式分割字符串，同时将模式保留在匹配中？

python regex

Python/Regex：如何按Regex模式分割字符串，同时将模式保留在匹配中？,python,regex,Python,Regex,我有一个字符串，其中包含以下形式的句子： “史密斯女士与她的同事史密斯女士交谈，为她创建新的活动团队。团队负责人的助理，负责组织早上的站立训练。开车环城。” 句子可能有标点符号，也可能没有正确的大小写文本中也可能有噪音（额外的字符、单词）我想按以下结构进行切片： “涂女士/涂先生/涂太太” “小姐/小姐/先生/夫人的收件人” “团队负责人至” “团队领导的目标” “.To” 我想把它分为以下几部分： ["Ms Smith to talk to her colleague", "Ms

我有一个字符串，其中包含以下形式的句子：

“史密斯女士与她的同事史密斯女士交谈，为她创建新的活动团队。团队负责人的助理，负责组织早上的站立训练。开车环城。”

句子可能有标点符号，也可能没有正确的大小写
文本中也可能有噪音（额外的字符、单词）
我想按以下结构进行切片：
- “涂女士/涂先生/涂太太”
- “小姐/小姐/先生/夫人的收件人”
- “团队负责人至”
- “团队领导的目标”
- “.To”

我想把它分为以下几部分：

["Ms Smith to talk to her colleague",
"Ms Smith to create new events for the team.",
"team Leader's assistant to organise morning stand-up session.",
"to drive around the city."]

我目前的解决方案是可行的，但非常不符合pythonic，我相信有一些方法可以避免while循环：

def slice(text):
    parts = []
    rule = "(^.+?)(?:(?:miss [a-z]+|ms [a-z]+|mrs [a-z]+|mr [a-z]+|team leader)(?:'s [a-z ]+?)?|\.) to.+?$"
    while True:
        try:
            part = re.findall(rule, text)[0]
            parts.append(part)
            # Remove part from text for next iteration
            text = text[len(part):]
        except IndexError:
            # findall returned empty list
            break
    # Add the remainder
    parts.append(text)
    return parts

谢谢你的帮助

只需使用

findall

和捕获子组，就可以完成所有您想做的事情。这个输出是你想要的吗

import re

s = "Ms Smith to talk to her colleague Ms Smith to create new events for the team. " +\
    "team Leader's assistant to organise morning stand-up session. to drive around the city."

roles = "Miss|Ms|Mr|Mrs|team leader"
matches = re.findall(f"""
    (
        \ ?
        (
            (?:{roles})?      # Read the role
            (?:[\w\'\-\ ]*?)  # and name to group(1) aka "identity"
        )
        to
        ([\w\'\-\ ]+?)  # Read the other words to group(2) aka "task"
        (?={roles}|\.)  # until next role or dot
        [\.\ ]?
    )
    """,
    s,
    flags=re.IGNORECASE | re.VERBOSE,
)

print("Full matches:")
for m in matches:
    print(" *", m[0].strip())


print("\nSplit by identity and task:")
for full, identity, task in matches:
    print(f" * Identity: '{identity}', task: '{task.strip()}, full match: '{full.strip()}'")

输出：

完全匹配：
*史密斯女士想和她的同事谈谈
*史密斯女士将为团队创建新活动。
*团队负责人的助理，负责组织晨会。
*开车环城

按身份和任务划分：
*身份：“史密斯女士”，任务：“与同事交谈，完全匹配：“史密斯女士与同事交谈”
*标识：“史密斯女士”，任务：“为团队创建新事件”，全场比赛：“史密斯女士为团队创建新事件”。
*身份：'组长助理'，任务：'组织晨会，全场比赛：“队长助理组织晨站” 会话。“
*身份：''，任务：''开车环城，完全匹配： “开车环城。”

您只需使用

findall

和捕获子组，就可以完成所有您想做的事情。这个输出是你想要的吗

import re

s = "Ms Smith to talk to her colleague Ms Smith to create new events for the team. " +\
    "team Leader's assistant to organise morning stand-up session. to drive around the city."

roles = "Miss|Ms|Mr|Mrs|team leader"
matches = re.findall(f"""
    (
        \ ?
        (
            (?:{roles})?      # Read the role
            (?:[\w\'\-\ ]*?)  # and name to group(1) aka "identity"
        )
        to
        ([\w\'\-\ ]+?)  # Read the other words to group(2) aka "task"
        (?={roles}|\.)  # until next role or dot
        [\.\ ]?
    )
    """,
    s,
    flags=re.IGNORECASE | re.VERBOSE,
)

print("Full matches:")
for m in matches:
    print(" *", m[0].strip())


print("\nSplit by identity and task:")
for full, identity, task in matches:
    print(f" * Identity: '{identity}', task: '{task.strip()}, full match: '{full.strip()}'")

输出：

完全匹配：
*史密斯女士想和她的同事谈谈
*史密斯女士将为团队创建新活动。
*团队负责人的助理，负责组织晨会。
*开车环城

如果句子以句号结尾，你也要切分吗？@RahulP否，文本中可能有额外的句号（它是OCRd）对于这个非琐碎的规则，您可以使用

re.VERBOSE

，这将允许您添加注释，并将大规则拆分为简短的自我解释规则块。如果句子以句号结尾，您是否也希望切分？@RahulP否，文本中可能有额外的句号（它是OCRd）对于这个非琐碎的规则，您可以使用

re.VERBOSE

，这将允许您添加注释，并将大规则拆分为简短的自我解释的规则块。您是否希望我的可读性建议：

python matches=re.findall（f“”）（\？（？：{roles}）？#将角色名称读到组（1）aka“identity”“（？：[\w\'\-\]*？）到（[\w\'\\\]+？）”读单词到组（2）aka“task”（？={roles}}}直到下一个角色开始或未建立。[\.\]？）”，s，flags=re.IGNORECASE | re.VERBOSE，）你想要我的可读性建议：matches=re.findall（f）“（\？（（？：{roles}）将角色名读给组（1）aka“identity”（？：[\w\'\-\]*？）读给组（2）aka“task”（？={roles}）\）”，直到下一个角色开始或未找到为止，s，flags=re.IGNORECASE | re.VERBOSE，）