Python 如何组合这些正则表达式以匹配本文中的变体？_Python_Regex

Python 如何组合这些正则表达式以匹配本文中的变体？

python regex

Python 如何组合这些正则表达式以匹配本文中的变体？,python,regex,Python,Regex,我有这种格式的数据： Charter by <company> from <origin> to <destination> 这很烦人，但对于这个简单的，人为的示例数据来说是可行的，但我的实际数据要复杂得多：每行最多可以有10个这样的“块”，因此手动指定2^10个可能的组合中的每一个都是不可行的我认为这种模式会奏效： pattern = "^Charter( by ([\S ]+))?( from ([\S ]+))?( to ([\S ]+))?$" m

我有这种格式的数据：

Charter by <company> from <origin> to <destination>

这很烦人，但对于这个简单的，人为的示例数据来说是可行的，但我的实际数据要复杂得多：每行最多可以有10个这样的“块”，因此手动指定2^10个可能的组合中的每一个都是不可行的

我认为这种模式会奏效：

pattern = "^Charter( by ([\S ]+))?( from ([\S ]+))?( to ([\S ]+))?$"
match = re.split(pattern, line)

因为它允许每个区块都是可选的，但是作为一个例子，对于马士基从中国到英国的租船线

，分割返回
['', ' by Maersk from China to England', 'Maersk from China to England', None, None, None, None, '']

显然，问题是第一个[\S]+
一直匹配到字符串的末尾，而不是在的处停止（注意前导空格），但我不确定如何处理这个问题，因为公司名称、来源和目的地都可能包含空格。一旦我敲定了模式，应该可以更容易地将碎片取出。
假设by
总是在from
之前，如果存在，并且from
在到之前，如果存在，让我们使用3个正则表达式来捕获by，from和to值取决于每个块后面的内容。例如，当捕获by
值时，我们提取by
与字符串的“from”或“to”或结尾之间的所有内容
守则：
import re

data = """Charter by Maersk from China to England
Charter from France
Charter by Safmarine to Poland
Charter by Safmarine from Los Angeles
Charter
Charter to New York
"""

company_pattern = re.compile(r"by (.*?)(?:from|to|$)")
origin_pattern = re.compile(r"from (.*?)(?:to|$)")
destination_pattern = re.compile(r"to (.*?)$")
for line in data.splitlines():
    match = company_pattern.search(line)
    company = match.group(1).strip() if match else ""

    match = origin_pattern.search(line)
    origin = match.group(1).strip() if match else ""

    match = destination_pattern.search(line)
    destination = match.group(1).strip() if match else ""

    print([company, origin, destination])

印刷品：
['Maersk', 'China', 'England']
['', 'France', '']
['Safmarine', '', 'Poland']
['Safmarine', 'Los Angeles', '']
['', '', '']
['', '', 'New York']

请注意，（？：…）
表示非捕获组，*？
是任何字符的非贪婪匹配，可以多次匹配。
只需使用非贪婪模式形式：
pattern = "^Charter( by ([\S ]+?))?( from ([\S ]+?))?( to ([\S ]+?))?$"

在您的示例中，这给出了：
['', ' by Maersk', 'Maersk', ' from China', 'China', ' to England', 'England', '']

我想我明白了，请告诉我它是否对你有帮助：
regexp=“^by（.*）从（.*）到（.*）$”

平均每个字符
*
表示0或多个时间
您需要的正则表达式可能是：
^Charter（？：by（[\S]+？）（？：from（[\S]+？））（？：to（[\S]+？）$



注:
（…）
是捕获组，即您可以通过.groups（）
访问的组，（？：…）
是非捕获组，未显示在.groups（）
中
[\S]+
是贪婪的-它尽可能匹配<代码>[\S]+？
是惰性的–匹配尽可能短的文本
（…）？
或（？：…）？
组是可选的–它可能存在于文本中，也可能不存在于文本中
re.split
是错误的工具：使用re.match
（或re.search
），例如：
下面是一个包含三个正则表达式的方法，re.findall
和一个辅助函数：
def joiner(x):
    if x: return ''.join(x[0])
    else: return ''

patterns = [re.compile(r'by ([A-Za-z]+)(\s[A-Z][a-z]*)?'),
            re.compile(r'to ([A-Za-z]+)(\s[A-Z][a-z]*)?'),
            re.compile(r'from ([A-Za-z]+)(\s[A-Z][a-z]*)?')]

results = [[joiner(re.findall(p, line)) for line in data.splitlines()] for p in patterns]

输出：
[['Maersk', '', 'Safmarine', 'Safmarine', '', ''],
 ['China', 'France', '', 'Los Angeles', '', ''],
 ['England', '', 'Poland', '', '', 'New York']]

速度没那么差：
In [175]: %timeit [[joiner(re.findall(p, line)) for line in data.splitlines()] for p in patterns]
10000 loops, best of 3: 42.8 µs per loop

如果没有像“纽约”和“洛杉矶”这样的城市名称，它会更快/更简单。
这没有帮助。问题明确指出，by…
、from…
和to…
可能缺失。谢谢；将其与命名组组合，如sopattern=“^Charter（by（？P[\S]+？）？（from（？P[\S]+？）？（to（？P[\S]+？）？$”看起来正是我需要的。谢谢；不幸的是，数据不是人为的，因此位置名称中的空格是不可避免的。
[['Maersk', '', 'Safmarine', 'Safmarine', '', ''],
 ['China', 'France', '', 'Los Angeles', '', ''],
 ['England', '', 'Poland', '', '', 'New York']]

In [175]: %timeit [[joiner(re.findall(p, line)) for line in data.splitlines()] for p in patterns]
10000 loops, best of 3: 42.8 µs per loop