Python regex-具有可选范围的匹配数字_Python_Regex

Python regex-具有可选范围的匹配数字

python regex

Python regex-具有可选范围的匹配数字,python,regex,Python,Regex,使用python的re模块，我试图从以下语句中获取美元值： “$305000-$349950”应该给出这样的元组（305000349950） “200万美元左右的买家”->（200万） “…买家指南129万美元以上”-->（1290000） “…$485000和$510000”->（485000510000）下面的模式适用于单个值，但如果存在范围（如上面的第一个点和最后一个点），则只给出最后一个数字（即349950和510000） _pattern=r”“（？x） ^ .* （？P [€$

使用python的re模块，我试图从以下语句中获取美元值：

“$305000-$349950”应该给出这样的元组（305000349950）
“200万美元左右的买家”->（200万）
“…买家指南129万美元以上”-->（1290000）
“…$485000和$510000”->（485000510000）

下面的模式适用于单个值，但如果存在范围（如上面的第一个点和最后一个点），则只给出最后一个数字（即349950和510000）

_pattern=r”“（？x）
^
.*
（？P
[€$£]
\d{1,3}
[,.]?
\d{0,3}
（？：[，.]\d{3}）*
（？P[kKmM]？\s？[mM]？）
)
（？：\s（？：-\b | \b | \b）\s）？
（？P
[€$£]
\d{1,3}
[,.]?
\d{0,3}
（？：[，.]\d{3}）*
（？P[kKmM]？\s？[mM]？）
)?
.*?
$
"""

尝试

target2=match.group（“target2”）.strip（）时，target2始终显示为None

我绝对不是一个注册专家，但我真的看不出我做错了什么。多层组工作，在我看来，target2组是相同的模式，也就是说，最后是可选的匹配
我希望我的措辞可以理解…
+1对于正则表达式模式使用verbose模式
模式开头的*
是贪婪的，因此它尝试匹配整行。然后它返回到与target1匹配的位置。模式中的所有其他内容都是可选的，因此将target1与线上的最后一个匹配匹配是成功的匹配。您可以通过添加一个“？”来尝试使第一个*
不贪婪，如下所示：
_pattern = r"""(?x)
    ^
    .*?                   <-- add the ?
    (?P<target1>
    ... snip ...
    """

_pattern=r”“（？x）
^
。*？您可以提出一些正则表达式逻辑，并结合一个转换缩写数字的函数。下面是一些python代码示例：
# -*- coding: utf-8> -*-
import re, locale
from locale import *
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8') 

string = """"$305,000 - $349,950"
"Mid $2M's Buyers"
"... Buyers Guide $1.29M+"
"...$485,000 and $510,000"
"""

def convert_number(number, unit):
    if unit == "K":
        exp = 10**3
    elif unit == "M":
        exp = 10**6
    return (atof(number) * exp)

matches = []
rx = r"""
    \$(?P<value>\d+[\d,.]*)         # match a dollar sign 
                                    # followed by numbers, dots and commas
                                    # make the first digit necessary (+)
    (?P<unit>M|K)?                  # match M or K and save it to a group
    (                               # opening parenthesis
        \s(?:-|and)\s               # match a whitespace, dash or "and"
        \$(?P<value1>\d+[\d,.]*)    # the same pattern as above
        (?P<unit1>M|K)?
    )?                              # closing parethesis, 
                                    # make the whole subpattern optional (?)
"""
for match in re.finditer(rx, string, re.VERBOSE):
    if match.group('unit') is not None:
        value1 = convert_number(match.group('value'), match.group('unit'))
    else:
        value1 = atof(match.group('value'))
    m = (value1)
    if match.group('value1') is not None:
        if match.group('unit1') is not None:
            value2 = convert_number(match.group('value1'), match.group('unit1'))
        else:
            value2 = atof(match.group('value1'))
        m = (value1, value2)
    matches.append(m)

print matches
# [(305000.0, 349950.0), 2000000.0, 1290000.0, (485000.0, 510000.0)]

#-*-编码：utf-8>-*-
导入re，区域设置
从区域设置导入*
setlocale（locale.LC_ALL'en_US.UTF-8'）
字符串=“305000美元-349950美元”
“200万美元左右的买家”
“…买家指南129万美元以上”
“……485000美元和510000美元”
"""
def转换_编号（编号、单位）：
如果单位==“K”：
exp=10**3
elif单位==“M”：
exp=10**6
退货（atof（编号）*exp）
匹配项=[]
rx=r
\$（？P\d+[\d，.]*）#匹配一个美元符号
#后跟数字、点和逗号
#使第一个数字成为必需的（+）
（？PM|K）？#匹配M或K并将其保存到组中
（#左括号）
\s（？：-| and）\s#匹配空格、破折号或“and”
\$（？P\d+[\d，.]*）#与上述模式相同
（？PM|K）？
)?                              # 结束论文，
#将整个子模式设置为可选（？）
"""
对于re.finditer中的匹配（rx、string、re.VERBOSE）：
如果match.group（'unit'）不是None：
value1=转换\u编号（match.group（'value'）、match.group（'unit'））
其他：
value1=atof（match.group（'value'））
m=（值1）
如果match.group（'value1'）不是None：
如果match.group（'unit1'）不是None：
value2=转换\u编号（match.group（'value1'）、match.group（'unit1'））
其他：
value2=atof（match.group（'value1'））
m=（值1，值2）
匹配。追加（m）
打印匹配
# [(305000.0, 349950.0), 2000000.0, 1290000.0, (485000.0, 510000.0)]

代码使用了相当多的逻辑，它首先为atof（）
函数导入locale
模块，定义一个函数convert\u number（）
并使用代码中解释的正则表达式搜索范围。显然，您可以添加其他货币符号，如€$
，但它们不在原始示例中。
不幸的是，第一个选项不起作用。结果是一样的，尽管它报告了范围中的第一个数字。我以为您是第二个建议是一个好主意，但它也不起作用。顺便说一句，语法似乎是re.search（patter，line）
。无论如何more
组似乎总是没有…修复了对re.search（）的调用。你能用re.findall（）吗？仅使用target1的模式？它应该返回所有匹配项的列表。是的，我可以让它工作：）没有
组，它返回例如（[（'305000'，''）（'349950'，''），''）对于一个范围，它应该返回所有匹配项的列表。我认为更多组中的正则表达式存在一些问题，但无论如何使用re.findall（）效果很好。如果你在答案中添加一些相关内容，我会将其标记为已接受。感谢你的帮助。我喜欢此解决方案。虽然这不是纯正则表达式解决方案，但更容易理解understand@murphy：在我看来，它不一定只有正则表达式：）是的。如果我必须调试这段代码，比如说在3个月内，我有很好的机会立刻理解它。如果我现在使用一个邪恶的正则表达式，这对我的自我有好处，但3个月后让我恨自己：）+1谢谢Jan，我实际上已经在做类似的事情来转换缩写。我对问题中的模式的问题更多的是，出于某种原因，当有一个范围作为输入时，它没有找到第二个目标。即。“$340000-$400000“未达到下限，仅与$400000匹配。我知道，而且我会将您的答案标记为正确答案。只是@RootTwo在您发布之前提出了一个适合我的模式。
_pattern = r"""(?x)
    (?P<target1>
        [€$£]
        \d{1,3}
        [,.]?
        \d{0,3}
        (?:[,.]\d{3})*
        (?P<multiplyer1>[kKmM]?\s?[mM]?)
    )
    (?P<more>\s(?:\-|\band\b|\bto\b)\s)?
    """

match = re.search(_pattern, line)
target1, more = match.groups()
if more:
    target2 = re.search(_pattern, line, start=match.end())

_pattern = r"""(?x)
    (?P<target1>
        [€$£]
        \d{1,3}
        [,.]?
        \d{0,3}
        (?:[,.]\d{3})*
        (?P<multiplyer1>[kKmM]?\s?[mM]?)
    )
"""

targets = re.findall(_pattern, line)

# -*- coding: utf-8> -*-
import re, locale
from locale import *
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8') 

string = """"$305,000 - $349,950"
"Mid $2M's Buyers"
"... Buyers Guide $1.29M+"
"...$485,000 and $510,000"
"""

def convert_number(number, unit):
    if unit == "K":
        exp = 10**3
    elif unit == "M":
        exp = 10**6
    return (atof(number) * exp)

matches = []
rx = r"""
    \$(?P<value>\d+[\d,.]*)         # match a dollar sign 
                                    # followed by numbers, dots and commas
                                    # make the first digit necessary (+)
    (?P<unit>M|K)?                  # match M or K and save it to a group
    (                               # opening parenthesis
        \s(?:-|and)\s               # match a whitespace, dash or "and"
        \$(?P<value1>\d+[\d,.]*)    # the same pattern as above
        (?P<unit1>M|K)?
    )?                              # closing parethesis, 
                                    # make the whole subpattern optional (?)
"""
for match in re.finditer(rx, string, re.VERBOSE):
    if match.group('unit') is not None:
        value1 = convert_number(match.group('value'), match.group('unit'))
    else:
        value1 = atof(match.group('value'))
    m = (value1)
    if match.group('value1') is not None:
        if match.group('unit1') is not None:
            value2 = convert_number(match.group('value1'), match.group('unit1'))
        else:
            value2 = atof(match.group('value1'))
        m = (value1, value2)
    matches.append(m)

print matches
# [(305000.0, 349950.0), 2000000.0, 1290000.0, (485000.0, 510000.0)]