Python 搜索多个正则表达式子字符串_Python_Regex_Substring

Python 搜索多个正则表达式子字符串

python regex

Python 搜索多个正则表达式子字符串,python,regex,substring,Python,Regex,Substring,我有一个包含销售记录的SQLite表——字段13中有运输价格——基本上有3种可能性：价格：15.20英镑自由的未指定问题是不总是只有这些词：例如，它可以说“运费是15.20英镑”或“运费免费”——我需要将其标准化为上述可能性。我使用正则表达式： def correct_shipping(db_data): pattern=re.compile("\£(\d+.\d+)") #search for price pattern_free=re.compile("free") #search

我有一个包含销售记录的SQLite表——字段13中有运输价格——基本上有3种可能性：

价格：15.20英镑自由的未指定

问题是不总是只有这些词：例如，它可以说“运费是15.20英镑”或“运费免费”——我需要将其标准化为上述可能性。我使用正则表达式：

def correct_shipping(db_data):
pattern=re.compile("\£(\d+.\d+)") #search for price
pattern_free=re.compile("free") #search for free shipping
pattern_not=re.compile("not specified") #search for shipping not specified 

for every_line in db_data:
    try:
        found=pattern.search(every_line[13].replace(',','')).group(1)
    except:
        try:
            found=pattern_free.search(every_line[13]).group()
        except:
            found=pattern_not.search(every_line[13]).group()

    if found:
        query="UPDATE MAINTABLE SET Shipping='"+found+"' WHERE Id="+str(every_line[0])
        db_cursor.execute(query)
db_connection.commit()

但此代码引发异常

AttributeError:“非类型”对象没有属性“组”

-表单“5.20”中的第一个结果会触发它，因为找不到任何模式。

问题是如何正确地搜索字符串（try/except是否有必要？），或者如果没有找到任何字符串，如何忽略异常（尽管这不是一个很好的解决方案？

不要搜索井号。搜索数字，然后自己手动添加英镑符号

import re

strings = [
    "5.20",
    "$5.20",
    "$.50",
    "$5",
    "Shipping is free",
    "Shipping: not specified",
    "free",
    "not specified",
]

pattern = r"""
    \d*                     #A digit 0 or more times 
    [.]?                    #A dot, optional
    \d+                     #A digit, one or more times 
    | free                  #Or the word free
    | not \s+ specified     #Or the phrase "not specified"
"""

regex = re.compile(pattern, flags=re.X)
results = []

for string in strings:
    md = re.search(regex, string)

    if md:
        match = md.group()
        if re.search(r"\d", match):
            match = "$" + match
        results.append(match)
    else:
        print "Error--no match!"

print results

--output:--
['$5.20', '$5.20', '$.50', '$5', 'free', 'not specified', 'free', 'not specified']

第一个问题是您的代码不能正确处理故障。如果要使用在不匹配时返回

None

的函数，则必须检查

None

，或者处理尝试调用

组

时产生的

AttributeError

您只需在前两层下再添加一层

试试

。但这很难理解。这样的函数会简单得多：
match = pattern.search(every_line[13].replace(',',''))
if match:
    return match.group(1)
match = pattern_not.search(every_line[13])
if match:
    return match.group()
match = pattern_not.search(every_line[13])
if match:
    return match.group()

这将使用与代码相同的regexp，但不存在无论每个匹配是否成功都尝试调用group
的问题，因此它工作起来非常简单

有一些方法可以进一步简化。例如，您不需要使用regexps来搜索固定字符串，如“free”
；您可以使用str.find
或str.index
。或者，您也可以将search与单个regexp一起使用，并在其中进行三次交替，而不是进行三次单独的搜索

下一个问题是您的第一个模式是错误的。除了regexp特殊字符（或Python特殊字符…但是应该使用原始字符串，这样就不需要转义这些字符），反斜杠不应该转义任何字符，并且磅符号不是其中之一
更重要的是，如果这是Python2.x，您永远不应该将非ASCII字符放入字符串文本中；只将它们放在Unicode文本中。（并且仅当您为源文件指定编码时。）
Python的regexp引擎可以处理Unicode…但如果你给它一个mojibake，比如一个UTF-8磅的符号解码为拉丁语-1或其他什么，它就不能处理Unicode。（事实上，即使所有编码都正确，最好给它提供Unicode模式和搜索字符串，而不是编码的。否则，它无法知道它正在搜索Unicode，或者某些字符长度超过一个字节等等）
什么是“regex子字符串”？也许这个问题会有用：为什么反斜杠避开了英镑符号？它不是正则表达式的特殊字符。另外，这是Python 2.x还是3.x？您的源文件上有编码声明吗？请注意，正则表达式中的“.”将匹配任何字符，因此正则表达式\d+。\d+
将匹配字符串“10X5”。您认为这有什么帮助？（你怎么知道这不会造成伤害？如果他的源数据中还有其他数字呢？）特别是对于正则表达式模式非常感谢你的见解。