Python:将字符串与带有数字、字符串和日期占位符的模式字符串进行匹配,并具有良好的错误捕获能力

Python:将字符串与带有数字、字符串和日期占位符的模式字符串进行匹配,并具有良好的错误捕获能力,python,regex,string,date,Python,Regex,String,Date,我想写一个函数: match_string(input, pattern, valid_words, date_format) input是一个普通字符串模式是一个包含数字、单词和日期占位符的字符串。例如,“#是数字##是字符串###是日期”。在这里,我使用了#、#、和####分别表示数字、字符串和日期占位符,但我不受占位符的任何特定表示的约束 如果input“匹配”模式,则match\u字符串应返回true;也就是说,如果它包含有数字占位符的数字、有单词占位符的单词且单词位于给定的有效单词列

我想写一个函数:

match_string(input, pattern, valid_words, date_format)
input
是一个普通字符串<代码>模式是一个包含数字、单词和日期占位符的字符串。例如,“#是数字##是字符串###是日期”。在这里,我使用了#、#、和####分别表示数字、字符串和日期占位符,但我不受占位符的任何特定表示的约束

如果
input
“匹配”模式,则
match\u字符串应返回true;也就是说,如果它包含有数字占位符的数字、有单词占位符的单词且单词位于给定的有效单词列表中,以及有日期占位符且日期为给定的
日期格式的日期

最后,我需要match_字符串来返回详细的错误信息;如果
input
不匹配,并且它是占位符之一,则应显示“不是数字”、“不在单词列表中”或“不是日期”。若它和模式的正常部分不匹配,那个么它应该只是出错或返回False

这有很多问题要问:)但问题是如何实现
match\u string
。我认为正则表达式、字符串格式和自定义错误定义应该对我有所帮助,但我很难将它们整合在一起。我希望这个问题能够通过在python中灵活使用正则表达式来帮助其他人

示例:

> match_string('1 is a number foo is a string 12-1-2013 is a date', '# is a number ## is a string ### is a date', [foo], '%m-%d-%y')
True

> match_string('foo is a number bar is a string 12-1-2013 is a date', '# is a number ## is a string ### is a date', [foo], '%m-%d-%y')
Error: number expected for 'foo'

> match_string('1 is a number bar is a string 12-1-2013 is a date', '# is a number ## is a string ### is a date', [foo], '%m-%d-%y')
Error: invalid word: bar

> match_string('1 is a number foo is a string January is a date', '# is a number ## is a string ### is a date', [foo], '%m-%d-%y')
Error: invalid date format January

提前谢谢

我认为直接操作字符串或修改单个正则表达式模式会更有效。不幸的是,后者只会返回true或false

无论如何,我创建了一个脚本的要求,它将接受31天在每个月虽然,但它应该是相当容易添加更多的限制

from re import sub, match

def match_string(input, pattern, valid_words, date_format):
    errors = []

    # makes sure that input and pattern are compatible
    regex_pattern = sub(r'#{1,3}', '(.+?)', pattern)
    if not match(regex_pattern, input):
        return 'Error: Input doesn\'t match pattern!'   

    # converts the data_format to a regex
    date_regex = sub(r'%d', '(?P<day>\d+)', date_format)
    date_regex = sub(r'%m', '(?P<month>\d+)', date_regex)
    date_regex = sub(r'%y', '(?P<year>\d+)', date_regex)

    # extracts the dates
    regex_pattern = sub(r'###', '(.+?)', pattern)
    regex_pattern = sub(r'##', '(?:.+?)', regex_pattern)
    regex_pattern = sub(r'#', '(?:.+?)', regex_pattern)
    for date in match(regex_pattern, input).groups():
        m = match(date_regex, date)
        if not m:
            errors.append('Error: %s is not a valid date!' % date)
        else:
            if int(m.group('day')) < 1 or 31 < int(m.group('day')):
                errors.append('Error: %s is not a valid day!' % m.group('day'))
            if int(m.group('month')) < 1 or 12 < int(m.group('month')):
                errors.append('Error: %s is not a valid month!' % m.group('month'))

    # extracts the generic words
    regex_pattern = sub(r'###', '(?:.+?)', pattern)
    regex_pattern = sub(r'##', '(.+?)', regex_pattern)
    regex_pattern = sub(r'#', '(?:.+?)', regex_pattern)
    for word in match(regex_pattern, input).groups():
        if not word.strip() in valid_words:
            errors.append('Error: %s is not a valid word!' % word)

    # extracts the numbers
    regex_pattern = sub(r'###', '(?:.+?)', pattern)
    regex_pattern = sub(r'##', '(?:.+?)', regex_pattern)
    regex_pattern = sub(r'#', '(.+?)', regex_pattern)
    for number in match(regex_pattern, input).groups():
        if not match(r'\d+', number):
            errors.append('Error: %s is not a valid number!' % number)

    if len(errors) == 0:
        return True
    else:
        return '\n'.join(errors)

print match_string('1 and 2 are numbers foo and bar are strings 12-1-2013 is a date', '# and # are numbers ## and ## are strings ### is a date', ['foo', 'bar'], '%m-%d-%y')
print
print match_string('1 is a number foo is a string 12-1-2013 is a date', '# is a number ## is a string ### is a date', ['foo'], '%m-%d-%y')
print
print match_string('foo is a number bar is a string 12-1-2013 is a date', '# is a number ## is a string ### is a date', ['foo'], '%m-%d-%y')
print
print match_string('1 is a number bar is a string 12-1-2013 is a date', '# is a number ## is a string ### is a date', ['foo'], '%m-%d-%y')
print
print match_string('1 is a number foo is a string January is a date', '# is a number ## is a string ### is a date', ['foo'], '%m-%d-%y')
正如您所看到的,您的第二个示例有两个错误,而不是一个

编辑:

我在没有使用正则表达式的情况下重新编写了程序。它应该更有效率。它似乎只遍历输入一次,但仍然使用startswith()方法多次读取某些字符

当检测到错误时,此版本将立即返回。因此,它将只检测每个输入的第一个错误

def match_string(input, pattern, valid_words, date_format):
    print '\n> match_string(\'%s\', \'%s\', %s, \'%s\')' % (input, pattern, valid_words, date_format)

    digits = '0123456789'
    inputIndex = 0
    patternIndex = 0

    while inputIndex < len(input) and patternIndex < len(pattern):
        if pattern[patternIndex] == '#':
            patternIndex += 1
            if pattern[patternIndex] == '#':
                patternIndex += 1
                if pattern[patternIndex] == '#':

                    # validate date
                    date_formatIndex = 0
                    while inputIndex < len(input) and date_formatIndex < len(date_format):

                        if input[inputIndex] == date_format[date_formatIndex]:
                            inputIndex += 1
                            date_formatIndex += 1

                        elif input[inputIndex] in digits:

                            startIndex = inputIndex
                            while inputIndex < len(input) and input[inputIndex] in digits:
                                inputIndex += 1
                            number = int(input[startIndex:inputIndex])

                            if date_format[date_formatIndex:].startswith('%y'):
                                placeholder = True
                            elif date_format[date_formatIndex:].startswith('%m'):
                                if number < 1 or 12 < number:
                                    return 'Error: expected a month between 1 and 12\n input   %d -> "...%s"\n pattern %d -> "...%s"\n date format %d -> "...%s"' % (startIndex, input[startIndex:], patternIndex - 2, pattern[patternIndex - 2:], date_formatIndex, date_format[date_formatIndex:])   

                            elif date_format[date_formatIndex:].startswith('%d'):
                                if number < 1 or 31 < number:
                                    return 'Error: expected a day between 1 and 31\n input   %d -> "...%s"\n pattern %d -> "...%s"\n date format %d -> "...%s"' % (startIndex, input[startIndex:], patternIndex - 2, pattern[patternIndex - 2:], date_formatIndex, date_format[date_formatIndex:])   

                            else:
                                return 'Error: input doesn\'t match date format\n input   %d -> "...%s"\n pattern %d -> "...%s"\n date format %d -> "...%s"' % (startIndex, input[startIndex:], patternIndex - 2, pattern[patternIndex - 2:], date_formatIndex, date_format[date_formatIndex:])   

                            date_formatIndex += 2

                        else:
                            return 'Error: input doesn\'t match date format\n input   %d -> "...%s"\n pattern %d -> "...%s"\n date format %d -> "...%s"' % (inputIndex, input[inputIndex:], patternIndex - 2, pattern[patternIndex - 2:], date_formatIndex, date_format[date_formatIndex:])   

                    patternIndex += 1

                else:
                    # validate word
                    valid = False
                    for word in valid_words:
                        if input[inputIndex:].startswith(word):
                            valid = True
                            inputIndex += len(word)
                            break
                    if not valid:
                        return 'Error: expected a valid word\n input   %d -> "...%s"\n pattern %d -> "...%s"' % (inputIndex, input[inputIndex:], patternIndex - 2, pattern[patternIndex - 2:])                    

            else:
                # validate number
                if not input[inputIndex] in digits:
                    return 'Error: expected a number\n input   %d -> "...%s"\n pattern %d -> "...%s"' % (inputIndex, input[inputIndex:], patternIndex - 1, pattern[patternIndex - 1:])
                while inputIndex < len(input) and input[inputIndex] in digits:
                    inputIndex += 1

        elif input[inputIndex] != pattern[patternIndex]:
            return 'Error: input and pattern do not match\n input   %d -> "...%s"\n pattern %d -> "...%s"' % (inputIndex, input[inputIndex:], patternIndex, pattern[patternIndex:])
        else:
            inputIndex += 1            
            patternIndex += 1
    return True

print match_string('1 and 2 are numbers foo and bar are strings 12-1-2013 is a date', '# and # are numbers ## and ## are strings ### is a date', ['foo', 'bar'], '%m-%d-%y')
print match_string('1 is a number foo is a string 12-1-2013 is a date', '# is a number ## is a string ### is a date', ['foo'], '%m-%d-%y')
print match_string('foo is a number bar is a string 12-1-2013 is a date', '# is a number ## is a string ### is a date', ['foo'], '%m-%d-%y')
print match_string('1 is a number bar is a string 12-1-2013 is a date', '# is a number ## is a string ### is a date', ['foo'], '%m-%d-%y')
print match_string('1 is a number foo is a string January is a date', '# is a number ## is a string ### is a date', ['foo'], '%m-%d-%y')
print match_string('1 and 2 are numbers foo and bar are strings 15-1-2013 is a date', '# and # are numbers ## and ## are strings ### is a date', ['foo', 'bar'], '%m-%d-%y')
print match_string('1 and 2 are numbers foo and bar are strings 08-42-2013 is a date', '# and # are numbers ## and ## are strings ### is a date', ['foo', 'bar'], '%m-%d-%y')
print match_string('1 and 2 are numbers foo and bar are strings 08;4;2013 is a date', '# and # are numbers ## and ## are strings ### is a date', ['foo', 'bar'], '%m-%d-%y')
print match_string('1 and 2 are numbers foo and bar are strings 08-4-2013 is a date', '# and # are numbers ## and ## are strings ### is a date', ['foo', 'bar'], '~%m-%d-%y')

现在我们知道你想要什么,你需要什么。但是你的问题是什么?应该如何实现
match_string
?你应该至少为不同的输出包含一些示例输入和预期输出。发布你的最佳尝试,即使你觉得离它很远。此外,还有测试后的案例以及正面和负面的响应(您的程序在每种情况下应该做什么,不应该做什么)。再说一次:您在实现此方法时会面临哪些具体问题?太棒了@Akinakes!我实现了这个的一个变体。小问题。用正则表达式替换“#”是否需要sub()?我们可以只使用replace()吗?如果可能的话,字符串操作比regex更可取,所以在这种情况下replace()比sub()更好。当我编写代码时,我没有考虑它。请注意,实现自己的replace方法可以更快地直接操作字符串,因此只需遍历字符串一次,而不是像我的代码中那样遍历多次。@Neil我重新编写了程序,没有使用正则表达式进行比较。以后的版本应该更快。
def match_string(input, pattern, valid_words, date_format):
    print '\n> match_string(\'%s\', \'%s\', %s, \'%s\')' % (input, pattern, valid_words, date_format)

    digits = '0123456789'
    inputIndex = 0
    patternIndex = 0

    while inputIndex < len(input) and patternIndex < len(pattern):
        if pattern[patternIndex] == '#':
            patternIndex += 1
            if pattern[patternIndex] == '#':
                patternIndex += 1
                if pattern[patternIndex] == '#':

                    # validate date
                    date_formatIndex = 0
                    while inputIndex < len(input) and date_formatIndex < len(date_format):

                        if input[inputIndex] == date_format[date_formatIndex]:
                            inputIndex += 1
                            date_formatIndex += 1

                        elif input[inputIndex] in digits:

                            startIndex = inputIndex
                            while inputIndex < len(input) and input[inputIndex] in digits:
                                inputIndex += 1
                            number = int(input[startIndex:inputIndex])

                            if date_format[date_formatIndex:].startswith('%y'):
                                placeholder = True
                            elif date_format[date_formatIndex:].startswith('%m'):
                                if number < 1 or 12 < number:
                                    return 'Error: expected a month between 1 and 12\n input   %d -> "...%s"\n pattern %d -> "...%s"\n date format %d -> "...%s"' % (startIndex, input[startIndex:], patternIndex - 2, pattern[patternIndex - 2:], date_formatIndex, date_format[date_formatIndex:])   

                            elif date_format[date_formatIndex:].startswith('%d'):
                                if number < 1 or 31 < number:
                                    return 'Error: expected a day between 1 and 31\n input   %d -> "...%s"\n pattern %d -> "...%s"\n date format %d -> "...%s"' % (startIndex, input[startIndex:], patternIndex - 2, pattern[patternIndex - 2:], date_formatIndex, date_format[date_formatIndex:])   

                            else:
                                return 'Error: input doesn\'t match date format\n input   %d -> "...%s"\n pattern %d -> "...%s"\n date format %d -> "...%s"' % (startIndex, input[startIndex:], patternIndex - 2, pattern[patternIndex - 2:], date_formatIndex, date_format[date_formatIndex:])   

                            date_formatIndex += 2

                        else:
                            return 'Error: input doesn\'t match date format\n input   %d -> "...%s"\n pattern %d -> "...%s"\n date format %d -> "...%s"' % (inputIndex, input[inputIndex:], patternIndex - 2, pattern[patternIndex - 2:], date_formatIndex, date_format[date_formatIndex:])   

                    patternIndex += 1

                else:
                    # validate word
                    valid = False
                    for word in valid_words:
                        if input[inputIndex:].startswith(word):
                            valid = True
                            inputIndex += len(word)
                            break
                    if not valid:
                        return 'Error: expected a valid word\n input   %d -> "...%s"\n pattern %d -> "...%s"' % (inputIndex, input[inputIndex:], patternIndex - 2, pattern[patternIndex - 2:])                    

            else:
                # validate number
                if not input[inputIndex] in digits:
                    return 'Error: expected a number\n input   %d -> "...%s"\n pattern %d -> "...%s"' % (inputIndex, input[inputIndex:], patternIndex - 1, pattern[patternIndex - 1:])
                while inputIndex < len(input) and input[inputIndex] in digits:
                    inputIndex += 1

        elif input[inputIndex] != pattern[patternIndex]:
            return 'Error: input and pattern do not match\n input   %d -> "...%s"\n pattern %d -> "...%s"' % (inputIndex, input[inputIndex:], patternIndex, pattern[patternIndex:])
        else:
            inputIndex += 1            
            patternIndex += 1
    return True

print match_string('1 and 2 are numbers foo and bar are strings 12-1-2013 is a date', '# and # are numbers ## and ## are strings ### is a date', ['foo', 'bar'], '%m-%d-%y')
print match_string('1 is a number foo is a string 12-1-2013 is a date', '# is a number ## is a string ### is a date', ['foo'], '%m-%d-%y')
print match_string('foo is a number bar is a string 12-1-2013 is a date', '# is a number ## is a string ### is a date', ['foo'], '%m-%d-%y')
print match_string('1 is a number bar is a string 12-1-2013 is a date', '# is a number ## is a string ### is a date', ['foo'], '%m-%d-%y')
print match_string('1 is a number foo is a string January is a date', '# is a number ## is a string ### is a date', ['foo'], '%m-%d-%y')
print match_string('1 and 2 are numbers foo and bar are strings 15-1-2013 is a date', '# and # are numbers ## and ## are strings ### is a date', ['foo', 'bar'], '%m-%d-%y')
print match_string('1 and 2 are numbers foo and bar are strings 08-42-2013 is a date', '# and # are numbers ## and ## are strings ### is a date', ['foo', 'bar'], '%m-%d-%y')
print match_string('1 and 2 are numbers foo and bar are strings 08;4;2013 is a date', '# and # are numbers ## and ## are strings ### is a date', ['foo', 'bar'], '%m-%d-%y')
print match_string('1 and 2 are numbers foo and bar are strings 08-4-2013 is a date', '# and # are numbers ## and ## are strings ### is a date', ['foo', 'bar'], '~%m-%d-%y')
> match_string('1 and 2 are numbers foo and bar are strings 12-1-2013 is a date', '# and # are numbers ## and ## are strings ### is a date', ['foo', 'bar'], '%m-%d-%y')
True

> match_string('1 is a number foo is a string 12-1-2013 is a date', '# is a number ## is a string ### is a date', ['foo'], '%m-%d-%y')
True

> match_string('foo is a number bar is a string 12-1-2013 is a date', '# is a number ## is a string ### is a date', ['foo'], '%m-%d-%y')
Error: expected a number
 input   0 -> "...foo is a number bar is a string 12-1-2013 is a date"
 pattern 0 -> "...# is a number ## is a string ### is a date"

> match_string('1 is a number bar is a string 12-1-2013 is a date', '# is a number ## is a string ### is a date', ['foo'], '%m-%d-%y')
Error: expected a valid word
 input   14 -> "...bar is a string 12-1-2013 is a date"
 pattern 14 -> "...## is a string ### is a date"

> match_string('1 is a number foo is a string January is a date', '# is a number ## is a string ### is a date', ['foo'], '%m-%d-%y')
Error: input doesn't match date format
 input   30 -> "...January is a date"
 pattern 29 -> "...### is a date"
 date format 0 -> "...%m-%d-%y"

> match_string('1 and 2 are numbers foo and bar are strings 15-1-2013 is a date', '# and # are numbers ## and ## are strings ### is a date', ['foo', 'bar'], '%m-%d-%y')
Error: expected a month between 1 and 12
 input   44 -> "...15-1-2013 is a date"
 pattern 42 -> "...### is a date"
 date format 0 -> "...%m-%d-%y"

> match_string('1 and 2 are numbers foo and bar are strings 08-42-2013 is a date', '# and # are numbers ## and ## are strings ### is a date', ['foo', 'bar'], '%m-%d-%y')
Error: expected a day between 1 and 31
 input   47 -> "...42-2013 is a date"
 pattern 42 -> "...### is a date"
 date format 3 -> "...%d-%y"

> match_string('1 and 2 are numbers foo and bar are strings 08;4;2013 is a date', '# and # are numbers ## and ## are strings ### is a date', ['foo', 'bar'], '%m-%d-%y')
Error: input doesn't match date format
 input   46 -> "...;4;2013 is a date"
 pattern 42 -> "...### is a date"
 date format 2 -> "...-%d-%y"

> match_string('1 and 2 are numbers foo and bar are strings 08-4-2013 is a date', '# and # are numbers ## and ## are strings ### is a date', ['foo', 'bar'], '~%m-%d-%y')
Error: input doesn't match date format
 input   44 -> "...08-4-2013 is a date"
 pattern 42 -> "...### is a date"
 date format 0 -> "...~%m-%d-%y"