Python 词法分析或一系列正则表达式，将非结构化文本解析为结构化形式_Python_Regex_Parsing_Lexical Analysis

Python 词法分析或一系列正则表达式，将非结构化文本解析为结构化形式

python regex parsing

Python 词法分析或一系列正则表达式，将非结构化文本解析为结构化形式,python,regex,parsing,lexical-analysis,Python,Regex,Parsing,Lexical Analysis,我试图写一些代码，将功能类似谷歌日历快速添加功能。您知道可以在其中输入以下任何内容： 1） 2010年9月24日，约翰生日 2）约翰的生日，2010年9月24日 3） 2010年9月24日，约翰·多伊的生日 4） 2010年9月24日：约翰过生日 5）约翰的生日是2010年9月24日它可以计算出我们希望在2010年9月24日举行一次活动，并将剩余的材料作为活动文本我想做的是python 我正在考虑一种设计，在这种设计中，我编写的正则表达式可以匹配上面列出的所有情况并提取日期。但我相信有一

我试图写一些代码，将功能类似谷歌日历快速添加功能。您知道可以在其中输入以下任何内容： 1） 2010年9月24日，约翰生日 2）约翰的生日，2010年9月24日 3） 2010年9月24日，约翰·多伊的生日 4） 2010年9月24日：约翰过生日 5）约翰的生日是2010年9月24日

它可以计算出我们希望在2010年9月24日举行一次活动，并将剩余的材料作为活动文本

我想做的是python

我正在考虑一种设计，在这种设计中，我编写的正则表达式可以匹配上面列出的所有情况并提取日期。但我相信有一种更聪明的方法来解决这个问题。因为我显然没有受过词法分析或多种类型的解析器风格的训练。我正在寻找解决这个问题的好方法。

注意：这里的python代码不正确！这只是一个粗略的伪代码，它可能看起来如何

正则表达式擅长以固定格式（例如DD/MM/YYYY日期）从文本中查找和提取数据

词法分析器/语法分析器对擅长以结构化但有点可变的格式处理数据。Lexers将文本拆分为标记。这些标记是给定类型（数字、字符串等）的信息单位。解析器获取这一系列标记，并根据标记的顺序执行某些操作

查看数据，您有一个基本的（主语、动词、宾语）关系结构（人、‘生日’、日期）：

我将使用正则表达式将2010年9月29日和2010年9月24日作为单个令牌处理，并将其作为日期类型返回。对于其他日期，您可能也可以这样做，使用地图将9月和9月转换为9月

然后，您可以将所有其他内容作为字符串返回（用空格分隔）

那么你就有了：

日期“，”字符串“生日”

字符串“生日”，“日期”

日期为字符串的“生日”

日期“：”字符串“生日”

字符串“生日”日期

注意：'生日'、'、'：'和'of'这里是关键字，所以：

class Lexer:
    DATE = 1
    STRING = 2
    COMMA = 3
    COLON = 4
    BIRTHDAY = 5
    OF = 6

    keywords = { 'birthday': BIRTHDAY, 'of': OF, ',': COMMA, ':', COLON }

    def next_token():
        if have_saved_token:
            have_saved_token = False
            return saved_type, saved_value
        if date_re.match(): return DATE, date
        str = read_word()
        if str in keywords.keys(): return keywords[str], str
        return STRING, str

    def keep(type, value):
        have_saved_token = True
        saved_type = type
        saved_value = value

除3之外的所有单词都使用人称所有格形式（

的s

如果最后一个字符是辅音，那么

如果是元音）。这可能很棘手，因为“Alexis”可能是“Alexi”的复数形式，但由于您限制复数形式的位置，因此很容易发现：

def parseNameInPluralForm():
    name = parseName()
    if name.ends_with("'s"): name.remove_from_end("'s")
    elif name.ends_with("s"): name.remove_from_end("s")
    return name

现在，名字可以是

名字

或

名字姓氏

（是的，我知道日本会交换这些名字，但从处理角度来看，上述问题不需要区分名字和姓氏）。以下将处理这两种形式：

def parseName():
    type, firstName = Lexer.next_token()
    if type != Lexer.STRING: raise ParseError()
    type, lastName = Lexer.next_token()
    if type == Lexer.STRING: # first-name last-name
        return firstName + ' ' + lastName
    else:
        Lexer.keep(type, lastName)
        return firstName

最后，您可以使用以下方式处理表格1-5：

def parseBirthday():
    type, data = Lexer.next_token()
    if type == Lexer.DATE: # 1, 3 & 4
        date = data
        type, data = Lexer.next_token()
        if type == Lexer.COLON or type == Lexer.COMMA: # 1 & 4
            person = parsePersonInPluralForm()
            type, data = Lexer.next_token()
            if type != Lexer.BIRTHDAY: raise ParseError()
        elif type == Lexer.BIRTHDAY: # 3
            type, data = Lexer.next_token()
            if type != Lexer.OF: raise ParseError()
            person = parsePerson()
    elif type == Lexer.STRING: # 2 & 5
        Lexer.keep(type, data)
        person = parsePersonInPluralForm()
        type, data = Lexer.next_token()
        if type != Lexer.BIRTHDAY: raise ParseError()
        type, data = Lexer.next_token()
        if type == Lexer.COMMA: # 2
            type, data = Lexer.next_token()
        if type != Lexer.DATE: raise ParseError()
        date = data
    else:
        raise ParseError()
    return person, date