Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/278.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/6/multithreading/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何重写一个简单的标记器来使用正则表达式?_Python_Regex_Rewrite_Tokenize_Lexer - Fatal编程技术网

Python 如何重写一个简单的标记器来使用正则表达式?

Python 如何重写一个简单的标记器来使用正则表达式?,python,regex,rewrite,tokenize,lexer,Python,Regex,Rewrite,Tokenize,Lexer,这是第一次编写的标记器的优化版本,并且工作得相当好。辅助标记器可以解析此函数的输出,以创建更具特殊性的分类标记 def tokenize(source): return (token for token in (token.strip() for line in source.replace('\r\n', '\n').replace('\r', '\n').split('\n') for token in line.split('#',

这是第一次编写的标记器的优化版本,并且工作得相当好。辅助标记器可以解析此函数的输出,以创建更具特殊性的分类标记

def tokenize(source):
    return (token for token in (token.strip() for line
            in source.replace('\r\n', '\n').replace('\r', '\n').split('\n')
            for token in line.split('#', 1)[0].split(';')) if token)
我的问题是:如何简单地用
re
模块编写?下面是我无效的尝试

def tokenize2(string):
    search = re.compile(r'^(.+?)(?:;(.+?))*?(?:#.+)?$', re.MULTILINE)
    for match in search.finditer(string):
        for item in match.groups():
            yield item
编辑:这是我从标记器中寻找的输出类型。解析文本应该很容易

>>> def tokenize(source):
    return (token for token in (token.strip() for line
            in source.replace('\r\n', '\n').replace('\r', '\n').split('\n')
            for token in line.split('#', 1)[0].split(';')) if token)

>>> for token in tokenize('''\
a = 1 + 2; b = a - 3 # create zero in b
c = b * 4; d = 5 / c # trigger div error

e = (6 + 7) * 8
# try a boolean operation
f = 0 and 1 or 2
a; b; c; e; f'''):
    print(repr(token))


'a = 1 + 2'
'b = a - 3 '
'c = b * 4'
'd = 5 / c '
'e = (6 + 7) * 8'
'f = 0 and 1 or 2'
'a'
'b'
'c'
'e'
'f'
>>> 

我可能离这里很远-

>>> def tokenize(source):
...     search = re.compile(r'^(.+?)(?:;(.+?))*?(?:#.+)?$', re.MULTILINE)
...     return (token.strip() for line in source.split('\n') if search.match(line)
...                   for token in line.split('#', 1)[0].split(';') if token)
... 
>>> 
>>> 
>>> for token in tokenize('''\
... a = 1 + 2; b = a - 3 # create zero in b
... c = b * 4; d = 5 / c # trigger div error
... 
... e = (6 + 7) * 8
... # try a boolean operation
... f = 0 and 1 or 2
... a; b; c; e; f'''):
...     print(repr(token))
... 
'a = 1 + 2'
'b = a - 3'
'c = b * 4'
'd = 5 / c'
'e = (6 + 7) * 8'
'f = 0 and 1 or 2'
'a'
'b'
'c'
'e'
'f'
>>> 

如果适用,我会将
re.compile
排除在
def
范围之外。

以下是基于您的tokenize2函数的一个:

def tokenize2(source):
    search = re.compile(r'([^;#\n]+)[;\n]?(?:#.+)?', re.MULTILINE)
    for match in search.finditer(source):
        for item in match.groups():
            yield item

>>> for token in tokenize2('''\
... a = 1 + 2; b = a - 3 # create zero in b
... c = b * 4; d = 5 / c # trigger div error
... 
... e = (6 + 7) * 8
... # try a boolean operation
... f = 0 and 1 or 2
... a; b; c; e; f'''):
...     print(repr(token))
... 
'a = 1 + 2'
' b = a - 3 '
'c = b * 4'
' d = 5 / c '
'e = (6 + 7) * 8'
'f = 0 and 1 or 2'
'a'
' b'
' c'
' e'
' f'
>>> 

将正则表达式匹配应用到生成器理解末尾的
if
语句中是否可行?不,问题之一是类似
a;Bc
只返回
('a','c')
,而
a#b
返回
('a',无)
。谢谢!我希望在一个正则表达式中完成所有的标记化工作,但是代码运行得足够好。仍然欢迎其他人编写
lambda源代码:re.finditer(PATTERN,source,FLAGS)
来定义模式和标志。这将是一次很好的学习体验。您是否应该
.strip()
输入返回值?