从源代码字符串中提取Python函数源文本
假设我有有效的Python源代码,作为字符串:从源代码字符串中提取Python函数源文本,python,Python,假设我有有效的Python源代码,作为字符串: code_string = """ # A comment. def foo(a, b): return a + b class Bar(object): def __init__(self): self.my_list = [ 'a', 'b', ] """.strip() 目标:我希望获得包含函数定义源代码的行,保留空格。对于上面的代码字符串,我想获取字符串 def foo(a, b):
code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
""".strip()
目标:我希望获得包含函数定义源代码的行,保留空格。对于上面的代码字符串,我想获取字符串
def foo(a, b):
return a + b
及
或者,等价地,我很乐意得到代码字符串中函数的行号:foo
跨越第2-3行,而\uuu init\uuu
跨越第5-9行
尝试
我可以将代码字符串解析为其AST:
code_ast = ast.parse(code_string)
我可以找到FunctionDef
节点,例如:
function_def_nodes = [node for node in ast.walk(code_ast)
if isinstance(node, ast.FunctionDef)]
每个FunctionDef
节点的lineno
属性告诉我们该函数的第一行。我们可以用以下公式估算该函数的最后一行:
last_line = max(node.lineno for node in ast.walk(function_def_node)
if hasattr(node, 'lineno'))
但是,当函数以不显示为AST节点的语法元素结束时,例如\uuuuu init\uuuu
中的最后一个]
时,这并不能很好地工作
我怀疑是否有一种方法只使用AST,因为AST在\uuuu init\uuuu
这样的情况下基本上没有足够的信息
我不能使用inspect
模块,因为它只对“活动对象”起作用,而且我只有作为字符串的Python代码。我无法eval
代码,因为这是一个巨大的安全问题
理论上,我可以为Python编写一个解析器,但这似乎真的太过分了
注释中建议的一种启发式方法是使用行的前导空格。但是,对于具有奇怪缩进的奇怪但有效的函数,这可能会中断,例如:
def baz():
return [
1,
]
class Baz(object):
def hello(self, x):
return self.hello(
x - 1)
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass
一个更健壮的解决方案是使用
标记化
模块。以下代码可以处理奇怪的缩进、注释、多行标记、单行功能块和功能块内的空行:
import tokenize
from io import BytesIO
from collections import deque
code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
def test(self): pass
def abc(self):
'''multi-
line token'''
def baz():
return [
1,
]
class Baz(object):
def hello(self, x):
a = \
1
return self.hello(
x - 1)
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
pass
# unmatched parenthesis: (
""".strip()
file = BytesIO(code_string.encode())
tokens = deque(tokenize.tokenize(file.readline))
lines = []
while tokens:
token = tokens.popleft()
if token.type == tokenize.NAME and token.string == 'def':
start_line, _ = token.start
last_token = token
while tokens:
token = tokens.popleft()
if token.type == tokenize.NEWLINE:
break
last_token = token
if last_token.type == tokenize.OP and last_token.string == ':':
indents = 0
while tokens:
token = tokens.popleft()
if token.type == tokenize.NL:
continue
if token.type == tokenize.INDENT:
indents += 1
elif token.type == tokenize.DEDENT:
indents -= 1
if not indents:
break
else:
last_token = token
lines.append((start_line, last_token.end[0]))
print(lines)
这将产生:
[(2, 3), (6, 11), (13, 13), (14, 16), (18, 21), (24, 27), (29, 33)]
但是请注意,续行:
a = \
1
被tokenize
视为一行,即使它实际上是两行,因为如果打印令牌:
TokenInfo(type=53 (OP), string=':', start=(24, 20), end=(24, 21), line=' def hello(self, x):\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(24, 21), end=(24, 22), line=' def hello(self, x):\n')
TokenInfo(type=5 (INDENT), string=' ', start=(25, 0), end=(25, 4), line=' a = 1\n')
TokenInfo(type=1 (NAME), string='a', start=(25, 4), end=(25, 5), line=' a = 1\n')
TokenInfo(type=53 (OP), string='=', start=(25, 6), end=(25, 7), line=' a = 1\n')
TokenInfo(type=2 (NUMBER), string='1', start=(25, 8), end=(25, 9), line=' a = 1\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(25, 9), end=(25, 10), line=' a = 1\n')
TokenInfo(type=1 (NAME), string='return', start=(26, 4), end=(26, 10), line=' return self.hello(\n')
您可以看到,续行实际上被视为一行
'a=1\n'
,只有一个行号25
。不幸的是,这显然是标记化
模块的一个缺陷/限制。我认为一个小型解析器是为了尝试考虑这种奇怪的异常:
import re
code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
def baz():
return [
1,
]
class Baz(object):
def hello(self, x):
return self.hello(
x - 1)
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass
def test_multiline():
\"""
asdasdada
sdadd
\"""
pass
def test_comment(
a #)
):
return [a,
# ]
a]
def test_escaped_endline():
return "asdad \
asdsad \
asdas"
def test_nested():
return {():[[],
{
}
]
}
def test_strings():
return '\""" asdasd' + \"""
12asd
12312
"asd2" [
\"""
\"""
def test_fake_def_in_multiline()
\"""
print(123)
a = "def in_string():"
def after().
print("NOPE")
\"""Phew this ain't valid syntax\""" def something(): pass
""".strip()
code_string += '\n'
func_list=[]
func = ''
tab = ''
brackets = {'(':0, '[':0, '{':0}
close = {')':'(', ']':'[', '}':'{'}
string=''
tab_f=''
c1=''
multiline=False
check=False
for line in code_string.split('\n'):
tab = re.findall(r'^\s*',line)[0]
if re.findall(r'^\s*def', line) and not string and not multiline:
func += line + '\n'
tab_f = tab
check=True
if func:
if not check:
if sum(brackets.values()) == 0 and not string and not multiline:
if len(tab) <= len(tab_f):
func_list.append(func)
func=''
c1=''
c2=''
continue
func += line + '\n'
check = False
for c0 in line:
if c0 == '#' and not string and not multiline:
break
if c1 != '\\':
if c0 in ['"', "'"]:
if c2 == c1 == c0 == '"' and string != "'":
multiline = not multiline
string = ''
continue
if not multiline:
if c0 in string:
string = ''
else:
if not string:
string = c0
if not string and not multiline:
if c0 in brackets:
brackets[c0] += 1
if c0 in close:
b = close[c0]
brackets[b] -= 1
c2=c1
c1=c0
for f in func_list:
print('-'*40)
print(f)
与其重新发明解析器,不如使用python本身 基本上我会使用内置函数,它可以通过编译来检查字符串是否是有效的python代码。我将一个由选定行组成的字符串传递给它,从每一行
def
开始,传递到另一行,该行不会编译失败
code_string = """
#A comment
def foo(a, b):
return a + b
def bir(a, b):
c = a + b
return c
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
def baz():
return [
1,
]
""".strip()
lines = code_string.split('\n')
#looking for lines with 'def' keywords
defidxs = [e[0] for e in enumerate(lines) if 'def' in e[1]]
#getting the indentation of each 'def'
indents = {}
for i in defidxs:
ll = lines[i].split('def')
indents[i] = len(ll[0])
#extracting the strings
end = len(lines)-1
while end > 0:
if end < defidxs[-1]:
defidxs.pop()
try:
start = defidxs[-1]
except IndexError: #break if there are no more 'def'
break
#empty lines between functions will cause an error, let's remove them
if len(lines[end].strip()) == 0:
end = end -1
continue
try:
#fix lines removing indentation or compile will not compile
fixlines = [ll[indents[start]:] for ll in lines[start:end+1]] #remove indentation
body = '\n'.join(fixlines)
compile(body, '<string>', 'exec') #if it fails, throws an exception
print(body)
end = start #no need to parse less line if it succeed.
except:
pass
end = end -1
请注意,函数的打印顺序与它们出现在code\u字符串中的顺序相反
这甚至可以处理奇怪的缩进代码,但我认为如果您有嵌套函数,它将失败。我想您可以只迭代行,当一行匹配
^(\s*)def\s.*$
时,提取匹配的组(前导空格),然后使用该行以及开始的所有后续行(该空格)
您的意思是,提取所有后面的行,这些行的开头必须大于该空格?或者您还可以提取在相同缩进级别定义的以下函数OOPS,是的。不管怎么说,你明白了。嗯,如果函数内部有奇怪的缩进,那么它就不起作用,例如def baz():\n return[\n1,\n]
啊,我甚至不知道这是有效的python。看起来没有简单的文本处理方法,这看起来很有希望。你确定它适用于“奇怪的缩进”案例吗?我试过你的代码,它似乎破坏了我提供的所有“奇怪的缩进”函数,只提取每个函数的第一部分。Oops实际上没有任何逻辑来处理奇怪的缩进。现在添加。查找缩进和DEDENT标记(并检查没有缩进的单逻辑行大小写)可能会更健壮。@user2357112使用缩进和DEDENT确实也是我的第一个想法,尽管我还不清楚如何轻松处理单逻辑行大小写。我现在重写了代码,使其使用缩进和dedent,但注意到,tokenize
将延续行视为一行,即使它实际上是多行,因此在这种情况下,tokenize
返回的行号将关闭。不幸的是,这显然是tokenize
模块的缺陷/限制。编写解析器很难。我没有运行您的代码,但只要看一眼,我认为对于多行字符串(用“”“
”分隔)和转义字符串分隔符,它是失败的,并且它不理解注释(可能包含散括号或字符串分隔符)。请尝试。我应该包括包含字符串的大小写,如果在字符串中,则开/闭括号不应计算。编辑:转义分隔符是一个例外,我将包括它。您没有检查注释,因此无法判断是否应计算闭括号(如果在注释中,则不应计算)。包括转义字符和注释。很抱歉,我确实倾向于编写解析器,只需启动简单的程序,并在发现异常时添加内容,这不是我意识到的最佳做法
----------------------------------------
def foo(a, b):
return a + b
----------------------------------------
def __init__(self):
self.my_list = [
'a',
'b',
]
----------------------------------------
def baz():
return [
1,
]
----------------------------------------
def hello(self, x):
return self.hello(
x - 1)
----------------------------------------
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass
----------------------------------------
def test_multiline():
"""
asdasdada
sdadd
"""
pass
----------------------------------------
def test_comment(
a #)
):
return [a,
# ]
a]
----------------------------------------
def test_escaped_endline():
return "asdad asdsad asdas"
----------------------------------------
def test_nested():
return {():[[],
{
}
]
}
----------------------------------------
def test_strings():
return '""" asdasd' + """
12asd
12312
"asd2" [
"""
----------------------------------------
def after():
print("NOPE")
code_string = """
#A comment
def foo(a, b):
return a + b
def bir(a, b):
c = a + b
return c
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
def baz():
return [
1,
]
""".strip()
lines = code_string.split('\n')
#looking for lines with 'def' keywords
defidxs = [e[0] for e in enumerate(lines) if 'def' in e[1]]
#getting the indentation of each 'def'
indents = {}
for i in defidxs:
ll = lines[i].split('def')
indents[i] = len(ll[0])
#extracting the strings
end = len(lines)-1
while end > 0:
if end < defidxs[-1]:
defidxs.pop()
try:
start = defidxs[-1]
except IndexError: #break if there are no more 'def'
break
#empty lines between functions will cause an error, let's remove them
if len(lines[end].strip()) == 0:
end = end -1
continue
try:
#fix lines removing indentation or compile will not compile
fixlines = [ll[indents[start]:] for ll in lines[start:end+1]] #remove indentation
body = '\n'.join(fixlines)
compile(body, '<string>', 'exec') #if it fails, throws an exception
print(body)
end = start #no need to parse less line if it succeed.
except:
pass
end = end -1
def baz():
return [
1,
]
def __init__(self):
self.my_list = [
'a',
'b',
]
def bir(a, b):
c = a + b
return c
def foo(a, b):
return a + b