从源代码字符串中提取Python函数源文本_Python

从源代码字符串中提取Python函数源文本

python

从源代码字符串中提取Python函数源文本,python,Python,假设我有有效的Python源代码，作为字符串： code_string = """ # A comment. def foo(a, b): return a + b class Bar(object): def __init__(self): self.my_list = [ 'a', 'b', ] """.strip() 目标：我希望获得包含函数定义源代码的行，保留空格。对于上面的代码字符串，我想获取字符串 def foo(a, b):

假设我有有效的Python源代码，作为字符串：

code_string = """
# A comment.
def foo(a, b):
  return a + b
class Bar(object):
  def __init__(self):
    self.my_list = [
        'a',
        'b',
    ]
""".strip()

目标：我希望获得包含函数定义源代码的行，保留空格。对于上面的代码字符串，我想获取字符串

def foo(a, b):
  return a + b

及

或者，等价地，我很乐意得到代码字符串中函数的行号：

foo

跨越第2-3行，而

\uuu init\uuu

跨越第5-9行

尝试

我可以将代码字符串解析为其AST：

code_ast = ast.parse(code_string)

我可以找到

FunctionDef

节点，例如：

function_def_nodes = [node for node in ast.walk(code_ast)
                      if isinstance(node, ast.FunctionDef)]

每个

FunctionDef

节点的

lineno

属性告诉我们该函数的第一行。我们可以用以下公式估算该函数的最后一行：

last_line = max(node.lineno for node in ast.walk(function_def_node)
                if hasattr(node, 'lineno'))

但是，当函数以不显示为AST节点的语法元素结束时，例如

\uuuuu init\uuuu

中的最后一个

时，这并不能很好地工作

我怀疑是否有一种方法只使用AST，因为AST在

\uuuu init\uuuu

这样的情况下基本上没有足够的信息

我不能使用

inspect

模块，因为它只对“活动对象”起作用，而且我只有作为字符串的Python代码。我无法

eval

代码，因为这是一个巨大的安全问题

理论上，我可以为Python编写一个解析器，但这似乎真的太过分了

注释中建议的一种启发式方法是使用行的前导空格。但是，对于具有奇怪缩进的奇怪但有效的函数，这可能会中断，例如：

def baz():
  return [
1,
  ]

class Baz(object):
  def hello(self, x):
    return self.hello(
x - 1)

def my_type_annotated_function(
  my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
  # This function's indentation isn't unusual at all.
  pass

一个更健壮的解决方案是使用

标记化

模块。以下代码可以处理奇怪的缩进、注释、多行标记、单行功能块和功能块内的空行：

import tokenize
from io import BytesIO
from collections import deque
code_string = """
# A comment.
def foo(a, b):
  return a + b

class Bar(object):
  def __init__(self):

    self.my_list = [
        'a',
        'b',
    ]

  def test(self): pass
  def abc(self):
    '''multi-
    line token'''

def baz():
  return [
1,
  ]

class Baz(object):
  def hello(self, x):
    a = \
1
    return self.hello(
x - 1)

def my_type_annotated_function(
  my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
  pass
  # unmatched parenthesis: (
""".strip()
file = BytesIO(code_string.encode())
tokens = deque(tokenize.tokenize(file.readline))
lines = []
while tokens:
    token = tokens.popleft()
    if token.type == tokenize.NAME and token.string == 'def':
        start_line, _ = token.start
        last_token = token
        while tokens:
            token = tokens.popleft()
            if token.type == tokenize.NEWLINE:
                break
            last_token = token
        if last_token.type == tokenize.OP and last_token.string == ':':
            indents = 0
            while tokens:
                token = tokens.popleft()
                if token.type == tokenize.NL:
                    continue
                if token.type == tokenize.INDENT:
                    indents += 1
                elif token.type == tokenize.DEDENT:
                    indents -= 1
                    if not indents:
                        break
                else:
                    last_token = token
        lines.append((start_line, last_token.end[0]))
print(lines)

这将产生：

[(2, 3), (6, 11), (13, 13), (14, 16), (18, 21), (24, 27), (29, 33)]

但是请注意，续行：

a = \
1

被

tokenize

视为一行，即使它实际上是两行，因为如果打印令牌：

TokenInfo(type=53 (OP), string=':', start=(24, 20), end=(24, 21), line='  def hello(self, x):\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(24, 21), end=(24, 22), line='  def hello(self, x):\n')
TokenInfo(type=5 (INDENT), string='    ', start=(25, 0), end=(25, 4), line='    a = 1\n')
TokenInfo(type=1 (NAME), string='a', start=(25, 4), end=(25, 5), line='    a = 1\n')
TokenInfo(type=53 (OP), string='=', start=(25, 6), end=(25, 7), line='    a = 1\n')
TokenInfo(type=2 (NUMBER), string='1', start=(25, 8), end=(25, 9), line='    a = 1\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(25, 9), end=(25, 10), line='    a = 1\n')
TokenInfo(type=1 (NAME), string='return', start=(26, 4), end=(26, 10), line='    return self.hello(\n')

您可以看到，续行实际上被视为一行

'a=1\n'

，只有一个行号

。不幸的是，这显然是

标记化

模块的一个缺陷/限制。

我认为一个小型解析器是为了尝试考虑这种奇怪的异常：

import re

code_string = """
# A comment.
def foo(a, b):
  return a + b
class Bar(object):
  def __init__(self):
    self.my_list = [
        'a',
        'b',
    ]

def baz():
  return [
1,
  ]

class Baz(object):
  def hello(self, x):
    return self.hello(
x - 1)

def my_type_annotated_function(
  my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
  # This function's indentation isn't unusual at all.
  pass

def test_multiline():
    \"""
    asdasdada
sdadd
    \"""
    pass

def test_comment(
    a #)
):
    return [a,
    # ]
a]

def test_escaped_endline():
    return "asdad \
asdsad \
asdas"

def test_nested():
    return {():[[],
{
}
]
}

def test_strings():
    return '\""" asdasd' + \"""
12asd
12312
"asd2" [
\"""

\"""
def test_fake_def_in_multiline()
\"""
    print(123)
a = "def in_string():"
  def after().
    print("NOPE")

\"""Phew this ain't valid syntax\""" def something(): pass

""".strip()

code_string += '\n'


func_list=[]
func = ''
tab  = ''
brackets = {'(':0, '[':0, '{':0}
close = {')':'(', ']':'[', '}':'{'}
string=''
tab_f=''
c1=''
multiline=False
check=False
for line in code_string.split('\n'):
    tab = re.findall(r'^\s*',line)[0]
    if re.findall(r'^\s*def', line) and not string and not multiline:
        func += line + '\n'
        tab_f = tab
        check=True
    if func:
        if not check:
            if sum(brackets.values()) == 0 and not string and not multiline:
                if len(tab) <= len(tab_f):
                    func_list.append(func)
                    func=''
                    c1=''
                    c2=''
                    continue
            func += line + '\n'
        check = False
    for c0 in line:
        if c0 == '#' and not string and not multiline:
            break
        if c1 != '\\':
            if c0 in ['"', "'"]:
                if c2 == c1 == c0 == '"' and string != "'":
                    multiline = not multiline
                    string = ''
                    continue
                if not multiline:
                    if c0 in string:
                        string = ''
                    else:
                        if not string:
                            string = c0
            if not string and not multiline:
                if c0 in brackets:
                    brackets[c0] += 1
                if c0 in close:
                    b = close[c0]
                    brackets[b] -= 1
        c2=c1
        c1=c0

for f in func_list:
    print('-'*40)
    print(f)

与其重新发明解析器，不如使用python本身

基本上我会使用内置函数，它可以通过编译来检查字符串是否是有效的python代码。我将一个由选定行组成的字符串传递给它，从每一行

def

开始，传递到另一行，该行不会编译失败

code_string = """
#A comment
def foo(a, b):
  return a + b

def bir(a, b):
  c = a + b
  return c

class Bar(object):
  def __init__(self):
    self.my_list = [
        'a',
        'b',
    ]

def baz():
  return [
1,
  ]

""".strip()

lines = code_string.split('\n')

#looking for lines with 'def' keywords
defidxs = [e[0] for e in enumerate(lines) if 'def' in e[1]]

#getting the indentation of each 'def'
indents = {}
for i in defidxs:
    ll = lines[i].split('def')
    indents[i] = len(ll[0])

#extracting the strings
end = len(lines)-1
while end > 0:
    if end < defidxs[-1]:
        defidxs.pop()
    try:
        start = defidxs[-1]
    except IndexError: #break if there are no more 'def'
        break

    #empty lines between functions will cause an error, let's remove them
    if len(lines[end].strip()) == 0:
        end = end -1
        continue

    try:
        #fix lines removing indentation or compile will not compile
        fixlines = [ll[indents[start]:] for ll in lines[start:end+1]] #remove indentation
        body = '\n'.join(fixlines)
        compile(body, '<string>', 'exec') #if it fails, throws an exception
        print(body)
        end = start #no need to parse less line if it succeed.
    except:
        pass

    end = end -1

请注意，函数的打印顺序与它们出现在

code\u字符串中的顺序相反

这甚至可以处理奇怪的缩进代码，但我认为如果您有嵌套函数，它将失败。

我想您可以只迭代行，当一行匹配

^（\s*）def\s.*$

时，提取匹配的组（前导空格），然后使用该行以及

开始的所有后续行（该空格）

您的意思是，提取所有后面的行，这些行的开头必须大于该空格？或者您还可以提取在相同缩进级别定义的以下函数OOPS，是的。不管怎么说，你明白了。嗯，如果函数内部有奇怪的缩进，那么它就不起作用，例如

def baz（）：\n return[\n1，\n]

啊，我甚至不知道这是有效的python。看起来没有简单的文本处理方法，这看起来很有希望。你确定它适用于“奇怪的缩进”案例吗？我试过你的代码，它似乎破坏了我提供的所有“奇怪的缩进”函数，只提取每个函数的第一部分。Oops实际上没有任何逻辑来处理奇怪的缩进。现在添加。查找缩进和DEDENT标记（并检查没有缩进的单逻辑行大小写）可能会更健壮。@user2357112使用缩进和DEDENT确实也是我的第一个想法，尽管我还不清楚如何轻松处理单逻辑行大小写。我现在重写了代码，使其使用缩进和dedent，但注意到，

tokenize

将延续行视为一行，即使它实际上是多行，因此在这种情况下，

tokenize

返回的行号将关闭。不幸的是，这显然是

tokenize

模块的缺陷/限制。编写解析器很难。我没有运行您的代码，但只要看一眼，我认为对于多行字符串（用

“”“

”分隔）和转义字符串分隔符，它是失败的，并且它不理解注释（可能包含散括号或字符串分隔符）。请尝试。我应该包括包含字符串的大小写，如果在字符串中，则开/闭括号不应计算。编辑：转义分隔符是一个例外，我将包括它。您没有检查注释，因此无法判断是否应计算闭括号（如果在注释中，则不应计算）。包括转义字符和注释。很抱歉，我确实倾向于编写解析器，只需启动简单的程序，并在发现异常时添加内容，这不是我意识到的最佳做法

----------------------------------------
def foo(a, b):
  return a + b

----------------------------------------
  def __init__(self):
    self.my_list = [
        'a',
        'b',
    ]

----------------------------------------
def baz():
  return [
1,
  ]

----------------------------------------
  def hello(self, x):
    return self.hello(
x - 1)

----------------------------------------
def my_type_annotated_function(
  my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
  # This function's indentation isn't unusual at all.
  pass

----------------------------------------
def test_multiline():
    """
    asdasdada
sdadd
    """
    pass

----------------------------------------
def test_comment(
    a #)
):
    return [a,
    # ]
a]

----------------------------------------
def test_escaped_endline():
    return "asdad asdsad asdas"

----------------------------------------
def test_nested():
    return {():[[],
{
}
]
}

----------------------------------------
def test_strings():
    return '""" asdasd' + """
12asd
12312
"asd2" [
"""

----------------------------------------
  def after():
    print("NOPE")

code_string = """
#A comment
def foo(a, b):
  return a + b

def bir(a, b):
  c = a + b
  return c

class Bar(object):
  def __init__(self):
    self.my_list = [
        'a',
        'b',
    ]

def baz():
  return [
1,
  ]

""".strip()

lines = code_string.split('\n')

#looking for lines with 'def' keywords
defidxs = [e[0] for e in enumerate(lines) if 'def' in e[1]]

#getting the indentation of each 'def'
indents = {}
for i in defidxs:
    ll = lines[i].split('def')
    indents[i] = len(ll[0])

#extracting the strings
end = len(lines)-1
while end > 0:
    if end < defidxs[-1]:
        defidxs.pop()
    try:
        start = defidxs[-1]
    except IndexError: #break if there are no more 'def'
        break

    #empty lines between functions will cause an error, let's remove them
    if len(lines[end].strip()) == 0:
        end = end -1
        continue

    try:
        #fix lines removing indentation or compile will not compile
        fixlines = [ll[indents[start]:] for ll in lines[start:end+1]] #remove indentation
        body = '\n'.join(fixlines)
        compile(body, '<string>', 'exec') #if it fails, throws an exception
        print(body)
        end = start #no need to parse less line if it succeed.
    except:
        pass

    end = end -1

def baz():
  return [
1,
  ]
def __init__(self):
  self.my_list = [
      'a',
      'b',
  ]
def bir(a, b):
  c = a + b
  return c
def foo(a, b):
  return a + b