长字符串中注释的Python正则表达式_Python_Regex

长字符串中注释的Python正则表达式

python regex

长字符串中注释的Python正则表达式,python,regex,Python,Regex,我正在尝试为位于长字符串中的python注释设计一个好的正则表达式。到目前为止我有正则表达式： #(.?|\n)* 字符串： '### this is a comment\na = \'a string\'.toupper()\nprint a\n\na_var_name = " ${an.injection} "\nanother_var = " ${bn.injection} "\ndtabse_conn = " ${cn.injection} "\n\ndef do_something

我正在尝试为位于长字符串中的python注释设计一个好的正则表达式。到目前为止我有

正则表达式：

#(.?|\n)*

字符串：

'### this is a comment\na = \'a string\'.toupper()\nprint a\n\na_var_name = " ${an.injection} "\nanother_var = " ${bn.injection} "\ndtabse_conn = " ${cn.injection} "\n\ndef do_something()\n    # this call outputs an xml stream of the current parameter dictionary.\n    paramtertools.print_header(params)\n\nfor i in xrange(256):    # wow another comment\n    print i**2\n\n'

我觉得有一种更好的方法可以从字符串中获取所有单独的注释，但我不是正则表达式方面的专家。有谁有更好的解决方案吗？

如果您做两件事，Regex将很好地工作：

删除所有字符串文字（因为它们可以包含

字符）

捕获以

字符开头并一直到行尾的所有内容

下面是一个演示：

>>> from re import findall, sub
>>> string = '### this is a comment\na = \'a string\'.toupper()\nprint a\n\na_var_name = " ${an.injection} "\nanother_var = " ${bn.injection} "\ndtabse_conn = " ${cn.injection} "\n\ndef do_something()\n    # this call outputs an xml stream of the current parameter dictionary.\n    paramtertools.print_header(params)\n\nfor i in xrange(256):    # wow another comment\n    print i**2\n\n'
>>> findall("#.*", sub('(?s)\'.*?\'|".*?"', '', string))
['### this is a comment', '# this call outputs an xml stream of the current parameter dictionary.', '# wow another comment']
>>>

删除任何形式为

“…”

或

“…”

的内容。这使您不必担心字符串文本中的注释

（？s）

设置，允许

匹配换行符

最后，获取以

字符开头并一直到行尾的所有内容

要进行更完整的测试，请将此示例代码放在名为

test.py的文件中：
# Comment 1  
for i in range(10): # Comment 2
    print('#foo')
    print("abc#bar")
    print("""
#hello
abcde#foo
""")  # Comment 3
    print('''#foo
    #foo''')  # Comment 4

上述解决方案仍然有效：
>>> from re import findall, sub
>>> string = open('test.py').read()
>>> findall("#.*", sub('(?s)\'.*?\'|".*?"', '', string))
['# Comment 1', '# Comment 2', '# Comment 3', '# Comment 4']
>>>

由于这是字符串中的python代码，我将使用模块对其进行解析并提取注释：
import tokenize
import StringIO

text = '### this is a comment\na = \'a string\'.toupper()\nprint a\n\na_var_name = " ${an.injection} "\nanother_var = " ${bn.injection} "\ndtabse_conn = " ${cn.injection} "\n\ndef do_something():\n    # this call outputs an xml stream of the current parameter dictionary.\n    paramtertools.print_header(params)\n\nfor i in xrange(256):    # wow another comment\n    print i**2\n\n'

tokens = tokenize.generate_tokens(StringIO.StringIO(text).readline)
for toktype, ttext, (slineno, scol), (elineno, ecol), ltext in tokens:
    if toktype == tokenize.COMMENT:
        print ttext

印刷品：
### this is a comment
# this call outputs an xml stream of the current parameter dictionary.
# wow another comment

请注意，字符串中的代码有一个语法错误：do_something（）
函数定义后缺少：

另外，请注意，模块在这里没有帮助，因为它不保留注释。
从索引1处的匹配组获取注释
(#+[^\\\n]*)


示例代码：
import re
p = re.compile(ur'(#+[^\\\n]*)')
test_str = u"..."

re.findall(p, test_str)

匹配项：
1.  ### this is a comment
2.  # this call outputs an xml stream of the current parameter dictionary.
3.  # wow another comment

我试过了，问题是tokenize。untokenize
不是一个可靠的函数，因为我用它来转换代码。如果ast
模块保留了注释，我将有3周的生命。@baallezx您能详细说明一下为什么不能在这里使用tokenize
？谢谢。@Alexe，tokenize有很多问题。untokenize
如果遇到一个行延续字符``加上一些我想都想不出来的其他字符，它就会断开。我将再次尝试使用它，使用token.start和token.end作为字符串中放置的引用，然后返回给您。也许会有用。@baallezx谢谢你，是的，试试看。也许，最好还是坚持这种方法，自己解决问题，或者在SO社区的帮助下，通过创建单独的问题来解决问题。我仍然非常确定这是最健壮的方法（尤其是与正则表达式解决方案相比）。我认为这在python正则表达式中是不可行的，因为#可能是类似于a=“#foo”的东西。更复杂的情况下，有更多的开头和结尾\“或\”字符是可能的，因此如果有人能通过泵引理证明仅使用正则表达式是不可行的，我也不会感到奇怪。@alecxe有一个更好的解决方案。为什么不使用str=str.split（'\n'）在换行符上拆分字符串呢
然后对结果进行迭代？@RevanProdigalKnight，因为我必须将前面和后面的n
字符与正则表达式结果进行比较，以便将2个字符串变为第3个字符串。我尝试过这种方法，当您拆分行时，会增加复杂性。这都是因为我正在对文件进行代码转换然后在转换之后，我必须将注释添加回适当的位置。我还没有测试足够的案例，但到目前为止，这正是我所需要的。继续测试所有案例。