使用Python正则表达式提取代码的非注释部分_Python_Regex

使用Python正则表达式提取代码的非注释部分

python regex

使用Python正则表达式提取代码的非注释部分,python,regex,Python,Regex,我试图使用Python提取c代码的非注释部分。到目前为止，我的代码可以在这些示例中提取非_注释，如果找不到，它将直接返回 // comment /// comment non_comment; non_comment; /* comment */ non_comment; // comment /* comment */ non_comment; /* comment */ non_comment; /* comment */ /* comment */ non_comment; // comm

我试图使用Python提取c代码的非注释部分。到目前为止，我的代码可以在这些示例中提取非_注释，如果找不到，它将直接返回

// comment
/// comment
non_comment;
non_comment; /* comment */
non_comment; // comment
/* comment */ non_comment;
/* comment */ non_comment; /* comment */
/* comment */ non_comment; // comment

这是源代码，我使用doctest对不同的场景进行单元测试

import re
import doctest

def remove_comment(expr):
  """
  >>> remove_comment('// comment')
  ''
  >>> remove_comment('/// comment')
  ''
  >>> remove_comment('non_comment;')
  'non_comment;'
  >>> remove_comment('non_comment; /* comment */')
  'non_comment;'
  >>> remove_comment('non_comment; // comment')
  'non_comment;'
  >>> remove_comment('/* comment */ non_comment;')
  'non_comment;'
  >>> remove_comment('/* comment */ non_comment; /* comment */')
  'non_comment;'
  >>> remove_comment('/* comment */ non_comment; // comment')
  'non_comment;'
  """
  expr = expr.strip()
  if expr.startswith(('//', '///')):
      return ''
  # throw away /* ... */ comment, and // comment at the end
  pattern = r'(/\*.*\*/\W*)?(\w+;)(//|/\*.*\*/\W*)?'
  r = re.search(pattern, expr)
  return r.group(2).strip() if r else ''    

doctest.testmod()

然而，不知怎么的，我不喜欢这个代码，我相信应该有更好的方法来处理这个问题。有人知道更好的方法吗？谢谢

不要提取所有非注释，而是尝试删除注释，将其替换为

\/\/.*\/\*[^*]*\*\/是模式。它将捕获由/*.*/或以/

开头的任何内容。要查找注释，您还必须查找引用项，因为注释语法可以嵌入在字符串中。反之亦然，字符串可以嵌入到注释中

下面的正则表达式捕获组1中的注释和组2中的非注释

因此，要删除注释-

re.sub(r'(?m)((?:(?:^[ \t]*)?(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|/\*|//)))?|//(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|/\*|//))|(?=\r?\n))))+)|((?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|\'[^\'\\]*(?:\\[\S\s][^\'\\]*)*\'|(?:\r?\n(?:(?=(?:^[ \t]*)?(?:/\*|//))|[^/"\'\\\r\n]*))+|[^/"\'\\\r\n]+)+|[\S\s][^/"\'\\\r\n]*)', r'\2', sourceTxt)

若要仅获取非注释，您可以只匹配所有，将组2项目保存到数组中

此正则表达式保留格式并使用断言。还有一个没有格式的精简版本，它没有使用断言

演示PCRE：演示Python：

可读正则表达式

    # raw:   (?m)((?:(?:^[ \t]*)?(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|/\*|//)))?|//(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|/\*|//))|(?=\r?\n))))+)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|(?:\r?\n|[\S\s])[^/"'\\\s]*)
    # delimited:  /(?m)((?:(?:^[ \t]*)?(?:\/\*[^*]*\*+(?:[^\/*][^*]*\*+)*\/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|\/\*|\/\/)))?|\/\/(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|\/\*|\/\/))|(?=\r?\n))))+)|((?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?:\r?\n(?:(?=(?:^[ \t]*)?(?:\/\*|\/\/))|[^\/"'\\\r\n]*))+|[^\/"'\\\r\n]+)+|[\S\s][^\/"'\\\r\n]*)/

    (?m)                             # Multi-line modifier
    (                                # (1 start), Comments
         (?:
              (?: ^ [ \t]* )?                  # <- To preserve formatting
              (?:
                   /\*                              # Start /* .. */ comment
                   [^*]* \*+
                   (?: [^/*] [^*]* \*+ )*
                   /                                # End /* .. */ comment
                   (?:                              # <- To preserve formatting
                        [ \t]* \r? \n
                        (?=
                             [ \t]*
                             (?: \r? \n | /\* | // )
                        )
                   )?
                |
                   //                               # Start // comment
                   (?:                              # Possible line-continuation
                        [^\\]
                     |  \\
                        (?: \r? \n )?
                   )*?
                   (?:                              # End // comment
                        \r? \n
                        (?=                              # <- To preserve formatting
                             [ \t]*
                             (?: \r? \n | /\* | // )
                        )
                     |  (?= \r? \n )
                   )
              )
         )+                               # Grab multiple comment blocks if need be
    )                                # (1 end)

 |                                 ## OR

    (                                # (2 start), Non - comments
         # Quotes
         # ======================
         (?:                              # Quote and Non-Comment blocks
              "
              [^"\\]*                          # Double quoted text
              (?: \\ [\S\s] [^"\\]* )*
              "
           |                                 # --------------
              '
              [^'\\]*                          # Single quoted text
              (?: \\ [\S\s] [^'\\]* )*
              '
           |                                 # --------------

              (?:                              # Qualified Linebreak's
                   \r? \n
                   (?:
                        (?=                              # If comment ahead just stop
                             (?: ^ [ \t]* )?
                             (?: /\* | // )
                        )
                     |                                 # or,
                        [^/"'\\\r\n]*                    # Chars which doesn't start a comment, string, escape,
                                                         # or line continuation (escape + newline)
                   )
              )+
           |                                 # --------------
              [^/"'\\\r\n]+                    # Chars which doesn't start a comment, string, escape,
                                               # or line continuation (escape + newline)

         )+                               # Grab multiple instances

      |                                 # or,
         # ======================
         # Pass through

         [\S\s]                           # Any other char
         [^/"'\\\r\n]*                    # Chars which doesn't start a comment, string, escape,
                                          # or line continuation (escape + newline)

    )                                # (2 end), Non - comments

您可以删除注释：您能试试吗？：\/\/\/？\124;\/\*\ s*\w+\s*？：$\124;\*\/？\ 124;\ w+？哇，这太棒了！我只是做了一些类似的事情，用了两行而不是一行。我使用re.subr'/.*'，expr，后跟re.subr'/\*[\w]*\*/'，expr。但你的更好。顺便问一下，[^*]是什么意思？这意味着它将捕获任何不是*的字符。将[^*]替换为。你会明白我为什么这么做。很高兴知道，但我需要一段时间来消化。我将暂时保留[\w]*，因为它对我的大脑来说更加明确和自然。谢谢一种更简单的方法是，它在一行中失败，即“printf%s，/*无注释*/；不发表评论：“对不起，这种复杂程度是不必要的。当然字符串中可能有注释语法，但这种情况发生的频率是多少？在大多数情况下，您不需要300个字符的正则表达式模式。@emsimpson92-只为您准备。为了提高性能，我只是把它变得更复杂。我还在下面添加了preserve-formatting一个，过去20年的标准注释剥离器regex，它来自一个Perl-god，而这个极其复杂的注释剥离器就是从这个Perl-god派生出来的。享受这些吧，我可能不会在这里呆太久了。。。

    # (/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[\S\s][^/"'\\]*)


    (                                # (1 start), Comments 
         /\*                              # Start /* .. */ comment
         [^*]* \*+
         (?: [^/*] [^*]* \*+ )*
         /                                # End /* .. */ comment
      |  
         //                               # Start // comment
         (?: [^\\] | \\ \n? )*?           # Possible line-continuation
         \n                               # End // comment
    )                                # (1 end)
 |  
    (                                # (2 start), Non - comments 
         "
         (?: \\ [\S\s] | [^"\\] )*        # Double quoted text
         "
      |  '
         (?: \\ [\S\s] | [^'\\] )*        # Single quoted text
         ' 
      |  [\S\s]                           # Any other char
         [^/"'\\]*                        # Chars which doesn't start a comment, string, escape,
                                          # or line continuation (escape + newline)
    )                                # (2 end)