Regex 在perl中使用正则表达式查找源代码中的字符串_Regex_String_Perl

Regex 在perl中使用正则表达式查找源代码中的字符串

regex string perl

Regex 在perl中使用正则表达式查找源代码中的字符串,regex,string,perl,Regex,String,Perl,我正在学习perl中的正则表达式我想写一个脚本，它接受C源代码文件并查找字符串这是我的代码： my $file1= @ARGV; open my $fh1, '<', $file1; while(<>) { @words = split(/\s/, $_); $newMsg = join '', @words; push @strings,($newMsg =~ m/"(.*\\*.*\\*.*\\*.*)"/) if($newMsg=~/".*\\*.*\\

我正在学习perl中的正则表达式

我想写一个脚本，它接受C源代码文件并查找字符串

这是我的代码：

my $file1= @ARGV;
open my $fh1, '<', $file1;
while(<>)
{
  @words = split(/\s/, $_);
  $newMsg = join '', @words;
  push  @strings,($newMsg =~ m/"(.*\\*.*\\*.*\\*.*)"/) if($newMsg=~/".*\\*.*\\*.*\\*.*"/);
  print Dumper(\@strings);
foreach(@strings)
    {
    print"strings: $_\n"; 
    }

我必须做什么？

似乎您正试图使用以下正则表达式捕获字符串中的多行：

my $your_regexp = m{
    (
        .*  # anything
        \\* # any number of backslashes
        .*  # anything
        \\* # any number of backslashes
        .*  # anything
        \\* # any number of backslashes
        .*  # anything
    )
}x

但这似乎更像是对绝望的把握，而不是一个深思熟虑的计划

所以你有两个问题：

查找双引号（

“

）之间的所有内容

处理那些引号之间可能有多行的情况

正则表达式可以跨多行进行匹配。

/s

修饰符可以做到这一点。请尝试：

my $your_new_regexp = m{
    \"       # opening quote mark
    ([^\"]+) # anything that's not a quote mark, capture
    \"       # closing quote mark
}xs;

实际上，您可能有第三个问题：

从字符串中删除尾随的反斜杠/换行符对

您可以通过执行搜索替换来处理此问题：

foreach ( @strings ) {
    $_ =~ s/\\\n//g;
}

似乎您正试图使用以下正则表达式捕获字符串中的多行：

my $your_regexp = m{
    (
        .*  # anything
        \\* # any number of backslashes
        .*  # anything
        \\* # any number of backslashes
        .*  # anything
        \\* # any number of backslashes
        .*  # anything
    )
}x

但这似乎更像是对绝望的把握，而不是一个深思熟虑的计划

所以你有两个问题：

查找双引号（

“

）之间的所有内容

处理那些引号之间可能有多行的情况

正则表达式可以跨多行匹配。

/s

修饰符会执行此操作。因此，请尝试：

my $your_new_regexp = m{
    \"       # opening quote mark
    ([^\"]+) # anything that's not a quote mark, capture
    \"       # closing quote mark
}xs;

实际上，您可能有第三个问题：

从字符串中删除尾随的反斜杠/换行符对

您可以通过执行搜索替换来处理此问题：

foreach ( @strings ) {
    $_ =~ s/\\\n//g;
}

这里有一个有趣的解决方案。它使用了一个实验性的C解析器。我们可以使用模块附带的

c2ast.pl

程序将一段C源文件转换为抽象语法树，并将其转储到某个文件（使用Data:：Dumper）。然后我们可以用一点魔法提取所有字符串

不幸的是，AST对象没有方法，但由于它们是自动生成的，我们知道它们在内部的外观

他们是幸运的。
- 有些包含一个无列项目的数组引用
- 其他包含零个或多个项目（词素或对象）
“Lexemes”是一个arrayref，包含两个位置信息字段和索引2处的字符串内容

此信息可以从中提取

守则：

use strict; use warnings;
use Scalar::Util 'blessed';
use feature 'say';

our $VAR1;
require "test.dump"; # populates $VAR1

my @strings = map extract_value($_), find_strings($$VAR1);
say for @strings;

sub find_strings {
  my $ast = shift;
  return $ast if $ast->isa("C::AST::string");
  return map find_strings($_), map flatten($_), @$ast;
}

sub flatten {
  my $thing = shift;
  return $thing if blessed($thing);
  return map flatten($_), @$thing if ref($thing) eq "ARRAY";
  return (); # we are not interested in other references, or unblessed data
}

sub extract_value {
  my $string = shift;
  return unless blessed($string->[0]);
  return unless $string->[0]->isa("C::AST::stringLiteral");
  return $string->[0][0][2];
}

将

find_字符串

从递归重写为迭代：

sub find_strings {
  my @unvisited = @_;
  my @found;
  while (my $ast = shift @unvisited) {
    if ($ast->isa("C::AST::string")) {
      push @found, $ast;
    } else {
      push @unvisited, map flatten($_), @$ast;
    }
  }
  return @found;
}

测试C代码：

/* A "comment" */
#include <stdio.h>

static const char *text2 =
"Here, on the other hand, I've gone crazy\
and really let the literal span several lines\
without bothering with quoting each line's\
content. This works, but you can't indent"; 

int main() {
        printf("Hello %s:\n%s\n", "World", text2);
        return 0;
}

是谁生产的

"Here, on the other hand, I've gone crazyand really let the literal span several lineswithout bothering with quoting each line'scontent. This works, but you can't indent"
"World"
"Hello %s\n"
"" 
"" 
"" 
"" 
"" 
""

请注意，有些空字符串不是来自源代码，而是来自包含的文件。过滤掉这些可能不是不可能的，但有点不切实际。

这里有一个有趣的解决方案。它使用了一个实验性的C解析器。我们可以使用模块附带的

c2ast.pl

程序将一段C源文件转换为抽象语法树，并将其转储到某个文件（使用Data:：Dumper）。然后我们可以用一点魔法提取所有字符串

不幸的是，AST对象没有方法，但由于它们是自动生成的，我们知道它们在内部的外观

他们是幸运的。
- 有些包含一个无列项目的数组引用
- 其他包含零个或多个项目（词素或对象）
“Lexemes”是一个arrayref，包含两个位置信息字段和索引2处的字符串内容

此信息可以从中提取

守则：

use strict; use warnings;
use Scalar::Util 'blessed';
use feature 'say';

our $VAR1;
require "test.dump"; # populates $VAR1

my @strings = map extract_value($_), find_strings($$VAR1);
say for @strings;

sub find_strings {
  my $ast = shift;
  return $ast if $ast->isa("C::AST::string");
  return map find_strings($_), map flatten($_), @$ast;
}

sub flatten {
  my $thing = shift;
  return $thing if blessed($thing);
  return map flatten($_), @$thing if ref($thing) eq "ARRAY";
  return (); # we are not interested in other references, or unblessed data
}

sub extract_value {
  my $string = shift;
  return unless blessed($string->[0]);
  return unless $string->[0]->isa("C::AST::stringLiteral");
  return $string->[0][0][2];
}

将

find_字符串

从递归重写为迭代：

sub find_strings {
  my @unvisited = @_;
  my @found;
  while (my $ast = shift @unvisited) {
    if ($ast->isa("C::AST::string")) {
      push @found, $ast;
    } else {
      push @unvisited, map flatten($_), @$ast;
    }
  }
  return @found;
}

测试C代码：

/* A "comment" */
#include <stdio.h>

static const char *text2 =
"Here, on the other hand, I've gone crazy\
and really let the literal span several lines\
without bothering with quoting each line's\
content. This works, but you can't indent"; 

int main() {
        printf("Hello %s:\n%s\n", "World", text2);
        return 0;
}

是谁生产的

"Here, on the other hand, I've gone crazyand really let the literal span several lineswithout bothering with quoting each line'scontent. This works, but you can't indent"
"World"
"Hello %s\n"
"" 
"" 
"" 
"" 
"" 
""

请注意，有些空字符串不是来自源代码，而是来自包含的文件。过滤掉这些字符串可能不是不可能的，但有点不切实际。

下面是一种提取源文件中所有字符串的简单方法。我们可以做出一个重要的决定：我们是否对代码进行预处理？如果不是，我们可能会错过一些字符串，如果它们是通过宏生成的。我们还必须将

视为注释字符

由于这是一个快速而肮脏的解决方案，C代码的语法正确性不是问题。然而，我们将尊重这些评论

现在，如果源代码经过预处理（使用

gcc-esource.c

），那么多行字符串已经被折叠成一行了！此外，注释已被删除。含糖的剩下的唯一注释是提及行号和源文件，以便进行调试。基本上我们要做的就是

$ gcc -E source.c | perl -nE'
  next if /^#/;  # skip line directives etc.
  say $1 while /(" (?:[^"\\]+ | \\.)* ")/xg;
'

输出（将我的另一个答案中的测试文件作为输入）：

是的，这里有很多垃圾（它们似乎来自

\uu asm\uu

块），但这非常有效

注意我使用的正则表达式：

/（（？：[^“\\]+\\\）*”）/x

"         # a literal '"'
(?:       # the begin of a non-capturing group
  [^"\\]+ # a character class that matches anything but '"' or '\', repeated once or more
|
  \\.     # an escape sequence like '\n', '\"', '\\' ...
)*        # zero or more times
"         # closing '"'

此解决方案的局限性是什么

我们需要一个预处理器
- 此代码使用
```
gcc
```
- ```
clang
```
  还支持
```
-E
```
  选项，但我不知道输出的格式

字符文字是一种故障模式，例如，

myfunc（“”，一个变量“”）

将被提取为

”，一个变量“，”

我们还从其他源文件中提取字符串。（误报）

哦，等等，我们可以通过解析预处理器插入的源文件注释来修复最后一位

# 29 "/usr/include/stdio.h" 2 3 4

因此，如果我们记住当前文件名，并将其与我们想要的文件名进行比较，我们可以跳过不需要的字符串。这次，我将以完整脚本而不是一行代码的形式编写它

use strict; use warnings;
use autodie;  # automatic error handling
use feature 'say';

my $source = shift @ARGV;
my $string_re = qr/" (?:[^"\\]+ | \\.)* "/x;

# open a pipe from the preprocessor
open my $preprocessed, "-|", "gcc", "-E", $source;

my $file;
while (<$preprocessed>) {
  $file = $1 if /^\# \s+ \d+ \s+ ($string_re)/x;
  next if /^#/;
  next if $file ne qq("$source");
  say $1 while /($string_re)/xg;
}

如果您不能使用方便的预处理器折叠多行字符串并删除注释，这会变得更加糟糕，因为我们必须自己考虑所有这些。基本上，您希望一次完整地读取整个文件，而不是逐行迭代。然后，跳过任何注释。不要忘记忽略预处理器di