Regex 递归列出包含m个或n个正则表达式的文件_Regex_Bash_Search_Awk_Grep

Regex 递归列出包含m个或n个正则表达式的文件

regex bash search awk grep

Regex 递归列出包含m个或n个正则表达式的文件,regex,bash,search,awk,grep,Regex,Bash,Search,Awk,Grep,我有一个目录，里面有很多文件。我有n搜索模式，我想列出所有与m匹配的文件示例：从下面的文件中，列出至少包含以下两个文件的文件：str1、str2、str3和str4 $ls-l目录/ 总数16 -rw-r--r--。1我我10 Jun 22 14:22 a -rw-r--r--。1我5六月22日14:22 b -rw-r--r--。1我我10 Jun 22 14:22 c -rw-r--r--。1我9 Jun 22 14:22 d -rw-r--r--。1 me me 10 Jun 22 14

我有一个目录，里面有很多文件。我有

搜索模式，我想列出所有与

匹配的文件

示例：从下面的文件中，列出至少包含以下两个文件的文件：

str1

、

str2

、

str3

和

str4

$ls-l目录/
总数16
-rw-r--r--。1我我10 Jun 22 14:22 a
-rw-r--r--。1我5六月22日14:22 b
-rw-r--r--。1我我10 Jun 22 14:22 c
-rw-r--r--。1我9 Jun 22 14:22 d
-rw-r--r--。1 me me 10 Jun 22 14:22 e
$cat dir/a
str1
str2
$cat dir/b
str2
$cat dir/c
str2
str3
$cat dir/d
str
str4
$cat dir/e
str2
str4

我通过在

find

结果上执行一个相当丑陋的

for

循环，为每个文件生成

grep

进程，实现了这一点，这显然是非常低效的，并且会在包含大量文件的目录上花费很长时间：

以美元表示的f的

（find dir/-type f）；做
c=0
GRIP-QS的STR1’$ F＆C+C++
GRIP-QS’STR2'F＆＆C++
GRIP-QS’STR3'$F＆C+C++
GRIP-QS’STR4’$ F＆C+C++
[$c-ge 2]]和&echo$f
完成

我很确定我能以一种更好的方式实现这一点，但我不确定如何应对。根据我从手册页（即在

-e

和

-m

上）了解到的情况，仅使用

grep

是不可能的

什么是正确的工具？使用

awk

是否可以实现这一点

好处：通过使用

find

，我可以定义要更精确地搜索的文件（即

-prune

某些子目录或仅使用

-iname'*.txt'

搜索文件），我也希望使用其他解决方案进行搜索

更新下面是关于不同实现的性能的一些统计信息

find

awk

（答案中的脚本）

python

（我是一名

python

noob，请告知是否可以对其进行优化）：

导入操作系统
模式=[]
模式=[“str1”、“str2”、“str3”、“str4”]
对于os.walk（“dir”）中的root、dir和文件：
对于文件中的文件：
c=int（0）
filepath=os.path.join（根，文件）
以open（文件路径“r”）作为输入：
对于模式中的模式：
对于行输入：
如果模式一致：
c+=1
打破
如果（c>=2）：
打印（文件路径）

c++

（答案中的脚本）

这里有一个使用

awk

的选项，因为您也用它做了标记：

find dir -type f -exec \
awk '/str1|str2|str3|str4/{c++} END{if(c>=2) print FILENAME;}' {} \;

但是，它将计数重复项，因此包含

str1
str1

将被列出。

因为编程语言和性能不一样，这里是C++的版本。不过，我还没有将它与我自己的

awk

进行比较

#include <cstddef>
#include <filesystem>
#include <fstream>
#include <iostream>
#include <string>
#include <string_view>
#include <utility>
#include <vector>

namespace fs = std::filesystem;

int main() {
    const fs::path dir = "dir";
    std::vector<std::string_view> strs{   // or: std::array<std::string_view, 4>
        "str1",
        "str2",
        "str3",
        "str4",
    };

    std::string line;
    int count;     // matches in a file
    size_t strsco; // number of strings to check in strs

    // a lambda to find a match on a line
    auto matcher = [&](const fs::directory_entry& de) {
        for(size_t idx = 0; idx < strsco; ++idx) {
            if(line.find(strs[idx]) != std::string::npos) {
                // a match was found

                if(++count >= 2) {
                    std::cout << de.path() << '\n';
                    // or the below if the quotation marks surrounding the path are
                    // unwanted:
                    // std::cout << de.path().native() << '\n';
                    return false;
                }

                // swap the found string_view with the last in the vector
                // to remove it from future matches in this file.
                --strsco;
                std::swap(strs[idx], strs[strsco]);
            }
        }
        return true;
    };

    // do a "find dir -type f"
    for(const fs::directory_entry& de : fs::recursive_directory_iterator(dir)) {
        if(de.is_regular_file()) { // -type f

            // open the found file
            if(std::ifstream file(de.path()); file) {
                // reset counters
                count = 0;
                strsco = strs.size();
                // read line by line until the file stream is depleated or matcher()
                // returns false
                while(std::getline(file, line) && matcher(de));
            }
        }
    }
}

如果您使用另一个编译器，请确保启用速度优化，并且它需要C++17。

@David不客气！速度差是奇数，但在速度很重要的时候，我根本不使用<代码> AWK< /C>：-戴维，速度，我会用C++来让程序同时做<代码>查找<代码>和模式匹配，但是这需要一个以上的线性。不过，code>perl和

python

通常已经足够好了。我不知道该怎么办，所以我不能说。我永远不会使用

php

。我不知道perl，但使用python的解决方案要比awk的慢得多。在某个地方，应该有一个站点，其中包含多个任务的多种语言的计时统计信息，这将非常有趣。例如，他们可以从中获取代码，并为每个任务的每个工具/语言添加计时。当然，我不是自愿的：-）。@EdMorton我也传递了这个：-）这个很有魅力，谢谢你。我仍然在努力学习

awk

语法。你介意解释一下脚本吗？@David解释语法会浪费时间，因为它都写在手册中。哇，这真是太快了！非常感谢。您是否发布了第三次运行计时的计时结果，以从结果中消除缓存延迟影响？我假设一个脚本需要0.002s还是0.006s对你来说并不重要，因为它们都在一个眼界范围内-如果你有更大的文件，性能会有影响，你能用它们来测试时间吗？此外，你还比较了一个C++程序，它有一个硬编码“代码> STR1等，与一个AWK程序相比，该程序读取文件中的值，显然这不是苹果到苹果的比较。

$ cat reg.txt
str1
str2
str3
str4

$ cat prog.awk
# reads regexps from the first input file
# parameterized by `m'
# requires gawk or mawk for `nextfile'
FNR == NR {
  reg[NR] = $0
  next
}
FNR == 1 {
  for (i in reg)
    tst[i]
  cnt = 0
}
{
  for (i in tst) {
    if ($0 ~ reg[i]) {
      if (++cnt == m) {
        print FILENAME
        nextfile
      }
      delete tst[i]
    }
  }
}

$ find dir -type f -exec awk -v m=2 -f prog.awk reg.txt {} +
dir/a
dir/c

find dir -type f -exec \
awk '/str1|str2|str3|str4/{c++} END{if(c>=2) print FILENAME;}' {} \;

str1
str1

#include <cstddef>
#include <filesystem>
#include <fstream>
#include <iostream>
#include <string>
#include <string_view>
#include <utility>
#include <vector>

namespace fs = std::filesystem;

int main() {
    const fs::path dir = "dir";
    std::vector<std::string_view> strs{   // or: std::array<std::string_view, 4>
        "str1",
        "str2",
        "str3",
        "str4",
    };

    std::string line;
    int count;     // matches in a file
    size_t strsco; // number of strings to check in strs

    // a lambda to find a match on a line
    auto matcher = [&](const fs::directory_entry& de) {
        for(size_t idx = 0; idx < strsco; ++idx) {
            if(line.find(strs[idx]) != std::string::npos) {
                // a match was found

                if(++count >= 2) {
                    std::cout << de.path() << '\n';
                    // or the below if the quotation marks surrounding the path are
                    // unwanted:
                    // std::cout << de.path().native() << '\n';
                    return false;
                }

                // swap the found string_view with the last in the vector
                // to remove it from future matches in this file.
                --strsco;
                std::swap(strs[idx], strs[strsco]);
            }
        }
        return true;
    };

    // do a "find dir -type f"
    for(const fs::directory_entry& de : fs::recursive_directory_iterator(dir)) {
        if(de.is_regular_file()) { // -type f

            // open the found file
            if(std::ifstream file(de.path()); file) {
                // reset counters
                count = 0;
                strsco = strs.size();
                // read line by line until the file stream is depleated or matcher()
                // returns false
                while(std::getline(file, line) && matcher(de));
            }
        }
    }
}

g++ -std=c++17 -O3 -o prog prog.cpp