使用'删除重复的功能块；awk&x27/Python（通用解决方案）_Python_Regex_Bash_Awk

使用'删除重复的功能块；awk&x27/Python（通用解决方案）

python regex bash awk

使用'删除重复的功能块；awk&x27/Python（通用解决方案）,python,regex,bash,awk,Python,Regex,Bash,Awk,我有一个包含几个功能块的文本文件，其中一些是重复的。我想创建一个只包含唯一功能块的新文件。例如 input.txt（我已经更新了示例）：并希望将output.txt设置为： Func (a1,b1) abc1 { xyz1; { xy1; } xy1; } Func (a2,b2) abc2 { xyz2; { xy2; rst2; } xy2; } Func (a3,b3) abc3 { xyz3;

我有一个包含几个功能块的文本文件，其中一些是重复的。我想创建一个只包含唯一功能块的新文件。例如 input.txt（我已经更新了示例）：

并希望将output.txt设置为：

Func (a1,b1) abc1
{
xyz1;
    {
        xy1;
    }

xy1;
}

Func (a2,b2) abc2
{
xyz2;
    {
        xy2;
        rst2;
    }

xy2;
}

Func (a3,b3) abc3
{
xyz3;
    {
        xy3;
        rst3;
        def3;
    }

xy3;
}

我找到了一个使用

awk

删除重复行的解决方案，类似于：

$ awk '!a[$0]++' input.txt > output.txt

但问题是，上述解决方案只匹配一行，而不是文本块。我想将这个

awk

解决方案与正则表达式结合起来，以匹配单个函数块：

'/^FUNC（.|\n）*？\n}/'

但我没能做到。任何建议/解决方案都会非常有用。

如果代码块用空行分隔，则可以定义记录分隔符（和输出记录分隔符）

NB.适用于玩具示例，但这是脆弱的，因为代码块中的任何空行都会破坏逻辑。类似地，您不能依赖大括号，因为它也可能出现在代码块中

更新

对于更新的输入，这可能会更好

$ awk -v ORS='\n\n' '{record=($1~/^Func/)?$0:record RS $0} 
    /^}/ && !a[record]++{print record} '

在这里，我们定义的记录以“Func”关键字开始，以第一个位置上的大括号结束。累积记录行数，并使用“完成”打印。将ORS设置为记录之间有空行。

由于OP更改了要求和示例，因此我重新编写了代码，请尝试让我知道这是否有助于您（在此处读取输入文件2次）

现在也为解决方案添加一个非线性解决方案

awk '
FNR==NR && /Func/ && !a[$0]++{
  gsub(/^ +/,"");
  !b[$0]++;
  next}
FNR!=NR && /Func/{
  flag=($0 in b)?1:"";
  delete b[$0]}
flag
'   Input_file  Input_file

根据您的实际目的调整此代码（不知道示例中语言的确切协议和格式）。代码是自注释的

awk '
   # at every new function
   /^Func[[:space:]]*[(]/ {
     # print last function if keeped
     if ( Keep ) print Code
     # new function name
     Name=$NF
     # define to keep or avoid (keep if not yet in the list)
     Keep = ! ( Name in List)
     # put fct name in list
     List[ Name ]
     # clean code in memory
     Code = ""
     }
     # at each line, load the line into the code
      # if code is not empty, add old code + new line
     { Code = ( Code ? Code "\n" : "" ) $0 }

   # at the end, print last code if needed
   END { if ( Keep ) print Code }  
   ' sample.txt

上面只是假设每个Func定义都在它自己的行上，并且该行以函数名结尾

它所做的只是查找一个“Func”行，然后将标志

设置为true（如果这是我们第一次在该行末尾看到函数名），否则设置为false（使用常见的awk习语

！seen[$NF]+

，您在问题中已经使用了它，但将数组命名为

a[]

）。然后，如果

为真（即，您遵循的是以前未看到的函数名的Func定义），它将打印当前行，否则将跳过该行（即，您遵循的是以前已看到的函数名的Func定义）。

感谢所有人的解决方案。根据我发布的示例，它们是正确的，但我的实际任务更一般。我在Python中找到了一个通用的解决方案，因为上面提到的响应并不完美（可能是因为我对bash的了解有限）。我使用Python的通用解决方案如下：

import re
import os

testFolder = "./Path"

#Usage: Remove duplicate function block from one or more .txt files available in testFolder

#Iterating through the list of all the files available
for testFilePath in os.listdir(testFolder):
    if testFilePath.endswith(".txt"):
        #Extracting path for each text file found
        inputFile = open (testFolder + "/" + testFilePath, "r")

        #Creating a reduced folder in the output path
        outputPath = testFolder + "/Reduced"
        if not os.path.exists(outputPath):
            os.makedirs(outputPath)
        outputFile = open (outputPath + "/" + testFilePath, "w")

        #Reading all the content into a single string
        fileContent = inputFile.read()

        #Pattern for matching a Function block. Pattern matches multiple lines
        pattern = re.compile('(^FUNC(.|\n)*?\n})*',re.M)

        # Creating a list of function blocks
        funcList = pattern.findall(fileContent)
        #Creating a list of unique function block, thus removing duplicate data
        uniqueFuncList = set(funcList)

        #Writing each Function block to the output file separeted by a new line
        for element in uniqueFuncList:
            outputFile.write(element[0] + "\n\n") 
        inputFile.close()
        outputFile.close()

是的，你是对的，我添加了一个更接近我所拥有的示例。你可能需要一个解析器来强大地完成这项工作，特别是如果你也有自由形式的注释，并且需要处理大括号和“Func”注释部分中的关键字。我已从文件中删除了所有注释，以使其更容易。请使用与结尾匹配的内部代码also@karakfa，您能解释一下您的解决方案吗？这样我就可以根据自己的需要修改它了？我是awk的新手。谢谢你解释你的解决方案，也许我可以根据我的要求修改它？假设代码中没有

，示例显示它不是case@tanzil，请检查我编辑的代码，并让我知道这是否对您有帮助。++ve对于漂亮的代码，先生，不知何故，我的逻辑也有一点相同：）您的代码仍然非常清晰。非常高效和简短的代码。由于您在这里使用了很多上师的快捷方式，因此可能会为awk中的新手请求者添加一些注释

awk '
FNR==NR && /Func/ && !a[$0]++{
  gsub(/^ +/,"");
  !b[$0]++;
  next}
FNR!=NR && /Func/{
  flag=($0 in b)?1:"";
  delete b[$0]}
flag
'   Input_file  Input_file

awk '
   # at every new function
   /^Func[[:space:]]*[(]/ {
     # print last function if keeped
     if ( Keep ) print Code
     # new function name
     Name=$NF
     # define to keep or avoid (keep if not yet in the list)
     Keep = ! ( Name in List)
     # put fct name in list
     List[ Name ]
     # clean code in memory
     Code = ""
     }
     # at each line, load the line into the code
      # if code is not empty, add old code + new line
     { Code = ( Code ? Code "\n" : "" ) $0 }

   # at the end, print last code if needed
   END { if ( Keep ) print Code }  
   ' sample.txt

$ awk '$1=="Func"{ f=!seen[$NF]++ } f' file
Func (a1,b1) abc1
{
xyz1;
    {
        xy1;
    }

xy1;
}

Func (a2,b2) abc2
{
xyz2;
    {
        xy2;
        rst2;
    }

xy2;
}

Func (a3,b3) abc3
{
xyz3;
    {
        xy3;
        rst3;
        def3;
    }

xy3;
}

import re
import os

testFolder = "./Path"

#Usage: Remove duplicate function block from one or more .txt files available in testFolder

#Iterating through the list of all the files available
for testFilePath in os.listdir(testFolder):
    if testFilePath.endswith(".txt"):
        #Extracting path for each text file found
        inputFile = open (testFolder + "/" + testFilePath, "r")

        #Creating a reduced folder in the output path
        outputPath = testFolder + "/Reduced"
        if not os.path.exists(outputPath):
            os.makedirs(outputPath)
        outputFile = open (outputPath + "/" + testFilePath, "w")

        #Reading all the content into a single string
        fileContent = inputFile.read()

        #Pattern for matching a Function block. Pattern matches multiple lines
        pattern = re.compile('(^FUNC(.|\n)*?\n})*',re.M)

        # Creating a list of function blocks
        funcList = pattern.findall(fileContent)
        #Creating a list of unique function block, thus removing duplicate data
        uniqueFuncList = set(funcList)

        #Writing each Function block to the output file separeted by a new line
        for element in uniqueFuncList:
            outputFile.write(element[0] + "\n\n") 
        inputFile.close()
        outputFile.close()