Python 如果列重复，则删除并保留文件中的整行_Python_Bash_Awk

Python 如果列重复，则删除并保留文件中的整行

python bash awk

Python 如果列重复，则删除并保留文件中的整行,python,bash,awk,Python,Bash,Awk,我正试着做这样的事情输入文件 123 09 123 10 355 07 765 01 765 03 765 05 输出文件1 123 09 355 07 765 01 输出文件2 123 10 765 03 765 05 我是说。如果第1列中有重复的值，我想要两个（整行），但实际上我想把这些值放在另一个文件中我知道我可以使用 awk '!a[$1]++' file 但是否有可能获得输出2 我对python脚本持开放态度。对于第一个和第二个输出，您可以使用以下awk命令：

我正试着做这样的事情

输入文件

输出文件1

123 09
355 07
765 01

输出文件2

123 10
765 03
765 05

我是说。如果第1列中有重复的值，我想要两个（整行），但实际上我想把这些值放在另一个文件中

我知道我可以使用

awk '!a[$1]++' file

但是否有可能获得输出2

我对python脚本持开放态度。

对于第一个和第二个输出，您可以使用以下awk命令：

awk '!seen[$1]++{print > "output1"; next} {print > "output2"}' file

cat output1
123  09
355  07
765  01

cat output2
123  10
765  03
765  05

使用Python：

seen = set()
with open('data.txt') as fin, open('f1.txt', 'w') as fout1, open('f2.txt', 'w') as fout2:
    for line in fin:
        col = line.split()[0]
        if col in seen:
            fout2.write(line)
        else:
            seen.add(col)
            fout1.write(line)

试一试

您将进入Output1文件

123 09 355 07 765 01 awk单程

awk '{print >("file"(!a[$1]++?1:2))}' file

或

您可以直接在

bash

中执行此操作。例如：

#!/bin/bash

# file names
file=input.in
dupes=dupes.out
uniques=uniques.out

# an (associative) array to track seen keys
declare -a keys

# extracts a key from an input line via shell word splitting
get_key() {
  key=$1
}

# Removes old output files
[ -e "$dupes" ] && rm "$dupes"
[ -e "$uniques" ] && rm "$uniques"

# process the input line by line
while read line; do
  get_key $line
  if [ -n "${keys[$key]}" ]; then
    # a duplicate
    echo "$line" >> "$dupes"
  else
    # not a duplicate
    keys[$key]=1
    echo "$line" >> "$uniques"
  fi  
done < "$file"

#/bin/bash
#文件名
file=input.in
重复=重复
uniques=uniques.out
#跟踪可见关键点的（关联）阵列
声明-a密钥
#通过shell字拆分从输入行提取关键字
获取密钥（）{
钥匙=1美元
}
#删除旧的输出文件
[-e“$dupes”]&rm“$dupes”
[-e“$uniques”]&rm“$uniques”
#逐行处理输入
读行时；做
获取密钥$line
如果[-n“${keys[$key]}”]；然后
#复制品
回显“$line”>>“$dupes”
其他的
#不是复制品
密钥[$key]=1
回显“$line”>>“$uniques”
fi
完成<“$file”

它可以以多种方式缩短；我写这篇文章是为了清晰和灵活性，但以简洁为代价

无论如何，理解

bash

本身就是一个非常强大的编程环境是很重要的。减慢许多shell脚本速度的原因之一是使用了大量外部命令。使用外部命令本身并不坏，有时这是完成任务的最佳或唯一方法，但如果不是这样，您应该认真考虑避免使用它们。

这是一个简单易读的python脚本，可以完成任务。如果您有任何问题，请评论

# open all the files
with open('output_1.txt','w') as out_1:
    with open('output_2.txt', 'w') as out_2:
        with open('input.txt', 'r') as f:
            #make list that stores intermediate results
            tmp = []
            #iterate over each row of the input file
            for row in f:
                #extract the data contained in the row
                col_1, col_2 = row.split('  ') #split the line at double space

                #check if you have met col_1 before
                #if not, write the row in output_1
                if col_1 not in tmp:
                    tmp.append(col_1)
                    out_1.write(row)
                #otherwise write the row in output_2
                else:
                    out_2.write(row)

将您的输入文件转换为dictionary，并为dictionary中的每个键检查该键的值是否大于1，然后添加到其他文件。谢谢，但您的输出是“123103550776505”，我正在查找您的输出中的“123107650376505”

，输入文件中甚至没有编辑过的756问题（现在是765）我之前误解了这个问题。它已经更新了，请检查答案。@EdMorton我发现否定词更容易理解，因为我用得更多。我也会加上积极的一个，然后人们可以选择他们喜欢的：）至少这次我记得围绕文件名的括号@EdMorton是的，关于选择特定的线条，我经常使用它来寻找独特的线条。所以，使用否定词，有时甚至将其转换为肯定词，这是很自然的。显然，我知道这不是一个很好的练习！这是非常干净的，你也可以直接在汇编代码中完成，但为什么呢？鉴于各种输入文件内容，上述操作将以加密方式失败。很难在shell中稳健地编写文本处理循环，因为这不是shell的用途。显然，与awk解决方案相比，它的效率会非常低。

许多shell脚本的一个缺点是使用了大量的外部命令。

。真的吗<代码>读取时对每个字节进行系统调用。这可能是处理文件最慢的方法。@EdMorton至少在汇编中它会很快。@User112638726，是的，真的。启动子进程相当昂贵。至于bash的

read

builtin，您可以通过

strace

自己验证它。Bash以块方式读取缓冲区；在这种特殊情况下，我的bash使用128字节的缓冲区来执行作业。@EdMorton，如果此作业必须在更大的shell脚本上下文中执行，则

bash

的效率稍低，因为它仍然会比启动子

awk

进程的成本更高，除非输入太远，比这个例子大得多。是的，你必须考虑输入的内容是否对这个特定的实现是安全的，但本质上这不是一个不合格者。在某种程度上，你总是必须考虑任何实现对于可预测的输入是否正确。@ JulnBurrnER在Shell优于AWK的级别上，性能差异是如此微不足道，以至于它真的没有任何差别，甚至它只在最基本的任务上胜过。而对于大文件，则有利于awk。我以前测试过这个，大约40行之后它就不再有效了。如果您在使用awk处理40行文件时遇到性能问题，那么您肯定仍然会使用读取循环。我想不出在什么情况下使用读循环比awk更适合处理文本。

awk '{print >("file"(!a[$1]++?1:2))}' file

awk '{print >("file"(a[$1]++?2:1))}' file

#!/bin/bash

# file names
file=input.in
dupes=dupes.out
uniques=uniques.out

# an (associative) array to track seen keys
declare -a keys

# extracts a key from an input line via shell word splitting
get_key() {
  key=$1
}

# Removes old output files
[ -e "$dupes" ] && rm "$dupes"
[ -e "$uniques" ] && rm "$uniques"

# process the input line by line
while read line; do
  get_key $line
  if [ -n "${keys[$key]}" ]; then
    # a duplicate
    echo "$line" >> "$dupes"
  else
    # not a duplicate
    keys[$key]=1
    echo "$line" >> "$uniques"
  fi  
done < "$file"

# open all the files
with open('output_1.txt','w') as out_1:
    with open('output_2.txt', 'w') as out_2:
        with open('input.txt', 'r') as f:
            #make list that stores intermediate results
            tmp = []
            #iterate over each row of the input file
            for row in f:
                #extract the data contained in the row
                col_1, col_2 = row.split('  ') #split the line at double space

                #check if you have met col_1 before
                #if not, write the row in output_1
                if col_1 not in tmp:
                    tmp.append(col_1)
                    out_1.write(row)
                #otherwise write the row in output_2
                else:
                    out_2.write(row)