Python 2.7 比较awk/（或python）中的连续行并随机选择一个重复行_Python 2.7_Awk

Python 2.7 比较awk/（或python）中的连续行并随机选择一个重复行

python-2.7 awk

Python 2.7 比较awk/（或python）中的连续行并随机选择一个重复行,python-2.7,awk,Python 2.7,Awk,我想使用awk/python（因为我使用大文件，所以我更喜欢使用awk）命令比较大文件（~1GB）中的连续行。以下是输入和输出的示例：输入文件 #x y 1 11 # Remarks (not part of the input file) 10 12 # (Remark *1) 10 17 # 4 14 20 15 # (Remark *2) 20 16 # 20 17

我想使用awk/python（因为我使用大文件，所以我更喜欢使用awk）命令比较大文件（~1GB）中的连续行。以下是输入和输出的示例：

输入文件

#x   y
1    11        # Remarks (not part of the input file)  
10   12        # (Remark *1)
10   17        #
4    14
20   15        # (Remark *2)
20   16        #
20   17        #
20   22        #
5    19
10   20

（备注*1）：由于此行的x值与连续行/行的x值相同，因此应在输出文件中打印此行或下一行（随机选择）

（备注*2）：由于此行的x值与接下来3行的x值相同，因此应在输出文件中打印此行或接下来3行之一（随机选择）

我想要的输出文件如下所示：

或（由于随机选择，如果相同的x值出现在连续行中）

基本上，我想比较当前行/行的x值是否与下一个连续行/行的x值相同。如果不是，则应打印当前行。如果是，则应在具有相同x值的连续行/行中仅选择一个随机行（y值对于比较不重要）

我希望，有人能帮助我

$ cat tst.awk
function prtBuf(        idx) {
    if (cnt > 0) {
        idx = int((rand() * cnt) + 1)
        print buf[idx]
    }
    cnt = 0
}

BEGIN { srand() }
$1 != prev { prtBuf() }
{ buf[++cnt]=$0; prev=$1 }
END { prtBuf() }

$ awk -f tst.awk file
1    11        # Remarks (not part of the input file)
10   17        #
4    14
20   17        #
5    19
10   20

$ awk -f tst.awk file
1    11        # Remarks (not part of the input file)
10   12        # (Remark *1)
4    14
20   22        #
5    19
10   20

我假设示例中的

和

列标题实际上不是输入文件的一部分，因此将其删除。如果它们确实存在，并且您希望它们出现在输出中，那么只需在前面添加一个

NR==1{print；next}

行。

看起来像是水库采样的应用程序。是的，谢谢你的关键字！因为我是一个awk新手，我希望能得到更多的帮助。

$ cat tst.awk
function prtBuf(        idx) {
    if (cnt > 0) {
        idx = int((rand() * cnt) + 1)
        print buf[idx]
    }
    cnt = 0
}

BEGIN { srand() }
$1 != prev { prtBuf() }
{ buf[++cnt]=$0; prev=$1 }
END { prtBuf() }

$ awk -f tst.awk file
1    11        # Remarks (not part of the input file)
10   17        #
4    14
20   17        #
5    19
10   20

$ awk -f tst.awk file
1    11        # Remarks (not part of the input file)
10   12        # (Remark *1)
4    14
20   22        #
5    19
10   20