Bash 查找所有字符串中不相等的第一个字符索引的最快方法_Bash_Shell_Optimization

Bash 查找所有字符串中不相等的第一个字符索引的最快方法

bash shell optimization

Bash 查找所有字符串中不相等的第一个字符索引的最快方法,bash,shell,optimization,Bash,Shell,Optimization,假设我有两行这样的输入行 blablabla this is always the same 123 blablabla this is always the same 321 blablabla this is always the same 4242 blablabla this is al 242 blablabla this is always 2432 ... 开头有一个后缀，可能与所有子字符串相同，也可能不同。在我的例子中，这取决于一些代码。我想做的是去掉所有与所有字符串相同的前导

假设我有两行这样的输入行

blablabla this is always the same 123
blablabla this is always the same 321
blablabla this is always the same 4242
blablabla this is al 242
blablabla this is always 2432
...

开头有一个后缀，可能与所有子字符串相同，也可能不同。在我的例子中，这取决于一些代码。我想做的是去掉所有与所有字符串相同的前导字符。在这种情况下，我希望：

ways the same 123
ways the same 321
ways the same 4242
 242
ways 2432
...

我有一个输出正确结果的解决方案，但速度非常慢。我只需要bash中的解决方案。任何帮助都将不胜感激

[更新]我编辑了我的初始脚本，以演示此线程的当前解决方案

#!/bin/bash

# setup test data 
tempf=$( mktemp )
echo "blablabla this is always the same 123
blablabla this is always the same 321
blablabla this is always the same 4242
blablabla this is al 242
blablabla this is always 2432" > $tempf 

# BASELINE by myself 
find_index_baseline () {

    longest_line=$( cat $tempf | wc -L )  # determine end of iteration sequence 
    for i in $( seq 1 $longest_line ) # iterate over char at position i 
    do
        # find number of different chars by 
        #  - printing all data using echo 
        #  - cutting out the i'th character 
        #  - unique sort resulting character set 
        #  - count resulting characters 
        diffchars=$( cat $tempf | cut -c${i} | sort -u | wc -l )
        [ $diffchars -ge 2 ] && break # if more than 1 character, then break 
    done
    idx=$(( $i - 1 )) # save index 
    cat $tempf | while read line; do echo "${line:$idx}"; done 
}

# OPTIMIZED by anishsane 
find_index_anishsane () {

   awk 'NR==1{a=$0; next} #Record first line
     NR==FNR{ #For entire first pass,
         while(match($0, a)!=1) #Find the common part in string
             a=substr(a,1,length(a)-1); 
         next;
     }
     # In second pass
     FNR==1{a=length(a)} # This is just an optimization. You could also use sub/gensub based logic

     {print substr($0,a+1)} # Print the substring 
     ' $tempf $tempf
}

# OPTIMIZED by 123 
find_index_123 () {
    awk 'NR==1{
           pos=split($0,a,"")
     }
     NR==FNR{
          split($0,b,"")
          for(i=1;i<=pos;i++)
             if(b[i]!=a[i]){
                pos=i
                break
           }
           next
        }
    NR!=FNR{
       print substr($0,pos)
    }' $tempf $tempf
}

echo "--- BASELINE (run once)"
time find_index_baseline > /dev/null # even slow when running once :) 
echo "---- ANISHSANE x100"
time for i in {1..100}; do find_index_anishsane > /dev/null; done
echo "---- 123 x100"
time for i in {1..100}; do find_index_123 > /dev/null; done

rm -f $tempf

使用awk：

awk 'NR==1{a=$0; next} #Record first line
     NR==FNR{ #For entire first pass,
         while(match($0, a)!=1) #Find the common part in string
             a=substr(a,1,length(a)-1); 
         next;
     }
     # In second pass
     FNR==1{a=length(a)} # This is just an optimization. You could also use sub/gensub based logic

     {print substr($0,a+1)} # Print the substring 
     ' test-input.log test-input.log # Pass the file twice


ways the same 123
ways the same 321
ways the same 4242
 242
ways 2432

时间

输出：

Bash based code:
real    0m0.055s
user    0m0.008s
sys     0m0.000s

awk based code:
real    0m0.005s
user    0m0.000s
sys     0m0.004s

使用awk：

awk 'NR==1{a=$0; next} #Record first line
     NR==FNR{ #For entire first pass,
         while(match($0, a)!=1) #Find the common part in string
             a=substr(a,1,length(a)-1); 
         next;
     }
     # In second pass
     FNR==1{a=length(a)} # This is just an optimization. You could also use sub/gensub based logic

     {print substr($0,a+1)} # Print the substring 
     ' test-input.log test-input.log # Pass the file twice


ways the same 123
ways the same 321
ways the same 4242
 242
ways 2432

时间

输出：

Bash based code:
real    0m0.055s
user    0m0.008s
sys     0m0.000s

awk based code:
real    0m0.005s
user    0m0.000s
sys     0m0.004s

使用两个过程，并在第一个过程中捕获最远的一个过程

awk 'NR==1{
           pos=split($0,a,"")
     }
     NR==FNR{
          split($0,b,"")
          for(i=1;i<=pos;i++)
             if(b[i]!=a[i]){
                pos=i
                break
           }
           next
        }
    NR!=FNR{
       print substr($0,pos)
    }' file{,}

使用两个过程，并在第一个过程中捕获最远的一个过程

awk 'NR==1{
           pos=split($0,a,"")
     }
     NR==FNR{
          split($0,b,"")
          for(i=1;i<=pos;i++)
             if(b[i]!=a[i]){
                pos=i
                break
           }
           next
        }
    NR!=FNR{
       print substr($0,pos)
    }' file{,}

下面是一个Python解决方案，它完成了这项工作：

from itertools import izip, takewhile
import sys

def allEqual(x):
    return not x or len(x) == x.count(x[0])

lines = sys.stdin.read().splitlines()
prefixLen = sum(1 for _ in takewhile(allEqual, izip(*set(lines))))
for l in lines:
    print l[prefixLen:]

allEquals

函数告知给定序列（例如元组或列表）中的所有元素是否相等（或者序列是否为空）。

commonPrefixLength

函数获取字符串序列并返回最长公共前缀的长度。最后，主程序读取stdin，确定最长公共前缀的长度，并打印除公共前缀外的所有输入行

到目前为止，这似乎比基于awk的解决方案更快，例如：

$ for i in {1..10000};do echo -e "blablabla this is always the same 123\nblablabla this is always the same 321\nblablabla this is always the same 4242\nblablabla this is al 242\nblablabla this is always 2432" >> testdata.txt;done
$ time awk -f 123.awk testdata.txt{,} > /dev/null

real    0m3.858s
user    0m3.826s
sys 0m0.030s
$ time awk -f anishane.awk testdata.txt testdata.txt > /dev/null

real    0m0.517s
user    0m0.511s
sys 0m0.005s
$ time python frerich.py < testdata.txt > /dev/null

real    0m0.099s
user    0m0.082s
sys 0m0.014s

{1..10000}中i的

$；do echo-e“blabla这始终是相同的123\nblablabla这始终是相同的321\nblablabla这始终是相同的4242\nblablabla这始终是al 242\nblablablabla这始终是2432”>>testdata.txt；完成
$time awk-f 123.awk testdata.txt{，}>/dev/null
实际0m3.858s
用户0m3.826s
系统0m0.030s
$time awk-f anishane.awk testdata.txt testdata.txt>/dev/null
实际0.517s
用户0.511s
系统0m0.005s
$time python frerich.py/dev/null
实0.099秒
用户0m0.082s
系统0m0.014s

它们也产生同样的产出：

$ awk -f anishane.awk testdata.txt testdata.txt | md5
8a3880cb99a388092dd549c8dc4a9cc3
$ awk -f 123.awk testdata.txt{,} | md5
8a3880cb99a388092dd549c8dc4a9cc3
$ python frerich.py < testdata.txt | md5
8a3880cb99a388092dd549c8dc4a9cc3

$awk-f anishane.awk testdata.txt testdata.txt | md5
8a3880cb99a388092dd549c8dc4a9cc3
$awk-f 123.awk testdata.txt{，}md5
8a3880cb99a388092dd549c8dc4a9cc3
$python frerich.py

这里有一个Python解决方案，可以完成这项工作：

from itertools import izip, takewhile
import sys

def allEqual(x):
    return not x or len(x) == x.count(x[0])

lines = sys.stdin.read().splitlines()
prefixLen = sum(1 for _ in takewhile(allEqual, izip(*set(lines))))
for l in lines:
    print l[prefixLen:]

allEquals

函数告知给定序列（例如元组或列表）中的所有元素是否相等（或者序列是否为空）。

commonPrefixLength

函数获取字符串序列并返回最长公共前缀的长度。最后，主程序读取stdin，确定最长公共前缀的长度，并打印除公共前缀外的所有输入行

到目前为止，这似乎比基于awk的解决方案更快，例如：

$ for i in {1..10000};do echo -e "blablabla this is always the same 123\nblablabla this is always the same 321\nblablabla this is always the same 4242\nblablabla this is al 242\nblablabla this is always 2432" >> testdata.txt;done
$ time awk -f 123.awk testdata.txt{,} > /dev/null

real    0m3.858s
user    0m3.826s
sys 0m0.030s
$ time awk -f anishane.awk testdata.txt testdata.txt > /dev/null

real    0m0.517s
user    0m0.511s
sys 0m0.005s
$ time python frerich.py < testdata.txt > /dev/null

real    0m0.099s
user    0m0.082s
sys 0m0.014s

{1..10000}中i的

$；do echo-e“blabla这始终是相同的123\nblablabla这始终是相同的321\nblablabla这始终是相同的4242\nblablabla这始终是al 242\nblablablabla这始终是2432”>>testdata.txt；完成
$time awk-f 123.awk testdata.txt{，}>/dev/null
实际0m3.858s
用户0m3.826s
系统0m0.030s
$time awk-f anishane.awk testdata.txt testdata.txt>/dev/null
实际0.517s
用户0.511s
系统0m0.005s
$time python frerich.py/dev/null
实0.099秒
用户0m0.082s
系统0m0.014s

它们也产生同样的产出：

$ awk -f anishane.awk testdata.txt testdata.txt | md5
8a3880cb99a388092dd549c8dc4a9cc3
$ awk -f 123.awk testdata.txt{,} | md5
8a3880cb99a388092dd549c8dc4a9cc3
$ python frerich.py < testdata.txt | md5
8a3880cb99a388092dd549c8dc4a9cc3

$awk-f anishane.awk testdata.txt testdata.txt | md5
8a3880cb99a388092dd549c8dc4a9cc3
$awk-f 123.awk testdata.txt{，}md5
8a3880cb99a388092dd549c8dc4a9cc3
$python frerich.py

我真的不明白您是如何得到输出的，或者逻辑是什么，请您再详细解释一下好吗？您的意思是我是如何在实现中找到解决方案的？没关系，我想我在重读了几遍之后才明白。文件有多大？我用更多的文档更新了代码。嗯，没那么多。通常大约10-15行。我怀疑任何解决方案最终都不会使用shell语言，而是使用同样容易获得的语言。可能是sed、awk、Perl或Python。我真的不明白您是如何得到输出的，或者逻辑是什么，请您再详细解释一下好吗？您是说我是如何在实现中找到解决方案的？没关系，我想我在重读了几遍之后才明白。文件有多大？我用更多的文档更新了代码。嗯，没那么多。通常大约10-15行。我怀疑任何解决方案最终都不会使用shell语言，而是使用同样容易获得的语言。可能是sed、awk、Perl或Python。您一定是在我开始写作时发布的：（，你认为两种方法的答案都不一样吗？@anishsane哇，这太快了！谢谢。我将对此进行升级，并通过两种方法的比较更新我的初始代码。在我的机器上，性能增益甚至超过了因子10。我想对于实际的基准测试，你需要更大的输入数据集。@FrerichRaa请注意，基准应符合用例。在我的情况下，我不要求大型数据集使用此标准，但需要多次重复。我同意@FrerichRaabe。我没有检查大型数据集。你一定是在我开始写作时发布的：（，你认为两种方法的答案都不一样吗？@anishsane哇，这太快了！谢谢。我将对此进行升级，并通过两种方法的比较更新我的初始代码。在我的机器上，性能增益甚至超过了因子10。我想对于实际的基准测试，你需要更大的输入数据集。@FrerichRaa请注意，基准测试应该适合用例。在我的例子中，我不要求大型数据集使用此测试，但需要多次重复。我同意@FrerichRaabe。我没有检查大型数据集。谢谢123！这甚至是一个错误