Performance 在BASH中，从文件中选择随机行花费的时间太长_Performance_Algorithm_Bash_Random

Performance 在BASH中，从文件中选择随机行花费的时间太长

performance algorithm bash random

Performance 在BASH中，从文件中选择随机行花费的时间太长,performance,algorithm,bash,random,Performance,Algorithm,Bash,Random,我有一个脚本，它有以下语法： ./script number file 其中number我想从filefile获取的行数。这些行随机选择，然后打印两次。考虑到一个非常大的文件（约1000000行），此算法运行速度太慢。我不知道为什么，因为打印只包括访问阵列 #!/bin/bash max=`wc -l $2 | cut -d " " -f1` users=(`shuf -i 0-$max -n $1`) pages=(`shuf -i 0-$max -n $1`) readarray l

我有一个脚本，它有以下语法：

./script number file

其中number我想从filefile获取的行数。这些行随机选择，然后打印两次。考虑到一个非常大的文件（约1000000行），此算法运行速度太慢。我不知道为什么，因为打印只包括访问阵列

#!/bin/bash

max=`wc -l $2 | cut -d " " -f1`

users=(`shuf -i 0-$max -n $1`)
pages=(`shuf -i 0-$max -n $1`)

readarray lines < $2

for (( i = 0; i < $1; i++ )); do
    echo L ${lines[${users[i]}]} ${lines[${pages[i]}]} 
done

for (( i = 0; i < $1; i++ )); do
    echo U ${lines[${users[i]}]} ${lines[${pages[i]}]} 
done

#/bin/bash
max=`wc-l$2 | cut-d”“-f1`
用户=（`shuf-i0-$max-n$1`）
pages=（`shuf-i0-$max-n$1`）
读取阵列线<$2
（i=0；i<$1；i++）；做
echo L${lines[${users[i]}]}${lines[${pages[i]}]}
完成
（i=0；i<$1；i++）；做
echo U${lines[${users[i]}}${lines[${pages[i]}}}
完成

只需使用

shuf

选择行，这就是它的设计目的。例如（见注）：

注2:

如果

$1

较大，则最好不要使用数组。以下是一种可能的解决方案：

lines="$(paste -d' ' <(shuf -n $1 "$2") <(shuf -n $1 "$"))"
sed 's/^/L /' <<<"$lines"
sed 's/^/U /' <<<"$lines"

lines=“$（粘贴-d”也许您完全可以不使用数组，只使用文件实用程序和临时文件：
# Put the shuf outputs in two separate files:

shuf -n "$1" "$2" > shuf_users
shuf -n "$1" "$2" > shuf_pages

# paste the two:
paste -d ' ' shuf_users shuf_pages | sed 's/^/L /'
paste -d ' ' shuf_pages shuf_users | sed 's/^/U /'

在@rici的解决方案中，罪魁祸首可能也在输出行的两个循环中（例如for
循环的速度非常慢）
您应该使用mktemp
来创建临时文件shuf_用户
和shuf_页面
。这个练习留给读者来做。
下面的内容应该可以相当快地完成您想要的工作，bash数组速度很慢，并且是用临时文件构建的，因此使用它们的性能应该不会更好-如果Bash维护人员正确地实现了y，但他们还没有实现：
文件（确保名称相同，这是递归的）：
ranlines.bsh
#!/bin/bash
declare -i max=$(wc -l $2 | cut -d " " -f1)+1
declare STR=""
declare -i random_line=0
declare tmp_file="/tmp/_$$_$(date)"
declare -r usr_file="/tmp/_user_3434"
declare -r pgs_file="/tmp/_pgs_4343"

## create our tmp_file and tell it dont use 0 
echo "0" >> "$tmp_file" 

for (( i = 0; i < $1; i++ )); do
 while :; do 
   random_line=$(($RANDOM*30%$max));
   ## if you find an entry already in the tmp_file then continue 
   ## get a new number, loop until you find a new number
   (($(grep -c "$random_line" "$tmp_file"))) && continue;
   echo "$random_line" >> "$tmp_file" 
   break; 
 done 
 ## build the sed print string
 STR="$STR${random_line}p;"
done
rm "$tmp_file" 

if [[ $# -eq 2 ]]; then 
 #usr_file
 eval "sed -n '$STR' $2" > "$usr_file" 
 ## call us again, this time for the U 
 ranlines.bsh $1 $2 "U"
else 
 ## we know already we are processing the U because args is not 2 
 declare -i random_slct=$1+1
 eval "sed -n '$STR' $2" > "$pgs_file" 
 paste <(sed -n "${random_slct}q; a L" "$2") "$usr_file" "$pgs_file"
 paste <(sed -n "${random_slct}q; a U" "$2") "$pgs_file" "$usr_file"
 rm "$pgs_file" "$usr_file"
fi   
exit 0 

！/bin/bash
声明-i max=$（wc-l$2 | cut-d”“-f1）+1
declare STR=“”
声明-i随机_行=0
声明tmp_文件=“/tmp/$$\u$（日期）”
declare-r usr_file=“/tmp/_user_3434”
declare-r pgs_file=“/tmp/_pgs_4343”
##创建我们的tmp_文件并告诉它不要使用0
回显“0”>>“$tmp_文件”
对于（（i=0；i<$1；i++）；执行
当：；做
随机线=$（$random*30%$max））；
##如果在tmp_文件中已找到条目，则继续
##获取一个新号码，循环直到找到一个新号码
（$（grep-c“$random_line”“$tmp_file”）&&continue；
echo“$random\u line”>>“$tmp\u文件”
打破
完成
##构建sed打印字符串
STR=“$STR${random_line}p；”
完成
rm“$tmp_文件”
如果[$#-eq 2]]；则
#usr_文件
评估“sed-n'$STR'$2”>“$usr\u文件”
##再次给我们打电话，这一次是为了美国
ranlines.bsh$1$2“U”
其他的
##我们已经知道我们正在处理U，因为args不是2
declare-i random\u slct=$1+1
评估“sed-n'$STR'$2”>“$pgs_文件”
粘贴数组在Bash中的效率是出了名的低，您应该能够使用for循环for number来实现这一点，然后使用$RANDOM Bash变量modded来获得边界中的行号，然后您可以构建一个字符串并使用sed-n'4p；500p；245p；6773334p；34322p'打印，sed-n'4p；500p；245p；6773334p；34322p'readarray用户之间是否存在速度差如果$1
很大，第一个会更快，因为第二个需要中间步骤，即在内存中创建字符串，然后创建管道从内存中读取字符串。（至少，我认为它会更快；我实际上没有对它进行基准测试。）@rici你可能是对的，但值得一查。Horkyze你能检查一下你的数据并告诉我什么是最快的方法吗？@gniourf\u gniourf：好的，补充了这个建议。这与问题是正交的，IMHO.Tx，所以@Horkyze小心点，我在第二个粘贴状态中交换了shuf\u用户
和shuf\u页面
ent.@Horkyze:看看@rici笔记2中的解决方案，很好！如果我没有误读你的问题，我想我会这样做的。：）。
# Put the shuf outputs in two separate files:

shuf -n "$1" "$2" > shuf_users
shuf -n "$1" "$2" > shuf_pages

# paste the two:
paste -d ' ' shuf_users shuf_pages | sed 's/^/L /'
paste -d ' ' shuf_pages shuf_users | sed 's/^/U /'

#!/bin/bash
declare -i max=$(wc -l $2 | cut -d " " -f1)+1
declare STR=""
declare -i random_line=0
declare tmp_file="/tmp/_$$_$(date)"
declare -r usr_file="/tmp/_user_3434"
declare -r pgs_file="/tmp/_pgs_4343"

## create our tmp_file and tell it dont use 0 
echo "0" >> "$tmp_file" 

for (( i = 0; i < $1; i++ )); do
 while :; do 
   random_line=$(($RANDOM*30%$max));
   ## if you find an entry already in the tmp_file then continue 
   ## get a new number, loop until you find a new number
   (($(grep -c "$random_line" "$tmp_file"))) && continue;
   echo "$random_line" >> "$tmp_file" 
   break; 
 done 
 ## build the sed print string
 STR="$STR${random_line}p;"
done
rm "$tmp_file" 

if [[ $# -eq 2 ]]; then 
 #usr_file
 eval "sed -n '$STR' $2" > "$usr_file" 
 ## call us again, this time for the U 
 ranlines.bsh $1 $2 "U"
else 
 ## we know already we are processing the U because args is not 2 
 declare -i random_slct=$1+1
 eval "sed -n '$STR' $2" > "$pgs_file" 
 paste <(sed -n "${random_slct}q; a L" "$2") "$usr_file" "$pgs_file"
 paste <(sed -n "${random_slct}q; a U" "$2") "$pgs_file" "$usr_file"
 rm "$pgs_file" "$usr_file"
fi   
exit 0