BASH中更快的文本数据处理
我有一个速度问题。我有一个bash脚本,它解析来自TheTvDb.com的信息。它下载了近40000行数据,然后将数据减少到大约5000行,写入硬盘。然后它读取文件并将其解析为几个文件,这些文件稍后用作查找表。它基本上是在每一个“/集”之前获取它看到的所有信息,并将其写入一个特定的文件,然后为下一个文件重置 它必须在“/插曲”标记上同步,因为插曲标记外有一个“FirstAired”标记。这确保了数据是按顺序绘制的,而不是取决于与某一事件相关的每个单独标记 这是有问题的代码BASH中更快的文本数据处理,bash,string,text,performance,Bash,String,Text,Performance,我有一个速度问题。我有一个bash脚本,它解析来自TheTvDb.com的信息。它下载了近40000行数据,然后将数据减少到大约5000行,写入硬盘。然后它读取文件并将其解析为几个文件,这些文件稍后用作查找表。它基本上是在每一个“/集”之前获取它看到的所有信息,并将其写入一个特定的文件,然后为下一个文件重置 它必须在“/插曲”标记上同步,因为插曲标记外有一个“FirstAired”标记。这确保了数据是按顺序绘制的,而不是取决于与某一事件相关的每个单独标记 这是有问题的代码 if [ -f "
if [ -f "$mythicalLibrarian/$NewShowName/$NewShowName.xml" ]; then
Ename=""
actualEname=""
FAired=""
SeasonNr=""
EpisodeNr=""
recordNumber=0
echo "Parsing Downloaded information: $NewShowName.xml "
while read line
do
if [[ $line == \<\/Episode\> ]]; then
(( ++recordNumber ))
echo -ne "Building Record:$recordNumber ${actualEname:0:20} \r" 1>&2
echo "$actualEname" >> "$mythicalLibrarian/$NewShowName/$NewShowName.actualEname.txt"&
Ename=`echo "$actualEname" |sed 's/;.*//'`
echo "$Ename" >> "$mythicalLibrarian/$NewShowName/$NewShowName.Ename.txt"&
echo "$FAired" >> "$mythicalLibrarian/$NewShowName/$NewShowName.FAired.txt"&
echo "$SeasonNr" >> "$mythicalLibrarian/$NewShowName/$NewShowName.S.txt"&
echo "$EpisodeNr" >> "$mythicalLibrarian/$NewShowName/$NewShowName.E.txt"&
Ename=""
actualEname=""
FAired=""
SeasonNr=""
EpisodeNr=""
#Get actual show name
elif [[ $line == \<EpisodeName\>* ]]; then
actualEname=`echo "$line" | sed -e s/'<\/EpisodeName>'// -e s/'<EpisodeName>'// -e s/'\&\;'/'\&'/ -e s/'\"\;'/'\"'/ -e s/'\&\;'/'\&'/ -e s/'\&ndash\;'/'-'/ -e s/'\<\;'/'\<'/ -e 's/'\>\;'/'\>'/' |tr -d '|\?\*\<\"\:\>\+\\\[\]\/'`
#Get OriginalAirDate
elif [[ $line == \<FirstAired\>* ]]; then
FAired=`echo "$line" | sed -e s/'<FirstAired>'//g -e s/'<\/FirstAired>'//g`
#Get Season number
elif [[ $line == \<SeasonNumber\>* ]]; then
SeasonNr=`echo "$line" |sed -e s/'<SeasonNumber>'// -e s/'<\/SeasonNumber>'//`
#Get Episode number
elif [[ $line == \<EpisodeNumber\>* ]]; then
EpisodeNr=`echo "$line" |sed -e 's/<EpisodeNumber>//' -e 's/<\/EpisodeNumber>//'`
fi
done < "$mythicalLibrarian/$NewShowName/$NewShowName.xml"
chmod 777 "$mythicalLibrarian"/"$NewShowName"/"$NewShowName".actualEname.txt
chmod 666 "$mythicalLibrarian"/"$NewShowName"/"$NewShowName".Ename.txt
chmod 666 "$mythicalLibrarian/$NewShowName/$NewShowName".FAired.txt
chmod 666 "$mythicalLibrarian"/"$NewShowName"/"$NewShowName".S.txt
chmod 666 "$mythicalLibrarian/$NewShowName/$NewShowName".E.txt
GotNewInformation=1
elif [ ! -f "$mythicalLibrarian/$NewShowName/$NewShowName.xml" ]; then
echo "COULD NOT DOWNLOAD:www.thetvdb.com/api/$APIkey/series/$SeriesID/all/$Language.xml">>"$mythicalLibrarian"/output.log
fi
if[-f“$mythicalibrarian/$NewShowName/$NewShowName.xml”];然后
Ename=“”
ActualName=“”
FAired=“”
季候号=“”
eposodenr=“”
记录编号=0
echo“解析下载的信息:$NewShowName.xml”
读行时
做
如果[[$line=\]];然后
((++recordNumber))
echo-ne“构建记录:$recordNumber${actualname:0:20}\r“1>&2
echo“$actualName”>>“$mythicalibrarian/$NewShowName/$NewShowName.actualName.txt”&
Ename=`echo“$actualname”| sed's/;.*/'`
echo“$Ename”>>“$mythicalibrarian/$NewShowName/$NewShowName.Ename.txt”&
echo“$FAired”>>“$mythicalibrarian/$NewShowName/$NewShowName.FAired.txt”&
echo“$SeasonNr”>>“$Mythicalibrarian/$NewShowName/$NewShowName.S.txt”&
echo“$eposodenr”>>“$mythicalibrarian/$NewShowName/$NewShowName.E.txt”&
Ename=“”
ActualName=“”
FAired=“”
季候号=“”
eposodenr=“”
#获取实际的节目名称
elif[[$line=\*];然后
实际名称=`echo“$line”sed-es/''//-es/'/'/-es/'\&\'/'\&'/-e s/'\"\'/'\“'/-ES/'\&\'/'\&'/-ES/'\&ndash\'/'-'/-ES/'\<\'/'\\\最有可能的原因是该脚本中发生了大量的进程派生(sed
,tr
)
通过调用带有XML解析器的程序来读入并输出到各种文件,您可以获得更快的结果。如果需要将其保存在bash中,可以找到一些可以执行XSLT
的操作,将XML转换为文件中使用的格式并将其分割
就我个人而言,我会用Perl做这类事情
BASH是围绕数据操作和文件操作而设计的
Bash是为交互式命令处理和通过管道将程序连接在一起而设计的。据我所知,繁重的数据处理不是任何*sh的设计空间
Python或Perl将是解决问题的更好选择。我刚刚尝试了以下方法:
echo "Parsing Downloaded information: $NewShowName.xml "
while read line
do
if [[ $line == \<\/Episode\> ]]; then
(( ++recordNumber ))
echo -ne "Building Record:$recordNumber ${actualEname:0:20} \r" 1>&2
echo "$EpisodeName" >> "$mythicalLibrarian/$NewShowName/$NewShowName.actualEname.txt"&
Ename=`echo "$actualEname" |sed 's/;.*//'`
echo "$EpisodeName" >> "$mythicalLibrarian/$NewShowName/$NewShowName.Ename.txt"&
echo "$FirstAired" >> "$mythicalLibrarian/$NewShowName/$NewShowName.FAired.txt"&
echo "$SeasonNumber" >> "$mythicalLibrarian/$NewShowName/$NewShowName.S.txt"&
echo "$EpisodeNumber" >> "$mythicalLibrarian/$NewShowName/$NewShowName.E.txt"&
EpisodeName=""
actualEname=""
FirstAired=""
SeasonNumber=""
EpisodeNumber=""
else
var=`echo $line |tr '<>' ' '|awk '{print $1}'`
value=`echo "$line"|sed -e s/'<'"$var"'>'// -e s/'<\/'"$var"'>'// -e s/'\&\;'/'\&'/ -e s/'\"\;'/'\"'/ -e s/'\&\;'/'\&'/ -e s/'\&ndash\;'/'-'/ -e s/'\<\;'/'\<'/ -e 's/'\>\;'/'\>'/' |tr -d '|\?\*\<\"\:\>\+\\\[\]\/'`
eval $var="'$value'"
fi
echo“解析下载的信息:$NewShowName.xml”
读行时
做
如果[[$line=\]];则
((++recordNumber))
echo-ne“构建记录:$recordNumber${actualname:0:20}\r“1>&2
echo“$EpisodeName”>>“$Mythicalibrarian/$NewShowName/$NewShowName.actualName.txt”&
Ename=`echo“$actualname”| sed's/;.*/'`
echo“$eposodename”>>“$mythicalibrarian/$NewShowName/$NewShowName.Ename.txt”&
echo“$FirstAired”>>“$mythicalibrarian/$NewShowName/$NewShowName.FAired.txt”&
echo“$SeasonNumber”>>“$Mythicalibrarian/$NewShowName/$NewShowName.S.txt”&
echo“$eposodenumber”>>“$mythicalibrarian/$NewShowName/$NewShowName.E.txt”&
EpisodeName=“”
ActualName=“”
FirstAired=“”
季节编号=“”
eposodenumber=“”
其他的
var=`echo$line | tr''''awk'{print$1}'`
value=`echo“$line”| sed-es/''/'/'/'/'/'-es/'\&\'/'\&'/-es/'\"\'/'\''''/'''''''/-es/'\&\'/'\&\'./-e s/'\&ndash\'/'-'/-e s/'\<\'/'\ 使用类似于设计用于处理XML的方法。从所有echo
语句的末尾删除&
,您将获得相当大的加速
测试1:
$ time { for i in {1..1000}; do echo "hello"& done >/dev/null; } | cat
real 0m10.357s
user 0m2.764s
sys 0m15.441s
当在命令行中执行此操作时,cat
会吃掉“done”消息。可以使用冒号而不是cat
来抑制来自第一次定时测试的“完成”消息。不是程序在做,而是后台进程是管道的一部分
测试2:
$ time { for i in {1..1000}; do echo "hello"; done >/dev/null; }
real 0m0.152s
user 0m0.132s
sys 0m0.020s
请注意,这是在一台非常慢的旧机器上进行的
通过使用Bash的正则表达式和字符串处理特性,而不是在一个循环中重复生成多个外部实用程序,还可以提高速度
例如:
elif [[ $line == \<EpisodeName\>* ]]; then
actualEname=${line//<\/EpisodeName>/}
actualEname=${actualEname//<EpisodeName>/}
actualEname=${actualEname//&/&}
actualEname=${actualEname//–/-}
for string in '|' '<' '>' '"' '?' '*' '<' '>' ':' '"' '+' '\' '[' ']' '/'
do
actualEname=${actualEname//$string}
done
测试4:
$ time { for i in {1..100}; do
line='<EpisodeName><foo&bar–baz>Season–3–"quux"?*<>:"+\[]/</EpisodeName>
actualEname=${line//<\/EpisodeName>/}
actualEname=${actualEname//<EpisodeName>/}
actualEname=${actualEname//&/&}
actualEname=${actualEname//–/-}
for string in '|' '<' '>' '"' '\?' '\*' '<' '>' ':' '"' '+' '\\' '[' ']' '\/'
do
actualEname=${actualEname//$string}
done
done; }
real 0m5.403s
user 0m2.492s
sys 0m2.960s
$time{for i in{1..100};do
line='foo&;bar&ndash;bazzash&ndash;3&ndash;'qux'*:“+\[]/
ActualName=${line//}
ActualName=${ActualName//}
ActualName=${ActualName/&;/&}
ActualName=${ActualName/&ndash;/-}
对于'|''''''.''?''\*''中的字符串:''''+''\\''['']'''.\/'
做
ActualName=${ActualName/$string}
完成
完成;}
实际0m5.403s
用户0m2.492s
sys 0m2.960s
天哪,丹尼斯·威廉姆森(Dennis Williamson),它在不到1/2秒内就能解析。它只是在屏幕上闪烁。它过去需要15秒,但现在速度太快了,我甚至都说不出发生了什么
这些是丹尼斯·威廉姆森建议的改变。我只是把它贴在这里
echo "Parsing Downloaded information: $NewShowName.xml "
while read line
do
if [[ $line == \<\/Episode\> ]]; then
(( ++recordNumber ))
echo -ne "Building Record:$recordNumber ${actualEname:0:20} \r" 1>&2
echo "$actualEname" >> "$mythicalLibrarian/$NewShowName/$NewShowName.actualEname.txt"
echo "$Ename" >> "$mythicalLibrarian/$NewShowName/$NewShowName.Ename.txt"
echo "$FAired" >> "$mythicalLibrarian/$NewShowName/$NewShowName.FAired.txt"
echo "$SeasonNr" >> "$mythicalLibrarian/$NewShowName/$NewShowName.S.txt"
echo "$EpisodeNr" >> "$mythicalLibrarian/$NewShowName/$NewShowName.E.txt"
Ename=""
actualEname=""
FAired=""
SeasonNr=""
EpisodeNr=""
#Get actual show name
elif [[ $line == \<EpisodeName\>* ]]; then
line=${line/<\/EpisodeName>/}
line=${line/<EpisodeName>/}
line=${line/<}
line=${line/>/}
line=${line/"/}
line=${line/&/&}
line=${line/\|/}
line=${line/\?/}
line=${line/\*/}
line=${line/\:/}
line=${line/\+/}
line=${line/\\/}
line=${line/\//}
line=${line/\[/}
line=${line/\]/}
line=${line/\'/}
line=${line/\"/}
actualEname=${line/–/-}
Ename=${actualEname/;*/}
#Get OriginalAirDate
elif [[ $line == \<FirstAired\>* ]]; then
line=${line/<\/FirstAired>/}
line=${line/<FirstAired>/}
FAired=$line
#Get Season number
elif [[ $line == \<SeasonNumber\>* ]]; then
line=${line/<\/SeasonNumber>/}
line=${line/<SeasonNumber>/}
SeasonNr=$line
#Get Episode number
elif [[ $line == \<EpisodeNumber\>* ]]; then
line=${line/<\/EpisodeNumber>/}
line=${line/<EpisodeNumber>/}
EpisodeNr=$line
fi
done < "$mythicalLibrarian/$NewShowName/$NewShowName.xml"
chmod 666 "$mythicalLibrarian"/"$NewShowName"/"$NewShowName".actualEname.txt
chmod 666 "$mythicalLibrarian"/"$NewShowName"/"$NewShowName".Ename.txt
chmod 666 "$mythicalLibrarian/$NewShowName/$NewShowName".FAired.txt
chmod 666 "$mythicalLibrarian"/"$NewShowName"/"$NewShowName".S.txt
chmod 666 "$mythicalLibrarian/$NewShowName/$NewShowName".E.txt
GotNewInformation=1
echo“解析下载的信息:$NewShowName.xml”
读行时
做
如果[[$line=\]];则
((++recordNumber))
echo-ne“构建记录:$recordNumber${actualname:0:20}\r“1>&2
echo“$actualName”>>“$mythicalibrarian/$NewShowName/$NewShowName.actualName.txt”
回声“$Ename”>>
$ time { for i in {1..100}; do
line='<EpisodeName><foo&bar–baz>Season–3–"quux"?*<>:"+\[]/</EpisodeName>
actualEname=${line//<\/EpisodeName>/}
actualEname=${actualEname//<EpisodeName>/}
actualEname=${actualEname//&/&}
actualEname=${actualEname//–/-}
for string in '|' '<' '>' '"' '\?' '\*' '<' '>' ':' '"' '+' '\\' '[' ']' '\/'
do
actualEname=${actualEname//$string}
done
done; }
real 0m5.403s
user 0m2.492s
sys 0m2.960s
echo "Parsing Downloaded information: $NewShowName.xml "
while read line
do
if [[ $line == \<\/Episode\> ]]; then
(( ++recordNumber ))
echo -ne "Building Record:$recordNumber ${actualEname:0:20} \r" 1>&2
echo "$actualEname" >> "$mythicalLibrarian/$NewShowName/$NewShowName.actualEname.txt"
echo "$Ename" >> "$mythicalLibrarian/$NewShowName/$NewShowName.Ename.txt"
echo "$FAired" >> "$mythicalLibrarian/$NewShowName/$NewShowName.FAired.txt"
echo "$SeasonNr" >> "$mythicalLibrarian/$NewShowName/$NewShowName.S.txt"
echo "$EpisodeNr" >> "$mythicalLibrarian/$NewShowName/$NewShowName.E.txt"
Ename=""
actualEname=""
FAired=""
SeasonNr=""
EpisodeNr=""
#Get actual show name
elif [[ $line == \<EpisodeName\>* ]]; then
line=${line/<\/EpisodeName>/}
line=${line/<EpisodeName>/}
line=${line/<}
line=${line/>/}
line=${line/"/}
line=${line/&/&}
line=${line/\|/}
line=${line/\?/}
line=${line/\*/}
line=${line/\:/}
line=${line/\+/}
line=${line/\\/}
line=${line/\//}
line=${line/\[/}
line=${line/\]/}
line=${line/\'/}
line=${line/\"/}
actualEname=${line/–/-}
Ename=${actualEname/;*/}
#Get OriginalAirDate
elif [[ $line == \<FirstAired\>* ]]; then
line=${line/<\/FirstAired>/}
line=${line/<FirstAired>/}
FAired=$line
#Get Season number
elif [[ $line == \<SeasonNumber\>* ]]; then
line=${line/<\/SeasonNumber>/}
line=${line/<SeasonNumber>/}
SeasonNr=$line
#Get Episode number
elif [[ $line == \<EpisodeNumber\>* ]]; then
line=${line/<\/EpisodeNumber>/}
line=${line/<EpisodeNumber>/}
EpisodeNr=$line
fi
done < "$mythicalLibrarian/$NewShowName/$NewShowName.xml"
chmod 666 "$mythicalLibrarian"/"$NewShowName"/"$NewShowName".actualEname.txt
chmod 666 "$mythicalLibrarian"/"$NewShowName"/"$NewShowName".Ename.txt
chmod 666 "$mythicalLibrarian/$NewShowName/$NewShowName".FAired.txt
chmod 666 "$mythicalLibrarian"/"$NewShowName"/"$NewShowName".S.txt
chmod 666 "$mythicalLibrarian/$NewShowName/$NewShowName".E.txt
GotNewInformation=1