BASH中更快的文本数据处理

BASH中更快的文本数据处理,bash,string,text,performance,Bash,String,Text,Performance,我有一个速度问题。我有一个bash脚本,它解析来自TheTvDb.com的信息。它下载了近40000行数据,然后将数据减少到大约5000行,写入硬盘。然后它读取文件并将其解析为几个文件,这些文件稍后用作查找表。它基本上是在每一个“/集”之前获取它看到的所有信息,并将其写入一个特定的文件,然后为下一个文件重置 它必须在“/插曲”标记上同步,因为插曲标记外有一个“FirstAired”标记。这确保了数据是按顺序绘制的,而不是取决于与某一事件相关的每个单独标记 这是有问题的代码 if [ -f "

我有一个速度问题。我有一个bash脚本,它解析来自TheTvDb.com的信息。它下载了近40000行数据,然后将数据减少到大约5000行,写入硬盘。然后它读取文件并将其解析为几个文件,这些文件稍后用作查找表。它基本上是在每一个“/集”之前获取它看到的所有信息,并将其写入一个特定的文件,然后为下一个文件重置

它必须在“/插曲”标记上同步,因为插曲标记外有一个“FirstAired”标记。这确保了数据是按顺序绘制的,而不是取决于与某一事件相关的每个单独标记

这是有问题的代码

  if [ -f "$mythicalLibrarian/$NewShowName/$NewShowName.xml" ]; then
   Ename=""
   actualEname=""
   FAired=""
   SeasonNr=""
   EpisodeNr=""
    recordNumber=0

    echo "Parsing Downloaded information: $NewShowName.xml "
    while read line
   do

     if [[ $line == \<\/Episode\> ]]; then
      (( ++recordNumber ))
      echo -ne "Building Record:$recordNumber ${actualEname:0:20}            \r" 1>&2 
     echo "$actualEname" >> "$mythicalLibrarian/$NewShowName/$NewShowName.actualEname.txt"&
      Ename=`echo "$actualEname" |sed 's/;.*//'`
     echo "$Ename" >> "$mythicalLibrarian/$NewShowName/$NewShowName.Ename.txt"&
     echo "$FAired" >> "$mythicalLibrarian/$NewShowName/$NewShowName.FAired.txt"&
     echo "$SeasonNr" >> "$mythicalLibrarian/$NewShowName/$NewShowName.S.txt"&
     echo "$EpisodeNr" >> "$mythicalLibrarian/$NewShowName/$NewShowName.E.txt"&
     Ename=""
     actualEname=""
     FAired=""
     SeasonNr=""
     EpisodeNr=""

#Get actual show name 
     elif [[ $line == \<EpisodeName\>* ]]; then
      actualEname=`echo "$line" | sed -e s/'<\/EpisodeName>'// -e s/'<EpisodeName>'// -e s/'\&amp\;'/'\&'/ -e s/'\&quot\;'/'\"'/ -e s/'\&amp\;'/'\&'/ -e s/'\&ndash\;'/'-'/ -e s/'\&lt\;'/'\<'/ -e 's/'\&gt\;'/'\>'/' |tr -d '|\?\*\<\"\:\>\+\\\[\]\/'`


#Get OriginalAirDate
    elif [[ $line == \<FirstAired\>* ]]; then
      FAired=`echo "$line" | sed -e s/'<FirstAired>'//g -e s/'<\/FirstAired>'//g`

#Get Season number
     elif [[ $line == \<SeasonNumber\>* ]]; then
      SeasonNr=`echo "$line" |sed -e s/'<SeasonNumber>'// -e s/'<\/SeasonNumber>'//`

#Get Episode number
    elif [[ $line == \<EpisodeNumber\>* ]]; then
      EpisodeNr=`echo "$line" |sed -e 's/<EpisodeNumber>//' -e 's/<\/EpisodeNumber>//'`

    fi
   done < "$mythicalLibrarian/$NewShowName/$NewShowName.xml"


   chmod 777 "$mythicalLibrarian"/"$NewShowName"/"$NewShowName".actualEname.txt
   chmod 666 "$mythicalLibrarian"/"$NewShowName"/"$NewShowName".Ename.txt
   chmod 666 "$mythicalLibrarian/$NewShowName/$NewShowName".FAired.txt
   chmod 666 "$mythicalLibrarian"/"$NewShowName"/"$NewShowName".S.txt
   chmod 666 "$mythicalLibrarian/$NewShowName/$NewShowName".E.txt
    GotNewInformation=1


  elif [ ! -f "$mythicalLibrarian/$NewShowName/$NewShowName.xml" ]; then
   echo "COULD NOT DOWNLOAD:www.thetvdb.com/api/$APIkey/series/$SeriesID/all/$Language.xml">>"$mythicalLibrarian"/output.log
  fi
if[-f“$mythicalibrarian/$NewShowName/$NewShowName.xml”];然后
Ename=“”
ActualName=“”
FAired=“”
季候号=“”
eposodenr=“”
记录编号=0
echo“解析下载的信息:$NewShowName.xml”
读行时
做
如果[[$line=\]];然后
((++recordNumber))
echo-ne“构建记录:$recordNumber${actualname:0:20}\r“1>&2
echo“$actualName”>>“$mythicalibrarian/$NewShowName/$NewShowName.actualName.txt”&
Ename=`echo“$actualname”| sed's/;.*/'`
echo“$Ename”>>“$mythicalibrarian/$NewShowName/$NewShowName.Ename.txt”&
echo“$FAired”>>“$mythicalibrarian/$NewShowName/$NewShowName.FAired.txt”&
echo“$SeasonNr”>>“$Mythicalibrarian/$NewShowName/$NewShowName.S.txt”&
echo“$eposodenr”>>“$mythicalibrarian/$NewShowName/$NewShowName.E.txt”&
Ename=“”
ActualName=“”
FAired=“”
季候号=“”
eposodenr=“”
#获取实际的节目名称
elif[[$line=\*];然后

实际名称=`echo“$line”sed-es/''//-es/'/'/-es/'\&\'/'\&'/-e s/'\"\'/'\“'/-ES/'\&\'/'\&'/-ES/'\&ndash\'/'-'/-ES/'\<\'/'\\\最有可能的原因是该脚本中发生了大量的进程派生(
sed
tr

通过调用带有XML解析器的程序来读入并输出到各种文件,您可以获得更快的结果。如果需要将其保存在bash中,可以找到一些可以执行
XSLT
的操作,将XML转换为文件中使用的格式并将其分割

就我个人而言,我会用Perl做这类事情

BASH是围绕数据操作和文件操作而设计的

Bash是为交互式命令处理和通过管道将程序连接在一起而设计的。据我所知,繁重的数据处理不是任何*sh的设计空间

Python或Perl将是解决问题的更好选择。

我刚刚尝试了以下方法:

        echo "Parsing Downloaded information: $NewShowName.xml "
        while read line
        do




            if [[ $line == \<\/Episode\> ]]; then
                (( ++recordNumber ))
                echo -ne "Building Record:$recordNumber ${actualEname:0:20}            \r" 1>&2 
                echo "$EpisodeName" >> "$mythicalLibrarian/$NewShowName/$NewShowName.actualEname.txt"&
                Ename=`echo "$actualEname" |sed 's/;.*//'`
                echo "$EpisodeName" >> "$mythicalLibrarian/$NewShowName/$NewShowName.Ename.txt"&
                echo "$FirstAired" >> "$mythicalLibrarian/$NewShowName/$NewShowName.FAired.txt"&
                echo "$SeasonNumber" >> "$mythicalLibrarian/$NewShowName/$NewShowName.S.txt"&
                echo "$EpisodeNumber" >> "$mythicalLibrarian/$NewShowName/$NewShowName.E.txt"&
                EpisodeName=""
                actualEname=""
                FirstAired=""
                SeasonNumber=""
                EpisodeNumber=""
            else 
                var=`echo $line |tr '<>' ' '|awk '{print $1}'`

                value=`echo "$line"|sed -e s/'<'"$var"'>'// -e s/'<\/'"$var"'>'// -e s/'\&amp\;'/'\&'/ -e s/'\&quot\;'/'\"'/ -e s/'\&amp\;'/'\&'/ -e s/'\&ndash\;'/'-'/ -e s/'\&lt\;'/'\<'/ -e 's/'\&gt\;'/'\>'/' |tr -d '|\?\*\<\"\:\>\+\\\[\]\/'`
                eval $var="'$value'"
            fi
echo“解析下载的信息:$NewShowName.xml”
读行时
做
如果[[$line=\]];则
((++recordNumber))
echo-ne“构建记录:$recordNumber${actualname:0:20}\r“1>&2
echo“$EpisodeName”>>“$Mythicalibrarian/$NewShowName/$NewShowName.actualName.txt”&
Ename=`echo“$actualname”| sed's/;.*/'`
echo“$eposodename”>>“$mythicalibrarian/$NewShowName/$NewShowName.Ename.txt”&
echo“$FirstAired”>>“$mythicalibrarian/$NewShowName/$NewShowName.FAired.txt”&
echo“$SeasonNumber”>>“$Mythicalibrarian/$NewShowName/$NewShowName.S.txt”&
echo“$eposodenumber”>>“$mythicalibrarian/$NewShowName/$NewShowName.E.txt”&
EpisodeName=“”
ActualName=“”
FirstAired=“”
季节编号=“”
eposodenumber=“”
其他的
var=`echo$line | tr''''awk'{print$1}'`

value=`echo“$line”| sed-es/''/'/'/'/'/'-es/'\&\'/'\&'/-es/'\"\'/'\''''/'''''''/-es/'\&\'/'\&\'./-e s/'\&ndash\'/'-'/-e s/'\<\'/'\ 使用类似于设计用于处理XML的方法。

从所有
echo
语句的末尾删除
&
,您将获得相当大的加速

测试1:

$ time { for i in {1..1000}; do echo "hello"& done >/dev/null; } | cat

real    0m10.357s
user    0m2.764s
sys     0m15.441s
当在命令行中执行此操作时,
cat
会吃掉“done”消息。可以使用冒号而不是
cat
来抑制来自第一次定时测试的“完成”消息。不是程序在做,而是后台进程是管道的一部分

测试2:

$ time { for i in {1..1000}; do echo "hello"; done >/dev/null; }

real    0m0.152s
user    0m0.132s
sys     0m0.020s
请注意,这是在一台非常慢的旧机器上进行的

通过使用Bash的正则表达式和字符串处理特性,而不是在一个循环中重复生成多个外部实用程序,还可以提高速度

例如:

elif [[ $line == \<EpisodeName\>* ]]; then
    actualEname=${line//<\/EpisodeName>/}
    actualEname=${actualEname//<EpisodeName>/}
    actualEname=${actualEname//&amp;/&}
    actualEname=${actualEname//&ndash;/-}
    for string in '|' '&lt;' '&gt;' '&quot;' '?' '*' '<' '>' ':' '"' '+' '\' '[' ']' '/'
    do
        actualEname=${actualEname//$string}
    done
测试4:

$ time { for i in {1..100}; do
    line='<EpisodeName>&lt;foo&amp;bar&ndash;baz&gt;Season&ndash;3&ndash;&quot;quux&quot;?*<>:"+\[]/</EpisodeName>
    actualEname=${line//<\/EpisodeName>/}
    actualEname=${actualEname//<EpisodeName>/}
    actualEname=${actualEname//&amp;/&}
    actualEname=${actualEname//&ndash;/-}
    for string in '|' '&lt;' '&gt;' '&quot;' '\?' '\*' '<' '>' ':' '"' '+' '\\' '[' ']' '\/'
    do
        actualEname=${actualEname//$string}
    done
done; }

real    0m5.403s
user    0m2.492s
sys     0m2.960s
$time{for i in{1..100};do
line='foo&;bar&ndash;bazzash&ndash;3&ndash;'qux'*:“+\[]/
ActualName=${line//}
ActualName=${ActualName//}
ActualName=${ActualName/&;/&}
ActualName=${ActualName/&ndash;/-}
对于'|''''''.''?''\*''中的字符串:''''+''\\''['']'''.\/'
做
ActualName=${ActualName/$string}
完成
完成;}
实际0m5.403s
用户0m2.492s
sys 0m2.960s

天哪,丹尼斯·威廉姆森(Dennis Williamson),它在不到1/2秒内就能解析。它只是在屏幕上闪烁。它过去需要15秒,但现在速度太快了,我甚至都说不出发生了什么

这些是丹尼斯·威廉姆森建议的改变。我只是把它贴在这里

            echo "Parsing Downloaded information: $NewShowName.xml "
            while read line
            do

                if [[ $line == \<\/Episode\> ]]; then
                    (( ++recordNumber ))
                    echo -ne "Building Record:$recordNumber ${actualEname:0:20}           \r" 1>&2 
                    echo "$actualEname" >> "$mythicalLibrarian/$NewShowName/$NewShowName.actualEname.txt"

                    echo "$Ename" >> "$mythicalLibrarian/$NewShowName/$NewShowName.Ename.txt"
                    echo "$FAired" >> "$mythicalLibrarian/$NewShowName/$NewShowName.FAired.txt"
                    echo "$SeasonNr" >> "$mythicalLibrarian/$NewShowName/$NewShowName.S.txt"
                    echo "$EpisodeNr" >> "$mythicalLibrarian/$NewShowName/$NewShowName.E.txt"
                    Ename=""
                    actualEname=""
                    FAired=""
                    SeasonNr=""
                    EpisodeNr=""

#Get actual show name   
                elif [[ $line == \<EpisodeName\>* ]]; then
                    line=${line/<\/EpisodeName>/}
                    line=${line/<EpisodeName>/}
                    line=${line/&lt;}
                    line=${line/&gt;/} 
                    line=${line/&quot;/} 
                    line=${line/&amp;/&}
                    line=${line/\|/}
                    line=${line/\?/}
                    line=${line/\*/}
                    line=${line/\:/}
                    line=${line/\+/}
                    line=${line/\\/}
                    line=${line/\//}
                    line=${line/\[/}
                    line=${line/\]/}
                    line=${line/\'/}
                    line=${line/\"/}
                    actualEname=${line/&ndash;/-}
                    Ename=${actualEname/;*/}

#Get OriginalAirDate
                elif [[ $line == \<FirstAired\>* ]]; then
                    line=${line/<\/FirstAired>/}
                    line=${line/<FirstAired>/}
                    FAired=$line

#Get Season number
                elif [[ $line == \<SeasonNumber\>* ]]; then
                    line=${line/<\/SeasonNumber>/}
                    line=${line/<SeasonNumber>/}
                    SeasonNr=$line

#Get Episode number
                elif [[ $line == \<EpisodeNumber\>* ]]; then
                    line=${line/<\/EpisodeNumber>/}
                    line=${line/<EpisodeNumber>/}
                    EpisodeNr=$line
                fi
            done < "$mythicalLibrarian/$NewShowName/$NewShowName.xml"


            chmod 666 "$mythicalLibrarian"/"$NewShowName"/"$NewShowName".actualEname.txt
            chmod 666 "$mythicalLibrarian"/"$NewShowName"/"$NewShowName".Ename.txt
            chmod 666 "$mythicalLibrarian/$NewShowName/$NewShowName".FAired.txt
            chmod 666 "$mythicalLibrarian"/"$NewShowName"/"$NewShowName".S.txt
            chmod 666 "$mythicalLibrarian/$NewShowName/$NewShowName".E.txt
            GotNewInformation=1
echo“解析下载的信息:$NewShowName.xml”
读行时
做
如果[[$line=\]];则
((++recordNumber))
echo-ne“构建记录:$recordNumber${actualname:0:20}\r“1>&2
echo“$actualName”>>“$mythicalibrarian/$NewShowName/$NewShowName.actualName.txt”
回声“$Ename”>>
$ time { for i in {1..100}; do
    line='<EpisodeName>&lt;foo&amp;bar&ndash;baz&gt;Season&ndash;3&ndash;&quot;quux&quot;?*<>:"+\[]/</EpisodeName>
    actualEname=${line//<\/EpisodeName>/}
    actualEname=${actualEname//<EpisodeName>/}
    actualEname=${actualEname//&amp;/&}
    actualEname=${actualEname//&ndash;/-}
    for string in '|' '&lt;' '&gt;' '&quot;' '\?' '\*' '<' '>' ':' '"' '+' '\\' '[' ']' '\/'
    do
        actualEname=${actualEname//$string}
    done
done; }

real    0m5.403s
user    0m2.492s
sys     0m2.960s
            echo "Parsing Downloaded information: $NewShowName.xml "
            while read line
            do

                if [[ $line == \<\/Episode\> ]]; then
                    (( ++recordNumber ))
                    echo -ne "Building Record:$recordNumber ${actualEname:0:20}           \r" 1>&2 
                    echo "$actualEname" >> "$mythicalLibrarian/$NewShowName/$NewShowName.actualEname.txt"

                    echo "$Ename" >> "$mythicalLibrarian/$NewShowName/$NewShowName.Ename.txt"
                    echo "$FAired" >> "$mythicalLibrarian/$NewShowName/$NewShowName.FAired.txt"
                    echo "$SeasonNr" >> "$mythicalLibrarian/$NewShowName/$NewShowName.S.txt"
                    echo "$EpisodeNr" >> "$mythicalLibrarian/$NewShowName/$NewShowName.E.txt"
                    Ename=""
                    actualEname=""
                    FAired=""
                    SeasonNr=""
                    EpisodeNr=""

#Get actual show name   
                elif [[ $line == \<EpisodeName\>* ]]; then
                    line=${line/<\/EpisodeName>/}
                    line=${line/<EpisodeName>/}
                    line=${line/&lt;}
                    line=${line/&gt;/} 
                    line=${line/&quot;/} 
                    line=${line/&amp;/&}
                    line=${line/\|/}
                    line=${line/\?/}
                    line=${line/\*/}
                    line=${line/\:/}
                    line=${line/\+/}
                    line=${line/\\/}
                    line=${line/\//}
                    line=${line/\[/}
                    line=${line/\]/}
                    line=${line/\'/}
                    line=${line/\"/}
                    actualEname=${line/&ndash;/-}
                    Ename=${actualEname/;*/}

#Get OriginalAirDate
                elif [[ $line == \<FirstAired\>* ]]; then
                    line=${line/<\/FirstAired>/}
                    line=${line/<FirstAired>/}
                    FAired=$line

#Get Season number
                elif [[ $line == \<SeasonNumber\>* ]]; then
                    line=${line/<\/SeasonNumber>/}
                    line=${line/<SeasonNumber>/}
                    SeasonNr=$line

#Get Episode number
                elif [[ $line == \<EpisodeNumber\>* ]]; then
                    line=${line/<\/EpisodeNumber>/}
                    line=${line/<EpisodeNumber>/}
                    EpisodeNr=$line
                fi
            done < "$mythicalLibrarian/$NewShowName/$NewShowName.xml"


            chmod 666 "$mythicalLibrarian"/"$NewShowName"/"$NewShowName".actualEname.txt
            chmod 666 "$mythicalLibrarian"/"$NewShowName"/"$NewShowName".Ename.txt
            chmod 666 "$mythicalLibrarian/$NewShowName/$NewShowName".FAired.txt
            chmod 666 "$mythicalLibrarian"/"$NewShowName"/"$NewShowName".S.txt
            chmod 666 "$mythicalLibrarian/$NewShowName/$NewShowName".E.txt
            GotNewInformation=1