Performance 为什么是；grep——忽略案例”；慢50倍？_Performance_Bash_Time_Grep

Performance 为什么是；grep——忽略案例”；慢50倍？

performance bash time grep

Performance 为什么是；grep——忽略案例”；慢50倍？,performance,bash,time,grep,Performance,Bash,Time,Grep,我非常惊讶地看到，当您将--ignore case选项添加到grep时，它可以将搜索速度降低50倍。我在两台不同的机器上进行了测试，结果相同。我很想找到一个巨大性能差异的解释我还希望看到grep的另一个命令用于不区分大小写的搜索。我不需要正则表达式，只需要修复字符串搜索。首先，测试文件将是一个50MB的纯文本文件，其中包含一些虚拟数据，您可以使用以下代码生成它：创建test.txt yes all work and no play makes Jack a dull boy | head -

我非常惊讶地看到，当您将

--ignore case

选项添加到

grep

时，它可以将搜索速度降低50倍。我在两台不同的机器上进行了测试，结果相同。我很想找到一个巨大性能差异的解释

我还希望看到grep的另一个命令用于不区分大小写的搜索。我不需要正则表达式，只需要修复字符串搜索。首先，测试文件将是一个50MB的纯文本文件，其中包含一些虚拟数据，您可以使用以下代码生成它：

创建test.txt

yes all work and no play makes Jack a dull boy | head -c 50M > test.txt
echo "Jack is no fun" >> test.txt
echo "Jack is no Fun" >> test.txt

演示

下面是一个缓慢的演示。通过添加

--ignore case

选项，命令速度将降低57倍

$ time grep fun test.txt
all work and no plJack is no fun
real    0m0.061s

$ time grep --ignore-case fun test.txt
all work and no plJack is no fun
Jack is no Fun
real    0m3.498s

$ time LANG=POSIX grep --ignore-case fun test.txt
all work and no plJack is no fun
Jack is no Fun
real    0m0.142s

可能的解释

在谷歌上搜索，我发现一个关于grep在UTF-8语言环境中速度慢的讨论。所以我运行了下面的测试，它确实加快了速度。我机器上的默认区域设置是

en_US.UTF-8

，因此将其设置为

POSIX

似乎已经启动了性能，但现在我当然无法正确搜索Unicode文本，这是不可取的。它的速度仍然慢2.5倍

$ time grep fun test.txt
all work and no plJack is no fun
real    0m0.061s

$ time grep --ignore-case fun test.txt
all work and no plJack is no fun
Jack is no Fun
real    0m3.498s

$ time LANG=POSIX grep --ignore-case fun test.txt
all work and no plJack is no fun
Jack is no Fun
real    0m0.142s

备选方案

我们可以改用Perl，它的速度更快，但仍然是区分大小写的grep的5.5倍。上面的POSIX grep的速度大约是它的两倍

$ time perl -ne '/fun/i && print' test.txt
all work and no plJack is no fun
Jack is no Fun
real    0m0.388s

所以我很想找到一个快速正确的替代方案，如果有人有解释的话

更新-CentOS

上面测试过的两台机器都运行着Ubuntu11.04（纳蒂·独角鲸）和12.04（精确穿山甲）。在CentOS 5.3机器上运行相同的测试会产生以下有趣的结果。两种情况下的性能结果几乎相同。现在CentOS 5.3于2009年1月发布，它运行的是Grep2.5.1，而Ubuntu 12.04运行的是Grep2.10。因此，新版本中可能会有变化，两个发行版中可能会有差异

$ time grep fun test.txt
Jack is no fun
real    0m0.026s

$ time grep --ignore-case fun test.txt
Jack is no fun
Jack is no Fun
real    0m0.027s

我认为这个bug报告有助于理解为什么它很慢：

要进行不区分大小写的搜索，grep首先必须将整个50MB文件转换为一个大小写。这需要时间。不仅如此，还有内存拷贝

在测试用例中，首先生成文件。这意味着它将被内存缓存。第一次grep运行只需要缓存页面；它甚至不需要访问磁盘

不区分大小写的grep也会这样做，但它会尝试修改该数据。这意味着内核将为每个修改的4kB页面接受一个异常，最终不得不将整个50MB复制到新内存中，一次复制一个页面

基本上，我希望这会更慢。可能不会慢57倍，但肯定会慢。

这种慢是由于grep（在UTF-8区域设置上）不断访问文件“/usr/lib/locale/locale archive”和“/usr/lib/gconv/gconv modules.cache”

可以使用该实用程序显示它。这两个文件都来自glibc。

原因是它需要对当前区域设置进行Unicode感知比较，根据Marat的回答判断，这样做效率不高

这显示了在不考虑Unicode的情况下它的速度有多快：

$ time LC_CTYPE=C grep -i fun test.txt
all work and no plJack is no fun
Jack is no Fun
real    0m0.192s

当然，这种替代方法不适用于其他语言中的字符，如ñ/ñ、Ø/ø、Ð/ð、Æ/æ等

另一种方法是修改正则表达式，使其与大小写不敏感匹配：

$ time grep '[Ff][Uu][Nn]' test.txt
all work and no plJack is no fun
Jack is no Fun
real    0m0.193s

这相当快，但将每个字符转换为一个类当然是一件痛苦的事情，而且与上面的不同，将其转换为别名或

sh

脚本并不容易

作为比较，在我的系统中：

$ time grep fun test.txt
all work and no plJack is no fun
real    0m0.085s

$ time grep -i fun test.txt
all work and no plJack is no fun
Jack is no Fun
real    0m3.810s

我认为你在这件事上是不对的。这个文件很小，只有50MB。更重要的是，看看我的更新，centos在几乎相同的执行时间执行两次搜索。50MB是12500个内存页，大约50分钟的MP3，是hotmail附件限制的5倍。。。。我不确定我会称之为“微小”。不管怎样，就像我说的。57x的速度似乎有点太慢了。我自己做测试，转换案例，然后运行常规的grep和grep-i要快得多。你能提供一个总结吗？