在Ruby中查找句子是否包含特定短语
现在,我通过将一个句子拆分成一个数组,然后执行include来查看它是否包含一个特定的单词,从而查看一个句子是否包含一个特定的单词。比如:在Ruby中查找句子是否包含特定短语,ruby,regex,string,Ruby,Regex,String,现在,我通过将一个句子拆分成一个数组,然后执行include来查看它是否包含一个特定的单词,从而查看一个句子是否包含一个特定的单词。比如: "This is my awesome sentence.".split(" ").include?('awesome') 但是我想知道用一个短语做这件事最快的方法是什么。就像我想看看“这是我最棒的句子”这句话是否包含“我最棒的句子”。我正在拼凑句子,比较大量的短语,所以速度有点重要 如果您不熟悉正则表达式,我相信它们可以解决您的问题: 基本上,您将创建
"This is my awesome sentence.".split(" ").include?('awesome')
但是我想知道用一个短语做这件事最快的方法是什么。就像我想看看“这是我最棒的句子”这句话是否包含“我最棒的句子”。我正在拼凑句子,比较大量的短语,所以速度有点重要 如果您不熟悉正则表达式,我相信它们可以解决您的问题: 基本上,您将创建一个正则表达式对象来查找“awesome”(很可能不区分大小写),然后您就可以这样做了
/regex/.match(string)
返回匹配数据。如果要返回字符所在的索引,可以执行以下操作:
match = "This is my awesome sentence." =~ /awesome/
puts match #This will return the index of the first letter, so the first a in awesome
我读了这篇文章想了解更多的细节,因为它比我想的解释得更好。如果您不想对它有太多的了解,只想直接使用它,我建议您:
您可以轻松检查一个字符串是否包含另一个带方括号的字符串,如下所示:
irb(main):084:0> "This is my awesome sentence."["my awesome sentence"]
=> "my awesome sentence"
irb(main):085:0> "This is my awesome sentence."["cookies for breakfast?"]
=> nil
如果找到子字符串,它将返回该子字符串;如果没有,它将返回
nil
。它应该非常快。以下是一些变化:
require 'benchmark'
lorem = ('Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut' # !> unused literal ignored
'enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in' # !> unused literal ignored
'reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident,' # !> unused literal ignored
'sunt in culpa qui officia deserunt mollit anim id est laborum.' * 10) << ' foo'
lorem.split.include?('foo') # => true
lorem['foo'] # => "foo"
lorem.include?('foo') # => true
lorem[/foo/] # => "foo"
lorem[/fo{2}/] # => "foo"
lorem[/foo$/] # => "foo"
lorem[/fo{2}$/] # => "foo"
lorem[/fo{2}\Z/] # => "foo"
/foo/.match(lorem)[-1] # => "foo"
/foo$/.match(lorem)[-1] # => "foo"
/foo/ =~ lorem # => 621
n = 500_000
puts RUBY_VERSION
puts "n=#{ n }"
Benchmark.bm(25) do |x|
x.report("array search:") { n.times { lorem.split.include?('foo') } }
x.report("literal search:") { n.times { lorem['foo'] } }
x.report("string include?:") { n.times { lorem.include?('foo') } }
x.report("regex:") { n.times { lorem[/foo/] } }
x.report("wildcard regex:") { n.times { lorem[/fo{2}/] } }
x.report("anchored regex:") { n.times { lorem[/foo$/] } }
x.report("anchored wildcard regex:") { n.times { lorem[/fo{2}$/] } }
x.report("anchored wildcard regex2:") { n.times { lorem[/fo{2}\Z/] } }
x.report("/regex/.match") { n.times { /foo/.match(lorem)[-1] } }
x.report("/regex$/.match") { n.times { /foo$/.match(lorem)[-1] } }
x.report("/regex/ =~") { n.times { /foo/ =~ lorem } }
x.report("/regex$/ =~") { n.times { /foo$/ =~ lorem } }
x.report("/regex\Z/ =~") { n.times { /foo\Z/ =~ lorem } }
end
和1.8.7:
1.8.7
n=500000
user system total real
array search: 21.250000 0.000000 21.250000 ( 21.296039)
literal search: 0.660000 0.000000 0.660000 ( 0.660102)
string include?: 0.610000 0.000000 0.610000 ( 0.612433)
regex: 0.950000 0.000000 0.950000 ( 0.946308)
wildcard regex: 2.840000 0.000000 2.840000 ( 2.850198)
anchored regex: 0.950000 0.000000 0.950000 ( 0.951270)
anchored wildcard regex: 2.870000 0.010000 2.880000 ( 2.874209)
anchored wildcard regex2: 2.870000 0.000000 2.870000 ( 2.868291)
/regex/.match 1.470000 0.000000 1.470000 ( 1.479383)
/regex$/.match 1.480000 0.000000 1.480000 ( 1.498106)
/regex/ =~ 0.680000 0.000000 0.680000 ( 0.677444)
/regex$/ =~ 0.700000 0.000000 0.700000 ( 0.704486)
/regexZ/ =~ 0.700000 0.000000 0.700000 ( 0.701943)
因此,从结果来看,使用像'foobar'['foo']
这样的固定字符串搜索比使用regex'foobar'[/foo/]
慢,后者比等效的'foobar'=~/foo/
慢
OPs原始解决方案遭受了严重的损失,因为它遍历字符串两次:一次将其拆分为单个单词,第二次迭代数组以查找实际的目标单词。它的性能会随着字符串大小的增加而降低
关于Ruby的性能,我发现有趣的一点是,锚定正则表达式比未锚定正则表达式稍微慢一点。在Perl中,当我几年前第一次运行这种基准测试时,情况正好相反
这里有一个使用的更新版本。不同的表达式返回不同的结果。如果要查看目标字符串是否存在,可以使用Any。如果您想查看值是否位于字符串的末尾(如正在测试的字符串),或者想获取目标的位置,那么某些值肯定比其他值快,因此选择相应的值
require 'fruity'
TARGET_STR = (' ' * 100) + ' foo'
TARGET_STR['foo'] # => "foo"
TARGET_STR[/foo/] # => "foo"
TARGET_STR[/fo{2}/] # => "foo"
TARGET_STR[/foo$/] # => "foo"
TARGET_STR[/fo{2}$/] # => "foo"
TARGET_STR[/fo{2}\Z/] # => "foo"
TARGET_STR[/fo{2}\z/] # => "foo"
TARGET_STR[/foo\Z/] # => "foo"
TARGET_STR[/foo\z/] # => "foo"
/foo/.match(TARGET_STR)[-1] # => "foo"
/foo$/.match(TARGET_STR)[-1] # => "foo"
/foo/ =~ TARGET_STR # => 101
/foo$/ =~ TARGET_STR # => 101
/foo\Z/ =~ TARGET_STR # => 101
TARGET_STR.include?('foo') # => true
TARGET_STR.index('foo') # => 101
TARGET_STR.rindex('foo') # => 101
puts RUBY_VERSION
puts "TARGET_STR.length = #{ TARGET_STR.length }"
puts
puts 'compare fixed string vs. unanchored regex'
compare do
fixed_str { TARGET_STR['foo'] }
unanchored_regex { TARGET_STR[/foo/] }
end
puts
puts 'compare /foo/ to /fo{2}/'
compare do
unanchored_regex { TARGET_STR[/foo/] }
unanchored_regex2 { TARGET_STR[/fo{2}/] }
end
puts
puts 'compare unanchored vs. anchored regex' # !> assigned but unused variable - delay
compare do
unanchored_regex { TARGET_STR[/foo/] }
anchored_regex_dollar { TARGET_STR[/foo$/] }
anchored_regex_Z { TARGET_STR[/foo\Z/] }
anchored_regex_z { TARGET_STR[/foo\z/] }
end
puts
puts 'compare /foo/, match and =~'
compare do
unanchored_regex { TARGET_STR[/foo/] }
unanchored_match { /foo/.match(TARGET_STR)[-1] }
unanchored_eq_match { /foo/ =~ TARGET_STR }
end
puts
puts 'compare fixed, unanchored, Z, include?, index and rindex'
compare do
fixed_str { TARGET_STR['foo'] }
unanchored_regex { TARGET_STR[/foo/] }
anchored_regex_Z { TARGET_STR[/foo\Z/] }
include_eh { TARGET_STR.include?('foo') }
_index { TARGET_STR.index('foo') }
_rindex { TARGET_STR.rindex('foo') }
end
其结果是:
# >> 2.2.3
# >> TARGET_STR.length = 104
# >>
# >> compare fixed string vs. unanchored regex
# >> Running each test 8192 times. Test will take about 1 second.
# >> fixed_str is faster than unanchored_regex by 2x ± 0.1
# >>
# >> compare /foo/ to /fo{2}/
# >> Running each test 8192 times. Test will take about 1 second.
# >> unanchored_regex2 is similar to unanchored_regex
# >>
# >> compare unanchored vs. anchored regex
# >> Running each test 8192 times. Test will take about 1 second.
# >> anchored_regex_z is similar to anchored_regex_Z
# >> anchored_regex_Z is faster than unanchored_regex by 19.999999999999996% ± 10.0%
# >> unanchored_regex is similar to anchored_regex_dollar
# >>
# >> compare /foo/, match and =~
# >> Running each test 8192 times. Test will take about 1 second.
# >> unanchored_eq_match is faster than unanchored_regex by 2x ± 0.1 (results differ: 101 vs foo)
# >> unanchored_regex is faster than unanchored_match by 3x ± 0.1
# >>
# >> compare fixed, unanchored, Z, include?, index and rindex
# >> Running each test 32768 times. Test will take about 3 seconds.
# >> _rindex is similar to include_eh (results differ: 101 vs true)
# >> include_eh is faster than _index by 10.000000000000009% ± 10.0% (results differ: true vs 101)
# >> _index is faster than fixed_str by 19.999999999999996% ± 10.0% (results differ: 101 vs foo)
# >> fixed_str is faster than anchored_regex_Z by 39.99999999999999% ± 10.0%
# >> anchored_regex_Z is similar to unanchored_regex
修改字符串的大小可以显示需要了解的好东西
更改为1000个字符:
# >> 2.2.3
# >> TARGET_STR.length = 1004
# >>
# >> compare fixed string vs. unanchored regex
# >> Running each test 4096 times. Test will take about 1 second.
# >> fixed_str is faster than unanchored_regex by 50.0% ± 10.0%
# >>
# >> compare /foo/ to /fo{2}/
# >> Running each test 2048 times. Test will take about 1 second.
# >> unanchored_regex2 is similar to unanchored_regex
# >>
# >> compare unanchored vs. anchored regex
# >> Running each test 8192 times. Test will take about 1 second.
# >> anchored_regex_z is faster than anchored_regex_Z by 10.000000000000009% ± 10.0%
# >> anchored_regex_Z is faster than unanchored_regex by 3x ± 0.1
# >> unanchored_regex is similar to anchored_regex_dollar
# >>
# >> compare /foo/, match and =~
# >> Running each test 4096 times. Test will take about 1 second.
# >> unanchored_eq_match is similar to unanchored_regex (results differ: 1001 vs foo)
# >> unanchored_regex is faster than unanchored_match by 2x ± 0.1
# >>
# >> compare fixed, unanchored, Z, include?, index and rindex
# >> Running each test 32768 times. Test will take about 4 seconds.
# >> _rindex is faster than anchored_regex_Z by 2x ± 1.0 (results differ: 1001 vs foo)
# >> anchored_regex_Z is faster than include_eh by 2x ± 0.1 (results differ: foo vs true)
# >> include_eh is faster than fixed_str by 10.000000000000009% ± 10.0% (results differ: true vs foo)
# >> fixed_str is similar to _index (results differ: foo vs 1001)
# >> _index is similar to unanchored_regex (results differ: 1001 vs foo)
将其增加到10000:
# >> 2.2.3
# >> TARGET_STR.length = 10004
# >>
# >> compare fixed string vs. unanchored regex
# >> Running each test 512 times. Test will take about 1 second.
# >> fixed_str is faster than unanchored_regex by 39.99999999999999% ± 10.0%
# >>
# >> compare /foo/ to /fo{2}/
# >> Running each test 256 times. Test will take about 1 second.
# >> unanchored_regex2 is similar to unanchored_regex
# >>
# >> compare unanchored vs. anchored regex
# >> Running each test 8192 times. Test will take about 3 seconds.
# >> anchored_regex_z is similar to anchored_regex_Z
# >> anchored_regex_Z is faster than unanchored_regex by 21x ± 1.0
# >> unanchored_regex is similar to anchored_regex_dollar
# >>
# >> compare /foo/, match and =~
# >> Running each test 256 times. Test will take about 1 second.
# >> unanchored_eq_match is similar to unanchored_regex (results differ: 10001 vs foo)
# >> unanchored_regex is faster than unanchored_match by 10.000000000000009% ± 10.0%
# >>
# >> compare fixed, unanchored, Z, include?, index and rindex
# >> Running each test 32768 times. Test will take about 18 seconds.
# >> _rindex is faster than anchored_regex_Z by 2x ± 0.1 (results differ: 10001 vs foo)
# >> anchored_regex_Z is faster than include_eh by 15x ± 1.0 (results differ: foo vs true)
# >> include_eh is similar to _index (results differ: true vs 10001)
# >> _index is similar to fixed_str (results differ: 10001 vs foo)
# >> fixed_str is faster than unanchored_regex by 39.99999999999999% ± 10.0%
Ruby v2.6.5结果:
# >> 2.6.5
# >> n=500000
# >> user system total real
# >> array search: 6.744581 0.012204 6.756785 ( 6.766078)
# >> literal search: 0.351014 0.000334 0.351348 ( 0.351866)
# >> string include?: 0.325576 0.000493 0.326069 ( 0.326331)
# >> regex: 0.373231 0.000512 0.373743 ( 0.374197)
# >> wildcard regex: 0.371914 0.000356 0.372270 ( 0.372549)
# >> anchored regex: 0.373606 0.000568 0.374174 ( 0.374736)
# >> anchored wildcard regex: 0.374923 0.000349 0.375272 ( 0.375729)
# >> anchored wildcard regex2: 0.136772 0.000384 0.137156 ( 0.137474)
# >> /regex/.match 0.662532 0.003377 0.665909 ( 0.666605)
# >> /regex$/.match 0.671762 0.005036 0.676798 ( 0.677691)
# >> /regex/ =~ 0.322114 0.000404 0.322518 ( 0.322917)
# >> /regex$/ =~ 0.332067 0.000995 0.333062 ( 0.334226)
# >> /regexZ/ =~ 0.078958 0.000069 0.079027 ( 0.079082)
以及:
“”是相关的。这里有一个非答案显示了@TheTinMan for Ruby 1.9.2在OS X上的代码基准。请注意相对性能的差异,特别是第二次和第三次测试中的改进
user system total real
array search: 7.960000 0.000000 7.960000 ( 7.962338)
literal search: 0.450000 0.010000 0.460000 ( 0.445905)
string include?: 0.400000 0.000000 0.400000 ( 0.400932)
regex: 0.510000 0.000000 0.510000 ( 0.512635)
wildcard regex: 0.520000 0.000000 0.520000 ( 0.514800)
anchored regex: 0.510000 0.000000 0.510000 ( 0.513328)
anchored wildcard regex: 0.520000 0.000000 0.520000 ( 0.517759)
/regex/.match 0.940000 0.000000 0.940000 ( 0.943471)
/regex$/.match 0.940000 0.000000 0.940000 ( 0.936782)
/regex/ =~ 0.440000 0.000000 0.440000 ( 0.446921)
/regex$/ =~ 0.450000 0.000000 0.450000 ( 0.447904)
我用Benchmark.bmbm运行了这些结果,但是排练轮和实际时间之间的结果没有差别,如上图所示。“但我想知道最快的方法是什么”。然后对你的备选方案做一个分析并找出答案。这很容易,而且是一个很好的习惯,因为你不需要猜测,你可以知道你的选择中最快的是什么。这将与
句子中的10
相匹配。虽然这被选为解决方案,但它并没有回答OPs问题“…什么是最快的方法来实现这一点…”.通常我们对什么是最快完成某事的方式有先入为主的想法。基准测试有助于(dis)证明这些想法。几年前,我使用Perl完成了这个原始版本,当时,Perl的固定字符串搜索是使用index()
最快的,锚定的正则表达式搜索是非常接近的第二个。有趣的是Ruby的正则表达式搜索速度快了很多,知道了这一点,我将改变我做一些事情的方式。正则表达式模式可能有一些不好的地方,但仔细使用会大大加快文本处理。有趣的是,正则表达式引擎似乎使用/foo\Z/
进行了一些优化,但在现实世界中,这可能不是很有用。@theTinMan非常有趣;我想知道为什么我的结果与你的str['foo']
和str.include?('foo')
。哈哈,我的基准测试是从Mac OS X上的1.9.2开始的。我认为bmbm
是在测试时试图解释某些感知到的异常,但我不认为它能更好地模拟实时条件。垃圾收集和内存分配仍然可能在程序执行过程中的奇数时间发生,这让我觉得正常的olBenchmark.bm已经足够好了。我不知道为什么会有这种差异,除非我更大的循环大小触发了更多的内存初始化或垃圾收集活动。我认为,就目前的结果而言,在使用固定字符串搜索和正则表达式(regex)时,它们之间存在着冲突。显然有一些算法需要避免,比如瘟疫。@theTinMan实际上,我的测试代码和你的相同;我的笔记本电脑刚好相当快。:)那就行了。我的个人Mac笔记本电脑已经有好几年的历史了。我刚刚在我的四核Mac Pro上运行了代码,你的结果让我大吃一惊。现在我很沮丧。:-)
# >> 2.6.5
# >> TARGET_STR.length = 104
# >>
# >> compare fixed string vs. unanchored regex
# >> Running each test 32768 times. Test will take about 1 second.
# >> fixed_str is faster than unanchored_regex by 2x ± 0.1
# >>
# >> compare /foo/ to /fo{2}/
# >> Running each test 8192 times. Test will take about 1 second.
# >> unanchored_regex is similar to unanchored_regex2
# >>
# >> compare unanchored vs. anchored regex
# >> Running each test 16384 times. Test will take about 1 second.
# >> anchored_regex_z is similar to anchored_regex_Z
# >> anchored_regex_Z is similar to anchored_regex_dollar
# >> anchored_regex_dollar is similar to unanchored_regex
# >>
# >> compare /foo/, match and =~
# >> Running each test 16384 times. Test will take about 1 second.
# >> unanchored_eq_match is similar to unanchored_regex (results differ: 101 vs foo)
# >> unanchored_regex is faster than unanchored_match by 3x ± 1.0 (results differ: foo vs )
# >>
# >> compare fixed, unanchored, Z, include?, index and rindex
# >> Running each test 65536 times. Test will take about 3 seconds.
# >> _rindex is similar to include_eh (results differ: 101 vs true)
# >> include_eh is similar to _index (results differ: true vs 101)
# >> _index is similar to fixed_str (results differ: 101 vs foo)
# >> fixed_str is faster than anchored_regex_Z by 2x ± 0.1
# >> anchored_regex_Z is faster than unanchored_regex by 19.999999999999996% ± 10.0%
# >> 2.6.5
# >> TARGET_STR.length = 1004
# >>
# >> compare fixed string vs. unanchored regex
# >> Running each test 32768 times. Test will take about 2 seconds.
# >> fixed_str is faster than unanchored_regex by 7x ± 1.0
# >>
# >> compare /foo/ to /fo{2}/
# >> Running each test 2048 times. Test will take about 1 second.
# >> unanchored_regex is similar to unanchored_regex2
# >>
# >> compare unanchored vs. anchored regex
# >> Running each test 8192 times. Test will take about 1 second.
# >> anchored_regex_z is similar to anchored_regex_Z
# >> anchored_regex_Z is faster than unanchored_regex by 3x ± 1.0
# >> unanchored_regex is similar to anchored_regex_dollar
# >>
# >> compare /foo/, match and =~
# >> Running each test 2048 times. Test will take about 1 second.
# >> unanchored_eq_match is faster than unanchored_regex by 10.000000000000009% ± 10.0% (results differ: 1001 vs foo)
# >> unanchored_regex is faster than unanchored_match by 39.99999999999999% ± 10.0% (results differ: foo vs )
# >>
# >> compare fixed, unanchored, Z, include?, index and rindex
# >> Running each test 65536 times. Test will take about 4 seconds.
# >> _rindex is similar to include_eh (results differ: 1001 vs true)
# >> include_eh is similar to _index (results differ: true vs 1001)
# >> _index is similar to fixed_str (results differ: 1001 vs foo)
# >> fixed_str is faster than anchored_regex_Z by 2x ± 1.0
# >> anchored_regex_Z is faster than unanchored_regex by 4x ± 1.0
# >> 2.6.5
# >> TARGET_STR.length = 10004
# >>
# >> compare fixed string vs. unanchored regex
# >> Running each test 8192 times. Test will take about 2 seconds.
# >> fixed_str is faster than unanchored_regex by 31x ± 10.0
# >>
# >> compare /foo/ to /fo{2}/
# >> Running each test 512 times. Test will take about 1 second.
# >> unanchored_regex2 is similar to unanchored_regex
# >>
# >> compare unanchored vs. anchored regex
# >> Running each test 8192 times. Test will take about 3 seconds.
# >> anchored_regex_z is similar to anchored_regex_Z
# >> anchored_regex_Z is faster than unanchored_regex by 27x ± 1.0
# >> unanchored_regex is similar to anchored_regex_dollar
# >>
# >> compare /foo/, match and =~
# >> Running each test 512 times. Test will take about 1 second.
# >> unanchored_eq_match is similar to unanchored_regex (results differ: 10001 vs foo)
# >> unanchored_regex is faster than unanchored_match by 10.000000000000009% ± 10.0% (results differ: foo vs )
# >>
# >> compare fixed, unanchored, Z, include?, index and rindex
# >> Running each test 65536 times. Test will take about 14 seconds.
# >> _rindex is faster than _index by 2x ± 1.0
# >> _index is similar to include_eh (results differ: 10001 vs true)
# >> include_eh is similar to fixed_str (results differ: true vs foo)
# >> fixed_str is similar to anchored_regex_Z
# >> anchored_regex_Z is faster than unanchored_regex by 26x ± 1.0
user system total real
array search: 7.960000 0.000000 7.960000 ( 7.962338)
literal search: 0.450000 0.010000 0.460000 ( 0.445905)
string include?: 0.400000 0.000000 0.400000 ( 0.400932)
regex: 0.510000 0.000000 0.510000 ( 0.512635)
wildcard regex: 0.520000 0.000000 0.520000 ( 0.514800)
anchored regex: 0.510000 0.000000 0.510000 ( 0.513328)
anchored wildcard regex: 0.520000 0.000000 0.520000 ( 0.517759)
/regex/.match 0.940000 0.000000 0.940000 ( 0.943471)
/regex$/.match 0.940000 0.000000 0.940000 ( 0.936782)
/regex/ =~ 0.440000 0.000000 0.440000 ( 0.446921)
/regex$/ =~ 0.450000 0.000000 0.450000 ( 0.447904)