Ruby 正则表达式来匹配具有重复模式的字符串
我正试图找到一个正则表达式,它将URL与三个或更多重复段(可能包括任意数量的目录)相匹配,例如:Ruby 正则表达式来匹配具有重复模式的字符串,ruby,regex,Ruby,Regex,我正试图找到一个正则表达式,它将URL与三个或更多重复段(可能包括任意数量的目录)相匹配,例如: s1='1〕http://www.foo.com/bar/bar/bar/“ s2=http://www.foo.com/baz/biz/baz/biz/baz/biz/etc“ s3='/foo/bar/foo/bar/foo/bar/' 并且不匹配URL,如: s4='/foo/bar/foo/bar/foo/barbaz' 首先,我尝试: re1 = /((.+\/)+)\1\1/
s1='1〕http://www.foo.com/bar/bar/bar/“
s2=http://www.foo.com/baz/biz/baz/biz/baz/biz/etc“
s3='/foo/bar/foo/bar/foo/bar/'
s4='/foo/bar/foo/bar/foo/barbaz'
re1 = /((.+\/)+)\1\1/
有效的方法是:
re1 === s1 #=> true
re1 === s2 #=> true
但随着段数的增加,正则表达式匹配所需的时间呈指数增长:
require 'benchmark'
Benchmark.bm do |b|
(10..15).each do |num|
str = '/foo/bar' * num
puts str
b.report("#{num} repeats:") { /((.+\/)+)\1\1/ === str }
end
end
user system total real
/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar
10 repeats: 0.060000 0.000000 0.060000 ( 0.054839)
/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar
11 repeats: 0.210000 0.000000 0.210000 ( 0.213492)
/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar
12 repeats: 0.870000 0.000000 0.870000 ( 0.871879)
/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar
13 repeats: 3.370000 0.010000 3.380000 ( 3.399224)
/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar
14 repeats: 13.580000 0.110000 13.690000 ( 13.790675)
/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar
15 repeats: 54.090000 0.210000 54.300000 ( 54.562672)
然后,我尝试了一个类似于给定的正则表达式:
它没有性能问题,并且匹配我想要匹配的字符串:
re2 === s3 #=> true
但也匹配我不希望匹配的字符串,例如:
re2 === s4 #=> true, but should be false
我和第二个正则表达式很接近。我缺少什么?将
更改为[^\/]
。这将降低正则表达式的复杂性,因为它不会试图匹配“任何”字符
require 'benchmark'
Benchmark.bm do |b|
(10..15).each do |num|
str = '/foo/bar' * num
puts str
b.report("#{num} repeats:") { /(([^\/]+\/)+)\1\1/ === str }
end
end
10 repeats: 0.000000 0.000000 0.000000 ( 0.000015)
/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar
11 repeats: 0.000000 0.000000 0.000000 ( 0.000004)
/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar
12 repeats: 0.000000 0.000000 0.000000 ( 0.000004)
/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar
13 repeats: 0.000000 0.000000 0.000000 ( 0.000004)
/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar
14 repeats: 0.000000 0.000000 0.000000 ( 0.000004)
/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar
15 repeats: 0.000000 0.000000 0.000000 ( 0.000005)
定义 假设:
str = 'http://www.example.com/dog/baz/biz/baz/biz/baz/biz/cat/'
我们可以将'/dog'
、'/baz'
、'/biz'
等定义为段。一个组由一个或多个连续段组成,例如'/dog'
,'/baz'
,'/dog/baz'
,'/baz'
,'/baz/biz'
,'/baz/biz'
,等等
问题
我的理解是,问题在于确定给定字符串是否包含三个(或更多)连续且相等的组,后跟正斜杠<代码>s2通过以下子字符串满足此测试:
'/baz/biz/baz/biz/baz/biz/'
算法
我不相信可以用一个正则表达式来做这个决定,但是我们可以编写一个正则表达式来确定是否存在至少三个(或任意数量的)连续的、相等的组,给定每个组的段数。假设这是通过名为连续\u固定\u组\u大小?
的方法完成的,其调用如下:
contiguous_fixed_group_size?(str, segments_per_group, nbr_groups)
并返回true
或false
。为了确保字符串至少有3个连续、相等的组(对于每个组段的给定值),我们将此方法称为nbr\u组=3
。我认为最好暂时推迟这种方法的建设;就目前而言,假设我们可以使用它
我采用的方法是使用每个组段的不同值调用此方法,并确定该方法是否至少为其中一个值返回true
主要方法
第一步是确定字符串中的段数(其中str
包含上述字符串):
因此:
segments_per_group <= nbr_segments/nbr_groups
因此,我们可以确定str
是否包含(至少)nbr\u组
相邻的相等组,如下所示:
(1..nbr_segments/nbr_groups).any? do |segs_per_group|
contiguous_fixed_group_size?(str, segs_per_group, nbr_groups)
end
#=> true
def contiguous_fixed_group_size?(str, segments_per_group, nbr_groups)
r = /((?:\/[^\/]+){#{segments_per_group}})\1{#{nbr_groups-1}}/
str.match?(r)
end
我们可以将其包装在一个方法中:
def contiguous?(str, nbr_groups)
nbr_segments = str.scan(/(?<!\/)\/(?!\/)/).size - 1
(1..nbr_segments/nbr_groups).any? do |segs_per_grp|
contiguous_fixed_group_size?(str, segs_per_grp, nbr_groups)
end
end
为了
正则表达式是:
r #=> /((?:\/[^\/]+){2})\1{2}\//
此处以自由间距模式写入:
segments_per_group <= 8/3 => 2
(1..nbr_segments/nbr_groups).any? do |segs_per_group|
contiguous_fixed_group_size?(str, segs_per_group, nbr_groups)
end
#=> true
def contiguous?(str, nbr_groups)
nbr_segments = str.scan(/(?<!\/)\/(?!\/)/).size - 1
(1..nbr_segments/nbr_groups).any? do |segs_per_grp|
contiguous_fixed_group_size?(str, segs_per_grp, nbr_groups)
end
end
def contiguous_fixed_group_size?(str, segments_per_group, nbr_groups)
r = /((?:\/[^\/]+){#{segments_per_group}})\1{#{nbr_groups-1}}/
str.match?(r)
end
str = s2
segments_per_group = 2
nbr_groups = 3
r #=> /((?:\/[^\/]+){2})\1{2}\//
r = /
(?<!\/) # match is not to be preceded by a forward slash
# (negative lookbehind)
( # begin capture group 1
(?: # begin non-capture group
\/[^\/]+ # match '/' followed by 1+ char other than '/'
) # end non-capture group
{#{segments_per_group}} # execute non-capture group segments_per_group times
) # end capture group 1
\1{#{nbr_groups-1}} # execute contents of capture group 1
# nbr_groups-1 times
\/ # match '/'
/x # free-spacing regex definition mode
contiguous?(str, 3) #=> true
contiguous?(str, 2) #=> true
contiguous?(str, 1) #=> true
contiguous?(str, 4) #=> false
str = 'http://www.example.com/dog/baz/biz/baz/bix/baz/biz/cat/'
contiguous?(str, 3) #=> false
contiguous?(str, 2) #=> false
contiguous?(str, 1) #=> true