Ruby 在线上搜索和标记成对模式
我需要搜索并标记在一条线上某个位置拆分的图案。以下是放置在单独文件中的样本模式的简短列表,例如:Ruby 在线上搜索和标记成对模式,ruby,perl,bash,python-2.7,Ruby,Perl,Bash,Python 2.7,我需要搜索并标记在一条线上某个位置拆分的图案。以下是放置在单独文件中的样本模式的简短列表,例如: CAT,TREE LION,FOREST OWL,WATERFALL 如果第2列中的项目与第1列中的项目出现在同一行之后,则会显示匹配项。例如: THEREISACATINTHETREE. (matches) 如果第2列中的项目首先出现在行中,则不显示匹配项,例如: THETREEHASACAT. (does not match) 此外,如果第1列和第2列中的项目触碰,则不显示匹配项,例如:
CAT,TREE
LION,FOREST
OWL,WATERFALL
如果第2列中的项目与第1列中的项目出现在同一行之后,则会显示匹配项。例如:
THEREISACATINTHETREE. (matches)
如果第2列中的项目首先出现在行中,则不显示匹配项,例如:
THETREEHASACAT. (does not match)
此外,如果第1列和第2列中的项目触碰,则不显示匹配项,例如:
THECATTREEHASMANYBIRDS. (does not match)
一旦找到任何匹配项,我需要用\start{n}
(出现在第1列项目之后)和\end{n}
(出现在第2列项目之前)来标记它,其中n
是一个简单的计数器,只要找到任何匹配项,它就会增加。例如:
THEREISACAT\start{1}INTHE\end{1}TREE.
下面是一个更复杂的示例:
THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.
这就变成了:
THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}INA\end{1}TREENEARTHE\end{3}WATERFALL.
THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.
有时在同一个位置有多个匹配项:
THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.
这就变成了:
THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}INA\end{1}TREENEARTHE\end{3}WATERFALL.
THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.
- 文件中没有空格
- 文件中会出现许多非拉丁字符
- 模式匹配只需在同一行上找到(例如,第1行上的“CAT”与第2行上的“TREE”不匹配,因为它们在不同的行上)
如何找到这些匹配项并以这种方式标记它们?这里是一个部分答案。它满足您的所有需求,但最后一个需求除外,它没有单一的简单解决方案。我将把这个留给你去弄清楚:-) 我选择了基于规则的方法而不是正则表达式。我在以前的类似项目中发现,简单的基于规则的解析器比正则表达式更易于维护、可移植,而且通常速度更快。我在这里没有使用任何真正特定于Ruby的特性,因此它应该很容易移植到Python或Perl。它甚至应该可以移植到C,而无需付出太多努力
patterns = [
['CAT', 'TREE'],
['LION', 'FOREST'],
['OWL', 'WATERFALL']
]
lines = [
'THEREISACATINTHETREE.',
'THETREEHASACAT.',
'THECATTREEHASMANYBIRDS.',
'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.',
'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.'
]
newlines = []
START_TAG_LENGTH = 9
END_TAG_LENGTH = 7
lines.each do |line|
newline = line.dup
before = {}
n = 1
patterns.each do |pair|
a = 0
matches = [[], []]
len = pair[0].length
pair.each do |pattern|
b = 0
while (c = line.index(pattern, b))
matches[a] << c
b = c + 1
end
break if b == 0 && a > 0
a += 1
end
matches[0].each_with_index do |d, f|
bd = 0; be = 0
e = matches[1][f]
next if (d > e) || (d + len == e)
d = d + len
before.each { |g, h| bd += h if g <= d }
newline.insert(d + bd, "\\start{#{n}}")
before[d] ||= 0
before[d] += START_TAG_LENGTH
before.each { |g, h| be += h if g <= e }
newline.insert(e + be, "\\end{#{n}}")
before[e] ||= 0
before[e] += END_TAG_LENGTH
end
n += 1
end
newlines << newline
end
puts newlines
注意,它在最后一个上失败了。不过,这会给你一个良好的开端。如果您需要帮助了解某些代码的功能,请不要犹豫
顺便说一句,我只是好奇,你用这个做什么?这里有一个Perl方法:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
# couples of patterns to search for
my @patterns = (
['CAT', 'TREE'],
['LION', 'FOREST'],
['OWL', 'WATERFALL'],
);
# loop over all sentences
while (my $line = <DATA>) {
chomp $line; #remove linefeed
my $count = 1; #counter of start/end
foreach my $pats (@patterns) {
#$p1=first pattern, $p2=second
my ($p1, $p2) = @$pats;
#split on patterns, keep them, remove empty
my @s = grep {$_} split /($p1|$p2)/, $line;
#$start=position where to put the \start
#$end=position where to pt the \end
my ($start, $end) = (undef, undef);
#loop on all elements given by split
for my $i (0 .. $#s) {
# current element
my $cur = $s[$i];
#if = first pattern, keep its position in the array
if ($cur eq $p1) {
$start = $i;
}
#if = second pattern, keep its position in the array
if ($cur eq $p2) {
$end = $i;
}
#if both are defined and second pattern after first pattern
# insert \start and \end
if (defined($start) && defined($end) && $end > $start + 1) {
$s[$start] .= "\\start{$count}";
$s[$end] = "\\end{$count}" . $s[$end];
undef $end;
$count++;
}
}
# recompose the line
$line = join '', @s;
}
say $line;
}
__DATA__
THETREEHASACAT. (does not match)
THECATTREEHASMANYBIRDS. (does not match)
THEREISACATINTHETREE.
THECATANDLIONLEFTTHEFORESTANDMETANDOWLINATREENEARTHEWATERFALL.
THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.
CAT...TREE...CAT...TREE
看看这个(Ruby):
编辑
我插入了一些注释并澄清了一些变量。首先,您必须从模式中找到所有出现的开始字符串和结束字符串。然后,您需要找出哪些标记适合组合在一起(如果结束字符串在开始字符串之前,或者位于相同位置,因此相互接触,则它们不适合)。然后可以生成标记并插入到输出字符串中。请注意,您需要将插入的字符数添加到位置,因为插入标记时字符串的长度会发生变化。此外,在插入标记之前,必须按位置对标记进行排序,否则计算必须移动位置的距离会变得非常复杂。下面是Ruby中的一个简短示例:
patterns = [['CAT','TREE'], ['LION','FOREST'], ['OWL','WATERFALL']]
strings = ['THEREISACATINTHETREE.', 'THETREEHASACAT.', 'THECATTREEHASMANYBIRDS.', 'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.', 'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.', 'ACATONATREEANDANOTHERCATONANOTHERTREE.', 'ACATONATREEBUTNOCATTREE.']
strings.each do |string|
matches = {}; tags = []
counter = shift = 0
output = string.dup
patterns.each do |sstr,estr| # loop through all patterns
posa = []; posb = []; #
string.scan(sstr){posa << $~.end(0)} # remember found positions and
string.scan(estr){posb << $~.begin(0)} # find all valid combinations (next line)
matches[[sstr,estr]] = posa.product(posb).reject{|s,e|s>=e}
end
matches.each do |pat,pos| # loop through all matches
pos.each do |s,e| #
tags << [s,"\\start{#{counter += 1}}"] # generate and remember \start{}
tags << [e,"\\end{#{counter}}"] # and \end{} tags
end
end
tags.sort.each do |pos,tag| # sort and loop through tags
output.insert(pos+shift,tag) # insert tag and increment
shift += tag.chars.count # shift by num. of inserted chars
end
puts string, output # print result
end
输出:
input: THEREISACATINTHETREE.
output: THEREISACAT\start{1}INTHE\end{1}TREE.
input: THETREEHASACAT.
(does not match)
input: THECATTREEHASMANYBIRDS.
(does not match)
input: THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.
output: THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}IN\end{1}TREENEARTHE\end{3}WATERFALL.
input: THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.
output: THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.
input: ACATONATREEANDANOTHERCATONANOTHERTREE.
output: ACAT\start{1}\start{2}ONA\end{1}TREEANDANOTHERCAT\start{3}ONANOTHER\end{2}\end{3}TREE.
input: ACATONATREEBUTNOCATTREE.
output: ACAT\start{1}\start{2}ONA\end{1}TREEBUTNOCAT\end{2}TREE.
测试:
require 'test/unit'
class TestPatternMarker < Test::Unit::TestCase
def setup
@patterns = [
['CAT' , 'TREE' ],
['LION', 'FOREST' ],
['OWL' , 'WATERFALL']
]
@marker = PatternMarker.new(@patterns)
end
def test_should_parse_simple
@marker.parse 'THEREISACATINTHETREE.'
assert @marker.match?
assert_equal 'THEREISACAT\start{1}INTHE\end{1}TREE.', @marker.output
end
def test_should_parse_reverse
@marker.parse 'THETREEHASACAT.'
assert !@marker.match?
assert_equal @marker.input, @marker.output
end
def test_should_parse_touching
@marker.parse 'THECATTREEHASMANYBIRDS.'
assert !@marker.match?
assert_equal @marker.input, @marker.output
end
def test_should_parse_multiple_patterns
@marker.parse 'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINATREENEARTHEWATERFALL.'
assert @marker.match?
assert_equal 'THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}INA\end{1}TREENEARTHE\end{3}WATERFALL.', @marker.output
end
def test_should_mark_multiple_matches_at_same_place
@marker.parse 'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.'
assert @marker.match?
assert_equal 'THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.', @marker.output
end
def test_should_mark_all_possible_matches
@marker.parse 'CATFOOTREEFOOCATFOOTREE.'
assert @marker.match?
assert_equal 'CAT\start{1}\start{2}FOO\end{1}TREEFOOCAT\start{3}FOO\end{2}\end{3}TREE.', @marker.output
end
def test_should_accept_input
@marker.parse 'CATINTREE'
assert @marker.match?
assert_equal 'CATINTREE', @marker.input
@marker.parse 'FOOBAR'
assert !@marker.match?
assert_equal 'FOOBAR', @marker.input
end
def test_should_only_accept_valid_patterns
assert_raise ArgumentError do PatternMarker.new([]) end
assert_raise ArgumentError do PatternMarker.new(['FOO','BAR']) end
assert_raise ArgumentError do PatternMarker.new(['FOO','BAR'],['FOO','BAR','BAZ']) end
assert_raise ArgumentError do PatternMarker.new(['FOO','BAR'],['BAZ']) end
assert_nothing_raised do PatternMarker.new([['FOO','BAR']]) end
end
end
编辑:添加了测试并简化了一些代码以下是我的PERL方法。它又快又脏 如果我使用Marpa而不是regexp进行解析,可能会更好 不管怎样,它能完成任务
use strict;
use Test::More;
use Data::Dumper;
# patterns to search for
my @patterns = (
'CAT,TREE',
'LION,FOREST',
'OWL,WATERFALL',
);
#lines
my @lines = qw(
THEREISACATINTHETREE.
THETREEHASACAT.
THECATTREEHASMANYBIRDS.
THECATANDLIONLEFTTHEFORESTANDMETANDOWLINATREENEARTHEWATERFALL.
THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.
THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREESORBIGTREES.
);
my @expected_output = (
'THEREISACAT\start{1}INTHE\end{1}TREE.',
'Does not Match',
'Does not Match',
'THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}INA\end{1}TREENEARTHE\end{3}WATERFALL.',
'THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.',
'THECAT\start{1}\start{2}\start{3}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREESORBIG\end{3}TREES.',
);
#is(check_line($lines[0]),$expected_output[0]);die;
my $no=0;
for(my $i=0;$i<scalar(@lines );$i++){
is(check_line($lines[$i]),$expected_output[$i]);
$no++;
}
done_testing( $no );
sub check_line{
my $in = shift;
my $out = '';
my $match = 1;
foreach my $pattern_line (@patterns){
my ($first,$second) = split(/,/,$pattern_line);
#warn "$first,$second,$in\n";
if ($in !~ m#$first.+?$second#is){
next;
}
#matched
while ($in =~ s#($first)(.+?)($second)#$1\\start\{$match\}$2\\end\{$match\}_SECOND_#is){
$match++;
#warn "Found match: $match\n";
}
$in =~ s#_SECOND_#$second#gis;
#$in =~ s#\\start\{(\d+)\}\\start\{(\d+)\}#\\start\{$2\}\\start\{$1\}#gis;
my ($end,$start) = $in =~ m#\\start\{(\d+)\}(?:\\start\{(\d+)\})+#gis;
my $stmp = join("",map {"\\start\{$_\}"} ($start..$end));
#print Dumper($in,$start,$end,$stmp);
$in =~ s#\\start\{($end)\}.*?\\start\{($start)\}#$stmp#is;
}
return 'Does not Match' if $match ==1;
$out = $in;
return $out;
}
使用严格;
使用测试::更多;
使用数据::转储程序;
#要搜索的模式
我的@patterns=(
“猫,树”,
“狮子,森林”,
“猫头鹰,瀑布”,
);
#线条
我的@lines=qw(
特里萨卡廷特里。
泰拉萨卡特。
这个公园有很多鸟。
热带雨林和热带雨林在树木和土壤中留下了落差。
这棵树不象所有的树,但却象矮树。
这棵树不象所有的树,但却象矮树或大树。
);
我的@expected_输出=(
'在\end{1}树中有一个\start{1}',
“不匹配”,
“不匹配”,
'猫{1}和狮子{2}离开了{2}森林和梅坦多尔{3}瀑布,
'CAT\start{1}\start{2}不支持所有\end{1}树,但支持短\end{2}树。',
'CAT\start{1}\start{2}\start{3}不支持所有\end{1}树,但支持短\end{2}树环绕\end{3}树。',
);
#is(检查行($lines[0]),$expected_output[0]);死亡
我的$no=0;
对于(my$i=0;$i这里有一个完全在bash中的(没有外部命令)。不太难!它需要stdin上的输入行
#/bin/bash
words=("CAT TREE" "LION FORREST" "OWL WATERFALL")
function doit () {
if [[ "$line" =~ (.*)$word1(.*)$word2(.*) ]]; then
line="${BASH_REMATCH[1]}$alt_w1\\start{$count}${BASH_REMATCH[2]}$word2\\end{$count}${BASH_REMATCH[3]}"
(( count += 1 ))
doit
elif [[ "$line" =~ $alt_w1 ]]; then
line=${line//$alt_w1/$word1}
[[ "$line" =~ (.*)$word2(.*) ]]
line="${BASH_REMATCH[1]}$alt_w2${BASH_REMATCH[2]}"
doit
elif [[ "$line" =~ $alt_w2 ]]; then
line=${line//$alt_w2/$word2}
fi
}
while read line; do
count=1
for pair in "${words[@]}"; do
word1=${pair% *}
word2=${pair#* }
alt_w1="${word1:0:1}XYZZYX${word1:1}"
alt_w2="${word2:0:1}XYZZYX${word2:1}"
doit
done
echo "$line"
done
假设:
文本永远不会包含“XYZZYX”(字符串可以更改)
这些单词永远不会包含正则表达式中使用的字符。
- e、 g.
*[]^$+
- (排队的人也可以)
单词的长度将始终至少为两个字符
这些单词永远不会是您正在搜索的其他单词的子字符串。
- e、 g.
猫
和牛
- 事实上,这可能有效,但结果会让人非常困惑
这是我用不太流行的Python编写的解决方案
patterns = [u'CAT,TREE', u'LION,FOREST', u'OWL,WATERFALL']
strings = [u'THEREISACATINTHETREE.',
u'THETREEHASACAT.',
u'THECATTREEHASMANYBIRDS.',
u'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.',
u'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.',
u'ACATONATREEANDANOTHERCATONANOTHERTREE.',
u'ACATONATREEBUTNOCATTREE.' ]
def findMatch(needles, haystack, label):
needles = needles.split(',')
matches = haystack.split(needles[0])
if len(matches) > 1:
submatches = matches[1].split(needles[1])
if len(submatches) > 1:
return u''.join([matches[0], needles[0], u'\\start{'+label+'}', submatches[0], u'\\end{'+label+'}', needles[1], submatches[1]])
return False
for s in strings:
i = 0
res = s
for pat in patterns:
i = i + 1
temp = findMatch(pat, res, str(i))
if (temp):
res = temp
print ('searching in '+s+' yields '+res).encode('utf-8')
bash是实现这项任务的一个糟糕的选择,它可以完成,但复杂度很高。Perl非常适合这项工作,因为它在某种程度上是为这样的任务创建的。需求是非常不明确的。CAT…TREE…CAT…TREE
。第一个CAT
匹配两个树吗代码>-s?还是第二次出现的CAT
干预?两个CAT
-s是否可以共享相同的终止TREE
?结果是否为CAT\start{1}\start{2}…\end{1}TREE…CAT\start{3}…\end{2}\end{3}TREE
?用Perl进行完全、自动的UTF-8处理真的很容易,Perl保存并呼吸正则表达式。我会尝试一下,尽管我不知道@Kaz提出的问题的答案。还有一个问题是关于如何处理组合字符的图形,因为你可能会遇到一些奇怪的情况,我认为你不想这样做匹配一个局部图形。如果要求更具体,那将是一个相当简洁(且具有挑战性!)的高尔夫问题。我很高兴找到一个解决方案,但我遇到了一个有趣的问题:你为什么需要这样的东西?:)很好。肯定比我的版本更干净。但是这个算法不包括
require 'test/unit'
class TestPatternMarker < Test::Unit::TestCase
def setup
@patterns = [
['CAT' , 'TREE' ],
['LION', 'FOREST' ],
['OWL' , 'WATERFALL']
]
@marker = PatternMarker.new(@patterns)
end
def test_should_parse_simple
@marker.parse 'THEREISACATINTHETREE.'
assert @marker.match?
assert_equal 'THEREISACAT\start{1}INTHE\end{1}TREE.', @marker.output
end
def test_should_parse_reverse
@marker.parse 'THETREEHASACAT.'
assert !@marker.match?
assert_equal @marker.input, @marker.output
end
def test_should_parse_touching
@marker.parse 'THECATTREEHASMANYBIRDS.'
assert !@marker.match?
assert_equal @marker.input, @marker.output
end
def test_should_parse_multiple_patterns
@marker.parse 'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINATREENEARTHEWATERFALL.'
assert @marker.match?
assert_equal 'THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}INA\end{1}TREENEARTHE\end{3}WATERFALL.', @marker.output
end
def test_should_mark_multiple_matches_at_same_place
@marker.parse 'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.'
assert @marker.match?
assert_equal 'THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.', @marker.output
end
def test_should_mark_all_possible_matches
@marker.parse 'CATFOOTREEFOOCATFOOTREE.'
assert @marker.match?
assert_equal 'CAT\start{1}\start{2}FOO\end{1}TREEFOOCAT\start{3}FOO\end{2}\end{3}TREE.', @marker.output
end
def test_should_accept_input
@marker.parse 'CATINTREE'
assert @marker.match?
assert_equal 'CATINTREE', @marker.input
@marker.parse 'FOOBAR'
assert !@marker.match?
assert_equal 'FOOBAR', @marker.input
end
def test_should_only_accept_valid_patterns
assert_raise ArgumentError do PatternMarker.new([]) end
assert_raise ArgumentError do PatternMarker.new(['FOO','BAR']) end
assert_raise ArgumentError do PatternMarker.new(['FOO','BAR'],['FOO','BAR','BAZ']) end
assert_raise ArgumentError do PatternMarker.new(['FOO','BAR'],['BAZ']) end
assert_nothing_raised do PatternMarker.new([['FOO','BAR']]) end
end
end
Loaded suite pattern
Started
........
Finished in 0.003910 seconds.
8 tests, 21 assertions, 0 failures, 0 errors, 0 skips
Test run options: --seed 31173
use strict;
use Test::More;
use Data::Dumper;
# patterns to search for
my @patterns = (
'CAT,TREE',
'LION,FOREST',
'OWL,WATERFALL',
);
#lines
my @lines = qw(
THEREISACATINTHETREE.
THETREEHASACAT.
THECATTREEHASMANYBIRDS.
THECATANDLIONLEFTTHEFORESTANDMETANDOWLINATREENEARTHEWATERFALL.
THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.
THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREESORBIGTREES.
);
my @expected_output = (
'THEREISACAT\start{1}INTHE\end{1}TREE.',
'Does not Match',
'Does not Match',
'THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}INA\end{1}TREENEARTHE\end{3}WATERFALL.',
'THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.',
'THECAT\start{1}\start{2}\start{3}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREESORBIG\end{3}TREES.',
);
#is(check_line($lines[0]),$expected_output[0]);die;
my $no=0;
for(my $i=0;$i<scalar(@lines );$i++){
is(check_line($lines[$i]),$expected_output[$i]);
$no++;
}
done_testing( $no );
sub check_line{
my $in = shift;
my $out = '';
my $match = 1;
foreach my $pattern_line (@patterns){
my ($first,$second) = split(/,/,$pattern_line);
#warn "$first,$second,$in\n";
if ($in !~ m#$first.+?$second#is){
next;
}
#matched
while ($in =~ s#($first)(.+?)($second)#$1\\start\{$match\}$2\\end\{$match\}_SECOND_#is){
$match++;
#warn "Found match: $match\n";
}
$in =~ s#_SECOND_#$second#gis;
#$in =~ s#\\start\{(\d+)\}\\start\{(\d+)\}#\\start\{$2\}\\start\{$1\}#gis;
my ($end,$start) = $in =~ m#\\start\{(\d+)\}(?:\\start\{(\d+)\})+#gis;
my $stmp = join("",map {"\\start\{$_\}"} ($start..$end));
#print Dumper($in,$start,$end,$stmp);
$in =~ s#\\start\{($end)\}.*?\\start\{($start)\}#$stmp#is;
}
return 'Does not Match' if $match ==1;
$out = $in;
return $out;
}
#/bin/bash
words=("CAT TREE" "LION FORREST" "OWL WATERFALL")
function doit () {
if [[ "$line" =~ (.*)$word1(.*)$word2(.*) ]]; then
line="${BASH_REMATCH[1]}$alt_w1\\start{$count}${BASH_REMATCH[2]}$word2\\end{$count}${BASH_REMATCH[3]}"
(( count += 1 ))
doit
elif [[ "$line" =~ $alt_w1 ]]; then
line=${line//$alt_w1/$word1}
[[ "$line" =~ (.*)$word2(.*) ]]
line="${BASH_REMATCH[1]}$alt_w2${BASH_REMATCH[2]}"
doit
elif [[ "$line" =~ $alt_w2 ]]; then
line=${line//$alt_w2/$word2}
fi
}
while read line; do
count=1
for pair in "${words[@]}"; do
word1=${pair% *}
word2=${pair#* }
alt_w1="${word1:0:1}XYZZYX${word1:1}"
alt_w2="${word2:0:1}XYZZYX${word2:1}"
doit
done
echo "$line"
done
patterns = [u'CAT,TREE', u'LION,FOREST', u'OWL,WATERFALL']
strings = [u'THEREISACATINTHETREE.',
u'THETREEHASACAT.',
u'THECATTREEHASMANYBIRDS.',
u'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.',
u'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.',
u'ACATONATREEANDANOTHERCATONANOTHERTREE.',
u'ACATONATREEBUTNOCATTREE.' ]
def findMatch(needles, haystack, label):
needles = needles.split(',')
matches = haystack.split(needles[0])
if len(matches) > 1:
submatches = matches[1].split(needles[1])
if len(submatches) > 1:
return u''.join([matches[0], needles[0], u'\\start{'+label+'}', submatches[0], u'\\end{'+label+'}', needles[1], submatches[1]])
return False
for s in strings:
i = 0
res = s
for pat in patterns:
i = i + 1
temp = findMatch(pat, res, str(i))
if (temp):
res = temp
print ('searching in '+s+' yields '+res).encode('utf-8')