Ruby 在线上搜索和标记成对模式

Ruby 在线上搜索和标记成对模式,ruby,perl,bash,python-2.7,Ruby,Perl,Bash,Python 2.7,我需要搜索并标记在一条线上某个位置拆分的图案。以下是放置在单独文件中的样本模式的简短列表,例如: CAT,TREE LION,FOREST OWL,WATERFALL 如果第2列中的项目与第1列中的项目出现在同一行之后,则会显示匹配项。例如: THEREISACATINTHETREE. (matches) 如果第2列中的项目首先出现在行中,则不显示匹配项,例如: THETREEHASACAT. (does not match) 此外,如果第1列和第2列中的项目触碰,则不显示匹配项,例如:

我需要搜索并标记在一条线上某个位置拆分的图案。以下是放置在单独文件中的样本模式的简短列表,例如:

CAT,TREE
LION,FOREST
OWL,WATERFALL
如果第2列中的项目与第1列中的项目出现在同一行之后,则会显示匹配项。例如:

THEREISACATINTHETREE. (matches)
如果第2列中的项目首先出现在行中,则不显示匹配项,例如:

THETREEHASACAT. (does not match)
此外,如果第1列和第2列中的项目触碰,则不显示匹配项,例如:

THECATTREEHASMANYBIRDS. (does not match)
一旦找到任何匹配项,我需要用
\start{n}
(出现在第1列项目之后)和
\end{n}
(出现在第2列项目之前)来标记它,其中
n
是一个简单的计数器,只要找到任何匹配项,它就会增加。例如:

THEREISACAT\start{1}INTHE\end{1}TREE.
下面是一个更复杂的示例:

THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.
这就变成了:

THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}INA\end{1}TREENEARTHE\end{3}WATERFALL.
 THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.
有时在同一个位置有多个匹配项:

 THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.
这就变成了:

THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}INA\end{1}TREENEARTHE\end{3}WATERFALL.
 THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.
  • 文件中没有空格
  • 文件中会出现许多非拉丁字符
  • 模式匹配只需在同一行上找到(例如,第1行上的“CAT”与第2行上的“TREE”不匹配,因为它们在不同的行上)

如何找到这些匹配项并以这种方式标记它们?

这里是一个部分答案。它满足您的所有需求,但最后一个需求除外,它没有单一的简单解决方案。我将把这个留给你去弄清楚:-)

我选择了基于规则的方法而不是正则表达式。我在以前的类似项目中发现,简单的基于规则的解析器比正则表达式更易于维护、可移植,而且通常速度更快。我在这里没有使用任何真正特定于Ruby的特性,因此它应该很容易移植到Python或Perl。它甚至应该可以移植到C,而无需付出太多努力

patterns = [
  ['CAT', 'TREE'],
  ['LION', 'FOREST'],
  ['OWL', 'WATERFALL']
]

lines = [
  'THEREISACATINTHETREE.',
  'THETREEHASACAT.',
  'THECATTREEHASMANYBIRDS.',
  'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.',
  'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.'
]

newlines = []

START_TAG_LENGTH = 9
END_TAG_LENGTH = 7

lines.each do |line|

  newline = line.dup
  before = {}
  n = 1

  patterns.each do |pair|

    a = 0

    matches = [[], []]
    len = pair[0].length

    pair.each do |pattern|
      b = 0
      while (c = line.index(pattern, b))
        matches[a] << c
        b = c + 1
      end
      break if b == 0 && a > 0
      a += 1
    end

    matches[0].each_with_index do |d, f|
      bd = 0; be = 0
      e = matches[1][f]
      next if (d > e) || (d + len == e)
      d = d + len
      before.each { |g, h| bd += h if g <= d }
      newline.insert(d + bd, "\\start{#{n}}")
      before[d] ||= 0
      before[d] += START_TAG_LENGTH
      before.each { |g, h| be += h if g <= e }
      newline.insert(e + be, "\\end{#{n}}")
      before[e] ||= 0
      before[e] += END_TAG_LENGTH
    end

    n += 1

  end

  newlines << newline

end

puts newlines
注意,它在最后一个上失败了。不过,这会给你一个良好的开端。如果您需要帮助了解某些代码的功能,请不要犹豫


顺便说一句,我只是好奇,你用这个做什么?

这里有一个Perl方法:

#!/usr/bin/perl
use strict;
use warnings;
use 5.010;

# couples of patterns to search for
my @patterns = (
    ['CAT', 'TREE'],
    ['LION', 'FOREST'],
    ['OWL', 'WATERFALL'],
);

# loop over all sentences
while (my $line = <DATA>) {
    chomp $line;    #remove linefeed
    my $count = 1;  #counter of start/end
    foreach my $pats (@patterns) {
        #$p1=first pattern, $p2=second
        my ($p1, $p2) = @$pats;

        #split on patterns, keep them, remove empty
        my @s = grep {$_} split /($p1|$p2)/, $line;

        #$start=position where to put the \start
        #$end=position where to pt the \end
        my ($start, $end) = (undef, undef);

        #loop on all elements given by split
        for my $i (0 .. $#s) {
            # current element
            my $cur = $s[$i];

            #if = first pattern, keep its position in the array
            if ($cur eq $p1) {
                $start = $i;
            }

            #if = second pattern, keep its position in the array
            if ($cur eq $p2) {
                $end = $i;
            }

            #if both are defined and second pattern after first pattern
            # insert \start and \end
            if (defined($start) && defined($end) && $end > $start + 1) {
                $s[$start] .= "\\start{$count}";
                $s[$end] = "\\end{$count}" . $s[$end];
                undef $end;
                $count++;
            }
        }
        # recompose the line
        $line = join '', @s;
    }
    say $line;
}

__DATA__
THETREEHASACAT. (does not match)
THECATTREEHASMANYBIRDS. (does not match)
THEREISACATINTHETREE.
THECATANDLIONLEFTTHEFORESTANDMETANDOWLINATREENEARTHEWATERFALL.
THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.
CAT...TREE...CAT...TREE
看看这个(Ruby):

编辑


我插入了一些注释并澄清了一些变量。

首先,您必须从模式中找到所有出现的开始字符串和结束字符串。然后,您需要找出哪些标记适合组合在一起(如果结束字符串在开始字符串之前,或者位于相同位置,因此相互接触,则它们不适合)。然后可以生成标记并插入到输出字符串中。请注意,您需要将插入的字符数添加到位置,因为插入标记时字符串的长度会发生变化。此外,在插入标记之前,必须按位置对标记进行排序,否则计算必须移动位置的距离会变得非常复杂。下面是Ruby中的一个简短示例:

patterns = [['CAT','TREE'], ['LION','FOREST'], ['OWL','WATERFALL']]
strings = ['THEREISACATINTHETREE.', 'THETREEHASACAT.', 'THECATTREEHASMANYBIRDS.', 'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.', 'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.', 'ACATONATREEANDANOTHERCATONANOTHERTREE.', 'ACATONATREEBUTNOCATTREE.']

strings.each do |string|
  matches = {}; tags = []
  counter = shift = 0
  output = string.dup

  patterns.each do |sstr,estr|                # loop through all patterns
    posa = []; posb = [];                     #
    string.scan(sstr){posa << $~.end(0)}      # remember found positions and
    string.scan(estr){posb << $~.begin(0)}    # find all valid combinations (next line)
    matches[[sstr,estr]] = posa.product(posb).reject{|s,e|s>=e}
  end

  matches.each do |pat,pos|                   # loop through all matches
    pos.each do |s,e|                         # 
      tags << [s,"\\start{#{counter += 1}}"]  # generate and remember \start{}
      tags << [e,"\\end{#{counter}}"]         # and \end{} tags
    end
  end

  tags.sort.each do |pos,tag|                 # sort and loop through tags
    output.insert(pos+shift,tag)              # insert tag and increment
    shift += tag.chars.count                  # shift by num. of inserted chars
  end

  puts string, output                         # print result
end
输出:

input: THEREISACATINTHETREE.
output: THEREISACAT\start{1}INTHE\end{1}TREE.

input: THETREEHASACAT.
(does not match)

input: THECATTREEHASMANYBIRDS.
(does not match)

input: THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.
output: THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}IN\end{1}TREENEARTHE\end{3}WATERFALL.

input: THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.
output: THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.

input: ACATONATREEANDANOTHERCATONANOTHERTREE.
output: ACAT\start{1}\start{2}ONA\end{1}TREEANDANOTHERCAT\start{3}ONANOTHER\end{2}\end{3}TREE.

input: ACATONATREEBUTNOCATTREE.
output: ACAT\start{1}\start{2}ONA\end{1}TREEBUTNOCAT\end{2}TREE.
测试:

require 'test/unit'

class TestPatternMarker < Test::Unit::TestCase
  def setup
    @patterns = [
      ['CAT' , 'TREE'     ],
      ['LION', 'FOREST'   ],
      ['OWL' , 'WATERFALL']
    ]

    @marker = PatternMarker.new(@patterns)
  end

  def test_should_parse_simple
    @marker.parse 'THEREISACATINTHETREE.'
    assert @marker.match?
    assert_equal 'THEREISACAT\start{1}INTHE\end{1}TREE.', @marker.output
  end

  def test_should_parse_reverse
    @marker.parse 'THETREEHASACAT.'
    assert !@marker.match?
    assert_equal @marker.input, @marker.output
  end

  def test_should_parse_touching
    @marker.parse 'THECATTREEHASMANYBIRDS.'
    assert !@marker.match?
    assert_equal @marker.input, @marker.output
  end

  def test_should_parse_multiple_patterns
    @marker.parse 'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINATREENEARTHEWATERFALL.'
    assert @marker.match?
    assert_equal 'THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}INA\end{1}TREENEARTHE\end{3}WATERFALL.', @marker.output
  end

  def test_should_mark_multiple_matches_at_same_place
    @marker.parse 'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.'
    assert @marker.match?
    assert_equal 'THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.', @marker.output
  end

  def test_should_mark_all_possible_matches
    @marker.parse 'CATFOOTREEFOOCATFOOTREE.'
    assert @marker.match?
    assert_equal 'CAT\start{1}\start{2}FOO\end{1}TREEFOOCAT\start{3}FOO\end{2}\end{3}TREE.', @marker.output
  end

  def test_should_accept_input
    @marker.parse 'CATINTREE'
    assert @marker.match?
    assert_equal 'CATINTREE', @marker.input
    @marker.parse 'FOOBAR'
    assert !@marker.match?
    assert_equal 'FOOBAR', @marker.input
  end

  def test_should_only_accept_valid_patterns
    assert_raise ArgumentError do PatternMarker.new([])                                end
    assert_raise ArgumentError do PatternMarker.new(['FOO','BAR'])                     end
    assert_raise ArgumentError do PatternMarker.new(['FOO','BAR'],['FOO','BAR','BAZ']) end
    assert_raise ArgumentError do PatternMarker.new(['FOO','BAR'],['BAZ'])             end
    assert_nothing_raised      do PatternMarker.new([['FOO','BAR']])                   end
  end
end

编辑:添加了测试并简化了一些代码

以下是我的PERL方法。它又快又脏

如果我使用Marpa而不是regexp进行解析,可能会更好

不管怎样,它能完成任务

use strict;
use Test::More;
use Data::Dumper;

# patterns to search for
my @patterns = (
    'CAT,TREE',
    'LION,FOREST',
    'OWL,WATERFALL',
);
#lines
my @lines = qw(
THEREISACATINTHETREE.
THETREEHASACAT.
THECATTREEHASMANYBIRDS.
THECATANDLIONLEFTTHEFORESTANDMETANDOWLINATREENEARTHEWATERFALL.
THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.
THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREESORBIGTREES.
);


my @expected_output = (
'THEREISACAT\start{1}INTHE\end{1}TREE.',
'Does not Match',
'Does not Match',
'THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}INA\end{1}TREENEARTHE\end{3}WATERFALL.',
'THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.',
'THECAT\start{1}\start{2}\start{3}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREESORBIG\end{3}TREES.',
);

#is(check_line($lines[0]),$expected_output[0]);die;

my $no=0;
for(my $i=0;$i<scalar(@lines );$i++){   
    is(check_line($lines[$i]),$expected_output[$i]);
    $no++;
}
done_testing( $no );

sub check_line{
    my $in      = shift;
    my $out = '';
    my $match = 1;
    foreach my $pattern_line (@patterns){
        my ($first,$second) = split(/,/,$pattern_line);
        #warn "$first,$second,$in\n";
        if ($in !~ m#$first.+?$second#is){
            next;
        }
        #matched    

        while ($in =~ s#($first)(.+?)($second)#$1\\start\{$match\}$2\\end\{$match\}_SECOND_#is){
            $match++;
            #warn "Found match: $match\n";
        }
        $in =~ s#_SECOND_#$second#gis;
        #$in =~ s#\\start\{(\d+)\}\\start\{(\d+)\}#\\start\{$2\}\\start\{$1\}#gis;
        my ($end,$start) = $in =~ m#\\start\{(\d+)\}(?:\\start\{(\d+)\})+#gis;

        my $stmp = join("",map {"\\start\{$_\}"} ($start..$end));
        #print Dumper($in,$start,$end,$stmp);
        $in =~ s#\\start\{($end)\}.*?\\start\{($start)\}#$stmp#is;


    }
    return 'Does not Match' if $match ==1;
    $out = $in;
    return $out;
}
使用严格;
使用测试::更多;
使用数据::转储程序;
#要搜索的模式
我的@patterns=(
“猫,树”,
“狮子,森林”,
“猫头鹰,瀑布”,
);
#线条
我的@lines=qw(
特里萨卡廷特里。
泰拉萨卡特。
这个公园有很多鸟。
热带雨林和热带雨林在树木和土壤中留下了落差。
这棵树不象所有的树,但却象矮树。
这棵树不象所有的树,但却象矮树或大树。
);
我的@expected_输出=(
'在\end{1}树中有一个\start{1}',
“不匹配”,
“不匹配”,
'猫{1}和狮子{2}离开了{2}森林和梅坦多尔{3}瀑布,
'CAT\start{1}\start{2}不支持所有\end{1}树,但支持短\end{2}树。',
'CAT\start{1}\start{2}\start{3}不支持所有\end{1}树,但支持短\end{2}树环绕\end{3}树。',
);
#is(检查行($lines[0]),$expected_output[0]);死亡
我的$no=0;

对于(my$i=0;$i这里有一个完全在bash中的(没有外部命令)。不太难!它需要stdin上的输入行

#/bin/bash

words=("CAT TREE" "LION FORREST" "OWL WATERFALL")

function doit () {
  if [[ "$line" =~ (.*)$word1(.*)$word2(.*) ]]; then
    line="${BASH_REMATCH[1]}$alt_w1\\start{$count}${BASH_REMATCH[2]}$word2\\end{$count}${BASH_REMATCH[3]}"
    (( count += 1 ))
    doit
  elif [[ "$line" =~ $alt_w1 ]]; then
    line=${line//$alt_w1/$word1}
    [[ "$line" =~ (.*)$word2(.*) ]]
    line="${BASH_REMATCH[1]}$alt_w2${BASH_REMATCH[2]}"
    doit
  elif [[ "$line" =~ $alt_w2 ]]; then
    line=${line//$alt_w2/$word2}
  fi
}

while read line; do
  count=1
  for pair in "${words[@]}"; do
    word1=${pair% *}
    word2=${pair#* }
    alt_w1="${word1:0:1}XYZZYX${word1:1}"
    alt_w2="${word2:0:1}XYZZYX${word2:1}"
    doit
  done
  echo "$line"
done
假设:

  • 文本永远不会包含“XYZZYX”(字符串可以更改)
  • 这些单词永远不会包含正则表达式中使用的字符。
    • e、 g.
      *[]^$+
    • (排队的人也可以)
  • 单词的长度将始终至少为两个字符
  • 这些单词永远不会是您正在搜索的其他单词的子字符串。
    • e、 g.
    • 事实上,这可能有效,但结果会让人非常困惑

  • 这是我用不太流行的Python编写的解决方案

    patterns = [u'CAT,TREE', u'LION,FOREST', u'OWL,WATERFALL']
    
    strings = [u'THEREISACATINTHETREE.',
               u'THETREEHASACAT.',
               u'THECATTREEHASMANYBIRDS.',
               u'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.',
               u'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.',
               u'ACATONATREEANDANOTHERCATONANOTHERTREE.',
               u'ACATONATREEBUTNOCATTREE.' ]
    
    def findMatch(needles, haystack, label):
        needles = needles.split(',')
        matches = haystack.split(needles[0])
    
        if len(matches) > 1:
            submatches = matches[1].split(needles[1])
    
            if len(submatches) > 1:
                return u''.join([matches[0], needles[0], u'\\start{'+label+'}', submatches[0], u'\\end{'+label+'}', needles[1], submatches[1]])
    
        return False
    
    for s in strings:
        i = 0
        res = s
        for pat in patterns:
            i = i + 1
            temp = findMatch(pat, res, str(i))
    
            if (temp):
                res = temp
    
        print ('searching in '+s+' yields '+res).encode('utf-8')
    

    bash是实现这项任务的一个糟糕的选择,它可以完成,但复杂度很高。Perl非常适合这项工作,因为它在某种程度上是为这样的任务创建的。需求是非常不明确的。
    CAT…TREE…CAT…TREE
    。第一个
    CAT
    匹配两个
    树吗-s?还是第二次出现的
    CAT
    干预?两个
    CAT
    -s是否可以共享相同的终止
    TREE
    ?结果是否为
    CAT\start{1}\start{2}…\end{1}TREE…CAT\start{3}…\end{2}\end{3}TREE
    ?用Perl进行完全、自动的UTF-8处理真的很容易,Perl保存并呼吸正则表达式。我会尝试一下,尽管我不知道@Kaz提出的问题的答案。还有一个问题是关于如何处理组合字符的图形,因为你可能会遇到一些奇怪的情况,我认为你不想这样做匹配一个局部图形。如果要求更具体,那将是一个相当简洁(且具有挑战性!)的高尔夫问题。我很高兴找到一个解决方案,但我遇到了一个有趣的问题:你为什么需要这样的东西?:)很好。肯定比我的版本更干净。但是这个算法不包括
    
    
    require 'test/unit'
    
    class TestPatternMarker < Test::Unit::TestCase
      def setup
        @patterns = [
          ['CAT' , 'TREE'     ],
          ['LION', 'FOREST'   ],
          ['OWL' , 'WATERFALL']
        ]
    
        @marker = PatternMarker.new(@patterns)
      end
    
      def test_should_parse_simple
        @marker.parse 'THEREISACATINTHETREE.'
        assert @marker.match?
        assert_equal 'THEREISACAT\start{1}INTHE\end{1}TREE.', @marker.output
      end
    
      def test_should_parse_reverse
        @marker.parse 'THETREEHASACAT.'
        assert !@marker.match?
        assert_equal @marker.input, @marker.output
      end
    
      def test_should_parse_touching
        @marker.parse 'THECATTREEHASMANYBIRDS.'
        assert !@marker.match?
        assert_equal @marker.input, @marker.output
      end
    
      def test_should_parse_multiple_patterns
        @marker.parse 'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINATREENEARTHEWATERFALL.'
        assert @marker.match?
        assert_equal 'THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}INA\end{1}TREENEARTHE\end{3}WATERFALL.', @marker.output
      end
    
      def test_should_mark_multiple_matches_at_same_place
        @marker.parse 'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.'
        assert @marker.match?
        assert_equal 'THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.', @marker.output
      end
    
      def test_should_mark_all_possible_matches
        @marker.parse 'CATFOOTREEFOOCATFOOTREE.'
        assert @marker.match?
        assert_equal 'CAT\start{1}\start{2}FOO\end{1}TREEFOOCAT\start{3}FOO\end{2}\end{3}TREE.', @marker.output
      end
    
      def test_should_accept_input
        @marker.parse 'CATINTREE'
        assert @marker.match?
        assert_equal 'CATINTREE', @marker.input
        @marker.parse 'FOOBAR'
        assert !@marker.match?
        assert_equal 'FOOBAR', @marker.input
      end
    
      def test_should_only_accept_valid_patterns
        assert_raise ArgumentError do PatternMarker.new([])                                end
        assert_raise ArgumentError do PatternMarker.new(['FOO','BAR'])                     end
        assert_raise ArgumentError do PatternMarker.new(['FOO','BAR'],['FOO','BAR','BAZ']) end
        assert_raise ArgumentError do PatternMarker.new(['FOO','BAR'],['BAZ'])             end
        assert_nothing_raised      do PatternMarker.new([['FOO','BAR']])                   end
      end
    end
    
    Loaded suite pattern
    Started
    ........
    Finished in 0.003910 seconds.
    
    8 tests, 21 assertions, 0 failures, 0 errors, 0 skips
    
    Test run options: --seed 31173
    
    use strict;
    use Test::More;
    use Data::Dumper;
    
    # patterns to search for
    my @patterns = (
        'CAT,TREE',
        'LION,FOREST',
        'OWL,WATERFALL',
    );
    #lines
    my @lines = qw(
    THEREISACATINTHETREE.
    THETREEHASACAT.
    THECATTREEHASMANYBIRDS.
    THECATANDLIONLEFTTHEFORESTANDMETANDOWLINATREENEARTHEWATERFALL.
    THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.
    THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREESORBIGTREES.
    );
    
    
    my @expected_output = (
    'THEREISACAT\start{1}INTHE\end{1}TREE.',
    'Does not Match',
    'Does not Match',
    'THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}INA\end{1}TREENEARTHE\end{3}WATERFALL.',
    'THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.',
    'THECAT\start{1}\start{2}\start{3}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREESORBIG\end{3}TREES.',
    );
    
    #is(check_line($lines[0]),$expected_output[0]);die;
    
    my $no=0;
    for(my $i=0;$i<scalar(@lines );$i++){   
        is(check_line($lines[$i]),$expected_output[$i]);
        $no++;
    }
    done_testing( $no );
    
    sub check_line{
        my $in      = shift;
        my $out = '';
        my $match = 1;
        foreach my $pattern_line (@patterns){
            my ($first,$second) = split(/,/,$pattern_line);
            #warn "$first,$second,$in\n";
            if ($in !~ m#$first.+?$second#is){
                next;
            }
            #matched    
    
            while ($in =~ s#($first)(.+?)($second)#$1\\start\{$match\}$2\\end\{$match\}_SECOND_#is){
                $match++;
                #warn "Found match: $match\n";
            }
            $in =~ s#_SECOND_#$second#gis;
            #$in =~ s#\\start\{(\d+)\}\\start\{(\d+)\}#\\start\{$2\}\\start\{$1\}#gis;
            my ($end,$start) = $in =~ m#\\start\{(\d+)\}(?:\\start\{(\d+)\})+#gis;
    
            my $stmp = join("",map {"\\start\{$_\}"} ($start..$end));
            #print Dumper($in,$start,$end,$stmp);
            $in =~ s#\\start\{($end)\}.*?\\start\{($start)\}#$stmp#is;
    
    
        }
        return 'Does not Match' if $match ==1;
        $out = $in;
        return $out;
    }
    
    #/bin/bash
    
    words=("CAT TREE" "LION FORREST" "OWL WATERFALL")
    
    function doit () {
      if [[ "$line" =~ (.*)$word1(.*)$word2(.*) ]]; then
        line="${BASH_REMATCH[1]}$alt_w1\\start{$count}${BASH_REMATCH[2]}$word2\\end{$count}${BASH_REMATCH[3]}"
        (( count += 1 ))
        doit
      elif [[ "$line" =~ $alt_w1 ]]; then
        line=${line//$alt_w1/$word1}
        [[ "$line" =~ (.*)$word2(.*) ]]
        line="${BASH_REMATCH[1]}$alt_w2${BASH_REMATCH[2]}"
        doit
      elif [[ "$line" =~ $alt_w2 ]]; then
        line=${line//$alt_w2/$word2}
      fi
    }
    
    while read line; do
      count=1
      for pair in "${words[@]}"; do
        word1=${pair% *}
        word2=${pair#* }
        alt_w1="${word1:0:1}XYZZYX${word1:1}"
        alt_w2="${word2:0:1}XYZZYX${word2:1}"
        doit
      done
      echo "$line"
    done
    
    patterns = [u'CAT,TREE', u'LION,FOREST', u'OWL,WATERFALL']
    
    strings = [u'THEREISACATINTHETREE.',
               u'THETREEHASACAT.',
               u'THECATTREEHASMANYBIRDS.',
               u'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.',
               u'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.',
               u'ACATONATREEANDANOTHERCATONANOTHERTREE.',
               u'ACATONATREEBUTNOCATTREE.' ]
    
    def findMatch(needles, haystack, label):
        needles = needles.split(',')
        matches = haystack.split(needles[0])
    
        if len(matches) > 1:
            submatches = matches[1].split(needles[1])
    
            if len(submatches) > 1:
                return u''.join([matches[0], needles[0], u'\\start{'+label+'}', submatches[0], u'\\end{'+label+'}', needles[1], submatches[1]])
    
        return False
    
    for s in strings:
        i = 0
        res = s
        for pat in patterns:
            i = i + 1
            temp = findMatch(pat, res, str(i))
    
            if (temp):
                res = temp
    
        print ('searching in '+s+' yields '+res).encode('utf-8')