Regex 用多个尖括号清理管柱
我有下面的HTML代码Regex 用多个尖括号清理管柱,regex,tidy,Regex,Tidy,我有下面的HTML代码 <div class="article">this is a div article content</div> 虽然我真正需要的是: <div class="article">this is a <hl>div</hl> <hl>article</hl> content</div> 现在,这可以工作,但只替换标记中的第一个匹配项,即div: 下面的代码可能对您有所帮助:
<div class="article">this is a div article content</div>
虽然我真正需要的是:
<div class="article">this is a <hl>div</hl> <hl>article</hl> content</div>
现在,这可以工作,但只替换标记中的第一个匹配项,即div:
下面的代码可能对您有所帮助:
class HTMLCleaner(object):
def parse(self, html):
output = []
parsing_tag = False
html = iter(html)
for char in html:
if char == '<':
if parsing_tag:
drop_char = html.next()
while drop_char != '>':
drop_char = html.next()
continue
parsing_tag = True
elif char == '>':
parsing_tag = False
output.append(char)
return ''.join(output)
html = '<<hl>div</hl> <hl>class</hl>="<hl>article</hl>">this is a <hl>div</hl> <hl>article</hl> content</<hl>div</hl>>'
parser = HTMLCleaner()
print parser.parse(html)
给定输入的输出为:
<div class="article">this is a <hl>div</hl> <hl>article</hl> content</div>
我相信这就是你要找的
当另一个标记还没有被解析时,代码基本上会删除所有标记。隔离原始标记的主体,在其上运行标记程序,然后重新包装标记的文本,不是更容易吗?我真的不能这样做,因为文本是直接从SolrI标记的,我喜欢这个解决方案,因为我被一个正则表达式卡住了脑袋。。。有时只是从不同的角度思考。
<div <hl>class</hl>="<hl>article</hl>">this is a <hl>div</hl> <hl>article</hl> content</div>
var stripIllegalTags = function(html) {
var output = '',
dropChar,
parsingTag = false;
for (var i=0; i < html.length; i++) {
var character = html[i];
if (character == '<') {
if (parsingTag) {
do {
dropChar = html[i+1];
i++;
} while (dropChar != '>');
continue;
}
parsingTag = true;
} else if (character == '>') {
parsingTag = false;
}
output += character;
}
return output;
}
class HTMLCleaner(object):
def parse(self, html):
output = []
parsing_tag = False
html = iter(html)
for char in html:
if char == '<':
if parsing_tag:
drop_char = html.next()
while drop_char != '>':
drop_char = html.next()
continue
parsing_tag = True
elif char == '>':
parsing_tag = False
output.append(char)
return ''.join(output)
html = '<<hl>div</hl> <hl>class</hl>="<hl>article</hl>">this is a <hl>div</hl> <hl>article</hl> content</<hl>div</hl>>'
parser = HTMLCleaner()
print parser.parse(html)
<div class="article">this is a <hl>div</hl> <hl>article</hl> content</div>