跳过php正则表达式中的html标记
我坚持正确的英语(是的,我知道“坚持”和“正确的英语”是一个矛盾修饰法)。我已经创建了一个用于我公司网站的CMS,但有一件事真的让我很紧张——在发布的内容中创建“智能”引用 我有一个reg-ex可以做到这一点,但当我在副本中遇到html标记时,我会遇到问题。例如,我的CMS使用的一个已发布的故事可能包含一堆纯文本和一些HTML标记,例如链接标记,其中包含引号,出于明显的原因,我不想更改为“智能”引号 15年前,我是Perl正则表达式高手,但我在这方面完全是空白。我要做的是处理一个字符串,忽略html标记中的所有文本,用“智能”引号替换字符串中的所有引号,然后返回完整的html标记的字符串 我有一个功能,我拼凑起来处理我在CMS中遇到的最常见的场景,但我讨厌它丑陋,一点也不优雅,如果出现不可预见的标签,我的解决方案就会完全崩溃 这是密码(请不要笑,它是用半瓶苏格兰威士忌拼凑起来的): 正如我所说,我知道代码很难看,我愿意接受更优雅的解决方案。它可以工作,但在将来,如果出现不可预见的标记,它将中断。作为记录,我想重申,我并不是试图让正则表达式解析html标记;我试图让它在解析字符串中的所有其余文本时忽略它们跳过php正则表达式中的html标记,php,html,regex,quotes,smart-quotes,Php,Html,Regex,Quotes,Smart Quotes,我坚持正确的英语(是的,我知道“坚持”和“正确的英语”是一个矛盾修饰法)。我已经创建了一个用于我公司网站的CMS,但有一件事真的让我很紧张——在发布的内容中创建“智能”引用 我有一个reg-ex可以做到这一点,但当我在副本中遇到html标记时,我会遇到问题。例如,我的CMS使用的一个已发布的故事可能包含一堆纯文本和一些HTML标记,例如链接标记,其中包含引号,出于明显的原因,我不想更改为“智能”引号 15年前,我是Perl正则表达式高手,但我在这方面完全是空白。我要做的是处理一个字符串,忽略ht
有什么解决办法吗?我做了大量的在线搜索,似乎找不到解决方案,而且我对PHP的正则表达式实现非常不熟悉,这让我很吃惊。好的。在Slacks建议DOM解析之后,我回答了我自己的问题,但现在我有一个问题,正则表达式不能处理创建的字符串。这是我的密码:
function educate_quotes($string) {
$pattern = array(
'/"(\w+)"/',//quotes
"/(\w+)'(\w+)/",//apostrophe
"/'(\w+)'/",//single quotes
"/'\b/",//right single
"/--/"//emdash
);
$replace = array(
"“"."$1"."”",//quotes
"$1"."’"."$2",//apostrophe
"’"."$1"."‘",//single quotes
"‘",//right single
"—"//emdash
);
$xml = new DOMDocument();
$xml->loadHTML($string);
$text = (string)$xml->textContent;
$smart = preg_replace($pattern,$replace,$text);
$xml->textContent = $smart;
$html = $xml->saveHTML();
return $html;
}
$string = '<p>"This" <b>is</b> a "string" with <a href="http://somewhere.com">quotes</a> in it. <img src="blah.jpg" alt="This is an alt tag"></p><p>Whatever, you know?</p>';
$new_string = preg_split("/(<.*?>)/",$string, -1, PREG_SPLIT_DELIM_CAPTURE);
echo "<pre>";
print_r($new_string);
echo "</pre>";
for($i=0;$i<count($new_string);$i++) {
$str = $new_string[$i];
if ($str) {
if (strpos($str,"<") === false) {
$new_string[$i] = convert_quotes($str);
}
}
}
$str = join('',$new_string);
echo $str;
function convert_quotes($string) {
$pattern = array('/\b"/',//right double
'/"\b/',//left double
'/"/',//left double end of line
"/(\w+)'(\w+)/",//apostrophe
"/\b'/",//left single
"/'\b/",//right single
"/'$/",//right single end of line
"/--/"//emdash
);
$replace = array("”",//right double quote
"“",//left double
"”",//left double end of line
"$1"."’"."$2",//apostrophe
"’",//left single
"‘",//right single
"’",//right single end of line
"—"//emdash
);
return preg_replace($pattern,$replace,$string);
}
DOM解析工作正常;现在的问题是,我的正则表达式(我已经从上面的正则表达式更改了它,但直到上面的正则表达式已经不能处理创建的新字符串)实际上没有替换字符串中的任何引号
此外,当字符串中存在不完善的HTML代码时,我会收到以下恼人的警告:
Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Unexpected end tag : p in Entity, line: 2 in /home/leifw/now/cms_functions.php on line 418
因为我不能指望记者总是使用完美的HTML代码,这也是一个问题。好的。在Slacks建议DOM解析之后,我回答了我自己的问题,但现在我有一个问题,正则表达式不能处理创建的字符串。这是我的密码:
function educate_quotes($string) {
$pattern = array(
'/"(\w+)"/',//quotes
"/(\w+)'(\w+)/",//apostrophe
"/'(\w+)'/",//single quotes
"/'\b/",//right single
"/--/"//emdash
);
$replace = array(
"“"."$1"."”",//quotes
"$1"."’"."$2",//apostrophe
"’"."$1"."‘",//single quotes
"‘",//right single
"—"//emdash
);
$xml = new DOMDocument();
$xml->loadHTML($string);
$text = (string)$xml->textContent;
$smart = preg_replace($pattern,$replace,$text);
$xml->textContent = $smart;
$html = $xml->saveHTML();
return $html;
}
$string = '<p>"This" <b>is</b> a "string" with <a href="http://somewhere.com">quotes</a> in it. <img src="blah.jpg" alt="This is an alt tag"></p><p>Whatever, you know?</p>';
$new_string = preg_split("/(<.*?>)/",$string, -1, PREG_SPLIT_DELIM_CAPTURE);
echo "<pre>";
print_r($new_string);
echo "</pre>";
for($i=0;$i<count($new_string);$i++) {
$str = $new_string[$i];
if ($str) {
if (strpos($str,"<") === false) {
$new_string[$i] = convert_quotes($str);
}
}
}
$str = join('',$new_string);
echo $str;
function convert_quotes($string) {
$pattern = array('/\b"/',//right double
'/"\b/',//left double
'/"/',//left double end of line
"/(\w+)'(\w+)/",//apostrophe
"/\b'/",//left single
"/'\b/",//right single
"/'$/",//right single end of line
"/--/"//emdash
);
$replace = array("”",//right double quote
"“",//left double
"”",//left double end of line
"$1"."’"."$2",//apostrophe
"’",//left single
"‘",//right single
"’",//right single end of line
"—"//emdash
);
return preg_replace($pattern,$replace,$string);
}
DOM解析工作正常;现在的问题是,我的正则表达式(我已经从上面的正则表达式更改了它,但直到上面的正则表达式已经不能处理创建的新字符串)实际上没有替换字符串中的任何引号
此外,当字符串中存在不完善的HTML代码时,我会收到以下恼人的警告:
Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Unexpected end tag : p in Entity, line: 2 in /home/leifw/now/cms_functions.php on line 418
因为我不能指望记者总是使用完美的HTML代码,这也是一个问题。是否可以根据HTML
标记进行拆分,然后将其重新组合在一起
$text = "<div sdfas=\"sdfsd\" >ksdfsdf\"dfsd\" dfs </div> <span sdf='dsfs'> dfsd 'dsf ds' </span> ";
$new_text = preg_split("/(<.*?>)/", $text, -1, PREG_SPLIT_DELIM_CAPTURE);
echo htmlspecialchars(print_r($new_text, 1));
$text=“ksdfsdf\”dfsd\”dfs-dfsd'dsf-ds';
$new_text=preg_split(“/()/”,$text,-1,preg_split_DELIM_CAPTURE);
回显htmlspecialchars(打印($new_text,1));
所以你得到的是:
Array
(
[0] =>
[1] => <div sdfas="sdfsd" >
[2] => ksdfsdf"dfsd" dfs
[3] => </div>
[4] =>
[5] => <span sdf='dsfs'>
[6] => dfsd 'dsf ds'
[7] => </span>
[8] =>
)
数组
(
[0] =>
[1] =>
[2] =>ksdfsdf“dfsd”dfs
[3] =>
[4] =>
[5] =>
[6] =>dfsd'dsf ds'
[7] =>
[8] =>
)
然后,如果没有
,您可以在使用preg\u replace时将整个内容重新拼接在一起 是否可以基于html
标记进行拆分,然后将其重新拼接在一起
$text = "<div sdfas=\"sdfsd\" >ksdfsdf\"dfsd\" dfs </div> <span sdf='dsfs'> dfsd 'dsf ds' </span> ";
$new_text = preg_split("/(<.*?>)/", $text, -1, PREG_SPLIT_DELIM_CAPTURE);
echo htmlspecialchars(print_r($new_text, 1));
$text=“ksdfsdf\”dfsd\”dfs-dfsd'dsf-ds';
$new_text=preg_split(“/()/”,$text,-1,preg_split_DELIM_CAPTURE);
回显htmlspecialchars(打印($new_text,1));
所以你得到的是:
Array
(
[0] =>
[1] => <div sdfas="sdfsd" >
[2] => ksdfsdf"dfsd" dfs
[3] => </div>
[4] =>
[5] => <span sdf='dsfs'>
[6] => dfsd 'dsf ds'
[7] => </span>
[8] =>
)
数组
(
[0] =>
[1] =>
[2] =>ksdfsdf“dfsd”dfs
[3] =>
[4] =>
[5] =>
[6] =>dfsd'dsf ds'
[7] =>
[8] =>
)
然后,如果没有
,您可以在使用preg\u replace时将整个内容重新拼接在一起 根据A.Lau的建议,我想我有了一个解决方案,结果证明它实际上是正则表达式,而不是xml解析器
这是我的密码:
function educate_quotes($string) {
$pattern = array(
'/"(\w+)"/',//quotes
"/(\w+)'(\w+)/",//apostrophe
"/'(\w+)'/",//single quotes
"/'\b/",//right single
"/--/"//emdash
);
$replace = array(
"“"."$1"."”",//quotes
"$1"."’"."$2",//apostrophe
"’"."$1"."‘",//single quotes
"‘",//right single
"—"//emdash
);
$xml = new DOMDocument();
$xml->loadHTML($string);
$text = (string)$xml->textContent;
$smart = preg_replace($pattern,$replace,$text);
$xml->textContent = $smart;
$html = $xml->saveHTML();
return $html;
}
$string = '<p>"This" <b>is</b> a "string" with <a href="http://somewhere.com">quotes</a> in it. <img src="blah.jpg" alt="This is an alt tag"></p><p>Whatever, you know?</p>';
$new_string = preg_split("/(<.*?>)/",$string, -1, PREG_SPLIT_DELIM_CAPTURE);
echo "<pre>";
print_r($new_string);
echo "</pre>";
for($i=0;$i<count($new_string);$i++) {
$str = $new_string[$i];
if ($str) {
if (strpos($str,"<") === false) {
$new_string[$i] = convert_quotes($str);
}
}
}
$str = join('',$new_string);
echo $str;
function convert_quotes($string) {
$pattern = array('/\b"/',//right double
'/"\b/',//left double
'/"/',//left double end of line
"/(\w+)'(\w+)/",//apostrophe
"/\b'/",//left single
"/'\b/",//right single
"/'$/",//right single end of line
"/--/"//emdash
);
$replace = array("”",//right double quote
"“",//left double
"”",//left double end of line
"$1"."’"."$2",//apostrophe
"’",//left single
"‘",//right single
"’",//right single end of line
"—"//emdash
);
return preg_replace($pattern,$replace,$string);
}
$string='“这是一个包含的“字符串” 不管怎样,你知道吗;
$new_string=preg_split(“/()/”,$string,-1,preg_split_DELIM_CAPTURE);
回声“;
打印(新字符串);
回声“;
对于($i=0;$i[0]=>
>[1]=>
>[2]=>“这个”
> [3] =>
>[4]=>是
> [5] =>
>[6]=>带有
> [7] =>
>[10]=>在它里面。
> [11] =>
> [12] =>
>[13]=>
> [14] =>
>[15]=>
>[16]=>不管怎样,你知道吗?
>[17]=>
> [18] => >
>不管怎样,你知道吗?
“This”是一个带引号的字符串。这是一个alt标记
不管怎样,你知道吗
根据A.Lau的建议,我想我有了一个解决方案,结果证明它实际上是正则表达式,而不是xml解析器 这是我的密码:
function educate_quotes($string) {
$pattern = array(
'/"(\w+)"/',//quotes
"/(\w+)'(\w+)/",//apostrophe
"/'(\w+)'/",//single quotes
"/'\b/",//right single
"/--/"//emdash
);
$replace = array(
"“"."$1"."”",//quotes
"$1"."’"."$2",//apostrophe
"’"."$1"."‘",//single quotes
"‘",//right single
"—"//emdash
);
$xml = new DOMDocument();
$xml->loadHTML($string);
$text = (string)$xml->textContent;
$smart = preg_replace($pattern,$replace,$text);
$xml->textContent = $smart;
$html = $xml->saveHTML();
return $html;
}
$string = '<p>"This" <b>is</b> a "string" with <a href="http://somewhere.com">quotes</a> in it. <img src="blah.jpg" alt="This is an alt tag"></p><p>Whatever, you know?</p>';
$new_string = preg_split("/(<.*?>)/",$string, -1, PREG_SPLIT_DELIM_CAPTURE);
echo "<pre>";
print_r($new_string);
echo "</pre>";
for($i=0;$i<count($new_string);$i++) {
$str = $new_string[$i];
if ($str) {
if (strpos($str,"<") === false) {
$new_string[$i] = convert_quotes($str);
}
}
}
$str = join('',$new_string);
echo $str;
function convert_quotes($string) {
$pattern = array('/\b"/',//right double
'/"\b/',//left double
'/"/',//left double end of line
"/(\w+)'(\w+)/",//apostrophe
"/\b'/",//left single
"/'\b/",//right single
"/'$/",//right single end of line
"/--/"//emdash
);
$replace = array("”",//right double quote
"“",//left double
"”",//left double end of line
"$1"."’"."$2",//apostrophe
"’",//left single
"‘",//right single
"’",//right single end of line
"—"//emdash
);
return preg_replace($pattern,$replace,$string);
}
$string='“这”是一个包含字符串的“字符串”。不管怎样,你知道吗?';
$new_string=preg_split(“/()/”,$string,-1,preg_split_DELIM_CAPTURE);
回声“;
打印(新字符串);
回声“;
对于($i=0;$i[0]=>
>[1]=>
>[2]=>“这个”
> [3] =>
>[4]=>是
> [5] =>
>[6]=>带有
> [7] =>
>[10]=>在它里面。
> [11] =>
> [12] =>
>[13]=>
> [14] =>
>[15]=>
>[16]=>不管怎样,你知道吗?
>[17]=>
> [18] => >
>不管怎样,你知道吗?
“This”是一个带引号的字符串。这是一个alt标记
不管怎样,你知道吗
SLaks,我知道这一点,但我想既然我没有尝试解析HTML,我就这么做了