跳过php正则表达式中的html标记_Php_Html_Regex_Quotes_Smart Quotes

跳过php正则表达式中的html标记

php html regex

跳过php正则表达式中的html标记,php,html,regex,quotes,smart-quotes,Php,Html,Regex,Quotes,Smart Quotes,我坚持正确的英语（是的，我知道“坚持”和“正确的英语”是一个矛盾修饰法）。我已经创建了一个用于我公司网站的CMS，但有一件事真的让我很紧张——在发布的内容中创建“智能”引用我有一个reg-ex可以做到这一点，但当我在副本中遇到html标记时，我会遇到问题。例如，我的CMS使用的一个已发布的故事可能包含一堆纯文本和一些HTML标记，例如链接标记，其中包含引号，出于明显的原因，我不想更改为“智能”引号 15年前，我是Perl正则表达式高手，但我在这方面完全是空白。我要做的是处理一个字符串，忽略ht

我坚持正确的英语（是的，我知道“坚持”和“正确的英语”是一个矛盾修饰法）。我已经创建了一个用于我公司网站的CMS，但有一件事真的让我很紧张——在发布的内容中创建“智能”引用

我有一个reg-ex可以做到这一点，但当我在副本中遇到html标记时，我会遇到问题。例如，我的CMS使用的一个已发布的故事可能包含一堆纯文本和一些HTML标记，例如链接标记，其中包含引号，出于明显的原因，我不想更改为“智能”引号

15年前，我是Perl正则表达式高手，但我在这方面完全是空白。我要做的是处理一个字符串，忽略html标记中的所有文本，用“智能”引号替换字符串中的所有引号，然后返回完整的html标记的字符串

我有一个功能，我拼凑起来处理我在CMS中遇到的最常见的场景，但我讨厌它丑陋，一点也不优雅，如果出现不可预见的标签，我的解决方案就会完全崩溃

这是密码（请不要笑，它是用半瓶苏格兰威士忌拼凑起来的）：

正如我所说，我知道代码很难看，我愿意接受更优雅的解决方案。它可以工作，但在将来，如果出现不可预见的标记，它将中断。作为记录，我想重申，我并不是试图让正则表达式解析html标记；我试图让它在解析字符串中的所有其余文本时忽略它们

有什么解决办法吗？我做了大量的在线搜索，似乎找不到解决方案，而且我对PHP的正则表达式实现非常不熟悉，这让我很吃惊。

好的。在Slacks建议DOM解析之后，我回答了我自己的问题，但现在我有一个问题，正则表达式不能处理创建的字符串。这是我的密码：

function educate_quotes($string) {  
        $pattern = array(
            '/"(\w+)"/',//quotes
            "/(\w+)'(\w+)/",//apostrophe
            "/'(\w+)'/",//single quotes
           "/'\b/",//right single
            "/--/"//emdash
        );

        $replace = array(
            "&#8220;"."$1"."&#8221;",//quotes
            "$1"."&#8217;"."$2",//apostrophe
            "&#8217;"."$1"."&#8216;",//single quotes
            "&#8216;",//right single
            "&#151;"//emdash
        );

        $xml = new DOMDocument();
        $xml->loadHTML($string);
        $text = (string)$xml->textContent;
        $smart = preg_replace($pattern,$replace,$text);
        $xml->textContent = $smart; 
        $html = $xml->saveHTML();
        return $html;
    }

$string = '<p>"This" <b>is</b> a "string" with <a href="http://somewhere.com">quotes</a> in it. <img src="blah.jpg" alt="This is an alt tag"></p><p>Whatever, you know?</p>';

    $new_string = preg_split("/(<.*?>)/",$string, -1, PREG_SPLIT_DELIM_CAPTURE);

    echo "<pre>";
    print_r($new_string);
    echo "</pre>";

    for($i=0;$i<count($new_string);$i++) {
        $str = $new_string[$i];
        if ($str) {
            if (strpos($str,"<") === false) {
                $new_string[$i] = convert_quotes($str);
            }
        }
    }

    $str = join('',$new_string);
    echo $str; 

    function convert_quotes($string) {
        $pattern = array('/\b"/',//right double
                    '/"\b/',//left double
                    '/"/',//left double end of line
                    "/(\w+)'(\w+)/",//apostrophe
                    "/\b'/",//left single
                    "/'\b/",//right single
                    "/'$/",//right single end of line
                    "/--/"//emdash
                    );

        $replace = array("&#8221;",//right double quote
                    "&#8220;",//left double
                    "&#8221;",//left double end of line
                    "$1"."&#8217;"."$2",//apostrophe
                    "&#8217;",//left single
                    "&#8216;",//right single
                    "&#8217;",//right single end of line
                    "&#151;"//emdash
                    );
        return preg_replace($pattern,$replace,$string);
    }

DOM解析工作正常；现在的问题是，我的正则表达式（我已经从上面的正则表达式更改了它，但直到上面的正则表达式已经不能处理创建的新字符串）实际上没有替换字符串中的任何引号

此外，当字符串中存在不完善的HTML代码时，我会收到以下恼人的警告：

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Unexpected end tag : p in Entity, line: 2 in /home/leifw/now/cms_functions.php on line 418

因为我不能指望记者总是使用完美的HTML代码，这也是一个问题。

好的。在Slacks建议DOM解析之后，我回答了我自己的问题，但现在我有一个问题，正则表达式不能处理创建的字符串。这是我的密码：

function educate_quotes($string) {  
        $pattern = array(
            '/"(\w+)"/',//quotes
            "/(\w+)'(\w+)/",//apostrophe
            "/'(\w+)'/",//single quotes
           "/'\b/",//right single
            "/--/"//emdash
        );

        $replace = array(
            "&#8220;"."$1"."&#8221;",//quotes
            "$1"."&#8217;"."$2",//apostrophe
            "&#8217;"."$1"."&#8216;",//single quotes
            "&#8216;",//right single
            "&#151;"//emdash
        );

        $xml = new DOMDocument();
        $xml->loadHTML($string);
        $text = (string)$xml->textContent;
        $smart = preg_replace($pattern,$replace,$text);
        $xml->textContent = $smart; 
        $html = $xml->saveHTML();
        return $html;
    }

$string = '<p>"This" <b>is</b> a "string" with <a href="http://somewhere.com">quotes</a> in it. <img src="blah.jpg" alt="This is an alt tag"></p><p>Whatever, you know?</p>';

    $new_string = preg_split("/(<.*?>)/",$string, -1, PREG_SPLIT_DELIM_CAPTURE);

    echo "<pre>";
    print_r($new_string);
    echo "</pre>";

    for($i=0;$i<count($new_string);$i++) {
        $str = $new_string[$i];
        if ($str) {
            if (strpos($str,"<") === false) {
                $new_string[$i] = convert_quotes($str);
            }
        }
    }

    $str = join('',$new_string);
    echo $str; 

    function convert_quotes($string) {
        $pattern = array('/\b"/',//right double
                    '/"\b/',//left double
                    '/"/',//left double end of line
                    "/(\w+)'(\w+)/",//apostrophe
                    "/\b'/",//left single
                    "/'\b/",//right single
                    "/'$/",//right single end of line
                    "/--/"//emdash
                    );

        $replace = array("&#8221;",//right double quote
                    "&#8220;",//left double
                    "&#8221;",//left double end of line
                    "$1"."&#8217;"."$2",//apostrophe
                    "&#8217;",//left single
                    "&#8216;",//right single
                    "&#8217;",//right single end of line
                    "&#151;"//emdash
                    );
        return preg_replace($pattern,$replace,$string);
    }

此外，当字符串中存在不完善的HTML代码时，我会收到以下恼人的警告：

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Unexpected end tag : p in Entity, line: 2 in /home/leifw/now/cms_functions.php on line 418

因为我不能指望记者总是使用完美的HTML代码，这也是一个问题。

是否可以根据HTML

标记进行拆分，然后将其重新组合在一起

$text = "<div sdfas=\"sdfsd\" >ksdfsdf\"dfsd\" dfs </div> <span sdf='dsfs'> dfsd 'dsf ds' </span> ";
$new_text = preg_split("/(<.*?>)/", $text, -1, PREG_SPLIT_DELIM_CAPTURE);
echo htmlspecialchars(print_r($new_text, 1));

$text=“ksdfsdf\”dfsd\”dfs-dfsd'dsf-ds'；
$new_text=preg_split（“/（）/”，$text，-1，preg_split_DELIM_CAPTURE）；
回显htmlspecialchars（打印（$new_text，1））；

所以你得到的是：

Array
(
    [0] => 
    [1] => <div sdfas="sdfsd" >
    [2] => ksdfsdf"dfsd" dfs 
    [3] => </div>
    [4] =>  
    [5] => <span sdf='dsfs'>
    [6] =>  dfsd 'dsf ds' 
    [7] => </span>
    [8] =>  
)

数组
(
[0] => 
[1] => 
[2] =>ksdfsdf“dfsd”dfs
[3] => 
[4] =>  
[5] => 
[6] =>dfsd'dsf ds'
[7] => 
[8] =>  
)

然后，如果没有

，您可以在使用preg\u replace时将整个内容重新拼接在一起

是否可以基于html

标记进行拆分，然后将其重新拼接在一起

$text = "<div sdfas=\"sdfsd\" >ksdfsdf\"dfsd\" dfs </div> <span sdf='dsfs'> dfsd 'dsf ds' </span> ";
$new_text = preg_split("/(<.*?>)/", $text, -1, PREG_SPLIT_DELIM_CAPTURE);
echo htmlspecialchars(print_r($new_text, 1));

$text=“ksdfsdf\”dfsd\”dfs-dfsd'dsf-ds'；
$new_text=preg_split（“/（）/”，$text，-1，preg_split_DELIM_CAPTURE）；
回显htmlspecialchars（打印（$new_text，1））；

所以你得到的是：

Array
(
    [0] => 
    [1] => <div sdfas="sdfsd" >
    [2] => ksdfsdf"dfsd" dfs 
    [3] => </div>
    [4] =>  
    [5] => <span sdf='dsfs'>
    [6] =>  dfsd 'dsf ds' 
    [7] => </span>
    [8] =>  
)

数组
(
[0] => 
[1] => 
[2] =>ksdfsdf“dfsd”dfs
[3] => 
[4] =>  
[5] => 
[6] =>dfsd'dsf ds'
[7] => 
[8] =>  
)

然后，如果没有

，您可以在使用preg\u replace时将整个内容重新拼接在一起

根据A.Lau的建议，我想我有了一个解决方案，结果证明它实际上是正则表达式，而不是xml解析器

这是我的密码：

function educate_quotes($string) {  
        $pattern = array(
            '/"(\w+)"/',//quotes
            "/(\w+)'(\w+)/",//apostrophe
            "/'(\w+)'/",//single quotes
           "/'\b/",//right single
            "/--/"//emdash
        );

        $replace = array(
            "&#8220;"."$1"."&#8221;",//quotes
            "$1"."&#8217;"."$2",//apostrophe
            "&#8217;"."$1"."&#8216;",//single quotes
            "&#8216;",//right single
            "&#151;"//emdash
        );

        $xml = new DOMDocument();
        $xml->loadHTML($string);
        $text = (string)$xml->textContent;
        $smart = preg_replace($pattern,$replace,$text);
        $xml->textContent = $smart; 
        $html = $xml->saveHTML();
        return $html;
    }

$string = '<p>"This" <b>is</b> a "string" with <a href="http://somewhere.com">quotes</a> in it. <img src="blah.jpg" alt="This is an alt tag"></p><p>Whatever, you know?</p>';

    $new_string = preg_split("/(<.*?>)/",$string, -1, PREG_SPLIT_DELIM_CAPTURE);

    echo "<pre>";
    print_r($new_string);
    echo "</pre>";

    for($i=0;$i<count($new_string);$i++) {
        $str = $new_string[$i];
        if ($str) {
            if (strpos($str,"<") === false) {
                $new_string[$i] = convert_quotes($str);
            }
        }
    }

    $str = join('',$new_string);
    echo $str; 

    function convert_quotes($string) {
        $pattern = array('/\b"/',//right double
                    '/"\b/',//left double
                    '/"/',//left double end of line
                    "/(\w+)'(\w+)/",//apostrophe
                    "/\b'/",//left single
                    "/'\b/",//right single
                    "/'$/",//right single end of line
                    "/--/"//emdash
                    );

        $replace = array("&#8221;",//right double quote
                    "&#8220;",//left double
                    "&#8221;",//left double end of line
                    "$1"."&#8217;"."$2",//apostrophe
                    "&#8217;",//left single
                    "&#8216;",//right single
                    "&#8217;",//right single end of line
                    "&#151;"//emdash
                    );
        return preg_replace($pattern,$replace,$string);
    }

$string='“这是一个包含的“字符串” 不管怎样，你知道吗；
$new_string=preg_split（“/（）/”，$string，-1，preg_split_DELIM_CAPTURE）；
回声“；
打印（新字符串）；
回声“；
对于（$i=0；$i[0]=>
>[1]=>
>[2]=>“这个”
>     [3] => 
>[4]=>是
>     [5] => 
>[6]=>带有
>     [7] => 
>[10]=>在它里面。
>     [11] => 
>     [12] => 
>[13]=>
>     [14] => 
>[15]=>
>[16]=>不管怎样，你知道吗？
>[17]=>
>     [18] => >
>不管怎样，你知道吗？

“This”是一个带引号的字符串。这是一个alt标记

不管怎样，你知道吗

根据A.Lau的建议，我想我有了一个解决方案，结果证明它实际上是正则表达式，而不是xml解析器

这是我的密码：

function educate_quotes($string) {  
        $pattern = array(
            '/"(\w+)"/',//quotes
            "/(\w+)'(\w+)/",//apostrophe
            "/'(\w+)'/",//single quotes
           "/'\b/",//right single
            "/--/"//emdash
        );

        $replace = array(
            "&#8220;"."$1"."&#8221;",//quotes
            "$1"."&#8217;"."$2",//apostrophe
            "&#8217;"."$1"."&#8216;",//single quotes
            "&#8216;",//right single
            "&#151;"//emdash
        );

        $xml = new DOMDocument();
        $xml->loadHTML($string);
        $text = (string)$xml->textContent;
        $smart = preg_replace($pattern,$replace,$text);
        $xml->textContent = $smart; 
        $html = $xml->saveHTML();
        return $html;
    }

$string = '<p>"This" <b>is</b> a "string" with <a href="http://somewhere.com">quotes</a> in it. <img src="blah.jpg" alt="This is an alt tag"></p><p>Whatever, you know?</p>';

    $new_string = preg_split("/(<.*?>)/",$string, -1, PREG_SPLIT_DELIM_CAPTURE);

    echo "<pre>";
    print_r($new_string);
    echo "</pre>";

    for($i=0;$i<count($new_string);$i++) {
        $str = $new_string[$i];
        if ($str) {
            if (strpos($str,"<") === false) {
                $new_string[$i] = convert_quotes($str);
            }
        }
    }

    $str = join('',$new_string);
    echo $str; 

    function convert_quotes($string) {
        $pattern = array('/\b"/',//right double
                    '/"\b/',//left double
                    '/"/',//left double end of line
                    "/(\w+)'(\w+)/",//apostrophe
                    "/\b'/",//left single
                    "/'\b/",//right single
                    "/'$/",//right single end of line
                    "/--/"//emdash
                    );

        $replace = array("&#8221;",//right double quote
                    "&#8220;",//left double
                    "&#8221;",//left double end of line
                    "$1"."&#8217;"."$2",//apostrophe
                    "&#8217;",//left single
                    "&#8216;",//right single
                    "&#8217;",//right single end of line
                    "&#151;"//emdash
                    );
        return preg_replace($pattern,$replace,$string);
    }

$string='“这”是一个包含字符串的“字符串”。
不管怎样，你知道吗？'；
$new_string=preg_split（“/（）/”，$string，-1，preg_split_DELIM_CAPTURE）；
回声“；
打印（新字符串）；
回声“；
对于（$i=0；$i[0]=>
>[1]=>
>[2]=>“这个”
>     [3] => 
>[4]=>是
>     [5] => 
>[6]=>带有
>     [7] => 
>[10]=>在它里面。
>     [11] => 
>     [12] => 
>[13]=>
>     [14] => 
>[15]=>
>[16]=>不管怎样，你知道吗？
>[17]=>
>     [18] => >
>不管怎样，你知道吗？

“This”是一个带引号的字符串。这是一个alt标记

不管怎样，你知道吗

SLaks，我知道这一点，但我想既然我没有尝试解析HTML，我就这么做了