PHP正则表达式优化_Php_Regex_Optimization

PHP正则表达式优化

php regex optimization

PHP正则表达式优化,php,regex,optimization,Php,Regex,Optimization,我正在尝试优化一个PHP正则表达式，并且正在寻求奇妙的堆栈溢出社区的指导我试图捕获HTML块中的预定义匹配，例如： ##test## ##!test2## ##test3|id=5## 将运行的示例文本是： Lorem ipsum dolor sit amet，############测试####奉献精英。佩伦茨克我是康格·马萨。库拉比图尔##test3 | id=5##egestas ullamcorper sollicitudin。莫里斯·维尼那提斯是法雷特拉的命脉到目前为止，我有

我正在尝试优化一个

PHP

正则表达式，并且正在寻求奇妙的堆栈溢出社区的指导

我试图捕获

HTML

块中的预定义匹配，例如：

##test##

##!test2##

##test3|id=5##

将运行的示例文本是：

Lorem ipsum dolor sit amet，############测试####奉献精英。佩伦茨克我是康格·马萨。库拉比图尔##test3 | id=5##egestas ullamcorper sollicitudin。莫里斯·维尼那提斯是法雷特拉的命脉

到目前为止，我有两个选择。从优化的角度考虑哪一个是最好的

选项1

~##(!?)(test|test2|test3)(|\S+?)##~s

选项2

~\##(\S+)##~s

对于示例中的

“！”

！test2##，用于在处理项目时标记项目的特殊行为。可以将其移动为类似于

##test3 | force=true&id=5#

的属性。如果是这种情况，将有：

选项3

~##(test|test2|test3)(|\S+?)##~s

我们关注的最大因素是性能和优化

提前感谢您的帮助和见解

正如其他人所提到的，你需要调整你的表达时间

Python

具有奇妙的

timeit

模块，而对于

PHP

您需要想出自己的解决方案：

<?php

$string = <<<DATA
Lorem ipsum dolor sit amet, ##test## consectetur adipiscing elit. Pellentesque id congue massa. Curabitur ##test3|id=5## egestas ullamcorper sollicitudin. Mauris venenatis sed metus vitae pharetra.
DATA;

function timeit($regex, $string, $number) {
    $start = microtime(true);

    for($i=0;$i<$number;$i++) {
        preg_match_all($regex, $string, $matches);
    }

    return microtime(true) - $start;
}

$expressions = ['~##(!?)(test|test2|test3)(|\S+?)##~s', '~\##(\S+)##~s', '~##(test|test2|test3)(|\S+?)##~s'];
$cnt = 1;
foreach ($expressions as $expression) {
    echo "Expression " . $cnt . " took " . timeit($expression, $string, 10**5) . "\n";
    $cnt++;
}
?>

显然，您可以使用其他字符串和更多的迭代，但这将为您提供一个总体思路。

如果需要根据字符出现情况分析和处理匹配的子字符串，在regex步骤中分离组件似乎是最合乎逻辑的——在解决了准确性和易处理性之后，请关注模式优化

我的模式包含三个捕获组，只有中间一个需要正长度字符串。否定捕获组用于模式效率。我假设您的子字符串不包含用于分隔子字符串的

。如果它们可能包含

，请更新您的问题，我将更新我的答案

模式说明：

/          // pattern delimiter
##         // match leading substring delimiter
(!)?       // optionally capture: an exclamation mark
([^#|]+)   // greedily capture: one or more non-hash, non-pipe characters
\|?        // optionally match: a pipe
([^#]+)?   // optionally capture: one or more non-hash characters
##         // match trailing substring delimiter
/          // pattern delimiter

代码：（）

输出：

$m = array (
  0 => '##test##',
  1 => '',
  2 => 'test',
)

---
$m = array (
  0 => '##test3|id=5##',
  1 => '',
  2 => 'test3',
  3 => 'id=5',
)
post-pipe substring found

---
$m = array (
  0 => '##!test2##',
  1 => '!',
  2 => 'test2',
)
exclamation found

---
'Lorem ipsum dolor sit amet, [some replacement text] consectetur adipiscing elit. Pellentesque id congue massa. Curabitur [some replacement text] egestas ullamcorper sollicitudin. Mauris venenatis sed metus [some replacement text] vitae pharetra.'

如果您正在执行自定义替换过程，此方法将“优化”您的字符串处理。

但是如何进行基准测试并了解哪一个是最好的？

运行它们，并查看内存使用情况和运行代码的时间。我同意Andreas的观点，唯一的方法是进行大规模测试（10000+）并测量和比较您的结果前面的评论是正确的，但您遗漏了其他主要问题。您需要转义管道符号（

），如

（\ \124;？）

。您不需要转义散列符号（

）。此外，还不完全清楚正则表达式应该匹配哪些参数。但对于您正在尝试执行的操作，最简单且可能最快的正则表达式可能如下所示：

~##【^\s#】+？###~s

。尽可能避免替换，因为引擎必须进入每个分支内部才能找到满意的路径。最好的情况是通过第一面。更少的模式通常意味着更高的效率。根据需要应用修改器

会影响您甚至没有使用的

。如果可能的话，要贪婪。引擎喜欢它<代码>~##[^#]*.##~谢谢！这个基准测试脚本非常非常有帮助。@MrC：如果它帮助了你，你可以投票/接受它作为答案（左边绿色的勾号）。

$string='Lorem ipsum dolor sit amet, ##test## consectetur adipiscing elit. Pellentesque id congue massa. Curabitur ##test3|id=5## egestas ullamcorper sollicitudin. Mauris venenatis sed metus ##!test2## vitae pharetra.';

$result=preg_replace_callback(
    '/##(!)?([^#|]+)\|?([^#]+)?##/',
    function($m){
        echo '$m = ';
        var_export($m);
        echo "\n";
        // execute custom processing:
        if(isset($m[1][0])){  //check first character of element (element will always be set because $m[2] will always be set)
            echo "exclamation found\n";
        }
        // $m[2] is required (will always be set)
        if(isset($m[3])){  // will only be set if there is a positive-length string in it
            echo "post-pipe substring found\n";
        }
        echo "\n---\n";
        return '[some replacement text]';
    },$string);

var_export($result);

$m = array (
  0 => '##test##',
  1 => '',
  2 => 'test',
)

---
$m = array (
  0 => '##test3|id=5##',
  1 => '',
  2 => 'test3',
  3 => 'id=5',
)
post-pipe substring found

---
$m = array (
  0 => '##!test2##',
  1 => '!',
  2 => 'test2',
)
exclamation found

---
'Lorem ipsum dolor sit amet, [some replacement text] consectetur adipiscing elit. Pellentesque id congue massa. Curabitur [some replacement text] egestas ullamcorper sollicitudin. Mauris venenatis sed metus [some replacement text] vitae pharetra.'