PHP-在文本中查找多个关键字的最快方法？_Php_Html_Arrays_Regex_Parsing

PHP-在文本中查找多个关键字的最快方法？

php html arrays regex parsing

PHP-在文本中查找多个关键字的最快方法？,php,html,arrays,regex,parsing,Php,Html,Arrays,Regex,Parsing,我有一个大的关键字数组（超过一千个），我需要搜索一个大的HTML文件，以找到文本中存在哪些关键字。然后我需要返回找到的这些关键字的索引例如，如果我的数组是： $keywords = array("love", "money", "minute", "loop"); // etc. 如果有单词“money”和“loop”的实例，我想得到一个数组： $results = array("1", "3"); // first $keyword element is 0 我尝试使用preg_matc

我有一个大的关键字数组（超过一千个），我需要搜索一个大的HTML文件，以找到文本中存在哪些关键字。然后我需要返回找到的这些关键字的索引

例如，如果我的数组是：

$keywords = array("love", "money", "minute", "loop"); // etc.

如果有单词“money”和“loop”的实例，我想得到一个数组：

$results = array("1", "3"); // first $keyword element is 0

我尝试使用preg_match_all，但我不确定如何获得$matches并返回关键字的索引

$result = array_map(function ($keyword) use (&$html) {
    return stripos($html, $keyword) !== false;
}, $keywords);

以下是我目前掌握的代码：

$keywords = array("love", "money", "minute", "loop");

$html = file_get_contents($url);

preg_match_all("#(love|money|minute|loop)#i", $html, $matches);

var_dump($matches);

结果是这样的：

array(2) {
  [0]=>
  array(4) {
    [0]=>
    string(6) "minute"
    [1]=>
    string(6) "minute"
    [2]=>
    string(5) "money"
    [3]=>
    string(5) "Money"
  }
  [1]=>
  array(4) {
    [0]=>
    string(6) "minute"
    [1]=>
    string(6) "minute"
    [2]=>
    string(5) "money"
    [3]=>
    string(5) "Money"
  }
}

在PHP中，最快/最佳的方法是什么？赛前准备好了吗？我想避免使用foreach，这会导致我的函数对整个HTML进行一千多次爬网（不是很有效）

如何获取我的关键字索引？例如，找到的关键字是数字0和数字3，无论其数量如何

您可以使用

PREG\u OFFSET\u CAPTURE

标志获取偏移：

$matches=[];
$html = "love and money make the world loop around in a loop three times per minute";
preg_match_all("#love|money|minute|loop#i", $html, $matches, PREG_OFFSET_CAPTURE);
foreach ($matches[0] as $m) echo $m[0]." found at index ".$m[1]."\n";

// output:
love found at index 0
money found at index 9
loop found at index 30
loop found at index 47
minute found at index 68

现在，它的执行速度足够快，您可以进行评估。如果是这样的话，就没有必要寻找更复杂的替代方案。

只是一个你看不到太多的替代方案，使用2作为第二个参数将字符串拆分为以起始位置为键的数组中的单词。然后使用

array\u intersect（）

将其与关键字匹配

$keywords = array("love", "money", "minute", "loop");
// string courtesy of Joni's answer
$html = "love and money make the world loop around in a loop three times per minute";
$words = str_word_count($html, 2);
$match = array_intersect($words, $keywords);
print_r($match);

给

Array
(
    [0] => love
    [9] => money
    [30] => loop
    [47] => loop
    [68] => minute
)

不确定这对任何正则表达式的性能如何，只需尝试一下即可

或者因为屏幕空间不够

print_r(array_intersect(str_word_count($html, 2), $keywords));

如果您只是想知道关键字是否存在，只需颠倒

array\u intersect（）

（不区分大小写-首先使用

strtolower（）

）中数组的顺序即可

这给了

Array
(
    [0] => love
    [1] => money
    [2] => minute
    [3] => loop
)

最近更新：

从性能上看，我的解决方案可以通过翻转数组来优化，这样，检查键是否存在比扫描每个数组中的字符串值要快得多

$match = array_flip(array_intersect_key(array_flip($keywords), array_flip(str_word_count(strtolower($html), 1))));

如果您只需要查看文本中出现了哪些关键字，您可以在关键字数组上映射

stripos

$result = array_map(function ($keyword) use (&$html) {
    return stripos($html, $keyword) !== false;
}, $keywords);

现在，

stripos

将在另一个字符串中找到一个字符串。它没有单词的概念，如果您不想匹配作为较长单词的一部分存在的关键字，则需要使用带单词边界的正则表达式。但是您当前使用的表达式没有做到这一点，因此这可能不是问题。

index？比如在字数上

love peace and coding

如果关键字是

peace

您想返回1吗？同样是

code

上面的字符串的一部分吗？不，如果您只关心子字符串是否存在，而不关心它存在的位置或存在的次数，那么“全部匹配”肯定不是最佳的。正则表达式比较也比字符串比较慢。

preg\u match\u all

的设计目的不是告诉您单独数组上匹配的索引。编写一些代码来获取匹配项并在数组中找到它以获取索引。它是一个循环…@Andreas通过索引我的意思是哪些关键字出现在文本中，那么它是关键字1或关键字2或关键字75。这不是一次搜索每个关键字的整个内容吗。因此，如果您有1000个关键字，它可能会扫描整个字符串1000次。虽然它会在找到每个单词后停止扫描，但如果关键字不存在，它会搜索到字符串的末尾。虽然简单，但我希望此解决方案的性能比其他答案中的解决方案差，因为它会扫描大型文档上千次

$result = array_map(function ($keyword) use (&$html) {
    return stripos($html, $keyword) !== false;
}, $keywords);