获取网页上的字数问题php_Php

获取网页上的字数问题php

php

获取网页上的字数问题php,php,Php,我试图获得一个域网页的字数，但我得到的字数比预期的要多。例如，在Google.com上，使用我的函数，我得到180个单词，手动计算大约有30个。我注意到它还包括来自样式标记和javascript标记的单词，这有点奇怪。我也检查了这个，结果只有6个。我错在哪里了 function get_page_stats($domain) { $str = file_get_contents($domain); $str = strip_tags(strtolower($str));

我试图获得一个域网页的字数，但我得到的字数比预期的要多。例如，在Google.com上，使用我的函数，我得到180个单词，手动计算大约有30个。我注意到它还包括来自样式标记和javascript标记的单词，这有点奇怪。我也检查了这个，结果只有6个。我错在哪里了

function get_page_stats($domain) {
    $str = file_get_contents($domain);
    $str = strip_tags(strtolower($str));
    $words = str_word_count($str, 1);
    $words = array_count_values($words); // added as per Avinash Babu answer
    var_dump($words);
}
get_page_stats('http://google.com');

您可以为此使用

array\u count\u values（）

一个简单的例子

<?php
$str = '<h1>Hello</h1> this will show <a href="ur_html_file">word</a> count of all word used this time... hello!';

print_r(array_count_values(str_word_count(strip_tags(strtolower($str)), 1)));

通过从整个网页中删除样式标记和脚本标记，我成功地进行了很好的过滤
function get_page_stats($domain) {
    $str = file_get_contents($domain);
    $str = preg_replace('/<style\\b[^>]*>(.*?)<\\/style>/s', '', $str);
      // remove everything between the style tags
    $str = preg_replace('/<script\\b[^>]*>(.*?)<\\/script>/s', '', $str);
      // remove everything between the script tags
    $str = strip_tags(strtolower($str));
      // remove html tags
    $words = str_word_count($str, 1);
    $words = array_count_values($words);
      // count the words
    var_dump($words);
}

函数获取页面统计（$domain）{
$str=文件获取内容（$domain）；
$str=preg_replace（'/]*>（.*？）/s'，''$str）；
//删除样式标记之间的所有内容
$str=preg_replace（'/]*>（.*？）/s'，''$str）；
//删除脚本标记之间的所有内容
$str=带标签（strtolower（$str））；
//删除html标记
$words=str\u word\u计数（$str，1）；
$words=数组\计数\值（$words）；
//数一数单词
var_dump（大写）；
}
strip\u标签不会删除标签之间的部分，它只是删除标签本身。你至少要检查标签之间的所有内容，但即使是标签也可能包含这些内容，因此你可能也想过滤掉这些内容。很难得到一个页面的确切数字。嗯，它过滤得稍微好一点，但还不够。google.com获得70分，其中包括函数、窗口、偏航、for、var等值。。。我在页面中看不到它们，只在html代码中看到。如果你剥离标记，标记中有脚本，你会在其中计算javascript，对吗？@Hammerstein，没错，我正试图找出如何使用regexi来完成它。我不是一个好的regex。但是正如您所说，regex可以帮助匹配网页中的字符串并获取结果管理器来完成它，添加了$str=preg_replace（'/]*>（.*？/s'），'$str）；$str=preg_replace（'/]*>（.*？）/s'，''$str）文件获取内容后，它过滤得非常好。