Text 从网站中提取所有文本以构建一致性_Text_Web_Extract

Text 从网站中提取所有文本以构建一致性

text web

Text 从网站中提取所有文本以构建一致性,text,web,extract,Text,Web,Extract,我如何获取网站中的所有文本，我的意思不仅仅是ctrl+a/c。我希望能够从一个网站（和所有相关的网页）提取所有文本，并使用它来建立一个从该网站的词一致性。有什么想法吗？我对此很感兴趣，所以我写了解决方案的第一部分代码是用PHP编写的，因为它具有方便的strip_tags函数。这也是粗糙和程序性的，但我觉得这表明了我的想法 <?php $url = "http://www.stackoverflow.com"; //To use this you'll need to get a key

我如何获取网站中的所有文本，我的意思不仅仅是ctrl+a/c。我希望能够从一个网站（和所有相关的网页）提取所有文本，并使用它来建立一个从该网站的词一致性。有什么想法吗？

我对此很感兴趣，所以我写了解决方案的第一部分

代码是用PHP编写的，因为它具有方便的strip_tags函数。这也是粗糙和程序性的，但我觉得这表明了我的想法

<?php
$url = "http://www.stackoverflow.com";

//To use this you'll need to get a key for the Readabilty Parser API http://readability.com/developers/api/parser
$token = "";

//I make a HTTP GET request to the readabilty API and then decode the returned JSON
$parserResponse = json_decode(file_get_contents("http://www.readability.com/api/content/v1/parser?url=$url&token=$token"));

//I'm only interested in the content string in the json object
$content = $parserResponse->content;

//I strip the HTML tags for the article content
$wordsOnPage = strip_tags($content);

$wordCounter = array();

$wordSplit = explode(" ", $wordsOnPage);

//I then loop through each word in the article keeping count of how many times I've seen the word
foreach($wordSplit as $word)
{
incrementWordCounter($word);
}

//Then I sort the array so the most frequent words are at the end
asort($wordCounter);

//And dump the array
var_dump($wordCounter);

function incrementWordCounter($word)
{
    global $wordCounter;

    if(isset($wordCounter[$word]))
    {
    $wordCounter[$word] = $wordCounter[$word] + 1;
    }
    else
    {
    $wordCounter[$word] = 1;
    }

}


?>

我对此很感兴趣，所以我写了解决方案的第一部分
代码是用PHP编写的，因为它具有方便的strip_tags函数。这也是粗糙和程序性的，但我觉得这表明了我的想法
<?php
$url = "http://www.stackoverflow.com";

//To use this you'll need to get a key for the Readabilty Parser API http://readability.com/developers/api/parser
$token = "";

//I make a HTTP GET request to the readabilty API and then decode the returned JSON
$parserResponse = json_decode(file_get_contents("http://www.readability.com/api/content/v1/parser?url=$url&token=$token"));

//I'm only interested in the content string in the json object
$content = $parserResponse->content;

//I strip the HTML tags for the article content
$wordsOnPage = strip_tags($content);

$wordCounter = array();

$wordSplit = explode(" ", $wordsOnPage);

//I then loop through each word in the article keeping count of how many times I've seen the word
foreach($wordSplit as $word)
{
incrementWordCounter($word);
}

//Then I sort the array so the most frequent words are at the end
asort($wordCounter);

//And dump the array
var_dump($wordCounter);

function incrementWordCounter($word)
{
    global $wordCounter;

    if(isset($wordCounter[$word]))
    {
    $wordCounter[$word] = $wordCounter[$word] + 1;
    }
    else
    {
    $wordCounter[$word] = 1;
    }

}


?>