Php 分析网页文本内容并在<；中查看；部门>；同构_Php_Regex_Html

Php 分析网页文本内容并在<；中查看；部门>；同构

php regex html

Php 分析网页文本内容并在<；中查看；部门>；同构,php,regex,html,Php,Regex,Html,这里我正在解析页面文本： <?php $url= 'http://www.paulgraham.com/herd.html'; $doc = new DOMDocument(); libxml_use_internal_errors(true); $doc->loadHTMLFile($url); libxml_clear_errors(); $xpath = new DOMXPath($doc); foreach($xpath->query("//script") as $s

这里我正在解析页面文本：

<?php
$url= 'http://www.paulgraham.com/herd.html';
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTMLFile($url);
libxml_clear_errors();
$xpath = new DOMXPath($doc);
foreach($xpath->query("//script") as $script) {
    $script->parentNode->removeChild($script);
}
$textContent = $doc->textContent; //inherited from DOMNode
$text=escapeshellarg($textContent);
$test = preg_replace("/[^a-zA-Z]+/", " ", html_entity_decode($text));

echo $test; //This gives entire content in one line loosing actual page text format
echo echo nl2br($textContent);  // This does not show in single line but some un usual form. 

?>

loadHTMLFile（$url）；
libxml_clear_errors（）；
$xpath=新的DOMXPath（$doc）；
foreach（$xpath->query（//script）作为$script）{
$script->parentNode->removeChild（$script）；
}
$textContent=$doc->textContent//从DOMNode继承的
$text=escapeshellarg（$textContent）；
$test=preg_replace（“/[^a-zA-Z]+/”，“”，html_entity_decode（$text））；
回声试验//这将在一行中提供整个内容，而不是实际的页面文本格式
echo-echo nl2br（$textContent）；//这不是以单行的形式显示，而是以一些不常见的形式显示。
?>

我也尝试了

标记，但它也在一行中显示整个内容。

$test = preg_replace("/[^a-zA-Z]+/", " ", html_entity_decode($text));

这里做了什么更改，以便我可以得到和原始页面一样有换行符的段落

我只需要文本内容，不需要图像、按钮和所有内容。

如果替换：

$test = preg_replace("/<br>/", "\r\n", html_entity_decode($text));
$test = preg_replace("/<.+?>/", " ", $test);
$test = preg_replace("/[^a-zA-Z\r\n]+/", " ", $test);

到

$test=preg\u replace（“/
/”，“\r\n”，html\u实体解码（$text））；
$test=preg_替换（“//”，“，$test”）；
$test=preg_replace（“/[^a-zA-Z\r\n]+/”，“，$test”）；

尝试将

\n

放入否定字符类。为什么要使用

escapeshellarg

？此[^a-zA-Z]+只会删除所有非字母的内容。我建议分几步更换。首先将
更改为\r\n或\n。第二步删除html标记//第三步删除所有其他你想要的东西。@PatrickEvans:这对我来说是必要的。因为我将整个内容作为字符串参数，并将其传递给program@Darka：你记对了。但我无法实现它。如果你能在你的答案中加上它，那真是太棒了，我可以把它作为我的答案。谢谢，但你能加上解释吗。在这里，您没有包括a-z，那么在哪里考虑它？为什么要使用它？如果您想使用它，请查看更新的答案。@Darpa:它仍然没有给出预期的结果。请检查：`loadHTMLFile（$url）；libxml_clear_errors（）$xpath=新的DOMXPath（$doc）；foreach（$xpath->query（//script）作为$script）{$script->parentNode->removeChild（$script）；}$textContent=$doc->textContent//继承自DOMNode$text=escapeshellarg（$textContent）$test=preg_replace（“/
/”，“\r\n”，html_实体_decode（$text））$test=preg_replace（“//”，“，$test”）；echo$test；？>`达卡：非常感谢你的努力。但更新答案也因此给出了一行。