Php 在内部保留换行符<;p>;使用DOMXPath的标签?
我目前正在使用PHP和Php 在内部保留换行符<;p>;使用DOMXPath的标签?,php,html,dom,xpath,Php,Html,Dom,Xpath,我目前正在使用PHP和DOMXPath获取网页中所有元素的内容: <?php ... $doc = new DOMDocument(); $doc->loadHTML($html); $xpath = new DOMXPath($doc); $paragraphs = $xpath->evaluate("/html/body//p"); foreach ($paragraphs as $paragraph){ echo $paragraph->textCont
DOMXPath
获取网页中所有
元素的内容:
<?php
...
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$paragraphs = $xpath->evaluate("/html/body//p");
foreach ($paragraphs as $paragraph){
echo $paragraph->textContent . "<br />";
}
上述PHP的当前输出:
Some happy talk about our great product.We would love for you to buy it!
Random information and what notIsn't that cool?
我尝试了$parations=$doc->getElementsByTagName(“p”)代码>也一样,它给了我同样的东西
有没有办法让DOMXPath/DOMDocument保留换行符?我需要能够分离段落中的每个单词,而当前输出不允许这样做
如果有另一种方法来检索
元素中的字符串,同时保留
或'\n'
,那也很好
编辑
进一步调查后,所讨论的HTML实际上是由
标记分隔的锚定列表,但没有实际的换行符:
<p class="home_page_list"><a href="/home/personal-banking/checking/Category-Page-Classic-Checking/classic-checking.html">Classic Checking</a><br> <a href="/home/personal-banking/checking/Category-Page-Interest-Checking/interest-checking.html">Interest Checking</a><br> <a href="/home/personal-banking/checking/Category-Page-Interest-Checking/interest-premium-checking.html">Premium Checking</a><br> <a href="/home/personal-banking/Savings-Category-Page/Basic-Savings-Category-Page/basic-savings.html">Savings Plans</a><br> <a href="/home/personal-banking/Savings-Category-Page/Money-Market-Accounts-Category-Page/money-market-accounts.html">Money Market Accounts</a><br> <a href="/home/personal-banking/Savings-Category-Page/Certificates-of-Deposit-Category-Page/fixed-rate-CD.html">CDs</a><br> <a href="/home/personal-banking/Savings-Category-Page/Individual-Retirement-Account-Category-Page/individual-retirement-account.html">IRAs</a></p>
事实证明,这在给定的原始HTML中正常工作
更新:已解决
借助于@ircmaxell的回答,以及@netcoder和@Gordon留下的评论,这个问题已经解决了,虽然不是很优雅,但现在就可以解决了
例如:
foreach ($paragraphs as $paragraph){
$p_text = new DOMDocument();
$p_text->loadHTML(str_ireplace(array("<br>", "<br />"), "\r\n", DOMinnerHTML($paragraph)));
//Do whatever, in this case get all of the words in an array.
$words = explode(" ", str_ireplace(array(",", ".", "&", ":", "-", "\r\n"), " ", $p_text->textContent));
print_r($words);
}
foreach($段落为$段落){
$p_text=新的DOMDocument();
$p_text->loadHTML(str_-ireplace(数组(“
”,“
”),“\r\n”,DOMinnerHTML($段落));
//无论如何,在本例中,获取数组中的所有单词。
$words=explode(“,stru-ireplace(数组(“,”,“,”,“&“,”:“,“-”,“\r\n”),“,”,$p_-text->textContent));
打印(大写);
}
这利用了(如@netcoder所建议的)将
的实例替换为“\r\n”(如@ircmaxell所建议的),然后可以在文本内容后对其进行评估。
显然还有一些改进的空间,但它解决了我目前的问题
谢谢大家的帮助
Ben一种可能性
echo simplexml_import_dom($paragraph)->asXML();
我要做的是用文字换行符替换换行符:
$doc = new DOMDocument();
$doc->loadHTML($html);
$brs = $doc->getElementsByTagName('br');
foreach ($brs as $node) {
$node->parentNode->replaceChild($doc->createTextNode("\r\n"), $node);
}
$xpath = new DOMXPath($doc);
$paragraphs = $xpath->evaluate("/html/body//p");
foreach ($paragraphs as $paragraph){
echo $paragraph->textContent . "<br />";
}
$doc=newDOMDocument();
$doc->loadHTML($html);
$brs=$doc->getElementsByTagName('br');
foreach($brs作为$node){
$node->parentNode->replaceChild($doc->createTextNode(“\r\n”),$node);
}
$xpath=新的DOMXPath($doc);
$parations=$xpath->evaluate(“/html/body//p”);
foreach($段落为$段落){
echo$段落->文本内容。“
”;
}
我有同样的情况,我使用:
$document->loadHTML(str_replace('<br>', urlencode('<br>'), $string_or_file));
$document->loadHTML(str_replace(“
”、urlencode(“
”)、$string_或_file));
我使用urlencode()将其更改回显示或插入数据库。@Ben:你确定吗?什么PHP版本?在PHP5.3.3上按预期工作。注意:保留内部标记(例如:
,@netcoder:很确定,虽然我不会说这不可能,但我做错了什么。不幸的是,我们的主机在PHP5.2.12上。@Ben:在PHP5.2.10上也能正常工作。你是如何输出的?在web浏览器中?如果是,你在看什么,格式化的输出还是页面源?@Ben:请参阅。
$document->loadHTML(str_replace('<br>', urlencode('<br>'), $string_or_file));