使用PHP和HTMLPurifier、SimpleXmlElement或DOM从XML中筛选段落
我试图从这个XML的描述字段中删除社交媒体按钮,只留下段落(它太大了,无法在这里发布) 编辑:由于有些人无法访问XML,请遵循其中一个描述标记的部分内容:使用PHP和HTMLPurifier、SimpleXmlElement或DOM从XML中筛选段落,php,dom,xpath,Php,Dom,Xpath,我试图从这个XML的描述字段中删除社交媒体按钮,只留下段落(它太大了,无法在这里发布) 编辑:由于有些人无法访问XML,请遵循其中一个描述标记的部分内容: <description> <!-- TWITTER https://twitter.com/about/resources/buttons#tweet --> <script> document.write('<a href="https://www.twitter.com/tst_ofi
<description>
<!-- TWITTER https://twitter.com/about/resources/buttons#tweet --> <script> document.write('<a href="https://www.twitter.com/tst_oficial" class="twitter-follow-button" data-show-count="false" data-lang="pt">Seguir</a>'); !function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0];if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src="//platform.twitter.com/widgets.js";fjs.parentNode.insertBefore(js,fjs);}}(document,"script","twitter-wjs");</script>
<!-- CURTIR SITE FACEBOOK (Enviar) --> <iframe class="fb_ltr" src="http://www.facebook.com/plugins/like.php?href=https://www.facebook.com/TSTJus&layout=button_count&show_faces=false&action=like&colorscheme=light&width=25&height=25&locale=pt_BR" scrolling="no" frameborder="0" style="border:0px; margin-left:30px; overflow:hidden; width:120px; height:25px;vertical-align:bottom;" allowTransparency="true"></iframe>
<!-- GOOGLE PLUS +1--> <script type="text/javascript" src="https://apis.google.com/js/plusone.js"></script>
<g:plusone size="medium" href="https://plus.google.com/103151838647081346830" style="border-left:-200px"></g:plusone>
</div> </br></br>
<div class="modelo_noticia">
<div>
<div style="float: left; width:47%; text-align:center; margin: 0 9px 0 0;"><a href="/image/journal/article?img_id=5733388&t=1377023456174" target="_blank" style="text-decoration:none; color:black;"><img src="/image/journal/article?img_id=5733388&t=1377023456174" style="margin: 0 5px; width:98%;"/><span style="font-style:italic;"></span> </a></div>
<p> </p>
<p style="text-align: justify;"> <span style="font-size:12px;">"A CLT continua atual enq...a.</span></p>
<p style="text-align: justify;"> <span style="font-size:12px;">...or.</span></p>
<p style="text-align: justify;"> <span style="font-size:12px;">O min...do".</span></p>
<p style="text-align: justify;"> <span style="font-size:12px;">Ca...as".</span></p>
<p style="text-align: justify;"> <span style="font-size:12px;">Ao enc...izou.</span></p>
<p style="text-align: justify;"> <span style="font-size:12px;">Também parti...o.</span></p>
<p style="text-align: justify;"> <span style="font-size:12px;">Ao a...ócio".</span></p>
<p style="text-align: justify;"> <span style="font-size:12px;"><strong>Debate: reforma na CLT</strong></span></p>
<p style="text-align: justify;"> <span style="font-size:12px;">O min...s.</span></p>
<p style="text-align: justify;"> <span style="font-size:12px;">Ao...disse.</span></p>
<p style="text-align: justify;"> <span style="font-size:12px;">O m...o o país". </span></p> <p style="text-align: justify;"> <span style="font-size:12px;">(Fernanda Loureiro)</span></p>
</div>
<div style="clear:both;"></div>
</div>
<DIV style="vertical-align:bottom !important">
<!-- FACEBOOK CURTIR --> <!-- <script src="http://connect.facebook.net/pt_BR/all.js#xfbml=1"></script>
<fb:like layout="button_count" show_faces="true" width="80"></fb:like>-->
<iframe class="fb_ltr" src="http://www.facebook.com/plugins/like.php?href=http://www.tst.jus.br/noticias/-/asset_publisher/89Dk/content/{rss=true}&layout=button_count&show_faces=false&action=like&colorscheme=light&width=25&height=25&locale=pt_BR" scrolling="no" frameborder="0" style="border:none;border:0;margin-left:0; overflow:hidden; width:95px; height:25px;horizontal-align:left;vertical-align:bottom;" allowTransparency="true"></iframe>
<!-- TWITTAR --> <span style="margin-left:20px;"> <script type="text/javascript"> var endereco; endereco = window.location.href; document.write('<a href="http://twitter.com/share?url=' + endereco + '" class="twitter-share-button" data-text="Presidente do TST diz que trabalho precisa ser valorizado sem perda de competitividade" data-count="horizontal" data-via="tst_oficial">Tweet</a>') </script><script type="text/javascript" src="http://platform.twitter.com/widgets.js"></script> </span>
<!-- OK FACEBOOK Recomendar --> <!--<iframe id="f2ee48257c" name="f1f8d54994" frameborder="0" scrolling="no" style="border: none; overflow: hidden; height: 20px; width: 200px;" title="Like this content on Facebook." class="fb_ltr" src="http://www.facebook.com/plugins/like.php?api_key=228619377180035&locale=pt_BR&sdk=joey&channel_url=http://www.facebook.com/TSTJus?fref=ts&version=18%23cb%3Df360a99c9c&origin=http://www.tst.jus.br/noticias&href=http://www.tst.jus.br/noticias%26relation%3Dparent.parent&node_type=link&width=150&font=arial&layout=button_count&colorscheme=light&show_faces=false&send=true&extended_social_context=false&action=recommend" allowTransparency="true"></iframe>-->
<iframe border="0" frameborder="0" scrolling="no" class="fb_ltr" id="f2ee48257c" name="f1f8d54994" style="border:none;margin-left:0; overflow:hidden; width:200px; height:25px;horizontal-align:left;vertical-align:bottom;" allowTransparency="true" title="Enviar notícia no Facebook" class="fb_ltr" src="http://www.facebook.com/plugins/like.php?api_key=228619377180035&locale=pt_BR&sdk=joey&channel_url=http://www.tst.jus.br/noticias%3Fversion%3D18%23cb%3Df360a99c9c%26origin%3Dhttp://www.tst.jus.br/noticias%26relation%3Dparent.parent&href=http://www.tst.jus.br/noticias&node_type=link&width=150&font=arial&layout=button_count&colorscheme=light&show_faces=false&send=true&extended_social_context=false&action=recommend"></iframe>
<!-- YOUTUBE --> <a href="http://www.youtube.com/tst" target="_blank"> <img src="http://www.tst.jus.br/image/image_gallery?uuid=49d1dfeb-fba6-48be-9984-c2ba7dac709e&groupId=10157&t=1359131490760" border="0" title="Inscrição no Canal Youtube do TST" alt="Inscrição no Canal Youtube do TST"></a>
</DIV> </br>
</description>
文件。写(“”)!函数(d,s,id){var js,fjs=d.getElementsByTagName(s)[0];if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=“//platform.twitter.com/widgets.js”;fjs.parentNode.insertBefore(js,fjs);}(文档,“脚本”,“twitter wjs”);
”一个连续的CLT atual enq…A
…或
O min…do”
Ca…as“
Ao enc…izou.
Também parti…o
Ao a…O cio”
辩论:重新格式化
O最小…s
Ao…dise
O m…O país”
(费尔南达·卢雷罗)
var endereco;endereco=window.location.href;document.write(“”)
我已经尝试过使用正则表达式,但只能得到第一段(“#]*>(.*)#isU'
)。使用SimpleXmlElement、DOM,我不断地得到错误(我对它们不太了解,但它们似乎是最好的方法),最后是HTMLPurifier,它过滤所有内容,不返回任何相关内容
下面是我在最后是如何做到这一点的(按照Puggan Se的建议):
$i=0;
$feed='';//此处显示整个XML字符串
$dom=new DOMDocument();//声明DOMDocument
$dom->preserveWhiteSpace=false;//删除空格
$dom->loadXML($feed,LIBXML\u parseging);//对于长xml,LIBXML\u parsegig
$dom->formatOutput=true;//要获得好的输出??
$xml=new-DOMXPath($dom);//声明XPath
$xml->registerNamespace('a','http://purl.org/dc/elements/1.1/“);//从XML获取名称空间
//评估
$source=$xml->evaluate(//channel/title”);
$titles=$xml->evaluate(//item/title”);
$links=$xml->evaluate(//item/link”);
$dates=$xml->evaluate(“//item/dc:date”);
$descriptions=$xml->evaluate(//item/description”);
//回声频道的标题
如果($source->length>0){
$source=$source->item(0)->nodeValue;
echo$source.“
”;
}
//重复项目
foreach($title作为$title){
echo“{$titles->item($i)->nodeValue}
”;
echo“{$links->item($i)->nodeValue}
”;
echo“{$dates->item($i)->nodeValue}
”;
//仅从中筛选文本
$description=“{$descriptions->item($i)->nodeValue}”;
$description=mb_convert_编码($conteudo,'html实体,'utf-8');
unset($domtmp);
$domtmp=新的DOMDocument();
$domtmp->loadHTML($description);
$xmltmp=newdomxpath($domtmp);
$desc=$xmltmp->evaluate(“//p/span”);
foreach($desc as$node){
echo“{$node->nodeValue}”;
}
$i++;
}
你知道我怎样才能改进它吗
非常感谢您的帮助!是$description XML吗?您能解析它,然后使用xpath获取所有的p,然后只回显每个PP的内容吗?请至少回显您的XML的一个有效片段,我无法访问该链接。对不起,我不知道您为什么无法访问它,但我已编辑了该问题,以包含XML中最重要的部分谢谢你!
$i=0;
$feed= '<XML STRING>'; //The whole XML string here
$dom = new DOMDocument(); //declaring DOMDocument
$dom->preserveWhiteSpace = false; //removing spaces
$dom->loadXML($feed, LIBXML_PARSEHUGE); //LIBXML_PARSEHUGE for long XMLs
$dom->formatOutput = true; // for a nice output ??
$xml = new DOMXPath($dom); //declaring the XPath
$xml->registerNamespace('a','http://purl.org/dc/elements/1.1/'); //getting the namespace from the XML
//evaluates
$source = $xml->evaluate("//channel/title");
$titles = $xml->evaluate("//item/title");
$links = $xml->evaluate("//item/link");
$dates = $xml->evaluate("//item/dc:date");
$descriptions = $xml->evaluate("//item/description");
//echoing channel's title
if($source->length > 0) {
$source= $source->item(0)->nodeValue;
echo $source. '<br /><br />';
}
//echoing the items
foreach($titles as $title) {
echo "{$titles->item($i)->nodeValue}<br /><br />";
echo "{$links->item($i)->nodeValue}<br /><br />";
echo "{$dates->item($i)->nodeValue}<br /><br />";
//filtering only <p><span> text from <description>
$description = "{$descriptions->item($i)->nodeValue} ";
$description = mb_convert_encoding($conteudo, 'html-entities', 'utf-8');
unset($domtmp);
$domtmp = new DOMDocument();
$domtmp->loadHTML($description );
$xmltmp = new DOMXPath($domtmp);
$desc= $xmltmp->evaluate("//p/span");
foreach($desc as $node) {
echo "<p>{$node->nodeValue}</p>";
}
$i++;
}