使用PHP和HTMLPurifier、SimpleXmlElement或DOM从XML中筛选段落_Php_Dom_Xpath

使用PHP和HTMLPurifier、SimpleXmlElement或DOM从XML中筛选段落

php dom xpath

使用PHP和HTMLPurifier、SimpleXmlElement或DOM从XML中筛选段落,php,dom,xpath,Php,Dom,Xpath,我试图从这个XML的描述字段中删除社交媒体按钮，只留下段落（它太大了，无法在这里发布）编辑：由于有些人无法访问XML，请遵循其中一个描述标记的部分内容： <description>  <script> document.write('<a href="https://www.twitter.com/tst_ofi

我试图从这个XML的描述字段中删除社交媒体按钮，只留下段落（它太大了，无法在这里发布）

编辑：由于有些人无法访问XML，请遵循其中一个描述标记的部分内容：

    <description>
 <!-- TWITTER https://twitter.com/about/resources/buttons#tweet --> <script> document.write('<a href="https://www.twitter.com/tst_oficial" class="twitter-follow-button" data-show-count="false" data-lang="pt">Seguir</a>'); !function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0];if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src="//platform.twitter.com/widgets.js";fjs.parentNode.insertBefore(js,fjs);}}(document,"script","twitter-wjs");</script>
 <!-- CURTIR SITE FACEBOOK (Enviar) --> <iframe class="fb_ltr" src="http://www.facebook.com/plugins/like.php?href=https://www.facebook.com/TSTJus&layout=button_count&show_faces=false&action=like&colorscheme=light&width=25&height=25&locale=pt_BR" scrolling="no" frameborder="0" style="border:0px; margin-left:30px; overflow:hidden; width:120px; height:25px;vertical-align:bottom;" allowTransparency="true"></iframe>
 <!-- GOOGLE PLUS +1--> <script type="text/javascript" src="https://apis.google.com/js/plusone.js"></script> 
 <g:plusone size="medium" href="https://plus.google.com/103151838647081346830" style="border-left:-200px"></g:plusone>
 </div> </br></br> 
 <div class="modelo_noticia">
  <div>
   <div style="float: left; width:47%; text-align:center; margin: 0 9px 0 0;"><a href="/image/journal/article?img_id=5733388&t=1377023456174" target="_blank" style="text-decoration:none; color:black;"><img src="/image/journal/article?img_id=5733388&t=1377023456174" style="margin: 0 5px; width:98%;"/><span style="font-style:italic;"></span> </a></div>
   <p> &nbsp;</p>
   <p style="text-align: justify;"> <span style="font-size:12px;">"A CLT continua atual enq...a.</span></p>
   <p style="text-align: justify;"> <span style="font-size:12px;">...or.</span></p>
   <p style="text-align: justify;"> <span style="font-size:12px;">O min...do".</span></p>
   <p style="text-align: justify;"> <span style="font-size:12px;">Ca...as".</span></p>
   <p style="text-align: justify;"> <span style="font-size:12px;">Ao enc...izou.</span></p> 
   <p style="text-align: justify;"> <span style="font-size:12px;">Também parti...o.</span></p>
   <p style="text-align: justify;"> <span style="font-size:12px;">Ao a...ócio".</span></p> 
   <p style="text-align: justify;"> <span style="font-size:12px;"><strong>Debate: reforma na CLT</strong></span></p>
   <p style="text-align: justify;"> <span style="font-size:12px;">O min...s.</span></p>
   <p style="text-align: justify;"> <span style="font-size:12px;">Ao...disse.</span></p>
   <p style="text-align: justify;"> <span style="font-size:12px;">O m...o o país". &nbsp;&nbsp;</span></p>  <p style="text-align: justify;"> <span style="font-size:12px;">(Fernanda Loureiro)</span></p>
  </div>
  <div style="clear:both;"></div>
 </div>
 <DIV style="vertical-align:bottom !important">
  <!-- FACEBOOK CURTIR --> <!-- <script src="http://connect.facebook.net/pt_BR/all.js#xfbml=1"></script>
  <fb:like layout="button_count" show_faces="true" width="80"></fb:like>-->
  <iframe class="fb_ltr" src="http://www.facebook.com/plugins/like.php?href=http://www.tst.jus.br/noticias/-/asset_publisher/89Dk/content/{rss=true}&layout=button_count&show_faces=false&action=like&colorscheme=light&width=25&height=25&locale=pt_BR" scrolling="no" frameborder="0" style="border:none;border:0;margin-left:0; overflow:hidden; width:95px; height:25px;horizontal-align:left;vertical-align:bottom;" allowTransparency="true"></iframe>
  <!-- TWITTAR --> <span style="margin-left:20px;"> <script type="text/javascript"> var endereco; endereco = window.location.href; document.write('<a href="http://twitter.com/share?url=' + endereco + '" class="twitter-share-button" data-text="Presidente do TST diz que trabalho precisa ser valorizado sem perda de competitividade" data-count="horizontal" data-via="tst_oficial">Tweet</a>') </script><script type="text/javascript" src="http://platform.twitter.com/widgets.js"></script> </span>
  <!-- OK FACEBOOK Recomendar --> <!--<iframe id="f2ee48257c" name="f1f8d54994" frameborder="0" scrolling="no" style="border: none; overflow: hidden; height: 20px; width: 200px;" title="Like this content on Facebook." class="fb_ltr" src="http://www.facebook.com/plugins/like.php?api_key=228619377180035&amp;locale=pt_BR&sdk=joey&channel_url=http://www.facebook.com/TSTJus?fref=ts&version=18%23cb%3Df360a99c9c&origin=http://www.tst.jus.br/noticias&href=http://www.tst.jus.br/noticias%26relation%3Dparent.parent&node_type=link&width=150&font=arial&layout=button_count&colorscheme=light&show_faces=false&send=true&extended_social_context=false&action=recommend" allowTransparency="true"></iframe>-->
  <iframe border="0" frameborder="0" scrolling="no" class="fb_ltr" id="f2ee48257c" name="f1f8d54994" style="border:none;margin-left:0; overflow:hidden; width:200px; height:25px;horizontal-align:left;vertical-align:bottom;" allowTransparency="true" title="Enviar notícia no Facebook" class="fb_ltr" src="http://www.facebook.com/plugins/like.php?api_key=228619377180035&locale=pt_BR&sdk=joey&channel_url=http://www.tst.jus.br/noticias%3Fversion%3D18%23cb%3Df360a99c9c%26origin%3Dhttp://www.tst.jus.br/noticias%26relation%3Dparent.parent&amp;href=http://www.tst.jus.br/noticias&node_type=link&amp;width=150&amp;font=arial&amp;layout=button_count&amp;colorscheme=light&show_faces=false&send=true&amp;extended_social_context=false&action=recommend"></iframe> 
  <!-- YOUTUBE --> <a href="http://www.youtube.com/tst" target="_blank"> <img src="http://www.tst.jus.br/image/image_gallery?uuid=49d1dfeb-fba6-48be-9984-c2ba7dac709e&groupId=10157&t=1359131490760" border="0" title="Inscrição no Canal Youtube do TST" alt="Inscrição no Canal Youtube do TST"></a>
 </DIV> </br>
</description>


文件。写（“”）！函数（d，s，id）{var js，fjs=d.getElementsByTagName（s）[0]；if（！d.getElementById（id））{js=d.createElement（s）；js.id=id；js.src=“//platform.twitter.com/widgets.js”；fjs.parentNode.insertBefore（js，fjs）；}（文档，“脚本”，“twitter wjs”）；




”一个连续的CLT atual enq…A
…或
O min…do”
Ca…as“
Ao enc…izou.
Também parti…o
Ao a…O cio”
辩论：重新格式化

O最小…s
Ao…dise
O m…O país”
（费尔南达·卢雷罗）
var endereco；endereco=window.location.href；document.write（“”）

我已经尝试过使用正则表达式，但只能得到第一段（
“#]*>（.*） #isU'
）。使用SimpleXmlElement、DOM，我不断地得到错误（我对它们不太了解，但它们似乎是最好的方法），最后是HTMLPurifier，它过滤所有内容，不返回任何相关内容
下面是我在最后是如何做到这一点的（按照Puggan Se的建议）：

$i=0； $feed=''；//此处显示整个XML字符串 $dom=new DOMDocument（）；//声明DOMDocument $dom->preserveWhiteSpace=false；//删除空格 $dom->loadXML（$feed，LIBXML\u parseging）；//对于长xml，LIBXML\u parsegig $dom->formatOutput=true；//要获得好的输出？？ $xml=new-DOMXPath（$dom）；//声明XPath $xml->registerNamespace（'a'，'http://purl.org/dc/elements/1.1/“）；//从XML获取名称空间 //评估 $source=$xml->evaluate（//channel/title”）； $titles=$xml->evaluate（//item/title”）； $links=$xml->evaluate（//item/link”）； $dates=$xml->evaluate（“//item/dc:date”）； $descriptions=$xml->evaluate（//item/description”）； //回声频道的标题如果（$source->length>0）{ $source=$source->item（0）->nodeValue； echo$source.“ ”； } //重复项目 foreach（$title作为$title）{ echo“{$titles->item（$i）->nodeValue} ”； echo“{$links->item（$i）->nodeValue} ”； echo“{$dates->item（$i）->nodeValue} ”； //仅从中筛选文本 $description=“{$descriptions->item（$i）->nodeValue}”； $description=mb_convert_编码（$conteudo，'html实体，'utf-8'）； unset（$domtmp）； $domtmp=新的DOMDocument（）； $domtmp->loadHTML（$description）； $xmltmp=newdomxpath（$domtmp）； $desc=$xmltmp->evaluate（“//p/span”）； foreach（$desc as$node）{ echo“{$node->nodeValue}”； } $i++； }
你知道我怎样才能改进它吗

非常感谢您的帮助！
是$description XML吗？您能解析它，然后使用xpath获取所有的p，然后只回显每个PP的内容吗？请至少回显您的XML的一个有效片段，我无法访问该链接。对不起，我不知道您为什么无法访问它，但我已编辑了该问题，以包含XML中最重要的部分谢谢你！
$i=0; $feed= '<XML STRING>'; //The whole XML string here $dom = new DOMDocument(); //declaring DOMDocument $dom->preserveWhiteSpace = false; //removing spaces $dom->loadXML($feed, LIBXML_PARSEHUGE); //LIBXML_PARSEHUGE for long XMLs $dom->formatOutput = true; // for a nice output ?? $xml = new DOMXPath($dom); //declaring the XPath $xml->registerNamespace('a','http://purl.org/dc/elements/1.1/'); //getting the namespace from the XML //evaluates $source = $xml->evaluate("//channel/title"); $titles = $xml->evaluate("//item/title"); $links = $xml->evaluate("//item/link"); $dates = $xml->evaluate("//item/dc:date"); $descriptions = $xml->evaluate("//item/description"); //echoing channel's title if($source->length > 0) { $source= $source->item(0)->nodeValue; echo $source. ' '; } //echoing the items foreach($titles as $title) { echo "{$titles->item($i)->nodeValue} "; echo "{$links->item($i)->nodeValue} "; echo "{$dates->item($i)->nodeValue} "; //filtering only text from <description> $description = "{$descriptions->item($i)->nodeValue} "; $description = mb_convert_encoding($conteudo, 'html-entities', 'utf-8'); unset($domtmp); $domtmp = new DOMDocument(); $domtmp->loadHTML($description ); $xmltmp = new DOMXPath($domtmp); $desc= $xmltmp->evaluate("//p/span"); foreach($desc as $node) { echo "{$node->nodeValue}"; } $i++; }