PHP没有XML安全实体解码功能？没有一些xml实体解码？_Php_Xml_Entity_Converter

PHP没有XML安全实体解码功能？没有一些xml实体解码？

php xml

PHP没有XML安全实体解码功能？没有一些xml实体解码？,php,xml,entity,converter,Php,Xml,Entity,Converter,问题是：我需要一个由UTF8“完全编码”的XML文件；也就是说，在没有实体表示符号的情况下，所有符号都由UTF8编码，只有3个符号是XML保留的，&（amp），“”（gt）。而且，我需要一个内置函数，它可以快速地将实体转换为真正的UTF8字符（而不会损坏我的XML）。 PS：这是一个“现实世界的问题”（！）；例如，在美国，有280万篇科学文章与（也称为）。。。要像“普通XML-UTF8-text”一样处理，我们需要将数字实体更改为UTF8字符尝试的解决方案：此任务的自然功能是，但它会破坏XML

问题是：我需要一个由UTF8“完全编码”的XML文件；也就是说，在没有实体表示符号的情况下，所有符号都由UTF8编码，只有3个符号是XML保留的，&（amp），“”（gt）。而且，我需要一个内置函数，它可以快速地将实体转换为真正的UTF8字符（而不会损坏我的XML）。
PS：这是一个“现实世界的问题”（！）；例如，在美国，有280万篇科学文章与（也称为）。。。要像“普通XML-UTF8-text”一样处理，我们需要将数字实体更改为UTF8字符
尝试的解决方案：此任务的自然功能是，但它会破坏XML代码（！），转换保留的3个XML保留符号
说明问题假设

$xmlFrag ='<p>Hello world!    Let A<B and A=∬dxdy</p>';
也许另一个问题是，“为什么没有其他选项来实现我的期望？”——这对许多其他XML应用程序（！）很重要，而不仅仅是对我

我不需要一个解决办法作为答案。。。好的，我展示了我丑陋的函数，也许它能帮助你理解这个问题

function xml_entity_decode($s) { // here an illustration (by user-defined function) // about how the hypothetical PHP-build-in-function MUST work static $XENTITIES = array('&','>','<'); static $XSAFENTITIES = array('#_x_amp#;','#_x_gt#;','#_x_lt#;'); $s = str_replace($XENTITIES,$XSAFENTITIES,$s); //$s = html_entity_decode($s, ENT_NOQUOTES, 'UTF-8'); // any php version $s = html_entity_decode($s, ENT_HTML5|ENT_NOQUOTES, 'UTF-8'); // PHP 5.3+ $s = str_replace($XSAFENTITIES,$XENTITIES,$s); return $s; } // you see? not need a benchmark: // it is not so fast as direct use of html_entity_decode; if there // was an XML-safe option was ideal.
注：修正后。必须是
ENT\u HTML5
标志，才能进行转换。
尝试此功能：

function xmlsafe($s,$intoQuotes=1) { if ($intoQuotes) return str_replace(array('&','>','<','"'), array('&','>','<','"'), $s); else return str_replace(array('&','>','<'), array('&','>','<'), html_entity_decode($s)); }

函数xmlsafe（$s，$intoQuotes=1）{ 如果（$输入报价）返回str_replace（数组（'&'，'>'，''；
此外：

生产中使用的这段代码似乎与UTF-8没有任何问题，加载JATS XML文档时使用DTD，因为它将定义从命名实体到Unicode字符的任何映射，然后在保存时将编码设置为UTF-8：

$doc = new DOMDocument; $doc->load($inputFile, LIBXML_DTDLOAD | LIBXML_NOENT); $doc->encoding = 'UTF-8'; $doc->save($outputFile);

公共函数实体解码（$str，$charset=NULL） { if（strpos（$str，&'）==FALSE） { 返回$str； } 静态$u实体； isset（$charset）或$charset=$this->charset； $flag=is_php（'5.4'）？ENT|U COMPAT|ENT|U HTML5 ：ENT_COMPAT；做 { $str_compare=$str； //解码标准实体，避免误报如果（$c=preg_match_all（'/&[a-z]{2，}（？！[a-z；]）/i'，$str，$matches）） { 如果（！isset（$\u实体）） { $\u entities=array\u map（'strtolower'，get\u html\u translation\u table（html\u entities，$flag，$charset））； //如果我们不使用PHP5.4+，那么添加可能危险的HTML5 //手动将实体添加到数组中如果（$flag==ENT\u COMPAT） { $\实体['：']='&colon；'； $_实体['（']='&lpar；'； $_实体['）]='&rpar'； $\u实体[“\n”]='&newline；'； $_实体[“\t”]=”&tab；”； } } $replace=array（）； $matches=array_unique（array_map（'strtolower'，$matches[0]）；对于（$i=0；$i<$c；$i++） { if（$char=array_search（$matches[$i].；'，$_entities，TRUE））！==FALSE） { $replace[$matches[$i]]=$char； } } $str=str\u ireplace（数组\u键（$replace）、数组\u值（$replace）、$str）； } //解码数字和UTF16双字节实体 $str=html\u实体\u解码( preg_replace（'/（&#（？：x0*[0-9a-f]{2,5}（？））|（？：0*\d{2,4}（？[0-9；]）/iS'，'$1；'，$str）， $flag， $charset ); } 而（$str_compare！=$str）；返回$str； }
这个问题一次又一次地制造了一个“错误答案”（参见答案）。这可能是因为人们没有注意到，也因为没有答案：缺少PHP内置解决方案
…因此，让我们重复我的解决方法（这不是答案！），以避免造成更多的混乱：
最佳解决办法注意:

下面的函数
xml\u entity\u decode（）
是最好的解决方法（优于任何其他方法）

下面的函数不是对的答案，它只是一个workwaround

函数xml\u实体\u解码（$s）{ //说明（假设的）PHP内置函数必须如何工作静态$xenties=数组（“&；”、“”）；静态$XSAFENTITIES=array（“#ux_amp#”、“#ux_gt”、“#ux_lt”）； $s=str_replace（$xenties，$XSAFENTITIES，$s）； $s=html_entity_decode（$s，ENT_HTML5 | ENT_NOQUOTES，'UTF-8'）；//PHP5.3+ $s=str_replace（$XSAFENTITIES，$xenties，$s）；返回$s； }

为了测试并证明您有更好的解决方案，请首先使用以下简单的基准测试：

$countBchMk_MAX=1000； $xml=file_get_contents（'sample1.xml'）；//大而复杂的xml字符串 $start_time=微时间（真）；对于（$countBchMk=0；$countBchMkloadXML（$xml，LIBXML_DTDLOAD | LIBXML_NOENT）； $doc->encoding='UTF-8'； $A=$doc->saveXML（）； */ } $end_time=微时间（真）； echo“\nEND$countBchMk_MAX BENCKMARKs WITH”，（$end_time-$start_time）/$countBchMk_MAX， “秒”；
我也遇到了同样的问题，因为有人使用HTML模板来创建XML，而不是使用SimpleXML。唉……无论如何，我想到了以下几点。速度没有你的快，但速度不慢一个数量级，而且不太粗糙。你的会不经意间将
#x#amp#
转换为
$amp；
，如何它不可能出现在源XML中
注意：我假设默认编码是UTF-8

//搜索命名实体（如“&abc1；”之类的字符串）。 echo preg_replace_回调（'#和[A-Z0-9]+#i'，函数（$matches）{ //解码实体并重新编码为XML实体。这意味着“&；” //将保留“&；”，而“&euro；”将变为“€”。返回htmlentities（html\u entity\u decode（$matches[0]），entxml1）； }，“&euro&；fooÇ；”“\n”； /*欧元和富氏*/
另外，如果您想用编号的实体替换特殊字符（以防您不需要UTF-8 X）
echo '<k nid="'.$node->nid.'" description="'.xmlsafe($description).'"/>';

$doc = new DOMDocument; $doc->load($inputFile, LIBXML_DTDLOAD | LIBXML_NOENT); $doc->encoding = 'UTF-8'; $doc->save($outputFile);

public function entity_decode($str, $charset = NULL) { if (strpos($str, '&') === FALSE) { return $str; } static $_entities; isset($charset) OR $charset = $this->charset; $flag = is_php('5.4') ? ENT_COMPAT | ENT_HTML5 : ENT_COMPAT; do { $str_compare = $str; // Decode standard entities, avoiding false positives if ($c = preg_match_all('/&[a-z]{2,}(?![a-z;])/i', $str, $matches)) { if ( ! isset($_entities)) { $_entities = array_map('strtolower', get_html_translation_table(HTML_ENTITIES, $flag, $charset)); // If we're not on PHP 5.4+, add the possibly dangerous HTML 5 // entities to the array manually if ($flag === ENT_COMPAT) { $_entities[':'] = '&colon;'; $_entities['('] = '('; $_entities[')'] = '&rpar'; $_entities["\n"] = '&newline;'; $_entities["\t"] = '&tab;'; } } $replace = array(); $matches = array_unique(array_map('strtolower', $matches[0])); for ($i = 0; $i < $c; $i++) { if (($char = array_search($matches[$i].';', $_entities, TRUE)) !== FALSE) { $replace[$matches[$i]] = $char; } } $str = str_ireplace(array_keys($replace), array_values($replace), $str); } // Decode numeric & UTF16 two byte entities $str = html_entity_decode( preg_replace('/(&#(?:x0*[0-9a-f]{2,5}(?![0-9a-f;]))|(?:0*\d{2,4}(?![0-9;])))/iS', '$1;', $str), $flag, $charset ); } while ($str_compare !== $str); return $str; }

// Search for named entities (strings like "&abc1;"). $xml_utf8 = preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) { // Decode the entity and re-encode as XML entities. This means "&" // will remain "&" whereas "€" becomes "€". return htmlentities(html_entity_decode($matches[0]), ENT_XML1); }, "<Foo>€&foo Ç</Foo>") . "\n"; echo mb_encode_numericentity($xml_utf8, [0x80, 0xffff, 0, 0xffff]); /* <Foo>€&foo Ç</Foo> */

// Search for named entities (strings like "&abc1;"). $xml_utf8 = preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) { // Decode the entity and re-encode as XML entities. This means "&" // will remain "&" whereas "€" becomes "€". return htmlentities(html_entity_decode($matches[0]), ENT_XML1); }, "<Foo>€&foo Ç</Foo>") . "\n"; // Encodes (uncaught) numbered entities to UTF-8. echo mb_decode_numericentity($xml_utf8, [0x80, 0xffff, 0, 0xffff]); /* <Foo>€&foo Ç</Foo> */

<Foo>€&foo Ç é #_x_amp#; ∬</Foo>

php -r '$q=["&",">","<"];$y=["#_x_amp#;","#_x_gt#;","#_x_lt#;"]; $s=microtime(1); for(;++$i<1000000;)$r=str_replace($y,$q,html_entity_decode(str_replace($q,$y,"<Foo>€&foo Ç é #_x_amp#; ∬</Foo>"),ENT_HTML5|ENT_NOQUOTES)); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";' <Foo>€&foo Ç é & ∬</Foo> ===== Time taken: 2.0397531986237

php -r '$s=microtime(1); for(;++$i<1000000;)$r=preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>€&foo Ç é #_x_amp#; ∬</Foo>"); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";' <Foo>€&foo Ç é #_x_amp#; ∬</Foo> ===== Time taken: 4.045273065567

php -r '$s=microtime(1); for(;++$i<1000000;)$r=mb_encode_numericentity(preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>€&foo Ç é #_x_amp#; ∬</Foo>"),[0x80,0xffff,0,0xffff]); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";' <Foo>€&foo Ç é #_x_amp#; ∬</Foo> ===== Time taken: 5.4407880306244

php -r '$s=microtime(1); for(;++$i<1000000;)$r=mb_decode_numericentity(preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>€&foo Ç é #_x_amp#;</Foo>"),[0x80,0xffff,0,0xffff]); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";' <Foo>€&foo Ç é #_x_amp#; ∬</Foo> ===== Time taken: 5.5400078296661

echo xml_entity_decode(''); //Output  instead expected €