Php Can'；t解码标题中的html实体_Php_Encoding_Utf 8_Decode_Html Entities

Php Can'；t解码标题中的html实体

php encoding utf-8

Php Can'；t解码标题中的html实体,php,encoding,utf-8,decode,html-entities,Php,Encoding,Utf 8,Decode,Html Entities,我在解码youtube视频标题中的实体时遇到问题：这是我的密码： $url = 'http://www.youtube.com/watch?v=p7NMsywVQhY'; $html = @file_get_contents($url); $doc = new DOMDocument(); @$doc->loadHTML($html); $nodes = $doc->getElementsByTagName('title'); $title = $nodes->item(

我在解码youtube视频标题中的实体时遇到问题：

这是我的密码：

$url = 'http://www.youtube.com/watch?v=p7NMsywVQhY';
$html = @file_get_contents($url);
$doc = new DOMDocument();
@$doc->loadHTML($html);

$nodes = $doc->getElementsByTagName('title');
$title = $nodes->item(0)->nodeValue;

//decode the '&#x202a;' in the title
$title = html_entity_decode($title,ENT_QUOTES,'UTF-8'); //does not seem to have any effect
//decode the utf data
$title = utf8_decode($title);

$title返回一切正常，但返回问号，其中

和#x202a最初位于标题中
谢谢。
我不知道PHP是否提供了任何函数来实现这一点，但是您可以像这样使用preg\u replace
：
$string = preg_replace('/&#x([0-9a-f]+);/ei', 'chr(hexdec("$1"))', $string);

尝试此操作以强制正确检测字符集：
$doc = new DOMDocument();
@$doc->loadHTML('<?xml encoding="UTF-8">' . $html);

$nodes = $doc->getElementsByTagName('title');
$title = $nodes->item(0)->nodeValue;

echo $title;

$doc=newDOMDocument（）；
@$doc->loadHTML（“”.$html）；
$nodes=$doc->getElementsByTagName（'title'）；
$title=$nodes->item（0）->nodeValue；
echo$标题；
和#202a；是unicode中的“从左到右嵌入”，它不应该是可打印字符。好的，那么如何从字符串中删除这些类型的代码？搜索和替换可能是最好的选择。任何正则表达式都可以删除所有类型的代码吗？MatTheCat发布的一个不好用，那就不行了，因为没有代表#2029的字符。它是一个unicode控制字符。将其视为较低的27个ascii字符的等价物-它们有效果，但没有视觉表示。如果文档不是UTF-8，这会不会破坏编码？这似乎没有任何作用嗯，当我应用此正则表达式时，问号保留在字符串中