Matlab 如何从错误字符集编码的字符串中正确地重建字符串_Matlab_Encoding_Urlread

Matlab 如何从错误字符集编码的字符串中正确地重建字符串

matlab encoding

Matlab 如何从错误字符集编码的字符串中正确地重建字符串,matlab,encoding,urlread,Matlab,Encoding,Urlread,编辑添加了一些新信息，使问题更加清晰在2012B之前的matlab中，如果web内容的字符集不是utf8，则方法urlread将返回由错误字符集构造的字符串。（在Matlab 2012B中有所改进）比如说 % a chinese website whose content encoding by gb2312 url = 'http://www.cnbeta.com/articles/213618.htm'; html = urlread(url) 因为Matlab使用utf8而不是gb

编辑添加了一些新信息，使问题更加清晰

在2012B之前的matlab中，如果web内容的字符集不是utf8，则方法

urlread

将返回由错误字符集构造的字符串。（在Matlab 2012B中有所改进）

比如说

% a chinese website whose content encoding by gb2312
url = 'http://www.cnbeta.com/articles/213618.htm'; 
html = urlread(url)

因为Matlab使用utf8而不是gb2312对html进行编码。您将看到html中的汉字显示不正确

如果我读了一个utf8编码的中文网站，那么一切都很好：

% a chinese website whose content encoding by utf8
url = 'http://www.baidu.com/'; 
html = urlread(url)

那么，有没有办法从html中正确地重构字符串呢？我尝试了以下方法，但无效：

>> bytes = unicode2native(html,'utf8');
>> str = native2unicode(bytes,'gb2312')

但是，我知道有一种方法可以解决

urlread

的问题：在控制台中键入

edit urlread.m

，然后在第108行附近替换代码（在matlab 2011B中）：

作者：

保存文件，现在

urlread

将适用于gb2312编码的网站。

实际上，这个解决方案指出了为什么

urlread

有时不起作用。方法

urlread

始终使用utf8字符集对字符串进行编码，即使内容不是由utf8编码的。

似乎您已经有了解决方案，只需创建一个名为

urlread\u gb

的函数，该函数可以读取

gb2312

到底是什么问题？似乎您已经有了解决方案，只需创建一个名为urlread_gb的函数，它可以读取gb2312。好了，孩子们，这里有相同的问题。。。有可能以更“更好”的方式进行这种转换吗？

output = native2unicode(typecast(byteArrayOutputStream.toByteArray','uint8'),'UTF-8');

output = native2unicode(typecast(byteArrayOutputStream.toByteArray','uint8'),'gb2312');