C# 字符编码下载网页
我必须把网页下载到一个文本文件中并分析单词 他们在不同的聊天集中,iso-8859-1,windows-1252。。。我尝试了SO like和more的几种解决方案,但都没有奏效,我还在读mí;尼莫(当然没有空格),我应该在哪里读《米尼莫》或《m&e急性病》;西科 有人能帮我找到正确的路吗?谢谢C# 字符编码下载网页,c#,html,encoding,character-encoding,C#,Html,Encoding,Character Encoding,我必须把网页下载到一个文本文件中并分析单词 他们在不同的聊天集中,iso-8859-1,windows-1252。。。我尝试了SO like和more的几种解决方案,但都没有奏效,我还在读mí;尼莫(当然没有空格),我应该在哪里读《米尼莫》或《m&e急性病》;西科 有人能帮我找到正确的路吗?谢谢 public static string DownloadString(string address) { string strWebPage = ""; // create r
public static string DownloadString(string address)
{
string strWebPage = "";
// create request
System.Net.WebRequest objRequest = System.Net.HttpWebRequest.Create(address);
// get response
System.Net.HttpWebResponse objResponse;
objResponse = (System.Net.HttpWebResponse)objRequest.GetResponse();
// get correct charset and encoding from the server's header
string Charset = objResponse.CharacterSet;
Encoding encoding = Encoding.GetEncoding(Charset);
// read response into memory stream
MemoryStream memoryStream;
using (Stream responseStream = objResponse.GetResponseStream())
{
memoryStream = new MemoryStream();
byte[] buffer = new byte[1024];
int byteCount;
do
{
byteCount = responseStream.Read(buffer, 0, buffer.Length);
memoryStream.Write(buffer, 0, byteCount);
} while (byteCount > 0);
}
// set stream position to beginning
memoryStream.Seek(0, SeekOrigin.Begin);
StreamReader sr = new StreamReader(memoryStream, encoding);
strWebPage = sr.ReadToEnd();
// Check real charset meta-tag in HTML
int CharsetStart = strWebPage.IndexOf("charset=");
if (CharsetStart > 0)
{
CharsetStart += 8;
int CharsetEnd = strWebPage.IndexOfAny(new[] { ' ', '\"', ';' }, CharsetStart);
string RealCharset =
strWebPage.Substring(CharsetStart, CharsetEnd - CharsetStart);
// real charset meta-tag in HTML differs from supplied server header???
if (RealCharset != Charset)
{
// get correct encoding
Encoding CorrectEncoding = Encoding.GetEncoding(RealCharset);
// reset stream position to beginning
memoryStream.Seek(0, SeekOrigin.Begin);
// reread response stream with the correct encoding
StreamReader sr2 = new StreamReader(memoryStream, CorrectEncoding);
strWebPage = sr2.ReadToEnd();
// Close and clean up the StreamReader
sr2.Close();
}
}
// dispose the first stream reader object
sr.Close();
return strWebPage;
}
这不是编码的问题,那些有趣的字符串,比如
í代码>被调用
转换为正确编码后,使用(从System.Web
assembly)转换html实体