C# 字符编码下载网页

C# 字符编码下载网页,c#,html,encoding,character-encoding,C#,Html,Encoding,Character Encoding,我必须把网页下载到一个文本文件中并分析单词 他们在不同的聊天集中,iso-8859-1,windows-1252。。。我尝试了SO like和more的几种解决方案,但都没有奏效,我还在读mí;尼莫(当然没有空格),我应该在哪里读《米尼莫》或《m&e急性病》;西科 有人能帮我找到正确的路吗?谢谢 public static string DownloadString(string address) { string strWebPage = ""; // create r

我必须把网页下载到一个文本文件中并分析单词

他们在不同的聊天集中,iso-8859-1,windows-1252。。。我尝试了SO like和more的几种解决方案,但都没有奏效,我还在读mí;尼莫(当然没有空格),我应该在哪里读《米尼莫》或《m&e急性病》;西科

有人能帮我找到正确的路吗?谢谢

public static string DownloadString(string address)
{
    string strWebPage = "";
    // create request
    System.Net.WebRequest objRequest = System.Net.HttpWebRequest.Create(address);
    // get response
    System.Net.HttpWebResponse objResponse;
    objResponse = (System.Net.HttpWebResponse)objRequest.GetResponse();
    // get correct charset and encoding from the server's header
    string Charset = objResponse.CharacterSet;
    Encoding encoding = Encoding.GetEncoding(Charset);

    // read response into memory stream
    MemoryStream memoryStream;
    using (Stream responseStream = objResponse.GetResponseStream())
    {
        memoryStream = new MemoryStream();

        byte[] buffer = new byte[1024];
        int byteCount;
        do
        {
            byteCount = responseStream.Read(buffer, 0, buffer.Length);
            memoryStream.Write(buffer, 0, byteCount);
        } while (byteCount > 0);
    }

    // set stream position to beginning
    memoryStream.Seek(0, SeekOrigin.Begin);

    StreamReader sr = new StreamReader(memoryStream, encoding);
    strWebPage = sr.ReadToEnd();

    // Check real charset meta-tag in HTML
    int CharsetStart = strWebPage.IndexOf("charset=");
    if (CharsetStart > 0)
    {
        CharsetStart += 8;
        int CharsetEnd = strWebPage.IndexOfAny(new[] { ' ', '\"', ';' }, CharsetStart);
        string RealCharset =
               strWebPage.Substring(CharsetStart, CharsetEnd - CharsetStart);

        // real charset meta-tag in HTML differs from supplied server header???
        if (RealCharset != Charset)
        {
            // get correct encoding
            Encoding CorrectEncoding = Encoding.GetEncoding(RealCharset);

            // reset stream position to beginning
            memoryStream.Seek(0, SeekOrigin.Begin);

            // reread response stream with the correct encoding
            StreamReader sr2 = new StreamReader(memoryStream, CorrectEncoding);

            strWebPage = sr2.ReadToEnd();
            // Close and clean up the StreamReader
            sr2.Close();
        }
    }

    // dispose the first stream reader object
    sr.Close();

    return strWebPage;
}

这不是编码的问题,那些有趣的字符串,比如
í被调用

转换为正确编码后,使用(从
System.Web
assembly)转换html实体