Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/csharp/257.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
C# 如何将UTF-8字符串转换为Unicode?_C#_String_Unicode_Utf 8 - Fatal编程技术网

C# 如何将UTF-8字符串转换为Unicode?

C# 如何将UTF-8字符串转换为Unicode?,c#,string,unicode,utf-8,C#,String,Unicode,Utf 8,我有一个显示UTF-8编码字符的字符串,我想把它转换回Unicode 目前,我的实现如下: public static string DecodeFromUtf8(this string utf8String) { // read the string as UTF-8 bytes. byte[] encodedBytes = Encoding.UTF8.GetBytes(utf8String); // convert them into unicode bytes.

我有一个显示UTF-8编码字符的字符串,我想把它转换回Unicode

目前,我的实现如下:

public static string DecodeFromUtf8(this string utf8String)
{
    // read the string as UTF-8 bytes.
    byte[] encodedBytes = Encoding.UTF8.GetBytes(utf8String);

    // convert them into unicode bytes.
    byte[] unicodeBytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, encodedBytes);

    // builds the converted string.
    return Encoding.Unicode.GetString(encodedBytes);
}
public static string Utf8ToUtf16(string utf8String)
{
    /***************************************************************
     * Every .NET string will store text with the UTF-16 encoding, *
     * known as Encoding.Unicode. Other encodings may exist as     *
     * Byte-Array or incorrectly stored with the UTF-16 encoding.  *
     *                                                             *
     * UTF-8 = 1 bytes per char                                    *
     *    ["100" for the ansi 'd']                                 *
     *    ["206" and "186" for the russian '?']                    *
     *                                                             *
     * UTF-16 = 2 bytes per char                                   *
     *    ["100, 0" for the ansi 'd']                              *
     *    ["186, 3" for the russian '?']                           *
     *                                                             *
     * UTF-8 inside UTF-16                                         *
     *    ["100, 0" for the ansi 'd']                              *
     *    ["206, 0" and "186, 0" for the russian '?']              *
     *                                                             *
     * First we need to get the UTF-8 Byte-Array and remove all    *
     * 0 byte (binary 0) while doing so.                           *
     *                                                             *
     * Binary 0 means end of string on UTF-8 encoding while on     *
     * UTF-16 one binary 0 does not end the string. Only if there  *
     * are 2 binary 0, than the UTF-16 encoding will end the       *
     * string. Because of .NET we don't have to handle this.       *
     *                                                             *
     * After removing binary 0 and receiving the Byte-Array, we    *
     * can use the UTF-8 encoding to string method now to get a    *
     * UTF-16 string.                                              *
     *                                                             *
     ***************************************************************/

    // Get UTF-8 bytes and remove binary 0 bytes (filler)
    List<byte> utf8Bytes = new List<byte>(utf8String.Length);
    foreach (byte utf8Byte in utf8String)
    {
        // Remove binary 0 bytes (filler)
        if (utf8Byte > 0) {
            utf8Bytes.Add(utf8Byte);
        }
    }

    // Convert UTF-8 bytes to UTF-16 string
    return Encoding.UTF8.GetString(utf8Bytes.ToArray());
}
我在玩单词
“déjá”
。我已经通过这个将其转换为UTF-8,因此我开始用字符串
“déjÔ
测试我的方法

不幸的是,在这个实现中,字符串保持不变

我错在哪里

我有一个显示UTF-8编码字符的字符串

NET中没有这样的东西。string类只能存储UTF-16编码的字符串。UTF-8编码字符串只能作为字节[]存在。试图将字节存储到字符串中不会有好的结果;UTF-8使用的字节值没有有效的Unicode码点。字符串规范化后,内容将被销毁。因此,在DecodeFromUtf8()开始运行时恢复字符串已经太晚了


仅处理字节为[]的UTF-8编码文本。并使用UTF8Encoding.GetString()对其进行转换。

因此问题在于UTF-8代码单位值已作为16位代码单位序列存储在C#
字符串中。您只需验证每个代码单元是否在一个字节的范围内,将这些值复制到字节中,然后将新的UTF-8字节序列转换为UTF-16

public static string DecodeFromUtf8(this string utf8String)
{
    // copy the string as UTF-8 bytes.
    byte[] utf8Bytes = new byte[utf8String.Length];
    for (int i=0;i<utf8String.Length;++i) {
        //Debug.Assert( 0 <= utf8String[i] && utf8String[i] <= 255, "the char must be in byte's range");
        utf8Bytes[i] = (byte)utf8String[i];
    }

    return Encoding.UTF8.GetString(utf8Bytes,0,utf8Bytes.Length);
}

DecodeFromUtf8("d\u00C3\u00A9j\u00C3\u00A0"); // déjà

您看到的似乎是另一种编码的
字符串
解码错误,很可能是US Windows默认编码。假设没有其他损失,下面是如何逆转。一个不明显的损失是字符串末尾未显示的
非中断空格
(U+00A0)。当然,最好首先正确读取数据源,但可能数据源一开始存储不正确

using System;
using System.Text;

class Program
{
    static void Main(string[] args)
    {
        string junk = "déjÃ\xa0";  // Bad Unicode string

        // Turn string back to bytes using the original, incorrect encoding.
        byte[] bytes = Encoding.GetEncoding(1252).GetBytes(junk);

        // Use the correct encoding this time to convert back to a string.
        string good = Encoding.UTF8.GetString(bytes);
        Console.WriteLine(good);
    }
}
结果:

déjà

如果您有一个UTF-8字符串,其中每个字节都是正确的('Ö'->[195,0],[150,0]),则可以使用以下命令:

public static string DecodeFromUtf8(this string utf8String)
{
    // read the string as UTF-8 bytes.
    byte[] encodedBytes = Encoding.UTF8.GetBytes(utf8String);

    // convert them into unicode bytes.
    byte[] unicodeBytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, encodedBytes);

    // builds the converted string.
    return Encoding.Unicode.GetString(encodedBytes);
}
public static string Utf8ToUtf16(string utf8String)
{
    /***************************************************************
     * Every .NET string will store text with the UTF-16 encoding, *
     * known as Encoding.Unicode. Other encodings may exist as     *
     * Byte-Array or incorrectly stored with the UTF-16 encoding.  *
     *                                                             *
     * UTF-8 = 1 bytes per char                                    *
     *    ["100" for the ansi 'd']                                 *
     *    ["206" and "186" for the russian '?']                    *
     *                                                             *
     * UTF-16 = 2 bytes per char                                   *
     *    ["100, 0" for the ansi 'd']                              *
     *    ["186, 3" for the russian '?']                           *
     *                                                             *
     * UTF-8 inside UTF-16                                         *
     *    ["100, 0" for the ansi 'd']                              *
     *    ["206, 0" and "186, 0" for the russian '?']              *
     *                                                             *
     * First we need to get the UTF-8 Byte-Array and remove all    *
     * 0 byte (binary 0) while doing so.                           *
     *                                                             *
     * Binary 0 means end of string on UTF-8 encoding while on     *
     * UTF-16 one binary 0 does not end the string. Only if there  *
     * are 2 binary 0, than the UTF-16 encoding will end the       *
     * string. Because of .NET we don't have to handle this.       *
     *                                                             *
     * After removing binary 0 and receiving the Byte-Array, we    *
     * can use the UTF-8 encoding to string method now to get a    *
     * UTF-16 string.                                              *
     *                                                             *
     ***************************************************************/

    // Get UTF-8 bytes and remove binary 0 bytes (filler)
    List<byte> utf8Bytes = new List<byte>(utf8String.Length);
    foreach (byte utf8Byte in utf8String)
    {
        // Remove binary 0 bytes (filler)
        if (utf8Byte > 0) {
            utf8Bytes.Add(utf8Byte);
        }
    }

    // Convert UTF-8 bytes to UTF-16 string
    return Encoding.UTF8.GetString(utf8Bytes.ToArray());
}
或本机方法:

[DllImport("kernel32.dll")]
private static extern Int32 MultiByteToWideChar(UInt32 CodePage, UInt32 dwFlags, [MarshalAs(UnmanagedType.LPStr)] String lpMultiByteStr, Int32 cbMultiByte, [Out, MarshalAs(UnmanagedType.LPWStr)] StringBuilder lpWideCharStr, Int32 cchWideChar);

public static string Utf8ToUtf16(string utf8String)
{
    Int32 iNewDataLen = MultiByteToWideChar(Convert.ToUInt32(Encoding.UTF8.CodePage), 0, utf8String, -1, null, 0);
    if (iNewDataLen > 1)
    {
        StringBuilder utf16String = new StringBuilder(iNewDataLen);
        MultiByteToWideChar(Convert.ToUInt32(Encoding.UTF8.CodePage), 0, utf8String, -1, utf16String, utf16String.Capacity);

        return utf16String.ToString();
    }
    else
    {
        return String.Empty;
    }
}
如果您需要另一种方式,请参阅。
希望我能帮上忙。

那不是UTF8字符串。这是一个损坏的字符串,使用错误的编码从字节中严重转换。UTF-8是Unicode。C#字符串有16位字符,因此它们不可能是UTF-8编码的。我想系统不明白你想做什么。你从哪里得到错误编码的字符串?@AlexeyFrunze和richard:如果有帮助,请在问题中用“UTF-16”表示“Unicode”。C#的本机字符串编码是UTF-16,在文档中称为Unicode。您可能需要先了解一下您试图做什么……您指出了我想要避免的混淆。我的字符串是unicode字符串,也是.Net字符串,调试器将其显示为
déjÃ
。因此,我的目标是获取另一个(.Net)字符串,该字符串将显示为
déjá
(例如,在调试器中)。您错过了答案的要点,无法使其对每个可能的utf-8编码字符串正常工作。你能让它为déjÃ工作只是巧合。你已经有麻烦了,这应该是一个提示,在最后一个Ã之后还有一个额外的空间。一个特殊的,不间断的空格,代码点U+00a0。碰巧是一个有效的Unicode代码点。谢谢,我想我明白了。你的意思是我不能使用
string
来存储UTF-8字节。然而,正如你提到的,它可能是偶然发生的,如果我能让意外发生,那将是一个很大的帮助。换句话说,我仍然不知道如何在这种情况下进行转换。您可以使用Encoding.Default.GetBytes()尝试恢复字节[],试试运气。我强烈推荐这个系统。相反,随机类有一个更可预测的结果。我终于发现了一些(似乎)有效的东西。首先,我从这个臭名昭著的UTF-8字符串中得到一个
字节[]
。在这个数组中,我注意到所有的奇数索引都包含
0
,所以我删除了所有索引并调用了
unicodeBytes=Encoding.Convert(Encoding.UTF8,Encoding.Unicode,encodedBytes)。最后,我返回了
Encoding.Unicode.GetString(unicodeBytes)。然后,我挑选了许多语言的文本样本(感谢维基百科),构建了一个大字符串,将其转换成我臭名昭著的UTF-8格式,然后对其进行解码,得到了完全相同的原始字符串。没有随机,没有意外。谢谢barnes53这正好回答了我的问题,因为它产生了我期望的结果。你可以从我令人困惑的问题中找到我的意思。只是为了确保:转换后的字符串仍然是UTF-16,它只包含UTF-8编码数据。无法使用UTF-8编码处理字符串,因为.NET将始终使用UTF-16编码来处理字符串。