C++ C++;UTF-16到字符转换(Linux/Ubuntu)

C++ C++;UTF-16到字符转换(Linux/Ubuntu),c++,ubuntu,utf-8,ifstream,utf-16,C++,Ubuntu,Utf 8,Ifstream,Utf 16,我正在努力帮助一位朋友完成一个项目,这个项目本来应该是1小时,现在已经3天了。不用说,我感到非常沮丧和愤怒;-)呜呜呜呜。。。我呼吸 因此,用C++编写的程序只需读取一堆文件并进行处理。问题是,我的程序读取使用UTF-16编码的文件(因为这些文件包含用不同语言编写的单词),而简单使用ifstream似乎不起作用(它读取并输出垃圾)。我花了一段时间才意识到这是因为文件是UTF-16格式的 现在我花了整整一个下午的时间在网上寻找有关读取UTF16文件和将UTF16行内容转换为字符的信息!我就是看不出

我正在努力帮助一位朋友完成一个项目,这个项目本来应该是1小时,现在已经3天了。不用说,我感到非常沮丧和愤怒;-)呜呜呜呜。。。我呼吸

因此,用C++编写的程序只需读取一堆文件并进行处理。问题是,我的程序读取使用UTF-16编码的文件(因为这些文件包含用不同语言编写的单词),而简单使用ifstream似乎不起作用(它读取并输出垃圾)。我花了一段时间才意识到这是因为文件是UTF-16格式的

现在我花了整整一个下午的时间在网上寻找有关读取UTF16文件和将UTF16行内容转换为字符的信息!我就是看不出来!这是一场噩梦。我尝试学习以前从未使用过的
、wstring等(我专门研究图形应用程序,而不是桌面应用程序)。我就是搞不懂

这就是我所做的(但不起作用):

std::wifstream文件2(fileFullPath);
std::locale loc(std::locale(),新的std::codevt_utf16);
标准:电流注入(loc);
而(!file2.eof()){
std::wstring线;
std::getline(file2,line);

std::wcoutUTF-8能够表示所有有效的Unicode字符(代码点),这比UTF-16(覆盖前110万个代码点)要好。[尽管如评论所述,没有超过110万的有效Unicode代码点,因此UTF-16是“安全的”对于所有当前可用的代码点-可能在未来很长一段时间内,除非我们有非常复杂的书写语言的外星访客…]

它通过在必要时使用多个字节/字来存储单个代码点(我们称之为字符)来实现这一点。在UTF-8中,这是由设置的最高位标记的——在“多字节”字符的第一个字节中,设置了前两位,在接下来的字节中设置了前一位,从顶部开始的下一位为零

要将任意代码点转换为UTF-8,您可以使用我提供的代码。(是的,该问题与您的要求相反,但我的答案中的代码涵盖了转换的两个方向)

从UTF16转换为“整数”将是一个类似的方法,除了输入的长度。如果你幸运的话,你甚至可以不这样做

UTF16使用范围D800-DBFF作为第一部分,它保存10位数据,然后下面的项目是DC00-DFFF,保存以下10位数据

要跟随的16位代码

用于16位到32位转换的代码(我只测试了一点,但似乎工作正常):

std::向量utf32到utf16(int字符码)
{
std::向量r;
如果(字符码<0x10000)
{
如果(字符码&0xFC00==0xD800)
{
标准:cerr
如果我仅仅从命令行(使用linux命令)将所有文件转换为utf-8,我是否有可能丢失信息

不,所有UTF-16数据都可以无损地转换为UTF-8。这可能是最好的做法


当引入宽字符时,它们是专门在程序内部使用的文本表示形式,从不作为宽字符写入磁盘。宽流通过将您写入的宽字符转换为输出文件中的窄字符,并将文件中的窄字符转换为宽字符来反映这一点阅读时记忆中的记忆障碍

std::wofstream wout("output.txt");
wout << L"Hello"; // the output file will just be ASCII (assuming the platform uses ASCII).

std::wifstream win("ascii.txt");
std::wstring s;
wout >> s; // the ascii in the file is converted to wide characters.
以下是重写上述内容的一种方法:

// when reading UTF-16 you must use binary mode
std::wifstream file2(fileFullPath, std::ios::binary);

// ensure that wchar_t is large enough for UCS-4/UTF-32 (It is on Linux)
static_assert(WCHAR_MAX >= 0x10FFFF, "wchar_t not large enough");

// imbue file2 so that it will convert a UTF-16 file into wchar_t data.
// If the UTF-16 files are generated on Windows then you probably want to
// consume the BOM Windows uses
std::locale loc(
    std::locale(),
    new std::codecvt_utf16<wchar_t, 0x10FFFF, std::consume_header>);
file2.imbue(loc);

// imbue wcout so that wchar_t data printed will be converted to the system's
// encoding (which is probably UTF-8).
std::wcout.imbue(std::locale(""));

// Note that the above is doing something that one should not do, strictly
// speaking. The wchar_t data is in the wide encoding used by `codecvt_utf16`,
// UCS-4/UTF-32. This is not necessarily compatible with the wchar_t encoding
// used in other locales such as std::locale(""). Fortunately locales that use
// UTF-8 as the narrow encoding will generally also use UTF-32 as the wide
// encoding, coincidentally making this code work

std::wstring line;
while (std::getline(file2, line)) {
  std::wcout << line << std::endl;
}
//读取UTF-16时必须使用二进制模式
std::wifstream file2(fileFullPath,std::ios::binary);
//确保wchar\u t足够大,可以容纳UCS-4/UTF-32(它在Linux上)
静态断言(WCHAR\u MAX>=0x10FFFF,“WCHAR\u不够大”);
//嵌入file2,以便将UTF-16文件转换为wchar\t数据。
//如果UTF-16文件是在Windows上生成的,那么您可能希望
//使用Windows使用的BOM表
std::locale loc(
std::locale(),
新标准:编解码器(VT_utf16);
文件2.注入(loc);
//插入wcout,以便打印的wchar_t数据将转换为系统的
//编码(可能是UTF-8)。
std::wcout.imbue(std::locale(“”);
//请注意,严格地说,上面所说的是人们不应该做的事情
//是的。wchar\u t数据采用“codecvt\u utf16”使用的宽编码,
//UCS-4/UTF-32。这不一定与wchar\u t编码兼容
//用于其他语言环境,如std::locale(“”)。幸运的是,使用
//作为窄编码的UTF-8通常也将使用UTF-32作为宽编码
//编码,恰好使此代码工作
std::wstring线;
while(std::getline(file2,line)){

std::wcout我对Mats Peterson令人印象深刻的解决方案进行了调整、纠正和测试

int utf16_to_utf32(std::vector<int> &coded)
{
    int t = coded[0];
    if (t & 0xFC00 != 0xD800)
    {
    return t;
    }
    int charcode = (coded[1] & 0x3FF); // | ((t & 0x3FF) << 10);
    charcode += 0x10000;
    return charcode;
}



#ifdef __cplusplus    // If used by C++ code,
extern "C" {          // we need to export the C interface
#endif
void convert_utf16_to_utf32(UTF16 *input,
                            size_t input_size,
                            UTF32 *output)
{
     const UTF16 * const end = input + 1 * input_size;
     while (input < end){
       const UTF16 uc = *input++;
       std::vector<int> vec; // endianess
       vec.push_back(U16_LEAD(uc) & oxFF);
       printf("LEAD + %.4x\n",U16_LEAD(uc) & 0x00FF);
       vec.push_back(U16_TRAIL(uc) & oxFF);
       printf("TRAIL + %.4x\n",U16_TRAIL(uc) & 0x00FF);
       *output++ = utf16_to_utf32(vec);
     }
}
#ifdef __cplusplus
}
#endif
int utf16到utf32(标准::向量和编码)
{
int t=编码的[0];
如果(t&0xFC00!=0xD800)
{
返回t;
}

int charcode=(编码为[1]&0x3FF);/(t&0x3FF)您不会丢失任何将UTF-16转换为UTF-8的信息。我认为您的错误是认为C++会为您这样做。我不完全相信这一点,但我不相信它会发生。无论如何,我只需将UTF-16编码为UTF-8转换。这很简单,它肯定会占用您不到三天。我没有读到UTF-16,而是愚蠢地试图通过复制/粘贴一些我不完全理解的网络代码来强行解决方案-(那么,你确定从16转换到8不会导致信息丢失吗?问题是,为什么当初在外语中使用UTF-16。我认为这是必要的,因为有些字母表的字符数比用UTF-8编码的要多?UTF-16和UTF-8都是Unicode的完整编码。我肯定你不会t不要丢失任何信息。可能会使用UTF-16,因为这些文件来自Java/DotNET
std::wifstream file2(fileFullPath); // UTF-16 has to be read in binary mode
std::locale loc (std::locale(), new std::codecvt_utf16<char32_t>); // do you really want char32_t data? or do you want wchar_t?
std::cout.imbue(loc); // You're not even using cout, so why are you imbuing it?
// You need to imbue file2 here, not cout.
while (!file2.eof()) { // Aside from your UTF-16 question, this isn't the usual way to write a getline loop, and it doesn't behave quite correctly
    std::wstring line;
    std::getline(file2, line);
    std::wcout << line << std::endl; // wcout is not imbued with a locale that will correctly display the original UTF-16 data
}
// when reading UTF-16 you must use binary mode
std::wifstream file2(fileFullPath, std::ios::binary);

// ensure that wchar_t is large enough for UCS-4/UTF-32 (It is on Linux)
static_assert(WCHAR_MAX >= 0x10FFFF, "wchar_t not large enough");

// imbue file2 so that it will convert a UTF-16 file into wchar_t data.
// If the UTF-16 files are generated on Windows then you probably want to
// consume the BOM Windows uses
std::locale loc(
    std::locale(),
    new std::codecvt_utf16<wchar_t, 0x10FFFF, std::consume_header>);
file2.imbue(loc);

// imbue wcout so that wchar_t data printed will be converted to the system's
// encoding (which is probably UTF-8).
std::wcout.imbue(std::locale(""));

// Note that the above is doing something that one should not do, strictly
// speaking. The wchar_t data is in the wide encoding used by `codecvt_utf16`,
// UCS-4/UTF-32. This is not necessarily compatible with the wchar_t encoding
// used in other locales such as std::locale(""). Fortunately locales that use
// UTF-8 as the narrow encoding will generally also use UTF-32 as the wide
// encoding, coincidentally making this code work

std::wstring line;
while (std::getline(file2, line)) {
  std::wcout << line << std::endl;
}
int utf16_to_utf32(std::vector<int> &coded)
{
    int t = coded[0];
    if (t & 0xFC00 != 0xD800)
    {
    return t;
    }
    int charcode = (coded[1] & 0x3FF); // | ((t & 0x3FF) << 10);
    charcode += 0x10000;
    return charcode;
}



#ifdef __cplusplus    // If used by C++ code,
extern "C" {          // we need to export the C interface
#endif
void convert_utf16_to_utf32(UTF16 *input,
                            size_t input_size,
                            UTF32 *output)
{
     const UTF16 * const end = input + 1 * input_size;
     while (input < end){
       const UTF16 uc = *input++;
       std::vector<int> vec; // endianess
       vec.push_back(U16_LEAD(uc) & oxFF);
       printf("LEAD + %.4x\n",U16_LEAD(uc) & 0x00FF);
       vec.push_back(U16_TRAIL(uc) & oxFF);
       printf("TRAIL + %.4x\n",U16_TRAIL(uc) & 0x00FF);
       *output++ = utf16_to_utf32(vec);
     }
}
#ifdef __cplusplus
}
#endif