Boost 如何将代码点转换为utf-8？_Boost_C++_Utf 8_C++17_Boost Locale

Boost 如何将代码点转换为utf-8？

boost c++ utf-8

Boost 如何将代码点转换为utf-8？,boost,c++,utf-8,c++17,boost-locale,Boost,C++,Utf 8,C++17,Boost Locale,我有一些代码以unicode代码点读取（以字符串0xF00转义）因为我在使用，我在猜测以下方法是否是最好（也是正确的）： unsigned int codepoint{0xF00}; boost::locale::conv::utf_to_utf<char>(&codepoint, &codepoint+1); 无符号整数码点{0xF00}； boost:：locale:：conv:：utf_to_utf（&codepoint，&codepoint+1）；？您

我有一些代码以unicode代码点读取（以字符串0xF00转义）

因为我在使用，我在猜测以下方法是否是最好（也是正确的）：

unsigned int codepoint{0xF00};
boost::locale::conv::utf_to_utf<char>(&codepoint, &codepoint+1);

无符号整数码点{0xF00}；
boost:：locale:：conv:：utf_to_utf（&codepoint，&codepoint+1）；

？

您可以使用标准库将UTF-32（代码点）转换为UTF-8：

#include <locale>
#include <codecvt>

std::string codepoint_to_utf8(char32_t codepoint) {
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> convert;
    return convert.to_bytes(&codepoint, &codepoint + 1);
}

如前所述，这种形式的代码点是（方便的）UTF-32，所以您要寻找的是转码

对于一个不依赖于自C++17以来就不推荐使用的函数的解决方案，它并不十分丑陋，也不需要庞大的第三方库，您可以使用非常轻量级的（四个小标题！）及其函数

utf8:：utf32to8

它看起来像这样：

const uint32_t codepoint{0xF00};
std::vector<unsigned char> result;

try
{
   utf8::utf32to8(&codepoint, &codepoint + 1, std::back_inserter(result));
}
catch (const utf8::invalid_code_point&)
{
   // something
}

const uint32\u t码点{0xF00}；
std：：向量结果；
尝试
{
utf8:：utf32to8（&codepoint，&codepoint+1，std:：back_插入器（结果））；
}
捕获（常量utf8:：无效\u代码\u点&）
{
//某物
}

（如果您对异常过敏，还有一个

utf8:：unchecked:：utf32to8

）

（并考虑阅读<代码>向量<代码>或<代码> STD::U8Stord，因为C++ 20）（最后，请注意，我专门使用了

uint32\t

，以确保输入具有适当的宽度。）

我倾向于在项目中使用这个库，直到我需要一些更重的东西用于其他目的为止（此时我通常会切换到ICU）。

C++17已经弃用了许多处理utf的方便函数。不幸的是，最后剩下的那些将在C++20（*）中被弃用。所说的

std:：codecvt

仍然有效。从C++11到C++17，您可以使用

std:：codevt

，从C++20开始，它将是

std:：codevt

下面是一些在utf8中转换代码点（高达0x10FFFF）的代码：

//代码点是要转换的代码点
//buff是大小为sz的字符数组（至少应为4才能转换任何代码点）
//返回时sz是utf8转换字符串使用的buf大小
//返回值是std:：codevt:：out的返回值（0表示ok）
std:：codevt_base:：结果到_utf8（char32_t codepoint，char*buf，size_t&sz）{
std：：locale loc（“”）；
const std:：codecvt和cvt=
标准：：使用切面（loc）；
std：：mbstate_t state{{0}；
const char32*last_in；
char*最后一个输出；
std:：codevt_base:：result res=cvt.out（state，&codepoint，1+&codepoint，last_in，
buf，buf+sz，最后一次）；
sz=最后一次输出-buf；
返回res；
}

（*）

std:：codevt

仍将存在于C++20中。简单地说，缺省实例不再是“代码> STD:：CODECVTT ：CODECVT 但 STD:：CODECVT 和 STD:：CODDEVT< <代码>（注<代码> CAR8OT<<代码>代替 char < /C> >

< P>在读取C++中UTF-8支持的不稳定状态后，我偶然发现了相应的C支持，这看起来很有希望，而且可能不会很快被弃用

#include <clocale>
#include <cuchar>
#include <climits>

size_t to_utf8(char32_t codepoint, char *buf)
{
    const char *loc = std::setlocale(LC_ALL, "en_US.utf8");
    std::mbstate_t state{};
    std::size_t len = std::c32rtomb(buf, codepoint, &state);
    std::setlocale(LC_ALL, loc);
    return len;
}

如果应用程序的当前区域设置已经是UTF-8，您当然可以省略对

setlocale

的来回调用。

您想处理任何代码点（最多01FFFFF）还是只处理基本的多语言平面代码点（最多0xFFFF）？如果在前一种情况下，你将使用字符16到utf8，在后一种情况下，使用字符32到utf8…@SergeBallesta，我还没有真正决定。unsigned int是通过std:：strtol检索的，因此，0xFFFF-I可能对完整的解决方案感兴趣。。因此，这两种解决方案都可以起作用，只是提醒一下，

wstring\u convert

和

codevt\u utf8

自C++17以来就被弃用了。标准库中没有替代品，目前的建议是使用专用库。但关键是你[偶然]拥有UTF-32！嗯，我不能真正使用这个选项，因为它在中已被弃用。您是否有未弃用的解决方法？：）“char将被char8_t替换”这句话可能会产生误导？您在演示中编写了

char16_t

而不是

char32_t

；深思熟虑？@LightnessRacesinOrbit：是的，当你在评论后读到这篇文章时，可能会产生误导。我已经编辑了它（并修复了愚蠢的16…）。谢谢你的关注。@LightnessRacesinOrbit:呃。。。英语不是我的第一语言，我听不懂你最后的评论。我假设Naaaais是nice，但什么是np？它是互联网上的“没问题”的代言人。这是一个很好的解决方案（只使用一个小标题库），而且是目前为止我认为最好的——我可能会使用它。然而，由于我已经有了boost——鉴于当前的需求，我看不出有任何理由改变——并且考虑到它将实现我所期望的，而且根本没有任何问题。问这个问题的一个原因是我发现这里缺少boost文档，或者就是找不到这个文档。我希望这是有道理的。@darune听起来很有道理。顺便说一句，这是“警告”，一个词：）

// codepoint is the codepoint to convert
// buff is a char array of size sz (should be at least 4 to convert any code point)
// on return sz is the used size of buf for the utf8 converted string
// the return value is the return value of std::codecvt::out (0 for ok)
std::codecvt_base::result to_utf8(char32_t codepoint, char *buf, size_t& sz) {
    std::locale loc("");
    const std::codecvt<char32_t, char, std::mbstate_t> &cvt =
                   std::use_facet<std::codecvt<char32_t, char, std::mbstate_t>>(loc);

    std::mbstate_t state{{0}};

    const char32_t * last_in;
    char *last_out;
    std::codecvt_base::result res = cvt.out(state, &codepoint, 1+&codepoint, last_in,
                                            buf, buf+sz, last_out);
    sz = last_out - buf;
    return res;
}

#include <clocale>
#include <cuchar>
#include <climits>

size_t to_utf8(char32_t codepoint, char *buf)
{
    const char *loc = std::setlocale(LC_ALL, "en_US.utf8");
    std::mbstate_t state{};
    std::size_t len = std::c32rtomb(buf, codepoint, &state);
    std::setlocale(LC_ALL, loc);
    return len;
}

char32_t codepoint{0xfff};
char buf[MB_LEN_MAX]{};
size_t len = to_utf8(codepoint, buf);