C++ “我怎么能？”；转换；ISO-8859-7字符串到C+中的UTF-8+；？_C++_Unicode_Character Encoding

C++ “我怎么能？”；转换；ISO-8859-7字符串到C+中的UTF-8+；？

c++ unicode character-encoding

C++ “我怎么能？”；转换；ISO-8859-7字符串到C+中的UTF-8+；？,c++,unicode,character-encoding,C++,Unicode,Character Encoding,我正在使用10多年的机器，这些机器使用ISO8859-7来表示希腊字符，每个字符使用一个字节。我需要捕获这些字符并将其转换为UTF-8，以便将它们注入通过HTTPS发送的JSON中。另外，我使用的是GCC v4.4.7，我不想升级，所以我不能使用codeconv之类的示例：“O∧Α”：我得到字符值[0xcf，0xcb，0xc1，]，我需要写这个字符串“\u039F\u039B\u0391” PS：我不是字符集专家，所以请避免像“ISO 8859是Unicode的子集，所以您只需要实现算法

我正在使用10多年的机器，这些机器使用ISO8859-7来表示希腊字符，每个字符使用一个字节。我需要捕获这些字符并将其转换为UTF-8，以便将它们注入通过HTTPS发送的JSON中。另外，我使用的是GCC v4.4.7，我不想升级，所以我不能使用codeconv之类的

示例：“O∧Α”：我得到字符值

[0xcf，0xcb，0xc1，]

，我需要写这个字符串

“\u039F\u039B\u0391”

PS：我不是字符集专家，所以请避免像“ISO 8859是Unicode的子集，所以您只需要实现算法”这样的哲学答案。

鉴于要映射的值太少，一个简单的解决方案是使用查找表

伪代码：

id_offset    = 0x80  // 0x00 .. 0x7F same in UTF-8
c1_offset    = 0x20  // 0x80 .. 0x9F control characters

table_offset = id_offset + c1_offset

table = [
    u8"\u00A0",  // 0xA0
    u8"‘",       // 0xA1
    u8"’",
    u8"£",
    u8"€",
    u8"₯",
    // ... Refer to ISO 8859-7 for full list of characters.
]

let S be the input string
let O be an empty output string
for each char C in S
    reinterpret C as unsigned char U
    if U less than id_offset       // same in both encodings
        append C to O
    else if U less than table_offset  // control code
        append char '\xC2' to O  // lead byte
        append char C to O
    else
        append string table[U - table_offset] to O

综上所述，我建议通过使用库来节省一些时间。

一种方法是使用Posix

libiconv

库。在Linux上，所需的函数（

iconv\u open

、

iconv

和

iconv\u close

）甚至包含在

libc

中，因此不需要额外的链接。在旧机器上，您可能需要安装

libiconv

，但我对此表示怀疑

转换可能很简单，如下所示：

#include <iconv.h>

#include <cerrno>
#include <cstring>
#include <iostream>
#include <iterator>
#include <stdexcept>
#include <string>

// A wrapper for the iconv functions
class Conv {
public:
    // Open a conversion descriptor for the two selected character sets
    Conv(const char* to, const char* from) : cd(iconv_open(to, from)) {
        if(cd == reinterpret_cast<iconv_t>(-1))
            throw std::runtime_error(std::strerror(errno));
    }

    Conv(const Conv&) = delete;

    ~Conv() { iconv_close(cd); }

    // the actual conversion function
    std::string convert(const std::string& in) {
        const char* inbuf = in.c_str();
        size_t inbytesleft = in.size();

        // make the "out" buffer big to fit whatever we throw at it and set pointers
        std::string out(inbytesleft * 6, '\0');
        char* outbuf = out.data();
        size_t outbytesleft = out.size();

        // the const_cast shouldn't be needed but my "iconv" function declares it
        // "char**" not "const char**"
        size_t non_rev_converted = iconv(cd, const_cast<char**>(&inbuf),
                                         &inbytesleft, &outbuf, &outbytesleft);

        if(non_rev_converted == static_cast<size_t>(-1)) {
            // here you can add misc handling like replacing erroneous chars
            // and continue converting etc.
            // I'll just throw...
            throw std::runtime_error(std::strerror(errno));
        }

        // shrink to keep only what we converted
        out.resize(outbuf - out.data());

        return out;
    }

private:
    iconv_t cd;
};

int main() {
    Conv cvt("UTF-8", "ISO-8859-7");

    // create a string from the ISO-8859-7 data
    unsigned char data[]{0xcf, 0xcb, 0xc1};
    std::string iso88597_str(std::begin(data), std::end(data));

    auto utf8 = cvt.convert(iso88597_str);
    std::cout << utf8 << '\n';
}

使用此选项，您可以创建一个从ISO-8859-7到UTF-8的映射表，将其包含在项目中，而不是

iconv

：

好的，我决定自己做这件事，而不是寻找一个兼容的库。我是这样做的

主要的问题是如何使用ISO的单个字节填充Unicode的两个字节，因此我使用调试器读取相同字符的值，首先由旧机器写入，然后使用常量字符串（默认情况下为UTF-8）写入。我从“O”和“π”开始，看到在UTF-8中，第一个字节始终是0xCE，而第二个字节填充了ISO值加上偏移量（-0x30）。我构建了下面的代码来实现这一点，并使用了一个填充了所有希腊字母（大写和小写）的测试字符串。然后我意识到，从“π”（ISO中的0xF0）开始，第一个字节和第二个字节的偏移量都发生了变化，因此我添加了一个测试，以确定应用这两个规则中的哪一个。下面的方法返回一个bool，让调用方知道原始字符串是否包含ISO字符（用于其他目的），并用新字符串覆盖作为引用传递的原始字符串。我用CHARR数组代替字符串来与项目的其余部分建立一致性，这是C++编写的一个C项目。

bool iso_to_utf8(char* in){
bool wasISO=false;

if(in == NULL)
    return wasISO;

// count chars
int i=strlen(in);
if(!i)
    return wasISO;

// create and size new buffer
char *out = new char[2*i];
// fill with 0's, useful for watching the string as it gets built
memset(out, 0, 2*i);

// ready to start from head of old buffer
i=0;
// index for new buffer
int j=0;
// for each char in old buffer
while(in[i]!='\0'){
    if(in[i] >= 0){
        // it's already utf8-compliant, take it as it is
        out[j++] = in[i];
    }else{
        // it's ISO
        wasISO=true;
        // get plain value
        int val = in[i] & 0xFF;
        // first byte to CF or CE
        out[j++]= val > 0xEF ? 0xCF : 0xCE;
        // second char to plain value normalized
        out[j++] = val - (val > 0xEF ? 0x70 : 0x30);
    }
    i++;
}
// add string terminator
out[j]='\0';
// paste into old char array
strcpy(in, out);

return wasISO;

}

您基本上是在问“我可以使用什么库将一种编码转换为另一种编码，与我的古代编译器兼容？”。这有点离题了，请检查softwarerecs.stackexchange.com。我希望在没有外部库的情况下实现这一点。这“一般”是不可能的，因为编码映射不是固定的。当然，只需将256个字符从ISO编码映射到UTF-8就可以了。除非您还想进行反向转换。“我想在没有外部库的情况下实现它”

libiconv

？这些函数甚至包含在gnu的

libc

中，这是非常常见的，因此您甚至不必在linux上链接额外的库。这可能是我在绝望时可以选择的一种低成本解决方案。我把这个作为备用计划，这是一个很好的解决方案。我刚刚使用

libiconv

生成了一个

std:：unordered_映射。然后可以单独包含该映射，而无需使用iconv
或任何其他库。@tedlynmo巧妙地使用了元编程。我喜欢。不过，在这种情况下我更喜欢数组表。谢谢！当我进入计算机时，我将在结果中添加一个godbolt链接作为注释。我同意，在这种情况下，数组要好得多。@afe这是您需要的表格：在高位有三个？
（\x3f
）。这些是iso-8859-7中未使用的代码点。这是否适用于iso-8859-7字符0xa1
，0xa2
，0xa4
，0xa5
和0xaf
？既然你问了，我想这不在范围之内，我只关注没有符号的希腊字符。按照描述的步骤，可以很容易地添加所有缺少的字符。我没有测试你的版本，但看起来制作3字节utf8序列是行不通的。我链接到的地图对所有iso-8859-7字符都是精确的，而且速度更快。
bool iso_to_utf8(char* in){
bool wasISO=false;

if(in == NULL)
    return wasISO;

// count chars
int i=strlen(in);
if(!i)
    return wasISO;

// create and size new buffer
char *out = new char[2*i];
// fill with 0's, useful for watching the string as it gets built
memset(out, 0, 2*i);

// ready to start from head of old buffer
i=0;
// index for new buffer
int j=0;
// for each char in old buffer
while(in[i]!='\0'){
    if(in[i] >= 0){
        // it's already utf8-compliant, take it as it is
        out[j++] = in[i];
    }else{
        // it's ISO
        wasISO=true;
        // get plain value
        int val = in[i] & 0xFF;
        // first byte to CF or CE
        out[j++]= val > 0xEF ? 0xCF : 0xCE;
        // second char to plain value normalized
        out[j++] = val - (val > 0xEF ? 0x70 : 0x30);
    }
    i++;
}
// add string terminator
out[j]='\0';
// paste into old char array
strcpy(in, out);

return wasISO;