如果您知道字符编码，那么如何读取C语言中的文本文件，然后将其显示在控制台上？_C_Unicode_Character Encoding

如果您知道字符编码，那么如何读取C语言中的文本文件，然后将其显示在控制台上？

c unicode character-encoding

如果您知道字符编码，那么如何读取C语言中的文本文件，然后将其显示在控制台上？,c,unicode,character-encoding,C,Unicode,Character Encoding,以Java为例： public final class Meh { private static final String HELLO = "Hello world"; private static final Charset UTF32 = Charset.forName("UTF-32"); public static void main(final String... args) throws IOExce

以Java为例：

public final class Meh
{
    private static final String HELLO = "Hello world";

    private static final Charset UTF32 = Charset.forName("UTF-32");

    public static void main(final String... args)
        throws IOException
    {
        final Path tmpfile = Files.createTempFile("test", "txt");

        try (
            final Writer writer = Files.newBufferedWriter(tmpfile, UTF32);
        ) {
            writer.write(HELLO);
        }

        final String readBackFromFile;

        try (
            final Reader reader = Files.newBufferedReader(tmpfile, UTF32);
        ) {
            readBackFromFile = CharStreams.toString(reader);
        }

        Files.delete(tmpfile);

        System.out.println(HELLO.equals(readBackFromFile));
    }
}

此程序打印

true

。现在，请注意：

Java中的
```
Charset
```
是一个包装字符编码的类，双向包装；您可以使用
```
CharsetDecoder
```
将字节流解码为字符流，或者使用
```
CharsetEncoder
```
将字符流编码为字节流
这就是为什么Java有
```
char
```
vs
```
byte
```
然而，出于历史原因，
```
char
```
只是一个16位的无符号数字：这是因为当Java诞生时，Unicode没有在现在称为BMP（基本多语言平面；即，在U+0000-U+FFFF范围内定义的任何代码点）之外定义代码点

排除所有这些因素后，上述代码将执行以下操作：

给定一些“文本”，在这里表示为
```
字符串
```
，它首先将该文本转换为字节序列，然后再将其写入文件
然后它读回该文件：它只是一个字节序列，但随后它应用反向转换来找到存储在其中的“原始文本”
请注意，
```
CharStreams.toString（）
```
不在标准JDK中；这是番石榴的一节课

现在，关于C。。。我的问题如下:

在讨论此事时，我了解到C11标准
然而，似乎没有Java的
```
字符集
```
；另一个关于聊天室的评论是，C是SOL，但是C++有CODECVT…

是的，我知道UTF-32依赖于endianness；对于Java，这是默认情况

但基本上：我应该如何用C编写上面的程序？假设我想用C编写写端或读端的程序，我该怎么做？

在C中，您通常会使用像libiconv、libunistring或ICU这样的库

如果您只想处理UTF-32，那么可以直接写入和读取包含Unicode码点的32位整数数组，可以是小端或大端。与UTF-8或UTF-16不同，UTF-32字符串不需要任何特殊的编码和解码。您可以使用任何32位整数类型。我更喜欢C99的

uint32\u t

而不是C11的

char32\u t

。例如：

#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main() {
    // Could also contain non-ASCII code points.
    static const uint32_t hello[] = {
        'H', 'e', 'l', 'l', 'o', ' ',
        'w', 'o', 'r', 'l', 'd'
    };
    static size_t num_chars = sizeof(hello) / sizeof(uint32_t);

    const char *path = "test.txt";

    FILE *outstream = fopen(path, "wb");

    // Write big endian 32-bit integers        
    for (size_t i = 0; i < num_chars; i++) {
        uint32_t code_point = hello[i];

        for (int j = 0; j < 4; j++) {
            int c = (code_point >> ((3 - j) * 8)) & 0xFF;
            fputc(c, outstream);
        }
    }

    fclose(outstream);

    FILE *instream = fopen(path, "rb");

    // Get file size.
    fseek(instream, 0, SEEK_END);
    long file_size = ftell(instream);
    rewind(instream);

    if (file_size % 4) {
        fprintf(stderr, "File contains partial UTF-32");
        exit(1);
    }
    if (file_size > SIZE_MAX) {
        fprintf(stderr, "File too large");
        exit(1);
    }

    size_t num_chars_in = file_size / sizeof(uint32_t);
    uint32_t *read_back = malloc(file_size);

    // Read big endian 32-bit integers        
    for (size_t i = 0; i < num_chars_in; i++) {
        uint32_t code_point = 0;

        for (int j = 0; j < 4; j++) {
            int c = fgetc(instream);
            code_point |= c << ((3 - j) * 8);
        }

        read_back[i] = code_point;
    }

    fclose(instream);

    bool equal = num_chars == num_chars_in
                 && memcmp(hello, read_back, file_size) == 0;
    printf("%s\n", equal ? "true" : "false");

    free(read_back);

    return 0;
}

我想你不希望答案是关于

[java]

？我不认为“这就是为什么java有char vs byte”是完全正确的。。你可以通过一个很好的官方链接提出同样的建议？？？@hagrawal你可以阅读JLS关于这方面的内容，但事实就是这样；）@彼得拉维：好的，同意；但是，它是否保证移除标签/@fge IMHO标签应反映期望的答案，而不是问题中提到的技术。如果问题是你想要的，你可以用词搜索。

$ gcc -std=c99 -Wall so.c -o so
$ ./so
true
$ hexdump -C test.txt
00000000  00 00 00 48 00 00 00 65  00 00 00 6c 00 00 00 6c  |...H...e...l...l|
00000010  00 00 00 6f 00 00 00 20  00 00 00 77 00 00 00 6f  |...o... ...w...o|
00000020  00 00 00 72 00 00 00 6c  00 00 00 64              |...r...l...d|
0000002c