C++ 使用ICU（ICU4C）读取UTF-8编码文件的缓冲区大小_C++_Unicode_C++11_Fstream_Icu

C++ 使用ICU（ICU4C）读取UTF-8编码文件的缓冲区大小

c++ unicode c++11

C++ 使用ICU（ICU4C）读取UTF-8编码文件的缓冲区大小,c++,unicode,c++11,fstream,icu,C++,Unicode,C++11,Fstream,Icu,我正在尝试在带有msvc11的Windows上使用ICU4C读取UTF-8编码的文件。我需要确定缓冲区的大小来构建一个Unicode解构。由于ICU4C API中没有类似fseek的函数，我想我可以使用底层C文件： #include <unicode/ustdio.h> #include <stdio.h> /*...*/ UFILE *in = u_fopen("utfICUfseek.txt", "r", NULL, "UTF-8"); FILE* inFile =

我正在尝试在带有msvc11的Windows上使用ICU4C读取UTF-8编码的文件。我需要确定缓冲区的大小来构建一个Unicode解构。由于ICU4C API中没有类似fseek的函数，我想我可以使用底层C文件：

#include <unicode/ustdio.h>
#include <stdio.h>
/*...*/
UFILE *in = u_fopen("utfICUfseek.txt", "r", NULL, "UTF-8");
FILE* inFile = u_fgetfile(in);
fseek(inFile,  0, SEEK_END); /* Access violation here */
int size = ftell(inFile);
auto uChArr = new UChar[size];

困扰我的是，我必须为字符创建一个额外的缓冲区，然后才将它们转换为所需的UnicodeString。

这是使用ICU的替代方法

使用标准的

std:：fstream

可以将文件的全部/部分读取到标准的

std:：string

中，然后使用支持unicode的迭代器对其进行迭代

std:：字符串获取文件内容（常量字符*文件名）
{
std:：ifstream-in（文件名，std:：ios:：in | std:：ios:：binary）；
如果（在）
{
std：：字符串内容；
in.seekg（0，std:：ios:：end）；
contents.reserve（in.tellg（））；
in.seekg（0，std:：ios:：beg）；
assign（（std:：istreambuf_迭代器（in）），std:：istreambuf_迭代器（））；
in.close（）；
返回（内容）；
}
投掷（errno）；
}

然后在代码中

std::string myString = get_file_contents( "foobar" );
unicode::iterator< std::string, unicode::utf8 /* or utf16/32 */ > iter = myString.begin();

while ( iter != myString.end() )
{
    ...
    ++iter;
}

std:：string myString=get_file_contents（“foobar”）；
unicode:：iteratoriter=myString.begin（）；
while（iter！=myString.end（））
{
...
++iter；
}

好吧，要么你想一次读取整个文件进行某种后处理，在这种情况下，

icu:：UnicodeString

并不是最好的容器

#include <iostream>
#include <fstream>
#include <sstream>

int main()
{
    std::ifstream in( "utfICUfSeek.txt" );
    std::stringstream buffer;
    buffer << in.rdbuf();
    in.close();
    // ...
    return 0;
}

…或者我完全不知道你真正的问题是什么。；）

谢谢你的回答。get_file_内容是我一直在寻找的，尽管我不知道使用assign（…）函数是否更快，它具有线性复杂度（数字（7））或给定tellg（）结果的read函数（参见编辑）。迭代器解决方案很有趣，我将探索源代码，但我可能还需要ICU的排序规则和区域设置，所以我可能无法放弃这个库。到目前为止，这个库已经过时了。我想我试图避免为字符创建单独的缓冲区，因为最终在RAM中有一个UTF-8字符串（在这个字符数组中）和一个UTF-16字符串（内部在Unicode解构中）。因此，一个解决方案是实现一个函数，该函数将读取UTF-8代码单元，并（从这些代码单元中）逐个推断代码点。然后它将转换一组UTF-8代码单元（构成有效的代码点）将UTF-16代码单元“加载”到UnicodeString中，然后尝试推导下一组足以形成代码点的代码单元，依此类推。后一种建议只需要RAM中UTF-8代码单元最多4字节，UTF-16缓冲区最多4字节TotalCodePoints*4字节。但是，您必须在不复制UTF-16字符串的情况下将UChar（仅16位整数的typedef）缓冲区加载到UnicodeString中（如果它在删除字符串后释放缓冲区，这将是合适的）。

std::string myString = get_file_contents( "foobar" );
unicode::iterator< std::string, unicode::utf8 /* or utf16/32 */ > iter = myString.begin();

while ( iter != myString.end() )
{
    ...
    ++iter;
}

#include <iostream>
#include <fstream>
#include <sstream>

int main()
{
    std::ifstream in( "utfICUfSeek.txt" );
    std::stringstream buffer;
    buffer << in.rdbuf();
    in.close();
    // ...
    return 0;
}

#include <iostream>
#include <fstream>

#include <unicode/unistr.h>
#include <unicode/ustream.h>

int main()
{
    std::ifstream in( "utfICUfSeek.txt" );
    icu::UnicodeString uStr;
    in >> uStr;
    // ...
    in.close();
    return 0;
}