Serialization 在磁盘上存储一组protobuf_Serialization_Storage_Protocol Buffers_Thrift

Serialization 在磁盘上存储一组protobuf

serialization protocol-buffers

Serialization 在磁盘上存储一组protobuf,serialization,storage,protocol-buffers,thrift,Serialization,Storage,Protocol Buffers,Thrift,我使用protobuf作为序列化程序来格式化磁盘上的数据。我可能有一大组protobuf对象，比如说，数百万个。在磁盘上布局它们的最佳选择是什么？protobuf对象将逐个顺序读取，或由外部索引随机访问读取我曾经使用lenghth（int）+protobuf_对象+长度（int）。。。。格式化，但如果其中一个protobuf恰好是脏的，则会失败。如果protobuf对象很多都很小，那么它可能会有一些开销。（我希望我正确理解了您的问题，并且我的回答适合您的用例！）将协议缓冲区消息的任意流存储到

我使用protobuf作为序列化程序来格式化磁盘上的数据。我可能有一大组protobuf对象，比如说，数百万个。在磁盘上布局它们的最佳选择是什么？protobuf对象将逐个顺序读取，或由外部索引随机访问读取

我曾经使用lenghth（int）+protobuf_对象+长度（int）。。。。格式化，但如果其中一个protobuf恰好是脏的，则会失败。如果protobuf对象很多都很小，那么它可能会有一些开销。

（我希望我正确理解了您的问题，并且我的回答适合您的用例！）

将协议缓冲区消息的任意流存储到磁盘的一种技术是定义一个包装器消息，其中所有字段都定义为

repeated

（这意味着

optional

），然后在读取字节时，获得包装器类的一个实例并调用hasX（）方法来查找您实际拥有的内容。在您的案例中，这种方法的问题是，您没有随机访问，也没有真正的流式传输（所有类型的

Foo

消息将放在一起，然后是所有

Bar

s），如果您的数据太大，您将无法将整批消息放入内存中

事实上，您基本上需要一种方法来存储任何类型的数据，使其可以流式传输或随机访问。这是一个一般性问题，而不是协议缓冲区特有的问题

你的问题是：

界定记录。。。（见注）
…通过这种方式，可以检测到损坏，并且可以容忍或修复损坏
…同时维护允许随机访问的索引

您可能会使用索引来允许某种完整性检查，但即使这样，也需要一种机制来确保索引和数据对应并保持同步

因此，它可能不是理想的解决方案，但实现您想要的功能的一种方法，特别是在完整性存在问题的情况下，就是将这些信息存储在允许存储二进制数据并可以快速返回该数据的数据库中。随机访问和数据完整性问题将成为数据库提供商的责任。任何能够存储BLB的传统数据库都能够做到这一点，尽管我也会考虑将其存储在NoSQL中，如MunGDB。注

如果仔细定义协议缓冲区（即您知道存储的字段的类型和长度），那么实际上就不需要对记录进行定界，因为它们的长度永远不会改变。然而，这将破坏协议缓冲区的一个特性，即它的未来证明性质。如果将

.proto

设计为消息大小固定的方式，则无法添加新字段并仍然适合相同的文件格式，可以安全地说，每个新消息都是在x字节之后开始的。

如果只需要顺序访问，存储多条消息的最简单方法是在其之前写入对象的大小，如文件所述：

例如，您可以使用以下成员函数创建“MessagesFile”类，以打开、读取和写入您的消息：

// File is opened using append mode and wrapped into
// a FileOutputStream and a CodedOutputStream
bool Open(const std::string& filename,
          int buffer_size = kDefaultBufferSize) {

    file_ = open(filename.c_str(),
                 O_WRONLY | O_APPEND | O_CREAT, // open mode
                 S_IREAD | S_IWRITE | S_IRGRP | S_IROTH | S_ISUID); //file permissions

    if (file_ != -1) {
        file_ostream_ = new FileOutputStream(file_, buffer_size);
        ostream_ = new CodedOutputStream(file_ostream_);
        return true;
    } else {
        return false;
    }
}

// Code for append a new message
bool Serialize(const google::protobuf::Message& message) {
    ostream_->WriteLittleEndian32(message.ByteSize());
    return message.SerializeToCodedStream(ostream_);
}

// Code for reading a message using a FileInputStream
// wrapped into a CodedInputStream 
bool Next(google::protobuf::Message *msg) {
    google::protobuf::uint32 size;
    bool has_next = istream_->ReadLittleEndian32(&size);
    if(!has_next) {
        return false;
    } else {
        CodedInputStream::Limit msgLimit = istream_->PushLimit(size);
        if ( msg->ParseFromCodedStream(istream_) ) {
            istream_->PopLimit(msgLimit);
            return true;
        }
        return false;
    }
}

MessagesFile reader;
reader.Open("your_file.dat");

MyMsg msg;
while( reader.Next(&msg) ) {
    // user your message
}
...
// close the file

然后，要编写消息，请使用：

MessagesFile file;
reader.Open("your_file.dat");

file.Serialize(your_message1);
file.Serialize(your_message2);
...
// close the file

要阅读所有信息，请执行以下操作：

// File is opened using append mode and wrapped into
// a FileOutputStream and a CodedOutputStream
bool Open(const std::string& filename,
          int buffer_size = kDefaultBufferSize) {

    file_ = open(filename.c_str(),
                 O_WRONLY | O_APPEND | O_CREAT, // open mode
                 S_IREAD | S_IWRITE | S_IRGRP | S_IROTH | S_ISUID); //file permissions

    if (file_ != -1) {
        file_ostream_ = new FileOutputStream(file_, buffer_size);
        ostream_ = new CodedOutputStream(file_ostream_);
        return true;
    } else {
        return false;
    }
}

// Code for append a new message
bool Serialize(const google::protobuf::Message& message) {
    ostream_->WriteLittleEndian32(message.ByteSize());
    return message.SerializeToCodedStream(ostream_);
}

// Code for reading a message using a FileInputStream
// wrapped into a CodedInputStream 
bool Next(google::protobuf::Message *msg) {
    google::protobuf::uint32 size;
    bool has_next = istream_->ReadLittleEndian32(&size);
    if(!has_next) {
        return false;
    } else {
        CodedInputStream::Limit msgLimit = istream_->PushLimit(size);
        if ( msg->ParseFromCodedStream(istream_) ) {
            istream_->PopLimit(msgLimit);
            return true;
        }
        return false;
    }
}

MessagesFile reader;
reader.Open("your_file.dat");

MyMsg msg;
while( reader.Next(&msg) ) {
    // user your message
}
...
// close the file

了解如何访问数据可能很有用：只需按顺序读取、随机访问、随机写入、按某种标准搜索？定义“当一个protobuf脏时失败”；你的意思是“我不能只覆盖文件的那一部分，因为如果长度改变，文件中会有一个间隙（带有垃圾），或者它会覆盖下一个数据位”？是的，我只想按顺序读取它，或者根据文件中的“索引”查找。不需要随机写入。