Java：在资源中存储大地图_Java_Dictionary_Serialization

Java：在资源中存储大地图

java dictionary serialization

Java：在资源中存储大地图,java,dictionary,serialization,Java,Dictionary,Serialization,我需要使用一个包含字符串、字符串对的大文件，因为我想将它与一个JAR一起提供，所以我选择在应用程序的资源文件夹中包含一个序列化和gzip版本。以下是我创建序列化的方式： ObjectOutputStream out = new ObjectOutputStream( new BufferedOutputStream(new GZIPOutputStream(new FileOutputStream(OUT_FILE_PATH, false)))); out.writeOb

我需要使用一个包含字符串、字符串对的大文件，因为我想将它与一个JAR一起提供，所以我选择在应用程序的资源文件夹中包含一个序列化和gzip版本。以下是我创建序列化的方式：

ObjectOutputStream out = new ObjectOutputStream(
            new BufferedOutputStream(new GZIPOutputStream(new FileOutputStream(OUT_FILE_PATH, false))));
out.writeObject(map);
out.close();

我选择使用一个

HashMap

，生成的文件是60MB，映射包含大约400万个条目

现在，当我需要映射并使用以下命令反序列化它时：

final InputStream in = FileUtils.getResource("map.ser.gz");
final ObjectInputStream ois = new ObjectInputStream(new BufferedInputStream(new GZIPInputStream(in)));
map = (Map<String, String>) ois.readObject();
ois.close();

final InputStream in=FileUtils.getResource（“map.ser.gz”）；
最终ObjectInputStream ois=新ObjectInputStream（新BufferedInputStream（新GZIPInputStream（in））；
map=（map）ois.readObject（）；
ois.close（）；

这大约需要10~15秒。有没有更好的办法把这么大的地图储存在罐子里？我这样问是因为我也使用了斯坦福CoreNLP库，它本身使用了大模型文件，但在这方面似乎表现更好。我试图找到读取模型文件的代码，但放弃了。

您可以应用《Java性能：Scott Oaks的权威指南》一书中的一种技术，该书实际上将对象的压缩内容存储到字节数组中，因此我们需要一个我称之为此处

地图持有者

：

public class MapHolder implements Serializable {
    // This will contain the zipped content of my map
    private byte[] content;
    // My actual map defined as transient as I don't want to serialize its 
    // content but its zipped content
    private transient Map<String, String> map;

    public MapHolder(Map<String, String> map) {
        this.map = map;
    }

    private void writeObject(ObjectOutputStream out) throws IOException {
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        try (GZIPOutputStream zip = new GZIPOutputStream(baos);
            ObjectOutputStream oos = new ObjectOutputStream(
                new BufferedOutputStream(zip))) {
            oos.writeObject(map);
        }
        this.content = baos.toByteArray();
        out.defaultWriteObject();
        // Clear the temporary field content
        this.content = null;
    }

    private void readObject(ObjectInputStream in) throws IOException,
        ClassNotFoundException {
        in.defaultReadObject();
        try (ByteArrayInputStream bais = new ByteArrayInputStream(content);
            GZIPInputStream zip = new GZIPInputStream(bais);
            ObjectInputStream ois = new ObjectInputStream(
                new BufferedInputStream(zip))) {
            this.map = (Map<String, String>) ois.readObject();
            // Clean the temporary field content
            this.content = null;
        }
    }

    public Map<String, String> getMap() {
        return this.map;
    }
}

正如您可能已经注意到的，在序列化

MapHolder

实例时，不再压缩内部压缩的内容。

您的问题是压缩了数据。将其存储为纯文本

对性能的影响很可能在于解压缩流。罐子已经压缩，所以存储压缩的文件不会节省空间

基本上：

将文件存储为纯文本

使用

Files.lines（path.get（“myfilenane.txt”）

流式传输这些行

用最少的代码使用每一行

类似这样，假设数据的格式为

key=value

（类似于属性文件）：

Map Map=newhashmap（）；
Files.line（path.get（“myfilenane.txt”））
.map（s->s.split（“=”）
.forEach（a->map.put（a[0]，a[1]）；

免责声明：在我的电话中，代码可能无法编译或工作（但它有一个合理的机会）

你可以考虑许多快速序列化库中的一个：

protobuf（）
平面缓冲区（）
原船长（）

什么需要10~15秒？写地图还是看地图？您想改进什么？他的第二段代码明确告诉您读取文件需要10-15秒。请检查此项以提高序列化性能，并查看flush方法。这里的一些基准测试可能会有所帮助：

FileUtils.getResource（“map.ser.gz”）

返回JAR中resources文件夹中包含的文件的输入流。我使用了你的解决方案，并看到了一个最小的加速多个问题与这一个。它不是文件系统中的一个文件，而是我的JAR中的一个资源，但是行读取没有问题。使用流和单独拆分每一行实际上比反序列化慢。@eike我知道它在jar中。这就是重点——当它被添加到罐子中时，它已经被压缩了。好吧，我误解了你的问题（文件是一个序列化对象，不是文本文件），但我的回答的基本原则仍然适用：不要压缩文件-按原样放在jar中。是的，不压缩文件会让它更快

final ByteArrayInputStream in = new ByteArrayInputStream(
    Files.readAllBytes(Paths.get("/tmp/map.ser"))
);
final ObjectInputStream ois = new ObjectInputStream(in);
MapHolder holder = (MapHolder) ois.readObject();
map = holder.getMap();
ois.close();

Map<String, String> map = new HashMap<>();
Files.lines(Paths.get("myfilenane.txt"))
  .map(s -> s.split("="))
  .forEach(a -> map.put(a[0], a[1]));