Java-IDX文件：如何读取它们？_Java_Image Processing

Java-IDX文件：如何读取它们？

java image-processing

Java-IDX文件：如何读取它们？,java,image-processing,Java,Image Processing,在机器学习课程中，教授要求我们使用SVM实现手写数字的分类器。他给了我们这个，也包含了一个测试分类器的序列集我试图编写一个java类来读取数据集并将其转换为数据结构（例如HashMap），但在读取文件时遇到了很多困难。在数据集的网站上，有此IDX文件的结构：基本格式为： magic number size in dimension 0 size in dimension 1 size in dimension 2 ..... size in dimension N data 幻

在机器学习课程中，教授要求我们使用SVM实现手写数字的分类器。他给了我们这个，也包含了一个测试分类器的序列集

我试图编写一个java类来读取数据集并将其转换为数据结构（例如HashMap），但在读取文件时遇到了很多困难。在数据集的网站上，有此IDX文件的结构：

基本格式为：

magic number 
size in dimension 0 
size in dimension 1 
size in dimension 2 
..... 
size in dimension N 
data

幻数是一个整数（MSB优先）。前2个字节始终为0

第三个字节对数据类型进行编码：

0x08: unsigned byte 
0x09: signed byte 
0x0B: short (2 bytes) 
0x0C: int (4 bytes) 
0x0D: float (4 bytes) 
0x0E: double (8 bytes)

第4个字节编码向量/矩阵的维数：1表示向量，2表示矩阵

每个维度中的大小都是4字节整数（MSB优先，高端，与大多数非英特尔处理器一样）

数据以C数组的形式存储，即最后一个维度中的索引变化最快

所以我从文件t10k-labels-idx1-ubyte开始：

[offset] [type]          [value]          [description] 
0000     32 bit integer  0x00000801(2049) magic number (MSB first) 
0004     32 bit integer  10000            number of items 
0008     unsigned byte   ??               label 
0009     unsigned byte   ??               label 
........ 
xxxx     unsigned byte   ??               label

这就是它的结构，所以我编写了一个java类，它应该读取文件并计算幻数和项数。这是一节课：

public class ReadTest {

static String testLabel = "t10k-labels-idx1-ubyte";

public static void main(String[] args) {

    byte[] bytes = null;
    byte[] fileName = testLabel.getBytes(StandardCharsets.UTF_8);

    try {bytes = Files.readAllBytes(Paths.get(testLabel));}
    catch (IOException e) {System.err.println("Error: " + e.getMessage());}

    if(bytes != null) {

        List<Byte> listBytes = new ArrayList<>();
        List<Byte> listName = new ArrayList<>();

        for(int i=0; i<bytes.length; i++) listBytes.add(bytes[i]);
        for(int i=0; i<fileName.length; i++) listName.add(fileName[i]);

        System.out.println("I read " + listBytes.size() + " bytes!");
        System.out.println("Filename: " + listName.size() + " bytes!");

        if(listBytes.containsAll(listN)) System.out.println("Gotcha!");

        byte[] magic = {bytes[3], bytes[2], bytes[1], bytes[0]};
        byte[] items = {bytes[7], bytes[6], bytes[5], bytes[4]};

        System.out.println("[REVERSE] Magic: " + ByteBuffer.wrap(magic).getInt() + ". Items: " + ByteBuffer.wrap(items).getInt());
        System.out.println("[AFTER 32] Magic: " + ByteBuffer.wrap(bytes, 32, 4).getInt() + ". Items: " + ByteBuffer.wrap(bytes, 36, 4).getInt());
        System.out.println("[RAW] Magic: " + ByteBuffer.wrap(bytes, 0, 4).getInt() + ". Items: " + ByteBuffer.wrap(bytes, 4, 4).getInt());

        }

    }
}

公共类ReadTest{
静态字符串testLabel=“t10k-labels-idx1-ubyte”；
公共静态void main（字符串[]args）{
字节[]字节=null；
byte[]fileName=testLabel.getBytes（StandardCharsets.UTF_8）；
请尝试{bytes=Files.readAllBytes（path.get（testLabel））；}
catch（IOException e）{System.err.println（“错误：+e.getMessage（））；}
如果（字节数！=null）{
List listBytes=new ArrayList（）；
List listName=new ArrayList（）；
对于（int i=0；iI）已读取并尝试，但它返回给我一个OutOfMemoryError。我还尝试只使用代码中计算数字的部分，但它给我的结果与我的代码相同。您得到了什么错误？我能够运行您的程序，并得到了您期望的输出。“我读取了10008字节！文件名：50字节！[反向]魔术：17301504.道具：270991360[32后]魔术：67110660.道具：66305[原始]魔术：2049.道具：10000“您可能还需要检查一件事。解压缩时的zip文件“t10k-labels-idx1-ubyte.gz”包含一个名为“t10k labels.idx1-ubyte”的文件。您的代码有不同的名称，破折号而不是点）“t10k-labels-idx1-ubyte”.我不知道为什么，但当我提取这些文件时，它们的名称仍然像我写的那样。我将它们更改为.id*-ubyte，所有的都开始工作了！非常感谢，我想你救了我一天（：