Java 解析多个大型csv文件并将所有记录添加到ArrayList_Java

Java 解析多个大型csv文件并将所有记录添加到ArrayList

java

Java 解析多个大型csv文件并将所有记录添加到ArrayList,java,Java,目前我有大约12个csv文件，每个文件大约有150万条记录我使用univocity解析器作为我的csv阅读器/解析器库使用univocity解析器，我读取每个文件，并使用addAll（）方法将所有记录添加到arraylist中。当所有12个文件都被解析并添加到数组列表中时，我的代码将在末尾打印arraylist的大小 for (int i = 0; i < 12; i++) { myList.addAll(parser.parseAll(getReader("file-" +

目前我有大约12个csv文件，每个文件大约有150万条记录

我使用univocity解析器作为我的csv阅读器/解析器库

使用univocity解析器，我读取每个文件，并使用addAll（）方法将所有记录添加到arraylist中。当所有12个文件都被解析并添加到数组列表中时，我的代码将在末尾打印arraylist的大小

for (int i = 0; i < 12; i++) {
    myList.addAll(parser.parseAll(getReader("file-" + i + ".csv")));

}

for（int i=0；i<12；i++）{
addAll（parser.parseAll（getReader（“file-”+i+“.csv”））；
}

它一开始工作得很好，直到我达到第六个连续文件，然后在IntellijIDE输出窗口中似乎要花很长时间，甚至在一个小时后也不会打印出arraylist大小，而在第六个文件之前，打印速度相当快

如果有帮助的话，我正在使用MacBookPro（2014年年中）OSX Yosemite

这是一个关于fork和join的教科书问题。

在

parseAll

中，它们使用10000个元素进行预分配

/**
 * Parses all records from the input and returns them in a list.
 *
 * @param reader the input to be parsed
 * @return the list of all records parsed from the input.
 */
public final List<String[]> parseAll(Reader reader) {
    List<String[]> out = new ArrayList<String[]>(10000);
    beginParsing(reader);
    String[] row;
    while ((row = parseNext()) != null) {
        out.add(row);
    }
    return out;
}

然后检查是否有帮助。

问题是内存不足。当这种情况发生时，计算机开始爬行，因为它开始将内存交换到磁盘，反之亦然

将全部内容读入记忆绝对不是最好的策略。由于您只对计算一些统计信息感兴趣，所以根本不需要使用addAll（）

计算机科学的目标总是在内存消耗和执行速度之间达到平衡。您可以随时处理这两个概念，用内存换取更高的速度，或用速度换取内存节省

因此，将整个文件加载到内存中对您来说是舒适的，但这不是一个解决方案，即使在将来，当计算机将包含TB的内存时也是如此

public int getNumRecords(CsvParser parser, int start) {
    int toret = start;

    parser.beginParsing(reader);
    while (parser.parseNext() != null) {
        ++toret;
    }

    return toret;
}

正如您所看到的，这个函数没有占用内存（除了每一行）；您可以在CSV文件的循环中使用它，并以总行数结束。下一步是为所有统计数据创建一个类，用对象替换int start

class Statistics {
   public Statistics() {
       numRows = 0;
       numComedies = 0;
   }

   public countRow() {
       ++numRows;
   }

   public countComedies() {
        ++numComedies;
   }

   // more things...
   private int numRows;
   private int numComedies;
}

public int calculateStatistics(CsvParser parser, Statistics stats) {
    int toret = start;

    parser.beginParsing(reader);
    while (parser.parseNext() != null) {
        stats.countRow();
    }

    return toret;
}

希望这有帮助。

我是这个库的创建者。如果只想计算行数，请使用

RowProcessor

。您甚至不需要自己计算行数，因为解析器会为您这样做：

// Let's create our own RowProcessor to analyze the rows
static class RowCount extends AbstractRowProcessor {

    long rowCount = 0;

    @Override
    public void processEnded(ParsingContext context) {
        // this returns the number of the last valid record.
        rowCount = context.currentRecord();
    }
}

public static void main(String... args) throws FileNotFoundException {
    // let's measure the time roughly
    long start = System.currentTimeMillis();

    //Creates an instance of our own custom RowProcessor, defined above.
    RowCount myRowCountProcessor = new RowCount();

    CsvParserSettings settings = new CsvParserSettings();


    //Here you can select the column indexes you are interested in reading.
    //The parser will return values for the columns you selected, in the order you defined
    //By selecting no indexes here, no String objects will be created
    settings.selectIndexes(/*nothing here*/);

    //When you select indexes, the columns are reordered so they come in the order you defined.
    //By disabling column reordering, you will get the original row, with nulls in the columns you didn't select
    settings.setColumnReorderingEnabled(false);

    //We instruct the parser to send all rows parsed to your custom RowProcessor.
    settings.setRowProcessor(myRowCountProcessor);

    //Finally, we create a parser
    CsvParser parser = new CsvParser(settings);

    //And parse! All rows are sent to your custom RowProcessor (CsvDimension)
    //I'm using a 150MB CSV file with 3.1 million rows.
    parser.parse(new File("c:/tmp/worldcitiespop.txt"));

    //Nothing else to do. The parser closes the input and does everything for you safely. Let's just get the results:
    System.out.println("Rows: " + myRowCountProcessor.rowCount);
    System.out.println("Time taken: " + (System.currentTimeMillis() - start) + " ms");

}

输出

Rows: 3173959
Time taken: 1062 ms

编辑：我看到了您关于需要使用行中实际数据的评论。在这种情况下，处理

RowProcessor

类的

rowProcessed（）

方法中的行，这是最有效的处理方法

编辑2：

如果只想计算行数，请使用

CsvRoutines

中的

getInputDimension

：

    CsvRoutines csvRoutines = new CsvRoutines();
    InputDimension d = csvRoutines.getInputDimension(new File("/path/to/your.csv"));
    System.out.println(d.rowCount());
    System.out.println(d.columnCount());

也许如果你只关心项目的数量，正如你在问题中所建议的那样，就没有必要在im内存中存储全部内容。@qqilihq嗨，这只是我的第一步，我的第二步是从中推断一些统计数据，例如有多少本喜剧书等。它来自教科书中的并发处理练习部分。事实上，预分配的行数在这里并不是一个问题，因为ArrayList扩展得非常快。问题是他的虚拟机没有足够的内存来保存所有内容。

    CsvRoutines csvRoutines = new CsvRoutines();
    InputDimension d = csvRoutines.getInputDimension(new File("/path/to/your.csv"));
    System.out.println(d.rowCount());
    System.out.println(d.columnCount());