批处理文本文件的Java方法比单独执行相同的操作要慢得多_Java_Text_Methods_Batch Processing_Text Classification

批处理文本文件的Java方法比单独执行相同的操作要慢得多

java text methods

批处理文本文件的Java方法比单独执行相同的操作要慢得多,java,text,methods,batch-processing,text-classification,Java,Text,Methods,Batch Processing,Text Classification,我编写了一个方法processTrainDirectory，该方法用于导入和处理给定目录中的所有文本文件。单独处理每个文件大约需要相同的时间（90ms），但是当我使用批量导入给定目录的方法时，每个文件的时间会递增（300个文件之后从90ms增加到4000ms以上）。批量导入方法如下： public void processTrainDirectory(String folderPath, Category category) { File folder = new File(folder

我编写了一个方法

processTrainDirectory

，该方法用于导入和处理给定目录中的所有文本文件。单独处理每个文件大约需要相同的时间（90ms），但是当我使用批量导入给定目录的方法时，每个文件的时间会递增（300个文件之后从90ms增加到4000ms以上）。批量导入方法如下：

public void processTrainDirectory(String folderPath, Category category) {
    File folder = new File(folderPath);
    File[] listOfFiles = folder.listFiles();
    if (listOfFiles != null) {
        for (File file : listOfFiles) {
            if (file.isFile()) {
                processTrainText(file.getPath(), category);
            }
        }
    }
    else {
        System.out.println(foo);
    }

}

 public void processTrainText(String path, Category category){
    trainTextAmount++;
    Map<String, Integer> text = prepareText(path);
    update(text, category);

}

正如我所说，方法

processTrainText

是按目录中的每个文本文件调用的。在

processTrainDirectory

中使用此方法时，所需时间会逐渐增加。方法

processTrainText

如下：

public void processTrainDirectory(String folderPath, Category category) {
    File folder = new File(folderPath);
    File[] listOfFiles = folder.listFiles();
    if (listOfFiles != null) {
        for (File file : listOfFiles) {
            if (file.isFile()) {
                processTrainText(file.getPath(), category);
            }
        }
    }
    else {
        System.out.println(foo);
    }

}

 public void processTrainText(String path, Category category){
    trainTextAmount++;
    Map<String, Integer> text = prepareText(path);
    update(text, category);

}

这是我的

类别

类（为了澄清，删除了所有不需要的方法：

public class Category {
    private String categoryName;
    private double prior;
    private Map<String, Integer> frequencies;
    private Map<String, Double> probabilities;
    private int textAmount;
    private BayesianClassifier bc;

    public Category(String categoryName, BayesianClassifier bc){
        this.categoryName = categoryName;
        this.bc = bc;
        this.frequencies = new HashMap<>();
        this.probabilities = new HashMap<>();
        this.textAmount = 0;
        this.prior = 0.00;
    }

    public void addWord(String word){
        this.frequencies.put(word, 0);
        this.probabilities.put(word, 0.0);
    }

    public void updateFrequency(Map.Entry<String, Integer> entry){
        if(!this.frequencies.containsKey(entry.getKey())){
            this.frequencies.put(entry.getKey(), entry.getValue());
        }
        else {
            this.frequencies.put(entry.getKey(), this.frequencies.get(entry.getKey()) + entry.getValue());
        }
    }

    public void updateProbability(Map.Entry<String, Integer> entry){
        double chance = ((double) this.frequencies.get(entry.getKey()) + 1) / (sumFrequencies() + bc.getVocabulary().size());
        this.probabilities.put(entry.getKey(), chance);
    }

    public Integer sumFrequencies(){
        Integer sum = 0;
        for (Integer integer : this.frequencies.values()) {
            sum = sum + integer;
        }
        return sum;
    }  
}

公共类类别{
私有字符串categoryName；
私人双优先；
专用地图频率；
私有映射概率；
私人金额；
私人贝叶斯分类机bc；
公共类别（字符串类别名称，BayesianClassifier bc）{
this.categoryName=categoryName；
this.bc=bc；
this.frequencies=newhashmap（）；
this.probabilities=newhashmap（）；
this.textAmount=0；
此值为0.00；
}
公共无效添加字（字符串字）{
此.frequencies.put（字，0）；
这个。概率。put（word，0.0）；
}
公共无效更新频率（Map.Entry）{
如果（！this.frequencies.containsKey（entry.getKey（）））{
this.frequencies.put（entry.getKey（），entry.getValue（））；
}
否则{
this.frequencies.put（entry.getKey（）、this.frequencies.get（entry.getKey（））+entry.getValue（））；
}
}
public void updateProbability（Map.Entry）{
double chance=（（double）this.frequencies.get（entry.getKey（））+1）/（sumFrequencies（）+bc.get词汇表（）.size（））；
this.probability.put（entry.getKey（），chance）；
}
公共整数频率（）{
整数和=0；
for（整数：this.frequencies.values（））{
总和=总和+整数；
}
回报金额；
}  
}

此方法的作用是什么

update(text, category);

如果它正在做什么可能是一个随机呼叫我比这可能是你的瓶颈。如果您在没有附加上下文的情况下以单一方式调用它，并且它正在更新一些常规数据结构，那么它总是需要相同的时间。如果它更新了保存您过去迭代中的数据的内容，我很确定这将花费越来越多的时间-然后检查update（）方法的复杂性并减少瓶颈

更新：当您计算频率之和时，您的方法更新可能性（updateProbability）正在处理迄今为止收集的所有数据，因此处理的文件越多，所花费的时间就越长。这是您的瓶颈。

不需要每次都计算它-只要保存它，并在每次发生变化时更新它，以最小化计算量。

看起来每个文件的时间呈线性增长，总时间呈二次曲线增长。这意味着每个文件都在处理以前所有文件的数据。实际上，您是：

updateProbability

调用

sumFrequencies

，它贯穿整个

频率，并随着每个文件的增长而增长。这就是罪魁祸首。只需创建一个int-sumFrequencies
字段并在“updateFrequency”中更新它
作为进一步的改进，考虑使用番石榴，它以更简单和更有效的方式进行计数（没有自动装箱）。在修改代码后，考虑让它进行审查；有很多小问题。
岳母的死……？太棒了：“ToalCalm虽然很好笑，但并不太合适。这可能就是原因。你分析过应用程序吗？@Kayaman是的，我分析过，大部分时间都在update
中，但我问的问题是，当我运行processTrainText
200次时，需要200*90ms；当我调用包含200个文件的processTrainDirectory
方法时，需要200*2000ms，因为当被processTrainDirectory
@realpoinsist调用时，每次执行的每一步时间都在增加。我的类只有150行，所以我添加了整个为你上课！因此，如果有人想否决这个答案，那就好了。但至少弹出窗口告诉你留下评论来解释原因。@Stefan那是什么弹出窗口？任何人都没有义务发表评论。我相信这可能是因为你的答案并不是一个真正的答案，而是一个带有猜测的评论。在回答之前，你应该等待更多信息。@realpoint如果我投了反对票，我会得到一个通知，不管怎样，我会留下评论。我同意这应该是一个评论，你是对的……霍利·莫利·杜德，你在使用了3天之后，使我们的程序快了2.300倍。首先，处理600条文本需要28分钟，现在总共需要6秒。你现在正式成为荷兰某个地方两名随机学生的主角。