Machine learning WEKA中潜在语义分析的可扩展性_Machine Learning_Nlp_Artificial Intelligence_Weka

Machine learning WEKA中潜在语义分析的可扩展性

machine-learning nlp artificial-intelligence

Machine learning WEKA中潜在语义分析的可扩展性,machine-learning,nlp,artificial-intelligence,weka,Machine Learning,Nlp,Artificial Intelligence,Weka,我正在使用Weka进行文档分类研究。我需要设置一个基线，在此基础上我将显示我的贡献改进了分类。但是，在WEKAAPI中使用默认潜在语义分析会导致OutOfMemory错误在执行一些预处理之后，我的数据集由9603个实例中使用的25765个属性组成。这是针对列车集的，对于测试集，我有相同数量的class和normal属性，但这里我有3299 我有8GB的ram，并且已经将Java堆大小设置为4Gb，但是我仍然会遇到内存错误。以下是错误消息： Exception in thread "main"

我正在使用Weka进行文档分类研究。我需要设置一个基线，在此基础上我将显示我的贡献改进了分类。但是，在WEKAAPI中使用默认潜在语义分析会导致OutOfMemory错误

在执行一些预处理之后，我的数据集由9603个实例中使用的25765个属性组成。这是针对列车集的，对于测试集，我有相同数量的class和normal属性，但这里我有3299

我有8GB的ram，并且已经将Java堆大小设置为4Gb，但是我仍然会遇到内存错误。以下是错误消息：

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at weka.core.matrix.Matrix.getArrayCopy(Matrix.java:301)
at weka.core.matrix.SingularValueDecomposition.<init>(SingularValueDecomposition.java:76)
at weka.core.matrix.Matrix.svd(Matrix.java:913)
at weka.attributeSelection.LatentSemanticAnalysis.buildAttributeConstructor(LatentSemanticAnalysis.java:511)
at weka.attributeSelection.LatentSemanticAnalysis.buildEvaluator(LatentSemanticAnalysis.java:416)
at weka.attributeSelection.AttributeSelection.SelectAttributes(AttributeSelection.java:596)
at weka.filters.supervised.attribute.AttributeSelection.batchFinished(AttributeSelection.java:455)
at weka.filters.Filter.useFilter(Filter.java:682)
at test.main(test.java:44)

线程“main”java.lang.OutOfMemoryError中的异常：java堆空间位于weka.core.matrix.matrix.getArrayCopy（matrix.java:301）在weka.core.matrix.SingularValueDecomposition.（SingularValueDecomposition.java:76）位于weka.core.matrix.matrix.svd（matrix.java:913）在weka.attributeSelection.LatentSemanticAnalysis.buildAttributeConstructor（LatentSemanticAnalysis.java:511）在weka.attributeSelection.LatentSemanticAnalysis.buildEvaluator（LatentSemanticAnalysis.java:416）在weka.attributeSelection.attributeSelection.SelectAttributes（attributeSelection.java:596）在weka.filters.supervised.attribute.AttributeSelection.batchFinished（AttributeSelection.java:455）在weka.filters.Filter.useFilter（Filter.java:682） at test.main（test.java:44）

我已经用一个较小的数据集测试了我的代码，在那里一切都正常工作，所以这不是一个与代码相关的问题。有人能解释一下我如何扩大LSA以满足我的需求吗？或者，我是否可以应用另一个类似的过程，使其更具可扩展性？

您不会喜欢这个答案，但WEKA无法处理它。不管发生什么，实现都使用完整的SVD。因此，如果您有数千个以上的数据点，只需执行完整的SVD就需要花费大量的时间

更不用说WEKA使用的内存远远超过了一般需要的内存

在所有这些之上，Weka创建了一个稠密矩阵来进行SVD。您可能正在将其用于稀疏数据，这将摧毁您使用Weka进行LSA的任何希望

事实上，你必须使用Weka以外的其他工具来完成LSA

如果您不喜欢Weka，gensim有一个非常可靠/可扩展的LSA实现。@RobNeuhaus感谢您的输入，但不幸的是，我喜欢WEKAIndeed，我不喜欢这个答案，但它回答了我的问题！非常感谢。