Java 斯坦福主题建模工具箱：异常_Java_Scala_Stanford Nlp

Java 斯坦福主题建模工具箱：异常

java scala stanford-nlp

Java 斯坦福主题建模工具箱：异常,java,scala,stanford-nlp,Java,Scala,Stanford Nlp,我正在尝试使用斯坦福主题建模工具箱。我从这里下载了“tmt-0.4.0.jar”文件：我尝试着从这些示例中学习。示例0和1运行良好，但尝试示例2（无代码更改），我收到以下异常： [单元]加载pubmed-oa-subset.csv.term-counts.cache.70108071.gz [Concurrent]32允许线程“thread-3”中出现异常 java.lang.ArrayIndexOutOfBoundsException:-1位于 scalanlp.stage.text.Ter

我正在尝试使用斯坦福主题建模工具箱。我从这里下载了“tmt-0.4.0.jar”文件：我尝试着从这些示例中学习。示例0和1运行良好，但尝试示例2（无代码更改），我收到以下异常：

[单元]加载pubmed-oa-subset.csv.term-counts.cache.70108071.gz [Concurrent]32允许线程“thread-3”中出现异常 java.lang.ArrayIndexOutOfBoundsException:-1位于 scalanlp.stage.text.TermCounts$class.getDF（TermFilters.scala:64）位于 scalanlp.stage.text.TermCounts$$anon$2.getDF（TermFilters.scala:84）位于 scalanlp.stage.text.TermMinimumDocumentCountFilter$$anonfun$apply$4$$anonfun$apply$5$$anonfun$apply$6.apply（TermFilters.scala:172）在 scalanlp.stage.text.TermMinimumDocumentCountFilter$$anonfun$apply$4$$anonfun$apply$5$$anonfun$apply$6.apply（TermFilters.scala:172）位于scala.collection.Iterator$$anon$22.hasNext（Iterator.scala:390） scala.collection.Iterator$$anon$22.hasNext（Iterator.scala:388）位于 scala.collection.Iterator$class.foreach（Iterator.scala:660）位于 scala.collection.Iterator$$anon$22.foreach（Iterator.scala:382）位于 scala.collection.IterableViewLike$Transformed$class.foreach（IterableViewLike.scala:41）在 scala.collection.IterableViewLike$$anon$5.foreach（IterableViewLike.scala:82）在 scala.collection.TraversableOnce$class.size（TraversableOnce.scala:104）在 scala.collection.IterableViewLike$$anon$5.size（IterableViewLike.scala:82）在 scalanlp.stage.text.DocumentMinimumLengthFilter.filter（DocumentFilters.scala:31）在 scalanlp.stage.text.DocumentMinimumLengthFilter.filter（DocumentFilters.scala:28）在 scalanlp.stage.generic.Filter$$anonfun$apply$1.apply（Filter.scala:38）在 scalanlp.stage.generic.Filter$$anonfun$apply$1.apply（Filter.scala:38）位于scala.collection.Iterator$$anon$22.hasNext（Iterator.scala:390） edu.stanford.nlp.tmt.data.concurrent.concurrent$$anonfun$map$2.apply（concurrent.scala:100）在 edu.stanford.nlp.tmt.data.concurrent.concurrent$$anonfun$map$2.apply（concurrent.scala:88）在 edu.stanford.nlp.tmt.data.concurrent.concurrent$$anon$4.run（concurrent.scala:45）

为什么我会收到此异常，如何修复此异常？非常感谢你的帮助

PS：代码与网站示例2中的代码相同：

// Stanford TMT Example 2 - Learning an LDA model
// http://nlp.stanford.edu/software/tmt/0.4/

// tells Scala where to find the TMT classes
import scalanlp.io._;
import scalanlp.stage._;
import scalanlp.stage.text._;
import scalanlp.text.tokenize._;
import scalanlp.pipes.Pipes.global._;

import edu.stanford.nlp.tmt.stage._;
import edu.stanford.nlp.tmt.model.lda._;
import edu.stanford.nlp.tmt.model.llda._;

val source = CSVFile("pubmed-oa-subset.csv") ~> IDColumn(1);

val tokenizer = {
  SimpleEnglishTokenizer() ~>            // tokenize on space and punctuation
  CaseFolder() ~>                        // lowercase everything
  WordsAndNumbersOnlyFilter() ~>         // ignore non-words and non-numbers
  MinimumLengthFilter(3)                 // take terms with >=3 characters
}

val text = {
  source ~>                              // read from the source file
  Column(4) ~>                           // select column containing text
  TokenizeWith(tokenizer) ~>             // tokenize with tokenizer above
  TermCounter() ~>                       // collect counts (needed below)
  TermMinimumDocumentCountFilter(4) ~>   // filter terms in <4 docs
  TermDynamicStopListFilter(30) ~>       // filter out 30 most common terms
  DocumentMinimumLengthFilter(5)         // take only docs with >=5 terms
}

// turn the text into a dataset ready to be used with LDA
val dataset = LDADataset(text);

// define the model parameters
val params = LDAModelParams(numTopics = 30, dataset = dataset,
  topicSmoothing = 0.01, termSmoothing = 0.01);

// Name of the output model folder to generate
val modelPath = file("lda-"+dataset.signature+"-"+params.signature);

// Trains the model: the model (and intermediate models) are written to the
// output folder.  If a partially trained model with the same dataset and
// parameters exists in that folder, training will be resumed.
TrainCVB0LDA(params, dataset, output=modelPath, maxIterations=1000);

// To use the Gibbs sampler for inference, instead use
// TrainGibbsLDA(params, dataset, output=modelPath, maxIterations=1500);

//斯坦福TMT示例2-学习LDA模型
// http://nlp.stanford.edu/software/tmt/0.4/
//告诉Scala哪里可以找到TMT类
导入scalanlp.io.u;；
导入缩放阶段；
导入scalanlp.stage.text.\ux；
导入scalanlp.text.tokenize.\ux；
导入scalanlp.pipes.pipes.global.\ux；
导入edu.stanford.nlp.tmt.stage.u;；
导入edu.stanford.nlp.tmt.model.lda.\uu；
导入edu.stanford.nlp.tmt.model.llda.\uu；
val source=CSVFile（“pubmed oa subset.csv”）~>IDColumn（1）；
val标记器={
SimpleEnglishTokenizer（）~>//对空格和标点符号进行标记化
CaseFolder（）~>//所有内容都小写
WordsandNumberOnlyFilter（）~>//忽略非单词和非数字
MinimumLengthFilter（3）//采用>=3个字符的术语
}
val文本={
source ~>//从源文件读取
列（4）~>//选择包含文本的列
TokenizeWith（标记器）~>//使用上面的标记器标记
TermCounter（）~>//收集计数（下面需要）
TermMinimumDocumentCountFilter（4）~>//过滤输入项//过滤出30个最常见的项
DocumentMinimumLengthFilter（5）//仅获取>=5个术语的文档
}
//将文本转换为可与LDA一起使用的数据集
val数据集=LDADataset（文本）；
//定义模型参数
val params=LDAModelParams（numTopics=30，dataset=dataset，
主题平滑=0.01，术语平滑=0.01）；
//要生成的输出模型文件夹的名称
val modelPath=file（“lda-”+dataset.signature+“-”+params.signature）；
//训练模型：将模型（和中间模型）写入
//输出文件夹。如果使用相同的数据集和
//如果该文件夹中存在参数，则将恢复培训。
TrainCVB0LDA（参数，数据集，输出=modelPath，最大迭代次数=1000）；
//要使用Gibbs采样器进行推理，请使用
//TrainGibbsLDA（参数，数据集，输出=模型路径，最大迭代次数=1500）；

答案已由工具作者发布。请看这里

这通常发生在您有一个陈旧的.cache文件时-不幸的是错误消息不是特别有用。尝试在运行中删除缓存文件夹并重新运行

如果没有代码进行比较，我们不知道如何诊断。我添加了代码：）谢谢您的帮助：）