Java Spark 2.2.2:CountVectorModel索引24691超出大小为23262的向量的界限

Java Spark 2.2.2:CountVectorModel索引24691超出大小为23262的向量的界限,java,apache-spark,apache-spark-mllib,Java,Apache Spark,Apache Spark Mllib,大家好,祝你们有愉快的一天。根据你的经验,我想得到一些帮助。我正在尝试将文本文档集合转换为基于自定义词汇表的令牌计数向量,数组大小为24693,使用CountVectorModel 以下是简单的代码 CountVectorizerModel cvm2 = new CountVectorizerModel(vocabulary) .setInputCol(NEXT) .setOutputCol(NEXT_RAW_FEATURES);

大家好,祝你们有愉快的一天。根据你的经验,我想得到一些帮助。我正在尝试将文本文档集合转换为基于自定义词汇表的令牌计数向量,数组大小为24693,使用CountVectorModel

以下是简单的代码

CountVectorizerModel cvm2 = new CountVectorizerModel(vocabulary)
                .setInputCol(NEXT)
                .setOutputCol(NEXT_RAW_FEATURES);
        cvm2.transform(dataset).show(false);
这是我的全部例外:

Caused by: org.apache.spark.SparkException: Failed to execute user defined function($anonfun$8: (array<string>) => vector)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
ERROR   at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
----------------------
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)

    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalArgumentException: requirement failed: Index 24691 out of bounds for vector of size 23262
    at scala.Predef$.require(Predef.scala:224)
    at org.apache.spark.ml.linalg.SparseVector.<init>(Vectors.scala:570)
    at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:212)
    at org.apache.spark.ml.feature.CountVectorizerModel$$anonfun$8.apply(CountVectorizer.scala:265)
    at org.apache.spark.ml.feature.CountVectorizerModel$$anonfun$8.apply(CountVectorizer.scala:248)
    ... 16 more
我如何修复它?我需要适应吗

Index 24691 out of bounds for vector of size 23262
 setMinTF()

使用前缀大小。我不知道该怎么办,所以我在这里。基本上,我无法理解为什么会发生这种情况以及如何解决它。如果有人能帮助我,我将不胜感激。

您的vocab数组包含重复项。您需要删除阵列中的重复项