Apache spark 使用pyspark.ml.clustering在DataRicks上预测新数据的LDA模型误差

Apache spark 使用pyspark.ml.clustering在DataRicks上预测新数据的LDA模型误差,apache-spark,pyspark,nlp,databricks,lda,Apache Spark,Pyspark,Nlp,Databricks,Lda,我已经在DataRicks上使用pyspark.ml.clustering训练并保存了一个LDA模型,现在我需要使用新数据预测主题。但是,当我需要使用预测结果时,我得到了一个错误 这是输入数据模式(tokenizedTextdataframe): 这是要培训的代码摘要: from pyspark.ml.feature import IDF, HashingTF, Tokenizer, StopWordsRemover, CountVectorizer from pyspark.ml.cluste

我已经在DataRicks上使用pyspark.ml.clustering训练并保存了一个LDA模型,现在我需要使用新数据预测主题。但是,当我需要使用预测结果时,我得到了一个错误

这是输入数据模式(
tokenizedText
dataframe):

这是要培训的代码摘要:

from pyspark.ml.feature import IDF, HashingTF, Tokenizer, StopWordsRemover, CountVectorizer
from pyspark.ml.clustering import LDA, LDAModel, LocalLDAModel

counter = CountVectorizer(inputCol="Tokens", outputCol="term_frequency", minDF=10)
counterModel = counter.fit(tokenizedText)   
vectorizedLaw= counterModel.transform(tokenizedText)

lda_tf = LDA(k=6, maxIter=100, featuresCol="term_frequency", seed=135)
model_TF = lda_tf.fit(vectorizedLaw)
然后,我预测列车数据,一切正常:

predictions_TF = model_TF.transform(vectorizedLaw)
predictions_TF.select("topicDistribution").show(5, truncate=False)

+---------------------------------------------------------------------------------------------------------------------------+
|topicDistribution                                                                                                          |
+---------------------------------------------------------------------------------------------------------------------------+
|[0.013571512425910889,0.6217200752455205,0.2961210273974943,0.02637133808190742,0.01725914968266371,0.02495689716650296]   |
|[0.05687289141286662,0.06042583918761498,0.07003525643520062,0.10832523389472587,0.072220782376337,0.6321199966932549]     |
|[0.021911946837097802,0.02328240957204057,0.02699809068833656,0.8610644212787785,0.027841122709084537,0.038902008914662105]|
|[0.004887677638064053,0.0051942680450804005,0.00600826758941306,0.009373274444250117,0.4726839053470049,0.5018526069361875]|
|[0.013570322581255357,0.014437643205896269,0.01669607176655056,0.02612486131653466,0.6240083606668544,0.30516274046290875] |
+---------------------------------------------------------------------------------------------------------------------------+
only showing top 5 rows
所以我决定保存模型:

model_TF.save('/dbfs/mnt/docs/model6_4_pyspark')
最后,我创建了预测新评论的代码。我加载了模型,并在新文本上重复相同的步骤(我非常确定新数据的模式与训练df相同):

但是,我有一个错误:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 215.0 failed 1 times, most recent failure: Lost task 0.0 in stage 215.0 (TID 601, ip-10-172-225-237.us-west-2.compute.internal, executor driver): org.apache.spark.SparkException: Failed to execute user defined function(LDAModel$$Lambda$5764/1348803876: (struct<type:tinyint,size:int,indices:array<int>,values:array<double>>) => struct<type:tinyint,size:int,indices:array<int>,values:array<double>>)
这是全部错误:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<command-464806927610545> in <module>
----> 1 predictions.select("topicDistribution").show(5)

/databricks/spark/python/pyspark/sql/dataframe.py in show(self, n, truncate, vertical)
    439         """
    440         if isinstance(truncate, bool) and truncate:
--> 441             print(self._jdf.showString(n, 20, vertical))
    442         else:
    443             print(self._jdf.showString(n, int(truncate), vertical))

/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1303         answer = self.gateway_client.send_command(command)
   1304         return_value = get_return_value(
-> 1305             answer, self.gateway_client, self.target_id, self.name)
   1306 
   1307         for temp_arg in temp_args:

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
    126     def deco(*a, **kw):
    127         try:
--> 128             return f(*a, **kw)
    129         except py4j.protocol.Py4JJavaError as e:
    130             converted = convert_exception(e.java_exception)

/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling o2276.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 215.0 failed 1 times, most recent failure: Lost task 0.0 in stage 215.0 (TID 601, ip-10-172-225-237.us-west-2.compute.internal, executor driver): org.apache.spark.SparkException: Failed to execute user defined function(LDAModel$$Lambda$5764/1348803876: (struct<type:tinyint,size:int,indices:array<int>,values:array<double>>) => struct<type:tinyint,size:int,indices:array<int>,values:array<double>>)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:731)
    at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
    at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:187)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
    at org.apache.spark.scheduler.Task.run(Task.scala:117)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$11(Executor.scala:657)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:660)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IndexOutOfBoundsException: (4731,0) not in [-4157,4157) x [-6,6)
    at breeze.linalg.DenseMatrix$mcD$sp.apply$mcD$sp(DenseMatrix.scala:106)
    at breeze.linalg.DenseMatrix$mcD$sp.apply(DenseMatrix.scala:103)
    at breeze.linalg.DenseMatrix$mcD$sp.apply(DenseMatrix.scala:52)
    at breeze.linalg.Matrix.apply(Matrix.scala:44)
    at breeze.linalg.Matrix.apply$(Matrix.scala:44)
    at breeze.linalg.DenseMatrix.apply(DenseMatrix.scala:52)
    at breeze.linalg.DenseMatrix.apply(DenseMatrix.scala:52)
    at breeze.linalg.SliceMatrix.apply(SliceMatrix.scala:23)
    at breeze.linalg.Matrix.$anonfun$toDenseMatrix$1(Matrix.scala:125)
    at breeze.linalg.Matrix.$anonfun$toDenseMatrix$1$adapted(Matrix.scala:124)
    at breeze.linalg.MatrixConstructors.$anonfun$tabulate$2(Matrix.scala:230)
    at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
    at breeze.linalg.MatrixConstructors.$anonfun$tabulate$1(Matrix.scala:229)
    at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
    at breeze.linalg.MatrixConstructors.tabulate(Matrix.scala:229)
    at breeze.linalg.MatrixConstructors.tabulate$(Matrix.scala:227)
    at breeze.linalg.DenseMatrix$.tabulate(DenseMatrix.scala:360)
    at breeze.linalg.Matrix.toDenseMatrix(Matrix.scala:124)
    at breeze.linalg.Matrix.toDenseMatrix$(Matrix.scala:123)
    at breeze.linalg.SliceMatrix.toDenseMatrix(SliceMatrix.scala:13)
    at breeze.linalg.Matrix.toDenseMatrix$mcD$sp(Matrix.scala:123)
    at breeze.linalg.Matrix.toDenseMatrix$mcD$sp$(Matrix.scala:123)
    at breeze.linalg.SliceMatrix.toDenseMatrix$mcD$sp(SliceMatrix.scala:13)
    at org.apache.spark.mllib.clustering.OnlineLDAOptimizer$.variationalTopicInference(LDAOptimizer.scala:618)
    at org.apache.spark.ml.clustering.LDAModel.$anonfun$getTopicDistributionMethod$1(LDA.scala:502)
    ... 14 more

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2478)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2427)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2426)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2426)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1131)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1131)
    at scala.Option.foreach(Option.scala:407)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1131)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2678)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2625)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2613)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:917)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2313)
    at org.apache.spark.sql.execution.collect.Collector.runSparkJobs(Collector.scala:298)
    at org.apache.spark.sql.execution.collect.Collector.collect(Collector.scala:308)
    at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:82)
    at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:88)
    at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:508)
    at org.apache.spark.sql.execution.CollectLimitExec.executeCollectResult(limit.scala:58)
    at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:2986)
    at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3692)
    at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2710)
    at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3684)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:116)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:248)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:101)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:835)
    at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:198)
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3682)
    at org.apache.spark.sql.Dataset.head(Dataset.scala:2710)
    at org.apache.spark.sql.Dataset.take(Dataset.scala:2917)
    at org.apache.spark.sql.Dataset.getRows(Dataset.scala:304)
    at org.apache.spark.sql.Dataset.showString(Dataset.scala:341)
    at sun.reflect.GeneratedMethodAccessor410.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
    at py4j.Gateway.invoke(Gateway.java:295)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:251)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Failed to execute user defined function(LDAModel$$Lambda$5764/1348803876: (struct<type:tinyint,size:int,indices:array<int>,values:array<double>>) => struct<type:tinyint,size:int,indices:array<int>,values:array<double>>)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:731)
    at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
    at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:187)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
    at org.apache.spark.scheduler.Task.run(Task.scala:117)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$11(Executor.scala:657)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:660)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more
Caused by: java.lang.IndexOutOfBoundsException: (4731,0) not in [-4157,4157) x [-6,6)
    at breeze.linalg.DenseMatrix$mcD$sp.apply$mcD$sp(DenseMatrix.scala:106)
    at breeze.linalg.DenseMatrix$mcD$sp.apply(DenseMatrix.scala:103)
    at breeze.linalg.DenseMatrix$mcD$sp.apply(DenseMatrix.scala:52)
    at breeze.linalg.Matrix.apply(Matrix.scala:44)
    at breeze.linalg.Matrix.apply$(Matrix.scala:44)
    at breeze.linalg.DenseMatrix.apply(DenseMatrix.scala:52)
    at breeze.linalg.DenseMatrix.apply(DenseMatrix.scala:52)
    at breeze.linalg.SliceMatrix.apply(SliceMatrix.scala:23)
    at breeze.linalg.Matrix.$anonfun$toDenseMatrix$1(Matrix.scala:125)
    at breeze.linalg.Matrix.$anonfun$toDenseMatrix$1$adapted(Matrix.scala:124)
    at breeze.linalg.MatrixConstructors.$anonfun$tabulate$2(Matrix.scala:230)
    at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
    at breeze.linalg.MatrixConstructors.$anonfun$tabulate$1(Matrix.scala:229)
    at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
    at breeze.linalg.MatrixConstructors.tabulate(Matrix.scala:229)
    at breeze.linalg.MatrixConstructors.tabulate$(Matrix.scala:227)
    at breeze.linalg.DenseMatrix$.tabulate(DenseMatrix.scala:360)
    at breeze.linalg.Matrix.toDenseMatrix(Matrix.scala:124)
    at breeze.linalg.Matrix.toDenseMatrix$(Matrix.scala:123)
    at breeze.linalg.SliceMatrix.toDenseMatrix(SliceMatrix.scala:13)
    at breeze.linalg.Matrix.toDenseMatrix$mcD$sp(Matrix.scala:123)
    at breeze.linalg.Matrix.toDenseMatrix$mcD$sp$(Matrix.scala:123)
    at breeze.linalg.SliceMatrix.toDenseMatrix$mcD$sp(SliceMatrix.scala:13)
    at org.apache.spark.mllib.clustering.OnlineLDAOptimizer$.variationalTopicInference(LDAOptimizer.scala:618)
    at org.apache.spark.ml.clustering.LDAModel.$anonfun$getTopicDistributionMethod$1(LDA.scala:502)
    ... 14 more
---------------------------------------------------------------------------
Py4JJavaError回溯(最近一次调用)
在里面
---->1.选择(“主题分布”)。显示(5)
/show中的databricks/spark/python/pyspark/sql/dataframe.py(self,n,truncate,vertical)
439         """
440如果isinstance(truncate,bool)和truncate:
-->441打印(self.\u jdf.showString(n,20,垂直))
442其他:
443打印(self._jdf.showString(n,int(截断),垂直))
/调用中的databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py(self,*args)
1303 answer=self.gateway\u client.send\u命令(command)
1304返回值=获取返回值(
->1305应答,self.gateway\u客户端,self.target\u id,self.name)
1306
1307对于临时参数中的临时参数:
/deco中的databricks/spark/python/pyspark/sql/utils.py(*a,**kw)
126 def装饰(*a,**千瓦):
127尝试:
-->128返回f(*a,**kw)
129除py4j.protocol.Py4JJavaError外,错误为e:
130 converted=convert\u异常(例如java\u异常)
/获取返回值中的databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py(答案、网关客户端、目标id、名称)
326 raise Py4JJavaError(
327“调用{0}{1}{2}时出错。\n”。
-->328格式(目标id,“.”,名称),值)
329其他:
330升起Py4JError(
Py4JJavaError:调用o2276.showString时出错。
:org.apache.SparkException:作业因阶段失败而中止:阶段215.0中的任务0失败1次,最近的失败:阶段215.0中的任务0.0丢失(TID 601,ip-10-172-225-237.us-west-2.compute.internal,执行器驱动程序):org.apache.spark.SparkException:未能执行用户定义的函数(LDAModel$$Lambda$5764/1348803876:(struct)=>struct)
位于org.apache.spark.sql.catalyst.expressions.GeneratedClass$GenerateEditorForCodeGenStage1.processNext(未知源)
位于org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
位于org.apache.spark.sql.execution.whisttagecodegenexec$$anon$1.hasNext(whisttagecodegenexec.scala:731)
位于org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
位于org.apache.spark.sql.execution.collection.Collector.$anonfun$processFunc$1(Collector.scala:187)
位于org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
位于org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
位于org.apache.spark.scheduler.Task.run(Task.scala:117)
位于org.apache.spark.executor.executor$TaskRunner.$anonfun$run$11(executor.scala:657)
位于org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
位于org.apache.spark.executor.executor$TaskRunner.run(executor.scala:660)
位于java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
位于java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
运行(Thread.java:748)
由以下原因引起:java.lang.IndexOutOfBoundsException:(4731,0)不在[-41574157)x[-6,6)
在breeze.linalg.DenseMatrix$mcD$sp.apply$mcD$sp(DenseMatrix.scala:106)
在breeze.linalg.DenseMatrix$mcD$sp.apply(DenseMatrix.scala:103)
在breeze.linalg.DenseMatrix$mcD$sp.apply(DenseMatrix.scala:52)
在breeze.linalg.Matrix.apply(Matrix.scala:44)
在breeze.linalg.Matrix.apply$(Matrix.scala:44)
在breeze.linalg.DenseMatrix.apply(DenseMatrix.scala:52)
在breeze.linalg.DenseMatrix.apply(DenseMatrix.scala:52)
在breeze.linalg.SliceMatrix.apply(SliceMatrix.scala:23)
在breeze.linalg.Matrix.$anonfun$toDenseMatrix$1(Matrix.scala:125)
在breeze.linalg.Matrix.$anonfun$toDenseMatrix$1$adapted(Matrix.scala:124)
breeze.linalg.MatrixConstructors.$anonfun$表格$2(矩阵scala:230)
位于scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
在breeze.linalg.MatrixConstructors.$anonfun$表格$1(Matrix.scala:229)
位于scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
breeze.linalg.矩阵构造器.表格(矩阵scala:229)
在breeze.linalg.MatrixConstructors.tablate$(Matrix.scala:227)
在breeze.linalg.DenseMatrix$.tablate(DenseMatrix.scala:360)
在breeze.linalg.Matrix.toDenseMatrix(Matrix.scala:124)
在breeze.linalg.Matrix.toDenseMatrix$(Matrix.scala:123)
在breeze.linalg.SliceMatrix.toDenseMatrix(SliceMatrix.scala:13)
在breeze.linalg.Matrix.toDenseMatrix$mcD$sp(Matrix.scala:123)
在breeze.linalg.Matrix.toDenseMatrix$mcD$sp$(Matrix.scala:123)
在breeze.linalg.SliceMatrix.toDenseMatrix$mcD$sp(SliceMatrix.scala:13)
位于org.apache.spark.mllib.clustering.onlinedaoptimizer$.variationalTopicInference(ldaooptimizer.scala:618)
位于org.apache.spark.ml.clustering.LDAModel.$anonfun$getTopicDistributionMethod$1(LDA.scala:502)
…还有14个
驱动程序堆栈跟踪:
位于org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2478)
位于org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2427)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 215.0 failed 1 times, most recent failure: Lost task 0.0 in stage 215.0 (TID 601, ip-10-172-225-237.us-west-2.compute.internal, executor driver): org.apache.spark.SparkException: Failed to execute user defined function(LDAModel$$Lambda$5764/1348803876: (struct&lt;type:tinyint,size:int,indices:array&lt;int&gt;,values:array&lt;double&gt;&gt;) =&gt; struct&lt;type:tinyint,size:int,indices:array&lt;int&gt;,values:array&lt;double&gt;&gt;)
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<command-464806927610545> in <module>
----> 1 predictions.select("topicDistribution").show(5)

/databricks/spark/python/pyspark/sql/dataframe.py in show(self, n, truncate, vertical)
    439         """
    440         if isinstance(truncate, bool) and truncate:
--> 441             print(self._jdf.showString(n, 20, vertical))
    442         else:
    443             print(self._jdf.showString(n, int(truncate), vertical))

/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1303         answer = self.gateway_client.send_command(command)
   1304         return_value = get_return_value(
-> 1305             answer, self.gateway_client, self.target_id, self.name)
   1306 
   1307         for temp_arg in temp_args:

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
    126     def deco(*a, **kw):
    127         try:
--> 128             return f(*a, **kw)
    129         except py4j.protocol.Py4JJavaError as e:
    130             converted = convert_exception(e.java_exception)

/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling o2276.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 215.0 failed 1 times, most recent failure: Lost task 0.0 in stage 215.0 (TID 601, ip-10-172-225-237.us-west-2.compute.internal, executor driver): org.apache.spark.SparkException: Failed to execute user defined function(LDAModel$$Lambda$5764/1348803876: (struct<type:tinyint,size:int,indices:array<int>,values:array<double>>) => struct<type:tinyint,size:int,indices:array<int>,values:array<double>>)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:731)
    at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
    at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:187)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
    at org.apache.spark.scheduler.Task.run(Task.scala:117)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$11(Executor.scala:657)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:660)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IndexOutOfBoundsException: (4731,0) not in [-4157,4157) x [-6,6)
    at breeze.linalg.DenseMatrix$mcD$sp.apply$mcD$sp(DenseMatrix.scala:106)
    at breeze.linalg.DenseMatrix$mcD$sp.apply(DenseMatrix.scala:103)
    at breeze.linalg.DenseMatrix$mcD$sp.apply(DenseMatrix.scala:52)
    at breeze.linalg.Matrix.apply(Matrix.scala:44)
    at breeze.linalg.Matrix.apply$(Matrix.scala:44)
    at breeze.linalg.DenseMatrix.apply(DenseMatrix.scala:52)
    at breeze.linalg.DenseMatrix.apply(DenseMatrix.scala:52)
    at breeze.linalg.SliceMatrix.apply(SliceMatrix.scala:23)
    at breeze.linalg.Matrix.$anonfun$toDenseMatrix$1(Matrix.scala:125)
    at breeze.linalg.Matrix.$anonfun$toDenseMatrix$1$adapted(Matrix.scala:124)
    at breeze.linalg.MatrixConstructors.$anonfun$tabulate$2(Matrix.scala:230)
    at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
    at breeze.linalg.MatrixConstructors.$anonfun$tabulate$1(Matrix.scala:229)
    at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
    at breeze.linalg.MatrixConstructors.tabulate(Matrix.scala:229)
    at breeze.linalg.MatrixConstructors.tabulate$(Matrix.scala:227)
    at breeze.linalg.DenseMatrix$.tabulate(DenseMatrix.scala:360)
    at breeze.linalg.Matrix.toDenseMatrix(Matrix.scala:124)
    at breeze.linalg.Matrix.toDenseMatrix$(Matrix.scala:123)
    at breeze.linalg.SliceMatrix.toDenseMatrix(SliceMatrix.scala:13)
    at breeze.linalg.Matrix.toDenseMatrix$mcD$sp(Matrix.scala:123)
    at breeze.linalg.Matrix.toDenseMatrix$mcD$sp$(Matrix.scala:123)
    at breeze.linalg.SliceMatrix.toDenseMatrix$mcD$sp(SliceMatrix.scala:13)
    at org.apache.spark.mllib.clustering.OnlineLDAOptimizer$.variationalTopicInference(LDAOptimizer.scala:618)
    at org.apache.spark.ml.clustering.LDAModel.$anonfun$getTopicDistributionMethod$1(LDA.scala:502)
    ... 14 more

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2478)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2427)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2426)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2426)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1131)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1131)
    at scala.Option.foreach(Option.scala:407)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1131)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2678)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2625)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2613)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:917)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2313)
    at org.apache.spark.sql.execution.collect.Collector.runSparkJobs(Collector.scala:298)
    at org.apache.spark.sql.execution.collect.Collector.collect(Collector.scala:308)
    at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:82)
    at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:88)
    at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:508)
    at org.apache.spark.sql.execution.CollectLimitExec.executeCollectResult(limit.scala:58)
    at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:2986)
    at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3692)
    at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2710)
    at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3684)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:116)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:248)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:101)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:835)
    at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:198)
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3682)
    at org.apache.spark.sql.Dataset.head(Dataset.scala:2710)
    at org.apache.spark.sql.Dataset.take(Dataset.scala:2917)
    at org.apache.spark.sql.Dataset.getRows(Dataset.scala:304)
    at org.apache.spark.sql.Dataset.showString(Dataset.scala:341)
    at sun.reflect.GeneratedMethodAccessor410.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
    at py4j.Gateway.invoke(Gateway.java:295)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:251)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Failed to execute user defined function(LDAModel$$Lambda$5764/1348803876: (struct<type:tinyint,size:int,indices:array<int>,values:array<double>>) => struct<type:tinyint,size:int,indices:array<int>,values:array<double>>)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:731)
    at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
    at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:187)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
    at org.apache.spark.scheduler.Task.run(Task.scala:117)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$11(Executor.scala:657)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:660)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more
Caused by: java.lang.IndexOutOfBoundsException: (4731,0) not in [-4157,4157) x [-6,6)
    at breeze.linalg.DenseMatrix$mcD$sp.apply$mcD$sp(DenseMatrix.scala:106)
    at breeze.linalg.DenseMatrix$mcD$sp.apply(DenseMatrix.scala:103)
    at breeze.linalg.DenseMatrix$mcD$sp.apply(DenseMatrix.scala:52)
    at breeze.linalg.Matrix.apply(Matrix.scala:44)
    at breeze.linalg.Matrix.apply$(Matrix.scala:44)
    at breeze.linalg.DenseMatrix.apply(DenseMatrix.scala:52)
    at breeze.linalg.DenseMatrix.apply(DenseMatrix.scala:52)
    at breeze.linalg.SliceMatrix.apply(SliceMatrix.scala:23)
    at breeze.linalg.Matrix.$anonfun$toDenseMatrix$1(Matrix.scala:125)
    at breeze.linalg.Matrix.$anonfun$toDenseMatrix$1$adapted(Matrix.scala:124)
    at breeze.linalg.MatrixConstructors.$anonfun$tabulate$2(Matrix.scala:230)
    at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
    at breeze.linalg.MatrixConstructors.$anonfun$tabulate$1(Matrix.scala:229)
    at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
    at breeze.linalg.MatrixConstructors.tabulate(Matrix.scala:229)
    at breeze.linalg.MatrixConstructors.tabulate$(Matrix.scala:227)
    at breeze.linalg.DenseMatrix$.tabulate(DenseMatrix.scala:360)
    at breeze.linalg.Matrix.toDenseMatrix(Matrix.scala:124)
    at breeze.linalg.Matrix.toDenseMatrix$(Matrix.scala:123)
    at breeze.linalg.SliceMatrix.toDenseMatrix(SliceMatrix.scala:13)
    at breeze.linalg.Matrix.toDenseMatrix$mcD$sp(Matrix.scala:123)
    at breeze.linalg.Matrix.toDenseMatrix$mcD$sp$(Matrix.scala:123)
    at breeze.linalg.SliceMatrix.toDenseMatrix$mcD$sp(SliceMatrix.scala:13)
    at org.apache.spark.mllib.clustering.OnlineLDAOptimizer$.variationalTopicInference(LDAOptimizer.scala:618)
    at org.apache.spark.ml.clustering.LDAModel.$anonfun$getTopicDistributionMethod$1(LDA.scala:502)
    ... 14 more