Java:测试sparksql

Java:测试sparksql,java,scala,apache-spark,apache-spark-sql,bigdata,Java,Scala,Apache Spark,Apache Spark Sql,Bigdata,我用spark sql编写了应用程序的测试。这个测试不起作用。 没有spark sql模块-所有测试都有效(RDD) Libs版本: Junit:4.12 火花芯:2.2.1 Spark Sql:2.2.1 测试是: List<Claim> claims = FileResource.loadListObjOfFile("cg-32-claims-load.json", Claim[].class); assertTrue(claims.size() == 1000L); Da

我用spark sql编写了应用程序的测试。这个测试不起作用。 没有spark sql模块-所有测试都有效(RDD)

Libs版本:

  • Junit:4.12
  • 火花芯:2.2.1
  • Spark Sql:2.2.1
测试是:

List<Claim> claims = FileResource.loadListObjOfFile("cg-32-claims-load.json", Claim[].class);
assertTrue(claims.size() == 1000L);

Dataset<Claim> dataset = getSparkSession().createDataset(claims, Encoders.bean(Claim.class));
assertTrue(dataset.count() == 1000L);

Dataset<ResultBean> resDataSet = dataset
        .groupByKey((MapFunction<Claim, Integer>) Claim::getMbrId, Encoders.INT())
        .mapGroups((MapGroupsFunction<Integer, Claim, ResultBean>) (key, values) -> new ResultBean(), Encoders.bean(ResultBean.class));

assertTrue(resDataSet.count() == 42L);
List claims=FileResource.loadListObjOfFile(“cg-32-claims-load.json”,Claim[].class);
assertTrue(claims.size()=1000L);
Dataset Dataset=getSparkSession().createDataset(claims,Encoders.bean(Claim.class));
assertTrue(dataset.count()==1000L);
Dataset resDataSet=Dataset
.groupByKey((MapFunction)声明::getMbrId,Encoders.INT()
.mapGroups((MapGroupsFunction)(键,值)->newresultbean(),Encoders.bean(ResultBean.class));
assertTrue(resDataSet.count()==42L);
最后一行我有个例外。应用程序仅在测试中引发此异常。(简单的主类-工作良好)

看起来spark sql由于某种原因无法初始化JavaBean。 堆栈跟踪:

+- AppendColumns <function1>, initializejavabean(newInstance(class test.input.Claim), (setDiag1,diag1#28.toString), .... [input[0, java.lang.Integer, true].intValue AS value#84]
   +- LocalTableScan [birthDt#23, birthDtStr#24, clmFromDt#25, .... pcdCd#45, plcOfSvcCd#46, ... 2 more fields]

    at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
    at org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
    ....
    Caused by: java.lang.AssertionError: index (23) should < 23
    at org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:133)
    at org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:352)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply2_7$(generated.java:52)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:600)
    at org.apache.spark.sql.execution.LocalTableScanExec$$anonfun$unsafeRows$1.apply(LocalTableScanExec.scala:41)
    at org.apache.spark.sql.execution.LocalTableScanExec$$anonfun$unsafeRows$1.apply(LocalTableScanExec.scala:41)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at org.apache.spark.sql.execution.LocalTableScanExec.unsafeRows$lzycompute(LocalTableScanExec.scala:41)
    at org.apache.spark.sql.execution.LocalTableScanExec.unsafeRows(LocalTableScanExec.scala:36)
    at org.apache.spark.sql.execution.LocalTableScanExec.rdd$lzycompute(LocalTableScanExec.scala:48)
    at org.apache.spark.sql.execution.LocalTableScanExec.rdd(LocalTableScanExec.scala:48)
    at org.apache.spark.sql.execution.LocalTableScanExec.doExecute(LocalTableScanExec.scala:52)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
    at org.apache.spark.sql.execution.AppendColumnsExec.doExecute(objects.scala:272)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
    at org.apache.spark.sql.execution.exchange.ShuffleExchange.prepareShuffleDependency(ShuffleExchange.scala:88)
    at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:124)
    at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:115)
    at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
    ... 86 more
+-AppendColumns,initializejavabean(newInstance(class test.input.Claim),(setDiag1,diag1#28.toString),…[input[0,java.lang.Integer,true].intValue作为值#84]
+-LocalTableScan[birthDt#23、birthDtStr#24、clmFromDt#25、.pcdCd#45、plcOfSvcCd#46、…另外两个字段]
位于org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
位于org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115)
位于org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
位于org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
位于org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
....
原因:java.lang.AssertionError:索引(23)应<23
位于org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:133)
位于org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:352)
位于org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply2_7$(generated.java:52)
位于org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:600)
在org.apache.spark.sql.execution.LocalTableScanExec$$anonfun$unsafeRows$1.apply(LocalTableScanExec.scala:41)
在org.apache.spark.sql.execution.LocalTableScanExec$$anonfun$unsafeRows$1.apply(LocalTableScanExec.scala:41)
在scala.collection.TraversableLike$$anonfun$map$1.apply处(TraversableLike.scala:234)
在scala.collection.TraversableLike$$anonfun$map$1.apply处(TraversableLike.scala:234)
位于scala.collection.mutable.resizeblearray$class.foreach(resizeblearray.scala:59)
位于scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
位于scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
位于scala.collection.AbstractTraversable.map(Traversable.scala:104)
位于org.apache.spark.sql.execution.LocalTableScanExec.unsafeRows$lzycompute(LocalTableScanExec.scala:41)
位于org.apache.spark.sql.execution.LocalTableScanExec.unsafeRows(LocalTableScanExec.scala:36)
位于org.apache.spark.sql.execution.LocalTableScanExec.rdd$lzycompute(LocalTableScanExec.scala:48)
位于org.apache.spark.sql.execution.LocalTableScanExec.rdd(LocalTableScanExec.scala:48)
位于org.apache.spark.sql.execution.LocalTableScanExec.doExecute(LocalTableScanExec.scala:52)
位于org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
位于org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
位于org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
位于org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
位于org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
位于org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
位于org.apache.spark.sql.execution.AppendColumnsExec.doExecute(objects.scala:272)
位于org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
位于org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
位于org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
位于org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
位于org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
位于org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
位于org.apache.spark.sql.execution.exchange.shufleexchange.prepareShuffleDependency(shufleexchange.scala:88)
位于org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:124)
位于org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:115)
位于org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
…还有86个

当bean类中出现问题时会发生此错误,检查bean类中是否有所有的getter和setter会很有帮助


希望这能帮助那些陷入困境的人!

数据集中的列数是多少?在这种情况下,scala case类中的列数是否达到了最大值?嗯……bean有23列。。