Apache spark 从RDD创建数据集时获取“org.apache.spark.sql.AnalysisException”
我最近开始使用Spark的Dataset API,我正在尝试几个示例。下面是一个这样的例子,失败的原因是AnalysisException 执行上述代码时,我得到:Apache spark 从RDD创建数据集时获取“org.apache.spark.sql.AnalysisException”,apache-spark,rdd,apache-spark-dataset,Apache Spark,Rdd,Apache Spark Dataset,我最近开始使用Spark的Dataset API,我正在尝试几个示例。下面是一个这样的例子,失败的原因是AnalysisException 执行上述代码时,我得到: 19/06/02 18:04:42 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint 19/06/02 18:04:42 INFO CodeGenerator: Code generated in 405.026891 ms Except
19/06/02 18:04:42 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
19/06/02 18:04:42 INFO CodeGenerator: Code generated in 405.026891 ms
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`name`' given input columns: [_1, _2];
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:110)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:107)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:278)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:278)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:326)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:326)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChild$2(TreeNode.scala:295)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$4(TreeNode.scala:354)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike.map(TraversableLike.scala:237)
at scala.collection.TraversableLike.map$(TraversableLike.scala:230)
at scala.collection.immutable.List.map(List.scala:298)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:354)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUp$1(QueryPlan.scala:93)
at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:105)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:105)
at org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:116)
at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:126)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:93)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:107)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:85)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:85)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:82)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:95)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.resolveAndBind(ExpressionEncoder.scala:258)
at org.apache.spark.sql.Dataset.deserializer$lzycompute(Dataset.scala:214)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$deserializer(Dataset.scala:213)
at org.apache.spark.sql.Dataset$.apply(Dataset.scala:72)
at org.apache.spark.sql.Dataset.as(Dataset.scala:431)
at SocketStreamWordcountApp$.main(SocketStreamWordcountApp.scala:20)
at SocketStreamWordcountApp.main(SocketStreamWordcountApp.scala)
19/06/02 18:04:43 INFO SparkContext: Invoking stop() from shutdown hook
我认为,当我们尝试创建一个新的数据集或使用as[T]将RDD转换为数据集时,它会起作用。不是吗
只是为了进行实验,我尝试创建一个数据帧并将数据帧转换为数据集,如下所示,但最终还是出现了相同的错误
val sourceDS = spark.sparkContext.parallelize(source).toDF().as[Fruits]
// or val sourceDS = spark.createDataFrame(source).as[Fruits]
任何帮助都将不胜感激。输入数据框的列名必须与case类的字段名匹配。因此,您需要中间数据集[行]:
或者一路跟着一个
当然,合理的解决办法是从一开始就从水果开始
val source = Array(Fruits("mango", 1), Fruits("Guava", 2), Fruits("mango", 2), Fruits("guava", 2))
从spark 2.3开始,数据帧的列名应与案例类参数的名称匹配。而对于早期版本2.1.1,唯一的限制是相同数量的列/参数。 可以通过以下方式创建水果序列而不是元组:
case class Fruits(name: String, quantity: Int)
val source = Array(Fruits("mango", 1), Fruits("Guava", 2), Fruits("mango", 2), Fruits("guava", 2))
val sourceDS = spark.createDataset(source)
val resultDS = sourceDS.filter(_.name == "mango").filter(_.quantity
我认为@user11589880的答案是可行的,但我有一个备选方案供您考虑:
val sourceDS = Seq(Fruit("Mango", 1), Fruit("Guava", 2)).toDF
SourceD的类型应为Dataset[Fruit]
case class Fruits(name: String, quantity: Int)
val source = Array(Fruits("mango", 1), Fruits("Guava", 2), Fruits("mango", 2), Fruits("guava", 2))
val sourceDS = spark.createDataset(source)
val resultDS = sourceDS.filter(_.name == "mango").filter(_.quantity
val sourceDS = Seq(Fruit("Mango", 1), Fruit("Guava", 2)).toDF