Apache spark 如何透视流数据集?
我正在尝试透视Spark流数据集(结构化流),但我得到了一个Apache spark 如何透视流数据集?,apache-spark,spark-structured-streaming,apache-spark-2.0,Apache Spark,Spark Structured Streaming,Apache Spark 2.0,我正在尝试透视Spark流数据集(结构化流),但我得到了一个AnalysisException(以下摘录) 是否有人可以确认结构化流(Spark 2.0)中确实不支持数据透视,或者建议其他方法 线程“main”org.apache.spark.sql.AnalysisException中出现异常:必须使用writeStream.start()执行流源查询;; 卡夫卡 在org.apache.spark.sql.catalyst.analysis.UnsupportedOperationCheck
AnalysisException
(以下摘录)
是否有人可以确认结构化流(Spark 2.0)中确实不支持数据透视,或者建议其他方法
线程“main”org.apache.spark.sql.AnalysisException中出现异常:必须使用writeStream.start()执行流源查询;;
卡夫卡
在org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$thrower(UnsupportedOperationChecker.scala:297)
在org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36)
在org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:34)
位于org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
tl;dr
pivot
聚合不直接受Spark结构化流媒体(包括2.4.4的支持
作为解决方法,请使用或更通用的
我现在使用的是Spark 2.4.4的最新版本
scala> spark.version
res0: String = 2.4.4
(可以在堆栈跟踪中找到)检查流式查询的(逻辑计划)是否仅使用支持的操作
当执行pivot
时,必须首先执行groupBy
,因为这是提供pivot
的唯一界面
pivot有两个问题:
pivot
想知道要为多少列生成值,从而收集数据流数据集不可能做到的
pivot
实际上是Spark结构化流媒体不支持的另一个聚合(除了groupBy
)
让我们看一看问题1,它没有定义任何列作为轴心
val sq = spark
.readStream
.format("rate")
.load
.groupBy("value")
.pivot("timestamp") // <-- pivot with no values
.count
.writeStream
.format("console")
scala> sq.start
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
rate
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.throwError(UnsupportedOperationChecker.scala:389)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.$anonfun$checkForBatch$1(UnsupportedOperationChecker.scala:38)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.$anonfun$checkForBatch$1$adapted(UnsupportedOperationChecker.scala:36)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:36)
at org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:51)
at org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:62)
at org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:60)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:66)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:66)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3365)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:2788)
at org.apache.spark.sql.RelationalGroupedDataset.pivot(RelationalGroupedDataset.scala:384)
... 49 elided
下面是一个基于Jacek上述答案的简单Java示例:
JSON数组:
[{
"customer_id": "d6315a00",
"product": "Super widget",
"price": 10,
"bought_date": "2019-01-01"
},
{
"customer_id": "d6315a00",
"product": "Super widget",
"price": 10,
"bought_date": "2019-01-01"
},
{
"customer_id": "d6315a00",
"product": "Super widget",
"price": 10,
"bought_date": "2019-01-02"
},
{
"customer_id": "d6315a00",
"product": "Food widget",
"price": 4,
"bought_date": "2019-08-20"
},
{
"customer_id": "d6315cd0",
"product": "Food widget",
"price": 4,
"bought_date": "2019-09-19"
}, {
"customer_id": "d6315e2e",
"product": "Bike widget",
"price": 10,
"bought_date": "2019-01-01"
}, {
"customer_id": "d6315a00",
"product": "Bike widget",
"price": 10,
"bought_date": "2019-03-10"
},
{
"customer_id": "d631614e",
"product": "Garage widget",
"price": 4,
"bought_date": "2019-02-15"
}
]
package io.centilliard;
import static org.apache.spark.sql.functions.explode;
import static org.apache.spark.sql.functions.from_json;
import org.apache.spark.sql.AnalysisException;
import org.apache.spark.sql.Column;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.streaming.DataStreamWriter;
import org.apache.spark.sql.streaming.StreamingQuery;
import org.apache.spark.sql.streaming.StreamingQueryException;
import org.apache.spark.sql.types.ArrayType;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import scala.Function2;
import scala.runtime.BoxedUnit;
public class Pivot {
public static void main(String[] args) throws StreamingQueryException, AnalysisException {
StructType schema = new StructType(new StructField[]{
new StructField("customer_id", DataTypes.StringType, false, Metadata.empty()),
new StructField("product", DataTypes.StringType, false, Metadata.empty()),
new StructField("price", DataTypes.IntegerType, false, Metadata.empty()),
new StructField("bought_date", DataTypes.StringType, false, Metadata.empty())
});
ArrayType arrayType = new ArrayType(schema, false);
SparkSession spark = SparkSession
.builder()
.appName("SimpleExample")
.getOrCreate();
// Create a DataSet representing the stream of input lines from Kafka
Dataset<Row> dataset = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "utilization")
.load()
.selectExpr("CAST(value AS STRING) as json");
Column col = new Column("json");
Column data = from_json(col,arrayType).as("data");
Column explode = explode(data);
Dataset<Row> customers = dataset.select(explode).select("col.*");
DataStreamWriter<Row> dataStreamWriter = new DataStreamWriter<Row>(customers);
StreamingQuery dataStream = dataStreamWriter.foreachBatch(new Function2<Dataset<Row>, Object, BoxedUnit>() {
@Override
public BoxedUnit apply(Dataset<Row> dataset, Object object) {
dataset
.groupBy("customer_id","product","bought_date")
.pivot("product")
.sum("price")
.orderBy("customer_id")
.show();
return null;
}
})
.start();
dataStream.awaitTermination();
}
}
+-----------+-------------+-----------+-----------+-----------+-------------+------------+
|customer_id| product|bought_date|Bike widget|Food widget|Garage widget|Super widget|
+-----------+-------------+-----------+-----------+-----------+-------------+------------+
| d6315a00| Bike widget| 2019-03-10| 20| null| null| null|
| d6315a00| Super widget| 2019-01-02| null| null| null| 20|
| d6315a00| Super widget| 2019-01-01| null| null| null| 40|
| d6315a00| Food widget| 2019-08-20| null| 8| null| null|
| d6315cd0| Food widget| 2019-09-19| null| 8| null| null|
| d6315e2e| Bike widget| 2019-01-01| 20| null| null| null|
| d631614e|Garage widget| 2019-02-15| null| null| 8| null|
+-----------+-------------+-----------+-----------+-----------+-------------+------------+
Java代码:
[{
"customer_id": "d6315a00",
"product": "Super widget",
"price": 10,
"bought_date": "2019-01-01"
},
{
"customer_id": "d6315a00",
"product": "Super widget",
"price": 10,
"bought_date": "2019-01-01"
},
{
"customer_id": "d6315a00",
"product": "Super widget",
"price": 10,
"bought_date": "2019-01-02"
},
{
"customer_id": "d6315a00",
"product": "Food widget",
"price": 4,
"bought_date": "2019-08-20"
},
{
"customer_id": "d6315cd0",
"product": "Food widget",
"price": 4,
"bought_date": "2019-09-19"
}, {
"customer_id": "d6315e2e",
"product": "Bike widget",
"price": 10,
"bought_date": "2019-01-01"
}, {
"customer_id": "d6315a00",
"product": "Bike widget",
"price": 10,
"bought_date": "2019-03-10"
},
{
"customer_id": "d631614e",
"product": "Garage widget",
"price": 4,
"bought_date": "2019-02-15"
}
]
package io.centilliard;
import static org.apache.spark.sql.functions.explode;
import static org.apache.spark.sql.functions.from_json;
import org.apache.spark.sql.AnalysisException;
import org.apache.spark.sql.Column;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.streaming.DataStreamWriter;
import org.apache.spark.sql.streaming.StreamingQuery;
import org.apache.spark.sql.streaming.StreamingQueryException;
import org.apache.spark.sql.types.ArrayType;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import scala.Function2;
import scala.runtime.BoxedUnit;
public class Pivot {
public static void main(String[] args) throws StreamingQueryException, AnalysisException {
StructType schema = new StructType(new StructField[]{
new StructField("customer_id", DataTypes.StringType, false, Metadata.empty()),
new StructField("product", DataTypes.StringType, false, Metadata.empty()),
new StructField("price", DataTypes.IntegerType, false, Metadata.empty()),
new StructField("bought_date", DataTypes.StringType, false, Metadata.empty())
});
ArrayType arrayType = new ArrayType(schema, false);
SparkSession spark = SparkSession
.builder()
.appName("SimpleExample")
.getOrCreate();
// Create a DataSet representing the stream of input lines from Kafka
Dataset<Row> dataset = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "utilization")
.load()
.selectExpr("CAST(value AS STRING) as json");
Column col = new Column("json");
Column data = from_json(col,arrayType).as("data");
Column explode = explode(data);
Dataset<Row> customers = dataset.select(explode).select("col.*");
DataStreamWriter<Row> dataStreamWriter = new DataStreamWriter<Row>(customers);
StreamingQuery dataStream = dataStreamWriter.foreachBatch(new Function2<Dataset<Row>, Object, BoxedUnit>() {
@Override
public BoxedUnit apply(Dataset<Row> dataset, Object object) {
dataset
.groupBy("customer_id","product","bought_date")
.pivot("product")
.sum("price")
.orderBy("customer_id")
.show();
return null;
}
})
.start();
dataStream.awaitTermination();
}
}
+-----------+-------------+-----------+-----------+-----------+-------------+------------+
|customer_id| product|bought_date|Bike widget|Food widget|Garage widget|Super widget|
+-----------+-------------+-----------+-----------+-----------+-------------+------------+
| d6315a00| Bike widget| 2019-03-10| 20| null| null| null|
| d6315a00| Super widget| 2019-01-02| null| null| null| 20|
| d6315a00| Super widget| 2019-01-01| null| null| null| 40|
| d6315a00| Food widget| 2019-08-20| null| 8| null| null|
| d6315cd0| Food widget| 2019-09-19| null| 8| null| null|
| d6315e2e| Bike widget| 2019-01-01| 20| null| null| null|
| d631614e|Garage widget| 2019-02-15| null| null| 8| null|
+-----------+-------------+-----------+-----------+-----------+-------------+------------+
在大多数情况下,可以使用条件聚合作为解决方法。
相当于
df.groupBy("timestamp").
pivot("name", Seq("banana", "peach")).
sum("value")
是