Apache spark 如何透视流数据集?

Apache spark 如何透视流数据集?,apache-spark,spark-structured-streaming,apache-spark-2.0,Apache Spark,Spark Structured Streaming,Apache Spark 2.0,我正在尝试透视Spark流数据集(结构化流),但我得到了一个AnalysisException(以下摘录) 是否有人可以确认结构化流(Spark 2.0)中确实不支持数据透视,或者建议其他方法 线程“main”org.apache.spark.sql.AnalysisException中出现异常:必须使用writeStream.start()执行流源查询;; 卡夫卡 在org.apache.spark.sql.catalyst.analysis.UnsupportedOperationCheck

我正在尝试透视Spark流数据集(结构化流),但我得到了一个
AnalysisException
(以下摘录)

是否有人可以确认结构化流(Spark 2.0)中确实不支持数据透视,或者建议其他方法

线程“main”org.apache.spark.sql.AnalysisException中出现异常:必须使用writeStream.start()执行流源查询;; 卡夫卡 在org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$thrower(UnsupportedOperationChecker.scala:297) 在org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36) 在org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:34) 位于org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)


tl;dr
pivot
聚合不直接受Spark结构化流媒体(包括2.4.4的支持

作为解决方法,请使用或更通用的


我现在使用的是Spark 2.4.4的最新版本

scala> spark.version
res0: String = 2.4.4
(可以在堆栈跟踪中找到)检查流式查询的(逻辑计划)是否仅使用支持的操作

当执行
pivot
时,必须首先执行
groupBy
,因为这是提供
pivot
的唯一界面

pivot有两个问题:

  • pivot
    想知道要为多少列生成值,从而收集
    数据流数据集不可能做到的

  • pivot
    实际上是Spark结构化流媒体不支持的另一个聚合(除了
    groupBy

  • 让我们看一看问题1,它没有定义任何列作为轴心

    val sq = spark
      .readStream
      .format("rate")
      .load
      .groupBy("value")
      .pivot("timestamp") // <-- pivot with no values
      .count
      .writeStream
      .format("console")
    scala> sq.start
    org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
    rate
      at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.throwError(UnsupportedOperationChecker.scala:389)
      at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.$anonfun$checkForBatch$1(UnsupportedOperationChecker.scala:38)
      at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.$anonfun$checkForBatch$1$adapted(UnsupportedOperationChecker.scala:36)
      at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
      at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:126)
      at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:126)
      at scala.collection.immutable.List.foreach(List.scala:392)
      at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
      at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:126)
      at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:126)
      at scala.collection.immutable.List.foreach(List.scala:392)
      at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
      at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:126)
      at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:126)
      at scala.collection.immutable.List.foreach(List.scala:392)
      at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
      at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:126)
      at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:126)
      at scala.collection.immutable.List.foreach(List.scala:392)
      at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
      at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:126)
      at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:126)
      at scala.collection.immutable.List.foreach(List.scala:392)
      at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
      at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:36)
      at org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:51)
      at org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:62)
      at org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:60)
      at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:66)
      at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:66)
      at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72)
      at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68)
      at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77)
      at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77)
      at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3365)
      at org.apache.spark.sql.Dataset.collect(Dataset.scala:2788)
      at org.apache.spark.sql.RelationalGroupedDataset.pivot(RelationalGroupedDataset.scala:384)
      ... 49 elided
    

    下面是一个基于Jacek上述答案的简单Java示例:

    JSON数组:

    [{
            "customer_id": "d6315a00",
            "product": "Super widget",
            "price": 10,
            "bought_date": "2019-01-01"
        },
        {
            "customer_id": "d6315a00",
            "product": "Super widget",
            "price": 10,
            "bought_date": "2019-01-01"
        },
        {
            "customer_id": "d6315a00",
            "product": "Super widget",
            "price": 10,
            "bought_date": "2019-01-02"
        },
        {
            "customer_id": "d6315a00",
            "product": "Food widget",
            "price": 4,
            "bought_date": "2019-08-20"
        },
        {
            "customer_id": "d6315cd0",
            "product": "Food widget",
            "price": 4,
            "bought_date": "2019-09-19"
        }, {
            "customer_id": "d6315e2e",
            "product": "Bike widget",
            "price": 10,
            "bought_date": "2019-01-01"
        }, {
            "customer_id": "d6315a00",
            "product": "Bike widget",
            "price": 10,
            "bought_date": "2019-03-10"
        },
        {
            "customer_id": "d631614e",
            "product": "Garage widget",
            "price": 4,
            "bought_date": "2019-02-15"
        }
    ]
    
    package io.centilliard;
    
    import static org.apache.spark.sql.functions.explode;
    import static org.apache.spark.sql.functions.from_json;
    
    import org.apache.spark.sql.AnalysisException;
    import org.apache.spark.sql.Column;
    import org.apache.spark.sql.Dataset;
    import org.apache.spark.sql.Row;
    import org.apache.spark.sql.SparkSession;
    import org.apache.spark.sql.streaming.DataStreamWriter;
    import org.apache.spark.sql.streaming.StreamingQuery;
    import org.apache.spark.sql.streaming.StreamingQueryException;
    import org.apache.spark.sql.types.ArrayType;
    import org.apache.spark.sql.types.DataTypes;
    import org.apache.spark.sql.types.Metadata;
    import org.apache.spark.sql.types.StructField;
    import org.apache.spark.sql.types.StructType;
    
    import scala.Function2;
    import scala.runtime.BoxedUnit;
    
    public class Pivot {
    
        public static void main(String[] args) throws StreamingQueryException, AnalysisException {
    
            StructType schema = new StructType(new StructField[]{
                    new StructField("customer_id", DataTypes.StringType, false, Metadata.empty()),  
                    new StructField("product", DataTypes.StringType, false, Metadata.empty()),          
                    new StructField("price", DataTypes.IntegerType, false, Metadata.empty()),               
                    new StructField("bought_date", DataTypes.StringType, false, Metadata.empty())
                });
    
            ArrayType  arrayType = new ArrayType(schema, false);
    
            SparkSession spark = SparkSession
                    .builder()
                    .appName("SimpleExample")
                    .getOrCreate();
    
            // Create a DataSet representing the stream of input lines from Kafka
            Dataset<Row> dataset = spark
                            .readStream()
                            .format("kafka")                
                            .option("kafka.bootstrap.servers", "localhost:9092")
                            .option("subscribe", "utilization")
                            .load()
                            .selectExpr("CAST(value AS STRING) as json");
    
            Column col = new Column("json");        
            Column data = from_json(col,arrayType).as("data");  
            Column explode = explode(data);
            Dataset<Row> customers = dataset.select(explode).select("col.*");
    
            DataStreamWriter<Row> dataStreamWriter = new DataStreamWriter<Row>(customers);
    
            StreamingQuery dataStream = dataStreamWriter.foreachBatch(new Function2<Dataset<Row>, Object, BoxedUnit>() {
    
                @Override
                public BoxedUnit apply(Dataset<Row> dataset, Object object) {               
    
                    dataset
                    .groupBy("customer_id","product","bought_date")
                    .pivot("product")               
                    .sum("price")               
                    .orderBy("customer_id")
                    .show();
    
                    return null;
                }
            })
            .start();
    
            dataStream.awaitTermination();
        }
    
    }
    
    +-----------+-------------+-----------+-----------+-----------+-------------+------------+
    |customer_id|      product|bought_date|Bike widget|Food widget|Garage widget|Super widget|
    +-----------+-------------+-----------+-----------+-----------+-------------+------------+
    |   d6315a00|  Bike widget| 2019-03-10|         20|       null|         null|        null|
    |   d6315a00| Super widget| 2019-01-02|       null|       null|         null|          20|
    |   d6315a00| Super widget| 2019-01-01|       null|       null|         null|          40|
    |   d6315a00|  Food widget| 2019-08-20|       null|          8|         null|        null|
    |   d6315cd0|  Food widget| 2019-09-19|       null|          8|         null|        null|
    |   d6315e2e|  Bike widget| 2019-01-01|         20|       null|         null|        null|
    |   d631614e|Garage widget| 2019-02-15|       null|       null|            8|        null|
    +-----------+-------------+-----------+-----------+-----------+-------------+------------+
    
    Java代码:

    [{
            "customer_id": "d6315a00",
            "product": "Super widget",
            "price": 10,
            "bought_date": "2019-01-01"
        },
        {
            "customer_id": "d6315a00",
            "product": "Super widget",
            "price": 10,
            "bought_date": "2019-01-01"
        },
        {
            "customer_id": "d6315a00",
            "product": "Super widget",
            "price": 10,
            "bought_date": "2019-01-02"
        },
        {
            "customer_id": "d6315a00",
            "product": "Food widget",
            "price": 4,
            "bought_date": "2019-08-20"
        },
        {
            "customer_id": "d6315cd0",
            "product": "Food widget",
            "price": 4,
            "bought_date": "2019-09-19"
        }, {
            "customer_id": "d6315e2e",
            "product": "Bike widget",
            "price": 10,
            "bought_date": "2019-01-01"
        }, {
            "customer_id": "d6315a00",
            "product": "Bike widget",
            "price": 10,
            "bought_date": "2019-03-10"
        },
        {
            "customer_id": "d631614e",
            "product": "Garage widget",
            "price": 4,
            "bought_date": "2019-02-15"
        }
    ]
    
    package io.centilliard;
    
    import static org.apache.spark.sql.functions.explode;
    import static org.apache.spark.sql.functions.from_json;
    
    import org.apache.spark.sql.AnalysisException;
    import org.apache.spark.sql.Column;
    import org.apache.spark.sql.Dataset;
    import org.apache.spark.sql.Row;
    import org.apache.spark.sql.SparkSession;
    import org.apache.spark.sql.streaming.DataStreamWriter;
    import org.apache.spark.sql.streaming.StreamingQuery;
    import org.apache.spark.sql.streaming.StreamingQueryException;
    import org.apache.spark.sql.types.ArrayType;
    import org.apache.spark.sql.types.DataTypes;
    import org.apache.spark.sql.types.Metadata;
    import org.apache.spark.sql.types.StructField;
    import org.apache.spark.sql.types.StructType;
    
    import scala.Function2;
    import scala.runtime.BoxedUnit;
    
    public class Pivot {
    
        public static void main(String[] args) throws StreamingQueryException, AnalysisException {
    
            StructType schema = new StructType(new StructField[]{
                    new StructField("customer_id", DataTypes.StringType, false, Metadata.empty()),  
                    new StructField("product", DataTypes.StringType, false, Metadata.empty()),          
                    new StructField("price", DataTypes.IntegerType, false, Metadata.empty()),               
                    new StructField("bought_date", DataTypes.StringType, false, Metadata.empty())
                });
    
            ArrayType  arrayType = new ArrayType(schema, false);
    
            SparkSession spark = SparkSession
                    .builder()
                    .appName("SimpleExample")
                    .getOrCreate();
    
            // Create a DataSet representing the stream of input lines from Kafka
            Dataset<Row> dataset = spark
                            .readStream()
                            .format("kafka")                
                            .option("kafka.bootstrap.servers", "localhost:9092")
                            .option("subscribe", "utilization")
                            .load()
                            .selectExpr("CAST(value AS STRING) as json");
    
            Column col = new Column("json");        
            Column data = from_json(col,arrayType).as("data");  
            Column explode = explode(data);
            Dataset<Row> customers = dataset.select(explode).select("col.*");
    
            DataStreamWriter<Row> dataStreamWriter = new DataStreamWriter<Row>(customers);
    
            StreamingQuery dataStream = dataStreamWriter.foreachBatch(new Function2<Dataset<Row>, Object, BoxedUnit>() {
    
                @Override
                public BoxedUnit apply(Dataset<Row> dataset, Object object) {               
    
                    dataset
                    .groupBy("customer_id","product","bought_date")
                    .pivot("product")               
                    .sum("price")               
                    .orderBy("customer_id")
                    .show();
    
                    return null;
                }
            })
            .start();
    
            dataStream.awaitTermination();
        }
    
    }
    
    +-----------+-------------+-----------+-----------+-----------+-------------+------------+
    |customer_id|      product|bought_date|Bike widget|Food widget|Garage widget|Super widget|
    +-----------+-------------+-----------+-----------+-----------+-------------+------------+
    |   d6315a00|  Bike widget| 2019-03-10|         20|       null|         null|        null|
    |   d6315a00| Super widget| 2019-01-02|       null|       null|         null|          20|
    |   d6315a00| Super widget| 2019-01-01|       null|       null|         null|          40|
    |   d6315a00|  Food widget| 2019-08-20|       null|          8|         null|        null|
    |   d6315cd0|  Food widget| 2019-09-19|       null|          8|         null|        null|
    |   d6315e2e|  Bike widget| 2019-01-01|         20|       null|         null|        null|
    |   d631614e|Garage widget| 2019-02-15|       null|       null|            8|        null|
    +-----------+-------------+-----------+-----------+-----------+-------------+------------+
    

    在大多数情况下,可以使用条件聚合作为解决方法。 相当于

    df.groupBy("timestamp").
       pivot("name", Seq("banana", "peach")).
       sum("value")