Sql Pyspark：通过反检查值是否存在（非计数或求和）来聚合数据_Sql_Pyspark

Sql Pyspark：通过反检查值是否存在（非计数或求和）来聚合数据

sql pyspark

Sql Pyspark：通过反检查值是否存在（非计数或求和）来聚合数据,sql,pyspark,Sql,Pyspark,我有一个这样的数据集 test = spark.createDataFrame([ (0, 1, 5, "2018-06-03", "Region A"), (1, 1, 2, "2018-06-04", "Region B"), (2, 2, 1, "2018-06-03", "Region B"), (4, 1, 1, "2018-06-05", "Region C"), (5, 3, 2, "2018-06-03", "Region D"),

我有一个这样的数据集

test = spark.createDataFrame([
    (0, 1, 5, "2018-06-03", "Region A"),
    (1, 1, 2, "2018-06-04", "Region B"),
    (2, 2, 1, "2018-06-03", "Region B"),
    (4, 1, 1, "2018-06-05", "Region C"),
    (5, 3, 2, "2018-06-03", "Region D"),
    (6, 1, 2, "2018-06-03", "Region A"),
    (7, 4, 4, "2018-06-03", "Region A"),
    (8, 4, 4, "2018-06-03", "Region B"),
    (9, 5, 4, "2018-06-03", "Region A"),
    (10, 5, 4, "2018-06-03", "Region B"),
])\
  .toDF("orderid", "customerid", "price", "transactiondate", "location")
test.show()

我可以按如下方式汇总每个地区每个客户的订单：

temp_result = test.groupBy("customerid").pivot("location").agg(count("orderid")).na.fill(0)
temp_result.show()

现在，我想简单地通过确定值是否存在（即0或1）来聚合数据，而不是

sum

或

count

我可以通过以下方式获得上述结果：

for field in temp_result.schema.fields:
    if str(field.name) not in ['customerid', "overall_count", "overall_amount"]:
        name = str(field.name)
        temp_result = temp_result.withColumn(name, \
                                             when(col(name) >= 1, 1).otherwise(0))

但是有没有更简单的方法来获得它呢？

你基本上就快到了——只需要稍微调整一下就可以得到你想要的结果。在聚合中，添加计数比较并将布尔值转换为整数（如果需要）：

结果分为：

+----------+--------+--------+--------+--------+
|customerid|Region A|Region B|Region C|Region D|
+----------+--------+--------+--------+--------+
|         5|       1|       1|       0|       0|
|         1|       1|       1|       1|       0|
|         3|       0|       0|       0|       1|
|         2|       0|       1|       0|       0|
|         4|       1|       1|       0|       0|
+----------+--------+--------+--------+--------+

如果出现火花错误，您可以使用此解决方案，该解决方案通过附加步骤进行计数比较：

temp_result = test.groupBy("customerId", "location")\
                  .agg(count("orderid").alias("count"))\
                  .withColumn("count", (col("count")>0).cast("integer"))\
                  .groupby("customerId")\
                  .pivot("location")\
                  .agg(sum("count")).na.fill(0)

temp_result.show()

0，1很重要，因为以后我需要将其转换为矩阵并执行乘法。

u“pivot所需的聚合表达式，找到了‘cast（（count（orderid）>cast（0作为bigint））作为int’；“

，这就是我得到的，我错过了什么吗？@cqcn你有什么spark版本？它在databricks上，2.3。1@cqcn1991奇怪,，我有相同的版本，所以这不应该是一个问题。在我的回答中，我提供了一个不同的解决方案，它应该对您有用，但不幸的是，它更复杂。

temp_result = test.groupBy("customerId", "location")\
                  .agg(count("orderid").alias("count"))\
                  .withColumn("count", (col("count")>0).cast("integer"))\
                  .groupby("customerId")\
                  .pivot("location")\
                  .agg(sum("count")).na.fill(0)

temp_result.show()