Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/344.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何在pyspark数据帧中将groupby转换为reducebykey?_Python_Apache Spark_Pyspark_Apache Spark Sql_Spark Dataframe - Fatal编程技术网

Python 如何在pyspark数据帧中将groupby转换为reducebykey?

Python 如何在pyspark数据帧中将groupby转换为reducebykey?,python,apache-spark,pyspark,apache-spark-sql,spark-dataframe,Python,Apache Spark,Pyspark,Apache Spark Sql,Spark Dataframe,我已经用GROUPBY和sum函数编写了pyspark代码。我觉得由于团队的原因,我的表现受到了影响。相反,我想使用reducebykey。但我对这个地区还不熟悉。请在下面找到我的场景 步骤1:通过sqlcontext读取配置单元表联接查询数据并存储在dataframe中 步骤2:输入列的总数为15。其中5个为关键字段,其余为数值 步骤3:除了上面的输入列之外,还需要从数值列派生几个列。使用默认值的列很少 步骤4:我使用了GROUPBY和sum函数。如何使用带有map和reducebykey选项

我已经用GROUPBY和sum函数编写了pyspark代码。我觉得由于团队的原因,我的表现受到了影响。相反,我想使用reducebykey。但我对这个地区还不熟悉。请在下面找到我的场景

步骤1:通过sqlcontext读取配置单元表联接查询数据并存储在dataframe中

步骤2:输入列的总数为15。其中5个为关键字段,其余为数值

步骤3:除了上面的输入列之外,还需要从数值列派生几个列。使用默认值的列很少

步骤4:我使用了GROUPBY和sum函数。如何使用带有map和reducebykey选项的spark way进行类似的逻辑

from pyspark.sql.functions import col, when, lit, concat, round, sum

#sample data
df = sc.parallelize([(1, 2, 3, 4), (5, 6, 7, 8)]).toDF(["col1", "col2", "col3", "col4"])

#populate col5, col6, col7
col5 = when((col('col1') == 0) & (col('col3') != 0), round(col('col4')/ col('col3'), 2)).otherwise(0)
col6 = when((col('col1') == 0) & (col('col4') != 0), round((col('col3') * col('col4'))/ col('col1'), 2)).otherwise(0)
col7 = col('col2')
df1 = df.withColumn("col5", col5).\
    withColumn("col6", col6).\
    withColumn("col7", col7)

#populate col8, col9, col10
col8 = when((col('col1') != 0) & (col('col3') != 0), round(col('col4')/ col('col3'), 2)).otherwise(0)
col9 = when((col('col1') != 0) & (col('col4') != 0), round((col('col3') * col('col4'))/ col('col1'), 2)).otherwise(0)
col10= concat(col('col2'), lit("_NEW"))
df2 = df.withColumn("col5", col8).\
    withColumn("col6", col9).\
    withColumn("col7", col10)

#final dataframe
final_df = df1.union(df2)
final_df.show()

#groupBy calculation
#final_df.groupBy("col1", "col2", "col3", "col4").agg(sum("col5")).show()from pyspark.sql.functions import col, when, lit, concat, round, sum

#sample data
df = sc.parallelize([(1, 2, 3, 4), (5, 6, 7, 8)]).toDF(["col1", "col2", "col3", "col4"])

#populate col5, col6, col7
col5 = when((col('col1') == 0) & (col('col3') != 0), round(col('col4')/ col('col3'), 2)).otherwise(0)
col6 = when((col('col1') == 0) & (col('col4') != 0), round((col('col3') * col('col4'))/ col('col1'), 2)).otherwise(0)
col7 = col('col2')
df1 = df.withColumn("col5", col5).\
    withColumn("col6", col6).\
    withColumn("col7", col7)

#populate col8, col9, col10
col8 = when((col('col1') != 0) & (col('col3') != 0), round(col('col4')/ col('col3'), 2)).otherwise(0)
col9 = when((col('col1') != 0) & (col('col4') != 0), round((col('col3') * col('col4'))/ col('col1'), 2)).otherwise(0)
col10= concat(col('col2'), lit("_NEW"))
df2 = df.withColumn("col5", col8).\
    withColumn("col6", col9).\
    withColumn("col7", col10)

#final dataframe
final_df = df1.union(df2)
final_df.show()

#groupBy calculation
final_df.groupBy("col1", "col2", "col3", "col4").agg(sum("col5")........sum("coln")).show()

Spark SQL中没有
reduceByKey

groupBy
+聚合函数的工作原理与RDD.reduceByKey几乎相同。Spark将自动选择是类似于
RDD.groupByKey
(即用于收集列表)还是类似于
RDD.reduceByKey


Dataset.groupBy+聚合函数的性能应优于或等于RDD.reduceByKey。Catalyst optimizer关注如何在后台进行聚合

据我所知,它只会在executor上添加额外的最终聚合步骤,而不是Spark SQL groupBy+聚合中的驱动程序。感谢您的回复。我们不能在数据帧上应用reduceByKey吗?许多文章讲述了与reduceByKey相同的事情,因为group在最后阶段减少了行数,所以对于大型数据集来说,它比group by快。@user3150024这些文章是关于RDD的-数据集有一个抽象层,Catalyst optimizer优化查询:)还有其他方法提高性能吗?试图增加执行器的数量,但没有反映出来,只有两个内核的8个vCore用完了。我是否应该将该数据帧转换为RDD并应用reduceByKey。这行得通吗?@user3150024数据帧的
groupBy
agg
应该至少和
reduceByKey
一样快。听起来您还有一些与群集设置相关的问题。