Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 如何根据Pyspark中的值查找前n个键?_Apache Spark_Pyspark_Apache Spark Sql_User Defined Functions - Fatal编程技术网

Apache spark 如何根据Pyspark中的值查找前n个键?

Apache spark 如何根据Pyspark中的值查找前n个键?,apache-spark,pyspark,apache-spark-sql,user-defined-functions,Apache Spark,Pyspark,Apache Spark Sql,User Defined Functions,我有一个pyspark数据框架,其架构如下所示: root |-- query: string (nullable = true) |-- collect_list(docId): array (nullable = true) | |-- element: string (containsNull = true) |-- prod_count_dict: map (nullable = true) | |-- key: string | |-- value: in

我有一个pyspark数据框架,其架构如下所示:

root
 |-- query: string (nullable = true)
 |-- collect_list(docId): array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- prod_count_dict: map (nullable = true)
 |    |-- key: string
 |    |-- value: integer (valueContainsNull = true)
+--------------------+--------------------+--------------------+
|               query| collect_list(docId)|     prod_count_dict|
+--------------------+--------------------+--------------------+
|1/2 inch plywood ...|[471097-153-12CC,...|[530320-62634-100...|
|             1416445|[1416445-83-HHM5S...|[1054482-2251-FFC...
数据帧如下所示:

root
 |-- query: string (nullable = true)
 |-- collect_list(docId): array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- prod_count_dict: map (nullable = true)
 |    |-- key: string
 |    |-- value: integer (valueContainsNull = true)
+--------------------+--------------------+--------------------+
|               query| collect_list(docId)|     prod_count_dict|
+--------------------+--------------------+--------------------+
|1/2 inch plywood ...|[471097-153-12CC,...|[530320-62634-100...|
|             1416445|[1416445-83-HHM5S...|[1054482-2251-FFC...
请注意,
prod\u count\u dict
列是一个包含键值对的字典,如:

{x: 12, a: 16, b:1, f:3, ....}
我想做的是,我只想从key:value对中选择
顶部n
最大
值的
,并将其存储在另一列中,作为与该行对应的列表,如:[x,a,…]

我尝试了下面的代码,但它给了我一个错误,有没有办法解决这个特殊的问题

@F.udf(StringType())
def create_label(x):
# If the length of dictionary is less then 20, I want to return the keys of all the items in the dict.
    if len(x) >= 20:  
        val_sort = sorted(list(x.values()), reverse = True)
        cutoff = {k: v for (k, v) in x.items() if v > val_sort[20]}
        return cutoff.keys()
    else:
        return x.keys()

label_df = label_count_df.withColumn("label", create_label("prod_count_dict"))
label_df.show()

首先,我要爆出这条格言:

df = df.select("*", f.explode("prod_count_dict").alias("key", "value"))
之后,可以使用Window函数获取每个键的前n个值

import pyspark.sql.functions as f
from pyspark.sql import Window

w = Window.partitionBy(df['key']).orderBy(df['value'].desc())

df.select('*', f.rank().over(w).alias('rank'))\
  .filter(col('rank') <= 2) \  # setup N here
  .drop('rank')
导入pyspark.sql.f函数
从pyspark.sql导入窗口
w=Window.partitionBy(df['key']).orderBy(df['value'].desc())
df.select('*',f.rank().over(w).别名('rank'))\

.filter(col('rank')您编写的UDF是正确的。您只需要在实际使用它的地方更改代码。如果您在
rdd
中使用
.map
,这一点很容易做到:

#Let the udf that you have written be a normal python function
def create_label(x):
# If the length of the dictionary is less than 20, I want to return the keys of all the items in the dict.
    if len(x) >= 20:  
        val_sort = sorted(list(x.values()), reverse = True)
        cutoff = {k: v for (k, v) in x.items() if v > val_sort[20]}
        return cutoff.keys()
    else:
        return x.keys()
您需要更改的部分是:

label_df_col = ['query','prod_count_dict']
label_df = label_count_df.rdd.map(lambda x:(x.query, create_label(x.prod_count_dict))).toDF(label_df_col)
label_df.show()

这应该行得通。

行得通,我想我可以用df.withColumn来做