如何在连接时迭代Pyspark中的数组列
在pyspark中,我有数据帧a:如何在连接时迭代Pyspark中的数组列,pyspark,pyspark-dataframes,Pyspark,Pyspark Dataframes,在pyspark中,我有数据帧a: +-----------+----------------------+ | str1 | array_of_str | +-----------+----------------------+ | John | [mango, apple] | | Tom | [mango, orange] | | Matteo | [apple, banana] | 和带有 +---
+-----------+----------------------+
| str1 | array_of_str |
+-----------+----------------------+
| John | [mango, apple] |
| Tom | [mango, orange] |
| Matteo | [apple, banana] |
和带有
+-----------+----------------------+
| key | value |
+-----------+----------------------+
| mango | 1 |
| apple | 2 |
| orange | 3 |
我想创建一个类型为Arrayjoined\u result
的新列,该列将_str(dataframe\u a)的Array\u中的每个元素映射到其在dataframe\u b中的值,例如:
+-----------+----------------------+----------------------------------+
| str1 | array_of_str | joined_result |
+-----------+----------------------+----------------------------------+
| John | [mango, apple] | [1, 2] |
| Tom | [mango, orange] | [1, 3] |
| Matteo | [apple, banana] | [2] |
我不知道该怎么做,我知道我可以使用带有lambda函数的udf,但我无法使它工作:(帮助
提前谢谢你我对你的问题的回答:
lookup_list = map(lambda row: row.asDict(), dataframe_b.collect())
lookup_dict = {lookup['key']:lookup['value'] for lookup in lookup_list}
def mapper(keys):
return [lookup_dict[key][0] for key in keys]
dataframe_a = dataframe_a.withColumn('joined_result', F.udf(mapper)("arr_of_str"))
它可以根据您的需要工作:-)例如dataframe\u a=dataframe\u a.withColumn('joined\u result',F.udf(mapper,dataframe\u b)(“arr\u of_str”)
您不能这样做,因为udf在一个数据帧中运行(在dataframe\u a中)。此外,udf在PVM(Python虚拟机)中运行,因此必须传递Python对象,如字典,而不是数据帧。您可以假设第二个数据帧是查找数据帧,它不会非常大。我不知道你的情况,但这个解决方案运行得很好。不知怎的,我得到了一个新的列joined\u result
与数组中的数组:[[1],[2]]我想我解决了它,我在答案中的查找dict[key][0]
中添加了[0]。再次检查并告诉我:-)谢谢你的帮助:)我没有工作,它仍然在数组中返回一个数组,但只返回第一个字符=)
lookup_list = map(lambda row: row.asDict(), dataframe_b.collect())
lookup_dict = {lookup['key']:lookup['value'] for lookup in lookup_list}
def mapper(keys):
return [lookup_dict[key][0] for key in keys]
dataframe_a = dataframe_a.withColumn('joined_result', F.udf(mapper)("arr_of_str"))