Python 用列表字典迭代过滤spark数据帧
我有一个字典,看起来像这样的Python 用列表字典迭代过滤spark数据帧,python,apache-spark,pyspark,apache-spark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,我有一个字典,看起来像这样的a_dict={“E1”:[“a”,10,20,“red”],“E2”:[“b”,7,14,“green”],“E3”:[“c”,40,50,“blue”],但更久的是,我想同时用每个列表元组过滤spark数据帧。让我们看一个数据帧的示例: +----------------------+ | User| value| color | +----------------------+ | a| 12| red| | a| 21|
a_dict={“E1”:[“a”,10,20,“red”],“E2”:[“b”,7,14,“green”],“E3”:[“c”,40,50,“blue”]
,但更久的是,我想同时用每个列表元组过滤spark数据帧。让我们看一个数据帧的示例:
+----------------------+
| User| value| color |
+----------------------+
| a| 12| red|
| a| 21| red|
| b| 8| green|
| b| 13| green|
| c| 41| blue|
| b| 72| red|
| c| 52| blue|
| a| 13| yellow|
+----------------------+
a_dict = {"E1":["a",10,20,"red"],"E2":["b", 7, 14,"green"],"E3":["c",40,50,"blue"]}
df2 = spark.createDataFrame(a_dict.values(), ['user', 'value1', 'value2', 'color'])
result = df.join(df2,
(df['user'] == df2['user']) &
(df['color'] == df2['color']) &
(df['value'].between(df2['value1'], df2['value2'])),
'left_semi'
)
result.show()
+----+-----+-----+
|User|value|color|
+----+-----+-----+
| c| 41| blue|
| b| 8|green|
| b| 13|green|
| a| 12| red|
+----+-----+-----+
我现在正在做的是:
for key, value in a_dict.items():
df=df.filter((df.user == value[0])
& (df.value > value[1])
& (df.value< value[2])
& (df.color==value[3]))
我想知道是否有一种更快的方法,不使用for循环并每次重新分配数据帧。您可以从字典值创建一个数据帧,并执行半联接以过滤原始数据帧:
+----------------------+
| User| value| color |
+----------------------+
| a| 12| red|
| a| 21| red|
| b| 8| green|
| b| 13| green|
| c| 41| blue|
| b| 72| red|
| c| 52| blue|
| a| 13| yellow|
+----------------------+
a_dict = {"E1":["a",10,20,"red"],"E2":["b", 7, 14,"green"],"E3":["c",40,50,"blue"]}
df2 = spark.createDataFrame(a_dict.values(), ['user', 'value1', 'value2', 'color'])
result = df.join(df2,
(df['user'] == df2['user']) &
(df['color'] == df2['color']) &
(df['value'].between(df2['value1'], df2['value2'])),
'left_semi'
)
result.show()
+----+-----+-----+
|User|value|color|
+----+-----+-----+
| c| 41| blue|
| b| 8|green|
| b| 13|green|
| a| 12| red|
+----+-----+-----+
如果您不介意的话,它可以完美地工作,假设我在df2中还有一个值,
df2=spark.createDataFrame(a_dict.values(),['user','value1','value2','color','ID'])
我如何在df中保留每个匹配的ID值?然后结果应该有一个带有相对ID的列,这是否可能?很抱歉,我对连接不太熟悉,您可以尝试使用左连接而不是左半连接。如果仍然不成功,您可以问一个新问题n并提供必要的详细信息。