列中逗号分隔值上的Pyspark联接数据帧
所以我有两个数据帧,我想连接它们。第二个表中存储了逗号分隔的值,其中一个与表A中的列匹配。如何在Pyspark中使用它。下面是一个例子 表A有:列中逗号分隔值上的Pyspark联接数据帧,pyspark,pyspark-sql,pyspark-dataframes,Pyspark,Pyspark Sql,Pyspark Dataframes,所以我有两个数据帧,我想连接它们。第二个表中存储了逗号分隔的值,其中一个与表A中的列匹配。如何在Pyspark中使用它。下面是一个例子 表A有: +-------+--------------------+ |deal_id| deal_name| +-------+--------------------+ | 613760|ABCDEFGHI | | 613740|TEST123 | | 598946|OMG
+-------+--------------------+
|deal_id| deal_name|
+-------+--------------------+
| 613760|ABCDEFGHI |
| 613740|TEST123 |
| 598946|OMG |
表B有:
+-------+---------------------------+--------------------+
| deal_id| deal_type|
+-------+---------------------------+--------------------+
| 613760,613761,613762,613763 |Direct De |
| 613740,613750,613770,613780,613790|Direct |
| 598946 |In |
预期结果-当表A的交易ID与表B的逗号分隔值匹配时,连接表A和表B。例如,TableA.dealid-613760位于表B的第1行,我希望返回该行
+-------+--------------------+---------------+
|deal_id| deal_name| deal_type|
+-------+--------------------+---------------+
| 613760|ABCDEFGHI |Direct De |
| 613740|TEST123 |Direct |
| 598946|OMG |In |
感谢您的帮助。我需要它在pyspark里
谢谢 样本数据
from pyspark.sql.types import IntegerType, LongType, StringType, StructField, StructType
tuples_a = [('613760', 'ABCDEFGHI'),
('613740', 'TEST123'),
('598946', 'OMG'),
]
schema_a = StructType([
StructField('deal_id', StringType(), nullable=False),
StructField('deal_name', StringType(), nullable=False)
])
tuples_b = [('613760,613761,613762,613763 ', 'Direct De'),
('613740,613750,613770,613780,613790', 'Direct'),
('598946', 'In'),
]
schema_b = StructType([
StructField('deal_id', StringType(), nullable=False),
StructField('deal_type', StringType(), nullable=False)
])
df_a = spark_session.createDataFrame(data=tuples_a, schema=schema_a)
df_b = spark_session.createDataFrame(data=tuples_b, schema=schema_b)
您需要拆分列并将其分解才能合并
from pyspark.sql.functions import split, col, explode
df_b = df_b.withColumn('split', split(col('deal_id'), ','))\
.withColumn('exploded', explode(col('split')))\
.drop('deal_id', 'split')\
.withColumnRenamed('exploded', 'deal_id')
df_a.join(df_b, on = 'deal_id', how = 'left_outer')\
.show(10, False)
以及预期的结果
+-------+---------+---------+
|deal_id|deal_name|deal_type|
+-------+---------+---------+
|613760 |ABCDEFGHI|Direct De|
|613740 |TEST123 |Direct |
|598946 |OMG |In |
+-------+---------+---------+
使用find_in_set:
dfA.alias('d1')。join(dfB.alias('d2')、expr('find_in_set(deal_id,l_id)>0')、'left')。select('d1.*,'d2.deal_type')
我尝试了以下代码merg_orders_df=dfA.alias('d1')。join(dfB.alias('d2')、F.expr('F.find in_in_in_set(d1.deal,d2.deal_id)>0),'full,'d1')
我收到一个属性错误,表示AttributeError:module'pyspark.sql.functions'没有属性'find_in_set'。请帮助修复代码find_in_set
是一个Spark SQL内置函数,它是从F.expr()函数调用的,因此只需从F.find_in_set
中删除F.
。。。工作得很有魅力。。谢谢