Mysql Spark sql连接两个没有主键的数据帧
我有以下两个数据帧,我想加入它们以创建新的架构数据:Mysql Spark sql连接两个没有主键的数据帧,mysql,python-3.x,pyspark,apache-spark-sql,pyspark-dataframes,Mysql,Python 3.x,Pyspark,Apache Spark Sql,Pyspark Dataframes,我有以下两个数据帧,我想加入它们以创建新的架构数据: df = sqlContext.createDataFrame([("A011021","15","2020-01-01","2020-12-31","4"),("A011021","15","2020-01-01","2020-12-31","4"),("A011021","15","2020-01-01","2020-12-31","4"),("A011021","15","2020-01-01","2020-12-31","3")],
df = sqlContext.createDataFrame([("A011021","15","2020-01-01","2020-12-31","4"),("A011021","15","2020-01-01","2020-12-31","4"),("A011021","15","2020-01-01","2020-12-31","4"),("A011021","15","2020-01-01","2020-12-31","3")], ["rep_id","sales_target","start_date","end_date","st_new"])
df2.createOrReplaceTempView('df')
+--------------+------------+----------+----------+------+
rep_id |sales_target|start_date|end_date |st_new|
+--------------+------------+----------+----------+-------
|A011021 |15 |2020-01-01|2020-12-31|4 |
|A011021 |15 |2020-01-01|2020-12-31|4 |
|A011021 |15 |2020-01-01|2020-12-31|4 |
|A011021 |15 |2020-01-01|2020-12-31|3 |
|A011022 |6 |2020-01-01|2020-12-31|3 |
|A011022 |6 |2020-01-01|2020-12-31|3 |
+--------------+------------+----------+----------+-------
df2 = sqlContext.createDataFrame([("A011021","15","2020-01-01","2020-12-31","2020-01-01","2020-03-31"),("A011021","15","2020-01-01","2020-12-31","2020-04-01","2020-06-30"),("A011021","15","2020-01-01","2020-12-31","2020-07-01","2020-09-30"),("A011021","15","2020-01-01","2020-12-31","2020-10-01","2020-12-31")], ["rep_id","sales_target","start_date","end_date","new_sdt","new_edt"])
df2.createOrReplaceTempView('df2')
+--------------+------------+----------+----------+-----------+----------+
rep_id |sales_target|start_date|end_date |new_sdt |new_edt |
+--------------+------------+----------+----------------------+----------+
|A011021 |15 |2020-01-01|2020-12-31|2020-01-01 |2020-03-31|
|A011021 |15 |2020-01-01|2020-12-31|2020-04-01 |2020-06-30|
|A011021 |15 |2020-01-01|2020-12-31|2020-07-01 |2020-09-30|
|A011021 |15 |2020-01-01|2020-12-31|2020-10-01 |2020-12-31|
|A011022 |6 |2020-01-01|2020-06-30|2020-01-01 |2020-03-31|
|A011022 |6 |2020-01-01|2020-06-30|2020-04-01 |2020-06-30|
+--------------+------------+----------+----------------------+----------+
当我运行查询以连接两个表时,会得到重复的结果,如下所示:
select ds1.*,ds2.st_new from df2 ds2
inner join df1 ds1
on ds2.rep_id=ds1.rep_id
where ds2.rep_id='A011021'
+--------------+------------+----------+----------+------+-----------+----------+
rep_id |sales_target|start_date|end_date |st_new|new_sdt |new_edate |
+--------------+------------+----------+----------+------------------+----------+
|A011021 |15 |2020-01-01|2020-12-31|4 |2020-01-01 |2019-12-31|
|A011021 |15 |2020-01-01|2020-12-31|4 |2020-01-01 |2019-12-31|
|A011021 |15 |2020-01-01|2020-12-31|4 |2020-01-01 |2019-12-31|
|A011021 |15 |2020-01-01|2020-12-31|3 |2020-01-01 |2020-03-31|
|A011021 |15 |2020-01-01|2020-12-31|4 |2020-04-01 |2020-03-31|
|A011021 |15 |2020-01-01|2020-12-31|4 |2020-04-01 |2020-03-31|
|A011021 |15 |2020-01-01|2020-12-31|4 |2020-04-01 |2020-03-31|
|A011021 |15 |2020-01-01|2020-12-31|3 |2020-04-01 |2020-06-30|
|A011021 |15 |2020-01-01|2020-12-31|4 |2020-07-01 |2020-06-30|
|A011021 |15 |2020-01-01|2020-12-31|4 |2020-07-01 |2020-06-30|
|A011021 |15 |2020-01-01|2020-12-31|4 |2020-07-01 |2020-06-30|
|A011021 |15 |2020-01-01|2020-12-31|3 |2020-07-01 |2020-09-30|
|A011021 |15 |2020-01-01|2020-12-31|4 |2020-10-01 |2020-09-30|
|A011021 |15 |2020-01-01|2020-12-31|4 |2020-10-01 |2020-09-30|
|A011021 |15 |2020-01-01|2020-12-31|4 |2020-10-01 |2020-09-30|
|A011021 |15 |2020-01-01|2020-12-31|3 |2020-10-01 |2020-12-30|
+--------------+------------+----------+----------+------------------+----------+
是否有方法使用spark_sql或pyspark函数仅获取给定代表id的不同new_sdt、new_edt、季度数据,请提供帮助
预期结果是:
select ds1.*,ds2.st_new from df2 ds2
inner join df1 ds1
on ds2.rep_id=ds1.rep_id
+--------------+------------+----------+----------+------+-----------+----------+
rep_id |sales_target|start_date|end_date |st_new|new_sdt |new_edt |
+--------------+------------+----------+----------+------------------+----------+
|A011021 |15 |2020-01-01|2020-12-31|4 |2020-01-01 |2020-03-31|
|A011021 |15 |2020-01-01|2020-12-31|4 |2020-04-01 |2020-06-30|
|A011021 |15 |2020-01-01|2020-12-31|4 |2020-07-01 |2020-09-30|
|A011021 |15 |2020-01-01|2020-12-31|3 |2020-10-01 |2020-12-31|
|A011022 |6 |2020-01-01|2020-12-31|3 |2020-01-01 |2020-03-31|
|A011022 |6 |2020-01-01|2020-12-31|3 |2020-04-01 |2020-06-30|
+--------------+------------+----------+----------+------------------+----------+
+-------+------------+----------+----------+------+-------+------------+----------+----------+----------+----------+
| rep_id|sales_target|start_date| end_date|st_new| rep_id|sales_target|start_date| end_date| new_sdt| new_edt|
+-------+------------+----------+----------+------+-------+------------+----------+----------+----------+----------+
|A011021| 15|2020-01-01|2020-12-31| 4|A011021| 15|2020-01-01|2020-12-31|2020-04-01|2020-06-30|
|A011021| 15|2020-01-01|2020-12-31| 4|A011021| 15|2020-01-01|2020-12-31|2020-01-01|2020-03-31|
|A011021| 15|2020-01-01|2020-12-31| 4|A011021| 15|2020-01-01|2020-12-31|2020-07-01|2020-09-30|
|A011021| 15|2020-01-01|2020-12-31| 3|A011021| 15|2020-01-01|2020-12-31|2020-10-01|2020-12-31|
|A011022| 6|2020-01-01|2020-12-31| 3|A011022| 6|2020-01-01|2020-06-30|2020-04-01|2020-06-30|
|A011022| 6|2020-01-01|2020-12-31| 3|A011022| 6|2020-01-01|2020-06-30|2020-01-01|2020-03-31|
+-------+------------+----------+----------+------+-------+------------+----------+----------+----------+----------+
什么是
customer\u number
列?哦,对不起,应该是代表id,更新了查询。谢谢,汉克斯,你能把输出也打印出来吗。我会在我这边检查的。也添加了输出@Yuvathank you,如果你看到st_new,它应该是4,4,4,3,按这个顺序,这就是我发现很难的地方,有什么建议可以让它正确。st_new&new sdt、new edt是分解数据。谢谢,我无法添加任何一列级别的验证,因为记录的数量会根据季度数而变化,请参见我的代表id为A011022的示例。thanks@Yuva请验证您的输入。我在任何地方都看不到A011022
+-------+------------+----------+----------+------+-------+------------+----------+----------+----------+----------+
| rep_id|sales_target|start_date| end_date|st_new| rep_id|sales_target|start_date| end_date| new_sdt| new_edt|
+-------+------------+----------+----------+------+-------+------------+----------+----------+----------+----------+
|A011021| 15|2020-01-01|2020-12-31| 4|A011021| 15|2020-01-01|2020-12-31|2020-04-01|2020-06-30|
|A011021| 15|2020-01-01|2020-12-31| 4|A011021| 15|2020-01-01|2020-12-31|2020-01-01|2020-03-31|
|A011021| 15|2020-01-01|2020-12-31| 4|A011021| 15|2020-01-01|2020-12-31|2020-07-01|2020-09-30|
|A011021| 15|2020-01-01|2020-12-31| 3|A011021| 15|2020-01-01|2020-12-31|2020-10-01|2020-12-31|
|A011022| 6|2020-01-01|2020-12-31| 3|A011022| 6|2020-01-01|2020-06-30|2020-04-01|2020-06-30|
|A011022| 6|2020-01-01|2020-12-31| 3|A011022| 6|2020-01-01|2020-06-30|2020-01-01|2020-03-31|
+-------+------------+----------+----------+------+-------+------------+----------+----------+----------+----------+