Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/redis/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Mysql Spark sql连接两个没有主键的数据帧_Mysql_Python 3.x_Pyspark_Apache Spark Sql_Pyspark Dataframes - Fatal编程技术网

Mysql Spark sql连接两个没有主键的数据帧

Mysql Spark sql连接两个没有主键的数据帧,mysql,python-3.x,pyspark,apache-spark-sql,pyspark-dataframes,Mysql,Python 3.x,Pyspark,Apache Spark Sql,Pyspark Dataframes,我有以下两个数据帧,我想加入它们以创建新的架构数据: df = sqlContext.createDataFrame([("A011021","15","2020-01-01","2020-12-31","4"),("A011021","15","2020-01-01","2020-12-31","4"),("A011021","15","2020-01-01","2020-12-31","4"),("A011021","15","2020-01-01","2020-12-31","3")],

我有以下两个数据帧,我想加入它们以创建新的架构数据:

df = sqlContext.createDataFrame([("A011021","15","2020-01-01","2020-12-31","4"),("A011021","15","2020-01-01","2020-12-31","4"),("A011021","15","2020-01-01","2020-12-31","4"),("A011021","15","2020-01-01","2020-12-31","3")], ["rep_id","sales_target","start_date","end_date","st_new"])
df2.createOrReplaceTempView('df')
+--------------+------------+----------+----------+------+
rep_id         |sales_target|start_date|end_date  |st_new|
+--------------+------------+----------+----------+-------
|A011021       |15          |2020-01-01|2020-12-31|4     |
|A011021       |15          |2020-01-01|2020-12-31|4     |
|A011021       |15          |2020-01-01|2020-12-31|4     |
|A011021       |15          |2020-01-01|2020-12-31|3     |
|A011022       |6           |2020-01-01|2020-12-31|3     |
|A011022       |6           |2020-01-01|2020-12-31|3     |
+--------------+------------+----------+----------+-------

df2 = sqlContext.createDataFrame([("A011021","15","2020-01-01","2020-12-31","2020-01-01","2020-03-31"),("A011021","15","2020-01-01","2020-12-31","2020-04-01","2020-06-30"),("A011021","15","2020-01-01","2020-12-31","2020-07-01","2020-09-30"),("A011021","15","2020-01-01","2020-12-31","2020-10-01","2020-12-31")], ["rep_id","sales_target","start_date","end_date","new_sdt","new_edt"])
df2.createOrReplaceTempView('df2')
+--------------+------------+----------+----------+-----------+----------+
rep_id         |sales_target|start_date|end_date  |new_sdt    |new_edt   |
+--------------+------------+----------+----------------------+----------+
|A011021       |15          |2020-01-01|2020-12-31|2020-01-01 |2020-03-31|
|A011021       |15          |2020-01-01|2020-12-31|2020-04-01 |2020-06-30|
|A011021       |15          |2020-01-01|2020-12-31|2020-07-01 |2020-09-30|
|A011021       |15          |2020-01-01|2020-12-31|2020-10-01 |2020-12-31|
|A011022       |6           |2020-01-01|2020-06-30|2020-01-01 |2020-03-31|
|A011022       |6           |2020-01-01|2020-06-30|2020-04-01 |2020-06-30|
+--------------+------------+----------+----------------------+----------+
当我运行查询以连接两个表时,会得到重复的结果,如下所示:

select ds1.*,ds2.st_new from df2 ds2
inner join df1 ds1
on ds2.rep_id=ds1.rep_id
where ds2.rep_id='A011021'

+--------------+------------+----------+----------+------+-----------+----------+
rep_id         |sales_target|start_date|end_date  |st_new|new_sdt    |new_edate |
+--------------+------------+----------+----------+------------------+----------+
|A011021       |15          |2020-01-01|2020-12-31|4     |2020-01-01 |2019-12-31|
|A011021       |15          |2020-01-01|2020-12-31|4     |2020-01-01 |2019-12-31|
|A011021       |15          |2020-01-01|2020-12-31|4     |2020-01-01 |2019-12-31|
|A011021       |15          |2020-01-01|2020-12-31|3     |2020-01-01 |2020-03-31|
|A011021       |15          |2020-01-01|2020-12-31|4     |2020-04-01 |2020-03-31|
|A011021       |15          |2020-01-01|2020-12-31|4     |2020-04-01 |2020-03-31|
|A011021       |15          |2020-01-01|2020-12-31|4     |2020-04-01 |2020-03-31|
|A011021       |15          |2020-01-01|2020-12-31|3     |2020-04-01 |2020-06-30|
|A011021       |15          |2020-01-01|2020-12-31|4     |2020-07-01 |2020-06-30|
|A011021       |15          |2020-01-01|2020-12-31|4     |2020-07-01 |2020-06-30|
|A011021       |15          |2020-01-01|2020-12-31|4     |2020-07-01 |2020-06-30|
|A011021       |15          |2020-01-01|2020-12-31|3     |2020-07-01 |2020-09-30|
|A011021       |15          |2020-01-01|2020-12-31|4     |2020-10-01 |2020-09-30|
|A011021       |15          |2020-01-01|2020-12-31|4     |2020-10-01 |2020-09-30|
|A011021       |15          |2020-01-01|2020-12-31|4     |2020-10-01 |2020-09-30|
|A011021       |15          |2020-01-01|2020-12-31|3     |2020-10-01 |2020-12-30|
+--------------+------------+----------+----------+------------------+----------+
是否有方法使用spark_sql或pyspark函数仅获取给定代表id的不同new_sdt、new_edt、季度数据,请提供帮助

预期结果是:

select ds1.*,ds2.st_new from df2 ds2
inner join df1 ds1
on ds2.rep_id=ds1.rep_id

+--------------+------------+----------+----------+------+-----------+----------+
rep_id         |sales_target|start_date|end_date  |st_new|new_sdt    |new_edt |
+--------------+------------+----------+----------+------------------+----------+
|A011021       |15          |2020-01-01|2020-12-31|4     |2020-01-01 |2020-03-31|
|A011021       |15          |2020-01-01|2020-12-31|4     |2020-04-01 |2020-06-30|
|A011021       |15          |2020-01-01|2020-12-31|4     |2020-07-01 |2020-09-30|
|A011021       |15          |2020-01-01|2020-12-31|3     |2020-10-01 |2020-12-31|
|A011022       |6           |2020-01-01|2020-12-31|3     |2020-01-01 |2020-03-31|
|A011022       |6           |2020-01-01|2020-12-31|3     |2020-04-01 |2020-06-30|
+--------------+------------+----------+----------+------------------+----------+
  • 分配唯一id
  • 做一个内部连接
  • 删除多余的列
  • 期望输出

    +-------+------------+----------+----------+------+-------+------------+----------+----------+----------+----------+
    | rep_id|sales_target|start_date|  end_date|st_new| rep_id|sales_target|start_date|  end_date|   new_sdt|   new_edt|
    +-------+------------+----------+----------+------+-------+------------+----------+----------+----------+----------+
    |A011021|          15|2020-01-01|2020-12-31|     4|A011021|          15|2020-01-01|2020-12-31|2020-04-01|2020-06-30|
    |A011021|          15|2020-01-01|2020-12-31|     4|A011021|          15|2020-01-01|2020-12-31|2020-01-01|2020-03-31|
    |A011021|          15|2020-01-01|2020-12-31|     4|A011021|          15|2020-01-01|2020-12-31|2020-07-01|2020-09-30|
    |A011021|          15|2020-01-01|2020-12-31|     3|A011021|          15|2020-01-01|2020-12-31|2020-10-01|2020-12-31|
    |A011022|           6|2020-01-01|2020-12-31|     3|A011022|           6|2020-01-01|2020-06-30|2020-04-01|2020-06-30|
    |A011022|           6|2020-01-01|2020-12-31|     3|A011022|           6|2020-01-01|2020-06-30|2020-01-01|2020-03-31|
    +-------+------------+----------+----------+------+-------+------------+----------+----------+----------+----------+
    

    什么是
    customer\u number
    列?哦,对不起,应该是代表id,更新了查询。谢谢,汉克斯,你能把输出也打印出来吗。我会在我这边检查的。也添加了输出@Yuvathank you,如果你看到st_new,它应该是4,4,4,3,按这个顺序,这就是我发现很难的地方,有什么建议可以让它正确。st_new&new sdt、new edt是分解数据。谢谢,我无法添加任何一列级别的验证,因为记录的数量会根据季度数而变化,请参见我的代表id为A011022的示例。thanks@Yuva请验证您的输入。我在任何地方都看不到
    A011022
    +-------+------------+----------+----------+------+-------+------------+----------+----------+----------+----------+
    | rep_id|sales_target|start_date|  end_date|st_new| rep_id|sales_target|start_date|  end_date|   new_sdt|   new_edt|
    +-------+------------+----------+----------+------+-------+------------+----------+----------+----------+----------+
    |A011021|          15|2020-01-01|2020-12-31|     4|A011021|          15|2020-01-01|2020-12-31|2020-04-01|2020-06-30|
    |A011021|          15|2020-01-01|2020-12-31|     4|A011021|          15|2020-01-01|2020-12-31|2020-01-01|2020-03-31|
    |A011021|          15|2020-01-01|2020-12-31|     4|A011021|          15|2020-01-01|2020-12-31|2020-07-01|2020-09-30|
    |A011021|          15|2020-01-01|2020-12-31|     3|A011021|          15|2020-01-01|2020-12-31|2020-10-01|2020-12-31|
    |A011022|           6|2020-01-01|2020-12-31|     3|A011022|           6|2020-01-01|2020-06-30|2020-04-01|2020-06-30|
    |A011022|           6|2020-01-01|2020-12-31|     3|A011022|           6|2020-01-01|2020-06-30|2020-01-01|2020-03-31|
    +-------+------------+----------+----------+------+-------+------------+----------+----------+----------+----------+