Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 使用pyspark连接两个数据帧时合并相似的列名_Apache Spark_Pyspark Dataframes - Fatal编程技术网

Apache spark 使用pyspark连接两个数据帧时合并相似的列名

Apache spark 使用pyspark连接两个数据帧时合并相似的列名,apache-spark,pyspark-dataframes,Apache Spark,Pyspark Dataframes,在下面的程序中,将在pyspark中连接两个数据帧时创建重复列 >>> spark = SparkSession.builder.appName("Join").getOrCreate() >>> dict=[{"Emp_id" : 123 , "Emp_name" : "Raja" }, {"Emp_id" : 456 , "Emp_name" : "Ravi"}] >>> dict1=[{"Emp_id" : 123 , "Dep_nam

在下面的程序中,将在pyspark中连接两个数据帧时创建重复列

>>> spark = SparkSession.builder.appName("Join").getOrCreate()
>>> dict=[{"Emp_id" : 123 , "Emp_name" : "Raja" }, {"Emp_id" : 456 , "Emp_name" : "Ravi"}]
>>> dict1=[{"Emp_id" : 123 , "Dep_name" : "Computer" } , {"Emp_id" : 456 ,"Dep_name"  :"Economy"}]
>>> df=spark.createDataFrame(dict)
>>> df1=spark.createDataFrame(dict1)
>>> df2=df.join(df1,df.Emp_id == df1.Emp_id, how = 'inner')

>>> df.show()
    +------+--------+
    |Emp_id|Emp_name|
    +------+--------+
    |   123|    Raja|
    |   456|    Ravi|
    +------+--------+

>>> df1.show()
    +--------+------+
    |Dep_name|Emp_id|
    +--------+------+
    |Computer|   123|
    | Economy|   456|
    +--------+------+

>>> df2=df.join(df1,df.Emp_id == df1.Emp_id, how = 'inner')


>>> df2.show()
+------+--------+--------+------+
|Emp_id|Emp_name|Dep_name|Emp_id|
+------+--------+--------+------+
|   123|    Raja|Computer|   123|
|   456|    Ravi| Economy|   456|
+------+--------+--------+------+
有没有其他方法可以像在SAS中一样通过覆盖列来获取join的结果,如下面所示的数据

 +------+--------+--------+
|Emp_id|Emp_name|Dep_name|
+------+--------+--------+
|   123|    Raja|Computer|
|   456|    Ravi| Economy|
+------+--------+--------+

在您的加入条件中,用
['Emp\u id']
替换
df.Emp\u id==df1.Emp\u id


df2=df.join(df1,['Emp_id'], how = 'inner')
df2.show()

#+------+--------+--------+
#|Emp_id|Emp_name|Dep_name|
#+------+--------+--------+
#|   123|    Raja|Computer|
#|   456|    Ravi| Economy|
#+------+--------+--------+