Python 如何在PySpark中连接两个数据帧时解析重复的列名?

Python 如何在PySpark中连接两个数据帧时解析重复的列名?,python,apache-spark,pyspark,apache-spark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,我有一个完全相同的文件a和B。我试图在这两个数据帧上执行内部和外部联接。由于我将所有列作为重复列,现有的答案没有任何帮助。 我遇到的其他问题包含一个或两个列作为重复,我的问题是整个文件都是彼此重复的:数据和列名都是重复的 我的代码: import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext fro

我有一个完全相同的文件a和B。我试图在这两个数据帧上执行内部和外部联接。由于我将所有列作为重复列,现有的答案没有任何帮助。 我遇到的其他问题包含一个或两个列作为重复,我的问题是整个文件都是彼此重复的:数据和列名都是重复的

我的代码:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql import DataFrameReader, DataFrameWriter
from datetime import datetime

import time

# @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

print("All imports were successful.")

df = spark.read.orc(
    's3://****'
)
print("First dataframe read with headers set to True")
df2 = spark.read.orc(
    's3://****'
)
print("Second dataframe read with headers set to True")

# df3 = df.join(df2, ['c_0'], "outer")

# df3 = df.join(
#     df2,
#     df["column_test_1"] == df2["column_1"],
#     "outer"
# )

df3 = df.alias('l').join(df2.alias('r'), on='c_0') #.collect()

print("Dataframes have been joined successfully.")
output_file_path = 's3://****'
)

df3.write.orc(
    output_file_path
)
print("Dataframe has been written to csv.")
job.commit()
我面临的错误是:

pyspark.sql.utils.AnalysisException: u'Duplicate column(s): "c_4", "c_38", "c_13", "c_27", "c_50", "c_16", "c_23", "c_24", "c_1", "c_35", "c_30", "c_56", "c_34", "c_7", "c_46", "c_49", "c_57", "c_45", "c_31", "c_53", "c_19", "c_25", "c_10", "c_8", "c_14", "c_42", "c_20", "c_47", "c_36", "c_29", "c_15", "c_43", "c_32", "c_5", "c_37", "c_18", "c_54", "c_3", "__created_at__", "c_51", "c_48", "c_9", "c_21", "c_26", "c_44", "c_55", "c_2", "c_17", "c_40", "c_28", "c_33", "c_41", "c_22", "c_11", "c_12", "c_52", "c_6", "c_39" found, cannot save to file.;'
End of LogType:stdout

这里没有捷径。Pyspark期望左侧和右侧数据帧具有不同的字段名集(除了连接键)

一种解决方案是在每个字段名前面加上“left_uuu”或“right_uuu”,如下所示:

# Obtain columns lists
left_cols = df.columns
right_cols = df2.columns

# Prefix each dataframe's field with "left_" or "right_"
df = df.selectExpr([col + ' as left_' + col for col in left_cols])
df2 = df2.selectExpr([col + ' as right_' + col for col in right_cols])

# Perform join
df3 = df.alias('l').join(df2.alias('r'), on='c_0')

我做了类似的事情,但在scala中,您也可以将其转换为pyspark

  • 重命名每个数据帧中的列名

    dataFrame1.columns.foreach(columnName => {
      dataFrame1 = dataFrame1.select(dataFrame1.columns.head, dataFrame1.columns.tail: _*).withColumnRenamed(columnName, s"left_$columnName")
    })
    
    dataFrame1.columns.foreach(columnName => {
      dataFrame2 = dataFrame2.select(dataFrame2.columns.head, dataFrame2.columns.tail: _*).withColumnRenamed(columnName, s"right_$columnName")
    })
    
  • 现在通过提及列名来加入

    resultDF = dataframe1.join(dataframe2, dataframe1("left_c_0") === dataframe2("right_c_0"))
    

下面是一个帮助函数,用于连接两个数据帧并添加别名:

def join_with_aliases(left, right, on, how, right_prefix):
    renamed_right = right.selectExpr(
        [
            col + f" as {col}_{right_prefix}"
            for col in df2.columns
            if col not in on
        ]
        + on
    )
    right_on = [f"{x}{right_prefix}" for x in on]
    return left.join(renamed_right, on=on, how=how)
下面是一个如何使用它的示例:

df1 = spark.createDataFrame([[1, "a"], [2, "b"], [3, "c"]], ("id", "value"))
df2 = spark.createDataFrame([[1, "a"], [2, "b"], [3, "c"]], ("id", "value"))

join_with_aliases(
   left=df1,
   right=df2,
   on=["id"],
   how="inner",
   right_prefix="_right"
).show()

+---+-----+------------+
| id|value|value_right|
+---+-----+------------+
|  1|    a|           a|
|  3|    c|           c|
|  2|    b|           b|
+---+-----+------------+

如何显式选择列?你的意思是说
df。选择('c_0'作为'df_c_0','c_1'作为'df_c_1',..'c_49'作为'df_c_49')。加入(df2。选择('c_0'作为'df2_c_0','c_1'作为'df2_c_1','c_49'作为'df2_c_49'))
?答案的可能重复项是相同的。你需要给列名加上别名。不,没有一个答案可以解决我的问题。是的,正是由于我的弱点,我无法进一步推断别名,但问这个问题帮助我了解了
selectExpr
函数。因此,我请求你收回重复的评论,这对像我这样的新手是有帮助的。我的投票以重复的形式结束只是一次投票。不管结果如何,我仍然需要另外4个人(或一个金牌持有者)同意我的意见。您正在寻找的解决方案包含在中。在底部,它们显示了如何动态重命名所有列
selectExpr
是不需要的(尽管这是一种选择)。如果您仍然觉得这是不同的,请提问并准确解释它的不同之处。建议不要在
for
循环中使用
col
,因为它会覆盖本机PySpark函数
col
,这样它就不会被识别。