pyspark:连接两个数据帧时发生AnalysisException

pyspark:连接两个数据帧时发生AnalysisException,pyspark,apache-spark-sql,spark-dataframe,Pyspark,Apache Spark Sql,Spark Dataframe,我使用sparkSQL创建了两个数据帧: df1 = sqlContext.sql(""" ...""") df2 = sqlContext.sql(""" ...""") 我试图在my_id列上连接这两个数据框,如下所示: from pyspark.sql.functions import col combined_df = df1.join(df2, col("df1.my_id") == col("df2.my_id"), 'inner') 然后我得到了以下错误。知道我错过了什么吗?

我使用sparkSQL创建了两个数据帧:

df1 = sqlContext.sql(""" ...""")
df2 = sqlContext.sql(""" ...""")
我试图在
my_id
列上连接这两个数据框,如下所示:

from pyspark.sql.functions import col

combined_df = df1.join(df2, col("df1.my_id") == col("df2.my_id"), 'inner')
然后我得到了以下错误。知道我错过了什么吗?谢谢

AnalysisException                         Traceback (most recent call last)
<ipython-input-11-45f5313387cc> in <module>()
      3 from pyspark.sql.functions import col
      4 
----> 5 combined_df = df1.join(df2, col("df1.my_id") == col("df2.my_id"), 'inner')
      6 combined_df.take(10)

/usr/local/spark-latest/python/pyspark/sql/dataframe.py in join(self, other, on, how)
    770                 how = "inner"
    771             assert isinstance(how, basestring), "how should be basestring"
--> 772             jdf = self._jdf.join(other._jdf, on, how)
    773         return DataFrame(jdf, self.sql_ctx)
    774 

/usr/local/spark-latest/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135         for temp_arg in temp_args:

/usr/local/spark-latest/python/pyspark/sql/utils.py in deco(*a, **kw)
     67                                              e.java_exception.getStackTrace()))
     68             if s.startswith('org.apache.spark.sql.AnalysisException: '):
---> 69                 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
     70             if s.startswith('org.apache.spark.sql.catalyst.analysis'):
     71                 raise AnalysisException(s.split(': ', 1)[1], stackTrace)

AnalysisException: "cannot resolve '`df1.my_id`' given input columns: [...
AnalysisException回溯(最近一次调用)
在()
3从pyspark.sql.functions导入col
4.
---->5组合的df=df1.join(df2,col(“df1.my\u id”)==col(“df2.my\u id”),“inner”)
6联合测向取数(10)
/连接中的usr/local/spark-latest/python/pyspark/sql/dataframe.py(self、other、on、how)
770 how=“内部”
771断言isinstance(how,basestring),“how应该是basestring”
-->772 jdf=self.\u jdf.join(其他.\u jdf,on,how)
773返回数据帧(jdf,self.sql\u ctx)
774
/usr/local/spark-latest/python/lib/py4j-0.10.4-src.zip/py4j/java_-gateway.py in___调用(self,*args)
1131 answer=self.gateway\u client.send\u命令(command)
1132返回值=获取返回值(
->1133应答,self.gateway\u客户端,self.target\u id,self.name)
1134
1135对于临时参数中的临时参数:
/装饰中的usr/local/spark-latest/python/pyspark/sql/utils.py(*a,**kw)
67 e.java_exception.getStackTrace())
68如果s.StartWith('org.apache.spark.sql.AnalysisException:'):
--->69 raise AnalysisException(s.split(“:”,1)[1],stackTrace)
70如果s.startswith('org.apache.spark.sql.catalyst.analysis'):
71引发分析异常(s.split(“:”,1)[1],stackTrace)
AnalysisException:“无法解析给定输入列的'`df1.my_id`:[。。。

不确定
pyspark
,但如果两个
dataframe中的字段名相同,则这应该可以工作

combineDf = df1.join(df2, 'my_id', 'outer')

希望这能有所帮助!

我认为您的代码的问题是,您试图将“df1.my_id”作为列名,而不仅仅是
col('my_id')
。这就是为什么错误显示
无法解析df1.my_id给定的输入列

无需导入
col
即可执行此操作

combined_df = df1.join(df2, df1.my_id == df2.my_id, 'inner')