Python 无法在pyspark中联接两个RDD_Python_Apache Spark_Join_Pyspark

Python 无法在pyspark中联接两个RDD

python apache-spark join pyspark

Python 无法在pyspark中联接两个RDD,python,apache-spark,join,pyspark,Python,Apache Spark,Join,Pyspark,我有两个数据帧称为df1，df2，但当我尝试加入它时，它无法完成。让我为每个数据帧设置模式，并为每个数据帧设置示例输出 df1 Out[160]: DataFrame[BibNum: string, CallNumber: string, CheckoutDateTime: string, ItemBarcode: string, ItemCollection: string, ItemType: string] Row(BibNum=u'BibNum', CallNumber=u'CallN

我有两个数据帧称为df1，df2，但当我尝试加入它时，它无法完成。让我为每个数据帧设置模式，并为每个数据帧设置示例输出

df1
Out[160]: DataFrame[BibNum: string, CallNumber: string, CheckoutDateTime: string, ItemBarcode: string, ItemCollection: string, ItemType: string]

Row(BibNum=u'BibNum', CallNumber=u'CallNumber', CheckoutDateTime=u'CheckoutDateTime', ItemBarcode=u'ItemBarcode', ItemCollection=u'ItemCollection', ItemType=u'ItemType'),
 Row(BibNum=u'1842225', CallNumber=u'MYSTERY ELKINS1999', CheckoutDateTime=u'05/23/2005 03:20:00 PM', ItemBarcode=u'10035249209', ItemCollection=u'namys', ItemType=u'acbk')]



df2    
DataFrame[Author: string, BibNum: string, FloatingItem: string, ISBN: string, ItemCollection: string, ItemCount: string, ItemLocation: string, ItemType: string, PublicationDate: string, Publisher: string, ReportDate: string, Subjects: string, Title: string]

[Row(Author=u'Author', BibNum=u'BibNum', FloatingItem=u'FloatingItem', ISBN=u'ISBN', ItemCollection=u'ItemCollection', ItemCount=u'ItemCount', ItemLocation=u'ItemLocation', ItemType=u'ItemType', PublicationDate=u'PublicationYear', Publisher=u'Publisher', ReportDate=u'ReportDate', Subjects=u'Subjects', Title=u'Title'),
 Row(Author=u"O'Ryan| Ellie", BibNum=u'3011076', FloatingItem=u'Floating', ISBN=u'1481425730| 1481425749| 9781481425735| 9781481425742', ItemCollection=u'ncrdr', ItemCount=u'1', ItemLocation=u'qna', ItemType=u'jcbk', PublicationDate=u'2014', Publisher=u'Simon Spotlight|', ReportDate=u'09/01/2017', Subjects=u'Musicians Fiction| Bullfighters Fiction| Best friends Fiction| Friendship Fiction| Adventure and adventurers Fiction', Title=u"A tale of two friends / adapted by Ellie O'Ryan ; illustrated by Tom Caulfield| Frederick Gardner| Megan Petasky| and Allen Tam.")]

当我尝试使用此命令连接两个时：

df3=df1.join(df2, df1.BibNum==df2.BibNum)

，没有错误，但数据帧看起来像这样，具有重叠列：

DataFrame[BibNum: string, CallNumber: string, CheckoutDateTime: string, ItemBarcode: string, ItemCollection: string, ItemType: string, Author: string, BibNum: string, FloatingItem: string, ISBN: string, ItemCollection: string, ItemCount: string, ItemLocation: string, ItemType: string, PublicationDate: string, Publisher: string, ReportDate: string, Subjects: string, Title: string]

最后，在我获得df3（加入数据帧）之后，当我尝试df3.take（2）时，出现了错误：

列表索引超出范围。
因此，我希望通过计算借阅最多的天数（checkoutDateTime）来确定哪些ItemLocation可用。
您需要在公共列上加入数据框，否则它将从两个不同的数据框中生成两个同名的冲突列
common_cols = [x for x in df1.columns if x in df2.columns]
df3 = df1.join(df2, on=common_cols, how='outer')

您可以根据需要使用外部联接或左联接。请不要为同一问题问多个问题。您已经在以下位置获得了活动答案：
执行df3.show（）时是否看到输出？您的代码适合我，因此我认为错误不在df3.take（2）中。它一定在别的地方，我看不到df3.show（），还有相同的索引错误，奇怪的是我只能看到df1和df2列重叠的数据框。你能解释一下吗？我只能看到df1和df2列重叠的数据框，并更新问题吗？