Hadoop PySpark:在联接中处理NULL_Hadoop_Dataframe_Pyspark

Hadoop PySpark:在联接中处理NULL

hadoop dataframe pyspark

Hadoop PySpark:在联接中处理NULL,hadoop,dataframe,pyspark,Hadoop,Dataframe,Pyspark,我正在尝试在pyspark中加入2个数据帧。我的问题是我希望我的“内部联接”能够传递它，而不考虑空值。我可以看到，在scala中，我有一个替代的。但是，在pyspark中不起作用 userLeft = sc.parallelize([ Row(id=u'1', first_name=u'Steve', last_name=u'Kent', email=u's.kent@email.com'), Row(id=u'2', first_name=u'Marga

我正在尝试在pyspark中加入2个数据帧。我的问题是我希望我的“内部联接”能够传递它，而不考虑空值。我可以看到，在scala中，我有一个替代的。但是，在pyspark中不起作用

userLeft = sc.parallelize([
Row(id=u'1', 
    first_name=u'Steve', 
    last_name=u'Kent', 
    email=u's.kent@email.com'),
Row(id=u'2', 
    first_name=u'Margaret', 
    last_name=u'Peace', 
    email=u'marge.peace@email.com'),
Row(id=u'3', 
    first_name=None, 
    last_name=u'hh', 
    email=u'marge.hh@email.com')]).toDF()

userRight = sc.parallelize([
Row(id=u'2', 
    first_name=u'Margaret', 
    last_name=u'Peace', 
    email=u'marge.peace@email.com'),
Row(id=u'3', 
    first_name=None, 
    last_name=u'hh', 
    email=u'marge.hh@email.com')]).toDF()

当前工作版本：

userLeft.join（userRight，（userLeft.last\u name==userRight.last\u name）和（userLeft.first\u name==userRight.first\u name））.show（）

当前结果：

    +--------------------+----------+---+---------+--------------------+----------+---+---------+
|               email|first_name| id|last_name|               email|first_name| id|last_name|
    +--------------------+----------+---+---------+--------------------+----------+---+---------+ 
    |marge.peace@email...|  Margaret|  2|    Peace|marge.peace@email...|  Margaret|  2|    Peace|
    +--------------------+----------+---+---------+--------------------+----------+---+---------+

预期结果：

    +--------------------+----------+---+---------+--------------------+----------+---+---------+
|               email|first_name| id|last_name|               email|first_name| id|last_name|
+--------------------+----------+---+---------+--------------------+----------+---+---------+
|  marge.hh@email.com|      null|  3|       hh|  marge.hh@email.com|      null|  3|       hh|
|marge.peace@email...|  Margaret|  2|    Peace|marge.peace@email...|  Margaret|  2|    Peace|
+--------------------+----------+---+---------+--------------------+----------+---+---------+

使用另一个值而不是

null

：

userLeft=userLeft.na.fill（“未知”）
userRight=userRight.na.fill（“未知”）
userLeft.join（userRight、[“姓”、“名”]）
+---------+----------+--------------------+---+--------------------+---+
|姓|名|电子邮件| id |电子邮件| id|
+---------+----------+--------------------+---+--------------------+---+
|和平|玛格丽特|玛吉。peace@email...|2 |玛吉。peace@email...|  2|
|嗯|未知|玛姬。hh@email.com|3 |玛吉。hh@email.com|  3|
+---------+----------+--------------------+---+--------------------+---+

对于PYSPARK<2.3.0，仍然可以使用如下表达式列构建运算符：

import pyspark.sql.functions as F
df1.alias("df1").join(df2.alias("df2"), on = F.expr('df1.column <=> df2.column'))

导入pyspark.sql.F函数
df1.alias（“df1”）.join（df2.alias（“df2”），on=F.expr（'df1.column df2.column'））

对于PYSPARK>=2.3.0，您可以使用Column.eqNullSafe，或者与回答中的不同。

我尝试了这种方法。对于字符串和日期列，我能够将其转换为区分空值。例如：字符串“NULLCUSTOM”和日期：“8888-01-01”。但我不能为整数值或浮点值设置一个值。你有什么想法吗？

float（“inf”）

如果列的类型为

int

或

long

它实际上不是无穷大，id列的类型是

9223372036854775807

或

-1