Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Pandas 如何在pyspark中使用udf比较一对列?_Pandas_Apache Spark_Pyspark_Apache Spark Sql_Pyspark Sql - Fatal编程技术网

Pandas 如何在pyspark中使用udf比较一对列?

Pandas 如何在pyspark中使用udf比较一对列?,pandas,apache-spark,pyspark,apache-spark-sql,pyspark-sql,Pandas,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Sql,我有如下的数据帧 +---+---+---+ | t1| t2|t3 | +---+---+---+ |0 |1 |0 | +---+---+---+ 我想将每一列与另一列进行比较 例如t1列值0和t2列值为1组合列为1 我们必须对所有列对应用逻辑oR 我的预期输出如下所示: +----+---+---+---+ |t123| t1|t2 | t3| +----+---+---+---+ |t1 |0 |1 |0 | |t2 |1 |0 |1 | |t2 |0 |1

我有如下的数据帧

+---+---+---+
| t1| t2|t3 |
+---+---+---+
|0  |1  |0  |
+---+---+---+
我想将每一列与另一列进行比较

例如
t1
列值
0
t2
列值为
1
组合列为
1

我们必须对所有列对应用
逻辑oR

我的预期输出如下所示:

+----+---+---+---+
|t123| t1|t2 | t3|
+----+---+---+---+
|t1  |0  |1  |0  |
|t2  |1  |0  |1  |
|t2  |0  |1  |0  |
+----+---+---+---+
请帮我做这个

试试这个

cols=df.columns
n=len(cols)
df1=pd.concat([df]*n,ignore_index=True).eq(1)
df2= pd.concat([df.T]*n,axis=1,ignore_index=True).eq(1)
df2.columns=cols
df2=df2.reset_index(drop=True)
print (df1|df2).astype(int)
说明:

  • 根据需要将df1转换为逻辑df
  • 使用转置将df2转换为所需的逻辑df
  • 在两个df中执行逻辑OR
  • 输出:

       t1  t2  t3
    0   0   1   0
    1   1   1   1
    2   0   1   0
    

    对于pyspark,您可以创建一个空df,然后根据列将其插入到循环中。以下内容不仅适用于3列,还适用于更多列

    >>> import pyspark.sql.functions as F
    >>> 
    >>> df1 = spark.createDataFrame(sc.emptyRDD(), df.schema)
    >>> df.show()
    +---+---+---+
    | t1| t2| t3|
    +---+---+---+
    |  0|  1|  0|
    +---+---+---+
    
    >>> df1 = spark.createDataFrame(sc.emptyRDD(), df.schema)
    >>> df1 = df1.select(F.lit('').alias('t123'), F.col('*'))
    >>> df1.show()
    +----+---+---+---+
    |t123| t1| t2| t3|
    +----+---+---+---+
    +----+---+---+---+
    
    >>> for x in df.columns: 
    ...     mydf = df.select([(F.when(df[i]+df[x]==1,1).otherwise(0)).alias(i) for i in df.columns])
    ...     df1 = df1.union(mydf.select(F.lit(x).alias('t123'), F.col('*')))
    ... 
    >>> df1.show()
    +----+---+---+---+
    |t123| t1| t2| t3|
    +----+---+---+---+
    |  t1|  0|  1|  0|
    |  t2|  1|  0|  1|
    |  t3|  0|  1|  0|
    +----+---+---+---+