Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Dataframe 如何在PySpark中查找非空值的列集合_Dataframe_Apache Spark_Pyspark_Apache Spark Sql - Fatal编程技术网

Dataframe 如何在PySpark中查找非空值的列集合

Dataframe 如何在PySpark中查找非空值的列集合,dataframe,apache-spark,pyspark,apache-spark-sql,Dataframe,Apache Spark,Pyspark,Apache Spark Sql,我有一个包含n列的Pyspark数据帧(列1、列2……列n)。我必须再添加一列,列集合以逗号分隔 条件:如果两个或多个列具有值,则填充集合列中以逗号分隔的值,例如三个列的数据下方 ---------------------------------------------------------------------- | column_1 | column_2 | column_3 | col collections | ------------------

我有一个包含n列的Pyspark数据帧(列1、列2……列n)。我必须再添加一列,列集合以逗号分隔

条件:如果两个或多个列具有值,则填充集合列中以逗号分隔的值,例如三个列的数据下方

----------------------------------------------------------------------
| column_1  | column_2 | column_3 |             col collections      |
----------------------------------------------------------------------
|     -     |     -    |     -    |                  -               |
------------------------------------------ ---------------------------
|     1     |     -    |     -    |                  -               |
------------------------------------------ ---------------------------
|     -     |     1    |     -    |                  -               |
------------------------------------------ ---------------------------
|     -     |     -    |     1    |                  -               |
------------------------------------------ ---------------------------
|     1     |     1    |     -    | column_1,column_2                |
----------------------------------------------------------------------
|     1     |     1    |     1    | column_1,column_2,column_3       |
----------------------------------------------------------------------
|     1     |     -    |     -    |                      -           |
----------------------------------------------------------------------
|     -     |     1    |     1    | column_2,column_3                |
----------------------------------------------------------------------
这里有一个解决方案

将熊猫作为pd导入
从pyspark.sql.functions导入concat_ws、udf
从pyspark.sql.types导入StringType
熊猫_df=pd.DataFrame({
“column_1”:[无,'1',无,无,'1','1'],
'列2':[无,无,'1',无,'1',无],
“第3列”:[无,无,无,'1',无,'1',无]
})
df=spark.createDataFrame(pandas\u-df)
df.show()
# +--------+--------+--------+
#|第1列|第2列|第3列|
# +--------+--------+--------+
#|空|空|空|
#| 1 |空|空|
#|空| 1 |空|
#|空|空| 1|
#| 1 | 1 |空|
# |       1|       1|       1|
#| 1 |空|空|
# +--------+--------+--------+
def非空到列名称(名称):
返回udf(lambda值:如果值为None-else名称,则为None,StringType())
至少_two_udf=udf(lambda s:None如果(s是None)或(','不在s中)其他s,
StringType())
cols=[]
对于df.columns中的名称:
f=非空到列名称(名称)
cols+=[f(df[name])]
df=df.withColumn('collection',至少两个udf(concat_ws(',',*cols)))
df.show()
# +--------+--------+--------+--------------------+
#|第1列|第2列|第3列|集合|
# +--------+--------+--------+--------------------+
#|空|空|空|空|
#| 1 |空|空|空|
#|空| 1 |空|空|
#|空|空| 1 |空|
#| 1 | 1 |空|列| 1,列| 2|
#| 1 | 1 | 1 |列| 1,列| 2|
#| 1 |空|空|空|
# +--------+--------+--------+--------------------+
这里有一个解决方案

将熊猫作为pd导入
从pyspark.sql.functions导入concat_ws、udf
从pyspark.sql.types导入StringType
熊猫_df=pd.DataFrame({
“column_1”:[无,'1',无,无,'1','1'],
'列2':[无,无,'1',无,'1',无],
“第3列”:[无,无,无,'1',无,'1',无]
})
df=spark.createDataFrame(pandas\u-df)
df.show()
# +--------+--------+--------+
#|第1列|第2列|第3列|
# +--------+--------+--------+
#|空|空|空|
#| 1 |空|空|
#|空| 1 |空|
#|空|空| 1|
#| 1 | 1 |空|
# |       1|       1|       1|
#| 1 |空|空|
# +--------+--------+--------+
def非空到列名称(名称):
返回udf(lambda值:如果值为None-else名称,则为None,StringType())
至少_two_udf=udf(lambda s:None如果(s是None)或(','不在s中)其他s,
StringType())
cols=[]
对于df.columns中的名称:
f=非空到列名称(名称)
cols+=[f(df[name])]
df=df.withColumn('collection',至少两个udf(concat_ws(',',*cols)))
df.show()
# +--------+--------+--------+--------------------+
#|第1列|第2列|第3列|集合|
# +--------+--------+--------+--------------------+
#|空|空|空|空|
#| 1 |空|空|空|
#|空| 1 |空|空|
#|空|空| 1 |空|
#| 1 | 1 |空|列| 1,列| 2|
#| 1 | 1 | 1 |列| 1,列| 2|
#| 1 |空|空|空|
# +--------+--------+--------+--------------------+

到目前为止您尝试了什么?到目前为止您尝试了什么?