Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 使用pyspark从每行数组中获取不同的计数_Apache Spark_Pyspark_Apache Spark Sql_Pyspark Dataframes - Fatal编程技术网

Apache spark 使用pyspark从每行数组中获取不同的计数

Apache spark 使用pyspark从每行数组中获取不同的计数,apache-spark,pyspark,apache-spark-sql,pyspark-dataframes,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Dataframes,我使用pyspark dataframe从每行数组中查找不同的计数: 输入: 可乐 [1,1,1] [3,4,5] [1,2,1,2] output: 1 3 2 I used below code but it is giving me the length of an array: output: 3 3 4 please help me how do i achieve this using python pyspark dataframe. slen = udf(lambda

我使用pyspark dataframe从每行数组中查找不同的计数: 输入: 可乐 [1,1,1] [3,4,5] [1,2,1,2]

output:
1
3
2  

I used below code but it is giving me the length of an array:
output:
3
3
4

please help me how do i achieve this using python pyspark dataframe.

slen = udf(lambda s: len(s), IntegerType())
count = Df.withColumn("Count", slen(df.col1))
count.show()

Thanks in advanced !

对于spark2.4+,您可以使用array_distinct,然后只获取其大小,以获取数组中不同值的计数。对于大数据而言,使用UDF将非常缓慢且效率低下,请始终尝试使用内置函数中的spark

(欢迎来到SO)


对于spark2.4+,您可以使用array_distinct,然后只获取其大小,以获取数组中不同值的计数。对于大数据而言,使用UDF将非常缓慢且效率低下,请始终尝试使用内置函数中的spark

(欢迎来到SO)

df.show()

+------------+
|        col1|
+------------+
|   [1, 1, 1]|
|   [3, 4, 5]|
|[1, 2, 1, 2]|
+------------+

df.withColumn("count", F.size(F.array_distinct("col1"))).show()

+------------+-----+
|        col1|count|
+------------+-----+
|   [1, 1, 1]|    1|
|   [3, 4, 5]|    3|
|[1, 2, 1, 2]|    2|
+------------+-----+