Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/arrays/12.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Arrays apachepyspark如何创建包含n个元素的数组列_Arrays_Apache Spark_Dataframe_Pyspark_Spark Dataframe - Fatal编程技术网

Arrays apachepyspark如何创建包含n个元素的数组列

Arrays apachepyspark如何创建包含n个元素的数组列,arrays,apache-spark,dataframe,pyspark,spark-dataframe,Arrays,Apache Spark,Dataframe,Pyspark,Spark Dataframe,我有一个数据框,其中有一列的类型为integer 我想用包含n个元素的数组创建一个新列(n是第一列的#) 例如: x = spark.createDataFrame([(1,), (2,),],StructType([ StructField("myInt", IntegerType(), True)])) +-----+ |myInt| +-----+ | 1| | 2| | 3| +-----+ 我需要生成的数据帧如下所示: +-----+---------+ |my

我有一个数据框,其中有一列的类型为integer

我想用包含n个元素的数组创建一个新列(n是第一列的#)

例如:

x = spark.createDataFrame([(1,), (2,),],StructType([ StructField("myInt", IntegerType(), True)])) 

+-----+
|myInt|
+-----+
|    1|
|    2|
|    3|
+-----+
我需要生成的数据帧如下所示:

+-----+---------+
|myInt|    myArr|
+-----+---------+
|    1|      [1]|
|    2|   [2, 2]|
|    3|[3, 3, 3]|
+-----+---------+
+-----+------------------+
|myInt|             myArr|
+-----+------------------+
|    1|            [item]|
|    2|      [item, item]|
|    3|[item, item, item]|
+-----+------------------+
注意,数组中的值实际上并不重要,重要的只是计数

如果生成的数据帧如下所示,则可以:

+-----+---------+
|myInt|    myArr|
+-----+---------+
|    1|      [1]|
|    2|   [2, 2]|
|    3|[3, 3, 3]|
+-----+---------+
+-----+------------------+
|myInt|             myArr|
+-----+------------------+
|    1|            [item]|
|    2|      [item, item]|
|    3|[item, item, item]|
+-----+------------------+
使用
udf

from pyspark.sql.functions import *

@udf("array<int>")
def rep_(x):
    return [x for _ in range(x)]

x.withColumn("myArr", rep_("myInt")).show()
# +-----+------+
# |myInt| myArr|
# +-----+------+
# |    1|   [1]|
# |    2|[2, 2]|
# +-----+------+
从pyspark.sql.functions导入*
@自定义项(“数组”)
def代表(x):
返回[x表示范围(x)]
x、 withColumn(“myArr”,rep_ujn(“myInt”).show()
# +-----+------+
#| myInt | myArr|
# +-----+------+
# |    1|   [1]|
# |    2|[2, 2]|
# +-----+------+

如果可能,最好避免UDF,因为它们效率较低。您可以改用
array\u repeat

import pyspark.sql.functions as F

x.withColumn('myArr', F.array_repeat(F.col('myInt'), F.col('myInt'))).show()

+-----+---------+
|myInt|    myArr|
+-----+---------+
|    1|      [1]|
|    2|   [2, 2]|
|    3|[3, 3, 3]|
+-----+---------+

请注意,我在spark 2.4.4中遇到了一些问题,但在spark 3.0.1中效果良好