Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/309.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 从pyspark中的数组列创建dictionary/map类型列,这样所有数组元素的键都应该相同_Python_Arrays_Pyspark_Apache Spark Sql - Fatal编程技术网

Python 从pyspark中的数组列创建dictionary/map类型列,这样所有数组元素的键都应该相同

Python 从pyspark中的数组列创建dictionary/map类型列,这样所有数组元素的键都应该相同,python,arrays,pyspark,apache-spark-sql,Python,Arrays,Pyspark,Apache Spark Sql,我有一个spark数据框,如下所示: 数据帧的模式也如下所示: |-- array_list: array (nullable = true) | |-- element: string (containsNull = true) |-- len_of_array: integer (nullable = false) 数据帧 +---------------+--------------+ |array_list |len_of_array | +-------------

我有一个spark数据框,如下所示: 数据帧的模式也如下所示:

|-- array_list: array (nullable = true)
|    |-- element: string (containsNull = true)
|-- len_of_array: integer (nullable = false)

数据帧

+---------------+--------------+
|array_list     |len_of_array  |
+---------------------+--------+
|[t1, t2, t3]   |3             |
|[t1, t2]       |2             |
|[t2]           |1             |
+---------------------+--------+
我们如何创建一个新的列“mappings”,其中一个常量键“mapname”映射到列“array\u list”中数组中的每个元素 预期产量为

+---------------+--------------+----------------------------------------------------+
|array_list     |len_of_array  |mappings                                            |
+---------------------+--------+----------------------------------------------------+                 
|[t1, t2, t3]   |3             |[[mapname -> t1], [mapname -> t2], [mapname -> t3]] |
|[t1, t2]       |2             |[[mapname -> t1], [mapname -> t2]]                  |
|[t2]           |1             |[[mapname -> t2]]                                   | 
+---------------------+--------+----------------------------------------------------+
mapname(即键)是一个字符串,即“mapname”,对于所有数组元素都应该相同


我已经创建了一个额外的列-“increasing_id”,它的值为-1,2,3,然后尝试定义一个UDF并在for循环中使用它来更新mappings列的每一行,但它给出了一个错误

from pyspark.sql.functions import udf
from pyspark.sql import types as T  
@udf(T.MapType(T.StringType(), T.StringType())) 
def create_dict(name_value):
    return {"name": name_value}

然后最后使用for循环填充列值:

for j in range(3):
    result = df2.where(df2.increasing_id == j).select("len_of_array")
    result = result.collect()[0][0]
    df = df.where(df.increasing_id == j)\ 
           .withColumn('mappings',
                       array([create_dict(df.array_list.getItem(i)) for i in range(0,result)])) 

Error:
An error was encountered:
list index out of range
Traceback (most recent call last):
IndexError: list index out of range
预期的架构如下所示

|-- mappings: array (nullable = false)
|    |-- element: map (containsNull = true)
|    |    |-- key: string
|    |    |-- value: string (valueContainsNull = true)

映射类型列中不允许有重复的键名称。如果有重复的键值,你会为给定的键值取哪个值?我想让所有元素的“键值”保持不变,并且在提取/获取时——将使用所有键值,因为不需要提取任何特定的键值。这在Spark中是不可能的。映射类型列不是为此目的而设计的。我创建了一个额外的列-“递增的_id”,其索引为-1,2,3,然后尝试定义一个UDF并在for循环中使用它来更新映射列的每一行,但最终的数据框为空。附加下面的代码:
from pyspark.sql.functions将udf从pyspark.sql导入类型导入为T@udf(T.MapType(T.StringType(),T.StringType()))def create_dict(name_value):return{“name”:name_value}
,然后最后使用for循环填充列值:
for范围(3):result=df2.where(df2.increating\u id==j)。选择(“数组的len\u”)结果=result.collect()[0][0]df=df.where(df.increating\u id==j)\.带列('mappings',数组([create\u dict(df.array\u list.getItem(i))的范围(0,result)])