Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/321.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 想要从pyspark中的字典数组高效地派生数组吗_Python_Pyspark_Apache Spark Sql_Pyspark Dataframes - Fatal编程技术网

Python 想要从pyspark中的字典数组高效地派生数组吗

Python 想要从pyspark中的字典数组高效地派生数组吗,python,pyspark,apache-spark-sql,pyspark-dataframes,Python,Pyspark,Apache Spark Sql,Pyspark Dataframes,我有一个数据框,如下所示: +----------+----------+---------------------+ | Index | flagArray | +----------+----------+---------------------+ | 1 | [{start :1 , end :2, flag :A}, | | | {start :3 , end :5, fla

我有一个数据框,如下所示:

   +----------+----------+---------------------+
   | Index    |           flagArray            |
   +----------+----------+---------------------+
   |    1     | [{start :1 , end :2, flag :A}, | 
   |          | {start :3 , end :5, flag :A}]  |
   +----------+--------------------------------+
   |    2     | [{start :1 , end :5, flag :A}] |
   +--------- +----------+---------------------+
我想根据flagArray的开始、结束和标志字段派生一个新的列标志

   +----------+--------------------------------+------------+
   | Index    |           flagArray            |    flag2   |
   +----------+--------------------------------+------------+
   |    1     | [{start :1 , end :2, flag :A}, | [A,A,S,S,S]|
   |          | {start :3 , end :5, flag :A}]  |            |
   +----------+--------------------------------+------------+ 
   |    2     | [{start :1 , end :5, flag :A}] | [A,A,A,A,A]|
   +--------- +--------------------------------+------------+
我有一个工作代码,但我想以一种更有效的方式完成它,因为我将有数百万行,我的Flag列中将有300个数组元素:

     @udf(ArrayType(StringType()))
     def set_flag(startIndex,endIndex, flag):
         arraylength = len(startIndex)
         row = []
         for i in range(0,arraylength):
             start = int(startIndex[i])
             end = int(endIndex[i]) + 1
            derFlag = flag[i]
            for i in range(start,endIndex+1):
                row.append(derFlag)
         return row
     df = df.select("*","flagArray.start","flagArray.end","flagArray.flag")
     df = df.withColumn("flag2",set_flag(df.start,df.end,df.flag)).drop("start","end","flag")

df.start,df.end,df.flag
来自哪里?很抱歉,遗漏了一行代码。更新了问题。thanksin索引1秒标志字段是S,但在flagarray中它是?因此,如果有2,那么第二个变成了S?Hi Mohammad,它完全基于flagArray的开始和结束字段。我将检查flagArray中每个字典对应的标志是什么,并继续将其附加到输出字段数组中。对于索引1字典1,我的开始是1,结束是2,标志是A,字典2的开始是3,结束是5,标志是S,所以我的输出数组是[A,A,S,S,S]。对于索引2,我只有一个字典,其中开始是1,结束是2,标志是A,所以我的输出数组是[A,A,A,A,A],如果你的spark版本是2.4+,列
flagArray
是一个结构数组,那么你可以试试这个:
df.selectExpr(“*”,“flant(transform(flagary,x->array_repeat(x.flag,x.end-x.start+1)),作为标志2”)。show()
df.start、df.end、df.flag从哪里来?很抱歉,遗漏了一行代码。更新了问题。thanksin索引1秒标志字段是S,但在flagarray中它是?因此,如果有2,那么第二个变成了S?Hi Mohammad,它完全基于flagArray的开始和结束字段。我将检查flagArray中每个字典对应的标志是什么,并继续将其附加到输出字段数组中。对于索引1字典1,我的开始是1,结束是2,标志是A,字典2的开始是3,结束是5,标志是S,所以我的输出数组是[A,A,S,S,S]。对于索引2,我只有一个字典,其中开始是1,结束是2,标志是A,所以我的输出数组是[A,A,A,A,A],如果你的spark版本是2.4+,列
flagArray
是一个结构数组,那么你可以试试这个:
df.selectExpr(“*”,“flant(transform(flagary,x->array_repeat(x.flag,x.end-x.start+1)),作为标志2”)。show()
     @udf(ArrayType(StringType()))
     def set_flag(startIndex,endIndex, flag):
         arraylength = len(startIndex)
         row = []
         for i in range(0,arraylength):
             start = int(startIndex[i])
             end = int(endIndex[i]) + 1
            derFlag = flag[i]
            for i in range(start,endIndex+1):
                row.append(derFlag)
         return row
     df = df.select("*","flagArray.start","flagArray.end","flagArray.flag")
     df = df.withColumn("flag2",set_flag(df.start,df.end,df.flag)).drop("start","end","flag")