Pyspark 基于来自第二个数据帧的匹配键将列表附加到Spark数据帧列

Pyspark 基于来自第二个数据帧的匹配键将列表附加到Spark数据帧列,pyspark,apache-spark-sql,user-defined-functions,pyspark-dataframes,Pyspark,Apache Spark Sql,User Defined Functions,Pyspark Dataframes,我有两个具有相同列名的spark数据帧,当关键列相互匹配时,我想使用df2中相同列中的列表展开第一个df中的一些列 df1: +----+---+--++------+---------+-----+--------+--------+-------+ |k1 | k2 |list1 | list2 |list3|list4 |list5 |list 6 | +----+---+--+-------+---------------------------------+----

我有两个具有相同列名的spark数据帧,当关键列相互匹配时,我想使用df2中相同列中的列表展开第一个df中的一些列

df1:
+----+---+--++------+---------+-----+--------+--------+-------+
|k1  |  k2  |list1  | list2   |list3|list4   |list5   |list 6 |
+----+---+--+-------+---------------------------------+-------+
|   a| 121  |[car1] |[price1] |[1]  |[False] |[0.000] |[vfdvf]|
|   b| 11   |[car3] |[price3] |[2]  |[False] |[1.000] |[00000]|
|   c| 23   |[car3] |[price3] |[4]  |[False] |[2.500] |[fdabh]|
|   d| 250  |[car6] |[price6] |[6]  |[True]  |[0.450] |[00000]|
+----+---+--++----+---+--+--++----+---+------+----------------+


df2:
+----+---+--++------+---------+-----+--------+--------+-------+
|k1  |  k2  |list1  | list2   |list3|list4   |list5   |list 6 |
+----+---+--+-------+---------------------------------+-------+
|   m| 121  |[car5] |[price5] |[5]  |[False] |[3.000] |[vfdvf]|
|   b| 11   |[car8] |[price8] |[8]  |[False] |[2.000] |[mnfaf]|
|   c| 23   |[car7] |[price7] |[7]  |[False] |[1.500] |[00000]|
|   n| 250  |[car9] |[price9] |[9]  |[False] |[0.450] |[00000]|
+----+---+--++----+---+--+--++----+---+------+----------------+
由于包含项目列表的列彼此相关,因此顺序必须保持不变。是否只有当key1和key2在两个df之间匹配时,才能将整个列表从df2追加到df1

结果应该如下所示(我无法放入列表6列,但希望在结果中看到它,模式与其他列表列相同):


我还不熟悉使用UDF,在stackoverflow上找不到类似的问题,我发现唯一类似的问题是使用pandas(),这对于我的用例来说非常慢。如果您对此有任何见解,我们将不胜感激。

首先,您需要像我在下面所做的那样创建自己的模式,然后您的代码将正常工作,请使用我更新的代码

试试这个:您不需要UDF,首先执行内部联接,然后将其连接起来

   from pyspark.sql import SparkSession
from pyspark.sql import functions as F

from pyspark.sql.types import *
from datetime import datetime
from pyspark.sql import *
from collections import *
from pyspark.sql.functions import udf,explode
from pyspark.sql.types import StringType
table_schema = StructType([StructField('key1', StringType(), True),
                     StructField('key2', IntegerType(), True),
                     StructField('list1', ArrayType(StringType()), False),
                     StructField('list2', ArrayType(StringType()), False),
                     StructField('list3', ArrayType(IntegerType()), False),
                     StructField('list4', StringType(), False),
                     StructField('list5', ArrayType(FloatType()), False),
                     StructField('list6', ArrayType(StringType()), False)
                     ])
df= spark.createDataFrame(
    [
 (  "a", 121  ,["car1"] ,["price1"] ,[1]  ,["False"] ,[0.000] ,["vfdvf"]),
(   "b", 11   ,["car3"] ,["price3"] ,[2]  ,["False"] ,[1.000] ,[00000]),
(   "c", 23   ,["car3"] ,["price3"] ,[4]  ,["False"] ,[2.500] ,["fdabh"]),
(   "d", 250  ,["car6"] ,["price6"] ,[6]  ,["True"]  ,[0.450] ,[00000])
       
        ],table_schema
    )

df2= spark.createDataFrame(
    [
 ("m", 121  ,["car5"] ,["price5"] ,[5]  ,["False"] ,[3.000] ,["vfdvf"]),
(   "b", 11   ,["car8"] ,["price8"] ,[8]  ,["False"] ,[2.000] ,["mnfaf"]),
(   "c", 23   ,["car7"] ,["price7"] ,[7]  ,["False"] ,[1.500] ,[00000]),
(  "n", 250  ,["car9"] ,["price9"] ,[9]  ,["False"] ,[0.450] ,[00000])

],table_schema
    )
df.createOrReplaceTempView("A")
df2.createOrReplaceTempView("B")
spark.sql("select a.key1,a.key2,concat(a.list1,b.list1)List1 ,concat(a.list2,b.list2)List2, \
concat(a.list3,b.list3)List3 ,concat(a.list4,b.list4)List4,\
          concat(a.list5,b.list5)List5 ,\
          concat(a.list6,b.list6)List6 \
from A a inner join B  b on a.key1=b.key1 order by a.key1").show(truncate=False)

 +----+----+------------+----------------+------+--------------+----------+----------+
|key1|key2|List1       |List2           |List3 |List4         |List5     |List6     |
+----+----+------------+----------------+------+--------------+----------+----------+
|b   |11  |[car3, car8]|[price3, price8]|[2, 8]|[False][False]|[1.0, 2.0]|[0, mnfaf]|
|c   |23  |[car3, car7]|[price3, price7]|[4, 7]|[False][False]|[2.5, 1.5]|[fdabh, 0]|
+----+----+------------+----------------+------+--------------+----------+----------+

我找到了我问题的答案,想把它贴在这里,与其他可能面临同样问题的人分享,供我将来参考

    from pyspark.sql.types import BooleanType
    from pyspark.sql.types import StringType
    from pyspark.sql.types import DoubleType
    from pyspark.sql.types import IntegerType
    from pyspark.sql.types import ArrayType
    from pyspark.sql.types import LongType
    from pyspark.sql.types import ByteType
    
    def concatTypesFunc(array1, array2): 
        final_array=array1+array2
        return final_array

    spark.udf.register("concat_types", concatTypesFunc, 
    ArrayType(BooleanType())
    spark.udf.register("concat_types", concatTypesFunc, 
    ArrayType(StringType())
    spark.udf.register("concat_types", concatTypesFunc, 
    ArrayType(DoubleType())
    spark.udf.register("concat_types", concatTypesFunc, 
    ArrayType(IntegerType())
    spark.udf.register("concat_types", concatTypesFunc, 
    ArrayType(LongType())
    spark.udf.register("concat_types", concatTypesFunc, 
    ByteType(LongType()) 

    df= spark.createDataFrame(
        [
     (  "a", 121  ,["car1"] ,["price1"] ,[1]  ,["False"] ,[0.000] ,["vfdvf"]),
    (   "b", 11   ,["car3"] ,["price3"] ,[2]  ,["False"] ,[1.000] ,[00000]),
    (   "c", 23   ,["car3"] ,["price3"] ,[4]  ,["False"] ,[2.500] ,["fdabh"]),
    (   "d", 250  ,["car6"] ,["price6"] ,[6]  ,["True"]  ,[0.450] ,[00000])
           
            ],table_schema
        )
    
    df2= spark.createDataFrame(
        [
     ("m", 121  ,["car5"] ,["price5"] ,[5]  ,["False"] ,[3.000] ,["vfdvf"]),
    (   "b", 11   ,["car8"] ,["price8"] ,[8]  ,["False"] ,[2.000] ,["mnfaf"]),
    (   "c", 23   ,["car7"] ,["price7"] ,[7]  ,["False"] ,[1.500] ,[00000]),
    (  "n", 250  ,["car9"] ,["price9"] ,[9]  ,["False"] ,[0.450] ,[00000])
    
    ],table_schema
        )
    df.createOrReplaceTempView("a")
    df2.createOrReplaceTempView("b")
    spark.sql("select a.key1, a.key2, concat_types(a.list1,b.list1)List1 ,concat_types(a.list2,b.list2)List2, \
    concat_types(a.list3,b.list3)List3 ,concat_types(a.list4,b.list4)List4,\
              concat_types(a.list5,b.list5)List5 ,\
              concat_types(a.list6,b.list6)List6 \
    from a inner join b on a.key1=b.key1 order by a.key1").show(truncate=False)

感谢分享这个解决方案,我忘了提到我的真实数据要复杂得多,这些列表列中的每一列都可以有不同的数据类型(字符串、整数、数组、长型、bytetype、doubletype和booleantype)。我尝试了这个解决方案,得到了不匹配的类型错误(argumenet需要字符串类型,但column2是数组类型。有什么办法可以解决吗?请发布不同类型的列表和数组,我会相应地进行更改。谢谢Addy,刚刚用我希望在真实数据中看到的每列中的数据类型编辑了问题。请查看我的更新代码,我已经创建了自己的模式,然后你会未收到该错误解决方案仍然不起作用,只是注意到spark sql中的concat仅对字符串类型有效,我需要使用UDF进行适当的更改,一旦我将解决方案发布,以供其他人参考和我将来的参考。无论如何,感谢分享您的想法。
    from pyspark.sql.types import BooleanType
    from pyspark.sql.types import StringType
    from pyspark.sql.types import DoubleType
    from pyspark.sql.types import IntegerType
    from pyspark.sql.types import ArrayType
    from pyspark.sql.types import LongType
    from pyspark.sql.types import ByteType
    
    def concatTypesFunc(array1, array2): 
        final_array=array1+array2
        return final_array

    spark.udf.register("concat_types", concatTypesFunc, 
    ArrayType(BooleanType())
    spark.udf.register("concat_types", concatTypesFunc, 
    ArrayType(StringType())
    spark.udf.register("concat_types", concatTypesFunc, 
    ArrayType(DoubleType())
    spark.udf.register("concat_types", concatTypesFunc, 
    ArrayType(IntegerType())
    spark.udf.register("concat_types", concatTypesFunc, 
    ArrayType(LongType())
    spark.udf.register("concat_types", concatTypesFunc, 
    ByteType(LongType()) 

    df= spark.createDataFrame(
        [
     (  "a", 121  ,["car1"] ,["price1"] ,[1]  ,["False"] ,[0.000] ,["vfdvf"]),
    (   "b", 11   ,["car3"] ,["price3"] ,[2]  ,["False"] ,[1.000] ,[00000]),
    (   "c", 23   ,["car3"] ,["price3"] ,[4]  ,["False"] ,[2.500] ,["fdabh"]),
    (   "d", 250  ,["car6"] ,["price6"] ,[6]  ,["True"]  ,[0.450] ,[00000])
           
            ],table_schema
        )
    
    df2= spark.createDataFrame(
        [
     ("m", 121  ,["car5"] ,["price5"] ,[5]  ,["False"] ,[3.000] ,["vfdvf"]),
    (   "b", 11   ,["car8"] ,["price8"] ,[8]  ,["False"] ,[2.000] ,["mnfaf"]),
    (   "c", 23   ,["car7"] ,["price7"] ,[7]  ,["False"] ,[1.500] ,[00000]),
    (  "n", 250  ,["car9"] ,["price9"] ,[9]  ,["False"] ,[0.450] ,[00000])
    
    ],table_schema
        )
    df.createOrReplaceTempView("a")
    df2.createOrReplaceTempView("b")
    spark.sql("select a.key1, a.key2, concat_types(a.list1,b.list1)List1 ,concat_types(a.list2,b.list2)List2, \
    concat_types(a.list3,b.list3)List3 ,concat_types(a.list4,b.list4)List4,\
              concat_types(a.list5,b.list5)List5 ,\
              concat_types(a.list6,b.list6)List6 \
    from a inner join b on a.key1=b.key1 order by a.key1").show(truncate=False)

 +----+----+------------+----------------+------+--------------+----------+----------+
|key1|key2|List1       |List2           |List3 |List4         |List5     |List6     |
+----+----+------------+----------------+------+--------------+----------+----------+
|b   |11  |[car3, car8]|[price3, price8]|[2, 8]|[False][False]|[1.0, 2.0]|[0, mnfaf]|
|c   |23  |[car3, car7]|[price3, price7]|[4, 7]|[False][False]|[2.5, 1.5]|[fdabh, 0]|
+----+----+------------+----------------+------+--------------+----------+----------+