Pyspark 将一个数据帧的不同值传递到另一个数据帧

Pyspark 将一个数据帧的不同值传递到另一个数据帧,pyspark,Pyspark,我想从数据帧A中获取列的不同值,并将其传递到数据帧B的分解中 函数为每个不同的值创建重复行(DataFrameB) distinctSet = targetDf.select('utilityId').distinct()) utilisationFrequencyTable = utilisationFrequencyTable.withColumn("utilityId", psf.explode(assign_utilityId())) 作用 assign_utilityId = ps

我想从数据帧A中获取列的不同值,并将其传递到数据帧B的分解中 函数为每个不同的值创建重复行(DataFrameB)

distinctSet = targetDf.select('utilityId').distinct())

utilisationFrequencyTable = utilisationFrequencyTable.withColumn("utilityId", psf.explode(assign_utilityId()))
作用

assign_utilityId = psf.udf(
    lambda id: [x for x in id],
    ArrayType(LongType()))
如何将
distinctSet
值传递给
assign\u utilityId

更新

我想从Dataframe 1中获取唯一值,并在Dataframe 2中创建新列。像这样

+-----+------+--------+--------+
|index|status|timeSlot|utilityId
+-----+------+--------+--------+
|    0|   SUN|       0|101
|    0|   SUN|       1|101 
|    0|   SUN|       0|102
|    0|   SUN|       1|102

我们不需要一个udf。我已经尝试了一些输入,请检查

>>> from pyspark.sql import function as F

>>> df = spark.createDataFrame([(1,),(2,),(3,),(2,),(3,)],['col1'])
>>> df.show()
+----+
|col1|
+----+
|   1|
|   2|
|   3|
|   2|
|   3|
+----+

>>> df1 = spark.createDataFrame([(1,2),(2,3),(3,4)],['col1','col2'])
>>> df1.show()
+----+----+
|col1|col2|
+----+----+
|   1|   2|
|   2|   3|
|   3|   4|
+----+----+

>>> dist_val = df.select(F.collect_set('col1').alias('val')).first()['val']
>>> dist_val
[1, 2, 3]

>>> df1 = df1.withColumn('col3',F.array([F.lit(x) for x in dist_val]))
>>> df1.show()
+----+----+---------+
|col1|col2|     col3|
+----+----+---------+
|   1|   2|[1, 2, 3]|
|   2|   3|[1, 2, 3]|
|   3|   4|[1, 2, 3]|
+----+----+---------+
>>> df1.select("*",F.explode('col3').alias('expl_col')).drop('col3').show()
+----+----+--------+
|col1|col2|expl_col|
+----+----+--------+
|   1|   2|       1|
|   1|   2|       2|
|   1|   2|       3|
|   2|   3|       1|
|   2|   3|       2|
|   2|   3|       3|
|   3|   4|       1|
|   3|   4|       2|
|   3|   4|       3|
+----+----+--------+

你能展示一些样本数据吗?谢谢!!它起作用了!!我错过了这一部分“F.array([F.lit(x)表示距离中的x])”
>>> from pyspark.sql import function as F

>>> df = spark.createDataFrame([(1,),(2,),(3,),(2,),(3,)],['col1'])
>>> df.show()
+----+
|col1|
+----+
|   1|
|   2|
|   3|
|   2|
|   3|
+----+

>>> df1 = spark.createDataFrame([(1,2),(2,3),(3,4)],['col1','col2'])
>>> df1.show()
+----+----+
|col1|col2|
+----+----+
|   1|   2|
|   2|   3|
|   3|   4|
+----+----+

>>> dist_val = df.select(F.collect_set('col1').alias('val')).first()['val']
>>> dist_val
[1, 2, 3]

>>> df1 = df1.withColumn('col3',F.array([F.lit(x) for x in dist_val]))
>>> df1.show()
+----+----+---------+
|col1|col2|     col3|
+----+----+---------+
|   1|   2|[1, 2, 3]|
|   2|   3|[1, 2, 3]|
|   3|   4|[1, 2, 3]|
+----+----+---------+
>>> df1.select("*",F.explode('col3').alias('expl_col')).drop('col3').show()
+----+----+--------+
|col1|col2|expl_col|
+----+----+--------+
|   1|   2|       1|
|   1|   2|       2|
|   1|   2|       3|
|   2|   3|       1|
|   2|   3|       2|
|   2|   3|       3|
|   3|   4|       1|
|   3|   4|       2|
|   3|   4|       3|
+----+----+--------+
df = sqlContext.createDataFrame(sc.parallelize([(101,),(101,),(102,)]),['utilityId'])
df2 = sqlContext.createDataFrame(sc.parallelize([(0,'SUN',0),(0,'SUN',1)]),['index','status','timeSlot'])
rdf = df.distinct()


>>> df2.join(rdf).show()
+-----+------+--------+---------+                                               
|index|status|timeSlot|utilityId|
+-----+------+--------+---------+
|    0|   SUN|       0|      101|
|    0|   SUN|       0|      102|
|    0|   SUN|       1|      101|
|    0|   SUN|       1|      102|
+-----+------+--------+---------+