Pyspark 将一个数据帧的不同值传递到另一个数据帧
我想从数据帧A中获取列的不同值,并将其传递到数据帧B的分解中 函数为每个不同的值创建重复行(DataFrameB)Pyspark 将一个数据帧的不同值传递到另一个数据帧,pyspark,Pyspark,我想从数据帧A中获取列的不同值,并将其传递到数据帧B的分解中 函数为每个不同的值创建重复行(DataFrameB) distinctSet = targetDf.select('utilityId').distinct()) utilisationFrequencyTable = utilisationFrequencyTable.withColumn("utilityId", psf.explode(assign_utilityId())) 作用 assign_utilityId = ps
distinctSet = targetDf.select('utilityId').distinct())
utilisationFrequencyTable = utilisationFrequencyTable.withColumn("utilityId", psf.explode(assign_utilityId()))
作用
assign_utilityId = psf.udf(
lambda id: [x for x in id],
ArrayType(LongType()))
如何将distinctSet
值传递给assign\u utilityId
更新
我想从Dataframe 1中获取唯一值,并在Dataframe 2中创建新列。像这样
+-----+------+--------+--------+
|index|status|timeSlot|utilityId
+-----+------+--------+--------+
| 0| SUN| 0|101
| 0| SUN| 1|101
| 0| SUN| 0|102
| 0| SUN| 1|102
我们不需要一个udf。我已经尝试了一些输入,请检查
>>> from pyspark.sql import function as F
>>> df = spark.createDataFrame([(1,),(2,),(3,),(2,),(3,)],['col1'])
>>> df.show()
+----+
|col1|
+----+
| 1|
| 2|
| 3|
| 2|
| 3|
+----+
>>> df1 = spark.createDataFrame([(1,2),(2,3),(3,4)],['col1','col2'])
>>> df1.show()
+----+----+
|col1|col2|
+----+----+
| 1| 2|
| 2| 3|
| 3| 4|
+----+----+
>>> dist_val = df.select(F.collect_set('col1').alias('val')).first()['val']
>>> dist_val
[1, 2, 3]
>>> df1 = df1.withColumn('col3',F.array([F.lit(x) for x in dist_val]))
>>> df1.show()
+----+----+---------+
|col1|col2| col3|
+----+----+---------+
| 1| 2|[1, 2, 3]|
| 2| 3|[1, 2, 3]|
| 3| 4|[1, 2, 3]|
+----+----+---------+
>>> df1.select("*",F.explode('col3').alias('expl_col')).drop('col3').show()
+----+----+--------+
|col1|col2|expl_col|
+----+----+--------+
| 1| 2| 1|
| 1| 2| 2|
| 1| 2| 3|
| 2| 3| 1|
| 2| 3| 2|
| 2| 3| 3|
| 3| 4| 1|
| 3| 4| 2|
| 3| 4| 3|
+----+----+--------+
你能展示一些样本数据吗?谢谢!!它起作用了!!我错过了这一部分“F.array([F.lit(x)表示距离中的x])”
>>> from pyspark.sql import function as F
>>> df = spark.createDataFrame([(1,),(2,),(3,),(2,),(3,)],['col1'])
>>> df.show()
+----+
|col1|
+----+
| 1|
| 2|
| 3|
| 2|
| 3|
+----+
>>> df1 = spark.createDataFrame([(1,2),(2,3),(3,4)],['col1','col2'])
>>> df1.show()
+----+----+
|col1|col2|
+----+----+
| 1| 2|
| 2| 3|
| 3| 4|
+----+----+
>>> dist_val = df.select(F.collect_set('col1').alias('val')).first()['val']
>>> dist_val
[1, 2, 3]
>>> df1 = df1.withColumn('col3',F.array([F.lit(x) for x in dist_val]))
>>> df1.show()
+----+----+---------+
|col1|col2| col3|
+----+----+---------+
| 1| 2|[1, 2, 3]|
| 2| 3|[1, 2, 3]|
| 3| 4|[1, 2, 3]|
+----+----+---------+
>>> df1.select("*",F.explode('col3').alias('expl_col')).drop('col3').show()
+----+----+--------+
|col1|col2|expl_col|
+----+----+--------+
| 1| 2| 1|
| 1| 2| 2|
| 1| 2| 3|
| 2| 3| 1|
| 2| 3| 2|
| 2| 3| 3|
| 3| 4| 1|
| 3| 4| 2|
| 3| 4| 3|
+----+----+--------+
df = sqlContext.createDataFrame(sc.parallelize([(101,),(101,),(102,)]),['utilityId'])
df2 = sqlContext.createDataFrame(sc.parallelize([(0,'SUN',0),(0,'SUN',1)]),['index','status','timeSlot'])
rdf = df.distinct()
>>> df2.join(rdf).show()
+-----+------+--------+---------+
|index|status|timeSlot|utilityId|
+-----+------+--------+---------+
| 0| SUN| 0| 101|
| 0| SUN| 0| 102|
| 0| SUN| 1| 101|
| 0| SUN| 1| 102|
+-----+------+--------+---------+