Python Pyspark:如何从另一个数据帧向数据帧添加列?
我有两个10行的数据帧Python Pyspark:如何从另一个数据帧向数据帧添加列?,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我有两个10行的数据帧 df1.show() +-------------------+------------------+--------+-------+ | lat| lon|duration|stop_id| +-------------------+------------------+--------+-------+ | -6.23748779296875| 106.6937255859375| 247|
df1.show()
+-------------------+------------------+--------+-------+
| lat| lon|duration|stop_id|
+-------------------+------------------+--------+-------+
| -6.23748779296875| 106.6937255859375| 247| 0|
| -6.23748779296875| 106.6937255859375| 2206| 1|
| -6.23748779296875| 106.6937255859375| 609| 2|
| 0.5733972787857056|101.45503234863281| 16879| 3|
| 0.5733972787857056|101.45503234863281| 4680| 4|
| -6.851855278015137|108.64261627197266| 164| 5|
| -6.851855278015137|108.64261627197266| 220| 6|
| -6.851855278015137|108.64261627197266| 1669| 7|
|-0.9033176600933075|100.41548919677734| 30811| 8|
|-0.9033176600933075|100.41548919677734| 23404| 9|
+-------------------+------------------+--------+-------+
我想将bank_和_post列从df2添加到df1
df2来自一个函数
def assignPtime(x, mu, std):
mu = mu.values[0]
std = std.values[0]
x1 = np.random.normal(mu, std, 100000)
a1, b1 = np.histogram(x1, density=True)
val = x / 60
for k, v in enumerate(val):
prob = 0
for i,j in enumerate(b1[:-1]):
v1 = b1[i]
v2 = b1[i+1]
if (v >= v1) and (v < v2):
prob = a1[i]
x[k] = prob
return x
ff = pandas_udf(assignPtime, returnType=DoubleType())
df2 = df1.select(ff(col("duration"), lit(15), lit(15)).alias("ptime_bank_and_post"))
df2.show()
+--------------------+
| bank_and_post|
+--------------------+
|0.021806558032484918|
|0.014366417828826784|
|0.021806558032484918|
| 0.0|
| 0.0|
|0.021806558032484918|
|0.021806558032484918|
|0.014366417828826784|
| 0.0|
| 0.0|
+--------------------+
我得到了错误
ValueError: assignment destination is read-only
使用row_number窗口函数向df1、df2数据帧添加新列,然后在row_number列上加入数据帧
例如:
一,。使用行数函数:
三,。在单调递增的id函数上使用行数:
四,。使用zipWithIndex:
如果我尝试,就会出错。df2=df2.withColumnstop\u id,单调递增\u id。ValueError:分配目标为只读。我编写了用于生成的函数df2@emax,不要认为错误是由单调递增的id引起的。。。但它似乎与numpy单调递增的id有关的错误并不保证数字是连续的。两个数据帧未使用该方法正确连接。
ValueError: assignment destination is read-only
df1=spark.createDataFrame([(0,),(1,),(2,),(3,),(4,),(5,),(6,),(7,),(8,),(9,)],["stop_id"])
df2=spark.createDataFrame([("0.021806558032484918",),("0.014366417828826784",),("0.021806558032484918",),(" 0.0",),(" 0.0",),("0.021806558032484918",),("0.021806558032484918",),("0.014366417828826784",),(" 0.0",),(" 0.0",)],["bank_and_post"])
from pyspark.sql import *
from pyspark.sql.functions import *
w=Window.orderBy(lit(1))
df4=df2.withColumn("rn",row_number().over(w)-1)
df3=df1.withColumn("rn",row_number().over(w)-1)
df3.join(df4,["rn"]).drop("rn").show()
#+-------+--------------------+
#|stop_id| bank_and_post|
#+-------+--------------------+
#| 0|0.021806558032484918|
#| 1|0.014366417828826784|
#| 2|0.021806558032484918|
#| 3| 0.0|
#| 4| 0.0|
#| 5|0.021806558032484918|
#| 6|0.021806558032484918|
#| 7|0.014366417828826784|
#| 8| 0.0|
#| 9| 0.0|
#+-------+--------------------+
df1.withColumn("mid",monotonically_increasing_id()).\
join(df2.withColumn("mid",monotonically_increasing_id()),["mid"]).\
drop("mid").\
orderBy("stop_id").\
show()
#+-------+--------------------+
#|stop_id| bank_and_post|
#+-------+--------------------+
#| 0|0.021806558032484918|
#| 1|0.014366417828826784|
#| 2|0.021806558032484918|
#| 3| 0.0|
#| 4| 0.0|
#| 5|0.021806558032484918|
#| 6|0.021806558032484918|
#| 7|0.014366417828826784|
#| 8| 0.0|
#| 9| 0.0|
#+-------+--------------------+
w=Window.orderBy("mid")
df3=df1.withColumn("mid",monotonically_increasing_id()).withColumn("rn",row_number().over(w) - 1)
df4=df2.withColumn("mid",monotonically_increasing_id()).withColumn("rn",row_number().over(w) - 1)
df3.join(df4,["rn"]).drop("rn","mid").show()
#+-------+--------------------+
#|stop_id| bank_and_post|
#+-------+--------------------+
#| 0|0.021806558032484918|
#| 1|0.014366417828826784|
#| 2|0.021806558032484918|
#| 3| 0.0|
#| 4| 0.0|
#| 5|0.021806558032484918|
#| 6|0.021806558032484918|
#| 7|0.014366417828826784|
#| 8| 0.0|
#| 9| 0.0|
#+-------+--------------------+
df3=df1.rdd.zipWithIndex().toDF().select("_1.*","_2")
df4=df2.rdd.zipWithIndex().toDF().select("_1.*","_2")
df3.join(df4,["_2"]).drop("_2").orderBy("stop_id").show()
#+-------+--------------------+
#|stop_id| bank_and_post|
#+-------+--------------------+
#| 0|0.021806558032484918|
#| 1|0.014366417828826784|
#| 2|0.021806558032484918|
#| 3| 0.0|
#| 4| 0.0|
#| 5|0.021806558032484918|
#| 6|0.021806558032484918|
#| 7|0.014366417828826784|
#| 8| 0.0|
#| 9| 0.0|
#+-------+--------------------+