Python Spark数据帧更新值
我有3个数据帧:Python Spark数据帧更新值,python,apache-spark,dataframe,pyspark,spark-dataframe,Python,Apache Spark,Dataframe,Pyspark,Spark Dataframe,我有3个数据帧: 1. Item dataframe: +-------+---------+ |id_item|item_code| +-------+---------+ | 991| A0049| | 992| C1248| | 993| C0860| | 994| C0757| | 995| C0682| +-------+---------+ 及 及 现在,id\u usn在StructType中,我想用用户数据帧中的us
1. Item dataframe:
+-------+---------+
|id_item|item_code|
+-------+---------+
| 991| A0049|
| 992| C1248|
| 993| C0860|
| 994| C0757|
| 995| C0682|
+-------+---------+
及
及
现在,id\u usn在StructType中,我想用用户数据帧中的usn替换摘要数据帧中的id\u usn
我用的是火花
请帮我解决这个问题 希望有帮助
from pyspark.sql import functions as F
sdf1 = summarydf.select('id_item','summary',F.explode('summary').alias('col_summary')).select('*',F.col('col_summary').id_usn.alias('id_usn'),F.col('col_summary').rating.alias('rating')).drop('col_summary')
df = sdf1.join(itemdf,'id_item').join(userdf,'id_usn').select('item_code',F.struct('usn','rating').alias('tmpcol')).groupby('item_code').agg(F.collect_list('tmpcol').alias('summary'))
+---------+--------------------+
|item_code| summary|
+---------+--------------------+
| C1248|[[39063291,0.0010...|
| A0049|[[39063291,0.5799...|
+---------+--------------------+
您是否尝试使用
join
后跟select
?我担心这个问题太简单了,不可能成立,并试图找到值得穿上它的东西。我可以试试,但我的问题是替换id_usn,因为它在struct@JacekLaskowski:你对这个问题有什么想法吗?你可以分解阵列,将结构域作为单独的列并将它们连接起来。@Suresh:您能在答案中编写代码吗
3. Summary dataframe
+-------+--------------------+
|id_item| summary |
+-------+--------------------+
| 991|[[417567,0.579901...|
| 992|[[417567,0.001029...|
| 443|[[417585,0.219624...|
+-------+--------------------+
and schema of this dataFrame:
root
|-- id_item: integer (nullable = true)
|-- summary: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id_usn: long (nullable = true)
| | |-- rating: double (nullable = true)
from pyspark.sql import functions as F
sdf1 = summarydf.select('id_item','summary',F.explode('summary').alias('col_summary')).select('*',F.col('col_summary').id_usn.alias('id_usn'),F.col('col_summary').rating.alias('rating')).drop('col_summary')
df = sdf1.join(itemdf,'id_item').join(userdf,'id_usn').select('item_code',F.struct('usn','rating').alias('tmpcol')).groupby('item_code').agg(F.collect_list('tmpcol').alias('summary'))
+---------+--------------------+
|item_code| summary|
+---------+--------------------+
| C1248|[[39063291,0.0010...|
| A0049|[[39063291,0.5799...|
+---------+--------------------+