Python 如何根据Pyspark dataframe中的条件修改单元格值

Python 如何根据Pyspark dataframe中的条件修改单元格值,python,apache-spark,dataframe,sql-update,Python,Apache Spark,Dataframe,Sql Update,我有一个dataframe,它有以下几列: category| category_id| bucket| prop_count| event_count | accum_prop_count | accum_event_count ----------------------------------------------------------------------------------------------------- nation | nation |

我有一个dataframe,它有以下几列:

category| category_id| bucket| prop_count| event_count | accum_prop_count | accum_event_count ----------------------------------------------------------------------------------------------------- nation | nation | 1 | 222 | 444 | 555 | 6677 从其他函数添加的行:

a_temp3 = sqlContext.createDataFrame([("nation","state",2,222,444,555)],schema)
a_df = a_df.unionAll(a_temp3)
现在要修改,我正在尝试一个带有条件的联接

a_temp4 = sqlContext.createDataFrame([("state","state",2,444,555,666)],schema)
a_df = a_df.join(a_temp4, [(a_df.category_id == a_temp4.category_id) & (some other cond here)], how = "inner")
但是这个代码不起作用。我得到一个错误:

+--------+-----------+------+----------+-----------+----------------+--------+-----------+------+----------+-----------+----------------+ |category|category_id|bucket|prop_count|event_count|accum_prop_count|category|category_id|bucket|prop_count|event_count|accum_prop_count| +--------+-----------+------+----------+-----------+----------------+--------+-----------+------+----------+-----------+----------------+ | nation| state| 2| 222| 444| 555| state| state| 2| 444| 555| 666| +--------+-----------+------+----------+-----------+----------------+--------+-----------+------+----------+-----------+----------------+ +--------+-----------+------+----------+-----------+----------------+--------+-----------+------+----------+-----------+----------------+ |类别|类别| id |桶|道具计数|事件计数|累计道具计数|类别|类别| id |桶|道具计数|事件计数|累计道具计数|| +--------+-----------+------+----------+-----------+----------------+--------+-----------+------+----------+-----------+----------------+ |民族|国家| 2 | 222 | 444 | 555 |国家|国家| 2 | 444 | 555 | 666| +--------+-----------+------+----------+-----------+----------------+--------+-----------+------+----------+-----------+----------------+ 如何解决这个问题?正确的输出应该有2行,第二行应该有一个更新的值。内部联接将从初始数据帧中删除行,如果您想拥有与
a_df
(左侧)相同的行数,则需要左联接

2) 。如果列的名称相同,则
=
条件将复制列,您可以使用列表代替

3) 。我想“其他条件”指的是
bucket

4) 。如果存在,则希望保留来自_temp4的值(如果不存在,则连接将其值设置为null),
psf.coalesce
允许您这样做

import pyspark.sql.函数作为psf
a_df=a_df.join(a_temp4,[“category_id”,“bucket”],how=“leftouter”)。选择(
psf.coalesce(a_temp4.category,a_df.category)。别名(“category”),
“类别识别号”,
“桶”,
psf.coalesce(a_temp4.prop_count,a_df.prop_count)。别名(“prop_count”),
psf.coalesce(a_temp4.event_count,a_df.event_count)。别名(“event_count”),
psf.合并(a_temp4.accum_prop_count,a_df.accum_prop_count)。别名(“accum_prop_count”)
)
+--------+-----------+------+----------+-----------+----------------+
|类别|类别| id |桶|道具计数|事件计数|累计道具计数|
+--------+-----------+------+----------+-----------+----------------+
|州|州| 2 | 444 | 555 | 666|
|国家|国家| 1 | 222 | 444 | 555|
+--------+-----------+------+----------+-----------+----------------+

如果只使用一行数据框,则应考虑直接编码更新,而不是使用联接:

def update\u col(类别id、存储桶、列名称、列值):
返回psf.when((a_df.category_id==category_id)和(a_df.bucket==bucket),col_val)。否则(a_df[col_name])。别名(col_name)
a_df.select(
更新表格(“州”,2,“类别”,“国家”),
“类别识别号”,
“桶”,
更新列(“状态”,2,“属性计数”,444),
更新列(“状态”,2,“事件计数”,555),
更新列(“状态”,2,“累计属性计数”,666)
).show()
1)。内部联接将从初始数据帧中删除行,如果您想拥有与
a_df
(左侧)相同的行数,则需要左联接

2) 。如果列的名称相同,则
=
条件将复制列,您可以使用列表代替

3) 。我想“其他条件”指的是
bucket

4) 。如果存在,则希望保留来自_temp4的值(如果不存在,则连接将其值设置为null),
psf.coalesce
允许您这样做

import pyspark.sql.函数作为psf
a_df=a_df.join(a_temp4,[“category_id”,“bucket”],how=“leftouter”)。选择(
psf.coalesce(a_temp4.category,a_df.category)。别名(“category”),
“类别识别号”,
“桶”,
psf.coalesce(a_temp4.prop_count,a_df.prop_count)。别名(“prop_count”),
psf.coalesce(a_temp4.event_count,a_df.event_count)。别名(“event_count”),
psf.合并(a_temp4.accum_prop_count,a_df.accum_prop_count)。别名(“accum_prop_count”)
)
+--------+-----------+------+----------+-----------+----------------+
|类别|类别| id |桶|道具计数|事件计数|累计道具计数|
+--------+-----------+------+----------+-----------+----------------+
|州|州| 2 | 444 | 555 | 666|
|国家|国家| 1 | 222 | 444 | 555|
+--------+-----------+------+----------+-----------+----------------+

如果只使用一行数据框,则应考虑直接编码更新,而不是使用联接:

def update\u col(类别id、存储桶、列名称、列值):
返回psf.when((a_df.category_id==category_id)和(a_df.bucket==bucket),col_val)。否则(a_df[col_name])。别名(col_name)
a_df.select(
更新表格(“州”,2,“类别”,“国家”),
“类别识别号”,
“桶”,
更新列(“状态”,2,“属性计数”,444),
更新列(“状态”,2,“事件计数”,555),
更新列(“状态”,2,“累计属性计数”,666)
).show()
+--------+-----------+------+----------+-----------+----------------+--------+-----------+------+----------+-----------+----------------+ |category|category_id|bucket|prop_count|event_count|accum_prop_count|category|category_id|bucket|prop_count|event_count|accum_prop_count| +--------+-----------+------+----------+-----------+----------------+--------+-----------+------+----------+-----------+----------------+ | nation| state| 2| 222| 444| 555| state| state| 2| 444| 555| 666| +--------+-----------+------+----------+-----------+----------------+--------+-----------+------+----------+-----------+----------------+