Python 如何在pyspark中设置不同表中的列值？_Python_Apache Spark_Pyspark_Pyspark Sql

Python 如何在pyspark中设置不同表中的列值？

python apache-spark pyspark

Python 如何在pyspark中设置不同表中的列值？,python,apache-spark,pyspark,pyspark-sql,Python,Apache Spark,Pyspark,Pyspark Sql,在Pyspark中-如何在条件（B.list_expire_value）>5 |（B.list_date）5）|（table_2['list_date']5或y5或y5）|（表_2.list_date5）|（表_2.list_date

在Pyspark中-如何在

条件（B.list_expire_value）>5 |（B.list_date）<6

上使用

表B（list_date）

中的值设置表a中

列（列出1）

的列值。（B.）表示它们是表B的列

目前我正在做：

  spark_df = table_1.join("table_2", on ="uuid").when((table_2['list_expire_value'] > 5) | (table_2['list_date'] < 6)).withColumn("listed_1", table_2['list_date'])

spark_df=table_1.join（“table_2”，on=“uuid”）。当（（table_2['list_expire_value']>5）|（table_2['list_date']<6））。带列（“listed_1”，table_2['list_date']）

但是我犯了一个错误。如何做到这一点

Sample table : Table A uuid listed_1 001 abc 002 def 003 ghi Table B uuid list_date list_expire_value col4 001 12 7 dckvfd 002 14 3 dfdfgi 003 3 8 sdfgds Expected Output uuid listed1 list_expire_value col4 001 12 7 dckvfd 002 def 3 dfdfgi 003 3 8 sdfgds 002 of listed1 will not be replaced since they do not fufil the when conditions. 样本表：表A uuid已列为1 001 abc 002 def 003 ghi 表B uuid列表\日期列表\过期\值col4 001 12 7 dckvfd 002 14 3 DFGI 003 3 8 sdfgds 预期产量 uuid listed1列表\u过期\u值col4 001 12 7 dckvfd 002 def 3 DFGI 003 3 8 sdfgds 将不替换所列1的002，因为它们不符合when条件。希望这有帮助

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

A = sc.parallelize([('001','abc'),('002','def'),('003','ghi')]).toDF(['uuid','listed_1'])
B = sc.parallelize([('001',12,7,'dckvfd'),('002',14,3,'dfdfgi'),('003',3,8,'sdfgds')]).\
    toDF(['uuid','list_date','list_expire_value','col4'])

def cond_fn(x, y, z):
    if (x > 5 or y < 6):
        return y
    else:
        return z

final_df = A.join(B, on="uuid")
udf_val = udf(cond_fn, StringType())
final_df = final_df.withColumn("listed1",udf_val(final_df.list_expire_value,final_df.list_date, final_df.listed_1))
final_df.select(["uuid","listed1","list_expire_value","col4"]).show()

从pyspark.sql.functions导入udf
从pyspark.sql.types导入StringType
A=sc.parallelize（[（'001'，'abc'），（'002'，'def'），（'003'，'ghi'）））.toDF（['uuid'，'listed_1']））
B=sc.parallelize（[（'001'，12,7，'dckvfd'），（'002'，14,3，'dfdfgi'），（'003'，3,8，'sdfgds'））\
toDF（['uuid'、'list\U date'、'list\U expire\U value'、'col4']））
def cond_fn（x，y，z）：
如果（x>5或y<6）：
返回y
其他：
返回z
final_df=A.join（B，on=“uuid”）
udf_val=udf（cond_fn，StringType（））
final_df=final_df.with column（“listed1”，udf_val（final_df.list_expire_value，final_df.list_date，final_df.listed_1））
最终参数选择（[“uuid”、“listed1”、“list\u expire\u value”、“col4”）。显示（）

别忘了告诉我们它是否解决了您的问题：）

希望这有帮助

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

A = sc.parallelize([('001','abc'),('002','def'),('003','ghi')]).toDF(['uuid','listed_1'])
B = sc.parallelize([('001',12,7,'dckvfd'),('002',14,3,'dfdfgi'),('003',3,8,'sdfgds')]).\
    toDF(['uuid','list_date','list_expire_value','col4'])

def cond_fn(x, y, z):
    if (x > 5 or y < 6):
        return y
    else:
        return z

final_df = A.join(B, on="uuid")
udf_val = udf(cond_fn, StringType())
final_df = final_df.withColumn("listed1",udf_val(final_df.list_expire_value,final_df.list_date, final_df.listed_1))
final_df.select(["uuid","listed1","list_expire_value","col4"]).show()

从pyspark.sql.functions导入udf
从pyspark.sql.types导入StringType
A=sc.parallelize（[（'001'，'abc'），（'002'，'def'），（'003'，'ghi'）））.toDF（['uuid'，'listed_1']））
B=sc.parallelize（[（'001'，12,7，'dckvfd'），（'002'，14,3，'dfdfgi'），（'003'，3,8，'sdfgds'））\
toDF（['uuid'、'list\U date'、'list\U expire\U value'、'col4']））
def cond_fn（x，y，z）：
如果（x>5或y<6）：
返回y
其他：
返回z
final_df=A.join（B，on=“uuid”）
udf_val=udf（cond_fn，StringType（））
final_df=final_df.with column（“listed1”，udf_val（final_df.list_expire_value，final_df.list_date，final_df.listed_1））
最终参数选择（[“uuid”、“listed1”、“list\u expire\u value”、“col4”）。显示（）

别忘了告诉我们它是否解决了您的问题：）

pyspark sql查询的正确形式是

from pyspark.sql import functions as F
spark_df = table_1.join(table_2, 'uuid', 'inner').withColumn('list_expire_value',F.when((table_2.list_expire_value > 5) | (table_2.list_date < 6), table_1.listed_1).otherwise(table_2.list_date)).drop(table_1.listed_1)

从pyspark.sql导入函数为F
spark_df=表_1.连接（表_2，'uuid'，'inner'）。使用列（'list_expire_value'，F.when（（表_2.list_expire_value>5）|（表_2.list_date<6），表_1.listed_1）。否则（表_2.list_date））.drop（表_1.listed_1）

pyspark sql查询的正确形式是

from pyspark.sql import functions as F
spark_df = table_1.join(table_2, 'uuid', 'inner').withColumn('list_expire_value',F.when((table_2.list_expire_value > 5) | (table_2.list_date < 6), table_1.listed_1).otherwise(table_2.list_date)).drop(table_1.listed_1)

从pyspark.sql导入函数为F
spark_df=表_1.连接（表_2，'uuid'，'inner'）。使用列（'list_expire_value'，F.when（（表_2.list_expire_value>5）|（表_2.list_date<6），表_1.listed_1）。否则（表_2.list_date））.drop（表_1.listed_1）

@mtoto添加了预期的输出。@tbone，使用sqlContext，它将成为Update语句，set col value=x。spark中不允许这样做，对吗？不，只是创建一个新的数据帧，它是SQL连接的结果。无法理解如何进行SQL连接并从另一个表的列中更新列的值。为什么不更新uuid=003，它满足（表2['list\u date']<6）条件，对吗？还是数据有误@mtoto添加了预期的输出。@tbone，使用sqlContext，它将成为Update语句，set col value=x。spark中不允许这样做，对吗？不，只是创建一个新的数据帧，它是SQL连接的结果。无法理解如何进行SQL连接并从另一个表的列中更新列的值。为什么不更新uuid=003，它满足（表2['list\u date']<6）条件，对吗？还是数据有误？