Pyspark 如何从另一个df值中称赞df
我有2个数据帧,因此一个df具有具有良好格式的唯一值,而另一个df具有错误值的值,那么我如何才能相对于另一个数据帧使用错误值来补充df 示例:具有正确且唯一值的dfPyspark 如何从另一个df值中称赞df,pyspark,apache-spark-sql,Pyspark,Apache Spark Sql,我有2个数据帧,因此一个df具有具有良好格式的唯一值,而另一个df具有错误值的值,那么我如何才能相对于另一个数据帧使用错误值来补充df 示例:具有正确且唯一值的df +----------------------------------------+--------------+ |company_id |company_name | +----------------------------------------+---------
+----------------------------------------+--------------+
|company_id |company_name |
+----------------------------------------+--------------+
|8f642dc67fccf861548dfe1c761ce22f795e91f0|Muebles |
|cbf1c8b09cd5b549416d49d220a40cbd317f952e|MiPasajefy |
+----------------------------------------+--------------+
具有错误值的示例df:
+----------------------------------------+------------+
|company_id |company_name|
+----------------------------------------+------------+
|******* |MiPasajefy |
|cbf1c8b09cd5b549416d49d220a40cbd317f952e|NaN |
|NaN |MiPasajefy |
+----------------------------------------+------------+
列:company_id和company_name是关键列,
因此,具有修正值的错误df必须为:
+----------------------------------------+------------+
|company_id |company_name|
+----------------------------------------+------------+
|cbf1c8b09cd5b549416d49d220a40cbd317f952e|MiPasajefy |
|cbf1c8b09cd5b549416d49d220a40cbd317f952e|MiPasajefy |
|cbf1c8b09cd5b549416d49d220a40cbd317f952e|MiPasajefy |
+----------------------------------------+------------+
from datetime import datetime
from pyspark.sql import *
from collections import *
from pyspark.sql.functions import udf,explode
from pyspark.sql.types import StringType
table_schema = StructType([StructField('key1', StringType(), True),
StructField('key2', IntegerType(), True),
StructField('list1', ArrayType(StringType()), False),
StructField('list2', ArrayType(StringType()), False),
StructField('list3', ArrayType(IntegerType()), False),
StructField('list4', StringType(), False),
StructField('list5', ArrayType(FloatType()), False),
StructField('list6', ArrayType(StringType()), False)
])
df= spark.createDataFrame(
[
("8f642dc67fccf861548dfe1c761ce22f795e91f0","Muebles"),
("cbf1c8b09cd5b549416d49d220a40cbd317f952e","MiPasajefy")
],("company_id","company_name")
)
df2= spark.createDataFrame(
[
( "*****" ,"MiPasajefy" ),
("cbf1c8b09cd5b549416d49d220a40cbd317f952e","NaN" ),
("NaN","MiPasajefy")
],("company_id","company_name")
)
df.createOrReplaceTempView("A")
df2.createOrReplaceTempView("B")
spark.sql("select a.Company_name,a.company_id from B b left join A a on (a.company_id=b.company_id or a.Company_name=b.Company_name )").show(truncate=False)
+------------+----------------------------------------+
|Company_name|company_id |
+------------+----------------------------------------+
|MiPasajefy |cbf1c8b09cd5b549416d49d220a40cbd317f952e|
|MiPasajefy |cbf1c8b09cd5b549416d49d220a40cbd317f952e|
|MiPasajefy |cbf1c8b09cd5b549416d49d220a40cbd317f952e|
+------------+----------------------------------------+