Python 如何比较PySpark中2个数据帧中的数据类型和列
我在pyspark df_1和df2中有两个数据帧。模式如下所示Python 如何比较PySpark中2个数据帧中的数据类型和列,python,dataframe,apache-spark,pyspark,apache-spark-sql,Python,Dataframe,Apache Spark,Pyspark,Apache Spark Sql,我在pyspark df_1和df2中有两个数据帧。模式如下所示 >>> df1.printSchema() root |-- id: integer (nullable = false) |-- name: string (nullable = true) |-- address: string (nullable = true) |-- Zip: decimal(18,2)(nullable = true) >>> df2.printSchem
>>> df1.printSchema()
root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
|-- address: string (nullable = true)
|-- Zip: decimal(18,2)(nullable = true)
>>> df2.printSchema()
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- address: string (nullable = true)
|-- Zip: decimal(9,2)(nullable = true)
|-- nation: string (nullable = true)
现在我想比较两个数据帧中的列的列和数据类型差异
我们如何在pyspark中实现这一点
预期输出:
ID COl_Name DataFrame
1 Nation df2
ID Col_Name DF1 DF2
1 id None None
2 name None None
3 address None None
4 Zip Decimal(18,2) Decimal(9,2)
5 nation None None
type1.printSchema()
root
|-- col_name: string (nullable = true)
|-- dtype: string (nullable = true)
|-- dataframe: string (nullable = false)
type2.printSchema()
root
|-- col_name: string (nullable = true)
|-- dtype: string (nullable = true)
|-- dataframe: string (nullable = false)
result2.show()
+--------+----+----+
|col_name| df1| df2|
+--------+----+----+
| movieId|null|null|
| title|null|null|
| zip|null|null|
| genres|null|null|
+--------+----+----+
列:
ID COl_Name DataFrame
1 Nation df2
ID Col_Name DF1 DF2
1 id None None
2 name None None
3 address None None
4 Zip Decimal(18,2) Decimal(9,2)
5 nation None None
type1.printSchema()
root
|-- col_name: string (nullable = true)
|-- dtype: string (nullable = true)
|-- dataframe: string (nullable = false)
type2.printSchema()
root
|-- col_name: string (nullable = true)
|-- dtype: string (nullable = true)
|-- dataframe: string (nullable = false)
result2.show()
+--------+----+----+
|col_name| df1| df2|
+--------+----+----+
| movieId|null|null|
| title|null|null|
| zip|null|null|
| genres|null|null|
+--------+----+----+
数据类型:
ID COl_Name DataFrame
1 Nation df2
ID Col_Name DF1 DF2
1 id None None
2 name None None
3 address None None
4 Zip Decimal(18,2) Decimal(9,2)
5 nation None None
type1.printSchema()
root
|-- col_name: string (nullable = true)
|-- dtype: string (nullable = true)
|-- dataframe: string (nullable = false)
type2.printSchema()
root
|-- col_name: string (nullable = true)
|-- dtype: string (nullable = true)
|-- dataframe: string (nullable = false)
result2.show()
+--------+----+----+
|col_name| df1| df2|
+--------+----+----+
| movieId|null|null|
| title|null|null|
| zip|null|null|
| genres|null|null|
+--------+----+----+
您可以创建列数据类型的数据帧,并对其进行操作以获得所需的结果。我在这里使用了spark数据帧,但我想熊猫也应该工作
import pyspark.sql.functions as F
type1 = spark.createDataFrame(
df1.dtypes, 'col_name string, dtype string'
).withColumn('dataframe', F.lit('df1'))
type2 = spark.createDataFrame(
df2.dtypes, 'col_name string, dtype string'
).withColumn('dataframe', F.lit('df2'))
result1 = type1.join(type2, 'col_name', 'left_anti').unionAll(
type2.join(type1, 'col_name', 'left_anti')
).drop('dtype')
result1.show()
+--------+---------+
|col_name|dataframe|
+--------+---------+
| nation| df2|
+--------+---------+
result2 = type1.join(type2, 'col_name', 'full').select(
'col_name',
F.when(type1.dtype != type2.dtype, type1.dtype).alias('df1'),
F.when(type1.dtype != type2.dtype, type2.dtype).alias('df2')
)
result2.show()
+--------+-------------+------------+
|col_name| df1| df2|
+--------+-------------+------------+
| name| null| null|
| nation| null| null|
| Zip|decimal(18,2)|decimal(9,2)|
| id| null| null|
| address| null| null|
+--------+-------------+------------+