Python __在Pyspark中使用udf时发生getnewargs_u u;错误
有一个datafarame,有两列(db和tb):db代表数据库,tb代表该数据库的tableNamePython __在Pyspark中使用udf时发生getnewargs_u u;错误,python,dataframe,pyspark,Python,Dataframe,Pyspark,有一个datafarame,有两列(db和tb):db代表数据库,tb代表该数据库的tableName +--------------------+--------------------+ | database| tableName| +--------------------+--------------------+ |aaaaaaaaaaaaaaaaa...| tttttttttttttttt| |bbbbbb
+--------------------+--------------------+
| database| tableName|
+--------------------+--------------------+
|aaaaaaaaaaaaaaaaa...| tttttttttttttttt|
|bbbbbbbbbbbbbbbbb...| rrrrrrrrrrrrrrrr|
|aaaaaaaaaaaaaaaaa...| ssssssssssssssssss|
我在python中有以下方法:
def _get_tb_db(db, tb):
df = spark.sql("select * from {}.{}".format(db, tb))
return df.dtypes
而这个udf:
test = udf(lambda db, tb: _get_tb_db(db, tb), StringType())
运行此操作时:
df = df.withColumn("dtype", test(col("db"), col("tb")))
有以下错误:
pickle.PicklingError: Could not serialize object: Py4JError: An
error occurred while calling o58.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
我发现了一些关于stackoverflow的讨论:
但我不知道如何解决这个问题?
错误是否是因为我正在UDF中创建另一个数据帧
与我尝试的链接中的解决方案类似:
cols = copy.deepcopy(df.columns)
df = df.withColumn("dtype", scanning(cols[0], cols[1]))
但还是会出错
任何解决方案?该错误意味着您不能在UDF中使用Spark数据帧。但是,由于包含数据库和表名称的数据帧很可能很小,只需使用Python
for
循环就足够了,下面是一些可能有助于获取数据的方法:
from pyspark.sql import Row
# assume dfs is the df containing database names and table names
dfs.printSchema()
root
|-- database: string (nullable = true)
|-- tableName: string (nullable = true)
方法1:使用df.dtypes
运行sqlselect*from database.tableName limit 1
生成df并返回其数据类型,将其转换为StringType()
注意:
- 使用
而不是dtypes
将得到以下模式,其中str(dtypes)
和\u 1
分别是\u 2
列名称和
:列数据类型
root |-- database: string (nullable = true) |-- tableName: string (nullable = true) |-- dtypes: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- _1: string (nullable = true) | | |-- _2: string (nullable = true)
- 使用此方法,每个表将只有一行。对于接下来的两种方法,表的每种col_类型都有自己的行
spark.sql(“descripe tableName”)
直接获取数据帧来检索此信息,然后使用reduce函数合并所有表的结果
from functools import reduce
def get_df_dtypes(db, tb):
try:
return spark.sql('desc `{}`.`{}`'.format(db, tb)) \
.selectExpr(
'"{}" as `database`'.format(db)
, '"{}" as `tableName`'.format(tb)
, 'col_name'
, 'data_type')
except Exception, e:
print("ERROR from {}.{}: [{}]".format(db, tb, e))
pass
# an example table:
get_df_dtypes('default', 'tbl_df1').show()
+--------+---------+--------+--------------------+
|database|tableName|col_name| data_type|
+--------+---------+--------+--------------------+
| default| tbl_df1| array_b|array<struct<a:st...|
| default| tbl_df1| array_d| array<string>|
| default| tbl_df1|struct_c|struct<a:double,b...|
+--------+---------+--------+--------------------+
# use reduce function to union all tables into one df
df_dtypes = reduce(lambda d1, d2: d1.union(d2), [ get_df_dtypes(row.database, row.tableName) for row in dfs.collect() ])
注意:不同的Spark发行版/版本可能与
descripe tbl_name
和其他命令有不同的结果检索元数据时,请确保在查询中使用正确的列名。您到底想做什么?我对这个问题添加了更多的解释。UDF对每一行进行操作,并分别返回一行。我认为这是因为df=spark.sql(“select*from{}.{}.format(db,tb))
您试图使用udf
查询每行一个db。您应该在rdd
上尝试map
。谢谢,您可以再添加一些代码吗。我应该如何获得每行的数据类型(db.tb)?注释不用于扩展讨论;这段对话已经结束。
from functools import reduce
def get_df_dtypes(db, tb):
try:
return spark.sql('desc `{}`.`{}`'.format(db, tb)) \
.selectExpr(
'"{}" as `database`'.format(db)
, '"{}" as `tableName`'.format(tb)
, 'col_name'
, 'data_type')
except Exception, e:
print("ERROR from {}.{}: [{}]".format(db, tb, e))
pass
# an example table:
get_df_dtypes('default', 'tbl_df1').show()
+--------+---------+--------+--------------------+
|database|tableName|col_name| data_type|
+--------+---------+--------+--------------------+
| default| tbl_df1| array_b|array<struct<a:st...|
| default| tbl_df1| array_d| array<string>|
| default| tbl_df1|struct_c|struct<a:double,b...|
+--------+---------+--------+--------------------+
# use reduce function to union all tables into one df
df_dtypes = reduce(lambda d1, d2: d1.union(d2), [ get_df_dtypes(row.database, row.tableName) for row in dfs.collect() ])
data = []
DRow = Row('database', 'tableName', 'col_name', 'col_dtype')
for row in dfs.select('database', 'tableName').collect():
try:
for col in spark.catalog.listColumns(row.tableName, row.database):
data.append(DRow(row.database, row.tableName, col.name, col.dataType))
except Exception, e:
print("ERROR from {}.{}: [{}]".format(row.database, row.tableName, e))
pass
df_dtypes = spark.createDataFrame(data)
# DataFrame[database: string, tableName: string, col_name: string, col_dtype: string]