Python 基于类方法创建PySpark Dataframe列_Python_Function_Class_Apache Spark_Pyspark

Python 基于类方法创建PySpark Dataframe列

python function class apache-spark pyspark

Python 基于类方法创建PySpark Dataframe列,python,function,class,apache-spark,pyspark,Python,Function,Class,Apache Spark,Pyspark,我有一个python类，它有如下函数： class Features(): def __init__(self, json): self.json = json def get_email(self): email = self.json.get('fields', {}).get('email', None) return email +---------------+----------- |raw_json

我有一个python类，它有如下函数：

class Features():
    def __init__(self, json):
        self.json = json

    def get_email(self):
        email = self.json.get('fields', {}).get('email', None)
        return email

 +---------------+-----------
 |raw_json         |email
 +----------------+----------
 |                 |  
 +----------------+--------
 |                 |  
 +----------------+-------

我试图在pyspark数据框架中使用get_email函数，根据另一列“raw_json”创建一个新列，该列由json值组成：

df = data.withColumn('email', (F.udf(lambda j: Features.get_email(json.loads(j)), t.StringType()))('raw_json'))

因此，理想的pyspark数据帧如下所示：

class Features():
    def __init__(self, json):
        self.json = json

    def get_email(self):
        email = self.json.get('fields', {}).get('email', None)
        return email

 +---------------+-----------
 |raw_json         |email
 +----------------+----------
 |                 |  
 +----------------+--------
 |                 |  
 +----------------+-------

但我得到了一个错误，说：

TypeError: unbound method get_email() must be called with Features instance as first argument (got dict instance instead)

我该如何实现这一目标

我以前见过一个类似的问题，但没有解决。

我想您误解了类在Python中的使用方式。您可能正在寻找以下内容：

udf = F.udf(lambda j: Features(json.loads(j)).get_email())
df = data.withColumn('email', udf('raw_json'))

在这里，您实例化了一个

Features

对象，并调用该对象的

get\u email

方法。

但是如何在“raw\u json”列上应用该函数呢？您是否先尝试了

Features=Features（）

，然后使用

data.withColumn（'email'，Features.get\u email（col（“raw\u json”））

？@XXavier我在执行Features=Features（）时出错：TypeError:\uuuuu init\uuuuuuuuu（）只接受2个参数（给定1个）@kihhfuee我编辑了我的答案，让您应用于原始的json列。您未定义解析日期的错误与我的建议无关-您的代码中还有其他错误。