Pandas 如何在课堂上使用UDF_Pandas_Pyspark

Pandas 如何在课堂上使用UDF

pandas pyspark

Pandas 如何在课堂上使用UDF,pandas,pyspark,Pandas,Pyspark,我试图弄清楚如何在PandasUDF.GroupBy.Apply中使用self，并在Python的类方法中传递参数。我尝试了很多不同的方法，但都没有成功。我还在互联网上搜索了一个PandasUDF的例子，这个例子在一个有self和arguments的类中使用，但没有找到类似的例子。我知道如何使用Pandas.GroupBy.Apply完成前面提到的所有事情我能使它工作的唯一方法是声明它为静态方法 class Train: return_type = StructType([

我试图弄清楚如何在

PandasUDF.GroupBy.Apply中使用self
，并在Python的类方法中传递参数。我尝试了很多不同的方法，但都没有成功。我还在互联网上搜索了一个PandasUDF的例子，这个例子在一个有self和arguments的类中使用，但没有找到类似的例子。我知道如何使用Pandas.GroupBy.Apply
完成前面提到的所有事情
我能使它工作的唯一方法是声明它为静态方法
class Train:
    return_type = StructType([
        StructField("div_nbr", FloatType()),
        StructField("store_nbr", FloatType()),
        StructField("model_str", BinaryType())
    ])
    function_type = PandasUDFType.GROUPED_MAP

    def __init__(self):
       ............

    def run_train(self):
         output = sp_df.groupby(['A', 'B']).apply(self.model_train)
         output.show(10)

    @staticmethod
    @pandas_udf(return_type, function_type)
    def model_train(pd_df):
        features_name = ['days_into_year', 'months_into_year', 'minutes_into_day', 'hour_of_day', 'recency']

        X = pd_df[features_name].copy()
        Y = pd.DataFrame(pd_df['trans_type_value']).copy()

        estimator_1 = XGBRegressor(max_depth=3, learning_rate=0.1, n_estimators=300, verbosity=1,
                                   objective='reg:squarederror', booster='gbtree', n_jobs=-1, gamma=0,
                                   min_child_weight=5, max_delta_step=0, subsample=0.6, colsample_bytree=0.8,
                                   colsample_bylevel=1, colsample_bynode=1, reg_alpha=0, reg_lambda=1,
                                   scale_pos_weight=1, base_score=0.5, random_state=1234, missing=None,
                                   importance_type='gain')
        estimator_1.fit(X, Y)
        df_to_return = pd_df[['div_nbr', 'store_nbr']].drop_duplicates().copy()
        df_to_return['model_str'] = pickle.dumps(estimator_1)

        return df_to_return

实际上，我想要实现的是，在\uuu init\uuu（）
中声明return\u type
和function\u type
，特性\u name
，然后在PandasUDF中使用它，在执行PandasUDF.GroupBy.Apply时，还传递要在函数内部使用的参数
如果有人能帮我，我将不胜感激。我对PySpark有点生疏
 你需要在你的@staticmethod
@pissall中编写另一个使用Decorcator@pandas\u udf
的函数，但你没有完全理解。编写一个新方法，它是一个静态方法，以pyspark df作为输入，在该函数中，编写pandas udf，然后将其应用于DF。如果我这样做，我将如何访问self
？如果你想使用self，为什么要使用静态方法？你需要在@pandas\u udf
中编写另一个函数，并在@staticmethod
@pissall中使用Decorcator@pandas\u udf
，但我没有完全理解。编写一个新方法，这是一种静态方法，将pyspark df作为输入，在该函数中编写pandas udf，然后将其应用于df。如果我这样做，我将如何访问self
？如果你想使用self，为什么要使用静态方法？