Apache spark 如何在结构化查询中使用scikit学习模型?

Apache spark 如何在结构化查询中使用scikit学习模型?,apache-spark,scikit-learn,pyspark,spark-structured-streaming,Apache Spark,Scikit Learn,Pyspark,Spark Structured Streaming,我试图将使用pickle检索的scikit模型应用于结构化流数据帧的每一行 我曾尝试使用pandas_udf(版本代码1),但它给了我以下错误: AttributeError: 'numpy.ndarray' object has no attribute 'isnull' ValueError: Expected 2D array, got 1D array instead: [.. ... .. ..] Reshape your data either using array.reshap

我试图将使用pickle检索的scikit模型应用于结构化流数据帧的每一行

我曾尝试使用pandas_udf(版本代码1),但它给了我以下错误:

AttributeError: 'numpy.ndarray' object has no attribute 'isnull'
ValueError: Expected 2D array, got 1D array instead:
[.. ... .. ..]
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
代码:

inputPath = "/FileStore/df_training/streaming_df_1_nh_nd/"
from pyspark.sql import functions as f
from pyspark.sql.types import *

data_schema = data_spark_ts.schema

import pandas as pd

from pyspark.sql.functions import col, pandas_udf, PandasUDFType   # User Defines Functions for Pandas Dataframe
from pyspark.sql.types import LongType

get_prediction = pandas_udf(lambda x: gb2.predict(x), IntegerType())


streamingInputDF = (
  spark
    .readStream                       
    .schema(data_schema)               # Set the schema of the JSON data
    .option("maxFilesPerTrigger", 1)  # Treat a sequence of files as a stream by picking one file at a time
    .csv(inputPath)
    .fillna(0)
    .withColumn("prediction", get_prediction( f.struct([col(x) for x in data_spark.columns]) ))
)

display(streamingInputDF.select("prediction"))
我也尝试过使用普通udf而不是pandas_udf,但它给了我以下错误:

AttributeError: 'numpy.ndarray' object has no attribute 'isnull'
ValueError: Expected 2D array, got 1D array instead:
[.. ... .. ..]
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
我不知道如何重塑我的数据

我尝试应用的模型通过以下方式检索:

#load the pickle
import pickle
gb2 = None

with open('pickle_modello_unico.p', 'rb') as fp:
  gb2 = pickle.load(fp)
它的规格是这样的:

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=300,
              n_iter_no_change=None, presort='auto', random_state=None,
              subsample=1.0, tol=0.0001, validation_fraction=0.1,
              verbose=0, warm_start=False)

有什么办法解决这个问题吗

我解决了从pandas_udf返回pd.系列的问题

以下是工作代码:

inputPath = "/FileStore/df_training/streaming_df_1_nh_nd/"
from pyspark.sql import functions as f
from pyspark.sql.types import *

data_schema = data_spark_ts.schema

import pandas as pd

from pyspark.sql.functions import col, pandas_udf, PandasUDFType   # User Defines Functions for Pandas Dataframe
from pyspark.sql.types import LongType

get_prediction = pandas_udf(lambda x: pd.Series(gb2.predict(x)), StringType())


streamingInputDF = (
  spark
    .readStream                       
    .schema(data_schema)               # Set the schema of the JSON data
    .option("maxFilesPerTrigger", 1)  # Treat a sequence of files as a stream by picking one file at a time
    .csv(inputPath)
    .withColumn("prediction", get_prediction( f.struct([col(x) for x in data_spark.columns]) ))
)

display(streamingInputDF.select("prediction"))

我从pandas_udf返回pd.系列解决了这个问题

以下是工作代码:

inputPath = "/FileStore/df_training/streaming_df_1_nh_nd/"
from pyspark.sql import functions as f
from pyspark.sql.types import *

data_schema = data_spark_ts.schema

import pandas as pd

from pyspark.sql.functions import col, pandas_udf, PandasUDFType   # User Defines Functions for Pandas Dataframe
from pyspark.sql.types import LongType

get_prediction = pandas_udf(lambda x: pd.Series(gb2.predict(x)), StringType())


streamingInputDF = (
  spark
    .readStream                       
    .schema(data_schema)               # Set the schema of the JSON data
    .option("maxFilesPerTrigger", 1)  # Treat a sequence of files as a stream by picking one file at a time
    .csv(inputPath)
    .withColumn("prediction", get_prediction( f.struct([col(x) for x in data_spark.columns]) ))
)

display(streamingInputDF.select("prediction"))

scikit-learn
估计器不返回数据帧;它们返回
numpy
数组。您的错误是:“numpy.ndarray”对象没有属性“isnull”是因为numpy数组没有方法
isnull()
。改用
isnan()
。我从不调用isnull(),我应该在这里调用isnan()?我怀疑在pandas UDF字段上对
fillna()
的PySpark调用正在调用与您的底层数据类型不符的函数,但我需要一个调试环境才能确定。
scikit learn
估计器不会返回数据帧;它们返回
numpy
数组。您的错误是:“numpy.ndarray”对象没有属性“isnull”是因为numpy数组没有方法
isnull()
。改用
isnan()
。我从不调用isnull(),我应该在这里调用isnan()?我怀疑在pandas UDF字段上对
fillna()
的PySpark调用是在调用与基础数据类型不符的函数,但我需要一个调试环境来确定。