Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/360.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python pyspark.ml管道:基本预处理任务是否需要自定义转换器?_Python_Apache Spark_Machine Learning_Pyspark_Data Science - Fatal编程技术网

Python pyspark.ml管道:基本预处理任务是否需要自定义转换器?

Python pyspark.ml管道:基本预处理任务是否需要自定义转换器?,python,apache-spark,machine-learning,pyspark,data-science,Python,Apache Spark,Machine Learning,Pyspark,Data Science,从pyspark.ml和管道API开始,我发现自己正在为典型的预处理任务编写自定义转换器,以便在管道中使用它们。示例: from pyspark.ml import Pipeline, Transformer class CustomTransformer(Transformer): # lazy workaround - a transformer needs to have these attributes _defaultParamMap = dict() _p

pyspark.ml
和管道API开始,我发现自己正在为典型的预处理任务编写自定义转换器,以便在管道中使用它们。示例:

from pyspark.ml import Pipeline, Transformer


class CustomTransformer(Transformer):
    # lazy workaround - a transformer needs to have these attributes
    _defaultParamMap = dict()
    _paramMap = dict()
    _params = dict()

class ColumnSelector(CustomTransformer):
    """Transformer that selects a subset of columns
    - to be used as pipeline stage"""

    def __init__(self, columns):
        self.columns = columns


    def _transform(self, data):
        return data.select(self.columns)


class ColumnRenamer(CustomTransformer):
    """Transformer renames one column"""


    def __init__(self, rename):
        self.rename = rename

    def _transform(self, data):
        (colNameBefore, colNameAfter) = self.rename
        return data.withColumnRenamed(colNameBefore, colNameAfter)


class NaDropper(CustomTransformer):
    """
    Drops rows with at least one not-a-number element
    """

    def __init__(self, cols=None):
        self.cols = cols


    def _transform(self, data):
        dataAfterDrop = data.dropna(subset=self.cols) 
        return dataAfterDrop


class ColumnCaster(CustomTransformer):

    def __init__(self, col, toType):
        self.col = col
        self.toType = toType

    def _transform(self, data):
        return data.withColumn(self.col, data[self.col].cast(self.toType))

它们可以工作,但我想知道这是一种模式还是反模式——这样的转换器是使用管道API的好方法吗?是否有必要实现它们,或者其他地方是否提供了等效的功能?

我想说,它主要是基于观点的,尽管它看起来不必要地冗长,而且Python
Transformers
管道
API的其余部分没有很好地集成

还值得指出的是,这里的所有内容都可以通过
SQLTransformer
轻松实现。例如:

from pyspark.ml.feature import SQLTransformer

def column_selector(columns):
    return SQLTransformer(
        statement="SELECT {} FROM __THIS__".format(", ".join(columns))
    )


只要稍加努力,您就可以使用SQLAlchemy和Hive方言来避免手写SQL。

您能否详细说明一下“Python转换器与管道API的其余部分没有很好地集成”?默认情况下,没有
mlwriteable
(尽管有)。SQL并不是我所希望的优雅的替代方案,但是答案很好->接受。你怎么称呼定制变压器?
def na_dropper(columns):
    return SQLTransformer(
        statement="SELECT * FROM __THIS__ WHERE {}".format(
            " AND ".join(["{} IS NOT NULL".format(x) for x in columns])
        )
    )