在Java spark ml中创建自定义转换器

在Java spark ml中创建自定义转换器,java,scala,apache-spark,apache-spark-mllib,transformer,Java,Scala,Apache Spark,Apache Spark Mllib,Transformer,我想用Java创建一个定制的Spark Transformer 转换器是文本预处理器,其作用类似于标记器。它将输入列和输出列作为参数 我环顾四周,发现了两个Scala特征HasInputCol和HasOutputCol 如何创建扩展Transformer并实现HasInputCol和OutputCol的类 我的目标是拥有这样的东西 // Dataset that have a String column named "text" DataSet<Row> dataset;

我想用Java创建一个定制的Spark Transformer

转换器是文本预处理器,其作用类似于标记器。它将输入列和输出列作为参数

我环顾四周,发现了两个Scala特征HasInputCol和HasOutputCol

如何创建扩展Transformer并实现HasInputCol和OutputCol的类

我的目标是拥有这样的东西

   // Dataset that have a String column named "text"
   DataSet<Row> dataset;

   CustomTransformer customTransformer = new CustomTransformer();
   customTransformer.setInputCol("text");
   customTransformer.setOutputCol("result");

   // result that have 2 String columns named "text" and "result"
   DataSet<Row> result = customTransformer.transform(dataset);
//具有名为“text”的字符串列的数据集
数据集;
CustomTransformer CustomTransformer=新CustomTransformer();
customTransformer.setInputCol(“文本”);
customTransformer.setOutputCol(“结果”);
//具有两个名为“text”和“result”的字符串列的结果
数据集结果=customTransformer.transform(数据集);

您可能希望从继承您的
CustomTransformer
。您可以尝试以下方法:

import org.apache.spark.ml.UnaryTransformer;
import org.apache.spark.ml.util.Identifiable$;
import org.apache.spark.sql.types.DataType;
import org.apache.spark.sql.types.DataTypes;
import scala.Function1;
import scala.collection.JavaConversions$;
import scala.collection.immutable.Seq;

import java.util.Arrays;

public class MyCustomTransformer extends UnaryTransformer<String, scala.collection.immutable.Seq<String>, MyCustomTransformer>
{
    private final String uid = Identifiable$.MODULE$.randomUID("mycustom");

    @Override
    public String uid()
    {
        return uid;
    }


    @Override
    public Function1<String, scala.collection.immutable.Seq<String>> createTransformFunc()
    {
        // can't use labmda syntax :(
        return new scala.runtime.AbstractFunction1<String, Seq<String>>()
        {
            @Override
            public Seq<String> apply(String s)
            {
                // do the logic
                String[] split = s.toLowerCase().split("\\s");
                // convert to Scala type
                return JavaConversions$.MODULE$.iterableAsScalaIterable(Arrays.asList(split)).toList();
            }
        };
    }


    @Override
    public void validateInputType(DataType inputType)
    {
        super.validateInputType(inputType);
        if (inputType != DataTypes.StringType)
            throw new IllegalArgumentException("Input type must be string type but got " + inputType + ".");
    }

    @Override
    public DataType outputDataType()
    {
        return DataTypes.createArrayType(DataTypes.StringType, true); // or false? depends on your data
    }
}
@Override
public String uid() {
    return getUid();
}

private String getUid() {

    if (uid == null) {
        uid = Identifiable$.MODULE$.randomUID("mycustom");
    }
    return uid;
}
import org.apache.spark.ml.unarytranformer;
导入org.apache.spark.ml.util.Identification$;
导入org.apache.spark.sql.types.DataType;
导入org.apache.spark.sql.types.DataTypes;
导入scala.Function1;
导入scala.collection.JavaConversions$;
导入scala.collection.immutable.Seq;
导入java.util.array;
公共类MyCustomTransformer扩展了UnaryTransformer
{
私有最终字符串uid=可识别$.MODULE$.randomUID(“mycustom”);
@凌驾
公共字符串uid()
{
返回uid;
}
@凌驾
公共函数1 createTransformFunc()
{
//无法使用labmda语法:(
返回新的scala.runtime.AbstractFunction1()
{
@凌驾
公共顺序应用(字符串s)
{
//按逻辑做
String[]split=s.toLowerCase().split(\\s”);
//转换为Scala类型
返回JavaConversions$.MODULE$.iterablescalaitable(Arrays.asList(split)).toList();
}
};
}
@凌驾
public void validateInputType(数据类型inputType)
{
super.validateInputType(inputType);
if(inputType!=数据类型.StringType)
抛出新的IllegalArgumentException(“输入类型必须是字符串类型,但得到“+inputType+”);
}
@凌驾
公共数据类型outputDataType()
{
返回数据类型。createArrayType(DataTypes.StringType,true);//或false?取决于您的数据
}
}
正如建议的那样,您可以扩展
UnaryTransformer
。但是这相当棘手

注:以下所有注释适用于Spark 2.2.0版

要解决中所述的问题,他们从中获取了
“…Param null\uuuu inputCol不属于…”
,您应该实现
字符串uid()
,如下所示:

import org.apache.spark.ml.UnaryTransformer;
import org.apache.spark.ml.util.Identifiable$;
import org.apache.spark.sql.types.DataType;
import org.apache.spark.sql.types.DataTypes;
import scala.Function1;
import scala.collection.JavaConversions$;
import scala.collection.immutable.Seq;

import java.util.Arrays;

public class MyCustomTransformer extends UnaryTransformer<String, scala.collection.immutable.Seq<String>, MyCustomTransformer>
{
    private final String uid = Identifiable$.MODULE$.randomUID("mycustom");

    @Override
    public String uid()
    {
        return uid;
    }


    @Override
    public Function1<String, scala.collection.immutable.Seq<String>> createTransformFunc()
    {
        // can't use labmda syntax :(
        return new scala.runtime.AbstractFunction1<String, Seq<String>>()
        {
            @Override
            public Seq<String> apply(String s)
            {
                // do the logic
                String[] split = s.toLowerCase().split("\\s");
                // convert to Scala type
                return JavaConversions$.MODULE$.iterableAsScalaIterable(Arrays.asList(split)).toList();
            }
        };
    }


    @Override
    public void validateInputType(DataType inputType)
    {
        super.validateInputType(inputType);
        if (inputType != DataTypes.StringType)
            throw new IllegalArgumentException("Input type must be string type but got " + inputType + ".");
    }

    @Override
    public DataType outputDataType()
    {
        return DataTypes.createArrayType(DataTypes.StringType, true); // or false? depends on your data
    }
}
@Override
public String uid() {
    return getUid();
}

private String getUid() {

    if (uid == null) {
        uid = Identifiable$.MODULE$.randomUID("mycustom");
    }
    return uid;
}
显然,他们在构造函数中初始化uid。但问题是,在继承类中初始化uid之前,UnaryTransformer的
inputCol
(和
outputCol
)已初始化。请参阅
HasInputCol

final val inputCol: Param[String] = new Param[String](this, "inputCol", "input column name")
这就是
Param
的构造方式:

def this(parent: Identifiable, name: String, doc: String) = this(parent.uid, name, doc)
因此,当计算
parent.uid
时,将调用自定义
uid()
实现,此时
uid
仍为null。通过使用延迟计算实现
uid()
,您可以确保
uid()
从不返回null

但就你而言:

Param d7ac3108-799c-4aed-a093-c85d12833a4e__inputCol does not belong to fe3d99ba-e4eb-4e95-9412-f84188d936e3
这似乎有点不同。因为
“d7ac3108-799c-4aed-a093-c85d12833a4e”!=“fe3d99ba-e4eb-4e95-9412-f84188d936e3”
,您的
uid()
方法的实现似乎在每次调用时都会返回一个新值。在您的案例中,可能是这样实现的:

@Override
public String uid() {
    return Identifiable$.MODULE$.randomUID("mycustom");
}

顺便说一句,在扩展
UnaryTransformer
时,请确保转换函数是可序列化的

这是一个只包含一个输入列的示例,但您可以按照相同的模式轻松添加一个输出列。但这并没有实现读卡器和写卡器。您需要查看上面的链接以了解如何做到这一点

public class DropColumns extends Transformer implements Serializable, 
DefaultParamsWritable {

    private StringArrayParam _inputCols;
    private final String _uid;

    public DropColumns(String uid) {
        _uid = uid;
    }

    public DropColumns() {
        _uid = DropColumns.class.getName() + "_" + 
UUID.randomUUID().toString();
    }

    // Getters
    public String[] getInputCols() { return get(_inputCols).get(); }

   // Setters
   public DropColumns setInputCols(String[] columns) {
       _inputCols = inputCols();
       set(_inputCols, columns);
       return this;
   }

public DropColumns setInputCols(List<String> columns) {
    String[] columnsString = columns.toArray(new String[columns.size()]);
    return setInputCols(columnsString);
}

public DropColumns setInputCols(String column) {
    String[] columns = new String[]{column};
    return setInputCols(columns);
}

// Overrides
@Override
public Dataset<Row> transform(Dataset<?> data) {
    List<String> dropCol = new ArrayList<String>();
    Dataset<Row> newData = null;
    try {
        for (String currColumn : this.get(_inputCols).get() ) {
            dropCol.add(currColumn);
        }
        Seq<String> seqCol = JavaConverters.asScalaIteratorConverter(dropCol.iterator()).asScala().toSeq();      
        newData = data.drop(seqCol);
    } catch (Exception ex) {
        ex.printStackTrace();
    }
    return newData;
}

@Override
public Transformer copy(ParamMap extra) {
    DropColumns copied = new DropColumns();
    copied.setInputCols(this.getInputCols());
    return copied;
}

@Override
public StructType transformSchema(StructType oldSchema) {
    StructField[] fields = oldSchema.fields();  
    List<StructField> newFields = new ArrayList<StructField>();
    List<String> columnsToRemove = Arrays.asList( get(_inputCols).get() );
    for (StructField currField : fields) {
        String fieldName = currField.name();
        if (!columnsToRemove.contains(fieldName)) {
            newFields.add(currField);
        }
    }
    StructType schema = DataTypes.createStructType(newFields);
    return schema;
}

@Override
public String uid() {
    return _uid;
}

@Override
public MLWriter write() {
    return new DropColumnsWriter(this);
}

@Override
public void save(String path) throws IOException {
    write().saveImpl(path);
}

public static MLReader<DropColumns> read() {
    return new DropColumnsReader();
}

public StringArrayParam inputCols() {
    return new StringArrayParam(this, "inputCols", "Columns to be dropped");
}

public DropColumns load(String path) {
    return ( (DropColumnsReader) read()).load(path);
}
}
公共类DropColumns扩展了Transformer实现可序列化,
DefaultParamsWritable{
私有StringArrayParam\u输入;
私有最终字符串_uid;
公共DropColumns(字符串uid){
_uid=uid;
}
公共DropColumns(){
_uid=DropColumns.class.getName()+“389;”+
UUID.randomuid().toString();
}
//吸气剂
公共字符串[]getInputCols(){返回get(_inputCols).get();}
//二传手
公共DropColumns setInputCols(字符串[]列){
_inputCols=inputCols();
设置(输入、列);
归还这个;
}
公共DropColumns setInputCols(列表列){
String[]columnsString=columns.toArray(新字符串[columns.size()]);
返回设置输入(列字符串);
}
公共DropColumns setInputCols(字符串列){
字符串[]列=新字符串[]{column};
返回设置输入(列);
}
//覆盖
@凌驾
公共数据集转换(数据集数据){
List-dropCol=new-ArrayList();
数据集newData=null;
试一试{
for(String currColumn:this.get(\u inputCols.get()){
dropCol.add(当前列);
}
Seq seqCol=JavaConverters.asscalateratorconverter(dropCol.iterator()).asScala().toSeq();
新数据=数据删除(seqCol);
}捕获(例外情况除外){
例如printStackTrace();
}
返回新数据;
}
@凌驾
公共变压器副本(ParamMap额外){
复制的DropColumns=新的DropColumns();
复制.setInputCols(this.getInputCols());
返回复印件;
}
@凌驾
公共StructType transformSchema(StructType oldSchema){
StructField[]fields=oldSchema.fields();
List newFields=newarraylist();
List columnsToRemove=Arrays.asList(get(_inputCols.get());
for(StructField currField:fields){
字符串fieldName=currField.name();
如果