在Java spark ml中创建自定义转换器
我想用Java创建一个定制的Spark Transformer 转换器是文本预处理器,其作用类似于标记器。它将输入列和输出列作为参数 我环顾四周,发现了两个Scala特征HasInputCol和HasOutputCol 如何创建扩展Transformer并实现HasInputCol和OutputCol的类 我的目标是拥有这样的东西在Java spark ml中创建自定义转换器,java,scala,apache-spark,apache-spark-mllib,transformer,Java,Scala,Apache Spark,Apache Spark Mllib,Transformer,我想用Java创建一个定制的Spark Transformer 转换器是文本预处理器,其作用类似于标记器。它将输入列和输出列作为参数 我环顾四周,发现了两个Scala特征HasInputCol和HasOutputCol 如何创建扩展Transformer并实现HasInputCol和OutputCol的类 我的目标是拥有这样的东西 // Dataset that have a String column named "text" DataSet<Row> dataset;
// Dataset that have a String column named "text"
DataSet<Row> dataset;
CustomTransformer customTransformer = new CustomTransformer();
customTransformer.setInputCol("text");
customTransformer.setOutputCol("result");
// result that have 2 String columns named "text" and "result"
DataSet<Row> result = customTransformer.transform(dataset);
//具有名为“text”的字符串列的数据集
数据集;
CustomTransformer CustomTransformer=新CustomTransformer();
customTransformer.setInputCol(“文本”);
customTransformer.setOutputCol(“结果”);
//具有两个名为“text”和“result”的字符串列的结果
数据集结果=customTransformer.transform(数据集);
您可能希望从继承您的CustomTransformer
。您可以尝试以下方法:
import org.apache.spark.ml.UnaryTransformer;
import org.apache.spark.ml.util.Identifiable$;
import org.apache.spark.sql.types.DataType;
import org.apache.spark.sql.types.DataTypes;
import scala.Function1;
import scala.collection.JavaConversions$;
import scala.collection.immutable.Seq;
import java.util.Arrays;
public class MyCustomTransformer extends UnaryTransformer<String, scala.collection.immutable.Seq<String>, MyCustomTransformer>
{
private final String uid = Identifiable$.MODULE$.randomUID("mycustom");
@Override
public String uid()
{
return uid;
}
@Override
public Function1<String, scala.collection.immutable.Seq<String>> createTransformFunc()
{
// can't use labmda syntax :(
return new scala.runtime.AbstractFunction1<String, Seq<String>>()
{
@Override
public Seq<String> apply(String s)
{
// do the logic
String[] split = s.toLowerCase().split("\\s");
// convert to Scala type
return JavaConversions$.MODULE$.iterableAsScalaIterable(Arrays.asList(split)).toList();
}
};
}
@Override
public void validateInputType(DataType inputType)
{
super.validateInputType(inputType);
if (inputType != DataTypes.StringType)
throw new IllegalArgumentException("Input type must be string type but got " + inputType + ".");
}
@Override
public DataType outputDataType()
{
return DataTypes.createArrayType(DataTypes.StringType, true); // or false? depends on your data
}
}
@Override
public String uid() {
return getUid();
}
private String getUid() {
if (uid == null) {
uid = Identifiable$.MODULE$.randomUID("mycustom");
}
return uid;
}
import org.apache.spark.ml.unarytranformer;
导入org.apache.spark.ml.util.Identification$;
导入org.apache.spark.sql.types.DataType;
导入org.apache.spark.sql.types.DataTypes;
导入scala.Function1;
导入scala.collection.JavaConversions$;
导入scala.collection.immutable.Seq;
导入java.util.array;
公共类MyCustomTransformer扩展了UnaryTransformer
{
私有最终字符串uid=可识别$.MODULE$.randomUID(“mycustom”);
@凌驾
公共字符串uid()
{
返回uid;
}
@凌驾
公共函数1 createTransformFunc()
{
//无法使用labmda语法:(
返回新的scala.runtime.AbstractFunction1()
{
@凌驾
公共顺序应用(字符串s)
{
//按逻辑做
String[]split=s.toLowerCase().split(\\s”);
//转换为Scala类型
返回JavaConversions$.MODULE$.iterablescalaitable(Arrays.asList(split)).toList();
}
};
}
@凌驾
public void validateInputType(数据类型inputType)
{
super.validateInputType(inputType);
if(inputType!=数据类型.StringType)
抛出新的IllegalArgumentException(“输入类型必须是字符串类型,但得到“+inputType+”);
}
@凌驾
公共数据类型outputDataType()
{
返回数据类型。createArrayType(DataTypes.StringType,true);//或false?取决于您的数据
}
}
正如建议的那样,您可以扩展UnaryTransformer
。但是这相当棘手
注:以下所有注释适用于Spark 2.2.0版
要解决中所述的问题,他们从中获取了“…Param null\uuuu inputCol不属于…”
,您应该实现字符串uid()
,如下所示:
import org.apache.spark.ml.UnaryTransformer;
import org.apache.spark.ml.util.Identifiable$;
import org.apache.spark.sql.types.DataType;
import org.apache.spark.sql.types.DataTypes;
import scala.Function1;
import scala.collection.JavaConversions$;
import scala.collection.immutable.Seq;
import java.util.Arrays;
public class MyCustomTransformer extends UnaryTransformer<String, scala.collection.immutable.Seq<String>, MyCustomTransformer>
{
private final String uid = Identifiable$.MODULE$.randomUID("mycustom");
@Override
public String uid()
{
return uid;
}
@Override
public Function1<String, scala.collection.immutable.Seq<String>> createTransformFunc()
{
// can't use labmda syntax :(
return new scala.runtime.AbstractFunction1<String, Seq<String>>()
{
@Override
public Seq<String> apply(String s)
{
// do the logic
String[] split = s.toLowerCase().split("\\s");
// convert to Scala type
return JavaConversions$.MODULE$.iterableAsScalaIterable(Arrays.asList(split)).toList();
}
};
}
@Override
public void validateInputType(DataType inputType)
{
super.validateInputType(inputType);
if (inputType != DataTypes.StringType)
throw new IllegalArgumentException("Input type must be string type but got " + inputType + ".");
}
@Override
public DataType outputDataType()
{
return DataTypes.createArrayType(DataTypes.StringType, true); // or false? depends on your data
}
}
@Override
public String uid() {
return getUid();
}
private String getUid() {
if (uid == null) {
uid = Identifiable$.MODULE$.randomUID("mycustom");
}
return uid;
}
显然,他们在构造函数中初始化uid。但问题是,在继承类中初始化uid之前,UnaryTransformer的inputCol
(和outputCol
)已初始化。请参阅HasInputCol
:
final val inputCol: Param[String] = new Param[String](this, "inputCol", "input column name")
这就是Param
的构造方式:
def this(parent: Identifiable, name: String, doc: String) = this(parent.uid, name, doc)
因此,当计算parent.uid
时,将调用自定义uid()
实现,此时uid
仍为null。通过使用延迟计算实现uid()
,您可以确保uid()
从不返回null
但就你而言:
Param d7ac3108-799c-4aed-a093-c85d12833a4e__inputCol does not belong to fe3d99ba-e4eb-4e95-9412-f84188d936e3
这似乎有点不同。因为“d7ac3108-799c-4aed-a093-c85d12833a4e”!=“fe3d99ba-e4eb-4e95-9412-f84188d936e3”
,您的uid()
方法的实现似乎在每次调用时都会返回一个新值。在您的案例中,可能是这样实现的:
@Override
public String uid() {
return Identifiable$.MODULE$.randomUID("mycustom");
}
顺便说一句,在扩展
UnaryTransformer
时,请确保转换函数是可序列化的
这是一个只包含一个输入列的示例,但您可以按照相同的模式轻松添加一个输出列。但这并没有实现读卡器和写卡器。您需要查看上面的链接以了解如何做到这一点
public class DropColumns extends Transformer implements Serializable,
DefaultParamsWritable {
private StringArrayParam _inputCols;
private final String _uid;
public DropColumns(String uid) {
_uid = uid;
}
public DropColumns() {
_uid = DropColumns.class.getName() + "_" +
UUID.randomUUID().toString();
}
// Getters
public String[] getInputCols() { return get(_inputCols).get(); }
// Setters
public DropColumns setInputCols(String[] columns) {
_inputCols = inputCols();
set(_inputCols, columns);
return this;
}
public DropColumns setInputCols(List<String> columns) {
String[] columnsString = columns.toArray(new String[columns.size()]);
return setInputCols(columnsString);
}
public DropColumns setInputCols(String column) {
String[] columns = new String[]{column};
return setInputCols(columns);
}
// Overrides
@Override
public Dataset<Row> transform(Dataset<?> data) {
List<String> dropCol = new ArrayList<String>();
Dataset<Row> newData = null;
try {
for (String currColumn : this.get(_inputCols).get() ) {
dropCol.add(currColumn);
}
Seq<String> seqCol = JavaConverters.asScalaIteratorConverter(dropCol.iterator()).asScala().toSeq();
newData = data.drop(seqCol);
} catch (Exception ex) {
ex.printStackTrace();
}
return newData;
}
@Override
public Transformer copy(ParamMap extra) {
DropColumns copied = new DropColumns();
copied.setInputCols(this.getInputCols());
return copied;
}
@Override
public StructType transformSchema(StructType oldSchema) {
StructField[] fields = oldSchema.fields();
List<StructField> newFields = new ArrayList<StructField>();
List<String> columnsToRemove = Arrays.asList( get(_inputCols).get() );
for (StructField currField : fields) {
String fieldName = currField.name();
if (!columnsToRemove.contains(fieldName)) {
newFields.add(currField);
}
}
StructType schema = DataTypes.createStructType(newFields);
return schema;
}
@Override
public String uid() {
return _uid;
}
@Override
public MLWriter write() {
return new DropColumnsWriter(this);
}
@Override
public void save(String path) throws IOException {
write().saveImpl(path);
}
public static MLReader<DropColumns> read() {
return new DropColumnsReader();
}
public StringArrayParam inputCols() {
return new StringArrayParam(this, "inputCols", "Columns to be dropped");
}
public DropColumns load(String path) {
return ( (DropColumnsReader) read()).load(path);
}
}
公共类DropColumns扩展了Transformer实现可序列化,
DefaultParamsWritable{
私有StringArrayParam\u输入;
私有最终字符串_uid;
公共DropColumns(字符串uid){
_uid=uid;
}
公共DropColumns(){
_uid=DropColumns.class.getName()+“389;”+
UUID.randomuid().toString();
}
//吸气剂
公共字符串[]getInputCols(){返回get(_inputCols).get();}
//二传手
公共DropColumns setInputCols(字符串[]列){
_inputCols=inputCols();
设置(输入、列);
归还这个;
}
公共DropColumns setInputCols(列表列){
String[]columnsString=columns.toArray(新字符串[columns.size()]);
返回设置输入(列字符串);
}
公共DropColumns setInputCols(字符串列){
字符串[]列=新字符串[]{column};
返回设置输入(列);
}
//覆盖
@凌驾
公共数据集转换(数据集数据){
List-dropCol=new-ArrayList();
数据集newData=null;
试一试{
for(String currColumn:this.get(\u inputCols.get()){
dropCol.add(当前列);
}
Seq seqCol=JavaConverters.asscalateratorconverter(dropCol.iterator()).asScala().toSeq();
新数据=数据删除(seqCol);
}捕获(例外情况除外){
例如printStackTrace();
}
返回新数据;
}
@凌驾
公共变压器副本(ParamMap额外){
复制的DropColumns=新的DropColumns();
复制.setInputCols(this.getInputCols());
返回复印件;
}
@凌驾
公共StructType transformSchema(StructType oldSchema){
StructField[]fields=oldSchema.fields();
List newFields=newarraylist();
List columnsToRemove=Arrays.asList(get(_inputCols.get());
for(StructField currField:fields){
字符串fieldName=currField.name();
如果