使用Java的Spark MLlib分类输入格式
如何将DTO列表转换为Spark ML输入数据集格式 我必须:使用Java的Spark MLlib分类输入格式,java,apache-spark,apache-spark-mllib,apache-spark-ml,Java,Apache Spark,Apache Spark Mllib,Apache Spark Ml,如何将DTO列表转换为Spark ML输入数据集格式 我必须: public class MachineLearningDTO implements Serializable { private double label; private double[] features; public MachineLearningDTO() { } public MachineLearningDTO(double label, double[] features
public class MachineLearningDTO implements Serializable {
private double label;
private double[] features;
public MachineLearningDTO() {
}
public MachineLearningDTO(double label, double[] features) {
this.label = label;
this.features = features;
}
public double getLabel() {
return label;
}
public void setLabel(double label) {
this.label = label;
}
public double[] getFeatures() {
return features;
}
public void setFeatures(double[] features) {
this.features = features;
}
}
和代码:
Dataset<MachineLearningDTO> mlInputDataSet = spark.createDataset(mlInputData, Encoders.bean(MachineLearningDTO.class));
LogisticRegression logisticRegression = new LogisticRegression();
LogisticRegressionModel model = logisticRegression.fit(MLUtils.convertMatrixColumnsToML(mlInputDataSet));
然后我得到:
java.lang.UnsupportedOperationException:无法推断类的类型
org.apache.spark.ml.linalg.VectorUDT,因为它不符合bean
在
org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$serializerForJavaTypeInference.scala:437
我发现,为了防止有人也会坚持使用它,我写了一个简单的转换器,它可以工作:
private Dataset<Row> convertToMlInputFormat(List< MachineLearningDTO> data) {
List<Row> rowData = data.stream()
.map(dto ->
RowFactory.create(dto.getLabel() ? 1.0d : 0.0d, Vectors.dense(dto.getFeatures())))
.collect(Collectors.toList());
StructType schema = new StructType(new StructField[]{
new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),
new StructField("features", new VectorUDT(), false, Metadata.empty()),
});
return spark.createDataFrame(rowData, schema);
}
private Dataset<Row> convertToMlInputFormat(List< MachineLearningDTO> data) {
List<Row> rowData = data.stream()
.map(dto ->
RowFactory.create(dto.getLabel() ? 1.0d : 0.0d, Vectors.dense(dto.getFeatures())))
.collect(Collectors.toList());
StructType schema = new StructType(new StructField[]{
new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),
new StructField("features", new VectorUDT(), false, Metadata.empty()),
});
return spark.createDataFrame(rowData, schema);
}