Apache spark Spark Java:通过从不同列中获取值,将向量值作为新列添加到DataFrame中

Apache spark Spark Java:通过从不同列中获取值,将向量值作为新列添加到DataFrame中,apache-spark,dataframe,spark-dataframe,Apache Spark,Dataframe,Spark Dataframe,假设我们有一个包含4列A、B、C、D的数据框。现在我想要的是将B、C、D列值组合为向量,并将它们作为新列(即e列)添加到现有数据框中。我想直接在数据帧内执行此操作,而不需要将其转换为RDD,然后将向量值添加到RDD,然后再将其转换回数据帧。因为这不是一个好的解决方案 所以,我想要一些Java解决方案直接在DataFrameAPI中实现这一点 对于此场景,您可以使用VectorAssemBler 下面是示例代码 import java.util.Arrays; import org.apache

假设我们有一个包含4列A、B、C、D的数据框。现在我想要的是将B、C、D列值组合为向量,并将它们作为新列(即e列)添加到现有数据框中。我想直接在数据帧内执行此操作,而不需要将其转换为RDD,然后将向量值添加到RDD,然后再将其转换回数据帧。因为这不是一个好的解决方案


所以,我想要一些Java解决方案直接在DataFrameAPI中实现这一点

对于此场景,您可以使用
VectorAssemBler
下面是示例代码

import java.util.Arrays;

import org.apache.spark.ml.feature.VectorAssembler;
import org.apache.spark.ml.linalg.VectorUDT;
import org.apache.spark.ml.linalg.Vectors;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.types.*;

import static org.apache.spark.sql.types.DataTypes.*;

StructType schema = createStructType(new StructField[]{
  createStructField("id", IntegerType, false),
  createStructField("hour", IntegerType, false),
  createStructField("mobile", DoubleType, false),
  createStructField("userFeatures", new VectorUDT(), false),
  createStructField("clicked", DoubleType, false)
});
Row row = RowFactory.create(0, 18, 1.0, Vectors.dense(0.0, 10.0, 0.5), 1.0);
Dataset<Row> dataset = spark.createDataFrame(Arrays.asList(row), schema);

VectorAssembler assembler = new VectorAssembler()
  .setInputCols(new String[]{"hour", "mobile", "userFeatures"})
  .setOutputCol("features");

Dataset<Row> output = assembler.transform(dataset);
System.out.println(output.select("features", "clicked").first());
导入java.util.array;
导入org.apache.spark.ml.feature.VectorAssembler;
导入org.apache.spark.ml.linalg.VectorUDT;
导入org.apache.spark.ml.linalg.Vectors;
导入org.apache.spark.sql.Dataset;
导入org.apache.spark.sql.Row;
导入org.apache.spark.sql.RowFactory;
导入org.apache.spark.sql.types.*;
导入静态org.apache.spark.sql.types.DataTypes.*;
StructType架构=createStructType(新StructField[]{
createStructField(“id”,IntegerType,false),
createStructField(“小时”,整数类型,false),
createStructField(“移动”、双类型、假),
createStructField(“userFeatures”,new VectorUDT(),false),
createStructField(“单击”,双重类型,false)
});
Row-Row=RowFactory.create(0,18,1.0,Vectors.densite(0.0,10.0,0.5),1.0);
Dataset Dataset=spark.createDataFrame(Arrays.asList(行),schema);
向量汇编程序汇编程序=新向量汇编程序()
.setInputCols(新字符串[]{“hour”、“mobile”、“userFeatures”})
.setOutputCol(“特征”);
数据集输出=汇编程序.transform(数据集);
System.out.println(output.select(“features”,“clicked”).first());
欲了解更多用法,请查看下面的链接


希望这能起作用

对于此场景,您可以使用
矢量汇编程序
下面是示例代码

import java.util.Arrays;

import org.apache.spark.ml.feature.VectorAssembler;
import org.apache.spark.ml.linalg.VectorUDT;
import org.apache.spark.ml.linalg.Vectors;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.types.*;

import static org.apache.spark.sql.types.DataTypes.*;

StructType schema = createStructType(new StructField[]{
  createStructField("id", IntegerType, false),
  createStructField("hour", IntegerType, false),
  createStructField("mobile", DoubleType, false),
  createStructField("userFeatures", new VectorUDT(), false),
  createStructField("clicked", DoubleType, false)
});
Row row = RowFactory.create(0, 18, 1.0, Vectors.dense(0.0, 10.0, 0.5), 1.0);
Dataset<Row> dataset = spark.createDataFrame(Arrays.asList(row), schema);

VectorAssembler assembler = new VectorAssembler()
  .setInputCols(new String[]{"hour", "mobile", "userFeatures"})
  .setOutputCol("features");

Dataset<Row> output = assembler.transform(dataset);
System.out.println(output.select("features", "clicked").first());
导入java.util.array;
导入org.apache.spark.ml.feature.VectorAssembler;
导入org.apache.spark.ml.linalg.VectorUDT;
导入org.apache.spark.ml.linalg.Vectors;
导入org.apache.spark.sql.Dataset;
导入org.apache.spark.sql.Row;
导入org.apache.spark.sql.RowFactory;
导入org.apache.spark.sql.types.*;
导入静态org.apache.spark.sql.types.DataTypes.*;
StructType架构=createStructType(新StructField[]{
createStructField(“id”,IntegerType,false),
createStructField(“小时”,整数类型,false),
createStructField(“移动”、双类型、假),
createStructField(“userFeatures”,new VectorUDT(),false),
createStructField(“单击”,双重类型,false)
});
Row-Row=RowFactory.create(0,18,1.0,Vectors.densite(0.0,10.0,0.5),1.0);
Dataset Dataset=spark.createDataFrame(Arrays.asList(行),schema);
向量汇编程序汇编程序=新向量汇编程序()
.setInputCols(新字符串[]{“hour”、“mobile”、“userFeatures”})
.setOutputCol(“特征”);
数据集输出=汇编程序.transform(数据集);
System.out.println(output.select(“features”,“clicked”).first());
欲了解更多用法,请查看下面的链接

希望这能奏效