在java Spark中尝试zipWithIndex时出错
我尝试在spark中使用在java Spark中尝试zipWithIndex时出错,java,apache-spark,rdd,apache-spark-dataset,Java,Apache Spark,Rdd,Apache Spark Dataset,我尝试在spark中使用zipWithIndex添加一个具有行号的列,如下所示 val df = sc.parallelize(Seq((1.0, 2.0), (0.0, -1.0), (3.0, 4.0), (6.0, -2.3))).toDF("x", "y") val rddzip = df.rdd.zipWithIndex; val newSchema = StructType(df.schema.fields ++ Array(StructField("rowid", LongTyp
zipWithIndex
添加一个具有行号的列,如下所示
val df = sc.parallelize(Seq((1.0, 2.0), (0.0, -1.0), (3.0, 4.0), (6.0, -2.3))).toDF("x", "y")
val rddzip = df.rdd.zipWithIndex;
val newSchema = StructType(df.schema.fields ++ Array(StructField("rowid", LongType, false)))
val dfZippedWithId = spark.createDataFrame(rddzip.map{ case (row, index) => Row.fromSeq(row.toSeq ++ Array(index))}, newSchema)
但我在JAVA中尝试做的事情如下
JavaRDD<Row> rdd = (JavaRDD) df.toJavaRDD().zipWithIndex().map(t -> {
Row r = t._1;
Long index = t._2 + 1;
ArrayList<Object> list = new ArrayList<>();
for(Object item: JavaConverters.seqAsJavaListConverter(r.toSeq()).asJava()) {
list.add(item);
}
return RowFactory.create(JavaConverters.seqAsJavaListConverter(t._1.toSeq()).asJava().add(t._2));
});
StructType newSchema = df.schema()
.add(new StructField(name, DataTypes.LongType, true, Metadata.empty()));
return df.sparkSession().createDataFrame(rdd, newSchema);
有什么帮助吗?在scala版本中,您要传递给
spark.createDataFrame
RDD[Row]
在java中,您要传递JavaPairRDD
,您应该将它映射到JavaRDD[Row]
Dataset<Row> df = ss.range(10).toDF();
df.show();
JavaPairRDD<Row, Long> rddzip = df.toJavaRDD().zipWithIndex();
JavaRDD<Row> rdd = rddzip.map(s->{
Row r = s._1;
Object[] arr = new Object[r.size()+1];
for (int i = 0; i < arr.length-1; i++) {
arr[i] = r.get(i);
}
arr[arr.length-1] = s._2;
return RowFactory.create(arr);
});
StructType newSchema = df.schema().add(new StructField("rowid",
DataTypes.LongType, false, Metadata.empty()));
Dataset<Row> df2 = ss.createDataFrame(rdd,newSchema);
df2.show();
是的,映射我无法用Java编写相应的代码我认为在这个答案中,您可以找到您需要的:
Dataset<Row> df = ss.range(10).toDF();
df.show();
JavaPairRDD<Row, Long> rddzip = df.toJavaRDD().zipWithIndex();
JavaRDD<Row> rdd = rddzip.map(s->{
Row r = s._1;
Object[] arr = new Object[r.size()+1];
for (int i = 0; i < arr.length-1; i++) {
arr[i] = r.get(i);
}
arr[arr.length-1] = s._2;
return RowFactory.create(arr);
});
StructType newSchema = df.schema().add(new StructField("rowid",
DataTypes.LongType, false, Metadata.empty()));
Dataset<Row> df2 = ss.createDataFrame(rdd,newSchema);
df2.show();
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+---+
+---+-----+
| id|rowid|
+---+-----+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
| 5| 5|
| 6| 6|
| 7| 7|
| 8| 8|
| 9| 9|
+---+-----+