如何在Java中结合Spark RDD和PairRDD
我有一个数据集,其中包含如何在Java中结合Spark RDD和PairRDD,java,apache-spark,Java,Apache Spark,我有一个数据集,其中包含userId(String)、itemId(int)和rating(int)列 我想将字符串userid映射到唯一的长值。我尝试使用zipWithUniqueId()映射用户ID,它给出了一个pairRDD +------------+----------------+ | userId | userIdMapped | +------------+----------------+ | abc13 | 0 | +--
userId
(String)、itemId
(int)和rating
(int)列
我想将字符串userid映射到唯一的长值。我尝试使用zipWithUniqueId()
映射用户ID,它给出了一个pairRDD
+------------+----------------+
| userId | userIdMapped |
+------------+----------------+
| abc13 | 0 |
+------------+----------------+
| qwe34 | 1 |
+------------+----------------+
我想将长值添加到另一列并创建数据集,如下所示:
+----------+----------+---------+----------------+
| userId | itemId | rating | userIdMapped |
+----------+----------+---------+----------------+
| abc13 | 23 | 1 | 0 |
+----------+----------+---------+----------------+
| qwe34 | 56 | 3 | 1 |
+----------+----------+---------+----------------+
| qwe34 | 35 | 4 | 1 |
+----------+----------+---------+----------------+
+-----------------+----------------+
| userIdMapped | value |
+-----------------+----------------+
| 0 | abc13 |
+-----------------+----------------+
| 0 | qwe34 |
+-----------------+----------------+
| 1 | abc13 |
+-----------------+----------------+
| 1 | qwe34 |
+-----------------+----------------+
我尝试了以下方法:
JavaRDD<Feedback> feedbackRDD = spark.read().jdbc(MYSQL_CONNECTION_URL, feedbackQuery, connectionProperties)
.javaRDD().map(Feedback.mapFunc);
JavaPairRDD<String, Long> mappedPairRDD = feedbackRDD.map(new Function<Feedback, String>() {
public String call(Feedback p) throws Exception {
return p.getUserId();
}).distinct().zipWithUniqueId();
Dataset<Row> feedbackDS = spark.createDataFrame(feedbackRDD, Feedback.class);
Dataset<String> stringIds = spark.createDataset(zipped.keys().collect(), Encoders.STRING());
Dataset<Long> valueIds = spark.createDataset(zipped.values().collect(), Encoders.LONG());
Dataset<Row> longIds = valueIds.withColumnRenamed("value", "userIdMapped");
Dataset<Row> userIdMap = intIds.join(stringIds);
Dataset<Row> feedbackDSUserMapped = feedbackDS.join(userIdMap, feedbackDS.col("userId").equalTo(userIdMap.col("value")),
"inner");
//Here 'value' column contains string user ids
因此,产生的反馈dsusermapped
是错误的
我是Spark的新手,我相信一定有更好的方法
从pairdd
中获取长值并在初始数据集中设置为相关userId(RDD
)的最佳方法是什么
任何帮助都将不胜感激
数据将用于ALS模型。您可以尝试以下方法。使用内置函数和与原始数据集的联接指定唯一id
/**
* Created by RGOVIND on 11/16/2016.
*/
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.*;
import java.util.ArrayList;
import java.util.List;
public class SparkUserObjectMain {
static public void main(String[] args) {
SparkConf conf = new SparkConf().setMaster("local").setAppName("Stack Overflow App");
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(sc);
List<UserObject> users = new ArrayList<UserObject>();
//seed the data
UserObject user1 = new UserObject("abc13", "23", "1");
UserObject user2 = new UserObject("qwe34", "56", "3");
UserObject user3 = new UserObject("qwe34", "35", "4");
users.add(user1);
users.add(user2);
users.add(user3);
//how to encode the object ?
Encoder<UserObject> userObjectEncoder = Encoders.bean(UserObject.class);
//Create the user dataset
Dataset<UserObject> usersDataSet = sqlContext.createDataset(users, userObjectEncoder);
//assign unique id's
Dataset<Row> uniqueUsersWithId = usersDataSet.dropDuplicates("userId").select("userId").withColumn("id", functions.monotonically_increasing_id());
//join with original
Dataset<Row> joinedDataSet = usersDataSet.join(uniqueUsersWithId, "userId");
joinedDataSet.show();
}
}
印刷品:
+------+------+------+------------+
|userId|itemId|rating| id|
+------+------+------+------------+
| abc13| 23| 1|403726925824|
| qwe34| 56| 3|901943132160|
| qwe34| 35| 4|901943132160|
+------+------+------+------------+
使用
StringIndexer
解决该问题:
StringIndexer indexer = new StringIndexer()
.setInputCol("userId")
.setOutputCol("userIdMapped");
Dataset<Row> userJoinedDataSet = indexer.fit(feedbackDS).transform(feedbackDS);
StringIndexer indexer=新的StringIndexer()
.setInputCol(“用户ID”)
.setOutputCol(“userIdMapped”);
数据集userJoinedDataSet=indexer.fit(feedbackDS).transform(feedbackDS);
这是可行的,但不幸的是,数据必须用于ALS模型,并且单调递增的\u id()会生成超出int范围的值。不过,感谢您的详细回答:)
+------+------+------+------------+
|userId|itemId|rating| id|
+------+------+------+------------+
| abc13| 23| 1|403726925824|
| qwe34| 56| 3|901943132160|
| qwe34| 35| 4|901943132160|
+------+------+------+------------+
StringIndexer indexer = new StringIndexer()
.setInputCol("userId")
.setOutputCol("userIdMapped");
Dataset<Row> userJoinedDataSet = indexer.fit(feedbackDS).transform(feedbackDS);