Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/java/304.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何在Java中结合Spark RDD和PairRDD_Java_Apache Spark - Fatal编程技术网

如何在Java中结合Spark RDD和PairRDD

如何在Java中结合Spark RDD和PairRDD,java,apache-spark,Java,Apache Spark,我有一个数据集,其中包含userId(String)、itemId(int)和rating(int)列 我想将字符串userid映射到唯一的长值。我尝试使用zipWithUniqueId()映射用户ID,它给出了一个pairRDD +------------+----------------+ | userId | userIdMapped | +------------+----------------+ | abc13 | 0 | +--

我有一个数据集,其中包含
userId
(String)、
itemId
(int)和
rating
(int)列

我想将字符串userid映射到唯一的长值。我尝试使用
zipWithUniqueId()
映射用户ID,它给出了一个
pairRDD

+------------+----------------+
|   userId   |  userIdMapped  |
+------------+----------------+
|    abc13   |        0       |   
+------------+----------------+
|    qwe34   |        1       |   
+------------+----------------+
我想将长值添加到另一列并创建数据集,如下所示:

+----------+----------+---------+----------------+
| userId   |  itemId  |  rating |  userIdMapped  |
+----------+----------+---------+----------------+
|  abc13   |    23    |    1    |       0        |
+----------+----------+---------+----------------+
|  qwe34   |    56    |    3    |       1        |
+----------+----------+---------+----------------+
|  qwe34   |    35    |    4    |       1        |
+----------+----------+---------+----------------+
+-----------------+----------------+
|   userIdMapped  |     value      |
+-----------------+----------------+
|         0       |     abc13      |   
+-----------------+----------------+
|         0       |     qwe34      |   
+-----------------+----------------+
|         1       |     abc13      |   
+-----------------+----------------+
|         1       |     qwe34      |   
+-----------------+----------------+
我尝试了以下方法:

JavaRDD<Feedback> feedbackRDD = spark.read().jdbc(MYSQL_CONNECTION_URL, feedbackQuery, connectionProperties)
            .javaRDD().map(Feedback.mapFunc);
JavaPairRDD<String, Long> mappedPairRDD = feedbackRDD.map(new Function<Feedback, String>() {
    public String call(Feedback p) throws Exception {
        return p.getUserId();
    }).distinct().zipWithUniqueId();
Dataset<Row> feedbackDS = spark.createDataFrame(feedbackRDD, Feedback.class);
Dataset<String> stringIds = spark.createDataset(zipped.keys().collect(), Encoders.STRING());
Dataset<Long> valueIds = spark.createDataset(zipped.values().collect(), Encoders.LONG());       
Dataset<Row> longIds = valueIds.withColumnRenamed("value", "userIdMapped");
Dataset<Row> userIdMap = intIds.join(stringIds);    
Dataset<Row> feedbackDSUserMapped = feedbackDS.join(userIdMap, feedbackDS.col("userId").equalTo(userIdMap.col("value")),
            "inner");
//Here 'value' column contains string user ids
因此,产生的
反馈dsusermapped
是错误的

我是Spark的新手,我相信一定有更好的方法

pairdd
中获取长值并在初始数据集中设置为相关userId(
RDD
)的最佳方法是什么

任何帮助都将不胜感激


数据将用于ALS模型。您可以尝试以下方法。使用内置函数和与原始数据集的联接指定唯一id

/**
 * Created by RGOVIND on 11/16/2016.
 */

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.*;

import java.util.ArrayList;
import java.util.List;

public class SparkUserObjectMain {
    static public void main(String[] args) {
        SparkConf conf = new SparkConf().setMaster("local").setAppName("Stack Overflow App");
        JavaSparkContext sc = new JavaSparkContext(conf);
        SQLContext sqlContext = new SQLContext(sc);
        List<UserObject> users = new ArrayList<UserObject>();

        //seed the data
        UserObject user1 = new UserObject("abc13", "23", "1");
        UserObject user2 = new UserObject("qwe34", "56", "3");
        UserObject user3 = new UserObject("qwe34", "35", "4");
        users.add(user1);
        users.add(user2);
        users.add(user3);

        //how to encode the object ?
        Encoder<UserObject> userObjectEncoder = Encoders.bean(UserObject.class);
        //Create the user dataset
        Dataset<UserObject> usersDataSet = sqlContext.createDataset(users, userObjectEncoder);
        //assign unique id's
        Dataset<Row> uniqueUsersWithId = usersDataSet.dropDuplicates("userId").select("userId").withColumn("id", functions.monotonically_increasing_id());
        //join with original
        Dataset<Row> joinedDataSet = usersDataSet.join(uniqueUsersWithId, "userId");
        joinedDataSet.show();

    }
}
印刷品:

+------+------+------+------------+
|userId|itemId|rating|          id|
+------+------+------+------------+
| abc13|    23|     1|403726925824|
| qwe34|    56|     3|901943132160|
| qwe34|    35|     4|901943132160|
+------+------+------+------------+

使用
StringIndexer
解决该问题:

StringIndexer indexer = new StringIndexer()
              .setInputCol("userId")
              .setOutputCol("userIdMapped");
Dataset<Row> userJoinedDataSet = indexer.fit(feedbackDS).transform(feedbackDS);
StringIndexer indexer=新的StringIndexer()
.setInputCol(“用户ID”)
.setOutputCol(“userIdMapped”);
数据集userJoinedDataSet=indexer.fit(feedbackDS).transform(feedbackDS);

这是可行的,但不幸的是,数据必须用于ALS模型,并且单调递增的\u id()会生成超出int范围的值。不过,感谢您的详细回答:)
+------+------+------+------------+
|userId|itemId|rating|          id|
+------+------+------+------------+
| abc13|    23|     1|403726925824|
| qwe34|    56|     3|901943132160|
| qwe34|    35|     4|901943132160|
+------+------+------+------------+
StringIndexer indexer = new StringIndexer()
              .setInputCol("userId")
              .setOutputCol("userIdMapped");
Dataset<Row> userJoinedDataSet = indexer.fit(feedbackDS).transform(feedbackDS);