Java 使用Spark将字段添加到Csv_Java_Apache Spark_Apache Spark Sql_Rdd

Java 使用Spark将字段添加到Csv

java apache-spark

Java 使用Spark将字段添加到Csv,java,apache-spark,apache-spark-sql,rdd,Java,Apache Spark,Apache Spark Sql,Rdd,因此，我有一个CSV，它包含空间（纬度，经度）和时间（时间戳）数据为了对我们有用，我们将空间信息转换为“geohash”，将时间信息转换为“timehash” 问题是，如何将geohash和timehash添加为带有spark的CSV中每行的字段（因为数据大约200GB）我们尝试使用javapairdd及其函数mapTopair，但问题仍然在于如何转换回JavaRdd，然后转换成CSV？所以我认为这是一个糟糕的解决方案我要求一个简单的方法问题的更新： @Alvaro提供帮助后，我创建了这个

因此，我有一个CSV，它包含空间（

纬度

，

经度

）和时间（

时间戳

）数据

为了对我们有用，我们将空间信息转换为“

geohash

”，将时间信息转换为“

timehash

”

问题是，如何将

geohash

和

timehash

添加为带有spark的CSV中每行的字段（因为数据大约200GB）

我们尝试使用

javapairdd

及其函数

mapTopair

，但问题仍然在于如何转换回

JavaRdd

，然后转换成CSV？所以我认为这是一个糟糕的解决方案我要求一个简单的方法

问题的更新： @Alvaro提供帮助后，我创建了这个java类：

public class Hash {
public static SparkConf Spark_Config;
public static JavaSparkContext Spark_Context;

UDF2 geohashConverter = new UDF2<Long, Long, String>() {
    
    public String call(Long latitude, Long longitude) throws Exception {
        // convert here
        return "calculate_hash";
    }
};

UDF1 timehashConverter = new UDF1<Long, String>() {
    
    public String call(Long timestamp) throws Exception {
        // convert here
        return "calculate_hash";
    }
};
public Hash(String path) {
    SparkSession spark = SparkSession
            .builder()
            .appName("Java Spark SQL Example")
            .config("spark.master", "local")
            .getOrCreate();
    
    spark.udf().register("geohashConverter", geohashConverter, DataTypes.StringType);
    spark.udf().register("timehashConverter", timehashConverter, DataTypes.StringType);
    
Dataset df=spark.read().csv(path)
    .withColumn("geohash", callUDF("geohashConverter", col("_c6"), col("_c7")))
    .withColumn("timehash", callUDF("timehashConverter", col("_c1")))
.write().csv("C:/Users/Ahmed/Desktop/preprocess2");

 }

public static void main(String[] args) {
    String path = "C:/Users/Ahmed/Desktop/cabs_trajectories/cabs_trajectories/green/2013";
    Hash h = new Hash(path);
}
}

公共类哈希{
公共静态SparkConf Spark_配置；
公共静态JavaSparkContext Spark_上下文；
UDF2 geohashConverter=新UDF2（）{
公共字符串调用（长纬度、长经度）引发异常{
//在这里转换
返回“计算散列”；
}
};
UDF1 timehashConverter=新UDF1（）{
公共字符串调用（长时间戳）引发异常{
//在这里转换
返回“计算散列”；
}
};
公共哈希（字符串路径）{
火花会话火花=火花会话
.builder（）
.appName（“Java Spark SQL示例”）
.config（“spark.master”、“本地”）
.getOrCreate（）；
spark.udf（）.register（“geohashConverter”，geohashConverter，DataTypes.StringType）；
spark.udf（）.register（“timehashConverter”，timehashConverter，DataTypes.StringType）；
数据集df=spark.read（）.csv（路径）
.withColumn（“geohash”、callUDF（“geohashConverter”、col（“_c6”）、col（“_c7”））
.withColumn（“timehash”，callUDF（“timehashConverter”，col（“\u c1”）））
.write（）.csv（“C:/Users/Ahmed/Desktop/preprocess2”）；
}
公共静态void main（字符串[]args）{
String path=“C:/Users/Ahmed/Desktop/cabs\u trajectories/cabs\u trajectories/green/2013”；
哈希h=新哈希（路径）；
}
}

然后我得到了序列化问题，当我删除

write（）.csv（）

时，序列化问题就会消失。最有效的方法之一是使用数据集API加载csv，并使用用户定义的函数转换指定的列。这样，您的数据将始终保持结构化，而不必处理元组

首先，创建用户定义的函数：

geohashConverter

，它采用两个值（

latitude

和

longitude

），以及

timehashConverter

，它只采用时间戳

UDF2 geohashConverter = new UDF2<Long, Long, String>() {
    @Override
    public String call(Long latitude, Long longitude) throws Exception {
        // convert here
        return "calculate_hash";
    }
};

UDF1 timehashConverter = new UDF1<Long, String>() {
    @Override
    public String call(Long timestamp) throws Exception {
        // convert here
        return "calculate_hash";
    }
};

最后，只需读取您的CSV文件，并通过调用

with列

应用用户定义的函数。它将根据您正在使用

callUDF

调用的用户定义函数创建一个新列

callUDF

始终接收一个字符串，其中包含要调用的已注册UDF的名称以及一个或多个列，这些列的值将传递给UDF

最后，只需调用

write（）.csv（“path”）

希望有帮助

使现代化如果您发布导致问题的代码，这将非常有用，因为异常几乎没有说明代码的哪些部分不可序列化

无论如何，从我个人使用Spark的经验来看，我认为问题在于您用于计算哈希的对象。请记住，此对象必须通过集群分布。如果无法序列化此对象，它将抛出

任务不可序列化异常

。您有两个解决方案：

在用于计算哈希的类中实现可序列化的
```
接口
```


创建一个生成哈希的静态方法，并从UDF调用此方法


更新2
然后我得到了序列化问题，当我删除时，这个问题消失了
write（）.csv（）
这是意料之中的行为。当您删除write（）.csv（）
时，您没有执行任何操作。你应该知道Spark是如何工作的。在此代码中，在csv（）
之前调用的所有方法都是转换。在Spark中，在调用像csv（）
、show（）
或count（）
这样的操作之前，不会执行转换
问题是，您正在一个不可序列化的类中创建和执行Spark作业（甚至在构造函数中是最糟糕的！！！？？）
以静态方法创建Spark作业可以解决此问题。请记住，您的Spark代码必须通过集群分发，因此，它必须是可序列化的。它对我有用，也一定对你有用：
public class Hash {
    public static void main(String[] args) {
        String path = "in/prueba.csv";

        UDF2 geohashConverter = new UDF2<Long, Long, String>() {

            public String call(Long latitude, Long longitude) throws Exception {
                // convert here
                return "calculate_hash";
            }
        };

        UDF1 timehashConverter = new UDF1<Long, String>() {

            public String call(Long timestamp) throws Exception {
                // convert here
                return "calculate_hash";
            }
        };

        SparkSession spark = SparkSession
                .builder()
                .appName("Java Spark SQL Example")
                .config("spark.master", "local")
                .getOrCreate();

        spark.udf().register("geohashConverter", geohashConverter, DataTypes.StringType);
        spark.udf().register("timehashConverter", timehashConverter, DataTypes.StringType);

        spark
                .read()
                .format("com.databricks.spark.csv")
                .option("header", "true")
                .load(path)
                .withColumn("geohash", callUDF("geohashConverter", col("_c6"), col("_c7")))
                .withColumn("timehash", callUDF("timehashConverter", col("_c1")))
                .write().csv("resultados");
    }
}

公共类哈希{
公共静态void main（字符串[]args）{
字符串path=“in/prueba.csv”；
UDF2 geohashConverter=新UDF2（）{
公共字符串调用（长纬度、长经度）引发异常{
//在这里转换
返回“计算散列”；
}
};
UDF1 timehashConverter=新UDF1（）{
公共字符串调用（长时间戳）引发异常{
//在这里转换
返回“计算散列”；
}
};
火花会话火花=火花会话
.builder（）
.appName（“Java Spark SQL示例”）
.config（“spark.master”、“本地”）
.getOrCreate（）；
spark.udf（）.register（“geohashConverter”，geohashConverter，DataTypes.StringType）；
spark.udf（）.register（“timehashConverter”，timehashConverter，DataTypes.StringType）；
火花
.读（）
.format（“com.databricks.spark.csv”）
.选项（“标题”、“正确”）
.加载（路径）
.withColumn（“geohash”、callUDF（“geohashConverter”、col（“_c6”）、col（“_c7”））
.withColumn（“timehash”，callUDF（“timehashConverter”，col（“\u c1”）））
import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.callUDF;


spark.read().csv("/source/path")
        .withColumn("geohash", callUDF("geohashConverter", col("latitude"), col("longitude")))
        .withColumn("timehash", callUDF("timehashConverter", col("timestamp")))
.write().csv("/path/to/save");

public class Hash {
    public static void main(String[] args) {
        String path = "in/prueba.csv";

        UDF2 geohashConverter = new UDF2<Long, Long, String>() {

            public String call(Long latitude, Long longitude) throws Exception {
                // convert here
                return "calculate_hash";
            }
        };

        UDF1 timehashConverter = new UDF1<Long, String>() {

            public String call(Long timestamp) throws Exception {
                // convert here
                return "calculate_hash";
            }
        };

        SparkSession spark = SparkSession
                .builder()
                .appName("Java Spark SQL Example")
                .config("spark.master", "local")
                .getOrCreate();

        spark.udf().register("geohashConverter", geohashConverter, DataTypes.StringType);
        spark.udf().register("timehashConverter", timehashConverter, DataTypes.StringType);

        spark
                .read()
                .format("com.databricks.spark.csv")
                .option("header", "true")
                .load(path)
                .withColumn("geohash", callUDF("geohashConverter", col("_c6"), col("_c7")))
                .withColumn("timehash", callUDF("timehashConverter", col("_c1")))
                .write().csv("resultados");
    }
}