使用Java合并spark数据集中的两列

使用Java合并spark数据集中的两列,java,apache-spark-sql,apache-spark-dataset,Java,Apache Spark Sql,Apache Spark Dataset,我想合并ApacheSpark数据集中的两列。 我尝试了以下方法,但无效,有人能提出解决方案吗 Dataset<Row> df1 = spark.read().json("src/test/resources/DataSets/DataSet.json"); Dataset<Row> df1Map = df1.select(functions.array("beginTime", "endTime")); df1Map.show()

我想合并ApacheSpark数据集中的两列。 我尝试了以下方法,但无效,有人能提出解决方案吗

        Dataset<Row> df1 = spark.read().json("src/test/resources/DataSets/DataSet.json");
        Dataset<Row> df1Map = df1.select(functions.array("beginTime", "endTime"));
    df1Map.show();
请注意,我是在Java8中完成这项工作的

上述结果导致如下错误:

{"IPsrc":"abc", "IPdst":"def", "beginTime": 1, "endTime":1}
{"IPsrc":"def", "IPdst":"abc", "beginTime": 2, "endTime":2}
{"IPsrc":"abc", "IPdst":"def", "beginTime": 3, "endTime":3}
{"IPsrc":"def", "IPdst":"abc", "beginTime": 4, "endTime":4}
com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'beginTime': was expecting ('true', 'false' or 'null')
at [Source: beginTime; line: 1, column: 19]

at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1581)
at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:533)
at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._reportInvalidToken(ReaderBasedJsonParser.java:2462)
at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._handleOddValue(ReaderBasedJsonParser.java:1621)
at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:689)
at com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:3776)
at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3721)
at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2726)
at org.json4s.jackson.JsonMethods$class.parse(JsonMethods.scala:20)
at org.json4s.jackson.JsonMethods$.parse(JsonMethods.scala:50)
at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:104)
at org.apache.spark.sql.types.DataType.fromJson(DataType.scala)
+-----+-----+---------+-------+
|IPdst|IPsrc|beginTime|endTime|
+-----+-----+---------+-------+
|  def|  abc|        1|      1|
|  abc|  def|        2|      2|
|  def|  abc|        3|      3|
|  abc|  def|        4|      4|
+-----+-----+---------+-------+
root
 |-- IPdst: string (nullable = true)
 |-- IPsrc: string (nullable = true)
 |-- beginTime: long (nullable = true)
 |-- endTime: long (nullable = true)
df1.show()的初始表输出如下:

{"IPsrc":"abc", "IPdst":"def", "beginTime": 1, "endTime":1}
{"IPsrc":"def", "IPdst":"abc", "beginTime": 2, "endTime":2}
{"IPsrc":"abc", "IPdst":"def", "beginTime": 3, "endTime":3}
{"IPsrc":"def", "IPdst":"abc", "beginTime": 4, "endTime":4}
com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'beginTime': was expecting ('true', 'false' or 'null')
at [Source: beginTime; line: 1, column: 19]

at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1581)
at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:533)
at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._reportInvalidToken(ReaderBasedJsonParser.java:2462)
at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._handleOddValue(ReaderBasedJsonParser.java:1621)
at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:689)
at com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:3776)
at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3721)
at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2726)
at org.json4s.jackson.JsonMethods$class.parse(JsonMethods.scala:20)
at org.json4s.jackson.JsonMethods$.parse(JsonMethods.scala:50)
at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:104)
at org.apache.spark.sql.types.DataType.fromJson(DataType.scala)
+-----+-----+---------+-------+
|IPdst|IPsrc|beginTime|endTime|
+-----+-----+---------+-------+
|  def|  abc|        1|      1|
|  abc|  def|        2|      2|
|  def|  abc|        3|      3|
|  abc|  def|        4|      4|
+-----+-----+---------+-------+
root
 |-- IPdst: string (nullable = true)
 |-- IPsrc: string (nullable = true)
 |-- beginTime: long (nullable = true)
 |-- endTime: long (nullable = true)
df1的模式如下所示:

{"IPsrc":"abc", "IPdst":"def", "beginTime": 1, "endTime":1}
{"IPsrc":"def", "IPdst":"abc", "beginTime": 2, "endTime":2}
{"IPsrc":"abc", "IPdst":"def", "beginTime": 3, "endTime":3}
{"IPsrc":"def", "IPdst":"abc", "beginTime": 4, "endTime":4}
com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'beginTime': was expecting ('true', 'false' or 'null')
at [Source: beginTime; line: 1, column: 19]

at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1581)
at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:533)
at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._reportInvalidToken(ReaderBasedJsonParser.java:2462)
at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._handleOddValue(ReaderBasedJsonParser.java:1621)
at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:689)
at com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:3776)
at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3721)
at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2726)
at org.json4s.jackson.JsonMethods$class.parse(JsonMethods.scala:20)
at org.json4s.jackson.JsonMethods$.parse(JsonMethods.scala:50)
at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:104)
at org.apache.spark.sql.types.DataType.fromJson(DataType.scala)
+-----+-----+---------+-------+
|IPdst|IPsrc|beginTime|endTime|
+-----+-----+---------+-------+
|  def|  abc|        1|      1|
|  abc|  def|        2|      2|
|  def|  abc|        3|      3|
|  abc|  def|        4|      4|
+-----+-----+---------+-------+
root
 |-- IPdst: string (nullable = true)
 |-- IPsrc: string (nullable = true)
 |-- beginTime: long (nullable = true)
 |-- endTime: long (nullable = true)

如果要连接两列,可以执行以下操作:

Dataset<Row> df1Map = df1.select(functions.concat(df1.col("beginTime"), df1.col("endTime")));
df1Map.show();
Dataset df1Map=df1.select(functions.concat(df1.col(“beginTime”)、df1.col(“endTime”);
df1Map.show();

编辑: 我试过这个,它成功了: 这是我的代码:

SparkSession spark = SparkSession
            .builder()
            .master("local")
            .appName("SO")
            .getOrCreate();

Dataset<Row> df1 = spark.read().json("src/main/resources/json/Dataset.json");
df1.printSchema();
df1.show();

Dataset<Row> df1Map = df1.select(functions.array("beginTime", "endTime"));
df1Map.show();
SparkSession spark=SparkSession
.builder()
.master(“本地”)
.appName(“SO”)
.getOrCreate();
Dataset df1=spark.read().json(“src/main/resources/json/Dataset.json”);
df1.printSchema();
df1.show();
数据集df1Map=df1.select(functions.array(“beginTime”、“endTime”);
df1Map.show();
您可以试试这个

Dataset<Row> df1Map = df1.withColumn("begin_end_time", concat(col("beginTime"), lit(" - "), (col("endTime")) );
Dataset df1Map=df1.带列(“开始时间”),concat(col(“开始时间”),lit(“-”),col(“结束时间”);

理想情况下,我希望创建一个包含两列的数组。但是,我尝试了您的方法,并得到以下错误:com.fasterxml.jackson.core.JsonParseException:无法识别的标记“beginTime”:在[Source:beginTime;第1行,第19列]处应为('true'、'false'或'null')感谢您的尝试,我发现了错误,它似乎不是从intelliJ中的单元测试运行的,而是在其他情况下工作的。您能告诉我如何保留所有其他列吗?我只希望合并这两个列,但其余的保持原样,我事先不知道模式,所以我不知道所有列的名称,所以我如何重命名这个新数组(beginTime,endTime)列?很好,你找到了!你可以使用
.withColumn()
数据集df1Map=df1.withColumn(“begin\u end\u time”,functions.array(“beginTime”,“endTime”))
您能详细说明什么不起作用吗:您得到了错误或意外的输出?同时请提供输出/错误。@AbhishekBansal-用细节SBTW更新了问题。我使用的是数据集而不是数据帧,另外,我使用下面“concat”答案中建议的方法得到了相同的错误函数如何加载df1数据集?能否粘贴代码?有关详细信息,请查看更新后的问题