Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用ApacheSpark将表序列化为嵌套JSON_Json_Scala_Apache Spark - Fatal编程技术网

使用ApacheSpark将表序列化为嵌套JSON

使用ApacheSpark将表序列化为嵌套JSON,json,scala,apache-spark,Json,Scala,Apache Spark,我有一组类似以下示例的记录 |ACCOUNTNO|VEHICLENUMBER|CUSTOMERID| +---------+-------------+----------+ | 10003014| MH43AJ411| 20000000| | 10003014| MH43AJ411| 20000001| | 10003015| MH12GZ3392| 20000002| 我想解析成JSON,应该是这样的: { "ACCOUNTNO":10003014,

我有一组类似以下示例的记录

|ACCOUNTNO|VEHICLENUMBER|CUSTOMERID|
+---------+-------------+----------+
| 10003014|    MH43AJ411|  20000000|
| 10003014|    MH43AJ411|  20000001|
| 10003015|   MH12GZ3392|  20000002|
我想解析成JSON,应该是这样的:

{
    "ACCOUNTNO":10003014,
    "VEHICLE": [
        { "VEHICLENUMBER":"MH43AJ411", "CUSTOMERID":20000000},
        { "VEHICLENUMBER":"MH43AJ411", "CUSTOMERID":20000001}
    ],
    "ACCOUNTNO":10003015,
    "VEHICLE": [
        { "VEHICLENUMBER":"MH12GZ3392", "CUSTOMERID":20000002}
    ]
}
我已经编写了程序,但未能实现输出

package com.report.pack1.spark

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql._


object sqltojson {

  def main(args:Array[String]) {
    System.setProperty("hadoop.home.dir", "C:/winutil/")
    val conf = new SparkConf().setAppName("SQLtoJSON").setMaster("local[*]")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)
    import sqlContext.implicits._      
    val jdbcSqlConnStr = "jdbc:sqlserver://192.168.70.88;databaseName=ISSUER;user=bhaskar;password=welcome123;"      
    val jdbcDbTable = "[HISTORY].[TP_CUSTOMER_PREPAIDACCOUNTS]"
    val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" -> jdbcSqlConnStr,"dbtable" -> jdbcDbTable)).load()
    jdbcDF.registerTempTable("tp_customer_account")
    val res01 = sqlContext.sql("SELECT ACCOUNTNO, VEHICLENUMBER, CUSTOMERID FROM tp_customer_account GROUP BY ACCOUNTNO, VEHICLENUMBER, CUSTOMERID ORDER BY ACCOUNTNO ")
    res01.coalesce(1).write.json("D:/res01.json")      
  }
}

如何以给定格式序列化?提前谢谢

您可以使用
struct
groupBy
来获得所需的结果。下面是相同的代码。我已经在需要时对代码进行了注释

val df = Seq((10003014,"MH43AJ411",20000000),
  (10003014,"MH43AJ411",20000001),
  (10003015,"MH12GZ3392",20000002)
).toDF("ACCOUNTNO","VEHICLENUMBER","CUSTOMERID")

df.show
//output
//+---------+-------------+----------+
//|ACCOUNTNO|VEHICLENUMBER|CUSTOMERID|
//+---------+-------------+----------+
//| 10003014|    MH43AJ411|  20000000|
//| 10003014|    MH43AJ411|  20000001|
//| 10003015|   MH12GZ3392|  20000002|
//+---------+-------------+----------+

//create a struct column then group by ACCOUNTNO column and finally convert DF to JSON
df.withColumn("VEHICLE",struct("VEHICLENUMBER","CUSTOMERID")).
  select("VEHICLE","ACCOUNTNO"). //only select reqired columns
  groupBy("ACCOUNTNO"). 
  agg(collect_list("VEHICLE").as("VEHICLE")). //for the same group create a list of vehicles
  toJSON. //convert to json
  show(false)

//output
//+------------------------------------------------------------------------------------------------------------------------------------------+
//|value                                                                                                                                     |
//+------------------------------------------------------------------------------------------------------------------------------------------+
//|{"ACCOUNTNO":10003014,"VEHICLE":[{"VEHICLENUMBER":"MH43AJ411","CUSTOMERID":20000000},{"VEHICLENUMBER":"MH43AJ411","CUSTOMERID":20000001}]}|
//|{"ACCOUNTNO":10003015,"VEHICLE":[{"VEHICLENUMBER":"MH12GZ3392","CUSTOMERID":20000002}]}                                                   |
//+------------------------------------------------------------------------------------------------------------------------------------------+

您也可以使用您在问题中提到的相同语句将此
dataframe
写入文件。

Ok。谢谢但是它是这样来的:{code>{value:“{\'ACCOUNTNO\”:10003200,\'VEHICLE\:[{\'VEHICLENUMBER\':\'MH04FP4254\,\'CUSTOMERID\':20000287}}”}为什么\字符?第二件事是列表中的许多
VEHICLENUMBER
尚未合并,因为许多
VEHICLENUMBER
具有重复的值。数据来自远程SQL Server中的一个表,当然,一个表包含300多万条记录,是的,它有这方面的多个数据。这就是为什么我问你我应该在GROUPBY之后添加字段?如果是,那么我也会多次收到一个车号。您在Stackoverflow答案中显示的输出结果,我希望实际获得该结果。是的,该表包含重复数据,因此我是否应该使用DISTINCT?请帮帮我,我亲爱的朋友。你拿的数据/表格实际上不是我的输入。我想你已经避开了我的Scala代码。我在问题中显示的表是在远程SQL Server中对超过300万条记录的表进行SQL查询的结果。我给你这张桌子是为了更好地理解你在吗?我需要在列表的字段中使用groupby。