Apache spark apachesparksql中collect_集(struct)的使用
我有一个包含200万个字段的表,它在Spark Dataframe中注册 该表如下所示:Apache spark apachesparksql中collect_集(struct)的使用,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我有一个包含200万个字段的表,它在Spark Dataframe中注册 该表如下所示: CUSTADDRESSID ADDRESSTYPE ADDRESSLINE1 ADDRESSLINE2 ADDRESSLINE3 CITY STATE COUNTRY ZIP1 ISACTIVE ISCOMMUNICATION CREATEDDATE CREATEDUSER UPDATEDDATE UPDATEDUSER REASONCODE ZIP2 C
CUSTADDRESSID ADDRESSTYPE ADDRESSLINE1 ADDRESSLINE2 ADDRESSLINE3 CITY STATE COUNTRY ZIP1 ISACTIVE ISCOMMUNICATION CREATEDDATE CREATEDUSER UPDATEDDATE UPDATEDUSER REASONCODE ZIP2 C_ACCNO CUSTOMERID ACCOUNTGROUPID PREPAIDACCOUNTSTATUSID PREPAIDACCOUNTSTATUSDATE SOURCEOFENTRY REVENUECATEGORYID VEHICLENUMBER VEHICLECLASS SERIALNO HEXTAGID TAGSTATUS TAGSTARTEFFDATE TAGENDEFFDATE ISTAGBLACKLISTED ISBLACKLISTHOLD RCVERIFICATIONSTATUS EMAILADDRESS PHONENUMBER CCreatedDate CCreatedUser CUpdatedDate CUpdatedUser HISTID ACTION ISFEEWAIVER FEEWAIVERPASSTYPE VEHICLEIMGVERIFICATIONSTATUS TAGTID ISREVENUERECHARGE RowNumber
41 Mailing B309 PROGRESSIVE SIGNATURE SECTOR-6 GHANSOLI NAVI MUMBAI MH IND 400701 1 1 2013-06-07 12:55:54.827 bhagwadapos 2013-06-07 12:55:54.827 bhagwadapos NULL NULL 10003014 20000001 15 3079 2015-09-16 14:58:27.500 RegularRetailer 75 MH43AJ411 4 206158433290 91890704803000000C0A TAGINACTIVE 2014-08-08 14:24:12.227 2039-08-08 23:59:59.000 1 0 NULL shankarn75@rediffmail.com 9004419178 2013-06-07 12:56:16.650 bhagwadapos 2015-09-16 14:58:33.190 BatchProcess 15250 UPDATE NULL NULL NULL NULL NULL 1
我想转换成JSON,JSON文件应该是这样的,很抱歉我手工设计的:
ACCOUNTNO : 10003018
ADDRESS : Array
0 : Object
VEHICLE : Array
0 : Object
我已经编写了Spark SQL查询,但无法在ACCOUNTNO
下创建两个数组VEHICLE
和ADDRESS
这就是问题所在:
val query2 = "SELECT C_ACCNO AS ACCOUNTNO, collect_set(struct(VEHICLENUMBER, CUSTOMERID,ACCOUNTGROUPID,PREPAIDACCOUNTSTATUSID,PREPAIDACCOUNTSTATUSDATE,SOURCEOFENTRY,REVENUECATEGORYID,VEHICLECLASS,SERIALNO,HEXTAGID,TAGSTATUS,TAGSTARTEFFDATE,TAGENDEFFDATE,ISTAGBLACKLISTED,ISBLACKLISTHOLD,RCVERIFICATIONSTATUS,EMAILADDRESS,PHONENUMBER,CREATEDDATE,CREATEDUSER,UPDATEDDATE,UPDATEDUSER,ISFEEWAIVER,FEEWAIVERPASSTYPE,VEHICLEIMGVERIFICATIONSTATUS,TAGTID,ISREVENUERECHARGE)) as VEHICLE FROM joined_acc_add GROUP BY ACCOUNTNO ORDER BY ACCOUNTNO"
之后:
val res01 = sqlContext.sql(query2.toString)
res01.coalesce(1).write.json("D:/result01")
我需要帮助来查找查询中的错误。此查询正在抛出错误。我已完成。在一个对象下创建多个数组的查询应如下所示
val query1 = "SELECT PKcolumn1, collect_set(struct(column2, column3)), collect_set(struct(column4, column5)), collect_set(struct(column6, column7)) GROUP BY PKcolumn1 ORDER BY PKcolumn1"