使用java读取spark sql中的复杂json

使用java读取spark sql中的复杂json,java,spark-streaming,Java,Spark Streaming,我的json文件如下所示,我试图使用下面的代码读取majorsector_%下的所有名称 代码: JavaSQLContext sQLContext = new JavaSQLContext(sc); sQLContext.jsonFile("C:/Users/HimanshuK/Downloads/world_bank/world_bank.json").registerTempTable("logs"); sQLContext.sqlContext().cacheTabl

我的json文件如下所示,我试图使用下面的代码读取majorsector_%下的所有名称

代码:

  JavaSQLContext sQLContext = new JavaSQLContext(sc);
    sQLContext.jsonFile("C:/Users/HimanshuK/Downloads/world_bank/world_bank.json").registerTempTable("logs");
    sQLContext.sqlContext().cacheTable("logs");
    List s = sQLContext.sql("select majorsector_percent from logs limit 1 ").map(row -> new Tuple2<>(row.getString(0), row.getString(1))).collect();


   JSON FIle 

     { "_id" : { "$oid" : "52b213b38594d8a2be17c780" }, "approvalfy" : 1999, "board_approval_month" : "November", "boardapprovaldate" : "2013-11-12T00:00:00Z", "borrower" : "FEDERAL DEMOCRATIC REPUBLIC OF ETHIOPIA", "closingdate" : "2018-07-07T00:00:00Z", "country_namecode" : "Federal Democratic Republic of Ethiopia!$!ET", "countrycode" : "ET", "countryname" : "Federal Democratic Republic of Ethiopia", "countryshortname" : "Ethiopia", "docty" : "Project Information Document,Indigenous Peoples Plan,Project Information Document", "envassesmentcategorycode" : "C", "grantamt" : 0, "ibrdcommamt" : 0, "id" : "P129828", "idacommamt" : 130000000, "impagency" : "MINISTRY OF EDUCATION", "lendinginstr" : "Investment Project Financing", "lendinginstrtype" : "IN", "lendprojectcost" : 550000000, "majorsector_percent" : [ { "Name" : "Education", "Percent" : 46 }, { "Name" : "Education", "Percent" : 26 }, { "Name" : "Public Administration, Law, and Justice", "Percent" : 16 }, { "Name" : "Education", "Percent" : 12 } ], "mjtheme" : [ "Human development" ], "mjtheme_namecode" : [ { "name" : "Human development", "code" : "8" }, { "name" : "", "code" : "11" } ], "mjthemecode" : "8,11", "prodline" : "PE", "prodlinetext" : "IBRD/IDA", "productlinetype" : "L", "project_abstract" : { "cdata" : "The development  }, "project_name" : "Ethiopia General Education Quality Improvement Project II",  "projectfinancialtype" : "IDA", "projectstatusdisplay" : "Active", "regionname" : "Africa", "sector1" : { "Name" : "Primary education", "Percent" : 46 }, "sector2" : { "Name" : "Secondary education", "Percent" : 26 }, "sector3" : { "Name" : "Public administration- Other social services", "Percent" : 16 }, "sector4" : { "Name" : "Tertiary education", "Percent" : 12 }, "sectorcode" : "ET,BS,ES,EP", "source" : "IBRD", "status" : "Active", "supplementprojectflg" : "N", "theme1" : { "Name" : "Education for all", "Percent" : 100 }, "themecode" : "65", "totalamt" : 130000000, "totalcommamt" : 130000000, "url" : "http://www.worldbank.org/projects/P129828/ethiopia-general-education-quality-improvement-project-ii?lang=en" }
JavaSQLContext-sQLContext=newjavasqlcontext(sc);
sQLContext.jsonFile(“C:/Users/HimanshuK/Downloads/world_bank/world_bank.json”).registerTempTable(“日志”);
sQLContext.sQLContext().cacheTable(“日志”);
List s=sQLContext.sql(“从日志限制1中选择majorsector_percent”).map(行->新元组2(行.getString(0),行.getString(1)).collect();
JSON文件
1999年,委员会批准月份:11月,委员会批准日期:2013-11-12T00:00:00Z,借款人:埃塞俄比亚联邦民主共和国,结束日期:2018-07-07T00:00:00Z,国家名称代码:埃塞俄比亚联邦民主共和国!$!ET,国家代码:ET,“countryname:“埃塞俄比亚联邦民主共和国”,“countryshortname:“埃塞俄比亚”,“docty:“项目信息文件,土著人民计划,项目信息文件”,“环境评估分类代码”:“C”,“grantamt:”0,“ibrdcommamt:”0,“id:”P129828,“IDACOMAMT:”130000000,“IMPAGENT:”教育部“,”LendingStr“:”投资项目融资“,”LendingStrType“:”IN“,”LendingProjectCost“:”550000000“,”majorsector_%:”[{“姓名”:“教育”,“百分比”:46},{“姓名”:“教育”,“百分比”:26},{“姓名”:“公共行政、法律和司法”,“百分比”:16},{“姓名”:“教育”,“百分比”:12}],“MJ主题”:人类发展“],“mjtheme\u名称代码”:[{“名称”:“人类发展”,“代码”:“8”},{“名称”:“代码”:“11”}],“mjthemecode”:“8,11”,“产品线”:“PE”,“产品线文本”:“IBRD/IDA”,“产品线类型”:“L”,“项目摘要”:{“cdata”:“开发”;“项目名称”:“埃塞俄比亚普通教育质量改进项目II”,“项目财务类型”:“IDA”,“项目状态显示”:“活动”,“地区名称”:“非洲”,“部门1”:{“名称”:“初等教育”,“百分比”:46},“部门2”:{“名称”:“中等教育”,“百分比”:26},“部门3”:{“名称”:“公共行政-其他社会服务”“,”百分比“:16},“sector4:{”名称“:”高等教育“,”百分比“:12},“sectorcode:”ET,BS,ES,EP“,”来源“,”国际复兴开发银行“,”状态“,”活动“,”补充项目FLG“:”N“,”主题1:“{”名称“:”全民教育“,”百分比“:”100},“主题代码“:”65“,”总计金额“:”130000000总计金额“,”总计金额“:”130000000总计金额“,”130000000 url:”"http://www.worldbank.org/projects/P129828/ethiopia-general-education-quality-improvement-project-ii?lang=en" }
但是由于类型转换、如何处理此类情况以及如何了解模式,我得到了这个错误:


java.lang.ClassCastException:scala.collection.mutable.ArrayBuffer无法强制转换为java.lang.String

问题是该查询的结果是一个包含数组的结构。当您尝试使用
row.getString(1)映射结果时
在数组中,由于对应的对象不是字符串,因此会出现
CastException
错误

SQL查询的结果是一个数据帧,您可以像这样请求模式(您可以在Java API中使用相同的命令):

或者,您可以通过使用更具体的查询来简化流程:

val directQuery = sqlContext.sql("select majorsector_percent.Name, majorsector_percent.Percent from logs limit 1 ")
directQuery: org.apache.spark.sql.DataFrame = [Name: array<string>, Percent: array<bigint>]

scala> directQuery.collect
res5: Array[org.apache.spark.sql.Row] = Array([WrappedArray(Education, Education, Public Administration, Law, and Justice, Education),WrappedArray(46, 26, 16, 12)])
val directQuery=sqlContext.sql(“从日志限制1中选择majorsector_percent.Name,majorsector_percent.percent”)
directQuery:org.apache.spark.sql.DataFrame=[名称:数组,百分比:数组]
scala>directQuery.collect
res5:Array[org.apache.spark.sql.Row]=Array([WrappedArray(教育、教育、公共行政、法律和司法、教育),WrappedArray(46、26、16、12)])

问题在于,该查询的结果是一个包含数组的结构。当您尝试在数组上使用
行.getString(1)
映射结果时,由于相应的对象不是字符串,因此会出现
CastException

SQL查询的结果是一个数据帧,您可以像这样请求模式(您可以在Java API中使用相同的命令):

或者,您可以通过使用更具体的查询来简化流程:

val directQuery = sqlContext.sql("select majorsector_percent.Name, majorsector_percent.Percent from logs limit 1 ")
directQuery: org.apache.spark.sql.DataFrame = [Name: array<string>, Percent: array<bigint>]

scala> directQuery.collect
res5: Array[org.apache.spark.sql.Row] = Array([WrappedArray(Education, Education, Public Administration, Law, and Justice, Education),WrappedArray(46, 26, 16, 12)])
val directQuery=sqlContext.sql(“从日志限制1中选择majorsector_percent.Name,majorsector_percent.percent”)
directQuery:org.apache.spark.sql.DataFrame=[名称:数组,百分比:数组]
scala>directQuery.collect
res5:Array[org.apache.spark.sql.Row]=Array([WrappedArray(教育、教育、公共行政、法律和司法、教育),WrappedArray(46、26、16、12)])

提供的JSON语法无效。提供的JSON语法无效。
val directQuery = sqlContext.sql("select majorsector_percent.Name, majorsector_percent.Percent from logs limit 1 ")
directQuery: org.apache.spark.sql.DataFrame = [Name: array<string>, Percent: array<bigint>]

scala> directQuery.collect
res5: Array[org.apache.spark.sql.Row] = Array([WrappedArray(Education, Education, Public Administration, Law, and Justice, Education),WrappedArray(46, 26, 16, 12)])