Spark JSON读取失败_Json_Apache Spark_Apache Spark Sql

Spark JSON读取失败

json apache-spark

Spark JSON读取失败,json,apache-spark,apache-spark-sql,Json,Apache Spark,Apache Spark Sql,1:）我必须构建一个可以在spark中读取json文件的代码。我使用的是spark.read.json（“sample.json”）。但即使是像下面这样的简单json文件 { {"id" : "1201", "name" : "satish", "age" : "25"} {"id" : "1202", "name" : "krishna", "age" : "28"} {"id" : "1203", "name" : "amith", "age" : "39"} {

1:）我必须构建一个可以在spark中读取json文件的代码。我使用的是spark.read.json（“sample.json”）。但即使是像下面这样的简单json文件

{
   {"id" : "1201", "name" : "satish", "age" : "25"}
   {"id" : "1202", "name" : "krishna", "age" : "28"}
   {"id" : "1203", "name" : "amith", "age" : "39"}
   {"id" : "1204", "name" : "javed", "age" : "23"}
   {"id" : "1205", "name" : "prudvi", "age" : "23"}
}

我得到了错误的结果

+---------------+----+----+-------+
|_corrupt_record| age|  id|   name|
+---------------+----+----+-------+
|              {|null|null|   null|
|           null|  25|1201| satish|
|           null|  28|1202|krishna|
|           null|  39|1203|  amith|
|           null|  23|1204|  javed|
|           null|  23|1205| prudvi|
|              }|null|null|   null|
+---------------+----+----+-------+

我找到了上面的例子

2:）此外，我不知道如何处理格式错误的json文件，如下所示

{
    "title": "Person",
    "type": "object",
    "properties": {
        "firstName": {
            "type": "string"
        },
        "lastName": {
            "type": "string"
        },
        "age": {
            "description": "Age in years",
            "type": "integer",
            "minimum": 0
        }
    },
    "required": ["firstName", "lastName"]
}

我发现处理这类文件非常困难。除了spark之外，在Java/Scala中是否有一致的方法来处理json文件

请帮忙

谢谢

您的JSON文件应该如下所示：

{"id" : "1201", "name" : "satish", "age" : "25"}
{"id" : "1202", "name" : "krishna", "age" : "28"}
{"id" : "1203", "name" : "amith", "age" : "39"}
{"id" : "1204", "name" : "javed", "age" : "23"}
{"id" : "1205", "name" : "prudvi", "age" : "23"}

代码是：

%spark.pyspark

# sqlContext
sq = sqlc

# setup input
file_json = "hdfs://mycluster/user/test/test.json"

df = sqlc.read.json(file_json)
df.registerTempTable("myfile")

df2 = sqlc.sql("SELECT * FROM myfile")

df2.show()

输出：

+---+----+-------+
|age|  id|   name|
+---+----+-------+
| 25|1201| satish|
| 28|1202|krishna|
| 39|1203|  amith|
| 23|1204|  javed|
| 23|1205| prudvi|
+---+----+-------+

读取具有如下格式的Json数据非常简单。 [
{
“代码”：“AAA”，
“纬度”：“-17.3595”，
“lon”：“-145.494”，
“名称”：“安那机场”，
“城市”：“Anaa”，小于br> “州”：“Tuamotu Gambier”，
“国家”：“法属波利尼西亚”，
“woeid”：“12512819”，
“tz”：“太平洋/中途岛”，
“电话”：“，”
“类型”：“机场”，
“电子邮件”：“”
“url”：“”
“跑道长度”：“4921”，
“elev”：“7”，
“国际民航组织”：“NTGA”，
“直飞航班”：“2”，
“承运人”：“1”
},

在Spark中，使用scala可以获得以下数据：

object Airport_Data {
val conf = new SparkConf().setAppName("Airport_Analysis").setMaster("local")
val sc = new SparkContext(conf)
val Sql_Context = new org.apache.spark.sql.SQLContext(sc)
import Sql_Context.implicits._
println("SQL Context and SPARK Context has been initialized")

def data_show() = {
val airport_df = Sql_Context.read.option("multiLine", true).option("mode", 
"PERMISSIVE").json("C://Users//133288//Desktop//Flight//airports.json")
airport_df.show()
println("***************************")

// Print the Schema in tree format
println("************Print the Schema in tree format***************")
airport_df.printSchema()

// Select only the "Specific" Column
println("************Select only the 'Specific' Column***************")
airport_df.select("name").show()


// select everybody, but increment the "runway_length" by 2
println("************select everybody, but increment the 'runway_length' by 
2***************")
airport_df.select($"name", $"runway_length" + 200).show()

// select airports whose runway_lenth is greater than 4000
println("*************Select Airport with runway length Greater than 
5000************")
airport_df.filter($"runway_length" > 
5000).write.parquet("C://Users//133288//Desktop//Flight//Airport_lenth_5000.parquet")
airport_df.filter($"runway_length" > 
5000).write.csv("C://Users//133288//Desktop//Flight//Airport_lenth_5000")

}