Apache spark 带JSON文件的Spark-Permissive模式将所有记录移动到损坏列
我正在尝试使用spark接收JSON文件。我正在手动应用模式来创建Dataframe。问题是,即使架构的单个记录不匹配,它是否会将整个文件(所有记录)移动到损坏列 数据Apache spark 带JSON文件的Spark-Permissive模式将所有记录移动到损坏列,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,我正在尝试使用spark接收JSON文件。我正在手动应用模式来创建Dataframe。问题是,即使架构的单个记录不匹配,它是否会将整个文件(所有记录)移动到损坏列 数据 [{ "RecordNumber": 2, "Zipcode": 704, "ZipCodeType": "STANDARD", "City": "PASEO COSTA DEL SUR",
[{
"RecordNumber": 2,
"Zipcode": 704,
"ZipCodeType": "STANDARD",
"City": "PASEO COSTA DEL SUR",
"State": "PR"
},
{
"RecordNumber": 10,
"Zipcode": 709,
"ZipCodeType": "STANDARD",
"City": "BDA SAN LUIS",
"State": "PR"
},{
"Zipcode": "709aa",
"ZipCodeType": "STANDARD",
"City": "BDA SAN LUIS",
"State": "PR"
}]
代码
import org.apache.spark.sql.types._
import org.apache.spark.sql.types.DataTypes._
val s = StructType(StructField("City",StringType,true) ::
StructField("RecordNumber",LongType,true) ::
StructField("State",StringType,true) ::
StructField("ZipCodeType",StringType,true) ::
StructField("Zipcode",LongType,true) ::
StructField("corrupted_record",StringType,true) ::
Nil)
val df2=spark.read.
option("multiline","true").
option("mode", "PERMISSIVE").
option("columnNameOfCorruptRecord", "corrupted_record").
schema(s).
json("/tmp/test.json")
df2.show(false)
输出
scala> df2.filter($"corrupted_record".isNotNull).show(false)
+----+------------+-----+-----------+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|City|RecordNumber|State|ZipCodeType|Zipcode|corrupted_record |
+----+------------+-----+-----------+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|null|null |null |null |null |[{
"RecordNumber": 2,
"Zipcode": 704,
"ZipCodeType": "STANDARD",
"City": "PASEO COSTA DEL SUR",
"State": "PR"
},
{
"RecordNumber": 10,
"Zipcode": 709,
"ZipCodeType": "STANDARD",
"City": "BDA SAN LUIS",
"State": "PR"
},{
"Zipcode": "709aa",
"ZipCodeType": "STANDARD",
"City": "BDA SAN LUIS",
"State": "PR"
}]
问题
由于只有第三条记录在String
中有Zipcode,而我希望它是integer
(“Zipcode”:“709aa”,
)难道不应该只有第三条记录进入损坏的\u record
列,其他记录应该被正确解析吗?您只有一条记录(这是因为多行,真的)这是腐败的,所以一切都去了那里
如中所述,如果希望spark单独处理记录,则需要使用它,因为spark能够在多个执行器中分发解析,因此对于较大的文件,它的可伸缩性也更高