Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/json/14.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
很难将JSON转换为Spark数据帧_Json_Apache Spark_Pyspark_Apache Spark Sql - Fatal编程技术网

很难将JSON转换为Spark数据帧

很难将JSON转换为Spark数据帧,json,apache-spark,pyspark,apache-spark-sql,Json,Apache Spark,Pyspark,Apache Spark Sql,我一直在尝试将JSON加载到pyspark数据帧中,但我在这里遇到了一些困难 这是我迄今为止尝试过的(有多行和无多行): JSON文件: testjson = [ ('{"id":434, "address" : ["432.432.432.432", "432.432.432.432", "432.432.432.432", "432.432.432.432"]}',),

我一直在尝试将JSON加载到pyspark数据帧中,但我在这里遇到了一些困难

这是我迄今为止尝试过的(有多行和无多行):

JSON文件:

testjson = [
 ('{"id":434, "address" : ["432.432.432.432", "432.432.432.432", "432.432.432.432", "432.432.432.432"]}',), 
 ('{"id":434, "address" : ["432.432.432.432", "432.432.432.432", "432.432.432.432", "432.432.432.432"]}',), 
 ('{"id":434, "address" : ["432.432.432.432", "432.432.432.432", "432.432.432.432", "432.432.432.432"]}',), 
 ('{"id":434, "address" : ["432.432.432.432", "432.432.432.432", "432.432.432.432", "432.432.432.432"]}',), 
 ('{"id":434, "address" : ["432.432.432.432", "432.432.432.432", "432.432.432.432", "432.432.432.432"]}',), 
 ('{"id":434, "address" : ["432.432.432.432", "432.432.432.432", "432.432.432.432", "432.432.432.432"]}',), 
]

当试图显示数据帧时,我得到“corrupt_record”。我做错了什么?

尝试将其转换为字符串列表。Spark无法理解字符串元组列表。另外,
json.dumps
是不必要的,因为Spark应该能够理解您的json输入

df = spark.read.json(sc.parallelize([i[0] for i in testjson]))

df.show(truncate=False)
+--------------------------------------------------------------------+---+
|address                                                             |id |
+--------------------------------------------------------------------+---+
|[432.432.432.432, 432.432.432.432, 432.432.432.432, 432.432.432.432]|434|
|[432.432.432.432, 432.432.432.432, 432.432.432.432, 432.432.432.432]|434|
|[432.432.432.432, 432.432.432.432, 432.432.432.432, 432.432.432.432]|434|
|[432.432.432.432, 432.432.432.432, 432.432.432.432, 432.432.432.432]|434|
|[432.432.432.432, 432.432.432.432, 432.432.432.432, 432.432.432.432]|434|
|[432.432.432.432, 432.432.432.432, 432.432.432.432, 432.432.432.432]|434|
+--------------------------------------------------------------------+---+

JSON似乎无效。尝试通过验证器运行它。
df = spark.read.json(sc.parallelize([i[0] for i in testjson]))

df.show(truncate=False)
+--------------------------------------------------------------------+---+
|address                                                             |id |
+--------------------------------------------------------------------+---+
|[432.432.432.432, 432.432.432.432, 432.432.432.432, 432.432.432.432]|434|
|[432.432.432.432, 432.432.432.432, 432.432.432.432, 432.432.432.432]|434|
|[432.432.432.432, 432.432.432.432, 432.432.432.432, 432.432.432.432]|434|
|[432.432.432.432, 432.432.432.432, 432.432.432.432, 432.432.432.432]|434|
|[432.432.432.432, 432.432.432.432, 432.432.432.432, 432.432.432.432]|434|
|[432.432.432.432, 432.432.432.432, 432.432.432.432, 432.432.432.432]|434|
+--------------------------------------------------------------------+---+