Apache spark 将RDD中的JSON行转换为Apache Spark中的数据帧_Apache Spark_Pyspark_Apache Spark Sql_Rdd

Apache spark 将RDD中的JSON行转换为Apache Spark中的数据帧

apache-spark pyspark

Apache spark 将RDD中的JSON行转换为Apache Spark中的数据帧,apache-spark,pyspark,apache-spark-sql,rdd,Apache Spark,Pyspark,Apache Spark Sql,Rdd,我在S3中有大约17000个文件，如下所示： {"hour": "00", "month": "07", "second": "00", "year": "1970", "timezone": "-00:00", "day": "12", "minute": "00"} {"hour": "00", "month": "07", "second": "01", "year": "1970", "timezone": "-00:00", "day": "12", "minute": "00"} {

我在S3中有大约17000个文件，如下所示：

{"hour": "00", "month": "07", "second": "00", "year": "1970", "timezone": "-00:00", "day": "12", "minute": "00"}
{"hour": "00", "month": "07", "second": "01", "year": "1970", "timezone": "-00:00", "day": "12", "minute": "00"}
{"hour": "00", "month": "07", "second": "02", "year": "1970", "timezone": "-00:00", "day": "12", "minute": "00"}
{"hour": "00", "month": "07", "second": "03", "year": "1970", "timezone": "-00:00", "day": "12", "minute": "00"}
{"hour": "00", "month": "07", "second": "04", "year": "1970", "timezone": "-00:00", "day": "12", "minute": "00"}

我每天有一个文件。每个文件每秒钟包含一条记录。∴ 一个文件中有86000条记录。每个文件都有一个类似“YYYY-MM-DD”的文件名

我使用bot3生成bucket中文件的列表。这里我只选择10个使用前缀的文件

import boto3
s3_list = []
s3 = boto3.resource('s3')
my_bucket = s3.Bucket('time-waits-for-no-man')
for object in my_bucket.objects.filter(Prefix='1972-05-1):
    s3_list.append(object.key)

此函数返回文件列表（S3键）。然后，我定义了一个函数来获取文件并返回行：

def FileRead(s3Key):
    s3obj = boto3.resource('s3').Object(bucket_name='bucket', key=s3Key)
    contents = s3obj.get()['Body'].read().decode('utf-8')
    yield Row(**contents)

然后，我使用flatMap分发此函数：

job = sc.parallelize(s3_list)
foo = job.flatMap(FileRead)

问题然而，我不知道如何将这些行正确地泵入数据帧

>>> foo.toDF().show()
+--------------------+                                                          
|                  _1|
+--------------------+
|{"hour": "00", "m...|
|{"hour": "00", "m...|
|{"hour": "00", "m...|
|{"hour": "00", "m...|
|{"hour": "00", "m...|
|{"hour": "00", "m...|
|{"hour": "00", "m...|
|{"hour": "00", "m...|
|{"hour": "00", "m...|
|{"hour": "00", "m...|
+--------------------+

>>> foo.toDF().count()
10

有人能告诉我怎么做吗？

您可能应该直接使用

json

阅读器（

spark.read.json

sqlContext.read.json

），但如果您知道模式，您可以尝试手动解析json字符串：

从pyspark.sql.types导入StructField、StructType、StringType
从pyspark.sql导入行
导入json
字段=[“天”、“小时”、“分钟”、“月”、“秒”、“时区”、“年”]
schema=StructType([
字段中字段的StructField（字段，StringType（），True）
])
def解析，字段：
尝试：
d=json.loads（s[0]）
返回[元组（字段中字段的d.get（字段）]
除：
返回[]
createDataFrame（foo.flatMap（lambda s:parse（s，fields）），模式）

您还可以使用

get\u json\u object

：

从pyspark.sql.functions导入get\u json\u对象
选择([
get_json_对象（“值”，“$.{0}”.format（field））.alias（field）
一场接一场
])

您可能应该直接使用

json

读取器（

spark.read.json

sqlContext.read.json

），但如果您知道模式，可以尝试手动解析json字符串：

从pyspark.sql.types导入StructField、StructType、StringType
从pyspark.sql导入行
导入json
字段=[“天”、“小时”、“分钟”、“月”、“秒”、“时区”、“年”]
schema=StructType([
字段中字段的StructField（字段，StringType（），True）
])
def解析，字段：
尝试：
d=json.loads（s[0]）
返回[元组（字段中字段的d.get（字段）]
除：
返回[]
createDataFrame（foo.flatMap（lambda s:parse（s，fields）），模式）

您还可以使用

get\u json\u object

：

从pyspark.sql.functions导入get\u json\u对象
选择([
get_json_对象（“值”，“$.{0}”.format（field））.alias（field）
一场接一场
])

最后，我使用了：

def FileRead(s3Key):
    s3obj = boto3.resource('s3').Object(bucket_name='bucket', key=s3Key)
    contents = s3obj.get()['Body'].read().decode()
    result = []
    meow = contents.split('\n')
    index = 0
    limit = 10
    for item in meow:
        index += 1
        result.append(json.loads(item))
        if index == limit:
            return result

job = sc.parallelize(s3_list)
foo = job.flatMap(distributedJsonRead)
df = foo.toDF()

感谢@user6910411的启发。

最后我用它工作：

def FileRead(s3Key):
    s3obj = boto3.resource('s3').Object(bucket_name='bucket', key=s3Key)
    contents = s3obj.get()['Body'].read().decode()
    result = []
    meow = contents.split('\n')
    index = 0
    limit = 10
    for item in meow:
        index += 1
        result.append(json.loads(item))
        if index == limit:
            return result

job = sc.parallelize(s3_list)
foo = job.flatMap(distributedJsonRead)
df = foo.toDF()

感谢@user6910411的启发。

这里是解决同样问题的另一个方法

from pyspark.sql.types import StructType,StructField,StringType
fields =['hour','month','second','year','timezone','day','minute']

schema = StructType([
StructField(field,StringType(),True) for field in fields
])

js =(
  {"hour": "00", "month": "07", "second": "00", "year": "1970", "timezone": "-00:00", "day": "12", "minute": "00"},
  {"hour": "00", "month": "07", "second": "01", "year": "1970", "timezone": "-00:00", "day": "12", "minute": "00"},
  {"hour": "00", "month": "07", "second": "02", "year": "1970", "timezone": "-00:00", "day": "12", "minute": "00"},
  {"hour": "00", "month": "07", "second": "03", "year": "1970", "timezone": "-00:00", "day": "12", "minute": "00"},
  {"hour": "00", "month": "07", "second": "04", "year": "1970", "timezone": "-00:00", "day": "12", "minute": "00"}
)



rdd = sc.parallelize(js)
jsDF = spark.createDataFrame(rdd,schema)

jsDF.show()

对于同样的问题，这里有另一个解决方案

from pyspark.sql.types import StructType,StructField,StringType
fields =['hour','month','second','year','timezone','day','minute']

schema = StructType([
StructField(field,StringType(),True) for field in fields
])

js =(
  {"hour": "00", "month": "07", "second": "00", "year": "1970", "timezone": "-00:00", "day": "12", "minute": "00"},
  {"hour": "00", "month": "07", "second": "01", "year": "1970", "timezone": "-00:00", "day": "12", "minute": "00"},
  {"hour": "00", "month": "07", "second": "02", "year": "1970", "timezone": "-00:00", "day": "12", "minute": "00"},
  {"hour": "00", "month": "07", "second": "03", "year": "1970", "timezone": "-00:00", "day": "12", "minute": "00"},
  {"hour": "00", "month": "07", "second": "04", "year": "1970", "timezone": "-00:00", "day": "12", "minute": "00"}
)



rdd = sc.parallelize(js)
jsDF = spark.createDataFrame(rdd,schema)

jsDF.show()