Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/335.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何将结构化数据映射到Spark中的schemaRDD?_Python_Json_Apache Spark_Apache Spark Sql - Fatal编程技术网

Python 如何将结构化数据映射到Spark中的schemaRDD?

Python 如何将结构化数据映射到Spark中的schemaRDD?,python,json,apache-spark,apache-spark-sql,Python,Json,Apache Spark,Apache Spark Sql,我以前问过不同的问题,但有一些变化,所以我想再问一次,作为一个新问题。 我有一个结构化数据,其中只有一部分是json格式的,但我需要将整个数据映射到schemaRDD。数据如下所示: 03052015 04:13:20 {“recordType”:“NEW”,“data”:{“keycol”:“val1”,“col2”:“val2”,“col3”:“val3”} 每行以日期开头,后跟时间和json格式的文本。 我不仅需要将json中的文本映射到同一个结构中,还需要将日期和时间映射到同一个结构中

我以前问过不同的问题,但有一些变化,所以我想再问一次,作为一个新问题。 我有一个结构化数据,其中只有一部分是json格式的,但我需要将整个数据映射到schemaRDD。数据如下所示:

03052015 04:13:20 {“recordType”:“NEW”,“data”:{“keycol”:“val1”,“col2”:“val2”,“col3”:“val3”}

每行以日期开头,后跟时间和json格式的文本。 我不仅需要将json中的文本映射到同一个结构中,还需要将日期和时间映射到同一个结构中

我在Python中尝试了它,但显然它不起作用,因为Row不接受RDD(在本例中是jsonRDD)

目标是能够对schemaRDD运行如下查询:

select date, time, data.keycol, data.val1, data.val2, data.val3 from myOrder
如何将整行映射到schemaRDD


感谢您的帮助。

sqlContext.jsonRDD从包含字符串的rdd创建一个模式rdd,其中每个字符串都包含一个JSON表示。此代码示例来自SparkSQL文档():


关于jsonRDD最酷的一点是,您可以提供和附加参数来说明JSONs模式,这将提高性能然后调用schemaRDD.schema方法来获取模式。

最简单的选项是将此字段添加到JSON并使用jsonRDD

我的数据:

03052015 04:13:20 {"recordType":"NEW","data":{"keycol":"val1","col1":"val5","col2":"val3"}}
03062015 04:13:20 {"recordType":"NEW1","data":{"keycol":"val2","col1":"val6","col2":"val3"}}
03072015 04:13:20 {"recordType":"NEW2","data":{"keycol":"val3","col1":"val7","col2":"val3"}}
03082015 04:13:20 {"recordType":"NEW3","data":{"keycol":"val4","col1":"val8","col2":"val3"}}
代码:

结果:

[Row(ts=u'03052015 04:13:20', recordType=u'NEW', keycol=u'val1', col1=u'val5', data=u'val3'), Row(ts=u'03062015 04:13:20', recordType=u'NEW1', keycol=u'val2', col1=u'val6', data=u'val3'), Row(ts=u'03072015 04:13:20', recordType=u'NEW2', keycol=u'val3', col1=u'val7', data=u'val3'), Row(ts=u'03082015 04:13:20', recordType=u'NEW3', keycol=u'val4', col1=u'val8', data=u'val3')]

在您的代码中,有一个问题是您为每一行调用jsonRDD,这是不正确的-它接受RDD并返回SchemaRDD。

这正是我要查找的。非常感谢。
03052015 04:13:20 {"recordType":"NEW","data":{"keycol":"val1","col1":"val5","col2":"val3"}}
03062015 04:13:20 {"recordType":"NEW1","data":{"keycol":"val2","col1":"val6","col2":"val3"}}
03072015 04:13:20 {"recordType":"NEW2","data":{"keycol":"val3","col1":"val7","col2":"val3"}}
03082015 04:13:20 {"recordType":"NEW3","data":{"keycol":"val4","col1":"val8","col2":"val3"}}
import json

def transform(data):
    ts  = data[:18].strip()
    jss = data[18:].strip()
    jsj = json.loads(jss)
    jsj['ts'] = ts
    return json.dumps(jsj)

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
rdd = sc.textFile('/sparkdemo/sample.data')
tbl = sqlContext.jsonRDD(rdd.map(transform))
tbl.registerTempTable("myOrder")

sqlContext.sql("select ts, recordType, data.keycol, data.col1, data.col2 data from myOrder").collect()
[Row(ts=u'03052015 04:13:20', recordType=u'NEW', keycol=u'val1', col1=u'val5', data=u'val3'), Row(ts=u'03062015 04:13:20', recordType=u'NEW1', keycol=u'val2', col1=u'val6', data=u'val3'), Row(ts=u'03072015 04:13:20', recordType=u'NEW2', keycol=u'val3', col1=u'val7', data=u'val3'), Row(ts=u'03082015 04:13:20', recordType=u'NEW3', keycol=u'val4', col1=u'val8', data=u'val3')]