Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 3.x 将流式JSON转换为数据帧_Python 3.x_Apache Spark_Pyspark_Spark Structured Streaming - Fatal编程技术网

Python 3.x 将流式JSON转换为数据帧

Python 3.x 将流式JSON转换为数据帧,python-3.x,apache-spark,pyspark,spark-structured-streaming,Python 3.x,Apache Spark,Pyspark,Spark Structured Streaming,问题:如何将JSON字符串转换为DataFrame并仅选择所需的键 我上周刚开始使用Spark,我还在学习,所以请耐心等待 我正在使用Spark(2.4)结构化流媒体。spark应用程序从twitter流媒体获取数据(通过套接字),发送的数据是完整的tweet JSON字符串。下面是一个数据帧。每一行都是完整的JSON tweet +--------------------+ | value| +--------------------+ |{"created_at"

问题:如何将JSON字符串转换为DataFrame并仅选择所需的键

我上周刚开始使用Spark,我还在学习,所以请耐心等待

我正在使用Spark(2.4)结构化流媒体。spark应用程序从twitter流媒体获取数据(通过套接字),发送的数据是完整的tweet JSON字符串。下面是一个数据帧。每一行都是完整的JSON tweet

+--------------------+
|               value|
+--------------------+
|{"created_at":"Tu...|
|{"created_at":"Tu...|
|{"created_at":"Tu...|
+--------------------+
正如文卡塔所建议的,我这样做了,并翻译成python(下面是完整的代码)

这是返回值

+------------------------------+-------------------+
|created_at                    |id_str             |
+------------------------------+-------------------+
|Wed Feb 20 04:51:18 +0000 2019|1098082646511443968|
|Wed Feb 20 04:51:18 +0000 2019|1098082646285082630|
|Wed Feb 20 04:51:18 +0000 2019|1098082646444441600|
|Wed Feb 20 04:51:18 +0000 2019|1098082646557642752|
|Wed Feb 20 04:51:18 +0000 2019|1098082646494797824|
|Wed Feb 20 04:51:19 +0000 2019|1098082646817681408|
+------------------------------+-------------------+
可以看出,数据帧中只包含了我想要的2个键

val schema = new StructType().add("id", StringType).add("pin",StringType)

val dataFrame= data.
selectExpr("CAST(value AS STRING)").as[String].
select(from_json($"value",schema).
alias("tmp")).
select("tmp.*")
希望这对任何新手都有帮助

完整代码

from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json
from pyspark.sql.types import StructType, StringType


spark = SparkSession.builder.appName("StructuredNetworkWordCount").getOrCreate()
sc = spark.sparkContext

lines = spark.readStream.format('socket').option('host', '127.0.0.1').option('port', 9999).load()

schema = StructType().add('created_at', StringType(), False).add('id_str', StringType(), False)
df = lines.selectExpr('CAST(value AS STRING)').select(from_json('value', schema).alias('temp')).select('temp.*')

query = df.writeStream.format('console').option('truncate', 'False').start()

# this part is only used to print out the query when running as an app. Not needed if using jupyter
import time
time.sleep(10)
lines.stop()

下面是一个示例代码片段,您可以使用它将json转换为DataFrame

val schema = new StructType().add("id", StringType).add("pin",StringType)

val dataFrame= data.
selectExpr("CAST(value AS STRING)").as[String].
select(from_json($"value",schema).
alias("tmp")).
select("tmp.*")