Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在Spark'中生成JSON;在python(pyspark)中可作为数据帧访问的结构化流,无需RDD_Python_Apache Spark_Pyspark_Apache Spark Sql - Fatal编程技术网

在Spark'中生成JSON;在python(pyspark)中可作为数据帧访问的结构化流,无需RDD

在Spark'中生成JSON;在python(pyspark)中可作为数据帧访问的结构化流,无需RDD,python,apache-spark,pyspark,apache-spark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,我使用Spark 2.4.3,希望使用来自Kafka源的数据进行结构化流式传输。到目前为止,以下代码有效: from pyspark.sql import SparkSession from ast import literal_eval spark = SparkSession.builder \ .appName("streamer") \ .getOrCreate() # Create DataFrame representing the stream dsraw =

我使用Spark 2.4.3,希望使用来自Kafka源的数据进行结构化流式传输。到目前为止,以下代码有效:

from pyspark.sql import SparkSession
from ast import literal_eval

spark = SparkSession.builder \
    .appName("streamer") \
    .getOrCreate()

# Create DataFrame representing the stream
dsraw = spark.readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "localhost:9092") \
  .option("subscribe", "test") \
  .option("startingOffsets", """{"test":{"0":2707422}}""") \
  .load()

# Convert Kafka stream to something readable
ds = dsraw.selectExpr("CAST(value AS STRING)")

# Do query on the raw data
rawQuery = dsraw \
     .writeStream \
     .queryName("qraw") \
     .format("memory") \
     .start()
raw = spark.sql("select * from qraw")

# Do query on the converted data
dsQuery = ds \
     .writeStream \
     .queryName("qds") \
     .format("memory") \
     .start()
sdf = spark.sql("select * from qds")

# I have to access raw otherwise I get errors...
raw.select("value").show()

sdf.show()

# Make the json stuff accessable
sdf2 = sdf.rdd.map(lambda val: literal_eval(val['value']))
print(sdf2.first())
但我真的想知道下一行到最后一行的转换是否是最有用/最快的转换。你还有其他想法吗?我可以使用(Spark)数据帧而不是RDD吗

脚本的输出是

+--------------------+
|               value|
+--------------------+
|{
  "Signal": "[...|
|{
  "Signal": "[...|
+--------------------+
only showing top 20 rows

{'Signal': '[1234]', 'Value': 0.0, 'Timestamp': '2019-08-27T13:51:43.7146327Z'}

有一些解决方案,但只有这种经过调整的解决方案有效(归功于):

对于输出:

+--------+-----------+--------------------+
|  Signal|      Value|           Timestamp|
+--------+-----------+--------------------+
|[123456]|        0.0|2019-08-27T13:51:...|
|[123457]|        0.0|2019-08-27T13:51:...|
|[123458]| 318.880859|2019-08-27T13:51:...|
|[123459]|   285.5808|2019-08-27T13:51:...|

有一些解决方案,但只有这种经过调整的解决方案有效(归功于):

对于输出:

+--------+-----------+--------------------+
|  Signal|      Value|           Timestamp|
+--------+-----------+--------------------+
|[123456]|        0.0|2019-08-27T13:51:...|
|[123457]|        0.0|2019-08-27T13:51:...|
|[123458]| 318.880859|2019-08-27T13:51:...|
|[123459]|   285.5808|2019-08-27T13:51:...|