Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/355.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 打印卡夫卡主题中的Pyspark流媒体数据_Python_Apache Spark_Pyspark_Apache Kafka_Spark Structured Streaming - Fatal编程技术网

Python 打印卡夫卡主题中的Pyspark流媒体数据

Python 打印卡夫卡主题中的Pyspark流媒体数据,python,apache-spark,pyspark,apache-kafka,spark-structured-streaming,Python,Apache Spark,Pyspark,Apache Kafka,Spark Structured Streaming,我对kafka和pyspark还不熟悉,并且正在尝试编写简单的程序,所以我有两个JSon格式的kafka主题文件,我正在pyspark流媒体上阅读 我的制作人代码如下: from kafka import * import json import time import boto3 import json from Consumer_Group import * from json import loads class producer : def json_seri

我对kafka和pyspark还不熟悉,并且正在尝试编写简单的程序,所以我有两个JSon格式的kafka主题文件,我正在pyspark流媒体上阅读

我的制作人代码如下:

  from kafka import *
import json
import time
import boto3
import json
from Consumer_Group import *
from json import loads
class producer :
            def json_serializer(data):
                    return json.dumps(data).encode("utf-8")

            def read_s3():
                p1 = KafkaProducer(bootstrap_servers=['localhost:9092'], value_serializer=producer.json_serializer)
                s3 = boto3.resource('s3')
                bucket = s3.Bucket('kakfa')
                for obj in bucket.objects.all():
                    key = obj.key
                    body = obj.get()['Body'].read().decode('utf-8')
                p1.send("Uber_Eats",body)
                p1.flush()
from pyspark.sql import SparkSession
from kafka import *
import time
class consumer:
                def read_from_topic(self,spark):
                        df = spark.readStream \
                            .format("kafka") \
                            .option("kafka.bootstrap.servers", "localhost:9092") \
                            .option("subscribe", "Uber_Eats") \
                             .option("startingOffsets", "earliest") \
                            .load()
                        df.createOrReplaceTempView("kafka")
                        spark.sql("select * from kafka")
                        print(df.isStreaming())
                                  


                def get_consumer(self):
                    consumer = KafkaConsumer("Uber_Eats", group_id='group1', bootstrap_servers=
                    "localhost:9092")
                    return  consumer

                def print_details(self,c1):
                    #    self.consumer=self.get_consumer(self)
                        # Read and print message from consumer
                     try:
                                for msg in c1:
                                    print(msg.topic, msg.value)
                                print("Done")
                     except Exception  as e:
                                print(e)
我的消费者代码如下:

  from kafka import *
import json
import time
import boto3
import json
from Consumer_Group import *
from json import loads
class producer :
            def json_serializer(data):
                    return json.dumps(data).encode("utf-8")

            def read_s3():
                p1 = KafkaProducer(bootstrap_servers=['localhost:9092'], value_serializer=producer.json_serializer)
                s3 = boto3.resource('s3')
                bucket = s3.Bucket('kakfa')
                for obj in bucket.objects.all():
                    key = obj.key
                    body = obj.get()['Body'].read().decode('utf-8')
                p1.send("Uber_Eats",body)
                p1.flush()
from pyspark.sql import SparkSession
from kafka import *
import time
class consumer:
                def read_from_topic(self,spark):
                        df = spark.readStream \
                            .format("kafka") \
                            .option("kafka.bootstrap.servers", "localhost:9092") \
                            .option("subscribe", "Uber_Eats") \
                             .option("startingOffsets", "earliest") \
                            .load()
                        df.createOrReplaceTempView("kafka")
                        spark.sql("select * from kafka")
                        print(df.isStreaming())
                                  


                def get_consumer(self):
                    consumer = KafkaConsumer("Uber_Eats", group_id='group1', bootstrap_servers=
                    "localhost:9092")
                    return  consumer

                def print_details(self,c1):
                    #    self.consumer=self.get_consumer(self)
                        # Read and print message from consumer
                     try:
                                for msg in c1:
                                    print(msg.topic, msg.value)
                                print("Done")
                     except Exception  as e:
                                print(e)
主要类别:

from Producer_Group import *
from Consumer_Group import *
from Spark_Connection import *
class client:
    def transfer(self):
        spark = connection.get_connection(self)
        producer.read_s3()
        c1 = consumer.get_consumer(spark)
        consumer.read_from_topic(self,spark)
      #  consumer.print_details(self,c1)

c=client()
c.transfer()
我正在阅读卡夫卡主题的S3中的示例数据:

{
    
        {
            "Customer Number": "1",
            "Customer Name": "Aditya",
            "Restaurant Number": "2201",
            "Restaurant NameOrdered": "Bawarchi",
            "Number of Items": "3",
            "price": "10",
            "Operating Start hours": "9:00",
            "Operating End hours": "23:00"
        },
        {
            "Customer Number": "2",
            "Customer Name": "Sarva",
            "Restaurant Number": "2202",
            "Restaurant NameOrdered": "Sarvana Bhavan",
            "Number of Items": "4",
            "price": "20",
            "Operating Start hours": "8:00",
            "Operating End hours": "20:00"
        },
        {
            "Customer Number": "3",
            "Customer Name": "Kala",
            "Restaurant Number": "2203",
            "Restaurant NameOrdered": "Taco Bell",
            "Number of Items": "5",
            "price": "30",
            "Operating Start hours": "11:00",
            "Operating End hours": "21:00"
        }
    
}
到目前为止我尝试了什么::我尝试在控制台上打印,以便检查条件,如果通过,则将其插入数据库。为了检查条件,我正在从“read_from_topic”函数读取数据,并创建一个视图(createOrReplaceTempView)以查看数据,但没有任何内容正在打印,请有人指导我如何打印并验证我的条件或数据是否正确读取

提前感谢

创建视图(createOrReplaceTempView)以查看数据,但不打印任何内容

因为
spark.sql
返回一个新的数据帧

如果你想打印它,那么你需要

spark.sql("select * from kafka").show()
但是,这本身至少是两个字节的数组列,而不是JSON字符串,所以您需要


还值得指出的是,您显示的数据不是有效的JSON,而且由于Spark可以从S3本身读取文件,所以不需要使用boto3(因此,Kafka不是严格需要的,因为您可以直接将S3数据带到最终位置,中间有一个Spark
persist()
函数)

我想我以前已经告诉过您,
KafkaConsumer
readStream.format(“kafka”)
是两个完全独立的库,如果您想使用Spark,就不应该使用第一个库;如果您的唯一目标是从kafka消费,就不应该使用第二个库。。。那么,你能澄清一下你的目标吗?除此之外,您只显示了一个从未调用过的类定义,那么这段代码的哪一部分应该打印任何内容呢?我已经包括了“主类”和示例数据(JSON)。请看一看,我正在尝试阅读卡夫卡主题,每个客户都有“运营开始时间”,如果超过8:00,我应该插入mysql数据库,否则忽略,这是我的requirmenet。我的目标是只使用卡夫卡主题,并在spark streaming中进行转换,并基于插入到databse@OneCricketeer感谢您的帮助我的意思是,Python的
KafkaConsumer
也可以“进行转换”,您可以将其与其他库结合起来“写入数据库”,还不清楚为什么你认为你需要火花?此外,我甚至不会为此使用Python,因为Kafka附带Kafka Connect的目的正是为了编写外部系统。除此之外,当它运行时会发生什么?因为你实际上应该在
消费者身上发现一个错误。请阅读主题(self,spark)
中的内容,我正在尝试使用你给出的.show方法,但却给了我错误。pyspark.sql.utils.AnalysisException:具有流源的查询必须使用writeStream.start()执行;对的您需要开始()并等待终止。这里有一个来自Spark源代码的示例是的,我已经尝试过了,它给了我一个不同的错误,error StreamMetadata:错误写入流元数据StreamMetadata(68220860-8bb2-4058-8799-64d5ef5fcc7d)要归档:/C:/Users/komu0/AppData/Local/Temp/temporary-7975a168-d556-4ec3-abfb-f23c8c8508cd/metadata ExitCodeException exitCode=-1073741515:我不知道这件事。我也不在windows上开发,但这里似乎有一些解决方案当然,非常感谢