Windows PySpark使用fromu csv将Kafka csv分隔的数据解析为列
我不熟悉卡夫卡的结构化流媒体。正在尝试使用schema和_csv将分隔数据从Kafka转换为PySpark中的DataframeWindows PySpark使用fromu csv将Kafka csv分隔的数据解析为列,windows,pyspark,pycharm,apache-kafka-streams,Windows,Pyspark,Pycharm,Apache Kafka Streams,我不熟悉卡夫卡的结构化流媒体。正在尝试使用schema和_csv将分隔数据从Kafka转换为PySpark中的Dataframe kafkaDataSchema = StructType([ StructField("sid", StringType()), StructField("timestamp", LongType()), StructField("sensor", StringType()), StructField
kafkaDataSchema = StructType([
StructField("sid", StringType()), StructField("timestamp", LongType()),
StructField("sensor", StringType()), StructField("value", StringType()),
])
kafkaStream = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", self.config.get('kafka-config', 'bootstrap-servers')) \
.option("subscribe", self.config.get('kafka-config', 'topic-list-input')) \
.option("startingOffsets", self.config.get('kafka-config', 'startingOffsets')) \
.load()\
.selectExpr("CAST(value AS STRING)")
formattedStream = kafkaStream.select(from_csv(kafkaStream.value, kafkaDataSchema))
我得到以下错误:
Traceback (most recent call last):
File "main.py", line 43, in <module>
formattedStream = KafkaSource.readData(spark)
File "src.zip/src/main/sources/KafkaSource.py", line 31, in readData
File "src.zip/src/main/sources/KafkaSource.py", line 36, in formatKafkaData
File "/spark-3.1.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/functions.py", line 4082, in from_csv
TypeError: schema argument should be a column or string
回溯(最近一次呼叫最后一次):
文件“main.py”,第43行,在
formattedStream=KafkaSource.readData(spark)
readData中第31行的文件“src.zip/src/main/sources/KafkaSource.py”
文件“src.zip/src/main/sources/KafkaSource.py”,第36行,格式为Kafkadata
文件“/spark-3.1.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/functions.py”,第4082行,from_csv
TypeError:架构参数应为列或字符串
如何解决此问题?请尝试使用csv中的
(kafkaStream.value,kafkaDataSchema.simpleString())
感谢它的有效性。我还有一个问题,如何仅从数据帧中选择“sid”值?