Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/apache-kafka/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache kafka spark streaming如何处理kafka中的snappy压缩数据_Apache Kafka_Spark Streaming_Snappy - Fatal编程技术网

Apache kafka spark streaming如何处理kafka中的snappy压缩数据

Apache kafka spark streaming如何处理kafka中的snappy压缩数据,apache-kafka,spark-streaming,snappy,Apache Kafka,Spark Streaming,Snappy,卡夫卡制作人示例代码为: #!/usr/bin/env python #-*- coding: utf-8 -*- import ConfigParser as configparser from pykafka import KafkaClient import time import snappy config = configparser.ConfigParser() config.read("conf.ini") app_name = "test_word_counter" kaf

卡夫卡制作人示例代码为:

#!/usr/bin/env python
#-*- coding: utf-8 -*-

import ConfigParser as configparser
from pykafka import KafkaClient
import time
import snappy

config = configparser.ConfigParser()
config.read("conf.ini")

app_name = "test_word_counter"
kafka_hosts = config.get(app_name, 'kafka_hosts')
kafka_topic = config.get(app_name, 'kafka_topic')
print("kafka client: %s" % kafka_hosts)
print("kafka topic: %s" % kafka_topic)

kafka_client = KafkaClient(hosts=kafka_hosts)  # Create Kafka client
topic = kafka_client.topics[kafka_topic]  # This will create the topic if it does not exist

with topic.get_producer() as producer:  # Create Kafka producer on the given topic
    while True:
        msg = "just a test for snappy compress with kafka and spark"
        msg = snappy.compress(msg) # add snappy compress
        producer.produce(msg) # Send the message to Kafka
        print("send data len(%d)" % len(msg))
        print(msg)
        time.sleep(5)
代码非常简单,使用python snappy压缩数据,然后将其放入kafka

Pypark代码是:

def word_counter(zk_host, topic):
    sc = SparkContext(appName="PythonStreamingKafkaWordCounter")
    sc = SparkContext(conf=spark_conf)
    ssc = StreamingContext(sc, 30) 

    kvs = KafkaUtils.createStream(ssc, zk_host, "spark-streaming-consumer", {topic: 2}) 
    lines = kvs.map(lambda x: x[1])
    counts = lines.flatMap(lambda line: line.split(" ")) \
        .map(lambda word: (word, 1)) \
        .reduceByKey(lambda a, b: a+b)

    counts.pprint()
    ssc.start()
    ssc.awaitTermination()
然后,运行spark commit:

spark-submit --jars /usr/local/services/metrics-spark-analyser/external/spark-streaming-kafka-0-8-assembly_2.11-2.0.2.jar spark_word_counter_consumer.py
我收到以下火花错误消息:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xcc in position 1: invalid continuation byte
更详细的错误代码如下:

16/12/18 13:58:30 ERROR Executor: Exception in task 5.0 in stage 7.0 (TID 30)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/services/spark/python/lib/pyspark.zip/pyspark/worker.py", line 172, in main
    process()
  File "/usr/local/services/spark/python/lib/pyspark.zip/pyspark/worker.py", line 167, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/local/services/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
  File "/usr/local/services/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
  File "/usr/local/services/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 317, in func
  File "/usr/local/services/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1792, in combineLocally
  File "/usr/local/services/spark/python/lib/pyspark.zip/pyspark/shuffle.py", line 236, in mergeValues
    for k, v in iterator:
  File "/usr/local/services/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 73, in <lambda>
  File "/usr/local/services/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 36, in utf8_decoder
    return s.decode('utf-8')
  File "/usr/local/services/python/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xcc in position 1: invalid continuation byte
16/12/18 13:58:30错误执行者:第7.0阶段任务5.0中的异常(TID 30)
org.apache.spark.api.python.PythonException:回溯(最近一次调用last):
文件“/usr/local/services/spark/python/lib/pyspark.zip/pyspark/worker.py”,主文件第172行
过程()
文件“/usr/local/services/spark/python/lib/pyspark.zip/pyspark/worker.py”,第167行,正在处理中
serializer.dump_流(func(拆分索引,迭代器),outfile)
文件“/usr/local/services/spark/python/lib/pyspark.zip/pyspark/rdd.py”,第2371行,在pipeline_func中
文件“/usr/local/services/spark/python/lib/pyspark.zip/pyspark/rdd.py”,第2371行,在pipeline_func中
func中的文件“/usr/local/services/spark/python/lib/pyspark.zip/pyspark/rdd.py”,第317行
文件“/usr/local/services/spark/python/lib/pyspark.zip/pyspark/rdd.py”,第1792行,以组合形式
文件“/usr/local/services/spark/python/lib/pyspark.zip/pyspark/shuffle.py”,第236行,合并值
对于迭代器中的k,v:
文件“/usr/local/services/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py”,第73行,在
utf8_解码器中的文件“/usr/local/services/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py”,第36行
返回s.decode('utf-8')
文件“/usr/local/services/python/lib/python2.7/encodings/utf_8.py”,第16行,解码
返回编解码器.utf_8_解码(输入,错误,真)
UnicodeDecodeError:“utf8”编解码器无法解码位置1中的字节0xcc:无效的连续字节
spark streaming似乎无法从kafka取消压缩snappy数据。

我应该在spark中添加任何配置吗?

谢谢~

软件详细信息:

  • 卡夫卡0.10.1.0
  • hadoop 2.7的每个版本spark 2.0.2
  • python snappy 0.5
ps:

我写了一个简单的卡夫卡消费者阅读卡夫卡snappy数据,snappy解压过程是成功的