Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 在python中将Flume twitter流连接到spark时发生UTF-8编码错误_Apache Spark_Pyspark_Spark Streaming_Flume Ng_Flume Twitter - Fatal编程技术网

Apache spark 在python中将Flume twitter流连接到spark时发生UTF-8编码错误

Apache spark 在python中将Flume twitter流连接到spark时发生UTF-8编码错误,apache-spark,pyspark,spark-streaming,flume-ng,flume-twitter,Apache Spark,Pyspark,Spark Streaming,Flume Ng,Flume Twitter,我在将Flume代理收集的Twitter数据传递到Spark Stream时遇到问题。我可以独立下载twits,而只使用Flume。但我得到以下错误。我觉得这是FlumeUtils.createStream()中默认UTF-8编码的问题。我怎样才能改变它?我该换什么 pyspark终端错误: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/usr/local/spar

我在将Flume代理收集的Twitter数据传递到Spark Stream时遇到问题。我可以独立下载twits,而只使用Flume。但我得到以下错误。我觉得这是FlumeUtils.createStream()中默认UTF-8编码的问题。我怎样才能改变它?我该换什么

pyspark终端错误:

org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 172, in main
    process()
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 167, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/streaming/flume.py", line 107, in func
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/streaming/flume.py", line 36, in utf8_decoder
    return s.decode('utf-8')
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe4 in position 17: invalid continuation byte

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
    at org.apache.spark.scheduler.Task.run(Task.scala:86)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
17/01/01 15:36:41 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
Cmd启动pyspark

spark提交--jars~/project/spark-streaming-flume-assembly_2.11-2.0.2.jar~/project/news_stream_flume/news_stream_analysis.py localhost 9999
水槽形态:

# Name the components on this agent
FlumeAgent.sources = Twitter
FlumeAgent.sinks = spark
FlumeAgent.channels = MemChannel

# Twitter source
FlumeAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
FlumeAgent.sources.Twitter.consumerKey = x
FlumeAgent.sources.Twitter.consumerSecret =  y
FlumeAgent.sources.Twitter.accessToken = z
FlumeAgent.sources.Twitter.accessTokenSecret = xx
FlumeAgent.sources.Twitter.keywords = flume, spark

FlumeAgent.sinks.spark.type = avro
FlumeAgent.sinks.spark.channel = memoryChannel
FlumeAgent.sinks.spark.hostname = localhost
FlumeAgent.sinks.spark.port = 9999
FlumeAgent.sinks.spark.batch-size = 1

# Use a channel which buffers events in memory
FlumeAgent.channels.MemChannel.type = memory
FlumeAgent.channels.MemChannel.capacity = 10000
FlumeAgent.channels.MemChannel.transactionCapacity = 100

# Bind the source and sink to the channel
FlumeAgent.sources.Twitter.channels = MemChannel
FlumeAgent.sinks.spark.channel = MemChannel
Cmd以运行flume代理:

flume-ng agent --name FlumeAgent --conf-file  /home/hduser/project/flume_config_2src_spark_avro  -f /usr/lib/flume-ng/conf/flume-conf.properties -Dflume.root.logger=DEBUG,console

FlumeUtils.createStream
采用
bodyDecoder
参数,该参数是用于字符串解码的函数。默认实现只检查
None
解码为UTF-8:

def utf8_解码器:
“”“将unicode解码为UTF-8”“”
如果s为无:
一无所获
返回s.decode('utf-8')
  • 在Python2.x中,您应该能够用自己的代码替换它,使用所需的编码,甚至可以使用identity完全跳过解码(
    lambda x:x

  • Python3.x可能需要一些额外的步骤(使用
    ..getBytes
    的JVM端映射)来处理Pyrolite中的
    String
    ->
    unicode
    映射