Apache spark 来自卡夫卡的PySpark直接流媒体 目标

Apache spark 来自卡夫卡的PySpark直接流媒体 目标,apache-spark,apache-kafka,pyspark,spark-streaming,Apache Spark,Apache Kafka,Pyspark,Spark Streaming,我的目标是得到一个简单的Spark流媒体示例,它使用与Kafka working接口的直接方法,但我无法克服特定的错误 理想的结果是打开两个控制台窗口。一个我可以输入句子,另一个显示所有句子的“实时”字数 控制台1 这只猫喜欢熏肉 我的猫吃了熏肉 控制台2 时间: [(“the”,2),(“cat”,1),(“likes”,1),(“bacon”,1)] 时间: [(“the”,3),(“cat”,2),(“likes”,1),(“bacon”,2),(“my”,1),(“ate”,1)] 采

我的目标是得到一个简单的Spark流媒体示例,它使用与Kafka working接口的直接方法,但我无法克服特定的错误

理想的结果是打开两个控制台窗口。一个我可以输入句子,另一个显示所有句子的“实时”字数

控制台1

这只猫喜欢熏肉

我的猫吃了熏肉

控制台2

时间:

[(“the”,2),(“cat”,1),(“likes”,1),(“bacon”,1)]

时间:

[(“the”,3),(“cat”,2),(“likes”,1),(“bacon”,2),(“my”,1),(“ate”,1)]


采取的步骤 下载并卸载

kafka_2.10-0.8.2.0
spark-1.5.2-bin-hadoop2.6
在单独的屏幕中启动ZooKeeper和Kafka服务器。

screen -S zk
bin/zookeeper-server-start.sh config/zookeeper.properties
“Ctrl-a”“d”以分离屏幕

screen -S kafka
bin/kafka-server-start.sh config/server.properties
Ctrl-a“d”

开始制作卡夫卡

使用一个单独的控制台窗口并在其中键入单词以模拟流

bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
启动Pyspark

使用Spark streaming Kafka软件包

bin/pyspark --packages org.apache.spark:spark-streaming-kafka_2.10:1.5.2
运行简单字数统计

基于中的示例


错误 在Kafka producer控制台中键入单词只会产生一次结果,但是下面的错误会出现一次,并且不会产生进一步的结果(尽管“时间”部分会继续出现)

如果您有任何帮助或建议,我们将不胜感激。

试试跑步: spark提交——packagesorg.apache.spark:spark-streaming-kafka_2.10:1.5.1您的_python_文件_name.py
您可以设置其他参数(--deploy mode等)

在创建DSstreams RDD之后,我们应该使用foreachRDD来迭代RDD

from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
ssc = StreamingContext(sc, 2)
ssc = StreamingContext(sc, 2)
topic = "test"
brokers = "localhost:9092"
kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
kvs.foreachRDD(handler)
def handler(message):
    records = message.collect()
    for record in records:
         <Data processing whatever you want >
从pyspark.streaming导入StreamingContext
从pyspark.streaming.kafka导入KafkaUtils
ssc=StreamingContext(sc,2)
ssc=StreamingContext(sc,2)
topic=“测试”
brokers=“localhost:9092”
kvs=KafkaUtils.createDirectStream(ssc,[topic],{“metadata.broker.list”:brokers})
kvs.foreachRDD(处理程序)
def处理程序(消息):
records=message.collect()
记录中的记录:
Time: 2015-11-15 18:39:52
-------------------------------------------

15/11/15 18:42:57 ERROR PythonRDD: Error while sending iterator
java.net.SocketTimeoutException: Accept timed out
        at java.net.PlainSocketImpl.socketAccept(Native Method)
        at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
        at java.net.ServerSocket.implAccept(ServerSocket.java:530)
        at java.net.ServerSocket.accept(ServerSocket.java:498)
        at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:645)
Traceback (most recent call last):
  File "/vagrant/install_files/spark-1.5.2-bin-hadoop2.6/python/pyspark/streaming/util.py", line 62, in call
    r = self.func(t, *rdds)
  File "/vagrant/install_files/spark-1.5.2-bin-hadoop2.6/python/pyspark/streaming/dstream.py", line 171, in takeAndPrint
    taken = rdd.take(num + 1)
  File "/vagrant/install_files/spark-1.5.2-bin-hadoop2.6/python/pyspark/rdd.py", line 1299, in take
    res = self.context.runJob(self, takeUpToNumLeft, p)
  File "/vagrant/install_files/spark-1.5.2-bin-hadoop2.6/python/pyspark/context.py", line 917, in runJob
    return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
  File "/vagrant/install_files/spark-1.5.2-bin-hadoop2.6/python/pyspark/rdd.py", line 142, in _load_from_socket
    for item in serializer.load_stream(rf):
  File "/vagrant/install_files/spark-1.5.2-bin-hadoop2.6/python/pyspark/serializers.py", line 139, in load_stream
    yield self._read_with_length(stream)
  File "/vagrant/install_files/spark-1.5.2-bin-hadoop2.6/python/pyspark/serializers.py", line 156, in _read_with_length
    length = read_int(stream)
  File "/vagrant/install_files/spark-1.5.2-bin-hadoop2.6/python/pyspark/serializers.py", line 543, in read_int
    length = stream.read(4)
  File "/usr/lib/python2.7/socket.py", line 380, in read
    data = self._sock.recv(left)
error: [Errno 104] Connection reset by peer
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
ssc = StreamingContext(sc, 2)
ssc = StreamingContext(sc, 2)
topic = "test"
brokers = "localhost:9092"
kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
kvs.foreachRDD(handler)
def handler(message):
    records = message.collect()
    for record in records:
         <Data processing whatever you want >