Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/354.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 将数据作为RDD保存回Cassandra_Python_Apache Spark_Cassandra_Pyspark_Spark Cassandra Connector - Fatal编程技术网

Python 将数据作为RDD保存回Cassandra

Python 将数据作为RDD保存回Cassandra,python,apache-spark,cassandra,pyspark,spark-cassandra-connector,Python,Apache Spark,Cassandra,Pyspark,Spark Cassandra Connector,我试图读取卡夫卡的消息,处理数据,然后将数据添加到cassandra中,就像它是RDD一样 我的问题是把数据保存回卡桑德拉 from __future__ import print_function from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils from pyspark import SparkConf, SparkContext appName =

我试图读取卡夫卡的消息,处理数据,然后将数据添加到cassandra中,就像它是RDD一样

我的问题是把数据保存回卡桑德拉

from __future__ import print_function

from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from pyspark import SparkConf, SparkContext

appName = 'Kafka_Cassandra_Test'
kafkaBrokers = '1.2.3.4:9092'
topic = 'test'
cassandraHosts = '1,2,3'
sparkMaster = 'spark://mysparkmaster:7077'


if __name__ == "__main__":
    conf = SparkConf()
    conf.set('spark.cassandra.connection.host', cassandraHosts)

    sc = SparkContext(sparkMaster, appName, conf=conf)

    ssc = StreamingContext(sc, 1)

    kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": kafkaBrokers})
    lines = kvs.map(lambda x: x[1])
    counts = lines.flatMap(lambda line: line.split(" ")) \
        .map(lambda word: (word, 1)) \
        .reduceByKey(lambda a, b: a+b)
    counts.saveToCassandra('coreglead_v2', 'wordcount')

    ssc.start()
    ssc.awaitTermination()
错误是:

[root@gasweb2 ~]# spark-submit --jars /var/spark/lib/spark-streaming-kafka-assembly_2.10-1.6.0.jar --packages datastax:spark-cassandra-connector:1.5.0-RC1-s_2.11 /var/spark/scripts/kafka_cassandra.py
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/var/spark/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
datastax#spark-cassandra-connector added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
    confs: [default]
    found datastax#spark-cassandra-connector;1.5.0-RC1-s_2.11 in spark-packages
    found org.apache.cassandra#cassandra-clientutil;2.2.2 in central
    found com.datastax.cassandra#cassandra-driver-core;3.0.0-rc1 in central
    found io.netty#netty-handler;4.0.33.Final in central
    found io.netty#netty-buffer;4.0.33.Final in central
    found io.netty#netty-common;4.0.33.Final in central
    found io.netty#netty-transport;4.0.33.Final in central
    found io.netty#netty-codec;4.0.33.Final in central
    found io.dropwizard.metrics#metrics-core;3.1.2 in central
    found org.slf4j#slf4j-api;1.7.7 in central
    found org.apache.commons#commons-lang3;3.3.2 in central
    found com.google.guava#guava;16.0.1 in central
    found org.joda#joda-convert;1.2 in central
    found joda-time#joda-time;2.3 in central
    found com.twitter#jsr166e;1.1.0 in central
    found org.scala-lang#scala-reflect;2.11.7 in central
:: resolution report :: resolve 647ms :: artifacts dl 15ms
    :: modules in use:
    com.datastax.cassandra#cassandra-driver-core;3.0.0-rc1 from central in [default]
    com.google.guava#guava;16.0.1 from central in [default]
    com.twitter#jsr166e;1.1.0 from central in [default]
    datastax#spark-cassandra-connector;1.5.0-RC1-s_2.11 from spark-packages in [default]
    io.dropwizard.metrics#metrics-core;3.1.2 from central in [default]
    io.netty#netty-buffer;4.0.33.Final from central in [default]
    io.netty#netty-codec;4.0.33.Final from central in [default]
    io.netty#netty-common;4.0.33.Final from central in [default]
    io.netty#netty-handler;4.0.33.Final from central in [default]
    io.netty#netty-transport;4.0.33.Final from central in [default]
    joda-time#joda-time;2.3 from central in [default]
    org.apache.cassandra#cassandra-clientutil;2.2.2 from central in [default]
    org.apache.commons#commons-lang3;3.3.2 from central in [default]
    org.joda#joda-convert;1.2 from central in [default]
    org.scala-lang#scala-reflect;2.11.7 from central in [default]
    org.slf4j#slf4j-api;1.7.7 from central in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   16  |   0   |   0   |   0   ||   16  |   0   |
    ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
    confs: [default]
    0 artifacts copied, 16 already retrieved (0kB/14ms)
16/02/15 16:26:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
  File "/var/spark/scripts/kafka_cassandra.py", line 27, in <module>
    counts.saveToCassandra('coreglead_v2', 'wordcount')
AttributeError: 'TransformedDStream' object has no attribute 'saveToCassandra'
[root@gasweb2~]#spark提交--jars/var/spark/lib/spark-streaming-kafka-assembly_2.10-1.6.0.jar--包数据税:spark cassandra连接器:1.5.0-RC1-s_2.11/var/spark/scripts/kafka_cassandra.py
常春藤默认缓存设置为:/root/.ivy2/Cache
存储在:/root/.ivy2/jars中的包的jar
::加载设置::url=jar:file:/var/spark/lib/spark-assembly-1.6.0-hadoop2.6.0.jar/org/apache/ivy/core/settings/ivysettings.xml
datastax#spark cassandra连接器作为依赖项添加
::解析依赖项::org.apache.spark#spark提交父级;1
confs:[默认值]
找到Datasax#spark cassandra连接器;火花组件中的1.5.0-RC1-s_2.11
找到org.apache.cassandra#cassandra clientutil;2.2.2在中环
找到com.datasax.cassandra#cassandra驱动核心;中环3.0.0-rc1
找到io.netty#netty handler;4.0.33.中环决赛
发现io.netty#netty缓冲区;4.0.33.中环决赛
发现io.netty#netty common;4.0.33.中环决赛
发现io.netty#netty传输;4.0.33.中环决赛
找到io.netty#netty编解码器;4.0.33.中环决赛
找到io.dropwizard.metrics#metrics核心;3.1.2在中环
找到org.slf4j#slf4j api;1.7.7在中环
找到org.apache.commons#commons-lang3;3.3.2在中环
找到com.google.guava#guava;中环16.0.1
找到org.joda#joda convert;1.2在中环
找到乔达时间#乔达时间;2.3在中环
找到com.twitter#jsr166e;中环1.1.0
找到org.scala lang#scala reflect;2.11.7在中环
::解析报告::解析647ms::工件dl 15ms
::正在使用的模块:
cassandra#cassandra驱动核心;3.0.0-rc1从中央输入[默认值]
com.google.guava#guava ;;16.0.1从中央输入[默认]
com.twitter#jsr166e;1.1.0从中央输入[默认值]
datastax#火花卡桑德拉连接器;1.5.0-RC1-s_2.11来自[默认值]中的spark软件包
io.dropwizard.metrics#metrics核心;3.1.2从中央输入[默认]
io.netty#netty缓冲区;4.0.33.来自中央的最终版本[默认]
io.netty#netty编解码器;4.0.33.来自中央的最终版本[默认]
io.netty#netty common;4.0.33.来自中央的最终版本[默认]
io.netty#netty handler;4.0.33.来自中央的最终版本[默认]
io.netty#netty传输;4.0.33.来自中央的最终版本[默认]
乔达时间#乔达时间;2.3从中央输入[默认]
org.apache.cassandra#cassandra clientutil;2.2.2从中央输入[默认]
org.apache.commons#commons-lang3;3.3.2从中央输入[默认]
org.joda#joda convert;1.2从中央输入[默认]
org.scala lang#scala reflect;2.11.7从中央输入[默认]
org.slf4j#slf4j api;1.7.7从中央输入[默认]
---------------------------------------------------------------------
||模块| |工件|
|形态|编号|搜索| dwnlded |驱逐|编号| dwnlded|
---------------------------------------------------------------------
|默认值| 16 | 0 | 0 | 16 | 0|
---------------------------------------------------------------------
::检索::org.apache.spark#spark提交父级
confs:[默认值]
已复制0个工件,已检索16个(0 KB/14毫秒)
16/02/15 16:26:14警告NativeCodeLoader:无法为您的平台加载本机hadoop库。。。在适用的情况下使用内置java类
回溯(最近一次呼叫最后一次):
文件“/var/spark/scripts/kafka_cassandra.py”,第27行,在
counts.saveToCassandra('corelead_v2','wordcount')
AttributeError:“TransformedStream”对象没有属性“saveToCassandra”

通过搜索我发现,这似乎与另一个库有关(我不能使用这个库,因为我使用的是Cassandra 3.0,它还不受支持)

目标是从单个消息创建聚合数据(wordcount仅用于测试),并将其插入多个表中


我接近于自己使用和编写语句,但有没有更好的方法来实现这一点?

查看您的代码并阅读您的问题描述:您似乎没有使用任何Cassandra连接器。Spark没有提供现成的Cassandra支持,因此RDD和DStream数据类型没有
saveToCassandra
方法。您需要导入一个外部Spark Cassandra连接器,该连接器扩展RDD和DStream类型以支持Cassandra集成

这就是为什么会出现错误:Python在数据流类型上找不到任何函数
saveToCassandra
,因为当前不存在任何函数


您需要获得DataStax连接器或其他连接器,以使用
saveToCassandra
扩展数据流类型。您使用的是DataStax的Spark Cassandra连接器,它在RDD/DStream级别不支持python。仅支持数据帧。有关更多信息,请参阅

我已经围绕上述连接器编写了一个包装器:。Datastax提供的连接器功能并不完整,但有很多东西。此外,如果性能很重要,那么调查性能影响可能是值得的


最后,Spark附带了一个使用hadoop mapreduce中的CqlInput/OutputFormat的方法。在我看来,这不是一个非常方便开发人员的选项,但它确实存在。

感谢您的回复,我正在使用datastax:提供的连接器,我已经指定了运行spark submit的内容。我是Python新手,我怎么知道应该导入什么?@JimWright您如何设置Spark和PySpark?您是否正在使用Datasax Enterprise?另外,您是在使用pyspark shell还是在尝试如何执行代码?我使用的是community edition,