Apache spark 从Pypark访问cassandra

Apache spark 从Pypark访问cassandra,apache-spark,cassandra,pyspark,Apache Spark,Cassandra,Pyspark,我正在一个Azure数据湖上工作。 我想从我的pyspark脚本访问cassandra。我试过: > pyspark --packages anguenot/pyspark-cassandra:0.7.0 --conf spark.cassandra.connection.host=12.34.56.78 SPARK_MAJOR_VERSION is set to 2, using Spark2 Python 2.7.12 |Anaconda custom (64-bit)| (defau

我正在一个Azure数据湖上工作。 我想从我的pyspark脚本访问cassandra。我试过:

> pyspark --packages anguenot/pyspark-cassandra:0.7.0 --conf spark.cassandra.connection.host=12.34.56.78
SPARK_MAJOR_VERSION is set to 2, using Spark2
Python 2.7.12 |Anaconda custom (64-bit)| (default, Jul  2 2016, 17:42:40)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
Ivy Default Cache set to: /home/opnf/.ivy2/cache
The jars for the packages stored in: /home/opnf/.ivy2/jars
:: loading settings :: url = jar:file:/usr/hdp/2.5.5.0-157/spark2/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
anguenot#pyspark-cassandra added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
        confs: [default]
        found anguenot#pyspark-cassandra;0.7.0 in spark-packages
        found com.datastax.spark#spark-cassandra-connector_2.11;2.0.6 in central
        found org.joda#joda-convert;1.2 in central
        found commons-beanutils#commons-beanutils;1.9.3 in central
        found commons-collections#commons-collections;3.2.2 in central
        found com.twitter#jsr166e;1.1.0 in central
        found io.netty#netty-all;4.0.33.Final in central
        found joda-time#joda-time;2.3 in central
        found org.scala-lang#scala-reflect;2.11.8 in central
        found net.razorvine#pyrolite;4.10 in central
        found net.razorvine#serpent;1.12 in central
:: resolution report :: resolve 710ms :: artifacts dl 33ms
        :: modules in use:
        anguenot#pyspark-cassandra;0.7.0 from spark-packages in [default]
        com.datastax.spark#spark-cassandra-connector_2.11;2.0.6 from central in [default]
        com.twitter#jsr166e;1.1.0 from central in [default]
        commons-beanutils#commons-beanutils;1.9.3 from central in [default]
        commons-collections#commons-collections;3.2.2 from central in [default]
        io.netty#netty-all;4.0.33.Final from central in [default]
        joda-time#joda-time;2.3 from central in [default]
        net.razorvine#pyrolite;4.10 from central in [default]
        net.razorvine#serpent;1.12 from central in [default]
        org.joda#joda-convert;1.2 from central in [default]
        org.scala-lang#scala-reflect;2.11.8 from central in [default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   11  |   0   |   0   |   0   ||   11  |   0   |
        ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
        confs: [default]
        0 artifacts copied, 11 already retrieved (0kB/40ms)
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/anguenot_pyspark-cassandra-0.7.0.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/com.datastax.spark_spark-cassandra-connector_2.11-2.0.6.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/net.razorvine_pyrolite-4.10.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/org.joda_joda-convert-1.2.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/commons-beanutils_commons-beanutils-1.9.3.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/com.twitter_jsr166e-1.1.0.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/io.netty_netty-all-4.0.33.Final.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/joda-time_joda-time-2.3.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/org.scala-lang_scala-reflect-2.11.8.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/commons-collections_commons-collections-3.2.2.jar added multiple times to distributed cache.
18/04/17 14:52:39 WARN Client: Same path resource file:/home/opnf/.ivy2/jars/net.razorvine_serpent-1.12.jar added multiple times to distributed cache.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.0.2.2.5.5.0-157
      /_/

Using Python version 2.7.12 (default, Jul  2 2016 17:42:40)
SparkSession available as 'spark'.
>>> import pyspark_cassandra
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named pyspark_cassandra
>pyspark——包anguenot/pyspark-cassandra:0.7.0——conf spark.cassandra.connection.host=12.34.56.78
使用Spark2将SPARK_MAJOR_版本设置为2
Python 2.7.12 | Anaconda自定义(64位)|(默认值,2016年7月2日,17:42:40)
[GCC 4.4.7 20120313(Red Hat 4.4.7-1)]关于linux2
有关详细信息,请键入“帮助”、“版权”、“信用证”或“许可证”。
Anaconda由Continuum Analytics为您带来。
请查收:http://continuum.io/thanks 及https://anaconda.org
常春藤默认缓存设置为:/home/opnf/.ivy2/Cache
存储在:/home/opnf/.ivy2/jars中的包的jar
::加载设置::url=jar:file:/usr/hdp/2.5.5.0-157/spark2/jars/ivy-2.4.0.jar/org/apache/ivy/core/settings/ivysettings.xml
anguenot#pyspark cassandra作为依赖项添加
::解析依赖项::org.apache.spark#spark提交父级;1
confs:[默认值]
发现安圭诺·皮斯帕克·卡桑德拉;火花组件中的0.7.0
找到com.datastax.spark#spark-cassandra-connector2.11;中环2.0.6
找到org.joda#joda convert;1.2在中环
发现了commons beanutils#commons beanutils;1.9.3在中环
找到共享集合#共享集合;3.2.2在中环
找到com.twitter#jsr166e;中环1.1.0
找到爱娥·内蒂#内蒂全是;4.0.33.中环决赛
找到乔达时间#乔达时间;2.3在中环
找到org.scala lang#scala reflect;2.11.8在中环
发现了净拉佐尔文(razorvine)辉绿岩;4.10中环
发现了网蛇;1.12中环
::解析报告::解析710ms::工件dl 33ms
::正在使用的模块:
安圭诺·皮斯帕克·卡桑德拉;[默认值]中spark软件包的0.7.0
com.datastax.spark#spark-cassandra-connector_2.11;2.0.6从中央输入[默认]
com.twitter#jsr166e;1.1.0从中央输入[默认值]
公地小动物#公地小动物;1.9.3从中央输入[默认]
公共集合#公共集合;3.2.2从中央输入[默认]
io.netty#netty all;4.0.33.来自中央的最终版本[默认]
乔达时间#乔达时间;2.3从中央输入[默认]
净razorvine#辉绿岩;4.10从中央输入[默认]
拉佐尔文#蛇网;1.12从中央输入[默认]
org.joda#joda convert;1.2从中央输入[默认]
org.scala lang#scala reflect;2.11.8从中央输入[默认]
---------------------------------------------------------------------
||模块| |工件|
|形态|编号|搜索| dwnlded |驱逐|编号| dwnlded|
---------------------------------------------------------------------
|默认值| 11 | 0 | 0 | 11 | 0|
---------------------------------------------------------------------
::检索::org.apache.spark#spark提交父级
confs:[默认值]
已复制0个工件,已检索11个(0 KB/40毫秒)
将默认日志级别设置为“警告”。
要调整日志记录级别,请使用sc.setLogLevel(newLevel)。
18/04/17 14:52:39警告客户端:同一路径资源文件:/home/opnf/.ivy2/jars/anguenot_pyspark-cassandra-0.7.0.jar多次添加到分布式缓存。
18/04/17 14:52:39警告客户端:同一路径资源文件:/home/opnf/.ivy2/jars/com.datastax.spark\u spark-cassandra-connector\u 2.11-2.0.6.jar多次添加到分布式缓存。
18/04/17 14:52:39警告客户端:同一路径资源文件:/home/opnf/.ivy2/jars/net.razorvine_pyrolite-4.10.jar多次添加到分布式缓存。
18/04/17 14:52:39警告客户端:同一路径资源文件:/home/opnf/.ivy2/jars/org.joda_joda-convert-1.2.jar多次添加到分布式缓存。
18/04/17 14:52:39警告客户端:同一路径资源文件:/home/opnf/.ivy2/jars/commons-beanutils_commons-beanutils-1.9.3.jar多次添加到分布式缓存。
18/04/17 14:52:39警告客户端:同一路径资源文件:/home/opnf/.ivy2/jars/com.twitter_jsr166e-1.1.0.jar多次添加到分布式缓存。
18/04/17 14:52:39警告客户端:同一路径资源文件:/home/opnf/.ivy2/jars/io.netty_netty-all-4.0.33.Final.jar多次添加到分布式缓存。
18/04/17 14:52:39警告客户端:同一路径资源文件:/home/opnf/.ivy2/jars/joda-time\u joda-time-2.3.jar多次添加到分布式缓存。
18/04/17 14:52:39警告客户端:同一路径资源文件:/home/opnf/.ivy2/jars/org.scala-lang_scala-reflect-2.11.8.jar多次添加到分布式缓存。
18/04/17 14:52:39警告客户端:同一路径资源文件:/home/opnf/.ivy2/jars/commons-collections\u commons-collections-3.2.2.jar多次添加到分布式缓存。
18/04/17 14:52:39警告客户端:同一路径资源文件:/home/opnf/.ivy2/jars/net.razorvine_serpent-1.12.jar多次添加到分布式缓存。
欢迎来到
____              __
/ __/__  ___ _____/ /__
_\ \/ _ \/ _ `/ __/  '_/
/__/。\uuu/\\ uuu/\\ u/\\ u/\\版本2.0.2.2.5.5.0-157
/_/
使用Python版本2.7.12(默认值,2016年7月2日17:42:40)
SparkSession可用作“spark”。
>>>进口皮斯帕克·卡桑德拉
回溯(最近一次呼叫最后一次):
文件“”,第1行,在
ImportError:没有名为pyspark_cassandra的模块

显然,在加载过程中没有问题,但最后,我仍然无法导入包。原因可能是什么

软件包的使用与文档中描述的略有不同

不需要导入包。 相反,如果要读取数据帧,请使用:

sqlContext.read\
    .format("org.apache.spark.sql.cassandra")\
    .options(table="my_table", keyspace="my_keyspace")\
    .load()
如果要编写,请使用:

df.write\
    .format("org.apache.spark.sql.cassandra")\
    .mode('append')\
    .options(
        table="my_table", 
        keyspace="my_keyspace",
    )\
    .save()

(使用
模式('overwrite')
,您可能需要添加方法
。选项('confirm.truncate',True)

使用
pyspark cassandra
而不是直接使用spark cassandra连接器的原因是什么?@AlexOtt此软件包似乎非常方便。加法t