在Pyspark中添加python外部库

在Pyspark中添加python外部库,pyspark,spark-submit,Pyspark,Spark Submit,我正在使用pyspark(1.6),我想使用databricks:spark csv库。为此,我尝试了不同的方法,但没有成功 1-我尝试添加一个我从中下载的jar,并运行它 pyspark --jars THE_NAME_OF_THE_JAR df = sqlContext.read.format('com.databricks:spark-csv').options(header='true', inferschema='true').load('/dlk/doaat/nsi_dev/util

我正在使用pyspark(1.6),我想使用databricks:spark csv库。为此,我尝试了不同的方法,但没有成功

1-我尝试添加一个我从中下载的jar,并运行它

pyspark --jars THE_NAME_OF_THE_JAR
df = sqlContext.read.format('com.databricks:spark-csv').options(header='true', inferschema='true').load('/dlk/doaat/nsi_dev/utilisateur/referentiel/refecart.csv')
但我犯了这个错误:

Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
File "/usr/hdp/2.5.3.0-37/spark/python/pyspark/sql/readwriter.py", line 137, in load
return self._df(self._jreader.load(path))
 File "/usr/hdp/2.5.3.0-37/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
 File "/usr/hdp/2.5.3.0-37/spark/python/pyspark/sql/utils.py", line 45, in deco
return f(*a, **kw)
 File "/usr/hdp/2.5.3.0-37/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o53.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks:spark-csv. Please find packages at http://spark-packages.org
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: com.databricks:spark-csv.DefaultSource
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
    at scala.util.Try$.apply(Try.scala:161)
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
    at scala.util.Try.orElse(Try.scala:82)
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62)
    ... 14 more
但也犯了同样的错误。 3-第三种方式:

 pyspark --packages com.databricks:spark-csv_2.11:1.5.0
但它也不起作用,我得到了这个:

Python 2.7.13 |Anaconda 4.3.0 (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
Ivy Default Cache set to: /home/F18076/.ivy2/cache
The jars for the packages stored in: /home/F18076/.ivy2/jars
:: loading settings :: url = jar:file:/usr/hdp/2.5.3.0-37/spark/lib/spark-assembly-1.6.2.2.5.3.0-37-hadoop2.7.3.2.5.3.0-37.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
    confs: [default]

Spark 1.6包含Spark csv模块,因此您不需要任何外部库

实际上,根据我的记忆,您只需要将
jar
文件放在运行
pyspark
的文件夹中。然后,您只需要运行代码:

df = (sqlContext.read.format('com.databricks.spark.csv')
     .options(header='true', inferschema='true')
     .load('/dlk/doaat/nsi_dev/utilisateur/referentiel/refecart.csv') )

因此,请从下载jar文件。当我使用ApacheSpark1.6.1时。我曾经下载过这个版本:spark-csv_2.10-1.4.0.jar,因为Scala 2.10

对于我来说,使用Spark 1.6.3,以下工作:

pyspark——包com.databricks:spark-csv_2.10:1.5.0

运行上述命令后,控制台输出包括:

com.databricks#spark-csv_2.10 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
    confs: [default]
    found com.databricks#spark-csv_2.10;1.5.0 in central
    found org.apache.commons#commons-csv;1.1 in central
    found com.univocity#univocity-parsers;1.5.1 in central
请注意,除非您针对Scala 2.11专门构建了Spark 1.x(您会知道是否构建了Spark 1.x),否则您需要使用Spark-csv:1.5.0,而不是Spark-csv:2.11:1.5.0

如果您不想每次调用pyspark时都添加
--packages com.databricks:spark-csv_2.10:1.5.0
,还可以通过添加以下内容在
$spark_HOME/conf/spark defaults.conf
中配置包(如果您以前从未在其中设置过任何内容,则可能需要创建该文件):

spark.jars.packages               com.databricks:spark-csv_2.10:1.5.0
最后,对于Spark 1.x的旧版本(我认为至少是1.4和1.5),您可以设置环境变量
PYSPARK\u SUBMIT\u ARGS
,例如:

export PYSPARK_SUBMIT_ARGS="--packages com.databricks:spark-csv_2.10:1.5.0 pyspark-shell"
然后,调用
pyspark
将自动添加所需的依赖项。但是,这在Spark 1.6.3中不再有效


Spark 2.x不需要这些,因为Spark csv已内联到Spark 2。

在使用pyspark 1.6时,我无法访问Spark csv模块。你能给我举个例子吗。了解如何在我的工作中包含任何外部库以用于其他目的将是很有趣的。对不起,这是我的错,在spark 1.6中,您仍然需要包含外部包来读取csv文件。在添加外部文件方面帮不了你很多忙,但是如果使用spark csv,你应该像“spark submit--packages com.databricks:spar.10:1.5.0”那样运行你的作业
export PYSPARK_SUBMIT_ARGS="--packages com.databricks:spark-csv_2.10:1.5.0 pyspark-shell"