Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/sockets/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何在pyspark上使用带--py文件的ZIP?_Python_Apache Spark_Pyspark - Fatal编程技术网

Python 如何在pyspark上使用带--py文件的ZIP?

Python 如何在pyspark上使用带--py文件的ZIP?,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我正在尝试编写一些pyspark作业,这些作业依赖于我希望包含在作业中的模块,而不是全局安装在集群上 我决定尝试使用zip文件来实现这一点,但我似乎无法让它正常工作,而且我似乎也找不到在野外实现这一点的例子 我通过运行以下命令来构建zip: mkdir -p ./build cd ./build && python ../src/setup.py sdist --formats=zip spark-submit --master local[4] --py-files './b

我正在尝试编写一些pyspark作业,这些作业依赖于我希望包含在作业中的模块,而不是全局安装在集群上

我决定尝试使用zip文件来实现这一点,但我似乎无法让它正常工作,而且我似乎也找不到在野外实现这一点的例子

我通过运行以下命令来构建zip:

mkdir -p ./build
cd ./build && python ../src/setup.py sdist --formats=zip
spark-submit --master local[4] --py-files './build/dist/mysparklib-0.1.zip' ./jobs/helloworld.py
这将创建一个名为
/build/dist/mysparklib-0.1.zip的文件。到目前为止,一切顺利

我的工作是这样的:

from pyspark import SparkContext

# See: http://spark.apache.org/docs/latest/quick-start.html

readme_filename = './README.md'

sc = SparkContext('local', 'helloworld app')

readme_data = sc.textFile(readme_filename).cache()

def test_a_filter(s):
    import mysparklib
    return 'a' in s

a_s = readme_data.filter(test_a_filter).count()
b_s = readme_data.filter(lambda s: 'b' in s).count()

print("""
**************************************
* Lines with a: {}; Lines with b: {} *
**************************************
""".format(a_s, b_s))

sc.stop()
(这主要是从quickstart中采用的,但我尝试在其中一个过滤器中导入模块的情况除外。)

我通过运行以下命令开始工作:

mkdir -p ./build
cd ./build && python ../src/setup.py sdist --formats=zip
spark-submit --master local[4] --py-files './build/dist/mysparklib-0.1.zip' ./jobs/helloworld.py
虽然我看到包含了zip文件:

17/05/17 17:15:31 INFO SparkContext: Added file file:/Users/myuser/dev/mycompany/myproject/./build/dist/mysparklib-0.1.zip at file:/Users/myuser/dev/mycompany/myproject/./build/dist/mysparklib-0.1.zip with timestamp 1495055731604
它不会导入:

17/05/17 17:15:34 INFO DAGScheduler: ResultStage 0 (count at /Users/myuser/dev/mycompany/myproject/./jobs/helloworld.py:15) failed in 1.162 s due to Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/Users/myuser/dev/mycompany/myproject/spark/python/lib/pyspark.zip/pyspark/worker.py", line 174, in main
    process()
  File "/Users/myuser/dev/mycompany/myproject/spark/python/lib/pyspark.zip/pyspark/worker.py", line 169, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/Users/myuser/dev/mycompany/myproject/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2408, in pipeline_func
  File "/Users/myuser/dev/mycompany/myproject/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2408, in pipeline_func
  File "/Users/myuser/dev/mycompany/myproject/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2408, in pipeline_func
  File "/Users/myuser/dev/mycompany/myproject/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 345, in func
  File "/Users/myuser/dev/mycompany/myproject/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1040, in <lambda>
  File "/Users/myuser/dev/mycompany/myproject/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1040, in <genexpr>
  File "/Users/myuser/dev/mycompany/myproject/./jobs/helloworld.py", line 12, in test_a_filter
    import mysparklib
ModuleNotFoundError: No module named 'mysparklib'

        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
        at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)
17/05/17 17:15:34信息调度程序:结果阶段0(计数在/Users/myuser/dev/mycompany/myproject//jobs/helloworld.py:15)由于阶段失败导致作业中止而在1.162秒内失败:阶段0.0中的任务0失败1次,最近的失败:阶段0.0中丢失任务0.0(TID 0,本地主机,执行器驱动程序):org.apache.spark.api.python.PythonException:Traceback(最近一次调用last):
文件“/Users/myuser/dev/mycompany/myproject/spark/python/lib/pyspark.zip/pyspark/worker.py”,主文件第174行
过程()
文件“/Users/myuser/dev/mycompany/myproject/spark/python/lib/pyspark.zip/pyspark/worker.py”,第169行,正在处理中
serializer.dump_流(func(拆分索引,迭代器),outfile)
文件“/Users/myuser/dev/mycompany/myproject/spark/python/lib/pyspark.zip/pyspark/rdd.py”,第2408行,管道功能中
文件“/Users/myuser/dev/mycompany/myproject/spark/python/lib/pyspark.zip/pyspark/rdd.py”,第2408行,管道功能中
文件“/Users/myuser/dev/mycompany/myproject/spark/python/lib/pyspark.zip/pyspark/rdd.py”,第2408行,管道功能中
文件“/Users/myuser/dev/mycompany/myproject/spark/python/lib/pyspark.zip/pyspark/rdd.py”,第345行,func格式
文件“/Users/myuser/dev/mycompany/myproject/spark/python/lib/pyspark.zip/pyspark/rdd.py”,第1040行,在
文件“/Users/myuser/dev/mycompany/myproject/spark/python/lib/pyspark.zip/pyspark/rdd.py”,第1040行,在
文件“/Users/myuser/dev/mycompany/myproject//jobs/helloworld.py”,第12行,在测试过滤器中
导入mysparklib
ModuleNotFoundError:没有名为“mysparklib”的模块
位于org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
位于org.apache.spark.api.python.PythonRunner$$anon$1。(PythonRDD.scala:234)
位于org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
位于org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:323)上
位于org.apache.spark.rdd.rdd.iterator(rdd.scala:287)
位于org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
位于org.apache.spark.scheduler.Task.run(Task.scala:99)
位于org.apache.spark.executor.executor$TaskRunner.run(executor.scala:322)
位于java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
位于java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
运行(Thread.java:748)
为了进行合理性检查,我在mysparklib中运行了
python setup.py develope
,并尝试在cli上导入它,效果很好


有什么想法吗?

所以我成功了!核心问题是sdist的目录结构不是python在将zip添加到模块路径时所期望的结构(这就是
--py files
的工作方式;您可以通过打印
sys.path
来确认这一点)。特别是,sdist zip包含文件
/mysparklib-0.1/mysparklib/\uuuu init\uuuuuuuuuuuuupy
,但我们需要一个包含文件
/mysparklib/\uuuuuu init\uuuuuuuuuuuuuuuuuuuuuuy.py
的zip

所以与其跑

cd ./build && python ../src/setup.py sdist --formats=zip
我现在正在跑步

cd ./src && zip ../dist/mysparklib.zip -r ./mysparklib

这就行了。

所以我让它起作用了!核心问题是sdist的目录结构不是python在将zip添加到模块路径时所期望的结构(这就是
--py files
的工作方式;您可以通过打印
sys.path
来确认这一点)。特别是,sdist zip包含文件
/mysparklib-0.1/mysparklib/\uuuu init\uuuuuuuuuuuuupy
,但我们需要一个包含文件
/mysparklib/\uuuuuu init\uuuuuuuuuuuuuuuuuuuuuuy.py
的zip

所以与其跑

cd ./build && python ../src/setup.py sdist --formats=zip
我现在正在跑步

cd ./src && zip ../dist/mysparklib.zip -r ./mysparklib

这就行了。

我遇到了一个死胡同(?):我拉开包裹的拉链,看看里面有什么。它不见了。很奇怪!所以我改变了两件事:1)我开始在./src文件夹中运行sdist命令;2)我将packages参数改为硬编码,以包含mysparklib,而不是指望find_packages()来做正确的事情。当我解压tarball时,它包含我的包,但pyspark仍然失败!(我已经确认我可以pip安装zipball并让它做正确的事情。)我遇到了一个死胡同(?)我解压缩了我的包,看看里面有什么。它不见了。很奇怪!所以我改变了两件事:1)我开始在./src文件夹中运行sdist命令;2)我将packages参数改为硬编码,以包含mysparklib,而不是指望find_packages()来做正确的事情。当我解压tarball时,它包含我的包,但pyspark仍然失败!(我已经确认我可以pip安装zipball,并让它做正确的事情。)