Python 2.7 在spark项目中使用全局变量时出错

Python 2.7 在spark项目中使用全局变量时出错,python-2.7,apache-spark,pyspark,rdd,Python 2.7,Apache Spark,Pyspark,Rdd,在包含多个.py文件的PySpark项目中,有一个名为settings.py的文件来声明所有全局变量 # settings.py def prepareMyList(): return ['35','19','10','25'] def setGlobal(): global ageList ageList = prepareMyList() 现在,另一个文件utils.py包含过滤方法 # utils.py import settings def return

在包含多个.py文件的PySpark项目中,有一个名为
settings.py
的文件来声明所有全局变量

# settings.py

def prepareMyList():
    return ['35','19','10','25']

def setGlobal():
    global ageList
    ageList = prepareMyList()
现在,另一个文件
utils.py
包含过滤方法

# utils.py

import settings

def returnIfTrue(row):
    if row[1] in settings.ageList:
        return row
Filtering.py
使用
utils.py
文件中的方法对RDD执行过滤

# filtering.py

import utils

def doFiltering(fileRDD):
    filteredRDD = fileRDD.filter(utils.returnIfTrue)
    return filteredRDD
main.py
如下所示

# main.py

from pyspark import SparkContext
import settings
import filtering

sc = SparkContext()
settings.setGlobal()
rawRDD = sc.textFile("/path/to/Data/")
splittedRDD = rawRDD.map(lambda l:l.split(","))
filteredRDD = filtering.doFiltering(splittedRDD)
for row in filteredRDD.collect():
    print row
当项目运行时,它抛出一个错误
AttributeError:“module”对象没有属性“ageList”

Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/src/spark/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
  process()
File "/usr/local/src/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
  serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/local/src/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
  vs = list(itertools.islice(iterator, batch))
File "utils.py", line 6, in returnIfTrue
  if row[1] in settings.ageList:
AttributeError: 'module' object has no attribute 'ageList'

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    ... 1 more
原因:org.apache.spark.api.python.python异常:回溯(最近一次调用):
文件“/usr/local/src/spark/python/lib/pyspark.zip/pyspark/worker.py”,主文件第111行
过程()
文件“/usr/local/src/spark/python/lib/pyspark.zip/pyspark/worker.py”,第106行,正在处理中
serializer.dump_流(func(拆分索引,迭代器),outfile)
文件“/usr/local/src/spark/python/lib/pyspark.zip/pyspark/serializers.py”,第263行,在dump_流中
vs=列表(itertools.islice(迭代器,批处理))
文件“utils.py”,第6行,returnIfTrue
如果settings.ageList中的第[1]行:
AttributeError:“模块”对象没有属性“年龄列表”
位于org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
位于org.apache.spark.api.python.PythonRunner$$anon$1。(PythonRDD.scala:207)
位于org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
位于org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:306)上
位于org.apache.spark.rdd.rdd.iterator(rdd.scala:270)
位于org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
位于org.apache.spark.scheduler.Task.run(Task.scala:89)
位于org.apache.spark.executor.executor$TaskRunner.run(executor.scala:213)
位于java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
位于java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
... 还有一个

但您是否实际声明了年龄列表?试试这个

ageList = None
def setGlobal():
    global ageList
    ageList = prepareMyList()

但是你申报了年龄表了吗ageList=None def setGlobal():全局ageList ageList=prepareMyList()`不工作。显示
类型错误
NoneType不可编辑
。当然,默认值是
None
。如果您希望有一个空列表,请将其替换为
[]