Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark pyspark流:无法对工作进程执行rdd.count()_Apache Spark_Pyspark_Spark Streaming - Fatal编程技术网

Apache spark pyspark流:无法对工作进程执行rdd.count()

Apache spark pyspark流:无法对工作进程执行rdd.count(),apache-spark,pyspark,spark-streaming,Apache Spark,Pyspark,Spark Streaming,我有一份Pypark流媒体工作,主要从事以下工作: def printrddcount(rdd): c = rdd.count() print("{1}: Received an RDD of {0} rows".format("CANNOTCOUNT", datetime.now().isoformat()) ) 然后: ... stream.foreachRDD(printrddcount) stream = stream.map(parse(parse_event))

我有一份Pypark流媒体工作,主要从事以下工作:

def printrddcount(rdd):
    c = rdd.count()
    print("{1}: Received an RDD of {0} rows".format("CANNOTCOUNT", datetime.now().isoformat()) )
然后:

...
stream.foreachRDD(printrddcount)
stream = stream.map(parse(parse_event))
从我得到的信息来看,printrdd函数将在workers中执行 而且,是的,我知道在员工内部执行print()是个坏主意。但这不是重点。 我很确定这段代码直到最近才开始工作。 (而且,它看起来不同,因为“c”的内容实际上是在打印语句中打印的,而不仅仅是收集,然后扔掉……)

但现在,似乎(突然?)rdd.count()停止工作,ans正在使我的工作进程死亡,并说:

UnpicklingError: NEWOBJ class argument has NULL tp_new
完整(仅限python)堆栈跟踪:

Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/worker.py", line 163, in main
    func, profiler, deserializer, serializer = read_command(pickleSer, infile)
  File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/worker.py", line 54, in read_command
command = serializer._read_with_length(file)
  File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/serializers.py", line 169, in _read_with_length
    return self.loads(obj)
  File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/serializers.py", line 454, in loads
    return pickle.loads(obj)
UnpicklingError: NEWOBJ class argument has NULL tp_new
实际上,它失败的那一行是rdd.count()

知道rdd.count()为什么会失败吗?
如果某个东西应该被序列化,那么它应该是rdd,对吗?

好的。我进一步调查了一下。 rdd.count()没有问题

唯一错误的是,管道中还有另一个转换以某种方式“破坏”(关闭?失效?沿这些线路的某些东西)rdd。 因此,当它到达printrddcount函数时,它不能再被序列化,并给出错误

问题出现在代码中,代码如下所示:

...
log = logging.getLogger(__name__)
...
def parse(parse_function):
    def parse_function_wrapper(event):
        try:
            log.info("parsing")
            new_event = parse_function(event)
    except ParsingFailedException as e:
        pass
    return new_event
return parse_function_wrapper
然后:

...
stream.foreachRDD(printrddcount)
stream = stream.map(parse(parse_event))
现在,log.info(尝试了很多变体,在开始时日志记录在异常处理程序中)是造成问题的原因。 这让我不得不说,出于某种原因,很可能是logger对象无法序列化

我自己关闭这个线程,因为它实际上与rdd序列化无关;而且很可能连Pypark都没有