Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/328.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python udf创建的时间序列无法写入拼花地板_Python_Apache Spark_Pyspark_User Defined Functions - Fatal编程技术网

Python udf创建的时间序列无法写入拼花地板

Python udf创建的时间序列无法写入拼花地板,python,apache-spark,pyspark,user-defined-functions,Python,Apache Spark,Pyspark,User Defined Functions,我有下面的pyspark代码。在它中,我在数据框tz_inventory_aud_df2中填充缺失的结束日期值,日期在遥远的未来。我从同一数据帧中获取最小开始日期。然后我为从最小开始日期到当前日期的每个日期创建一个时间序列。我使用一个udf创建一个包含这些日期的数据框,然后将该数据框左键连接到tz_inventory_aud_df,以获得由我创建的数据框中的每个日期过滤的字段的总和。当我最终尝试将数据帧写入拼花地板文件时,我的驱动程序日志中出现了以下错误。有人知道是什么导致了错误吗?你能建议如何

我有下面的pyspark代码。在它中,我在数据框tz_inventory_aud_df2中填充缺失的结束日期值,日期在遥远的未来。我从同一数据帧中获取最小开始日期。然后我为从最小开始日期到当前日期的每个日期创建一个时间序列。我使用一个udf创建一个包含这些日期的数据框,然后将该数据框左键连接到tz_inventory_aud_df,以获得由我创建的数据框中的每个日期过滤的字段的总和。当我最终尝试将数据帧写入拼花地板文件时,我的驱动程序日志中出现了以下错误。有人知道是什么导致了错误吗?你能建议如何修复它吗

代码:

错误:

2020-03-17 08:03:05,437 WARN  [task-result-getter-0] scheduler.TaskSetManager (Logging.scala:logWarning(66)) - Lost task 0.1 in stage 12651.0 (TID 479153, ip-10-100-7-60.glue.dnsmasq, executor 7): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/mnt/yarn/usercache/root/appcache/application_1584428038308_0005/container_1584428038308_0005_01_000013/pyspark.zip/pyspark/worker.py", line 377, in main
    process()
  File "/mnt/yarn/usercache/root/appcache/application_1584428038308_0005/container_1584428038308_0005_01_000013/pyspark.zip/pyspark/worker.py", line 372, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/mnt/yarn/usercache/root/appcache/application_1584428038308_0005/container_1584428038308_0005_01_000013/pyspark.zip/pyspark/worker.py", line 248, in <lambda>
    func = lambda _, it: map(mapper, it)
  File "<string>", line 1, in <lambda>
  File "/mnt/yarn/usercache/root/appcache/application_1584428038308_0005/container_1584428038308_0005_01_000013/pyspark.zip/pyspark/worker.py", line 83, in <lambda>
    return lambda *a: toInternal(f(*a))
  File "/mnt/yarn/usercache/root/appcache/application_1584428038308_0005/container_1584428038308_0005_01_000013/pyspark.zip/pyspark/util.py", line 99, in wrapper
    return f(*args, **kwargs)
  File "script_2020-03-17-06-55-38.py", line 1839, in generate_date_series
    return [start + datetime.timedelta(days=x) for x in range(0, (stop-start).days + 1)]
TypeError: unsupported operand type(s) for -: 'datetime.date' and 'NoneType'

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
    at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:81)
更新:

tz_inventory_aud_df2=tz_inventory_aud_df[tz_inventory_aud_df['current_flag']==1]\
        .fillna({'end_date':'3018-01-01 00:00:00',
                  'start_date':'1990-01-01 00:00:00'})



        bs_df=tz_inventory_aud_df2.agg({'start_date':'min'})\
        .withColumn('min_date',to_date(col('min(start_date)')))

        timestamp = datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S')

        bs_df = bs_df.withColumn('current_date',to_date(unix_timestamp(lit(timestamp),'yyyy-MM-dd').cast("timestamp")))

        # creating time-series dataframe


        # UDF
        def generate_date_series(start, stop):
            return [start + datetime.timedelta(days=x) for x in range(0, (stop-start).days + 1)]

        # Register UDF for later usage
        spark.udf.register("generate_date_series", generate_date_series, ArrayType(DateType()) )

        # mydf is a DataFrame with columns `start` and `stop` of type DateType()
        bs_df.createOrReplaceTempView("mydf")

        filldate_df=spark.sql("SELECT explode(generate_date_series(min_date, current_date)) as dates FROM mydf")

        daily_af_units=filldate_df.alias('a').join(tz_inventory_aud_df2.alias('b'),
             (col('b.current_flag')==1)
              &(col('a.dates')>=col('b.start_date'))
              &(col('a.dates')<col('b.end_date')),
              how='inner'
             )\
        .select(col('b.product_id'),
               col('a.dates'),
               (col('b.available_units')+col('b.reserved_units')+col('b.packed_and_ready_units')).alias('daily_product_remaining')
               )\
        .alias('c')\
        .groupby(['product_id','dates']).sum()




        daily_af_units=daily_af_units.withColumn("daily_product_remaining",daily_af_units["sum(daily_product_remaining)"])

        daily_af_units=daily_af_units[['product_id', 'dates', 'daily_product_remaining']]
该问题不是由于写入操作引起的请记住,spark是基于惰性计算的,而是由于此操作引起的:

from datetime import date
import time
date.fromtimestamp(time.time()) - None

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-059d3edeb205> in <module>
      1 from datetime import date
      2 import time
----> 3 date.fromtimestamp(time.time()) - None

TypeError: unsupported operand type(s) for -: 'datetime.date' and 'NoneType'

谢谢你指出这一点。我已经在上面添加了更新,稍后在代码中使用正确版本的数据帧。我还尝试填充任何空的“开始日期”,这样udf就不会出现问题。但我在尝试写作时仍然会遇到同样的错误。有什么我不明白的吗?错误是由开始日期中的空值引起的,对吗?
from datetime import date
import time
date.fromtimestamp(time.time()) - None

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-059d3edeb205> in <module>
      1 from datetime import date
      2 import time
----> 3 date.fromtimestamp(time.time()) - None

TypeError: unsupported operand type(s) for -: 'datetime.date' and 'NoneType'
tz_inventory_aud_df2=tz_inventory_aud_df.fillna({'end_date':'3018-01-01 00:00:00'})


bs_df=tz_inventory_aud_df2.agg({'start_date':'min'})\
        .withColumn('min_date',to_date(col('min(start_date)')))