Python Pyspark pyarrow pandas_udf-分组的_映射返回数据帧，IntegerType、TimestampType为None NaN_Python_Pandas_Apache Spark_Pyspark_Pyarrow

Python Pyspark pyarrow pandas_udf-分组的_映射返回数据帧，IntegerType、TimestampType为None NaN

python pandas apache-spark pyspark

Python Pyspark pyarrow pandas_udf-分组的_映射返回数据帧，IntegerType、TimestampType为None NaN,python,pandas,apache-spark,pyspark,pyarrow,Python,Pandas,Apache Spark,Pyspark,Pyarrow,最好的目前，我正在试验pysparkpandas_udf，但不幸的是，当我返回一个数据帧时遇到了一些问题：NA、None或NaNs。如果我使用的是FloatType，那么结果是OK的，但是一旦我使用的是IntegerType，TimestampType等等。。。我收到一个错误&它不再工作了以下是一些有效和无效的示例：什么有效？例1）结果: User Sport Age Age_lag 0 Alice Football 27 NaN 1 Bo

最好的

目前，我正在试验pysparkpandas_udf，但不幸的是，当我返回一个数据帧时遇到了一些问题：NA、None或NaNs。如果我使用的是FloatType，那么结果是OK的，但是一旦我使用的是IntegerType，TimestampType等等。。。我收到一个错误&它不再工作了

以下是一些有效和无效的示例：

什么有效？

例1）

结果:

    User    Sport   Age     Age_lag
0   Alice   Football    27  NaN
1   Bob     Basketball  34  27.0
2   Alice   Football    27  NaN
3   Bob     Basketball  34  27.0

    User    Sport   Age     Age_lag
0   Alice   Football    27  -1
1   Bob     Basketball  34  27
2   Alice   Football    27  -1
3   Bob     Basketball  34  27

例2）

如果我们将Age\u lag的类型更改为IntegerType（）&用-1填充Na，那么我们仍然有一个有效的结果（无NAN）
结果:

User Sport Age Age_lag 0 Alice Football 27 NaN 1 Bob Basketball 34 27.0 2 Alice Football 27 NaN 3 Bob Basketball 34 27.0

User Sport Age Age_lag 0 Alice Football 27 -1 1 Bob Basketball 34 27 2 Alice Football 27 -1 3 Bob Basketball 34 27

什么不起作用
例3）
如果我们省略了.fillna（-1），那么我们将收到下一个错误

custom_schema = StructType([ StructField('User',StringType(),True), StructField('Sport',StringType(),True), StructField('Age',IntegerType(),True), StructField('Age_lag',IntegerType(),True), ]) # the schema is what it needs as an output format @pandas_udf(custom_schema, PandasUDFType.GROUPED_MAP) def my_custom_function(pdf): # Input/output are both a pandas.DataFrame #return a totalaly different DataFrame dt = pd.DataFrame({'User': ['Alice', 'Bob'], 'Sport': ['Football', 'Basketball'], 'Age': [27, 34]}) dt['Age_lag'] = dt['Age'].shift(1) return dt df.groupby('id').apply(my_custom_function).toPandas()
结果：pyarrow.lib.arrow无效：浮点值被截断

例4）
最后但并非最不重要的一点是，如果我们只发送一个静态数据帧，其中age\u lag包含一个None，那么它也不起作用

from pyspark.sql.types import StructType,NullType, StructField,FloatType, LongType, DoubleType, StringType, IntegerType # true means, accepts nulls custom_schema = StructType([ StructField('User',StringType(),True), StructField('Sport',StringType(),True), StructField('Age',IntegerType(),True), StructField('Age_lag',IntegerType(),True), ]) # the schema is what it needs as an output format @pandas_udf(custom_schema, PandasUDFType.GROUPED_MAP) def my_custom_function(pdf): # Input/output are both a pandas.DataFrame #return a totalaly different DataFrame dt = pd.DataFrame({'User': ['Alice', 'Bob'], 'Sport': ['Football', 'Basketball'], 'Age': [27, 34], 'Age_lag': [27, None]}) return dt df.groupby('id').apply(my_custom_function).toPandas()
问题：

你是怎么处理的

这是一个糟糕的设计吗？

（因为我可以想象有1000个案例，我真的想把“不”和“不”都退回去）

我们真的必须填写所有缺失的值吗？然后再换回来？如果是整数，则使用浮点数？等等
这个问题会在不久的将来得到解决吗？（因为熊猫是全新的）

这是一个有趣的问题，但我相信它更适合Apache Spark或Arrow开发者列表。如果行为是有意的，则可能需要更具体的错误消息，并记录限制。如果不是，则应记录为bug。最后，只有开发人员才能谈论当前和未来的工作。由于缺少值，Pandas解决了NumPy的所有问题，并且他们的方法不能与Spark完全交换。