Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/277.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何将RDD.mapPartitions()中的Pandas数据帧转换为Spark数据帧?_Python_Pandas_Apache Spark_Pyspark - Fatal编程技术网

Python 如何将RDD.mapPartitions()中的Pandas数据帧转换为Spark数据帧?

Python 如何将RDD.mapPartitions()中的Pandas数据帧转换为Spark数据帧?,python,pandas,apache-spark,pyspark,Python,Pandas,Apache Spark,Pyspark,我有一个Python函数,它返回一个数据帧。我在Spark 2.2.0中使用pyspark的函数调用此函数。但是我无法将mapPartitions()返回的RDD转换为Spark数据帧。将生成以下错误: ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). 说明问题的简单代码: import pandas as pd def f

我有一个Python函数,它返回一个数据帧。我在Spark 2.2.0中使用pyspark的函数调用此函数。但是我无法将
mapPartitions()
返回的RDD转换为Spark数据帧。将生成以下错误:

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
说明问题的简单代码:

import pandas as pd

def func(data):
    pdf = pd.DataFrame(list(data), columns=("A", "B", "C"))
    pdf += 10 # Add 10 to every value. The real function is a lot more complex!
    return [pdf]

pdf = pd.DataFrame([(1.87, 0.6, 7.1), (-0.3, 0.1, 8.2), (2.8, 0.3, 6.1), (-0.2, 0.5, 5.9)], columns=("A", "B", "C"))

sdf = spark.createDataFrame(pdf)
sdf.show()
rddIn = sdf.rdd

for i in rddIn.collect():
    print(i)

result = rddIn.mapPartitions(func)

for i in result.collect():
    print(i)

resDf = spark.createDataFrame(result) # --> ValueError!
resDf.show()
输出为:

+----+---+---+
|   A|  B|  C|
+----+---+---+
|1.87|0.6|7.1|
|-0.3|0.1|8.2|
| 2.8|0.3|6.1|
|-0.2|0.5|5.9|
+----+---+---+
Row(A=1.87, B=0.6, C=7.1)
Row(A=-0.3, B=0.1, C=8.2)
Row(A=2.8, B=0.3, C=6.1)
Row(A=-0.2, B=0.5, C=5.9)
       A     B     C
0  11.87  10.6  17.1
     A     B     C
0  9.7  10.1  18.2
      A     B     C
0  12.8  10.3  16.1
     A     B     C
0  9.8  10.5  15.9

但倒数第二行产生上述
ValueError
。我真的希望
resDf.show()
看起来与
sdf.show()
完全相同,只是表中的每个值都添加了10。理想情况下,
结果
RDD的结构应该与
rddIn
相同,RDD进入
mapPartitions()

必须将数据转换为标准Python类型并展平:

resDf = spark.createDataFrame(
    result.flatMap(lambda df: (r.tolist() for r in df.to_records()))
)

resDF.show()
# +---+------------------+----+----+                                              
# | _1|                _2|  _3|  _4|
# +---+------------------+----+----+
# |  0|11.870000000000001|10.6|17.1|
# |  0|               9.7|10.1|18.2|
# |  0|              12.8|10.3|16.1|
# |  0|               9.8|10.5|15.9|
# +---+------------------+----+----+
如果您使用Spark 2.3,这也应该有效

from pyspark.sql.functions import pandas_udf, spark_partition_id
from pyspark.sql.functions import PandasUDFType

@pandas_udf(sdf.schema, functionType=PandasUDFType.GROUPED_MAP)  
def func(pdf):
    pdf += 10 
    return pdf

sdf.groupBy(spark_partition_id().alias("_pid")).apply(func)

谢谢-我不会在一百万年内得到这个!如果使用
df.to_记录(index=False)
则不会得到第一列的索引值。