Pandas 除了PyArrow或使用PyArrow之外，如何提高pyspark到数据帧转换的效率_Pandas_Apache Spark_Pyspark_Pyspark Dataframes

Pandas 除了PyArrow或使用PyArrow之外，如何提高pyspark到数据帧转换的效率

pandas apache-spark pyspark

Pandas 除了PyArrow或使用PyArrow之外，如何提高pyspark到数据帧转换的效率,pandas,apache-spark,pyspark,pyspark-dataframes,Pandas,Apache Spark,Pyspark,Pyspark Dataframes,我也尝试了PyArrow，在我的示例中，我使用spark.sql语句获得了spark-datframe。之后我想转换成熊猫数据帧。为了显示执行时间，我运行了以下语句 import time startTime = time.time() df=df.toPandas() executionTime = (time.time() - startTime) executionTime 这是1021.55 我也试过了 import time startTime = time.time() spark

我也尝试了PyArrow，在我的示例中，我使用spark.sql语句获得了spark-datframe。之后我想转换成熊猫数据帧。为了显示执行时间，我运行了以下语句

import time
startTime = time.time()
df=df.toPandas()
executionTime = (time.time() - startTime)
executionTime

这是1021.55

我也试过了

import time
startTime = time.time()
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
df=df.toPandas()
executionTime = (time.time() - startTime)
executionTime

这是1008.71

给出数据帧形状的简要概念是（944,5）。以下是spark dataframe中的数据类型

import pandas as pd
pd.set_option('max_colwidth', -1) # to prevent truncating of columns in jupyter

def count_column_types(spark_df):
"""Count number of columns per type"""
return pd.DataFrame(spark_df.dtypes).groupby(1, as_index=False)[0].agg({'count':'count', 'names':lambda x: " | ".join(set(x))}).rename(columns={1:"type"})
 count_column_types(df) 

    type           count    names
 0  bigint          1   col4
 1  date            1   col1
 2  decimal(20,4)   1   col5
 3  int             1   col2
 4  string          1   col3

如果您使用所谓的熊猫UDF，

spark.sql.execution.arrow.pyspark.enabled

，请告诉我是否有任何方法可以提高效率，但在您的情况下，

spark.sql.execution.arrow.pyspark.enabled

无效

您的问题是

toPandas

需要将所有数据从执行器收集到驱动程序节点，但在此之前，它需要处理SQL查询，主要瓶颈可能就在那里（您没有展示示例，所以很难说）。您可以尝试了解瓶颈在哪里-在SQL查询执行中，或者它实际上在

toPandas

中。为此，请尝试以下方法：

df = spark.sql(....)
import time
startTime = time.time()
df.write.format("noop").mode("overwrite").save()
executionTime = (time.time() - startTime)
executionTime

并将此执行时间与从

toPandas

获得的时间进行比较，它在写入时抛出错误。格式化步骤说Py4JJavaError:调用o123.save时出错：java.lang.ClassNotFoundException:未能找到数据源：noop。请在ah找到包裹，因为您需要Spark 3。。。将该行替换为

df.count（）

，例如，计数也需要同样多的时间，因此意识到收集需要更多的时间并不完全是因为熊猫。从而使我的查询变得更高效