使用Pyspark中的toPandas或Pyarrow函数转换为熊猫时，Pyspark数据帧未返回所有行_Pandas_Apache Spark_Pyspark_Apache Spark Sql_Pyarrow

使用Pyspark中的toPandas或Pyarrow函数转换为熊猫时，Pyspark数据帧未返回所有行

pandas apache-spark pyspark

使用Pyspark中的toPandas或Pyarrow函数转换为熊猫时，Pyspark数据帧未返回所有行,pandas,apache-spark,pyspark,apache-spark-sql,pyarrow,Pandas,Apache Spark,Pyspark,Apache Spark Sql,Pyarrow,在尝试使用箭头函数将pyspark数据帧转换为pandas数据帧时，只有一半的行被转换。Pyspark df包含170000行 >> spark.conf.set("spark.sql.execution.arrow.enabled", "true") >> result_pdf = train_set.select("*").toPandas() >> result_pdf returns only 65000 rows. 我尝试使用以下命令安装和更新py

在尝试使用箭头函数将pyspark数据帧转换为pandas数据帧时，只有一半的行被转换。Pyspark df包含170000行

>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
>> result_pdf = train_set.select("*").toPandas()
>> result_pdf returns only 65000 rows.

我尝试使用以下命令安装和更新pyarrow：

>> conda install -c conda-forge pyarrow
>> pip install pyarrow
>> pip install pyspark[sql]

然后跑

>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
>> result_pdf = train_set.select("*").toPandas()
>>spark.conf.set("spark.sql.execution.arrow.enabled", "true")
>> result_pdf = train_set.select("*").toPandas()

每次转换时，我都会收到以下警告消息：

C:\Users\MUM1342.conda\envs\snakes\lib\site packages\pyarrow\uuuuuu init\uuuuuuuuu.py:152: 用户警告：pyarrow.open\u流已被弃用，请使用 pyarrow.ipc.open_流警告。警告（“pyarrow.open_流为已弃用，请使用“ C:\Users\MUM1342.conda\envs\snakes\lib\site packages\pyspark\sql\dataframe.py:2138: UserWarning:toPandas尝试进行箭头优化，因为 “spark.sql.execution.arrow.enabled”设置为true，但已达到下面的错误无法继续。请注意 “spark.sql.execution.arrow.fallback.enabled”无效关于计算中的故障。

实际产量：

> train_set.count
> 170256
> result_pdf.shape
> 6500

预期产出：

> train_set.count
> 170256
> result_pdf.shape
> 170256

请尝试下面，如果它的工作

启用基于箭头的列数据传输

spark.conf.set("spark.sql.execution.arrow.enabled", "true")

请尝试下面，如果它的工作

启用基于箭头的列数据传输

spark.conf.set("spark.sql.execution.arrow.enabled", "true")