Python pyspark和pandas对该列的阅读方式不同_Python_Pandas_Pyspark

Python pyspark和pandas对该列的阅读方式不同

python pandas pyspark

Python pyspark和pandas对该列的阅读方式不同,python,pandas,pyspark,Python,Pandas,Pyspark,我有一个数据框，如下所示，由pandas正确读取：我使用的代码非常简单： import pandas as pd from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() hede = spark.read.parquet(r"C:/users/batuhan.engin/desktop/date=2021-04-01") he = pd.read_parquet(path

我有一个数据框，如下所示，由pandas正确读取：

我使用的代码非常简单：

import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
hede = spark.read.parquet(r"C:/users/batuhan.engin/desktop/date=2021-04-01")
he = pd.read_parquet(path=r"C:/users/batuhan.engin/desktop/date=2021-04-01", engine='pyarrow')

错误是，Pypark错误地读取了sales_revenue列，我不明白为什么

he[he.product_id == 2461]:
    sales_type sales_channel  sales_quantity  sales_revenue  tax_amount currency  product_id  store_id
27     Regular     Wholesale             6.0     818.500000         NaN     None        2461       300
110    Regular     Wholesale             2.0     272.829987         NaN     None        2461        42
132    Regular     Wholesale            18.0    2475.540039         NaN     None        2461       314

但当我阅读Pyspark的文章时，销售收入栏是不正确的。事实上，我甚至无法想象pyspark是如何在sales_revenue一栏中得出这些值的：

hede.filter("product_id == 2461").show()
+----------+-------------+--------------+-------------+----------+--------+----------+--------+
|sales_type|sales_channel|sales_quantity|sales_revenue|tax_amount|currency|product_id|store_id|
+----------+-------------+--------------+-------------+----------+--------+----------+--------+
|   Regular|    Wholesale|           6.0|   -186969.84|      null|    null|      2461|     300|
|   Regular|    Wholesale|           2.0|       -444.8|      null|    null|      2461|      42|
|   Regular|    Wholesale|          18.0|      6553.92|      null|    null|      2461|     314|
+----------+-------------+--------------+-------------+----------+--------+----------+--------+

有什么想法吗？可能是版本或软件包？

欢迎使用stackoverflow，您能否至少添加在pandas和pyspark数据帧中加载数据的方式？据我所知，如果从同一来源读取数据，没有明显的理由会发生这种情况。您能否共享实际的数据文件（仅示例）？