Python pyspark和pandas对该列的阅读方式不同
我有一个数据框,如下所示,由pandas正确读取: 我使用的代码非常简单:Python pyspark和pandas对该列的阅读方式不同,python,pandas,pyspark,Python,Pandas,Pyspark,我有一个数据框,如下所示,由pandas正确读取: 我使用的代码非常简单: import pandas as pd from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() hede = spark.read.parquet(r"C:/users/batuhan.engin/desktop/date=2021-04-01") he = pd.read_parquet(path
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
hede = spark.read.parquet(r"C:/users/batuhan.engin/desktop/date=2021-04-01")
he = pd.read_parquet(path=r"C:/users/batuhan.engin/desktop/date=2021-04-01", engine='pyarrow')
错误是,Pypark错误地读取了sales_revenue列,我不明白为什么
he[he.product_id == 2461]:
sales_type sales_channel sales_quantity sales_revenue tax_amount currency product_id store_id
27 Regular Wholesale 6.0 818.500000 NaN None 2461 300
110 Regular Wholesale 2.0 272.829987 NaN None 2461 42
132 Regular Wholesale 18.0 2475.540039 NaN None 2461 314
但当我阅读Pyspark的文章时,销售收入栏是不正确的。事实上,我甚至无法想象pyspark是如何在sales_revenue一栏中得出这些值的:
hede.filter("product_id == 2461").show()
+----------+-------------+--------------+-------------+----------+--------+----------+--------+
|sales_type|sales_channel|sales_quantity|sales_revenue|tax_amount|currency|product_id|store_id|
+----------+-------------+--------------+-------------+----------+--------+----------+--------+
| Regular| Wholesale| 6.0| -186969.84| null| null| 2461| 300|
| Regular| Wholesale| 2.0| -444.8| null| null| 2461| 42|
| Regular| Wholesale| 18.0| 6553.92| null| null| 2461| 314|
+----------+-------------+--------------+-------------+----------+--------+----------+--------+
有什么想法吗?可能是版本或软件包?欢迎使用stackoverflow,您能否至少添加在pandas和pyspark数据帧中加载数据的方式?据我所知,如果从同一来源读取数据,没有明显的理由会发生这种情况。您能否共享实际的数据文件(仅示例)?