Apache spark pyspark(2.4版)group by的总和不起作用
我有一个像这样的数据文件Apache spark pyspark(2.4版)group by的总和不起作用,apache-spark,pyspark,Apache Spark,Pyspark,我有一个像这样的数据文件 +---------+---------+--------------------+--------+-------------------+---------+----------+--------------+ |InvoiceNo|StockCode| Description|Quantity| InvoiceDate|UnitPrice|CustomerID| Country| +---------+---------+
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode| Description|Quantity| InvoiceDate|UnitPrice|CustomerID| Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
| 536365| 85123A|WHITE HANGING HEA...| 6|2010-12-01 08:26:00| 2.55| 17850.0|United Kingdom|
| 536365| 71053| WHITE METAL LANTERN| 6|2010-12-01 08:26:00| 3.39| 17850.0|United Kingdom|
| 536365| 84406B|CREAM CUPID HEART...| 8|2010-12-01 08:26:00| 2.75| 17850.0|United Kingdom|
| 536365| 84029G|KNITTED UNION FLA...| 6|2010-12-01 08:26:00| 3.39| 17850.0|United Kingdom|
| 536365| 84029E|RED WOOLLY HOTTIE...| 6|2010-12-01 08:26:00| 3.39| 17850.0|United Kingdom|
| 536365| 22752|SET 7 BABUSHKA NE...| 2|2010-12-01 08:26:00| 7.65| 17850.0|United Kingdom|
| 536365| 21730|GLASS STAR FROSTE...| 6|2010-12-01 08:26:00| 4.25| 17850.0|United Kingdom|
| 536366| 22633|HAND WARMER UNION...| 6|2010-12-01 08:28:00| 1.85| 17850.0|United Kingdom|
| 536366| 22632|HAND WARMER RED P...| 6|2010-12-01 08:28:00| 1.85| 17850.0|United Kingdom|
| 536367| 84879|ASSORTED COLOUR B...| 32|2010-12-01 08:34:00| 1.69| 13047.0|United Kingdom|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
当我运行以下代码时
from pyspark.sql.functions import sum as sum_,count
relatil_data.groupBy('InvoiceNo').agg(sum_('UnitPrice'))
它工作正常,并提供以下输出:
DataFrame[InvoiceNo: string, sum(UnitPrice): double]
但当我在代码下面运行时
df=relatil_data.groupBy('InvoiceNo').agg(sum_('UnitPrice'))
df.show()
我得到以下错误
C:\spark\spark-2.4.0-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o4839.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 198.0 failed 1 times, most recent failure: Lost task 0.0 in stage 198.0 (TID 214, localhost, executor driver): java.io.FileNotFoundException: C:\Users\pg186028\AppData\Local\Temp\blockmgr-e7aa0c35-ca53-4602-8411-bf816e010a46\17\temp_shuffle_f694f1cf-e72f-41b6-bf65-97ade34afc7c (The system cannot find the path specified)
当我尝试创建一个视图并在其上运行SQL时,也会发生同样的情况。在不导入库文件的情况下尝试使用以下代码
relatil_data.groupBy('InvoiceNo').agg("UnitPrice":"sum")
要从sumUnitPrice更改o/p列名,请尝试下面的代码
relatil_data.groupBy('InvoiceNo').agg("UnitPrice":"sum").withColumnRenamed("sum(UnitPrice)","Total_UnitPrice")
两者都会失败。区别在于,您正在使用df.show调用一个操作。您需要解决找不到该文件的根本原因error@ernest_k:我正在尝试,但为什么会出现此错误?@ernest_k:可能是因为输出数据帧中的双数据类型导致您的代码失败,因为Spark正在丢失洗牌文件。通常是某个更大问题的症状,不太可能特定于有问题的代码。看起来像是权限问题windows…,请尝试将spark scratch目录设置为类似c:\tmpinvalid语法at:in UnitPrice:sum您是如何创建数据框的?它使用下面的方法df=spark.read.format吗?这是我使用的:relatil_data=spark.read.optioninferSchema,true\.optionheader,true\.optioninferSchema,true\.csv2010-12-01.csvuse赞誉pranteses:relatil_data.groupBy'InvoiceNo.agg{UnitPrice:sum}