Python 在Pyspark dataframe中添加一个新列作为带映射的sum
我有一个pyspark数据帧,如下所示:Python 在Pyspark dataframe中添加一个新列作为带映射的sum,python,apache-spark,pyspark,apache-spark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,我有一个pyspark数据帧,如下所示: Stock | open_price | list_price A | 100 | 1 B | 200 | 2 C | 300 | 3 stockNames = sqlDF.rdd.map(lambda p: (p.stock,p.open_price*p.open_price,sum(p.open_price)) ).collect() for name in stockNames:
Stock | open_price | list_price
A | 100 | 1
B | 200 | 2
C | 300 | 3
stockNames = sqlDF.rdd.map(lambda p: (p.stock,p.open_price*p.open_price,sum(p.open_price)) ).collect()
for name in stockNames:
print(name)
我试图通过map和rdd实现下面的功能,它打印出每一行的股票、开盘价*开盘价、整个开盘价列的总和
(A, 100 , 600)
(B, 400, 600)
(C, 900, 600)
以上面的表格为例,第一行是:A,100*1100+200+300
我能够使用下面的代码获得前两列
stockNames = sqlDF.rdd.map(lambda p: (p.stock,p.open_price*p.open_price) ).collect()
for name in stockNames:
print(name)
然而,当我尝试做如下总和(p.开盘价)时:
Stock | open_price | list_price
A | 100 | 1
B | 200 | 2
C | 300 | 3
stockNames = sqlDF.rdd.map(lambda p: (p.stock,p.open_price*p.open_price,sum(p.open_price)) ).collect()
for name in stockNames:
print(name)
它给了我下面的错误
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 75.0 failed 1 times, most recent failure: Lost task 0.0 in stage 75.0 (TID 518, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "C:\Spark\spark-2.3.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 229, in main
File "C:\Spark\spark-2.3.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 224, in process
File "C:\Spark\spark-2.3.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py", line 372, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "<ipython-input-48-f08584cc31c6>", line 19, in <lambda>
TypeError: 'int' object is not iterable
Py4JJavaError:调用z:org.apache.spark.api.python.PythonRDD.collectAndServe时出错。
:org.apache.SparkException:作业因阶段失败而中止:阶段75.0中的任务0失败1次,最近的失败:阶段75.0中的任务0.0丢失(TID 518,localhost,executor driver):org.apache.spark.api.python异常:回溯(最近一次调用):
文件“C:\Spark\Spark-2.3.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py”,第229行,在main中
文件“C:\Spark\Spark-2.3.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py”,第224行,正在处理中
文件“C:\Spark\Spark-2.3.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py”,第372行,在dump\u流中
vs=列表(itertools.islice(迭代器,批处理))
文件“”,第19行,在
TypeError:“int”对象不可编辑
如何在我的地图RDD中添加未平仓价格的总和
提前感谢您,因为我对RDD和map还很陌生。分别计算总和:
df = spark.createDataFrame(
[("A", 100, 1), ("B", 200, 2), ("C", 300, 3)],
("stock", "price", "list_price")
)
total = df.selectExpr("sum(price) AS total")
并添加为列:
from pyspark.sql.functions import lit
df.withColumn("total", lit(total.first()[0])).show()
# +-----+-----+----------+-----+
# |stock|price|list_price|total|
# +-----+-----+----------+-----+
# | A| 100| 1| 600|
# | B| 200| 2| 600|
# | C| 300| 3| 600|
# +-----+-----+----------+-----+
或交叉连接
:
df.crossJoin(total).show()
# +-----+-----+----------+-----+
# |stock|price|list_price|total|
# +-----+-----+----------+-----+
# | A| 100| 1| 600|
# | B| 200| 2| 600|
# | C| 300| 3| 600|
# +-----+-----+----------+-----+
RDD.map
在这里并不适用(您可以用它代替with column
,但它效率低下,我不建议这样做)。这会给您带来什么错误?你想求什么?我想求开盘价300+200+100列的和