Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 在Pyspark dataframe中添加一个新列作为带映射的sum_Python_Apache Spark_Pyspark_Apache Spark Sql - Fatal编程技术网

Python 在Pyspark dataframe中添加一个新列作为带映射的sum

Python 在Pyspark dataframe中添加一个新列作为带映射的sum,python,apache-spark,pyspark,apache-spark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,我有一个pyspark数据帧,如下所示: Stock | open_price | list_price A | 100 | 1 B | 200 | 2 C | 300 | 3 stockNames = sqlDF.rdd.map(lambda p: (p.stock,p.open_price*p.open_price,sum(p.open_price)) ).collect() for name in stockNames:

我有一个pyspark数据帧,如下所示:

Stock | open_price | list_price
A     | 100        | 1
B     | 200        | 2
C     | 300        | 3
stockNames = sqlDF.rdd.map(lambda p: (p.stock,p.open_price*p.open_price,sum(p.open_price)) ).collect()
for name in stockNames:
    print(name)
我试图通过map和rdd实现下面的功能,它打印出每一行的股票、开盘价*开盘价、整个开盘价列的总和

(A, 100 , 600)
(B, 400, 600)
(C, 900, 600)
以上面的表格为例,第一行是:A,100*1100+200+300

我能够使用下面的代码获得前两列

stockNames = sqlDF.rdd.map(lambda p: (p.stock,p.open_price*p.open_price) ).collect()
for name in stockNames:
    print(name)
然而,当我尝试做如下总和(p.开盘价)时:

Stock | open_price | list_price
A     | 100        | 1
B     | 200        | 2
C     | 300        | 3
stockNames = sqlDF.rdd.map(lambda p: (p.stock,p.open_price*p.open_price,sum(p.open_price)) ).collect()
for name in stockNames:
    print(name)
它给了我下面的错误

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 75.0 failed 1 times, most recent failure: Lost task 0.0 in stage 75.0 (TID 518, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "C:\Spark\spark-2.3.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 229, in main
  File "C:\Spark\spark-2.3.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 224, in process
  File "C:\Spark\spark-2.3.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py", line 372, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "<ipython-input-48-f08584cc31c6>", line 19, in <lambda>
TypeError: 'int' object is not iterable
Py4JJavaError:调用z:org.apache.spark.api.python.PythonRDD.collectAndServe时出错。
:org.apache.SparkException:作业因阶段失败而中止:阶段75.0中的任务0失败1次,最近的失败:阶段75.0中的任务0.0丢失(TID 518,localhost,executor driver):org.apache.spark.api.python异常:回溯(最近一次调用):
文件“C:\Spark\Spark-2.3.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py”,第229行,在main中
文件“C:\Spark\Spark-2.3.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py”,第224行,正在处理中
文件“C:\Spark\Spark-2.3.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py”,第372行,在dump\u流中
vs=列表(itertools.islice(迭代器,批处理))
文件“”,第19行,在
TypeError:“int”对象不可编辑
如何在我的地图RDD中添加未平仓价格的总和


提前感谢您,因为我对RDD和map还很陌生。

分别计算总和:

df = spark.createDataFrame(
    [("A", 100, 1), ("B", 200, 2), ("C", 300, 3)],
    ("stock", "price", "list_price")
)

total = df.selectExpr("sum(price) AS total")
并添加为列:

from pyspark.sql.functions import lit

df.withColumn("total", lit(total.first()[0])).show()

# +-----+-----+----------+-----+
# |stock|price|list_price|total|
# +-----+-----+----------+-----+
# |    A|  100|         1|  600|
# |    B|  200|         2|  600|
# |    C|  300|         3|  600|
# +-----+-----+----------+-----+
交叉连接

df.crossJoin(total).show()

# +-----+-----+----------+-----+
# |stock|price|list_price|total|
# +-----+-----+----------+-----+
# |    A|  100|         1|  600|
# |    B|  200|         2|  600|
# |    C|  300|         3|  600|
# +-----+-----+----------+-----+

RDD.map
在这里并不适用(您可以用它代替
with column
,但它效率低下,我不建议这样做)。

这会给您带来什么错误?你想求什么?我想求开盘价300+200+100列的和