Python 如何使用foreach更新pyspark数据帧
我有一个pyspark数据框架,我想处理每一行,并根据一些逻辑更新/删除/插入行。我试图使用“foreach”和“foreachPartition”,但我真的不知道它将如何返回修改后的数据来更新实际的数据帧Python 如何使用foreach更新pyspark数据帧,python,dataframe,pyspark,Python,Dataframe,Pyspark,我有一个pyspark数据框架,我想处理每一行,并根据一些逻辑更新/删除/插入行。我试图使用“foreach”和“foreachPartition”,但我真的不知道它将如何返回修改后的数据来更新实际的数据帧 data = [ { "city": "s", "latitude": "51", "longitude": "5", "region":
data = [
{
"city": "s",
"latitude": "51",
"longitude": "5",
"region": "Europe",
"date_range": "date_last_year",
},
{
"city": "s",
"latitude": "5",
"longitude": "5.67",
"region": "Europe",
"date_range": "date_all_time",
},
{
"city": "Aalborg",
"latitude": "57.03",
"longitude": "9.007",
"region": "Europe",
"date_range": "date_last_year",
},
{
"city": "Aalborg",
"latitude": "57.033",
"longitude": "9.0007",
"region": "Europe",
"date_range": "date_last_year",
},
{
"city": "Aalborg",
"latitude": "57.0",
"longitude": "9.97",
"region": "Europe",
"date_range": "date_last_year",
},
{
"city": "Aarau",
"latitude": "47.32",
"longitude": "8.05",
"region": "Europe",
"date_range": "date_last_year",
},
]
from pyspark import SparkContext
from pyspark.sql import SQLContext, functions as sf
sc = SparkContext()
sqlContext = SQLContext(sc)
df = sc.parallelize(data).toDF()
def myfunction(row):
if float(row.latitude) > 50:
print('do_something')
# need to access "df" to do some operations
df.foreach(myfunction)
df.show()
# output
do_something
do_something
do_something
do_something
+-------+--------------+--------+---------+------+
| city| date_range|latitude|longitude|region|
+-------+--------------+--------+---------+------+
| s|date_last_year| 51| 5|Europe|
| s| date_all_time| 5| 5.67|Europe|
|Aalborg|date_last_year| 57.03| 9.007|Europe|
|Aalborg|date_last_year| 57.033| 9.0007|Europe|
|Aalborg|date_last_year| 57.0| 9.97|Europe|
| Aarau|date_last_year| 47.32| 8.05|Europe|
+-------+--------------+--------+---------+------+
我想将“df”传递到foreach函数中,或者在foreach函数调用时返回并聚合它们。如何做到这一点?如果你想聚合数据,你要寻找的功能是
groupby
。如果你给出一个具体的例子,说明你想做什么,那么会有一些建议的方法。Foreach不是最好的方法。