Python 我有两个pyspark数据帧,并希望根据第一个数据帧中的列值计算第二个数据帧中点列的和

Python 我有两个pyspark数据帧,并希望根据第一个数据帧中的列值计算第二个数据帧中点列的和,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,这是我的第一个数据帧,其中包括玩家点数 玩家名称 pid 火柴 要点 0 维拉特科利 10 2. 0 1. 拉维钱德兰·阿什温 11 2. 9 2. 乔塔姆·甘比尔 12 2. 1. 3. Ravindra Jadeja 13 2. 7. 4. 阿米特·米什拉 14 2. 2. 5. 穆罕默德·沙米 15 2. 2. 6. 卡伦奈尔 16 2. 4. 7. 哈迪克·潘迪亚 17 2. 0 8. 切特什瓦尔·普贾拉 18 2. 9 9 阿金卡·拉汉 19 2. 5. 代码 这样可以避免嵌套循环,我

这是我的第一个数据帧,其中包括玩家点数

玩家名称 pid 火柴 要点 0 维拉特科利 10 2. 0 1. 拉维钱德兰·阿什温 11 2. 9 2. 乔塔姆·甘比尔 12 2. 1. 3. Ravindra Jadeja 13 2. 7. 4. 阿米特·米什拉 14 2. 2. 5. 穆罕默德·沙米 15 2. 2. 6. 卡伦奈尔 16 2. 4. 7. 哈迪克·潘迪亚 17 2. 0 8. 切特什瓦尔·普贾拉 18 2. 9 9 阿金卡·拉汉 19 2. 5. 代码

这样可以避免嵌套循环,我们还可以创建副本,使其不会更改原始数据集

注意:它使用相同的方法,将玩家名称替换为点数,然后进行行求和

代码

这样可以避免嵌套循环,我们还可以创建副本,使其不会更改原始数据集

注意:它使用相同的方法,用点数替换球员姓名,然后进行行相加

我就是这样做的

sc = SparkContext('local[*]')
spark = SparkSession(sparkContext=sc)

df2 = spark.read.options(inferSchema='True',delimiter=',',header='True').csv("D:\\bop\\small_input_spark.csv")
df1 = spark.read.options(inferSchema='True',delimiter=',',header='True').csv("D:\\bop\\player_points.csv")

# start = time.time()

player_name = df1.select('Playername').collect()
points = df1.select('points').collect()



dictn = {row['Playername']:row['points'] for row in df1.collect()}

print(dictn)

# user_func =  udf(lambda x: dictn.get(x), IntegerType())
# newdf = df2.withColumn('p1','p2',user_func(df2.p1,df2.p2))

dictn = {k:str(v) for k,v in zip(dictn.keys(),dictn.values())}

df3 = df2.na.replace(dictn,1,("captain","v-captain","MoM","p1","p2","p3","p4","p5","p6","p7","p8","p9","p10","p11"))

integer_type = ["captain","v-captain","MoM","p1","p2","p3","p4","p5","p6","p7","p8","p9","p10","p11"]

for c in integer_type:
    df3 = df3.withColumn(c, df3[c].cast(IntegerType()))

numeric_col_list=df3.schema.names
numeric_col_list=numeric_col_list[4:]   


df3 = df3.withColumn('v-captain', ((col('v-captain') / 2 )))
df3 = df3.withColumn('MoM', ((col('MoM') * 2 )))


df3 = df3.withColumn('points',reduce(add, [col(x) for x in numeric_col_list]))

我就是这样做的

sc = SparkContext('local[*]')
spark = SparkSession(sparkContext=sc)

df2 = spark.read.options(inferSchema='True',delimiter=',',header='True').csv("D:\\bop\\small_input_spark.csv")
df1 = spark.read.options(inferSchema='True',delimiter=',',header='True').csv("D:\\bop\\player_points.csv")

# start = time.time()

player_name = df1.select('Playername').collect()
points = df1.select('points').collect()



dictn = {row['Playername']:row['points'] for row in df1.collect()}

print(dictn)

# user_func =  udf(lambda x: dictn.get(x), IntegerType())
# newdf = df2.withColumn('p1','p2',user_func(df2.p1,df2.p2))

dictn = {k:str(v) for k,v in zip(dictn.keys(),dictn.values())}

df3 = df2.na.replace(dictn,1,("captain","v-captain","MoM","p1","p2","p3","p4","p5","p6","p7","p8","p9","p10","p11"))

integer_type = ["captain","v-captain","MoM","p1","p2","p3","p4","p5","p6","p7","p8","p9","p10","p11"]

for c in integer_type:
    df3 = df3.withColumn(c, df3[c].cast(IntegerType()))

numeric_col_list=df3.schema.names
numeric_col_list=numeric_col_list[4:]   


df3 = df3.withColumn('v-captain', ((col('v-captain') / 2 )))
df3 = df3.withColumn('MoM', ((col('MoM') * 2 )))


df3 = df3.withColumn('points',reduce(add, [col(x) for x in numeric_col_list]))