Python Pyspark:如何应用pandas\u udf?
我正在尝试在Python Pyspark:如何应用pandas\u udf?,python,pyspark,Python,Pyspark,我正在尝试在pyspark中应用pandas\u udf 我有一个pyspark数据帧,如下所示: +-------------------+------------------+--------+-------+ | lat| lon|duration|stop_id| +-------------------+------------------+--------+-------+ | -6.23748779296875| 106.
pyspark
中应用pandas\u udf
我有一个pyspark数据帧,如下所示:
+-------------------+------------------+--------+-------+
| lat| lon|duration|stop_id|
+-------------------+------------------+--------+-------+
| -6.23748779296875| 106.6937255859375| 247| 0|
| -6.23748779296875| 106.6937255859375| 2206| 1|
| -6.23748779296875| 106.6937255859375| 609| 2|
| 0.5733972787857056|101.45503234863281| 16879| 3|
| 0.5733972787857056|101.45503234863281| 4680| 4|
| -6.851855278015137|108.64261627197266| 164| 5|
| -6.851855278015137|108.64261627197266| 220| 6|
| -6.851855278015137|108.64261627197266| 1669| 7|
|-0.9033176600933075|100.41548919677734| 30811| 8|
|-0.9033176600933075|100.41548919677734| 23404| 9|
+-------------------+------------------+--------+-------+
我正在尝试一个简单的函数来创建一个列test
,如果持续时间大于1000
,则为1
,否则为0
schema =StructType([
StructField('test', IntegerType(), True),
StructField('stop_id', IntegerType(), True)
])
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def probTime(stop_df):
stopid = stop_df['stop_id'].values[0]
val = stop_df['duration'].values[0]
test = 0
if val > 1000:
test = 1
df = pd.DataFrame()
df['prob_time'] = test
df['stop_id'] = stopid
return df
但是我有一张空桌子
sp = stop_df.groupBy("stop_id").apply(probTime)
sp.show(5)
+----+-------+
|test|stop_id|
+----+-------+
+----+-------+
可以使用“when”函数直接在spark上完成,而不是编写函数 1) 函数时导入
从pyspark.sql.functions导入时
2) 使用它在现有dataframe中创建新列
stop_df=stop_df.with column('test',when(stop_df['duration']>1000,1)。否则(0))
stop_df dataframe的测试列将具有所需的值在分组函数中分配新的df时会出现问题:您需要将值分配为列表。以下列例子为例:
df = pd.DataFrame()
test = 1
stopid = 1
df['prob_time'] = test
df['stop_id'] = stopid
print(df)
这将产生:
Columns: [prob_time, stop_id]
Index: []
相比
df = pd.DataFrame()
test = 1
stopid = 1
df['prob_time'] = [test]
df['stop_id'] = [stopid]
print(df)
产生
prob_time stop_id
0 1 1
因此,您应该将代码更改为后一种形式。我需要理解函数,因为我需要编写更复杂的函数