Apache spark 熊猫用户定义函数(UDF)-是否可以返回布尔值?
我正在尝试编写一个UDF函数,它将检查字符串数组中的任何元素是否以特定值开头。我想要的结果是这样的:Apache spark 熊猫用户定义函数(UDF)-是否可以返回布尔值?,apache-spark,pyspark,pyspark-sql,pyspark-dataframes,Apache Spark,Pyspark,Pyspark Sql,Pyspark Dataframes,我正在尝试编写一个UDF函数,它将检查字符串数组中的任何元素是否以特定值开头。我想要的结果是这样的: df.filter(list_contains(val, df.stringArray_column)).show() 在df.stringArray的任何元素以val开头的每一行上,函数列表_包含的函数都将返回True 举个例子: df = spark.read.csv(path) display(df.filter(list_contains('50', df.stringArray_co
df.filter(list_contains(val, df.stringArray_column)).show()
在df.stringArray的任何元素以val开头的每一行上,函数列表_包含的函数都将返回True
举个例子:
df = spark.read.csv(path)
display(df.filter(list_contains('50', df.stringArray_column)))
上面的代码将显示stringArray列的元素以50开头的每一行df
我用python编写了一个函数,速度非常慢
def list_contains(val):
# Perfom what ListContains generated
def list_contains_udf(column_list):
for element in column_list:
if element.startswith(val):
return True
return False
return udf(list_contains_udf, BooleanType())
谢谢你的帮助
编辑:这是一些示例数据,也是我正在寻找的一个输出示例:
df.LIST: ["408000","641100"]
["633400","641100"]
["633400","791100"]
["633400","408100"]
["633400","641100"]
["408110","641230"]
["633400","647200"]
display(df.select('LIST').filter(list_contains('408')(df.LIST)))
output: LIST
["408000","641100"]
["633400","408100"]
["408110","641230"]
更新的答案: 如果我们假设数组的长度相同,那么在没有UDF的情况下也可以这样做。让我们试试下面的方法
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
from pyspark.sql.functions import col
spark = SparkSession.builder.appName('prefix_finder').getOrCreate()
# sample data creation
my_df = spark.createDataFrame(
[('scooby', ['cartoon', 'kidfriendly']),
('batman', ['dark', 'cars']),
('meshuggah', ['heavy', 'dark']),
('guthrie', ['god', 'guitar'])
]
, schema=('character', 'tags'))
数据帧my_df
如下所示:
+---------+----------------------+
|character|tags |
+---------+----------------------+
|scooby |[cartoon, kidfriendly]|
|batman |[dark, cars] |
|meshuggah|[heavy, dark] |
|guthrie |[god, guitar] |
+---------+----------------------+
+---------+----------------------+-------------+
|character|tags |concat_tags |
+---------+----------------------+-------------+
|scooby |[cartoon, kidfriendly]|(car.*|kid.*)|
|batman |[dark, cars] |(dar.*|car.*)|
|meshuggah|[insane, heavy] |(ins.*|hea.*)|
|guthrie |[god, guitar] |(god.*|gui.*)|
+---------+----------------------+-------------+
如果我们正在搜索前缀car,则只应返回第一行和第二行,因为car是carto和cars的前缀
下面是本机spark操作来实现这一点
num_items_in_arr = 2 # this was the assumption
prefix = 'car'
my_df2 = my_df.select(col('character'), col('tags'), *(col('tags').getItem(i).alias(f'tag{i}') for i in range(num_items_in_arr)))
数据帧my_df2
看起来像:
+---------+----------------------+-------+-----------+
|character|tags |tag0 |tag1 |
+---------+----------------------+-------+-----------+
|scooby |[cartoon, kidfriendly]|cartoon|kidfriendly|
|batman |[dark, cars] |dark |cars |
|meshuggah|[insane, heavy] |insane |heavy |
|guthrie |[god, guitar] |god |guitar |
+---------+----------------------+-------+-----------+
让我们在my_df2
上创建一列concat_标记,我们将使用它进行正则表达式匹配
cols_of_interest = [f'tag{i}' for i in range(num_items_in_arr)]
for idx, col_name in enumerate(cols_of_interest):
my_df2 = my_df2.withColumn(col_name, f.substring(col_name, 1, prefix_len))
if idx == 0:
my_df2 = my_df2.withColumn(col_name, f.concat(lit("("), col_name, lit(".*")))
elif idx == len(cols_to_update_concat) - 1:
my_df2 = my_df2.withColumn(col_name, f.concat(col_name, lit(".*)")))
else:
my_df2 = my_df2.withColumn(col_name, f.concat(col_name, lit(".*")))
my_df3 = my_df2.withColumn('concat_tags', f.concat_ws('|', *cols_of_interest)).drop(*cols_of_interest)
my_df3
如下所示:
+---------+----------------------+
|character|tags |
+---------+----------------------+
|scooby |[cartoon, kidfriendly]|
|batman |[dark, cars] |
|meshuggah|[heavy, dark] |
|guthrie |[god, guitar] |
+---------+----------------------+
+---------+----------------------+-------------+
|character|tags |concat_tags |
+---------+----------------------+-------------+
|scooby |[cartoon, kidfriendly]|(car.*|kid.*)|
|batman |[dark, cars] |(dar.*|car.*)|
|meshuggah|[insane, heavy] |(ins.*|hea.*)|
|guthrie |[god, guitar] |(god.*|gui.*)|
+---------+----------------------+-------------+
现在我们需要对列concat_标记应用正则表达式匹配
结果如下所示:
+---------+----------------------+-------------+-------+
|character|tags |concat_tags |matched|
+---------+----------------------+-------------+-------+
|scooby |[cartoon, kidfriendly]|(car.*|kid.*)|car |
|batman |[dark, cars] |(dar.*|car.*)|car |
|meshuggah|[insane, heavy] |(ins.*|hea.*)| |
|guthrie |[god, guitar] |(god.*|gui.*)| |
+---------+----------------------+-------------+-------+
稍微清理一下
my_df5 = my_df4.filter(my_df4.matched != "").drop('concat_tags', 'matched')
现在我们来看最后一个数据帧:
+---------+----------------------+
|character|tags |
+---------+----------------------+
|scooby |[cartoon, kidfriendly]|
|batman |[dark, cars] |
+---------+----------------------+
您可以发布示例数据和预期结果吗?对于spark 2.4+,请使用sparksql内置函数exists:
df.filter('exists(stringArray_列,x->left(x,3)=“408”))。show()
,link:@jxc dam,这是一个非常简洁的单行程序解决方案。感谢您在@jxc,在哪里可以找到更复杂的“exists”示例?事实上,我真的需要函数中的“start with”功能。@Epicol假设所有数组的长度都相同,这是可行的。请检查我的最新答案。