Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用startswith from列表的Pyspark过滤器_Python_Apache Spark_Pyspark_Apache Spark Sql - Fatal编程技术网

Python 使用startswith from列表的Pyspark过滤器

Python 使用startswith from列表的Pyspark过滤器,python,apache-spark,pyspark,apache-spark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,我有一个元素列表,可以从RDD中记录的两个字符串开始。如果我有yes和no的和元素列表,它们应该匹配yes23和no3,但不匹配35yes或41no。使用pyspark,如何使用列表或元组中的任何元素 一个例子是: +-----+------+ |index| label| +-----+------+ | 1|yes342| | 2| 45yes| | 3| no123| | 4| 75no| +-----+------+ 当我尝试时: Element_List =

我有一个元素列表,可以从RDD中记录的两个字符串开始。如果我有
yes
no
的和元素列表,它们应该匹配
yes23
no3
,但不匹配
35yes
41no
。使用pyspark,如何使用列表或元组中的任何元素

一个例子是:

+-----+------+
|index| label|
+-----+------+
|    1|yes342|
|    2| 45yes|
|    3| no123|
|    4|  75no|
+-----+------+
当我尝试时:

Element_List = ['yes','no']
filter_DF = DF.where(DF.label.startswith(tuple(Element_List)))
生成的df应类似于:

+-----+------+
|index| label|
+-----+------+
|    1|yes342|
|    3| no123|
+-----+------+
相反,我得到了一个错误:

Py4JError: An error occurred while calling o250.startsWith. Trace:
py4j.Py4JException: Method startsWith([class java.util.ArrayList]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)

出现提示,因此它看起来像是
startsWith
不能与任何类型的列表一起使用。有简单的解决方法吗?

像这样编写表达式:

from pyspark.sql.functions import col, lit
from functools import reduce

element_list = ['yes','no']

df = spark.createDataFrame(
    ["yes23", "no3", "35yes", """41no["maybe"]"""],
    "string"
).toDF("location")

starts_with = reduce(
    lambda x, y: x | y,
    [col("location").startswith(s) for s in element_list], 
    lit(False))

df.where(starts_with).show()
# +--------+
# |location|
# +--------+
# |   yes23|
# |     no3|
# +--------+

注意:此语法还支持负筛选,即df。其中(~starts_with)将从element_列表中选择不以元素开头的项。