Dataframe 将第一个数据帧值STARTS与第二个数据帧值中的任何一个进行检查
我有两个pyspark数据帧,如下所示:Dataframe 将第一个数据帧值STARTS与第二个数据帧值中的任何一个进行检查,dataframe,apache-spark,pyspark,apache-spark-sql,Dataframe,Apache Spark,Pyspark,Apache Spark Sql,我有两个pyspark数据帧,如下所示: df1 = spark.createDataFrame( ["yes","no","yes23", "no3", "35yes", """41no["maybe"]"""], "string" ).toDF("location&qu
df1 = spark.createDataFrame(
["yes","no","yes23", "no3", "35yes", """41no["maybe"]"""],
"string"
).toDF("location")
df2 = spark.createDataFrame(
["yes","no"],
"string"
).toDF("location")
我想检查位置列中的值是否来自df1,开始与,位置列中的值是否来自df2,反之亦然
比如:
df1.select("location").startsWith(df2.location)
以下是我在这里期望的输出:
+-------------+
| location|
+-------------+
| yes|
| no|
| yes23|
| no3|
+-------------+
在我看来,使用spark SQL最简单:
df1.createOrReplaceTempView('df1')
df2.createOrReplaceTempView('df2')
joined = spark.sql("""
select df1.*
from df1
join df2
on df1.location rlike '^' || df2.location
""")