Python 基于另一个pyspark数据帧中的字从pyspark数据帧中删除字

Python 基于另一个pyspark数据帧中的字从pyspark数据帧中删除字,python,dataframe,apache-spark,pyspark,Python,Dataframe,Apache Spark,Pyspark,我想从辅助数据帧中删除主数据帧中的字 这是主数据框: +----------+--------------------+ | event_dt| cust_text| +----------+--------------------+ |2020-09-02|hi fine i want to go| |2020-09-02|i need a line hold | |2020-09-02|i have the 60 packs| |2020-09-02|hello w

我想从辅助数据帧中删除主数据帧中的字

这是主数据框:

+----------+--------------------+
|  event_dt|           cust_text|
+----------+--------------------+
|2020-09-02|hi fine i want to go|
|2020-09-02|i need  a line hold |
|2020-09-02|i have the  60 packs|
|2020-09-02|hello want you teach|
下面是单列辅助数据帧。无论单词出现在哪里,都需要从主数据框的
cust_text
列中删除辅助数据框中的单词。例如,
'want'
将从主数据框中显示的每一行中删除(在本例中,将从第1行和第4行中删除)

event_dt
列将保持原样,每一行将保持原样,结果数据框中仅从主数据框中删除辅助数据框字,如下所示

+----------+--------------------+
|  event_dt|           cust_text|
+----------+--------------------+
|2020-09-02|hi fine i to        |
|2020-09-02|i line hold         |
|2020-09-02|i the 60 packs      |
|2020-09-02|you teach           |
+----------+--------------------+

谢谢你的帮助

这应该是您的工作解决方案-使用
array\u except()
来消除不需要的字符串,但是为了做到这一点,我们需要做一些准备

在这里创建数据框 将列作为数组以备将来使用 输出 现在,只需按lookup dataframe分组,并获取变量中的所有查找值,如下所示 这就是诀窍
你能检查一下并让我知道这个解决方案对你有效吗?太好了!!成功了。我添加了一件事,因为次要数据帧的列表超过7000个字。从pyspark.sql.functions导入lit df_lookup=df.withColumn('col1',lit('1'))df.show(5)array_except()从源中删除重复项。我不想这样。@dsk,array_except()删除源中的重复项。我不想那样。知道怎么解决吗?
+----------+--------------------+
|  event_dt|           cust_text|
+----------+--------------------+
|2020-09-02|hi fine i to        |
|2020-09-02|i line hold         |
|2020-09-02|i the 60 packs      |
|2020-09-02|you teach           |
+----------+--------------------+
from pyspark.sql import functions as F
from pyspark.sql import types as T
df = spark.createDataFrame([("2020-09-02","hi fine i want to go"),("2020-09-02","i need  a line hold"), ("2020-09-02", "i have the  60 packs"), ("2020-09-02", "hello want you teach")],[ "col1","col2"])
df = df.withColumn("col2", F.split("col2", " "))
df.show(truncate=False)
df_lookup = spark.createDataFrame([(1,"want"),(1,"because"), (1, "need"), (1, "hello"),(1, "a"),(1, "give"), (1, "go")],[ "col1","col2"])
df_lookup.show()
+----------+---------------------------+
|col1      |col2                       |
+----------+---------------------------+
|2020-09-02|[hi, fine, i, want, to, go]|
|2020-09-02|[i, need, , a, line, hold] |
|2020-09-02|[i, have, the, , 60, packs]|
|2020-09-02|[hello, want, you, teach]  |
+----------+---------------------------+

+----+-------+
|col1|   col2|
+----+-------+
|   1|   want|
|   1|because|
|   1|   need|
|   1|  hello|
|   1|      a|
|   1|   give|
|   1|     go|
+----+-------+
df_lookup_var = df_lookup.groupBy("col1").agg(F.collect_set("col2").alias("col2")).collect()[0][1]
print(df_lookup_var)
x = ",".join(df_lookup_var)
print(x)
df = df.withColumn("filter_col", F.lit(x))
df = df.withColumn("filter_col", F.split("filter_col", ","))
df.show(truncate=False)
df = df.withColumn("ArrayColumn", F.array_except("col2", "filter_col"))
df.show(truncate = False)
+----------+---------------------------+-----------------------------------------+---------------------------+
|col1      |col2                       |filter_col                               |ArrayColumn                |
+----------+---------------------------+-----------------------------------------+---------------------------+
|2020-09-02|[hi, fine, i, want, to, go]|[need, want, a, because, hello, give, go]|[hi, fine, i, to]          |
|2020-09-02|[i, need, , a, line, hold] |[need, want, a, because, hello, give, go]|[i, , line, hold]          |
|2020-09-02|[i, have, the, , 60, packs]|[need, want, a, because, hello, give, go]|[i, have, the, , 60, packs]|
|2020-09-02|[hello, want, you, teach]  |[need, want, a, because, hello, give, go]|[you, teach]               |
+----------+---------------------------+-----------------------------------------+---------------------------+