Python 基于另一个pyspark数据帧中的字从pyspark数据帧中删除字_Python_Dataframe_Apache Spark_Pyspark

Python 基于另一个pyspark数据帧中的字从pyspark数据帧中删除字

python dataframe apache-spark pyspark

Python 基于另一个pyspark数据帧中的字从pyspark数据帧中删除字,python,dataframe,apache-spark,pyspark,Python,Dataframe,Apache Spark,Pyspark,我想从辅助数据帧中删除主数据帧中的字这是主数据框： +----------+--------------------+ | event_dt| cust_text| +----------+--------------------+ |2020-09-02|hi fine i want to go| |2020-09-02|i need a line hold | |2020-09-02|i have the 60 packs| |2020-09-02|hello w

我想从辅助数据帧中删除主数据帧中的字

这是主数据框：

+----------+--------------------+
|  event_dt|           cust_text|
+----------+--------------------+
|2020-09-02|hi fine i want to go|
|2020-09-02|i need  a line hold |
|2020-09-02|i have the  60 packs|
|2020-09-02|hello want you teach|

下面是单列辅助数据帧。无论单词出现在哪里，都需要从主数据框的

cust_text

列中删除辅助数据框中的单词。例如，

'want'

将从主数据框中显示的每一行中删除（在本例中，将从第1行和第4行中删除）

event_dt

列将保持原样，每一行将保持原样，结果数据框中仅从主数据框中删除辅助数据框字，如下所示

+----------+--------------------+
|  event_dt|           cust_text|
+----------+--------------------+
|2020-09-02|hi fine i to        |
|2020-09-02|i line hold         |
|2020-09-02|i the 60 packs      |
|2020-09-02|you teach           |
+----------+--------------------+

谢谢你的帮助

这应该是您的工作解决方案-使用array\u except（）
来消除不需要的字符串，但是为了做到这一点，我们需要做一些准备

在这里创建数据框 将列作为数组以备将来使用 输出现在，只需按lookup dataframe分组，并获取变量中的所有查找值，如下所示这就是诀窍

你能检查一下并让我知道这个解决方案对你有效吗？太好了！！成功了。我添加了一件事，因为次要数据帧的列表超过7000个字。从pyspark.sql.functions导入lit df_lookup=df.withColumn（'col1'，lit（'1'））df.show（5）array_except（）从源中删除重复项。我不想这样。@dsk，array_except（）删除源中的重复项。我不想那样。知道怎么解决吗？

+----------+--------------------+
|  event_dt|           cust_text|
+----------+--------------------+
|2020-09-02|hi fine i to        |
|2020-09-02|i line hold         |
|2020-09-02|i the 60 packs      |
|2020-09-02|you teach           |
+----------+--------------------+

from pyspark.sql import functions as F
from pyspark.sql import types as T
df = spark.createDataFrame([("2020-09-02","hi fine i want to go"),("2020-09-02","i need  a line hold"), ("2020-09-02", "i have the  60 packs"), ("2020-09-02", "hello want you teach")],[ "col1","col2"])

df = df.withColumn("col2", F.split("col2", " "))
df.show(truncate=False)
df_lookup = spark.createDataFrame([(1,"want"),(1,"because"), (1, "need"), (1, "hello"),(1, "a"),(1, "give"), (1, "go")],[ "col1","col2"])
df_lookup.show()

+----------+---------------------------+
|col1      |col2                       |
+----------+---------------------------+
|2020-09-02|[hi, fine, i, want, to, go]|
|2020-09-02|[i, need, , a, line, hold] |
|2020-09-02|[i, have, the, , 60, packs]|
|2020-09-02|[hello, want, you, teach]  |
+----------+---------------------------+

+----+-------+
|col1|   col2|
+----+-------+
|   1|   want|
|   1|because|
|   1|   need|
|   1|  hello|
|   1|      a|
|   1|   give|
|   1|     go|
+----+-------+

df_lookup_var = df_lookup.groupBy("col1").agg(F.collect_set("col2").alias("col2")).collect()[0][1]
print(df_lookup_var)
x = ",".join(df_lookup_var)
print(x)
df = df.withColumn("filter_col", F.lit(x))
df = df.withColumn("filter_col", F.split("filter_col", ","))
df.show(truncate=False)

df = df.withColumn("ArrayColumn", F.array_except("col2", "filter_col"))
df.show(truncate = False)
+----------+---------------------------+-----------------------------------------+---------------------------+
|col1      |col2                       |filter_col                               |ArrayColumn                |
+----------+---------------------------+-----------------------------------------+---------------------------+
|2020-09-02|[hi, fine, i, want, to, go]|[need, want, a, because, hello, give, go]|[hi, fine, i, to]          |
|2020-09-02|[i, need, , a, line, hold] |[need, want, a, because, hello, give, go]|[i, , line, hold]          |
|2020-09-02|[i, have, the, , 60, packs]|[need, want, a, because, hello, give, go]|[i, have, the, , 60, packs]|
|2020-09-02|[hello, want, you, teach]  |[need, want, a, because, hello, give, go]|[you, teach]               |
+----------+---------------------------+-----------------------------------------+---------------------------+