Python 基于另一个pyspark数据帧中的字从pyspark数据帧中删除字
我想从辅助数据帧中删除主数据帧中的字 这是主数据框:Python 基于另一个pyspark数据帧中的字从pyspark数据帧中删除字,python,dataframe,apache-spark,pyspark,Python,Dataframe,Apache Spark,Pyspark,我想从辅助数据帧中删除主数据帧中的字 这是主数据框: +----------+--------------------+ | event_dt| cust_text| +----------+--------------------+ |2020-09-02|hi fine i want to go| |2020-09-02|i need a line hold | |2020-09-02|i have the 60 packs| |2020-09-02|hello w
+----------+--------------------+
| event_dt| cust_text|
+----------+--------------------+
|2020-09-02|hi fine i want to go|
|2020-09-02|i need a line hold |
|2020-09-02|i have the 60 packs|
|2020-09-02|hello want you teach|
下面是单列辅助数据帧。无论单词出现在哪里,都需要从主数据框的cust_text
列中删除辅助数据框中的单词。例如,'want'
将从主数据框中显示的每一行中删除(在本例中,将从第1行和第4行中删除)
event_dt
列将保持原样,每一行将保持原样,结果数据框中仅从主数据框中删除辅助数据框字,如下所示
+----------+--------------------+
| event_dt| cust_text|
+----------+--------------------+
|2020-09-02|hi fine i to |
|2020-09-02|i line hold |
|2020-09-02|i the 60 packs |
|2020-09-02|you teach |
+----------+--------------------+
谢谢你的帮助 这应该是您的工作解决方案-使用
array\u except()
来消除不需要的字符串,但是为了做到这一点,我们需要做一些准备
在这里创建数据框
将列作为数组以备将来使用
输出
现在,只需按lookup dataframe分组,并获取变量中的所有查找值,如下所示
这就是诀窍
你能检查一下并让我知道这个解决方案对你有效吗?太好了!!成功了。我添加了一件事,因为次要数据帧的列表超过7000个字。从pyspark.sql.functions导入lit df_lookup=df.withColumn('col1',lit('1'))df.show(5)array_except()从源中删除重复项。我不想这样。@dsk,array_except()删除源中的重复项。我不想那样。知道怎么解决吗?
+----------+--------------------+
| event_dt| cust_text|
+----------+--------------------+
|2020-09-02|hi fine i to |
|2020-09-02|i line hold |
|2020-09-02|i the 60 packs |
|2020-09-02|you teach |
+----------+--------------------+
from pyspark.sql import functions as F
from pyspark.sql import types as T
df = spark.createDataFrame([("2020-09-02","hi fine i want to go"),("2020-09-02","i need a line hold"), ("2020-09-02", "i have the 60 packs"), ("2020-09-02", "hello want you teach")],[ "col1","col2"])
df = df.withColumn("col2", F.split("col2", " "))
df.show(truncate=False)
df_lookup = spark.createDataFrame([(1,"want"),(1,"because"), (1, "need"), (1, "hello"),(1, "a"),(1, "give"), (1, "go")],[ "col1","col2"])
df_lookup.show()
+----------+---------------------------+
|col1 |col2 |
+----------+---------------------------+
|2020-09-02|[hi, fine, i, want, to, go]|
|2020-09-02|[i, need, , a, line, hold] |
|2020-09-02|[i, have, the, , 60, packs]|
|2020-09-02|[hello, want, you, teach] |
+----------+---------------------------+
+----+-------+
|col1| col2|
+----+-------+
| 1| want|
| 1|because|
| 1| need|
| 1| hello|
| 1| a|
| 1| give|
| 1| go|
+----+-------+
df_lookup_var = df_lookup.groupBy("col1").agg(F.collect_set("col2").alias("col2")).collect()[0][1]
print(df_lookup_var)
x = ",".join(df_lookup_var)
print(x)
df = df.withColumn("filter_col", F.lit(x))
df = df.withColumn("filter_col", F.split("filter_col", ","))
df.show(truncate=False)
df = df.withColumn("ArrayColumn", F.array_except("col2", "filter_col"))
df.show(truncate = False)
+----------+---------------------------+-----------------------------------------+---------------------------+
|col1 |col2 |filter_col |ArrayColumn |
+----------+---------------------------+-----------------------------------------+---------------------------+
|2020-09-02|[hi, fine, i, want, to, go]|[need, want, a, because, hello, give, go]|[hi, fine, i, to] |
|2020-09-02|[i, need, , a, line, hold] |[need, want, a, because, hello, give, go]|[i, , line, hold] |
|2020-09-02|[i, have, the, , 60, packs]|[need, want, a, because, hello, give, go]|[i, have, the, , 60, packs]|
|2020-09-02|[hello, want, you, teach] |[need, want, a, because, hello, give, go]|[you, teach] |
+----------+---------------------------+-----------------------------------------+---------------------------+