Pyspark 比较行以在Pysark中生成名词块
我有一个Spark数据帧,其中每一行都是一个句子的标记,并包含其词性。我试图找到最好的方法来比较一行和下一行,以创建最长的名词块Pyspark 比较行以在Pysark中生成名词块,pyspark,pyspark-dataframes,Pyspark,Pyspark Dataframes,我有一个Spark数据帧,其中每一行都是一个句子的标记,并包含其词性。我试图找到最好的方法来比较一行和下一行,以创建最长的名词块 +------+-----------+---------------------------+--------+-------+-------+-----+ |REV_ID| SENT_ID| SENTENCE|TOKEN_ID| TOKEN| LEMMA| POS| +------+-----------+-----
+------+-----------+---------------------------+--------+-------+-------+-----+
|REV_ID| SENT_ID| SENTENCE|TOKEN_ID| TOKEN| LEMMA| POS|
+------+-----------+---------------------------+--------+-------+-------+-----+
| 1| 1|Ice hockey game took hours.| 1| Ice| ice| NOUN|
| 1| 1|Ice hockey game took hours.| 2| hockey| hockey| NOUN|
| 1| 1|Ice hockey game took hours.| 3| game| game| NOUN|
| 1| 1|Ice hockey game took hours.| 4| took| take| VERB|
| 1| 1|Ice hockey game took hours.| 5| hours| hour| NOUN|
我知道for循环效率不高,但我不确定如何才能获得预期的结果,如下所示:
+------+-----------+---------------------------+--------+-------+-------+-----+----------------+
|REV_ID| SENT_ID| SENTENCE|TOKEN_ID| TOKEN| LEMMA| POS| NOUN_CHUNK|
+------+-----------+---------------------------+--------+-------+-------+-----+----------------+
| 1| 1|Ice hockey game took hours.| 1| Ice| ice| NOUN| ice hockey game|
| 1| 1|Ice hockey game took hours.| 2| hockey| hockey| NOUN| ice hockey game|
| 1| 1|Ice hockey game took hours.| 3| game| game| NOUN| ice hockey game|
| 1| 1|Ice hockey game took hours.| 4| took| take| VERB| NULL|
| 1| 1|Ice hockey game took hours.| 5| hours| hour| NOUN| hour|
用窗口函数试试这个
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy("SENT_ID").orderBy("TOKEN_ID")
w1=Window().partitionBy("SENT_ID", "list")
df\
.withColumn("list", F.sum(F.when(F.col("POS")=='NOUN', F.lit(0)).otherwise(F.lit(1))).over(w))\
.withColumn("list", F.expr("""IF(POS!='NOUN',null,list)"""))\
.withColumn("NOUN_CHUNK", F.when(F.col("list").isNotNull(),F.array_join(F.collect_list("LEMMA").over(w1),' '))\
.otherwise(F.lit(None))).drop("list").orderBy("SENT_ID","TOKEN_ID").show()
#+------+-------+--------------------+--------+------+------+----+---------------+
#|REV_ID|SENT_ID| SENTENCE|TOKEN_ID| TOKEN| LEMMA| POS| NOUN_CHUNK|
#+------+-------+--------------------+--------+------+------+----+---------------+
#| 1| 1|Ice hockey game t...| 1| Ice| ice|NOUN|ice hockey game|
#| 1| 1|Ice hockey game t...| 2|hockey|hockey|NOUN|ice hockey game|
#| 1| 1|Ice hockey game t...| 3| game| game|NOUN|ice hockey game|
#| 1| 1|Ice hockey game t...| 4| took| take|VERB| null|
#| 1| 1|Ice hockey game t...| 5| hours| hour|NOUN| hour|
#+------+-------+--------------------+--------+------+------+----+---------------+
我试着实现它,但它似乎只返回非名词的标记,因此所有名词都为空_CHUNK@user3242036仔细检查您的代码,它在我的集群上运行良好,或者提供更多数据来表明它没有按描述工作。我唯一更改的是添加了“REV_ID”要查看窗口分区功能,请参见,但这不会改变代码的功能。我尝试了更多的数据,效果非常好。提供更多不起作用的数据案例我看到了我从未见过的奇怪行为。有时它会准确地返回你所展示的和我所期待的。其他时候,它返回一个空数据帧。除了再次按run键,我什么都没做