Python 3.x 在PySpark中查找数据帧列中存在的类似字符串,而不使用for循环
我有一个数据框,其中包含一个带字符串的列。我想找到类似的字符串并用一些标志标记它们。我正在使用python Levenshtein模块中的函数,希望将比率大于0.90的字符串标记为“相似”。以下是我拥有的数据帧示例:Python 3.x 在PySpark中查找数据帧列中存在的类似字符串,而不使用for循环,python-3.x,pyspark,apache-spark-sql,pyspark-dataframes,Python 3.x,Pyspark,Apache Spark Sql,Pyspark Dataframes,我有一个数据框,其中包含一个带字符串的列。我想找到类似的字符串并用一些标志标记它们。我正在使用python Levenshtein模块中的函数,希望将比率大于0.90的字符串标记为“相似”。以下是我拥有的数据帧示例: sentenceDataFrame = spark.createDataFrame([ (0, "Hi I heard about Spark"), (1, "I wish Java could use case classes"), (2, "Logist
sentenceDataFrame = spark.createDataFrame([
(0, "Hi I heard about Spark"),
(1, "I wish Java could use case classes"),
(2, "Logistic,regression,models,are,neat"),
(3, "Logistic,regression,model,are,neat")
], ["id", "sentence"])
所需输出为:
+---+-----------------------------------+------------+
|id |sentence |similar_flag|
+---+-----------------------------------+------------+
|0 |Hi I heard about Spark | |
|1 |I wish Java could use case classes | |
|2 |Logistic regression models are neat|2_0 |
|3 |Logistic regression model is neat |2_1 |
|4 |Logistics regression model are neat|2_2 |
+---+-----------------------------------+------------+
其中,“2_1”表示“2”是引用字符串(用于匹配的第一个唯一字符串)的“id”,而“1”表示与其匹配的第一个字符串。我想完全避免for循环。对于较小的数据,我已经使用For循环在简单python中实现了所需的结果,并且希望在PySpark中也获得相同的结果,因此我不想使用python Levenshtein以外的任何模块。我遇到过这种方法,但它要求我放弃python Levenshtein模块。此外,我的数据帧可能会很大(并且预计每天都会增长),因此这种方法可能会导致内存错误。有没有更好的方法来达到预期的效果?我将分三步回答。首先,您需要允许
df
查看所有选项,因此您可能需要使用crossJoin
的数据的Carthesian产品,例如:
from pyspark.sql import functions as f
df_new = (
sentenceDataFrame.crossJoin(
sentenceDataFrame.select(
f.col('sentence').alias('second_sentence'),
f.col('id').alias('second_id')))
)
其次,看看pyspark.sql.functions.levehstein
。一旦你的句子排列成一个对另一个,用Levehstein距离添加一个新的列
df_new_with_dist = df_new.withColumn('levehstein_distance',
f.levenshtein(f.col("sentence"), f.col("second_sentence"))
)
df_new_with_dist.show()
+---+--------------------+--------------------+---------+-------------------+
| id| sentence| second_sentence|second_id|levehstein_distance|
+---+--------------------+--------------------+---------+-------------------+
| 0|Hi I heard about ...|Hi I heard about ...| 0| 0|
| 0|Hi I heard about ...|I wish Java could...| 1| 27|
| 0|Hi I heard about ...|Logistic,regressi...| 2| 29|
| 0|Hi I heard about ...|Logistic,regressi...| 3| 28|
| 1|I wish Java could...|Hi I heard about ...| 0| 27|
| 1|I wish Java could...|I wish Java could...| 1| 0|
| 1|I wish Java could...|Logistic,regressi...| 2| 32|
| 1|I wish Java could...|Logistic,regressi...| 3| 31|
| 2|Logistic,regressi...|Hi I heard about ...| 0| 29|
| 2|Logistic,regressi...|I wish Java could...| 1| 32|
| 2|Logistic,regressi...|Logistic,regressi...| 2| 0|
| 2|Logistic,regressi...|Logistic,regressi...| 3| 1|
| 3|Logistic,regressi...|Hi I heard about ...| 0| 28|
| 3|Logistic,regressi...|I wish Java could...| 1| 31|
| 3|Logistic,regressi...|Logistic,regressi...| 2| 1|
| 3|Logistic,regressi...|Logistic,regressi...| 3| 0|
+---+--------------------+--------------------+---------+-------------------+
最后,过滤掉id==second\u id
的所有行。如果您希望坚持您的符号,例如,2\u 1
,我建议您添加groupBy(f.col(“id”)
,并使用f.min()在levehstein\u距离上进行聚合。然后您可以连接您的ID,例如
min_dist_df = (
df_new_with_dist.where(f.col('id') != f.col('second_id'))
.groupBy(f.col('id').alias('second_id'))
.agg(f.min(f.col('levehstein_distance')).alias('levehstein_distance'))
)
(
df_new_with_dist.join(min_dist_df,
on=['second_id', 'levehstein_distance'],
how='inner')
.withColumn('similar_flag', f.concat(f.concat(f.col('id'), f.lit('_'), f.col('second_id'))))
.select('id', 'sentence', 'similar_flag')
).show()
+---+--------------------+------------+
| id| sentence|similar_flag|
+---+--------------------+------------+
| 2|Logistic,regressi...| 2_3|
| 1|I wish Java could...| 1_0|
| 0|Hi I heard about ...| 0_1|
| 3|Logistic,regressi...| 3_2|
+---+--------------------+------------+
虽然这并不完全是您想要的,但您可以过滤并调整levehstein_distance
值,以获得您想要的答案 我将分三步回答。首先,您需要允许df
查看所有选项,因此您可能需要使用crossJoin
的数据的Carthesian产品,例如:
from pyspark.sql import functions as f
df_new = (
sentenceDataFrame.crossJoin(
sentenceDataFrame.select(
f.col('sentence').alias('second_sentence'),
f.col('id').alias('second_id')))
)
其次,看看pyspark.sql.functions.levehstein
。一旦你的句子排列成一个对另一个,用Levehstein距离添加一个新的列
df_new_with_dist = df_new.withColumn('levehstein_distance',
f.levenshtein(f.col("sentence"), f.col("second_sentence"))
)
df_new_with_dist.show()
+---+--------------------+--------------------+---------+-------------------+
| id| sentence| second_sentence|second_id|levehstein_distance|
+---+--------------------+--------------------+---------+-------------------+
| 0|Hi I heard about ...|Hi I heard about ...| 0| 0|
| 0|Hi I heard about ...|I wish Java could...| 1| 27|
| 0|Hi I heard about ...|Logistic,regressi...| 2| 29|
| 0|Hi I heard about ...|Logistic,regressi...| 3| 28|
| 1|I wish Java could...|Hi I heard about ...| 0| 27|
| 1|I wish Java could...|I wish Java could...| 1| 0|
| 1|I wish Java could...|Logistic,regressi...| 2| 32|
| 1|I wish Java could...|Logistic,regressi...| 3| 31|
| 2|Logistic,regressi...|Hi I heard about ...| 0| 29|
| 2|Logistic,regressi...|I wish Java could...| 1| 32|
| 2|Logistic,regressi...|Logistic,regressi...| 2| 0|
| 2|Logistic,regressi...|Logistic,regressi...| 3| 1|
| 3|Logistic,regressi...|Hi I heard about ...| 0| 28|
| 3|Logistic,regressi...|I wish Java could...| 1| 31|
| 3|Logistic,regressi...|Logistic,regressi...| 2| 1|
| 3|Logistic,regressi...|Logistic,regressi...| 3| 0|
+---+--------------------+--------------------+---------+-------------------+
最后,过滤掉id==second\u id
的所有行。如果您希望坚持您的符号,例如,2\u 1
,我建议您添加groupBy(f.col(“id”)
,并使用f.min()在levehstein\u距离上进行聚合。然后您可以连接您的ID,例如
min_dist_df = (
df_new_with_dist.where(f.col('id') != f.col('second_id'))
.groupBy(f.col('id').alias('second_id'))
.agg(f.min(f.col('levehstein_distance')).alias('levehstein_distance'))
)
(
df_new_with_dist.join(min_dist_df,
on=['second_id', 'levehstein_distance'],
how='inner')
.withColumn('similar_flag', f.concat(f.concat(f.col('id'), f.lit('_'), f.col('second_id'))))
.select('id', 'sentence', 'similar_flag')
).show()
+---+--------------------+------------+
| id| sentence|similar_flag|
+---+--------------------+------------+
| 2|Logistic,regressi...| 2_3|
| 1|I wish Java could...| 1_0|
| 0|Hi I heard about ...| 0_1|
| 3|Logistic,regressi...| 3_2|
+---+--------------------+------------+
虽然这并不完全是您想要的,但您可以过滤并调整levehstein_distance
值,以获得您想要的答案 谢谢大家!!这是一个足够接近的解决方案,交叉连接解决了避免for循环的主要问题。我查看了Spark Python API文档,但找不到与Levenshtein Ratio等效的函数。Levenshtein距离也是一个类似的度量,但可能会导致回归(我以前使用简单python处理过结果,需要匹配)。我一定会试试这个解决方案。谢谢!这是一个足够接近的解决方案,交叉连接解决了避免for循环的主要问题。我查看了Spark Python API文档,但找不到与Levenshtein Ratio等效的函数。Levenshtein距离也是一个类似的度量,但可能会导致回归(我以前使用简单python处理过结果,需要匹配)。我一定会试试这个解决方案。