Python 3.x 在PySpark中查找数据帧列中存在的类似字符串,而不使用for循环

Python 3.x 在PySpark中查找数据帧列中存在的类似字符串,而不使用for循环,python-3.x,pyspark,apache-spark-sql,pyspark-dataframes,Python 3.x,Pyspark,Apache Spark Sql,Pyspark Dataframes,我有一个数据框,其中包含一个带字符串的列。我想找到类似的字符串并用一些标志标记它们。我正在使用python Levenshtein模块中的函数,希望将比率大于0.90的字符串标记为“相似”。以下是我拥有的数据帧示例: sentenceDataFrame = spark.createDataFrame([ (0, "Hi I heard about Spark"), (1, "I wish Java could use case classes"), (2, "Logist

我有一个数据框,其中包含一个带字符串的列。我想找到类似的字符串并用一些标志标记它们。我正在使用python Levenshtein模块中的函数,希望将比率大于0.90的字符串标记为“相似”。以下是我拥有的数据帧示例:

sentenceDataFrame = spark.createDataFrame([
    (0, "Hi I heard about Spark"),
    (1, "I wish Java could use case classes"),
    (2, "Logistic,regression,models,are,neat"),
    (3, "Logistic,regression,model,are,neat")
], ["id", "sentence"])
所需输出为:

+---+-----------------------------------+------------+
|id |sentence                           |similar_flag|
+---+-----------------------------------+------------+
|0  |Hi I heard about Spark             |            |
|1  |I wish Java could use case classes |            |
|2  |Logistic regression models are neat|2_0         |
|3  |Logistic regression model is neat  |2_1         |
|4  |Logistics regression model are neat|2_2         |
+---+-----------------------------------+------------+

其中,“2_1”表示“2”是引用字符串(用于匹配的第一个唯一字符串)的“id”,而“1”表示与其匹配的第一个字符串。我想完全避免for循环。对于较小的数据,我已经使用For循环在简单python中实现了所需的结果,并且希望在PySpark中也获得相同的结果,因此我不想使用python Levenshtein以外的任何模块。我遇到过这种方法,但它要求我放弃python Levenshtein模块。此外,我的数据帧可能会很大(并且预计每天都会增长),因此这种方法可能会导致内存错误。有没有更好的方法来达到预期的效果?

我将分三步回答。首先,您需要允许
df
查看所有选项,因此您可能需要使用
crossJoin
的数据的Carthesian产品,例如:

from pyspark.sql import functions as f

df_new = (
    sentenceDataFrame.crossJoin(
                         sentenceDataFrame.select(
                             f.col('sentence').alias('second_sentence'),
                             f.col('id').alias('second_id')))
)
其次,看看
pyspark.sql.functions.levehstein
。一旦你的句子排列成一个对另一个,用Levehstein距离添加一个新的列

df_new_with_dist = df_new.withColumn('levehstein_distance',
    f.levenshtein(f.col("sentence"), f.col("second_sentence"))
)

df_new_with_dist.show()

+---+--------------------+--------------------+---------+-------------------+
| id|            sentence|     second_sentence|second_id|levehstein_distance|
+---+--------------------+--------------------+---------+-------------------+
|  0|Hi I heard about ...|Hi I heard about ...|        0|                  0|
|  0|Hi I heard about ...|I wish Java could...|        1|                 27|
|  0|Hi I heard about ...|Logistic,regressi...|        2|                 29|
|  0|Hi I heard about ...|Logistic,regressi...|        3|                 28|
|  1|I wish Java could...|Hi I heard about ...|        0|                 27|
|  1|I wish Java could...|I wish Java could...|        1|                  0|
|  1|I wish Java could...|Logistic,regressi...|        2|                 32|
|  1|I wish Java could...|Logistic,regressi...|        3|                 31|
|  2|Logistic,regressi...|Hi I heard about ...|        0|                 29|
|  2|Logistic,regressi...|I wish Java could...|        1|                 32|
|  2|Logistic,regressi...|Logistic,regressi...|        2|                  0|
|  2|Logistic,regressi...|Logistic,regressi...|        3|                  1|
|  3|Logistic,regressi...|Hi I heard about ...|        0|                 28|
|  3|Logistic,regressi...|I wish Java could...|        1|                 31|
|  3|Logistic,regressi...|Logistic,regressi...|        2|                  1|
|  3|Logistic,regressi...|Logistic,regressi...|        3|                  0|
+---+--------------------+--------------------+---------+-------------------+

最后,过滤掉
id==second\u id
的所有行。如果您希望坚持您的符号,例如,
2\u 1
,我建议您添加
groupBy(f.col(“id”)
,并使用
f.min()在
levehstein\u距离上进行聚合。然后您可以连接您的ID,例如

min_dist_df = (
    df_new_with_dist.where(f.col('id') != f.col('second_id'))
                    .groupBy(f.col('id').alias('second_id'))
                    .agg(f.min(f.col('levehstein_distance')).alias('levehstein_distance'))
)


(
    df_new_with_dist.join(min_dist_df,
                          on=['second_id', 'levehstein_distance'],
                          how='inner')
                    .withColumn('similar_flag', f.concat(f.concat(f.col('id'), f.lit('_'), f.col('second_id'))))
                    .select('id', 'sentence', 'similar_flag')
).show()

+---+--------------------+------------+
| id|            sentence|similar_flag|
+---+--------------------+------------+
|  2|Logistic,regressi...|         2_3|
|  1|I wish Java could...|         1_0|
|  0|Hi I heard about ...|         0_1|
|  3|Logistic,regressi...|         3_2|
+---+--------------------+------------+


虽然这并不完全是您想要的,但您可以过滤并调整
levehstein_distance
值,以获得您想要的答案

我将分三步回答。首先,您需要允许
df
查看所有选项,因此您可能需要使用
crossJoin
的数据的Carthesian产品,例如:

from pyspark.sql import functions as f

df_new = (
    sentenceDataFrame.crossJoin(
                         sentenceDataFrame.select(
                             f.col('sentence').alias('second_sentence'),
                             f.col('id').alias('second_id')))
)
其次,看看
pyspark.sql.functions.levehstein
。一旦你的句子排列成一个对另一个,用Levehstein距离添加一个新的列

df_new_with_dist = df_new.withColumn('levehstein_distance',
    f.levenshtein(f.col("sentence"), f.col("second_sentence"))
)

df_new_with_dist.show()

+---+--------------------+--------------------+---------+-------------------+
| id|            sentence|     second_sentence|second_id|levehstein_distance|
+---+--------------------+--------------------+---------+-------------------+
|  0|Hi I heard about ...|Hi I heard about ...|        0|                  0|
|  0|Hi I heard about ...|I wish Java could...|        1|                 27|
|  0|Hi I heard about ...|Logistic,regressi...|        2|                 29|
|  0|Hi I heard about ...|Logistic,regressi...|        3|                 28|
|  1|I wish Java could...|Hi I heard about ...|        0|                 27|
|  1|I wish Java could...|I wish Java could...|        1|                  0|
|  1|I wish Java could...|Logistic,regressi...|        2|                 32|
|  1|I wish Java could...|Logistic,regressi...|        3|                 31|
|  2|Logistic,regressi...|Hi I heard about ...|        0|                 29|
|  2|Logistic,regressi...|I wish Java could...|        1|                 32|
|  2|Logistic,regressi...|Logistic,regressi...|        2|                  0|
|  2|Logistic,regressi...|Logistic,regressi...|        3|                  1|
|  3|Logistic,regressi...|Hi I heard about ...|        0|                 28|
|  3|Logistic,regressi...|I wish Java could...|        1|                 31|
|  3|Logistic,regressi...|Logistic,regressi...|        2|                  1|
|  3|Logistic,regressi...|Logistic,regressi...|        3|                  0|
+---+--------------------+--------------------+---------+-------------------+

最后,过滤掉
id==second\u id
的所有行。如果您希望坚持您的符号,例如,
2\u 1
,我建议您添加
groupBy(f.col(“id”)
,并使用
f.min()在
levehstein\u距离上进行聚合。然后您可以连接您的ID,例如

min_dist_df = (
    df_new_with_dist.where(f.col('id') != f.col('second_id'))
                    .groupBy(f.col('id').alias('second_id'))
                    .agg(f.min(f.col('levehstein_distance')).alias('levehstein_distance'))
)


(
    df_new_with_dist.join(min_dist_df,
                          on=['second_id', 'levehstein_distance'],
                          how='inner')
                    .withColumn('similar_flag', f.concat(f.concat(f.col('id'), f.lit('_'), f.col('second_id'))))
                    .select('id', 'sentence', 'similar_flag')
).show()

+---+--------------------+------------+
| id|            sentence|similar_flag|
+---+--------------------+------------+
|  2|Logistic,regressi...|         2_3|
|  1|I wish Java could...|         1_0|
|  0|Hi I heard about ...|         0_1|
|  3|Logistic,regressi...|         3_2|
+---+--------------------+------------+


虽然这并不完全是您想要的,但您可以过滤并调整
levehstein_distance
值,以获得您想要的答案

谢谢大家!!这是一个足够接近的解决方案,交叉连接解决了避免for循环的主要问题。我查看了Spark Python API文档,但找不到与Levenshtein Ratio等效的函数。Levenshtein距离也是一个类似的度量,但可能会导致回归(我以前使用简单python处理过结果,需要匹配)。我一定会试试这个解决方案。谢谢!这是一个足够接近的解决方案,交叉连接解决了避免for循环的主要问题。我查看了Spark Python API文档,但找不到与Levenshtein Ratio等效的函数。Levenshtein距离也是一个类似的度量,但可能会导致回归(我以前使用简单python处理过结果,需要匹配)。我一定会试试这个解决方案。