Python 如何将发音相似的单词组合在一起_Python_Python 3.x_List

Python 如何将发音相似的单词组合在一起

python python-3.x list

Python 如何将发音相似的单词组合在一起,python,python-3.x,list,Python,Python 3.x,List,我试图从一个列表中找出所有听起来相似的单词我试着用余弦相似性得到它们，但这并没有达到我的目的 from sklearn.metrics.pairwise import cosine_similarity dataList = ['two','fourth','forth','dessert','to','desert'] cosine_similarity(dataList) 我知道这不是正确的方法，我似乎无法得到如下结果： result = ['xx', 'xx', 'yy', 'yy',

我试图从一个列表中找出所有听起来相似的单词

我试着用余弦相似性得到它们，但这并没有达到我的目的

from sklearn.metrics.pairwise import cosine_similarity
dataList = ['two','fourth','forth','dessert','to','desert']
cosine_similarity(dataList)

我知道这不是正确的方法，我似乎无法得到如下结果：

result = ['xx', 'xx', 'yy', 'yy', 'zz', 'zz']

如果它们的意思是发音相似的单词

首先，您需要使用正确的方法来获得发音相似的单词，即字符串相似性，我建议：

使用：

输出：

T000
T000

['D263', 'D263', 'F630', 'F630', 'T000', 'T000']

T000
T000

[['D263', 'D263'], ['F630', 'F630'], ['T000', 'T000']]

[[('dessert', 'D263'), ('desert', 'D263')], [('fourth', 'F630'), ('forth', 'F630')], [('two', 'T000'), ('to', 'T000')]]

现在，或许可以创建一个函数来处理列表，然后对其进行排序以获得它们：

def getSoundexList(dList):
    res = [soundex(x) for x in dList]   # iterate over each elem in the dataList
    # print(res)     # ['T000', 'F630', 'F630', 'D263', 'T000', 'D263']
    return res

dataList = ['two','fourth','forth','dessert','to','desert']    
print([x for x in sorted(getSoundexList(dataList))])

输出：

T000
T000

['D263', 'D263', 'F630', 'F630', 'T000', 'T000']

T000
T000

[['D263', 'D263'], ['F630', 'F630'], ['T000', 'T000']]

[[('dessert', 'D263'), ('desert', 'D263')], [('fourth', 'F630'), ('forth', 'F630')], [('two', 'T000'), ('to', 'T000')]]

编辑：

另一种方法可以是：

使用：

输出：

T000
T000

['D263', 'D263', 'F630', 'F630', 'T000', 'T000']

T000
T000

[['D263', 'D263'], ['F630', 'F630'], ['T000', 'T000']]

[[('dessert', 'D263'), ('desert', 'D263')], [('fourth', 'F630'), ('forth', 'F630')], [('two', 'T000'), ('to', 'T000')]]

编辑2：

T000
T000

['D263', 'D263', 'F630', 'F630', 'T000', 'T000']

T000
T000

[['D263', 'D263'], ['F630', 'F630'], ['T000', 'T000']]

[[('dessert', 'D263'), ('desert', 'D263')], [('fourth', 'F630'), ('forth', 'F630')], [('two', 'T000'), ('to', 'T000')]]

如果要将它们分组，可以使用groupby：

from itertools import groupby

def getSoundexList(dList):
    return sorted([soundex(x) for x in dList])

dataList = ['two','fourth','forth','dessert','to','desert']    
print([list(g) for _, g in groupby(getSoundexList(dataList), lambda x: x)])

输出：

T000
T000

['D263', 'D263', 'F630', 'F630', 'T000', 'T000']

T000
T000

[['D263', 'D263'], ['F630', 'F630'], ['T000', 'T000']]

[[('dessert', 'D263'), ('desert', 'D263')], [('fourth', 'F630'), ('forth', 'F630')], [('two', 'T000'), ('to', 'T000')]]

编辑3：

T000
T000

['D263', 'D263', 'F630', 'F630', 'T000', 'T000']

T000
T000

[['D263', 'D263'], ['F630', 'F630'], ['T000', 'T000']]

[[('dessert', 'D263'), ('desert', 'D263')], [('fourth', 'F630'), ('forth', 'F630')], [('two', 'T000'), ('to', 'T000')]]

这是给@Eric Duminil的，假设您想要

名称和它们各自的val
：
使用dict
以及：
输出：
T000
T000

['D263', 'D263', 'F630', 'F630', 'T000', 'T000']

T000
T000

[['D263', 'D263'], ['F630', 'F630'], ['T000', 'T000']]

[[('dessert', 'D263'), ('desert', 'D263')], [('fourth', 'F630'), ('forth', 'F630')], [('two', 'T000'), ('to', 'T000')]]

编辑4（用于OP）：
Soundex:
Soundex是一种系统，在这种系统中，值被分配给名称
发音相似的名称获得相同值的方式。这些价值观
被称为soundex编码。基于soundex的搜索应用
不会直接搜索名称，而是搜索
soundex编码。通过这样做，它将获得所有发出声音的名称
就像我们在寻找的名字
这里是另一个与上述类似的案例，并从这个答案中得到启发：也许很有趣：）