Ruby on rails 根据给定的相似性百分比将批量文本分组为组_Ruby On Rails_Ruby_Nlp_Similarity

Ruby on rails 根据给定的相似性百分比将批量文本分组为组

ruby-on-rails ruby nlp

Ruby on rails 根据给定的相似性百分比将批量文本分组为组,ruby-on-rails,ruby,nlp,similarity,Ruby On Rails,Ruby,Nlp,Similarity,我浏览了GitHub中提供的以下NLP gems，但没有找到正确的解决方案是否有任何gem或库可用于根据给定的相似百分比对文本进行分组。以上所有的gem都有助于发现两个字符串之间的相似性，但分组大量数据需要花费大量时间您只需使用Ruby加上列出的宝石之一就可以做到这一点我选择了模糊字符串匹配，因为我喜欢这个名字以下是您如何使用gem： require 'fuzzystringmatch' # Create the matcher jarow = FuzzyStringMatch::Ja

我浏览了GitHub中提供的以下NLP gems，但没有找到正确的解决方案

是否有任何gem或库可用于根据给定的相似百分比对文本进行分组。以上所有的gem都有助于发现两个字符串之间的相似性，但分组大量数据需要花费大量时间

您只需使用Ruby加上列出的宝石之一就可以做到这一点

我选择了模糊字符串匹配，因为我喜欢这个名字

以下是您如何使用gem：

require 'fuzzystringmatch'

# Create the matcher
jarow = FuzzyStringMatch::JaroWinkler.create( :native )

# Get the distance
jarow.getDistance(  "jones",      "johnson" )
# => 0.8323809523809523

# Round it
jarow.getDistance(  "jones",      "johnson" ).round(2)
# => 0.83

由于得到了一个浮点值，因此可以使用

round

方法定义所需的精度

现在，要对类似结果进行分组，您可以使用

可枚举

模块中的

分组方法
您向它传递一个块，groupby
将在集合上迭代。对于每次迭代，您将返回您试图分组的值（在本例中为距离），它将返回一个散列，其中距离作为键，字符串数组作为值匹配在一起
require 'fuzzystringmatch'

jarow = FuzzyStringMatch::JaroWinkler.create( :native )

target = "jones"
precision = 2
candidates = [ "Jessica Jones", "Jones", "Johnson", "thompson", "john", "thompsen" ]

distances = candidates.group_by { |candidate|
  jarow.getDistance( target, candidate ).round(precision)
}

distances
# => {0.52=>["Jessica Jones"],
#     0.87=>["Jones"],
#     0.68=>["Johnson"],
#     0.55=>["thompson", "thompsen"],
#     0.83=>["john"]}

我希望这有帮助