Ruby 我如何根据单词在同一句子中的使用频率对它们进行分组？_Ruby_Grouping_Tf Idf

Ruby 我如何根据单词在同一句子中的使用频率对它们进行分组？

ruby

Ruby 我如何根据单词在同一句子中的使用频率对它们进行分组？,ruby,grouping,tf-idf,Ruby,Grouping,Tf Idf,我有一个正文，500个句子。句子被清楚地删除了，为了简单起见，让我们假设一段时间。每个句子大约有10-20个单词我想把它分成几组词，从统计上看，这些词在同一个句子中最常用。这里有一个简单的例子 This is a sentence about pink killer cats chasing madonna. Sometimes when whales fight bricklayers, everyone drinks champaigne. You know Madonna has lit

我有一个正文，500个句子。句子被清楚地删除了，为了简单起见，让我们假设一段时间。每个句子大约有10-20个单词

我想把它分成几组词，从统计上看，这些词在同一个句子中最常用。这里有一个简单的例子

This is a sentence about pink killer cats chasing madonna.
Sometimes when whales fight bricklayers, everyone drinks champaigne.
You know Madonna has little cats on her slippers.
When whales drink whiskey, your golf game is over.

我确实有一个过滤掉的停止词列表，在上面的例子中，我可以想象想要建立这些组

第一组：粉红猫麦当娜
第二组：鲸鱼在游泳时喝水

或者类似的。我意识到这可能是一项相当复杂的工作。我一直在尝试TF IDF相似性，但还没有真正取得任何进展。我在ruby工作，很想听听人们的想法/方向/建议。

我喜欢这个难题，下面是我对可能的解决方案的看法*

*尽管如此，我还是建议下一次你不要只问问题，而不展示你的尝试和你的困境。。。否则，你可能会把你的课堂作业扔给我们

假设这是我们的文本：

text = 'This is a sentence about pink killer cats chasing madonna.
        Sometimes when whales fight bricklayers, everyone drinks champaigne.
        You know Madonna has little cats on her slippers.
        When whales drink whiskey, your golf game is over.'

在我看来，手头的任务有很多阶段

创建一个“单词”目录

计算每个单词在文本中出现的次数

require 'strscan'
words = {}
scn = StringScanner.new(text.downcase)
( words[scn.matched] =  words[scn.matched].to_i + 1 if scn.scan(/[\w]*/) ) while (scn.skip(/[^\w]*/) > 0) || !scn.eos?

删除任何只出现一次的单词-它是不相关的

words.delete_if {|w, v| v <= 1}

用每个句子中使用的单词填充句子。以下是一种简化方法（在实际应用中，您需要将单词分开，以确保“cat”和“caterpillar”不重叠）：

瞧，这些是常见的群体：

common_groups.each {|g, c| puts "the word(s) #{g} were common to #{c} sentences."}

# => the word(s) ["is"] were common to 2 sentences.
# => the word(s) ["when", "whales"] were common to 2 sentences.
# => the word(s) ["cats", "madonna"] were common to 2 sentences.

整个代码可能如下所示：

text = 'This is a sentence about pink killer cats chasing madonna.
        Sometimes when whales fight bricklayers, everyone drinks champaigne.
        You know Madonna has little cats on her slippers.
        When whales drink whiskey, your golf game is over.'

require 'strscan'
text.downcase!
words = {}
scn = StringScanner.new(text)

( words[scn.matched] =  words[scn.matched].to_i + 1 if scn.scan(/[\w]*/) ) while (scn.skip(/[^\w]*/) > 0) || !scn.eos?

words.delete_if {|w, v| v <= 1}

sentences = {}
text.split(/\.[\s]*/).each {|s| sentences[s] = []}

# # A better code will split the sentences into words to
# # avoid partial recognition (cat vs. caterpillar).
# # for example:
sentences.each {|s, v| tmp = s.split(/[^\w]+/); words.each {|w, c| v << w if tmp.include? w} }
# # The following is the simplified version above:
# words.each {|w, c| sentences.each {|s, v| v << w if s.include? w} }

common_groups = {}
tmp_groups = sentences.values
until tmp_groups.empty?
   active_group = tmp_groups.pop
   tmp_groups.each do |g|
        common = active_group & g
        next if common.empty?
        common_groups[common] = [2,(common_groups[common].to_i + 1)].max
   end
end

common_groups.each {|g, c| puts "the word(s) #{g} were common to #{c} sentences."}

# => the word(s) ["is"] were common to 2 sentences.
# => the word(s) ["when", "whales"] were common to 2 sentences.
# => the word(s) ["cats", "madonna"] were common to 2 sentences.

text=”这是一句关于粉色杀手猫追逐麦当娜的话。
有时当鲸鱼与瓦工搏斗时，每个人都喝香槟酒。
你知道麦当娜的拖鞋上有小猫。
当鲸鱼喝威士忌时，你的高尔夫比赛就结束了
需要“strscan”
text.downcase！
单词={}
scn=StringScanner.new（文本）
（单词[scn.matched]=单词[scn.matched]。如果scn.scan（/[\w]*/），则为i+1，而（scn.skip（/[^\w]*/）>0）|！scn.eos？
如果{w，v | v{s][“猫”，“麦当娜”]是两个句子的常用词，则删除。

编辑

我纠正了代码中的一个问题，即文本不能持久化为小写。（

text.downcase！

vs.

text.downcase

）

EDIT2

我回顾了部分单词问题（即

cat

与

caterpillar

或

dog

与

dogma

）