Ruby 我如何根据单词在同一句子中的使用频率对它们进行分组?
我有一个正文,500个句子。句子被清楚地删除了,为了简单起见,让我们假设一段时间。每个句子大约有10-20个单词 我想把它分成几组词,从统计上看,这些词在同一个句子中最常用。这里有一个简单的例子Ruby 我如何根据单词在同一句子中的使用频率对它们进行分组?,ruby,grouping,tf-idf,Ruby,Grouping,Tf Idf,我有一个正文,500个句子。句子被清楚地删除了,为了简单起见,让我们假设一段时间。每个句子大约有10-20个单词 我想把它分成几组词,从统计上看,这些词在同一个句子中最常用。这里有一个简单的例子 This is a sentence about pink killer cats chasing madonna. Sometimes when whales fight bricklayers, everyone drinks champaigne. You know Madonna has lit
This is a sentence about pink killer cats chasing madonna.
Sometimes when whales fight bricklayers, everyone drinks champaigne.
You know Madonna has little cats on her slippers.
When whales drink whiskey, your golf game is over.
我确实有一个过滤掉的停止词列表,在上面的例子中,我可以想象想要建立这些组
第一组:粉红猫麦当娜第二组:鲸鱼在游泳时喝水
或者类似的。我意识到这可能是一项相当复杂的工作。我一直在尝试TF IDF相似性,但还没有真正取得任何进展。我在ruby工作,很想听听人们的想法/方向/建议。我喜欢这个难题,下面是我对可能的解决方案的看法* *尽管如此,我还是建议下一次你不要只问问题,而不展示你的尝试和你的困境。。。否则,你可能会把你的课堂作业扔给我们 假设这是我们的文本:
text = 'This is a sentence about pink killer cats chasing madonna.
Sometimes when whales fight bricklayers, everyone drinks champaigne.
You know Madonna has little cats on her slippers.
When whales drink whiskey, your golf game is over.'
在我看来,手头的任务有很多阶段
require 'strscan'
words = {}
scn = StringScanner.new(text.downcase)
( words[scn.matched] = words[scn.matched].to_i + 1 if scn.scan(/[\w]*/) ) while (scn.skip(/[^\w]*/) > 0) || !scn.eos?
words.delete_if {|w, v| v <= 1}
common_groups.each {|g, c| puts "the word(s) #{g} were common to #{c} sentences."}
# => the word(s) ["is"] were common to 2 sentences.
# => the word(s) ["when", "whales"] were common to 2 sentences.
# => the word(s) ["cats", "madonna"] were common to 2 sentences.
整个代码可能如下所示:
text = 'This is a sentence about pink killer cats chasing madonna.
Sometimes when whales fight bricklayers, everyone drinks champaigne.
You know Madonna has little cats on her slippers.
When whales drink whiskey, your golf game is over.'
require 'strscan'
text.downcase!
words = {}
scn = StringScanner.new(text)
( words[scn.matched] = words[scn.matched].to_i + 1 if scn.scan(/[\w]*/) ) while (scn.skip(/[^\w]*/) > 0) || !scn.eos?
words.delete_if {|w, v| v <= 1}
sentences = {}
text.split(/\.[\s]*/).each {|s| sentences[s] = []}
# # A better code will split the sentences into words to
# # avoid partial recognition (cat vs. caterpillar).
# # for example:
sentences.each {|s, v| tmp = s.split(/[^\w]+/); words.each {|w, c| v << w if tmp.include? w} }
# # The following is the simplified version above:
# words.each {|w, c| sentences.each {|s, v| v << w if s.include? w} }
common_groups = {}
tmp_groups = sentences.values
until tmp_groups.empty?
active_group = tmp_groups.pop
tmp_groups.each do |g|
common = active_group & g
next if common.empty?
common_groups[common] = [2,(common_groups[common].to_i + 1)].max
end
end
common_groups.each {|g, c| puts "the word(s) #{g} were common to #{c} sentences."}
# => the word(s) ["is"] were common to 2 sentences.
# => the word(s) ["when", "whales"] were common to 2 sentences.
# => the word(s) ["cats", "madonna"] were common to 2 sentences.
text=”这是一句关于粉色杀手猫追逐麦当娜的话。
有时当鲸鱼与瓦工搏斗时,每个人都喝香槟酒。
你知道麦当娜的拖鞋上有小猫。
当鲸鱼喝威士忌时,你的高尔夫比赛就结束了
需要“strscan”
text.downcase!
单词={}
scn=StringScanner.new(文本)
(单词[scn.matched]=单词[scn.matched]。如果scn.scan(/[\w]*/),则为i+1,而(scn.skip(/[^\w]*/)>0)|!scn.eos?
如果{w,v | v{s][“猫”,“麦当娜”]是两个句子的常用词,则删除。
编辑
我纠正了代码中的一个问题,即文本不能持久化为小写。
(text.downcase!
vs.text.downcase
)
EDIT2
我回顾了部分单词问题(即
cat
与caterpillar
或dog
与dogma
)我喜欢这个难题,下面是我对可能的解决方案的看法*
*尽管如此,我还是建议下次你不要在没有展示你尝试了什么和你被困在哪里的情况下抛出你的问题……否则,你可能会把你的课堂作业扔给我们
假设这是我们的文本:
text = 'This is a sentence about pink killer cats chasing madonna.
Sometimes when whales fight bricklayers, everyone drinks champaigne.
You know Madonna has little cats on her slippers.
When whales drink whiskey, your golf game is over.'
在我看来,手头的任务有很多阶段
require 'strscan'
words = {}
scn = StringScanner.new(text.downcase)
( words[scn.matched] = words[scn.matched].to_i + 1 if scn.scan(/[\w]*/) ) while (scn.skip(/[^\w]*/) > 0) || !scn.eos?
words.delete_if {|w, v| v <= 1}
common_groups.each {|g, c| puts "the word(s) #{g} were common to #{c} sentences."}
# => the word(s) ["is"] were common to 2 sentences.
# => the word(s) ["when", "whales"] were common to 2 sentences.
# => the word(s) ["cats", "madonna"] were common to 2 sentences.
整个代码可能如下所示:
text = 'This is a sentence about pink killer cats chasing madonna.
Sometimes when whales fight bricklayers, everyone drinks champaigne.
You know Madonna has little cats on her slippers.
When whales drink whiskey, your golf game is over.'
require 'strscan'
text.downcase!
words = {}
scn = StringScanner.new(text)
( words[scn.matched] = words[scn.matched].to_i + 1 if scn.scan(/[\w]*/) ) while (scn.skip(/[^\w]*/) > 0) || !scn.eos?
words.delete_if {|w, v| v <= 1}
sentences = {}
text.split(/\.[\s]*/).each {|s| sentences[s] = []}
# # A better code will split the sentences into words to
# # avoid partial recognition (cat vs. caterpillar).
# # for example:
sentences.each {|s, v| tmp = s.split(/[^\w]+/); words.each {|w, c| v << w if tmp.include? w} }
# # The following is the simplified version above:
# words.each {|w, c| sentences.each {|s, v| v << w if s.include? w} }
common_groups = {}
tmp_groups = sentences.values
until tmp_groups.empty?
active_group = tmp_groups.pop
tmp_groups.each do |g|
common = active_group & g
next if common.empty?
common_groups[common] = [2,(common_groups[common].to_i + 1)].max
end
end
common_groups.each {|g, c| puts "the word(s) #{g} were common to #{c} sentences."}
# => the word(s) ["is"] were common to 2 sentences.
# => the word(s) ["when", "whales"] were common to 2 sentences.
# => the word(s) ["cats", "madonna"] were common to 2 sentences.
text=”这是一句关于粉色杀手猫追逐麦当娜的话。
有时当鲸鱼与瓦工搏斗时,每个人都喝香槟酒。
你知道麦当娜的拖鞋上有小猫。
当鲸鱼喝威士忌时,你的高尔夫比赛就结束了
需要“strscan”
text.downcase!
单词={}
scn=StringScanner.new(文本)
(单词[scn.matched]=单词[scn.matched]。如果scn.scan(/[\w]*/),则为i+1,而(scn.skip(/[^\w]*/)>0)| |!scn.eos?
如果{w,v | v{s][“猫”,“麦当娜”]是两个句子的常用词,则删除。
编辑
我纠正了代码中的一个问题,即文本不能持久化为小写。
(text.downcase!
vs.text.downcase
)
EDIT2
我审查了部分单词问题(即
cat
与caterpillar
或dog
与dogma
)欢迎使用Stack Overflow。我们希望看到您的代码尝试,以及有关您尝试解决问题的具体问题,而不是我们为您生成一个想法的散弹枪模式。欢迎使用Stack Overflow。我们希望看到您的代码尝试,以及有关您尝试解决问题的具体问题,而不是我们生成一个解决方案给你的想法是一种猎枪式的模式。