Ruby&；带有反向索引的Mongodb带来了一些有趣的结果_Ruby_Regex_Mongodb_Hash

Ruby&；带有反向索引的Mongodb带来了一些有趣的结果

ruby regex mongodb hash

Ruby&；带有反向索引的Mongodb带来了一些有趣的结果,ruby,regex,mongodb,hash,Ruby,Regex,Mongodb,Hash,对于我的程序，我正在使用Twitter提要中的数据创建一个反向索引，但是，当解析这些数据并将其放入mongodb时，会出现一些有趣的问题通常的条目类型应如下所示： {"ax"=>1, "easyjet"=>1, "from"=>2} 但是，在解析一些tweet时，它们在db中的结果如下： {""=>{""=>{""=>{""=>{""=>{"giants"=>{"dhem"=>1, "giants"=>1, "giantss"

对于我的程序，我正在使用Twitter提要中的数据创建一个反向索引，但是，当解析这些数据并将其放入mongodb时，会出现一些有趣的问题

通常的条目类型应如下所示：

{"ax"=>1, "easyjet"=>1, "from"=>2}

但是，在解析一些tweet时，它们在db中的结果如下：

{""=>{""=>{""=>{""=>{""=>{"giants"=>{"dhem"=>1, "giants"=>1, "giantss"=>1}}}}

def pull_hash_tags(tweet, lang)
    hash_tags = tweet.split.find_all { |word| /^#.+/.match word }
    t = tweet.gsub(/https?:\/\/[\S]+/,"") # removing urls
    t = t.gsub(/#\w+/,"") # removing hash tags
    t = t.gsub(/[^0-9a-z ]/i, '') # removing non-alphanumerics and keeping spaces
    t = t.gsub(/\r/," ")
    t = t.gsub(/\n/," ")
    hash_tags.each { |tag| add_to_hash(lang, tag, t) }
end

def add_to_hash(lang, tag, t)
    t.gsub(/\W+/, ' ').split.each { |word| @db.collection.update({"_id" => lang}, {"$inc" => {"#{tag}.#{word}" => 1}}, { :upsert => true }) }
end

我有以下几行代码将tweet拆分并增加db中的值：

{""=>{""=>{""=>{""=>{""=>{"giants"=>{"dhem"=>1, "giants"=>1, "giantss"=>1}}}}

def pull_hash_tags(tweet, lang)
    hash_tags = tweet.split.find_all { |word| /^#.+/.match word }
    t = tweet.gsub(/https?:\/\/[\S]+/,"") # removing urls
    t = t.gsub(/#\w+/,"") # removing hash tags
    t = t.gsub(/[^0-9a-z ]/i, '') # removing non-alphanumerics and keeping spaces
    t = t.gsub(/\r/," ")
    t = t.gsub(/\n/," ")
    hash_tags.each { |tag| add_to_hash(lang, tag, t) }
end

def add_to_hash(lang, tag, t)
    t.gsub(/\W+/, ' ').split.each { |word| @db.collection.update({"_id" => lang}, {"$inc" => {"#{tag}.#{word}" => 1}}, { :upsert => true }) }
end

我正在尝试获取普通单词（仅包含字母数字字符），没有双空格，也没有回车符等。

您应该添加

t.strip看起来问题可能是前导/尾随空格。
我建议在连接时添加一个日志记录程序，然后准确地观察您在数据库中输入的内容。您的代码可能有问题。当处理大约50GB的数据时，这将很难准确确定。在这种情况下，不要使用记录器。只需在pull\u hash\u tags方法中添加一些代码来查找这些异常文档。