MongoDB：搜索非常频繁的词时，文本搜索速度较慢_Mongodb

MongoDB：搜索非常频繁的词时，文本搜索速度较慢

mongodb

MongoDB：搜索非常频繁的词时，文本搜索速度较慢,mongodb,Mongodb,我收集了大约100万个文档（主要是电影），我在一个字段上创建了一个文本索引。几乎所有搜索都可以正常工作：不到20毫秒就可以得到结果。例外情况是，当一次搜索一个非常频繁的术语时，它可以持续3000毫秒！比如说, 如果我在收藏中搜索“纸浆”（只有40个文档有），它会持续1毫秒如果我搜索“电影”（750000个文档有它），它会持续3000ms。分析请求时，explain（'executionStats'）显示扫描了所有“电影”文档。我尝试了许多索引、排序+限制和暗示，但所有750000个文档仍然

我收集了大约100万个文档（主要是电影），我在一个字段上创建了一个文本索引。几乎所有搜索都可以正常工作：不到20毫秒就可以得到结果。例外情况是，当一次搜索一个非常频繁的术语时，它可以持续3000毫秒！比如说,

如果我在收藏中搜索“纸浆”（只有40个文档有），它会持续1毫秒

如果我搜索“电影”（750000个文档有它），它会持续3000ms。分析请求时，explain（'executionStats'）显示扫描了所有“电影”文档。我尝试了许多索引、排序+限制和暗示，但所有750000个文档仍然被扫描，结果仍然很慢

有没有一种策略能够更快地在数据库中搜索非常频繁的词？

我最后编写了自己的停止词列表，代码如下：

import pymongo
from bson.code import Code

# NB max occurences of a word in a collection after what it is considerated as a stop word.
NB_MAX_COUNT = 20000
STOP_WORDS_FILE = 'stop_words.py'

db = connection to the database...

mapfn = Code("""function() {
    var words = this.field_that_is_text_indexed;
    if (words) {
        // quick lowercase to normalize per your requirements
        words = words.toLowerCase().split(/[ \/]/);
        for (var i = words.length - 1; i >= 0; i--) {
            // might want to remove punctuation, etc. here
            if (words[i])  {      // make sure there's something
               emit(words[i], 1); // store a 1 for each word
            }
        }
    }
};""")

reducefn = Code("""function( key, values ) {
    var count = 0;
    values.forEach(function(v) {
        count +=v;
    });
    return count;
};""")

with open(STOP_WORDS_FILE,'w') as fh:
    fh.write('# -*- coding: utf-8 -*-\n'
             'stop_words = [\n')

    result = db.mycollection.map_reduce(mapfn,reducefn,'words_count')
    for doc in result.find({'value':{'$gt':NB_MAX_COUNT}}):
        fh.write("'%s',\n" % doc['_id'])

    fh.write(']\n')

我最后编写了自己的停止语列表，代码如下：

import pymongo
from bson.code import Code

# NB max occurences of a word in a collection after what it is considerated as a stop word.
NB_MAX_COUNT = 20000
STOP_WORDS_FILE = 'stop_words.py'

db = connection to the database...

mapfn = Code("""function() {
    var words = this.field_that_is_text_indexed;
    if (words) {
        // quick lowercase to normalize per your requirements
        words = words.toLowerCase().split(/[ \/]/);
        for (var i = words.length - 1; i >= 0; i--) {
            // might want to remove punctuation, etc. here
            if (words[i])  {      // make sure there's something
               emit(words[i], 1); // store a 1 for each word
            }
        }
    }
};""")

reducefn = Code("""function( key, values ) {
    var count = 0;
    values.forEach(function(v) {
        count +=v;
    });
    return count;
};""")

with open(STOP_WORDS_FILE,'w') as fh:
    fh.write('# -*- coding: utf-8 -*-\n'
             'stop_words = [\n')

    result = db.mycollection.map_reduce(mapfn,reducefn,'words_count')
    for doc in result.find({'value':{'$gt':NB_MAX_COUNT}}):
        fh.write("'%s',\n" % doc['_id'])

    fh.write(']\n')

这是预期的行为，想想看。文本索引比实际集合大得多，因为每个单词都会被索引。因此，搜索索引中大于总集合的部分将导致mongoDB扫描实际集合本身。如果你能强迫它使用索引，它会更慢。为了更快地获得结果，您可能会限制结果？我尝试过限制，但由于我想要一个排序的输出，所有文档仍然会被扫描…逻辑上。这是预期的行为，请考虑一下。文本索引比实际集合大得多，因为每个单词都会被索引。因此，搜索索引中大于总集合的部分将导致mongoDB扫描实际集合本身。如果你能强迫它使用索引，它会更慢。为了更快地获得结果，您可能会限制结果？我试图限制，但由于我想要排序的输出，所有文档仍然会被扫描…符合逻辑。