Python 部分匹配GAE搜索API_Python_Google App Engine_Search_Autocomplete_Webapp2

Python 部分匹配GAE搜索API

python google-app-engine search autocomplete

Python 部分匹配GAE搜索API,python,google-app-engine,search,autocomplete,webapp2,Python,Google App Engine,Search,Autocomplete,Webapp2,使用可以搜索部分匹配吗我正在尝试创建自动完成功能，其中术语将是一个不完整的单词。例如 >b >bui >建造将全部返回“建筑物” GAE如何做到这一点？如中所述，不，这是不可能的，因为搜索API实现了全文索引希望这有帮助虽然全文搜索不支持LIKE语句（部分匹配），但您可以绕过它首先，为所有可能的子字符串（hello=h、he、hel、lo等）标记数据字符串使用标记化字符串构建索引+文档（搜索API） index = search.Index(name='item_autocomple

使用可以搜索部分匹配吗

我正在尝试创建自动完成功能，其中术语将是一个不完整的单词。例如

>b
>bui
>建造

将全部返回“建筑物”

GAE如何做到这一点？

如中所述，不，这是不可能的，因为搜索API实现了全文索引

希望这有帮助

虽然全文搜索不支持LIKE语句（部分匹配），但您可以绕过它

首先，为所有可能的子字符串（hello=h、he、hel、lo等）标记数据字符串

使用标记化字符串构建索引+文档（搜索API）

index = search.Index(name='item_autocomplete')
for item in items:  # item = ndb.model
    name = ','.join(tokenize_autocomplete(item.name))
    document = search.Document(
        doc_id=item.key.urlsafe(),
        fields=[search.TextField(name='name', value=name)])
    index.put(document)

执行搜索，然后walah

results = search.Index(name="item_autocomplete").search("name:elo")

我的typeahead控件也有同样的问题，我的解决方案是将字符串解析为小部分：

name='hello world'
name_search = ' '.join([name[:i] for i in xrange(2, len(name)+1)])
print name_search;
# -> he hel hell hello hello  hello w hello wo hello wor hello worl hello world

希望此帮助类似于@Desmond Lua answer，但具有不同的标记化功能：

def tokenize(word): token=[] words = word.split(' ') for word in words: for i in range(len(word)): if i==0: continue w = word[i] if i==1: token+=[word[0]+w] continue token+=[token[-1:][0]+w] return ",".join(token) def标记化（word）：令牌=[] 单词=单词分割（“”）用文字表示：对于范围内的i（len（word））：如果i==0：继续 w=单词[i] 如果i==1：令牌+=[字[0]+w] 持续令牌+=[令牌[-1:][0]+w] 返回“，”。加入（令牌）它将解析

hello world

为

he，hel，hello，wo，wor，worl，world

它适用于轻型自动完成功能

我的版本优化：不重复令牌

def tokenization(text):
    a = []
    min = 3
    words = text.split()
    for word in words:
        if len(word) > min:
            for i in range(min, len(word)):
                token = word[0:i]
                if token not in a:
                    a.append(token)
    return a

在这里跳得很晚

但这里是我的一个有很好文档记录的函数，它可以进行标记化。docstring应该帮助您更好地理解和使用它。祝你好运

def tokenize(string_to_tokenize, token_min_length=2):
  """Tokenizes a given string.

  Note: If a word in the string to tokenize is less then
  the minimum length of the token, then the word is added to the list
  of tokens and skipped from further processing.
  Avoids duplicate tokens by using a set to save the tokens.
  Example usage:
    tokens = tokenize('pack my box', 3)

  Args:
    string_to_tokenize: str, the string we need to tokenize.
    Example: 'pack my box'.
    min_length: int, the minimum length we want for a token.
    Example: 3.

  Returns:
    set, containng the tokenized strings. Example: set(['box', 'pac', 'my',
    'pack'])
  """
  tokens = set()
  token_min_length = token_min_length or 1
  for word in string_to_tokenize.split(' '):
    if len(word) <= token_min_length:
      tokens.add(word)
    else:
      for i in range(token_min_length, len(word) + 1):
        tokens.add(word[:i])
  return tokens

def标记化（字符串到标记化，标记最小长度=2）：
“”标记给定的字符串。
注意：如果要标记化的字符串中的单词小于
标记的最小长度，然后将单词添加到列表中
已从进一步处理中跳过标记。
通过使用集合保存令牌，避免重复令牌。
用法示例：
代币=代币化（“打包我的盒子”，3）
Args：
string-to-tokenize:str，我们需要标记的字符串。
示例：“打包我的箱子”。
min_length:int，我们想要的令牌的最小长度。
例：3。
返回：
set，包含标记化字符串。例如：set（['box'，'pac'，'my'，
“包装”]）
"""
令牌=集合（）
令牌最小长度=令牌最小长度或1
对于字符串中的单词\u到\u标记化.split（“”）：
如果len（word）这样做效果很好。我设法修改了Ferris的search.index函数来自动标记所有文本字段（一行更改），并且它“只起作用”。只是不要试图直接从搜索结果向用户显示所述字段；）我还添加了name.lower（），因为我在俄语方面遇到了一些奇怪的问题：如果令牌以大写字母开头，我就找不到这样的令牌。友好提示：短语是“哇！”我添加了限制子字符串长度的选项，以避免搜索文档增加过多的情况。请添加更多关于您张贴答案的描述，先生。
def tokenize(string_to_tokenize, token_min_length=2):
  """Tokenizes a given string.

  Note: If a word in the string to tokenize is less then
  the minimum length of the token, then the word is added to the list
  of tokens and skipped from further processing.
  Avoids duplicate tokens by using a set to save the tokens.
  Example usage:
    tokens = tokenize('pack my box', 3)

  Args:
    string_to_tokenize: str, the string we need to tokenize.
    Example: 'pack my box'.
    min_length: int, the minimum length we want for a token.
    Example: 3.

  Returns:
    set, containng the tokenized strings. Example: set(['box', 'pac', 'my',
    'pack'])
  """
  tokens = set()
  token_min_length = token_min_length or 1
  for word in string_to_tokenize.split(' '):
    if len(word) <= token_min_length:
      tokens.add(word)
    else:
      for i in range(token_min_length, len(word) + 1):
        tokens.add(word[:i])
  return tokens