Python：在嗖嗖声中突出显示精确的短语搜索结果_Python_Indexing_Full Text Search_Whoosh

Python：在嗖嗖声中突出显示精确的短语搜索结果

python indexing

Python：在嗖嗖声中突出显示精确的短语搜索结果,python,indexing,full-text-search,whoosh,Python,Indexing,Full Text Search,Whoosh,根据Whoosh文档（以及前面的问题），可以通过在希望搜索的短语周围加上双引号来搜索Whoosh中的单词。但是，当我尝试实现精确的短语搜索时，我会返回默认搜索语法生成的结果。有人知道我如何修改搜索语法，以便只匹配被查询文档（古腾堡项目的格列佛旅行）中包含确切短语“理性政府”的部分吗？如果其他人能提供任何建议，我将不胜感激 from whoosh.index import create_in from whoosh.fields import * from whoosh import qparse

根据Whoosh文档（以及前面的问题），可以通过在希望搜索的短语周围加上双引号来搜索Whoosh中的单词。但是，当我尝试实现精确的短语搜索时，我会返回默认搜索语法生成的结果。有人知道我如何修改搜索语法，以便只匹配被查询文档（古腾堡项目的格列佛旅行）中包含确切短语“理性政府”的部分吗？如果其他人能提供任何建议，我将不胜感激

from whoosh.index import create_in
from whoosh.fields import *
from whoosh import qparser
import os, codecs, nltk

def remove_non_ascii(s):
    return "".join(x for x in s if ord(x) < 128)

if not os.path.exists("indexdir"):
    os.mkdir("indexdir")

schema = Schema(content=TEXT(stored=True, analyzer=analysis.StandardAnalyzer(stoplist=None)))

ix = create_in("indexdir", schema)
writer = ix.writer()
gulliver = codecs.open("gulliver.txt","r","utf-8")
gulliver = gulliver.read().replace("_","")
writer.add_document(content=gulliver)
writer.commit()

searcher = ix.searcher()

parser = qparser.QueryParser("content", schema=ix.schema)
q = parser.parse(u"government of reason")
results = searcher.search(q)
results.fragmenter.charlimit = None

for hit in results:
    print " ".join( remove_non_ascii( nltk.clean_html( hit.highlights("content", top=1000000) ) ).split() )

从whoosh.index导入创建
从whoosh.fields导入*
来自whoosh import qparser
导入操作系统、编解码器、nltk
def删除非ascii字符：
返回“”。连接（如果ord（x）<128，则x代表s中的x）
如果不存在os.path.exists（“indexdir”）：
os.mkdir（“indexdir”）
schema=schema（内容=TEXT（存储=True，分析器=analysis.StandardAnalyzer（停止列表=None）））
ix=在（“indexdir”，模式）中创建
writer=ix.writer（）
格列佛=codecs.open（“格列佛.txt”、“r”、“utf-8”）
gulliver=gulliver.read（）.replace（“\u”，”）
writer.add_文档（内容=格列佛）
writer.commit（）
searcher=ix.searcher（）
parser=qparser.QueryParser（“内容”，schema=ix.schema）
q=parser.parse（u“理性政府”）
结果=搜索者。搜索（q）
results.fragmenter.charlimit=无
点击结果：
打印“.join（删除非ascii（nltk.clean\uHTML（hit.highlights（“content”，top=1000000）））.split（））

编辑马特·查普特提供了一些代码，这些代码应该能够在简短的帖子中返回给定查询的热门内容中的准确短语，但我无法让他的方法发挥作用。

你有没有尝试过

“理性政府”

？是的，运气不好。我尝试了

“《理性政府》”

，不管有没有unicode前言，但没有骰子。