Python 返回空值的嗖嗖声_Python_Unicode_Encoding_Whoosh

Python 返回空值的嗖嗖声

python unicode encoding

Python 返回空值的嗖嗖声,python,unicode,encoding,whoosh,Python,Unicode,Encoding,Whoosh,我使用Whoosh来索引和搜索各种编码的文本。但是，在对索引文件执行搜索时，使用“突出显示”功能的输出中不会出现某些匹配结果。我有一种感觉，这与编码错误有关，但我不知道是什么阻止了所有结果的显示。我将非常感谢其他人能对这一谜团提供的任何线索以下是我用来创建索引的脚本，以及我正在索引的文件： from whoosh.index import create_in from whoosh.fields import * import glob, os, chardet encodings = ['

我使用Whoosh来索引和搜索各种编码的文本。但是，在对索引文件执行搜索时，使用“突出显示”功能的输出中不会出现某些匹配结果。我有一种感觉，这与编码错误有关，但我不知道是什么阻止了所有结果的显示。我将非常感谢其他人能对这一谜团提供的任何线索

以下是我用来创建索引的脚本，以及我正在索引的文件：

from whoosh.index import create_in
from whoosh.fields import *
import glob, os, chardet

encodings = ['utf-8', 'ISO-8859-2', 'windows-1250', 'windows-1252', 'latin1', 'ascii']

def determine_string_encoding(string):
    result = chardet.detect(string)
    string_encoding = result['encoding']
    return string_encoding

#specify a list of paths that contain all of the texts we wish to index
text_dirs = [

"C:\Users\Douglas\Desktop\intertextuality\sample_datasets\hume",
"C:\Users\Douglas\Desktop\intertextuality\sample_datasets\complete_pope\clean"

]

#establish the schema to be used when storing texts; storing content allows us to retrieve hightlighted extracts from texts in which matches occur
schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT(stored=True))

#check to see if we already have an index directory. If we don't, make it
if not os.path.exists("index"):
    os.mkdir("index")
ix = create_in("index", schema)

#create writer object we'll use to write each of the documents in text_dir to the index
writer = ix.writer()

#create file in which we can write the encoding of each file to disk for review
with open("encodings_log.txt","w") as encodings_out:

    #for each directory in our list
    for i in text_dirs:

        #for each text file in that directory (j is now the path to the current file within the current directory)
        for j in glob.glob( i + "\\*.txt" ):

            #first, let's grab j title. If the title is stored in the text file name, we can use this method:
            text_title = j.split("\\")[-1]

            #now let's read the file
            with open( j, "r" ) as text_content:
                text_content = text_content.read()

                #use method defined above to determine encoding of path and text_content
                path_encoding = determine_string_encoding(j)
                text_content_encoding = determine_string_encoding(text_content)

                #because we know the encoding of the files in this directory, let's override the previous text_content_encoding value and specify that encoding explicitly
                if "clean" in j:
                    text_content_encoding = "iso-8859-1"

                #decode text_title, path, and text_content to unicode using the encodings we determined for each above
                unicode_text_title = unicode(text_title, path_encoding)
                unicode_text_path = unicode(j, path_encoding)
                unicode_text_content = unicode(text_content, text_content_encoding)

                #use writer method to add document to index
                writer.add_document( title = unicode_text_title, path = unicode_text_path, content = unicode_text_content )

#after you've added all of your documents, commit changes to the index
writer.commit()

这段代码似乎对文本进行了索引，没有任何问题，但是当我使用下面的脚本解析索引时，我在out.txt输出文件中得到了三个空白值——前两行为空，第六行为空，但我希望这三行不是空的。下面是我用来解析索引的脚本：

from whoosh.qparser import QueryParser
from whoosh.qparser import FuzzyTermPlugin
from whoosh.index import open_dir
import codecs

#now that we have an index, we can open it with open_dir
ix = open_dir("index")

with ix.searcher() as searcher: 
    parser = QueryParser("content", schema=ix.schema)

    #to enable Levenshtein-based parse, use plugin
    parser.add_plugin(FuzzyTermPlugin())

    #using ~2/3 means: allow for edit distance of two (where additions, subtractions, and insertions each cost one), but only count matches for which first three letters match. Increasing this denominator greatly increases speed
    query = parser.parse(u"swallow~2/3")
    results = searcher.search(query)

    #see see whoosh.query.phrase, which describes "slop" parameter (ie: number of words we can insert between any two words in our search query)

    #write query results to disk or html
    with codecs.open("out.txt","w") as out:

        for i in results[0:]:    

            title = i["title"]
            highlight = i.highlights("content")
            clean_highlight = " ".join(highlight.split())

            out.write(clean_highlight.encode("utf-8") + "\n")

如果有人能解释为什么这三行是空的，我将永远感激。

天哪，我可能已经弄明白了！我的一些文本文件（包括路径中带有“hume”的两个文件）似乎超过了控制Whoosh索引创建行为的阈值。如果试图为太大的文件编制索引，Whoosh会将该文本存储为字符串值，而不是unicode值。因此，假设有一个包含字段“path”（文件路径）、“title”（文件标题）、“content”（文件内容）和“encoding”（当前文件编码）的索引，则可以通过运行如下脚本来测试该索引中的文件是否正确索引：

from whoosh.qparser import QueryParser
from whoosh.qparser import FuzzyTermPlugin
from whoosh.index import open_dir
import codecs

#now that we have an index, we can open it with open_dir
ix = open_dir("index")

phrase_to_search = unicode("swallow")

with ix.searcher() as searcher: 
    parser = QueryParser("content", schema=ix.schema)

    query = parser.parse( phrase_to_search )
    results = searcher.search(query)

    for hit in results:    
        hit_encoding = (hit["encoding"])

        with codecs.open(hit["path"], "r", hit_encoding) as fileobj:
            filecontents  = fileobj.read()
            hit_highlight = hit.highlights("content", text=filecontents)
            hit_title     = (hit["title"])

            print type(hit_highlight), hit["title"]

如果任何打印的值具有类型“str”，则高亮显示者似乎将指定文件的一部分视为类型字符串而不是unicode

这里有两种方法可以纠正这个问题：1）将大文件（）拆分为较小的文件（所有文件都应包含<32K个字符），并为这些较小的文件编制索引。这种方法需要更多的处理，但可以确保合理的处理速度。2）向结果变量传递一个参数，以增加可存储为unicode的最大字符数，从而在上面的示例中正确打印到终端。要在上面的代码中实现此解决方案，可以在定义

结果的行之后添加以下行：
results.fragmenter.charlimit=100000

添加这一行可以将指定文件的前100000个字符的任何结果打印到终端，尽管这会显著增加处理时间。或者，可以使用results.fragmenter.charlimit=None完全删除字符限制，尽管这确实增加了处理大文件时的处理时间…
我可以在上面的描述中添加一些内容来帮助更好地诊断这种情况吗？