如何对文本文件执行二进制搜索以在python中搜索关键字？_Python_Search_Binary_Text Files

如何对文本文件执行二进制搜索以在python中搜索关键字？

python search binary

如何对文本文件执行二进制搜索以在python中搜索关键字？,python,search,binary,text-files,Python,Search,Binary,Text Files,文本文件包含两列-索引号（5个空格）和字符（30个空格）。它是按字典顺序排列的。我想执行二进制搜索来搜索关键字考虑使用集合而不是二进制搜索来查找文件中的关键字设置： O（n）创建，O（1）查找，O（1）插入/删除如果输入文件之间用空格分隔，则： f = open('file') keywords = set( (line.strip().split(" ")[1] for line in f.readlines()) ) f.close() my_word in keyword

文本文件包含两列-索引号（5个空格）和字符（30个空格）。

它是按字典顺序排列的。我想执行二进制搜索来搜索关键字

考虑使用集合而不是二进制搜索来查找文件中的关键字

设置：

O（n）创建，O（1）查找，O（1）插入/删除

如果输入文件之间用空格分隔，则：

f = open('file')
keywords = set( (line.strip().split(" ")[1] for line in f.readlines()) )
f.close()    

my_word in keywords
<returns True or False>

你需要进行二进制搜索吗？如果没有，请尝试将平面文件转换为平面文件。这将为您提供非常快速的哈希查找，以查找给定单词的索引：

import cdb

# convert the corpus file to a constant database one time
db = cdb.cdbmake('corpus.db', 'corpus.db_temp')
for line in open('largecorpus.txt', 'r'):
    index, word = line.split()
    db.add(word, index)
db.finish()

在单独的脚本中，对其运行查询：

import cdb
db = cdb.init('corpus.db')
db.get('chaos')
12345

下面是一种使用Python内置的对分模块实现这一点的有趣方法

import bisect
import os


class Query(object):

    def __init__(self, query, index=5):
        self.query = query
        self.index = index

    def __lt__(self, comparable):
        return self.query < comparable[self.index:]


class FileSearcher(object):

    def __init__(self, file_pointer, record_size=35):
        self.file_pointer = file_pointer
        self.file_pointer.seek(0, os.SEEK_END)
        self.record_size = record_size + len(os.linesep)
        self.num_bytes = self.file_pointer.tell()
        self.file_size = (self.num_bytes // self.record_size)

    def __len__(self):
        return self.file_size

    def __getitem__(self, item):
        self.file_pointer.seek(item * self.record_size)
        return self.file_pointer.read(self.record_size)


if __name__ == '__main__':
    with open('data.dat') as file_to_search:
        query = raw_input('Query: ')
        wrapped_query = Query(query)

        searchable_file = FileSearcher(file_to_search)
        print "Located @ line: ", bisect.bisect(searchable_file, wrapped_query)

导入对分导入操作系统类查询（对象）：定义初始化（self，查询，索引=5）： self.query=query self.index=索引定义（自身，可比）：返回self.query如果需要在文件中查找单个关键字：

line_with_keyword = next((line for line in open('file') if keyword in line),None)
if line_with_keyword is not None: 
   print line_with_keyword # found

要查找多个关键字，可以使用

set（）

作为：

您可以使用上面的

dict（）

而不是

set（）

来保存

index

信息

以下是如何对文本文件进行二进制搜索：

import bisect

lines = open('file').readlines() # O(n) list creation
keywords = map(extract_keyword, lines) 
i = bisect.bisect_left(keywords, keyword) # O(log(n)) search
if keyword == keywords[i]:
   print(lines[i]) # found

与

set（）

变体相比没有优势

注意：除第一个变量外，所有变量都会将整个文件加载到内存中。不需要将整个文件加载到内存中。

通过重复对分范围，并向前读取行终止符，可以对具有未知长度记录的已排序文本文件执行二进制搜索，但会稍微降低效率。下面是我通过csv文件查找的内容，该文件在第一个字段中有两个数字标题行。给它一个打开的文件和第一个要查找的字段。针对您的问题修改此选项应该相当容易。偏移量为零的第一行上的匹配将失败，因此可能需要特殊情况。在我的情况下，前两行是标题，被跳过

请原谅我在下面缺少精光的蟒蛇。我使用这个函数和一个类似的函数，直接从Maxmind分发的CSV文件执行GeoCity Lite纬度和经度计算

希望这有帮助

========================================

# See if the input loc is in file 
def look1(f,loc):
# Compute filesize of open file sent to us
hi = os.fstat(f.fileno()).st_size
lo=0
lookfor=int(loc)
# print "looking for: ",lookfor
while hi-lo > 1:
    # Find midpoint and seek to it
    loc = int((hi+lo)/2)
    # print " hi = ",hi," lo = ",lo
    # print "seek to: ",loc
    f.seek(loc)
    # Skip to beginning of line
    while f.read(1) != '\n':
        pass
    # Now skip past lines that are headers
    while 1:
        # read line
        line = f.readline()
        # print "read_line: ", line
        # Crude csv parsing, remove quotes, and split on ,
        row=line.replace('"',"")
        row=row.split(',')
        # Make sure 1st fields is numeric
        if row[0].isdigit():
            break
    s=int(row[0])
    if lookfor < s:
        # Split into lower half
        hi=loc
        continue
    if lookfor > s:
        # Split into higher half
        lo=loc
        continue
    return row  # Found
# If not found
return False

#查看输入loc是否在文件中
def look1（f，loc）：
#计算发送给我们的打开文件的文件大小
hi=os.fstat（f.fileno（））.st\u size
lo=0
查找=int（loc）
#打印“查找：”，查找
当hi-lo>1时：
#找到中间点并寻找它
loc=int（（高+低）/2）
#打印“hi=”，hi，“lo=”，lo
#打印“搜索到：”，loc
f、 搜索（loc）
#跳到行首
而f读（1）！='\n'：
通过
#现在跳过标题行
而1：
#读线
line=f.readline（）
#打印“读取行：”，行
#粗csv解析、删除引号并在上拆分，
行=行。替换（“，”）
行=行。拆分（'，'）
#确保第一个字段是数字
如果行[0]。isdigit（）：
打破
s=int（第[0]行）
如果查找s：
#分成上半部分
lo=loc
持续
返回行#找到
#如果找不到
返回错误

我编写了一个简单的Python3.6+包，可以做到这一点。（有关更多信息，请参阅其页面！）

安装：

pip安装二进制文件\u搜索

示例文件：

1,one
2,two_a
2,two_b
3,three

用法：

from binary_file_search.BinaryFileSearch import BinaryFileSearch
with BinaryFileSearch('example.file', sep=',', string_mode=False) as bfs:
    # assert bfs.is_file_sorted()  # test if the file is sorted.
    print(bfs.search(2))

结果：

[[2'，two_a']，[2'，two_b']

是每行的行长常量吗？您提到的是“空格”。您是指值之间的空格还是“索引编号为5个字符”和“数据为30个字符”？是的。每行的行长都是常量…我是指后者。“索引编号为5个字符”和“数据为30个字符？按所述的“列”“分开？这个问题是因为克里加谈到了分离空间。这两列之间有什么？他们没有空间接触吗？找到关键字后，您想做什么？您能将整个文件加载到内存中吗？还是太大了？谢谢您的帮助！我试图运行此代码，但列表“关键字”没有任何元素。它为空。@请检查我的编辑。还要检查文件是否正确打开等。对于f.readlines（）中的行：print linehey，但O（n）是否比O（log n）贵？？我需要在一个巨大的语料库上运行它。要在字典中查找键或检查对象是否在集合中，它是O（1），O（n）是集合或dict的创建，二叉树是O（n log n）用于创建。+1：它不需要在内存中加载整个文件，在这种情况下，二叉搜索是合理的。+1；如果每个“记录”也包含一个换行符，则建议的

记录大小应为36。这非常有效-请注意，您得到的是最后一条记录，而不是第一条记录。如果你想第一次使用左对分。您还必须更改包装器类来包装文件中的数据，因为它将切换comparisonNote
# See if the input loc is in file 
def look1(f,loc):
# Compute filesize of open file sent to us
hi = os.fstat(f.fileno()).st_size
lo=0
lookfor=int(loc)
# print "looking for: ",lookfor
while hi-lo > 1:
    # Find midpoint and seek to it
    loc = int((hi+lo)/2)
    # print " hi = ",hi," lo = ",lo
    # print "seek to: ",loc
    f.seek(loc)
    # Skip to beginning of line
    while f.read(1) != '\n':
        pass
    # Now skip past lines that are headers
    while 1:
        # read line
        line = f.readline()
        # print "read_line: ", line
        # Crude csv parsing, remove quotes, and split on ,
        row=line.replace('"',"")
        row=row.split(',')
        # Make sure 1st fields is numeric
        if row[0].isdigit():
            break
    s=int(row[0])
    if lookfor < s:
        # Split into lower half
        hi=loc
        continue
    if lookfor > s:
        # Split into higher half
        lo=loc
        continue
    return row  # Found
# If not found
return False

1,one
2,two_a
2,two_b
3,three

from binary_file_search.BinaryFileSearch import BinaryFileSearch
with BinaryFileSearch('example.file', sep=',', string_mode=False) as bfs:
    # assert bfs.is_file_sorted()  # test if the file is sorted.
    print(bfs.search(2))