Python 获取文件中总行数和行索引的有效方法_Python

Python 获取文件中总行数和行索引的有效方法

python

Python 获取文件中总行数和行索引的有效方法,python,Python,以下是5次调用的cProfile输出： ncalls tottime percall cumtime percall filename:lineno(function) 5 3.743 0.749 3.743 0.749 {posix.waitpid} 6 0.756 0.126 0.756 0.126 {method 'readlines' of 'file' objects} 5 0.070 0.01

以下是5次调用的cProfile输出：

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    5    3.743    0.749    3.743    0.749 {posix.waitpid}
    6    0.756    0.126    0.756    0.126 {method 'readlines' of 'file' objects}
    5    0.070    0.014    0.070    0.014 {posix.read}
    5    0.058    0.012    0.058    0.012 {posix.fork} objects}

我需要将整个过程运行500万次（以后可能会更多）。因此，我需要尽可能多的改进

```
posix.waitpid
```
是子流程调用的等待时间（我需要等待，直到调用完成并且输出就绪）。因此，我可能无法进一步改进
我需要找到行
```
startswith（'xxx'）
```
的索引和文件中的行总数。有没有什么方法可以比以f:的形式打开
```
open（“yyy.txt”）
```
或
```
读取行
```
或
```
打开（“yyy.txt”）更快地获取这些信息
```

如果文件不太大，无法放入内存，则可以一次读取整个文件，而不是一行一行。然后，不要将数据拆分成行，而是找到要查找的内容并计算换行符数，以给出项目所在的行。通过计算所有换行符来获取总计数。这里有一个函数可以执行此操作：
def find_line_fast(file_name, start):
    with open(file_name) as f:
        buf = f.read()
    found_at = -1
    # Find a line that starts with value of start.
    idx = buf.find('\n'+start)
    if idx != -1:
        # If found, count lines up to line where found.
        found_at = buf[:idx+1].count('\n') + 1
    # Return line found at, and total lines.
    return found_at, buf.count('\n')

下面是上述方法与readline和行拆分方法的基准比较。以上是最快的
import datetime

def find_line_readline(file_name, start):
    count = 0
    found_at = -1
    with open(file_name) as f:
        for line in f:
            count += 1
            if found_at == -1 and line.startswith(start):
                found_at = count
    return found_at, count


def find_line_split(file_name, start):
    with open(file_name) as f:
        buf = f.read()
    found_at = -1
    for i, line in enumerate(buf.split('\n')):
        if line.startswith(start):
            found_at = i+1
            break
    return found_at, buf.count('\n')


def find_line_fast(file_name, start):
    with open(file_name) as f:
        buf = f.read()
    found_at = -1
    idx = buf.find('\n'+start)
    if idx != -1:
        found_at = buf[:idx+1].count('\n') + 1
    return found_at, buf.count('\n')


n = 100
fname = "boggle_dict.txt"
st = "zymotic"
for fn in (find_line_readline, find_line_split, find_line_fast):
    at, count = fn(fname, st)
    print fn.__name__, 'found "%s" on line: %d of %d' % (st, at, count)
    start = datetime.datetime.now()
    for i in xrange(n):
        fn(fname, st)
    print n, '*', fn.__name__, 'took', datetime.datetime.now() - start
    print

输出
find_line_readline found "zymotic" on line: 172819 of 172823
100 * find_line_readline took 0:00:14.289262

find_line_split found "zymotic" on line: 172819 of 172823
100 * find_line_split took 0:00:12.784887

find_line_fast found "zymotic" on line: 172819 of 172823
100 * find_line_fast took 0:00:01.144335

注意：虽然它更复杂，但您也可以（假设使用ASCII或ASCII超集编码）使用mmap
模块映射文件，而无需读入；您不再依赖于有足够的RAM，您可以立即开始处理，而无需等待整个文件加载（它提供了一个.find
方法直接）。@gammazero，感谢您的回复。我只是有机会尝试一下。在这5次通话中，我看不到时差，当通话次数增加时，可能会有更多的时差。有一个问题，我有几行开始（开始），想得到第一行的索引。这段代码似乎找到了最后一个，对吗？我使用了一个大文件（boggle dictionary），在文件末尾附近找到一行，并执行100次以显示显著的时间差。使用一个小文件，您不会看到太多的时间差。我检查了所有函数是否只找到第一个。我将正在搜索的单词复制到多个位置，只找到了第一个。\u readline
在查找后不会再次搜索。\u拆分
在第一次查找后中断查找循环。\u fast
只搜索到第一次查找。