Python 如何使搜索和计数更快？_Python_List_Search

Python 如何使搜索和计数更快？

python list search

Python 如何使搜索和计数更快？,python,list,search,Python,List,Search,GENERIC_TEXT_STORE是一个字符串列表。例如： def count_occurrences(string): count = 0 for text in GENERIC_TEXT_STORE: count += text.count(string) return count 给定一个字符串“text”，我想找出文本（即“this”）在泛型文本存储中出现的次数。如果我的通用文本库很大，那么速度会很慢。有什么方法可以使搜索和计数更快？例如，如果

GENERIC_TEXT_STORE是一个字符串列表。例如：

def count_occurrences(string):
    count = 0
    for text in GENERIC_TEXT_STORE:
        count += text.count(string)
    return count

给定一个字符串“text”，我想找出文本（即“this”）在泛型文本存储中出现的次数。如果我的通用文本库很大，那么速度会很慢。有什么方法可以使搜索和计数更快？例如，如果我将大型泛型文本存储列表拆分为多个较小的列表，会更快吗

如果多处理模块在这里有用，如何实现这一目的？

首先，检查您的算法是否真的在做您想要做的事情，如上面评论中所建议的那样。count（）方法正在检查子字符串的相等性，如果您希望测试完整的单词，那么通过重构代码，您可能会得到很大的改进。像这样的东西可以作为你的条件

GENERIC_TEXT_STORE = ['this is good', 'this is a test', 'that's not a test']

多处理可能会有所帮助，因为您可以将列表拆分为较小的列表（每个核心一个），然后在每个进程完成时将所有结果相加（避免在执行过程中进行进程间通信）。我从测试中发现，Python中的多进程在不同的操作系统之间有很大的差异，Windows和Mac可能需要相当长的时间才能真正生成进程，而Linux似乎要快得多。有人说，使用pstools为每个进程设置CPU关联性很重要，但我发现这与我的情况没有多大区别

另一个答案是使用Cython将Python编译成C程序，或者用更快的语言重写整个程序，但由于您已将此答案标记为Python，我认为您对此并不感兴趣。

您可以使用

re

any((word==string for word in text.split()))

现在使用

timeit

分析执行时间

当

GENERIC\u TEXT\u STORE

的大小为

时

In [2]: GENERIC_TEXT_STORE = ['this is good', 'this is a test', 'that\'s not a test']

In [3]: def count_occurrences(string):
   ...:     count = 0
   ...:     for text in GENERIC_TEXT_STORE:
   ...:         count += text.count(string)
   ...:     return count

In [6]: import re

In [7]: def count(_str):
   ...:     return len(re.findall(_str,''.join(GENERIC_TEXT_STORE)))
   ...:
In [28]: def count1(_str):
    ...:     return ' '.join(GENERIC_TEXT_STORE).count(_str)
    ...:

当

GENERIC\u TEXT\u STORE

的大小为

时

In [9]: timeit count('this')
1.27 µs ± 57.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [10]: timeit count_occurrences('this')
697 ns ± 25.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [33]: timeit count1('this')
385 ns ± 22.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

当

GENERIC\u TEXT\u STORE

的大小为

In [17]: timeit count('this')
1.07 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [18]: timeit count_occurrences('this')
3.35 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [37]: timeit count1('this')
275 µs ± 18.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [20]: timeit count('this')
5.7 ms ± 2.39 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [21]: timeit count_occurrences('this')
33 ms ± 3.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [40]: timeit count1('this')
3.98 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

当

通用文本存储的大小超过一百万（1500000
）时
count1
当GENERIC\u TEXT\u STORE
的大小很大时，count
和count1
的速度几乎是count\u出现次数的4到5倍
您的函数应该只计算单词（例如是，好的，不是，那是），还是应该计算子字符串（例如，是g
，是tha
，是tes
，好的
）？我将附议请求，以澄清您的要求。如果您多次搜索同一个大字符串集合，并且string
参数始终是一个单词，则可以将集合存储为字数。如果string
可以是任意子字符串，如“rbitrary s”也许可以考虑将其存储为trie。如果在其值更改之前只搜索一次GENERIC\u STRING\u STORE，我认为您的代码在一般情况下不容易改进。如果搜索是基于单词而不是基于子字符串（并且取决于确切的用例）还可以对信息进行预处理，并建立一个索引，其中每个单词都指向它出现在其中的子字符串列表。
In [23]: timeit count('this')
50.3 ms ± 7.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [24]: timeit count_occurrences('this')
283 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [43]: timeit count1('this')
40.7 ms ± 1.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)