Python 从大型存储库到文件的关键字匹配_Python_Python 2.7

Python 从大型存储库到文件的关键字匹配

python python-2.7

Python 从大型存储库到文件的关键字匹配,python,python-2.7,Python,Python 2.7,我有一套400K关键字，需要与100K输入文件匹配我目前实现这一目标的方法如下代码： import glob with open("keyword.txt") as inp: keyword_set=set([lin.strip().lower() for lin in inp]) for fil in glob.glob("file/path/*.txt"): with open(fil) as inp, open("output.txt","w") as out:

我有一套400K关键字，需要与100K输入文件匹配

我目前实现这一目标的方法如下

代码：

import glob
with open("keyword.txt") as inp:
    keyword_set=set([lin.strip().lower() for lin in inp])
for fil in glob.glob("file/path/*.txt"):
    with open(fil) as inp, open("output.txt","w") as out:
        file_txt = inp.read().lower()
        for val in keyword_set:
            if val in file_txt:
                out.write("{}\t{}".format(fil, val))

BUENOS AIRES
Argentina

关键字\u示例：

import glob
with open("keyword.txt") as inp:
    keyword_set=set([lin.strip().lower() for lin in inp])
for fil in glob.glob("file/path/*.txt"):
    with open(fil) as inp, open("output.txt","w") as out:
        file_txt = inp.read().lower()
        for val in keyword_set:
            if val in file_txt:
                out.write("{}\t{}".format(fil, val))

BUENOS AIRES
Argentina

由于我在一个大型存储库上循环，因此需要花费大量的时间（对于文件，时间从几秒到几分钟不等）。是否有任何方法可以增加吞吐量并减少所花费的时间。

看看您的代码：

    file_txt = inp.read().lower()
    for val in keyword_set:
        if val in file_txt:
            out.write("{}\t{}".format(fil, val))

声明

        if val in file_txt:

在文本中查找字符串（子字符串搜索，时间复杂度平均不低于O（n*C）。如果文本由单词组成（如示例中所示），则可以使用更合适的表示法

例如，将

文件中的所有单词表示为集（）
（就像第二个集一样）。当然，如果可能的话，您可以按分隔符分割文件中的文本。如果关键字可以包含2个或多个单词，则必须将顺序对（bigram）和三元组（trigram）添加到集合中。之后，您可以在集合中搜索关键字，该语句的时间复杂度在平均情况下等于O（1）
import glob
with open("keyword.txt") as inp:
    keyword_set=set([lin.strip().lower() for lin in inp])
for fil in glob.glob("file/path/*.txt"):
    with open(fil) as inp, open("output.txt","w") as out:
        file_txt = inp.read().lower()
        file_set = set(file_txt.split('Your delimiter'))
        # [ adding bigrams (or trigrams) to the set ]
        for val in keyword_set:
            if val in file_set:
                out.write("{}\t{}".format(fil, val))

        # [You can use set intersection operation here instead of cycle] 

如果要考虑发生次数，必须使用collections.Counter
而不是set
从示例中猜每个实例只有一个匹配项？如果是，则在您的输出后的新行上添加break
。在同一缩进处写入level@Chris_Rands不，不是这样的，我有多个关键字匹配一个文件。我只是将它们写在单独的行中，用于下游处理，所以您要说的是将文本转换为一个集合并进行查找。如果我有像10个单词这样的大关键字，那么我必须使用文件内容创建一组10个单词，并进行查找。@Pythonmaster，将连续单词添加到set/Counter
中是个好主意。如果您编写的代码是子字符串搜索过程，则速度较慢。如果你能让表现更好，你就能获得好的最终表现。谢谢，伙计，你的想法是有道理的