Python 检查字符串是否至少包含列表中的一个字符串_Python_List_Match

Python 检查字符串是否至少包含列表中的一个字符串

python list

Python 检查字符串是否至少包含列表中的一个字符串,python,list,match,Python,List,Match,我正在尝试使用python进行匹配我有一个字符串列表（len~3000）和一个文件，我想检查文件中的每一行是否至少有一个字符串在列表中最直接的方法是一个接一个地检查，但这需要时间（虽然没有那么长）有什么方法可以让我搜索得更快吗例如： list = ["aq", "bs", "ce"] if the line is "aqwerqwerqwer" -> true (since has "aq" in it) if the line is "qweqweqwe" -> fal

我正在尝试使用python进行匹配

我有一个字符串列表（len~3000）和一个文件，我想检查文件中的每一行是否至少有一个字符串在列表中

最直接的方法是一个接一个地检查，但这需要时间（虽然没有那么长）

有什么方法可以让我搜索得更快吗

例如：

list = ["aq", "bs", "ce"]

if the line is "aqwerqwerqwer"  -> true (since has "aq" in it)
if the line is "qweqweqwe" -> false (has none of "aq", "bs" or "ce")

您可以使用和：

上述代码将测试

lst

中的任何项目是否可以在

行中找到。如果是这样，将运行#Do stuff

请参见下面的演示：
>>> lst = ["aq", "bs", "ce"]
>>> if any(s in "aqwerqwerqwer" for s in lst):
...     print(True)
...
True
>>> if any(s in "qweqweqwe" for s in lst):
...     print(True)
...
>>>

您可以使用itertools.groupby：
from itertools import groupby
pats = ['pat', 'pat2', …]
matches = groupby(lines, keyfunc=lambda line:any(pat in line for pat in pats))

如果您的模式都是单字符串，则可以使用以下集合进一步优化：
pats = set('abcd')
matches = groupby(lines, keyfunc=pats.intersection)

这将导致类似于
[(matched patterns, lines matched),
 (empty list, lines not matched),
 (matched patterns, lines matched),
 …]

（除非它是一个生成器，而不是一个列表。）这是它的主要逻辑。下面是对预处理的生成器进行迭代以生成输出的一种方法
for linegrp in matches:
  for line in matched_pats, linegrp:
    if matched_pats:
      print('"{}" matched because of "{}"'.format(line, matched_pats))
    else:
      print('"{}" did not match')

更复杂但更快：将字符串列表预处理为前缀trie
然后，对于每个文件行，从每个字符位置开始，查看可以走多远才能进入trie
如果保留所有活动尝试的队列，则在扫描该行时只需查看每个字符的位置一次。您还可以在每个trie节点上包含一个“最小终端深度”计数器，以便在接近字符串末尾时尽早截断比较

一个简单的半步是将您的大字符串列表减少为一个字符串列表的dict，由您要查找的每个字符串的前三个字符索引
from itertools import count, tee, izip

def triwise(iterable):
    # base on pairwise, from the itertools documentation
    "s -> (s0,s1,s2), (s1,s2,s3), (s2,s3,s4), ..."
    a, b, c = tee(iterable, 3)
    next(b, None)
    next(c, None)
    next(c, None)
    return izip(a, b, c)

class Searcher:
    def __init__(self):
        self.index = {}

    def add_seek_strings(self, strings):
        for s in strings:
            pre = s[:3]
            if pre in self.index:
                self.index[pre].append(s)
            else:
                self.index[pre] = [s]

    def find_matches(self, target):
        offset = -1
        for a,b,c in triwise(target):
            offset += 1
            pre = a+b+c
            if pre in self.index:
                from_here = target[offset:]
                for seek in self.index[pre]:
                    if from_here.startswith(seek):
                        yield seek

    def is_match(self, target):
        for match in self.find_matches(target):
            return True
        return False

def main():
    srch = Searcher()
    srch.add_seek_strings(["the", "words", "you", "want"])

    with open("myfile.txt") as inf:
        matched_lines = [line for line in inf if srch.is_match(line)]

if __name__=="__main__":
    main()

对于将正则表达式引擎与自动创建的正则表达式一起使用，这实际上是一个很好的用例
尝试：
正则表达式将比每个字符串的简单线性扫描更快地匹配每一行。这有两个原因：正则表达式是用C实现的，正则表达式被编译成一个状态机，它只检查每个输入字符一次，而不是像在一个简单的解决方案中那样检查几次
请参阅IPython笔记本中的比较：。测试数据由3000个字符串组成，在100万行的列表中进行匹配。在我的机器上，这种天真的方法花费了1分钟46秒，而这个解决方案只有9.97秒。
这仍然会对每一行文件进行线性搜索。改为使用set（）。@liori将行转换为set本身需要线性时间。@Aश威尼च没关系，因为我会多次使用这个集合或列表，它肯定比每次搜索列表要好time@YilunZhang任何都可以使用任何可移植的数据结构，我相信。啊，好吧，我理解OP的“has”as“完全等于”。请忽略我的评论。这是否回答了你的问题？
from itertools import count, tee, izip

def triwise(iterable):
    # base on pairwise, from the itertools documentation
    "s -> (s0,s1,s2), (s1,s2,s3), (s2,s3,s4), ..."
    a, b, c = tee(iterable, 3)
    next(b, None)
    next(c, None)
    next(c, None)
    return izip(a, b, c)

class Searcher:
    def __init__(self):
        self.index = {}

    def add_seek_strings(self, strings):
        for s in strings:
            pre = s[:3]
            if pre in self.index:
                self.index[pre].append(s)
            else:
                self.index[pre] = [s]

    def find_matches(self, target):
        offset = -1
        for a,b,c in triwise(target):
            offset += 1
            pre = a+b+c
            if pre in self.index:
                from_here = target[offset:]
                for seek in self.index[pre]:
                    if from_here.startswith(seek):
                        yield seek

    def is_match(self, target):
        for match in self.find_matches(target):
            return True
        return False

def main():
    srch = Searcher()
    srch.add_seek_strings(["the", "words", "you", "want"])

    with open("myfile.txt") as inf:
        matched_lines = [line for line in inf if srch.is_match(line)]

if __name__=="__main__":
    main()

def re_match(strings_to_match, my_file):
    # building regular expression to match
    expression = re.compile(
        '(' + 
        '|'.join(re.escape(item) for item in strings_to_match) +
        ')')

    # perform matching
    for line in my_file:
        if not expression.search(line):
            return False
    return True