python在同一行上搜索不同的字符串_Python

python在同一行上搜索不同的字符串

python

python在同一行上搜索不同的字符串,python,Python,我想优化以下代码： if re.search(str(stringA), line) and re.search(str(stringB), line): ..... ..... 我试过： stringAB = stringA + '.*' + stringB if re.search(str(stringAB), line): ..... ..... 但我得到的结果并不可靠。我在这里使用re.search，因为这似乎是我可以搜索stringA和stringB

我想优化以下代码：

if re.search(str(stringA), line) and re.search(str(stringB), line):
    .....
    .....

我试过：

stringAB = stringA + '.*' + stringB
if re.search(str(stringAB), line):
    .....
    .....

但我得到的结果并不可靠。我在这里使用re.search，因为这似乎是我可以搜索stringA和stringB中指定模式的确切正则表达式的唯一方法

此代码背后的逻辑是按照以下egrep命令示例建模的：

stringA=Success
stringB=mysqlDB01

egrep "${stringA}" /var/app/mydata | egrep "${stringB}"

如果有更好的方法不必重新搜索，请告诉我。

一种方法是使用\b创建一个匹配任一单词的模式，这样我们只匹配完整的单词，使用re.findall检查字符串中的所有匹配项，然后使用set equality确保两个单词都匹配

import re

stringA = "spam"
stringB = "egg"

words = {stringA, stringB}

# Make a pattern that matches either word
pat = re.compile(r"\b{}\b|\b{}\b".format(stringA, stringB))

data = [
    "this string has spam in it",
    "this string has egg in it",
    "this string has egg in it and another egg too",
    "this string has both egg and spam in it",
    "the word spams shouldn't match",
    "and eggs shouldn't match, either",
]

for s in data:
    found = pat.findall(s)
    print(repr(s), found, set(found) == words)

输出

执行setfound==words的更有效方法是使用words.issubsetfound，因为它跳过了对find的显式转换

正如Jon Clements在一篇评论中提到的，我们可以简化和概括该模式以处理任意数量的单词，并且我们应该使用re.escape，以防任何单词包含regex元字符

pat = re.compile(r"\b({})\b".format("|".join(re.escape(word) for word in words)))

谢谢你，乔恩

这是一个按指定顺序匹配单词的版本。如果找到匹配项，则打印匹配的子字符串，否则不打印任何子字符串

import re

stringA = "spam"
stringB = "egg"
words = [stringA, stringB]

# Make a pattern that matches all the words, in order
pat = r"\b.*?\b".join([re.escape(word) for word in words])
pat = re.compile(r"\b" + pat + r"\b")

data = [
    "this string has spam and also egg, in the proper order",
    "this string has spam in it",
    "this string has spamegg in it",
    "this string has egg in it",
    "this string has egg in it and another egg too",
    "this string has both egg and spam in it",
    "the word spams shouldn't match",
    "and eggs shouldn't match, either",
]

for s in data:
    found = pat.search(s)
    if found:
        found = found.group()
    print('{!r}: {!r}'.format(s, found))

输出

stringA和stringB是什么类型的对象？可能它们实际上不是字符串，因为您正在对它们调用str。它们是字符串。我调用str是为了确保python将它们视为字符串。字符串，我的意思是，用户可能希望在文件中搜索的任何模式。如果s已经是字符串，那么Python已经知道它是字符串对象。strs只返回s。您是否因为stringA不总是在stringB之前而错过了命中率？这一尝试表明了这一点。顺便说一句：如果x和y已经尽可能地优化了，那么您可能正在尝试过早的优化。不可能使您的解决方案更有效。它已经完成了获得期望结果所需的最低工作量。除了在stringA和stringB上不必要地调用str之外。可能值得泛化pat，因此它类似于：word in words的r'\b{}\b'.format'|'.joinre.escapeword？尽管这在这里并不重要-您可以使用.finditer来避免创建列表。。。例如：单词.issubsetm.group代表pat中的m。finditers@JonClements好主意！最初我没有使用re.escape，因为我认为字符串可能已经是正则表达式了，但我想这是个好主意。但是我不想为.finditer费心，因为如果OP搜索的是单行text.im实际使用的openlogfile作为f，迭代一个巨大的日志文件，并在读取的每行日志上搜索两种模式，可能没有多大好处。字符串必须按指定的顺序显示。stringA然后stringB。尽管如此，我可以想象一个场景，用户希望将其反转。因此，我想知道.finditer是否能帮助加快读取一个巨大日志文件并检查每行中的两种模式的过程？@RoyMWell请在我的答案末尾查看更新版本。因为您需要逐行搜索。finditer在这里并没有多大好处：当要搜索的每个字符串都有很多KB并且包含很多匹配项时，它非常有用。

import re

stringA = "spam"
stringB = "egg"
words = [stringA, stringB]

# Make a pattern that matches all the words, in order
pat = r"\b.*?\b".join([re.escape(word) for word in words])
pat = re.compile(r"\b" + pat + r"\b")

data = [
    "this string has spam and also egg, in the proper order",
    "this string has spam in it",
    "this string has spamegg in it",
    "this string has egg in it",
    "this string has egg in it and another egg too",
    "this string has both egg and spam in it",
    "the word spams shouldn't match",
    "and eggs shouldn't match, either",
]

for s in data:
    found = pat.search(s)
    if found:
        found = found.group()
    print('{!r}: {!r}'.format(s, found))

'this string has spam and also egg, in the proper order': 'spam and also egg'
'this string has spam in it': None
'this string has spamegg in it': None
'this string has egg in it': None
'this string has egg in it and another egg too': None
'this string has both egg and spam in it': None
"the word spams shouldn't match": None
"and eggs shouldn't match, either": None