在Python中查找多个字符串的最佳方法_Python_Regex_String_Find

在Python中查找多个字符串的最佳方法

python regex string

在Python中查找多个字符串的最佳方法,python,regex,string,find,Python,Regex,String,Find,我需要为我正在处理的这个问题指出正确的方向：假设我正在读取C程序的输出，如下所示： while True: ln = p.stdout.readline() if '' == ln: break #do stuff here with ln 我的输出是这样的： TrnIq: Thread on CPU 37 TrnIq: Thread on CPU 37 but will be moved to CPU 44 IP-Thread on CPU 33 F

我需要为我正在处理的这个问题指出正确的方向：

假设我正在读取C程序的输出，如下所示：

while True:
    ln = p.stdout.readline()
    if '' == ln:
        break
    #do stuff here with ln

我的输出是这样的：

TrnIq: Thread on CPU 37
TrnIq: Thread on CPU 37 but will be moved to CPU 44
IP-Thread on CPU 33
FANOUT Thread on CPU 37
Filter-Thread on CPU 38 but will be moved to CPU 51
TRN TMR Test 2 Supervisor Thread on CPU 34
HomographyWarp Traking Thread[0] on CPU 26

我想将“TrnIq:threadon”和“37”捕获为两个独立的变量：一个字符串和一个来自输出“TrnIq:threadoncpu37”的数字

例如，其他行捕获“上的同形变形跟踪线程[0]和“CPU 26上的同形变形跟踪线程[0]中的#“26”时，情况非常相同

唯一真正的挑战是这样的行：“在CPU 38上过滤线程，但将被移动到CPU 51”在这行我需要“文件服务器线程”和#“51”而不是第一个#“38”

Python有很多不同的方法来实现这一点，我甚至不知道从哪里开始

提前谢谢

假设

ln

是一行数据（经过编辑，包括将CPU值转换为int），则以下内容应返回一个信息元组：

例如：

>>> import re
>>> for ln in lines:
...     print re.match(r'(.*?)(?: on CPU.*)?(?: (?:on|to) CPU )(.*)', ln).groups()
... 
('TrnIq: Thread', '37')
('TrnIq: Thread', '44')
('IP-Thread', '33')
('FANOUT Thread', '37')
('Filter-Thread', '51')
('TRN TMR Test 2 Supervisor Thread', '34')
('HomographyWarp Traking Thread[0]', '26')

说明：

(.*?)          # capture zero or more characters at the start of the string,
               #   as few characters as possible
(?: on CPU.*)? # optionally match ' on CPU' followed by any number of characters,
               #   do not capture this
(?: (?:on|to) CPU )  # match ' on CPU ' or ' to CPU ', but don't capture
(.*)           # capture the rest of the line

Rubular:

所以在\s+CPU上使用regex

^（.*）\s+。（在您提到的情况下，您总是需要第二个CPU号，因此可以使用单个regexp完成：
# Test program
import re

lns = [
    "TrnIq: Thread on CPU 37",
    "TrnIq: Thread on CPU 37 but will be moved to CPU 44",
    "IP-Thread on CPU 33",
    "FANOUT Thread on CPU 37",
    "Filter-Thread on CPU 38 but will be moved to CPU 51",
    "TRN TMR Test 2 Supervisor Thread on CPU 34",
    "HomographyWarp Traking Thread[0] on CPU 26"
]

for ln in lns:
    test    = re.search("(?P<process>.*Thread\S* on).* CPU (?P<cpu>\d+)$", ln)
    print "%s: '%s' on CPU #%s" % ( ln, test.group('process'), test.group('cpu'))

测试程序
进口稀土
lns=[
“TrnIq:CPU 37上的线程”，
“TrnIq:CPU 37上的线程，但将移动到CPU 44”，
“CPU 33上的IP线程”，
“CPU 37上的扇出线程”，
“CPU 38上的筛选器线程，但将移动到CPU 51”，
“CPU 34上的TRN TMR测试2主管线程”，
“CPU 26上的同形扭曲跟踪线程[0]”
]
对于lns中的ln：
test=re.search（（？P.*Thread\S*on）。*CPU（？P\d+）$，ln）
在CPU#%s“%上打印“%s:“%s”（ln，test.group（'process'），test.group（'CPU'））

在一般情况下，您可能希望区分不同的情况（例如CPU上的线程、移动的线程、子线程…）。为此，您可以依次使用多个re.search（）。例如：

# This search recognizes lines of the form "...Thread on CPU so-and-so", and
# also lines that add "...but will be moved to CPU some-other-cpu".
test = re.search("(?P<process>.* Thread) on CPU (?P<cpu1>\d+)( but will be moved to CPU (?P<cpu2>\d+))*", ln)
if test:
   # Here we capture Process Thread, both moved and non moved
   if test.group('cpu2'):
       # We have process, cpu1 and cpu2: moved thread
   else:
       # Nonmoved task, we have test.group('process') and cpu1.
else:
   # No match, try some other regexp. For example processes with a thread number
   # between square brackets: "Thread[0]", which are not captured by the regex above.
   test = re.search("(?P<process>.*) Thread[(?P<thread>\d+)] on CPU (?P<cpu1>)", ln)
   if test:
       # Here we have Homography Traking in process, 0 in thread, 26 in cpu1

#此搜索识别形式为“…CPU上的线程某某”的行，以及
#还有添加“…但将移动到其他CPU”的行。
test=re.search（（？P.*线程）在CPU（？P\d+）（但将移动到CPU（？P\d+）））*，ln）
如果测试：
#这里我们捕获进程线程，包括移动线程和非移动线程
如果测试组（'cpu2'）：
#我们有进程，cpu1和cpu2：移动线程
其他：
#非移动任务，我们有test.group（'process'）和cpu1。
其他：
#不匹配，请尝试其他一些regexp。例如，具有线程号的进程
#在方括号之间：“Thread[0]”，上面的正则表达式没有捕捉到它。
测试=重新搜索（（？P.*）线程[（？P\d+）]在CPU（？P）上，ln）
如果测试：
#在这里，我们在进程中有单应性跟踪，在线程中有0，在cpu1中有26

为了获得最佳性能，最好先对出现频率更高的行进行测试。

只需两次正则表达式搜索即可：

import re

while True:
    ln = p.stdout.readline()
    if '' == ln:
        break

    start_match = re.search(r'^(.*?) on', ln)
    end_match = re.search(r'(\d+)$', ln)
    process = start_match and start_match.group(0)
    process_number = end_match and end_match.group(0)

在这里，正则表达式对我来说似乎有些过分。[免责声明：我不喜欢正则表达式，但确实喜欢Python，所以我尽可能用Python编写，而不编写正则表达式。出于我从未完全理解的原因，这被认为是令人惊讶的。]

s = """TrnIq: Thread on CPU 37
TrnIq: Thread on CPU 37 but will be moved to CPU 44
IP-Thread on CPU 33
FANOUT Thread on CPU 37
Filter-Thread on CPU 38 but will be moved to CPU 51
TRN TMR Test 2 Supervisor Thread on CPU 34
HomographyWarp Traking Thread[0] on CPU 26"""

for line in s.splitlines():
    words = line.split()
    if not ("CPU" in words and "on" in words): continue # skip uninteresting lines
    prefix_words = words[:words.index("on")+1]
    prefix = ' '.join(prefix_words)
    cpu = int(words[-1])
    print (prefix, cpu)

给予

我不认为我需要将这些代码翻译成英语。

你能提供一些细节吗？我仍然习惯于python的语法来处理这类东西？你可以阅读python的正则表达式模块文档：我得到正则表达式，而不是“and”和match.group（）函数。matchgroup（）函数返回一个字符串对吗？啊，如果没有找到匹配项，

和

是正确的。

group（）

将返回任何指定的捕获组作为字符串，在这种情况下，因为我们只有一个称为

组（0）

。我想我应该澄清一个事实，那就是这些正是我感兴趣的输出行。还有数千行：`process\u trn\u ip\u rslts:切换到trn\u FILTER\u传播状态。trn\u FILTER:total\u update\u timer=0.057454秒。trn\u ib->trn\u ib\u state.ip\u part=1 DISPOSITION ACCEPT VALUE=2-a*\u 000004.pgm DISPOSITION接受发送到套接字6的帧4 TrnIb:Sending image id=（1003080551，750074，帧CNT 4）从IB通过socket=8到IP。open_sock/bind OK，sock=79 create_cmd_sock/listen OK，erc=0 trn_filter，实例3:socket_from_cmd:35~因此这将返回一个每行包含2个字符串的元组？如果我想将#s转换为字符串，python是否有strotnum函数？@NASAIntern-您可以将末尾的

更改为

\d+

这样，您就可以确保只抓取数字，而不是抓取行的其余部分。然后您就可以使用

int（）

将字符串转换为数字的内置函数。仅供将来使用，您是否介意进一步解释这一行：我得到正则表达式字符，但函数的具体工作方式是：roc，cpu=re.match（r'（.*）（：on-cpu.*）（：（？：on-to）cpu）（.*），ln groups（）不知道为什么你会被否决，你的答案是正确的，而且解释得很好。+1.这是非常次要的，但在代码中的某个地方有一行

import re

可能会更好，只是为了增加一点清晰度。@NASAIntern-

re.match（）

返回a，并从正则表达式返回捕获组的元组。调用

proc，cpu=…

，基本上与

groups=re.match（…）.groups（）相同；proc=groups[0]；cpu=groups[1]

。是的！即使我只看一下就可以理解它！比正则表达式简单得多！但您认为哪种方法更有效？效率是多少？我认为它就像是总时间的一种度量——写时间、调试时间、运行时间、修改时间以处理我没有预测到的情况——以获得我需要的输出.regexp通常（并非总是）会在性能上胜出。我认为它们只会在代码方面胜出，因为我很少遇到这样的用例——对于基本工具来说有点太复杂，并且不足以证明使用真正的PAR是正确的

import re

while True:
    ln = p.stdout.readline()
    if '' == ln:
        break

    start_match = re.search(r'^(.*?) on', ln)
    end_match = re.search(r'(\d+)$', ln)
    process = start_match and start_match.group(0)
    process_number = end_match and end_match.group(0)

s = """TrnIq: Thread on CPU 37
TrnIq: Thread on CPU 37 but will be moved to CPU 44
IP-Thread on CPU 33
FANOUT Thread on CPU 37
Filter-Thread on CPU 38 but will be moved to CPU 51
TRN TMR Test 2 Supervisor Thread on CPU 34
HomographyWarp Traking Thread[0] on CPU 26"""

for line in s.splitlines():
    words = line.split()
    if not ("CPU" in words and "on" in words): continue # skip uninteresting lines
    prefix_words = words[:words.index("on")+1]
    prefix = ' '.join(prefix_words)
    cpu = int(words[-1])
    print (prefix, cpu)

('TrnIq: Thread on', 37)
('TrnIq: Thread on', 44)
('IP-Thread on', 33)
('FANOUT Thread on', 37)
('Filter-Thread on', 51)
('TRN TMR Test 2 Supervisor Thread on', 34)
('HomographyWarp Traking Thread[0] on', 26)