Python 在列表中提到某些关键字后切片字符串
我是python新手,遇到了一个问题。我要做的是,我有一个字符串,其中包含两个人之间的对话:Python 在列表中提到某些关键字后切片字符串,python,string,list,Python,String,List,我是python新手,遇到了一个问题。我要做的是,我有一个字符串,其中包含两个人之间的对话: str = " dylankid: *random words* senpai: *random words* dylankid: *random words* senpai: *random words*" 我想使用dylankid和senpai作为名称从字符串创建两个列表: dylankid = [ ] senpai = [ ] 这就是我正在努力的地方,在dylankid列表中,我想把所有在“
str = " dylankid: *random words* senpai: *random words* dylankid: *random words* senpai: *random words*"
我想使用dylankid和senpai作为名称从字符串创建两个列表:
dylankid = [ ]
senpai = [ ]
这就是我正在努力的地方,在dylankid列表中,我想把所有在“dylankid”之后的单词放在字符串中,但在下一个“dylankid”或“senpai”之前
森派名单也是如此
看起来是这样的
dylankid = ["random words", "random words", "random words"]
senpai = ["random words", "random words", "random words"]
dylankid包含来自dylankid的所有消息,反之亦然
我已经研究过如何对它进行切片并使用split()
和re.compile()
,但我无法找到一种方法来指定开始切片的位置和停止位置
希望它足够清楚,任何帮助都将不胜感激:)以下代码将创建一个dict,其中键是人,值是消息列表:
from collections import defaultdict
import re
PATTERN = '''
\s* # Any amount of space
(dylankid|senpai) # Capture person
:\s # Colon and single space
(.*?) # Capture everything, non-greedy
(?=\sdylankid:|\ssenpai:|$) # Until we find following person or end of string
'''
s = " dylankid: *random words* senpai: *random words* dylankid: *random words* senpai: *random words*"
res = defaultdict(list)
for person, message in re.findall(PATTERN, s, re.VERBOSE):
res[person].append(message)
print res['dylankid']
print res['senpai']
它将产生以下输出:
['*random words*', '*random words*']
['*random words*', '*random words*']
这可以收紧,但应该很容易扩展到更多的用户名
from collections import defaultdict
# Input string
all_messages = " dylankid: *random words* senpai: *random words* dylankid: *random words* senpai: *random words*"
# Expected users
users = ['dylankid', 'senpai']
starts = {'{}:'.format(x) for x in users}
D = defaultdict(list)
results = defaultdict(list)
# Read through the words in the input string, collecting the ones that follow a user name
current_user = None
for word in all_messages.split(' '):
if word in starts:
current_user = word[:-1]
D[current_user].append([])
elif current_user:
D[current_user][-1].append(word)
# Join the collected words into messages
for user, all_parts in D.items():
for part in all_parts:
results[user].append(' '.join(part))
结果是:
defaultdict(
<class 'list'>,
{'senpai': ['*random words*', '*random words*'],
'dylankid': ['*random words*', '*random words*']}
)
defaultdict(
,
{'senpai':['*随机词*','*随机词*'],
“dylankid”:['*随机词*','*随机词*']}
)
您可以使用groupby,使用\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
s = "dylankid: *random words d* senpai: *random words s* dylankid: *random words d* senpai: *random words s*"
from itertools import groupby
d = {"dylankid:": [], "senpai:":[]}
grps = groupby(s.split(" "), d.__contains__)
for k, v in grps:
if k:
d[next(v)].append(" ".join(next(grps)[1]))
print(d)
输出:
{'dylankid:': ['*random words d*', '*random words d*'], 'senpai:': ['*random words s*', '*random words s*']}
每次我们在口述中得到一个名字时,我们将该名字与next(v)
一起使用,然后使用str.join
将下一组单词连接到下一个名字
如果名称后面碰巧没有单词,则可以使用空列表作为下次调用的默认值:
s = "dylankid: *random words d* senpai: *random words s* dylankid: *random words d* senpai: *random words s* senpai:"
from itertools import groupby
d = {"dylankid:": [], "senpai:":[]}
grps = groupby(s.split(" "), d.__contains__)
for k, v in grps:
if k:
d[next(v)].append(" ".join(next(grps,[[], []])[1]))
print(d)
较大字符串上的某些计时:
In [15]: dy, sn = "dylankid:", " senpai:"
In [16]: t = " foo " * 1000
In [17]: s = "".join([dy + t + sn + t for _ in range(1000)])
In [18]: %%timeit
....: d = {"dylankid:": [], "senpai:": []}
....: grps = groupby(s.split(" "), d.__contains__)
....: for k, v in grps:
....: if k:
....: d[next(v)].append(" ".join(next(grps, [[], []])[1]))
....:
1 loop, best of 3: 376 ms per loop
In [19]: %%timeit
....: PATTERN = '''
....: \s* # Any amount of space
....: (dylankid|senpai) # Capture person
....: :\s # Colon and single space
....: (.*?) # Capture everything, non-greedy
....: (?=\sdylankid:|\ssenpai:|$) # Until we find following person or end of string
....: '''
....: res = defaultdict(list)
....: for person, message in re.findall(PATTERN, s, re.VERBOSE):
....: res[person].append(message)
....:
1 loop, best of 3: 753 ms per loop
两者都返回相同的输出:
In [20]: d = {"dylankid:": [], "senpai:": []}
In [21]: grps = groupby(s.split(" "), d.__contains__)
In [22]: for k, v in grps:
if k:
d[next(v)].append(" ".join(next(grps, [[], []])[1]))
....:
In [23]: PATTERN = '''
....: \s* # Any amount of space
....: (dylankid|senpai) # Capture person
....: :\s # Colon and single space
....: (.*?) # Capture everything, non-greedy
....: (?=\sdylankid:|\ssenpai:|$) # Until we find following person or end of string
....: '''
In [24]: res = defaultdict(list)
In [25]: for person, message in re.findall(PATTERN, s, re.VERBOSE):
....: res[person].append(message)
....:
In [26]: d["dylankid:"] == res["dylankid"]
Out[26]: True
In [27]: d["senpai:"] == res["senpai"]
Out[27]: True
您可能会发现该函数很有用,因为它可以在您指定的分隔符后拆分字符串。是否有带有
:
的随机字?@padraickenningham是的,有一个与名称相关的错误:
@padraickenningham否没有两个注释:1-可能是re.finditer
更好。。。2-最好将模式编译为pat
,然后改为pat.finditer
。@IronFist:re.finditer
可以节省内存,以防有很多匹配项,但在这个特定场景中使用re.compile
的动机是什么?出于同样的原因,我会使用re.finditer
…性能改进,但我想这似乎不是OP的兴趣。@IronFist:re.compile在这里如何提高性能?状态如下:“当表达式将在单个程序中多次使用时,保存生成的正则表达式对象以供重用更有效”。但在这种情况下,只有一个表达式,而且只使用一次。