Python 在列表中提到某些关键字后切片字符串_Python_String_List

Python 在列表中提到某些关键字后切片字符串

python string list

Python 在列表中提到某些关键字后切片字符串,python,string,list,Python,String,List,我是python新手，遇到了一个问题。我要做的是，我有一个字符串，其中包含两个人之间的对话： str = " dylankid: *random words* senpai: *random words* dylankid: *random words* senpai: *random words*" 我想使用dylankid和senpai作为名称从字符串创建两个列表： dylankid = [ ] senpai = [ ] 这就是我正在努力的地方，在dylankid列表中，我想把所有在“

我是python新手，遇到了一个问题。我要做的是，我有一个字符串，其中包含两个人之间的对话：

str = "  dylankid: *random words* senpai: *random words* dylankid: *random words* senpai: *random words*"

我想使用dylankid和senpai作为名称从字符串创建两个列表：

dylankid = [ ]
senpai = [ ]

这就是我正在努力的地方，在dylankid列表中，我想把所有在“dylankid”之后的单词放在字符串中，但在下一个“dylankid”或“senpai”之前森派名单也是如此看起来是这样的

dylankid = ["random words", "random words", "random words"]
senpai = ["random words", "random words", "random words"]

dylankid包含来自dylankid的所有消息，反之亦然

我已经研究过如何对它进行切片并使用

split（）

和

re.compile（）

，但我无法找到一种方法来指定开始切片的位置和停止位置

希望它足够清楚，任何帮助都将不胜感激：）

以下代码将创建一个dict，其中键是人，值是消息列表：

from collections import defaultdict
import re

PATTERN = '''
    \s*                         # Any amount of space
    (dylankid|senpai)           # Capture person
    :\s                         # Colon and single space
    (.*?)                       # Capture everything, non-greedy
    (?=\sdylankid:|\ssenpai:|$) # Until we find following person or end of string
'''
s = "  dylankid: *random words* senpai: *random words* dylankid: *random words* senpai: *random words*"
res = defaultdict(list)
for person, message in re.findall(PATTERN, s, re.VERBOSE):
    res[person].append(message)

print res['dylankid']
print res['senpai']

它将产生以下输出：

['*random words*', '*random words*']
['*random words*', '*random words*']

这可以收紧，但应该很容易扩展到更多的用户名

from collections import defaultdict

# Input string
all_messages = "  dylankid: *random words* senpai: *random words* dylankid: *random words* senpai: *random words*"

# Expected users
users = ['dylankid', 'senpai']

starts = {'{}:'.format(x) for x in users}
D = defaultdict(list)
results = defaultdict(list)

# Read through the words in the input string, collecting the ones that follow a user name
current_user = None
for word in all_messages.split(' '):
    if word in starts:
        current_user = word[:-1]
        D[current_user].append([])
    elif current_user:
        D[current_user][-1].append(word)

# Join the collected words into messages
for user, all_parts in D.items():
    for part in all_parts:
        results[user].append(' '.join(part))

结果是：

defaultdict(
    <class 'list'>,
    {'senpai': ['*random words*', '*random words*'],
    'dylankid': ['*random words*', '*random words*']}
)

defaultdict(
,
{'senpai'：['*随机词*'，'*随机词*']，
“dylankid”：['*随机词*'，'*随机词*']}
)

您可以使用groupby，使用

\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu

s = "dylankid: *random words d* senpai: *random words s* dylankid: *random words d*  senpai: *random words s*"
from itertools import groupby

d = {"dylankid:": [], "senpai:":[]}

grps = groupby(s.split(" "), d.__contains__)

for k, v in grps:
    if k:
        d[next(v)].append(" ".join(next(grps)[1]))
print(d)

输出：

{'dylankid:': ['*random words d*', '*random words d*'], 'senpai:': ['*random words s*', '*random words s*']}

每次我们在口述中得到一个名字时，我们将该名字与

next（v）

一起使用，然后使用

str.join

将下一组单词连接到下一个名字

如果名称后面碰巧没有单词，则可以使用空列表作为下次调用的默认值：

s = "dylankid: *random words d* senpai: *random words s* dylankid: *random words d*  senpai: *random words s* senpai:"
from itertools import groupby

d = {"dylankid:": [], "senpai:":[]}
grps = groupby(s.split(" "), d.__contains__)

for k, v in grps:
    if k:
        d[next(v)].append(" ".join(next(grps,[[], []])[1]))
print(d)

较大字符串上的某些计时：

In [15]: dy, sn = "dylankid:", " senpai:"

In [16]: t = " foo " * 1000

In [17]: s = "".join([dy + t + sn + t for _ in range(1000)])

In [18]: %%timeit
   ....: d = {"dylankid:": [], "senpai:": []}
   ....: grps = groupby(s.split(" "), d.__contains__)
   ....: for k, v in grps:
   ....:     if k:
   ....:         d[next(v)].append(" ".join(next(grps, [[], []])[1]))
   ....: 
1 loop, best of 3: 376 ms per loop

In [19]: %%timeit
   ....: PATTERN = '''
   ....:     \s*                         # Any amount of space
   ....:     (dylankid|senpai)           # Capture person
   ....:     :\s                         # Colon and single space
   ....:     (.*?)                       # Capture everything, non-greedy
   ....:     (?=\sdylankid:|\ssenpai:|$) # Until we find following person or end of string
   ....: '''
   ....: res = defaultdict(list)
   ....: for person, message in re.findall(PATTERN, s, re.VERBOSE):
   ....:     res[person].append(message)
   ....: 
1 loop, best of 3: 753 ms per loop

两者都返回相同的输出：

In [20]: d = {"dylankid:": [], "senpai:": []}

In [21]: grps = groupby(s.split(" "), d.__contains__)

In [22]: for k, v in grps:
           if k:                                        
                d[next(v)].append(" ".join(next(grps, [[], []])[1]))
   ....:         

In [23]: PATTERN = '''
   ....:     \s*                         # Any amount of space
   ....:     (dylankid|senpai)           # Capture person
   ....:     :\s                         # Colon and single space
   ....:     (.*?)                       # Capture everything, non-greedy
   ....:     (?=\sdylankid:|\ssenpai:|$) # Until we find following person or end of string
   ....: '''

In [24]: res = defaultdict(list)

In [25]: for person, message in re.findall(PATTERN, s, re.VERBOSE):
   ....:         res[person].append(message)
   ....:     

In [26]: d["dylankid:"] == res["dylankid"]
Out[26]: True

In [27]: d["senpai:"] == res["senpai"]
Out[27]: True

您可能会发现该函数很有用，因为它可以在您指定的分隔符后拆分字符串。是否有带有

：

的随机字？@padraickenningham是的，有一个与

名称相关的错误：

@padraickenningham否没有两个注释：1-可能是

re.finditer

更好。。。2-最好将模式编译为

pat

，然后改为

pat.finditer

。@IronFist:

re.finditer

可以节省内存，以防有很多匹配项，但在这个特定场景中使用

re.compile

的动机是什么？出于同样的原因，我会使用

re.finditer

…性能改进，但我想这似乎不是OP的兴趣。@IronFist:re.compile在这里如何提高性能？状态如下：“当表达式将在单个程序中多次使用时，保存生成的正则表达式对象以供重用更有效”。但在这种情况下，只有一个表达式，而且只使用一次。