Python:在字典值(字符串列表)中查找并替换模式

Python:在字典值(字符串列表)中查找并替换模式,python,regex,string,list,dictionary,Python,Regex,String,List,Dictionary,我有一个包含key:value对的字典,其中的值是字符串列表: dictionarylst = {0:["example inside some sentence", "something else", "some blah"], 1:["testing", "some other word"], 2:["a new expression", "my cat is cute"]} 我还有一个单词列表,可以是代币或大字: wordslist = ["expression 1", "my expr

我有一个包含key:value对的字典,其中的值是字符串列表:

dictionarylst = {0:["example inside some sentence", "something else", "some blah"], 1:["testing", "some other word"], 2:["a new expression", "my cat is cute"]}
我还有一个单词列表,可以是代币或大字:

wordslist = ["expression 1", "my expression", "other", "blah"]
我正在尝试将我的单词列表中的每个单词与字典中每个值中的每个文本进行匹配。当存在匹配时,我只想用一个空格替换该模式(但保留文本的其余部分),并使用相同的键将输出存储在新字典中

这就是我迄今为止所尝试的:

dictionarycleaned = {}
for key,value in dictionarylst.items():
    for text in value :
        for word in wordslist :
            if word in value :
                pattern = re.compile(r'\b({})\b'.format(word))
                matches = re.findall(pattern, text)
                dictionarycleaned[key] = [re.sub(i,' ', text) for i in matches]
            else :
                dictionarycleaned[key] = value
这只匹配我的单词列表中的一小部分模式。我尝试了不同的变体:比如将模式与每个值中的整个字符串列表相匹配,或者在dictionarylst之前迭代wordlist,但是似乎没有任何东西可以清除我的所有数据(数据非常大)

谢谢你的建议。

试试这个:

import re
import pprint

dictionarylst = {
    0: ["example inside some sentence", "something else", "some blah"],
    1: ["testing", "some other word"],
    2: ["a new expression", "my cat is cute"],
}
wordslist = ["expression 1", "my expression", "other", "blah"]

dictionarycleaned = dictionarylst.copy()
for key, value in dictionarylst.items():
    for n, text in enumerate(value):
        for word in wordslist:
            if word in text:
                dictionarycleaned[key][n] = re.sub(r"\b({})\b".format(word), " ", text)

pprint.pprint(dictionarycleaned)
输出:

pako@b00s:~/tests$ python dict.py 
{0: ['example inside some sentence', 'something else', 'some  '],
 1: ['testing', 'some   word'],
 2: ['a new expression', 'my cat is cute']}

因为它是一个平面字符串替换,如果wordslist中的单词不能包含双引号(“),您可以简单地从dict创建一个json字符串,然后进行替换并从修改后的json字符串重新生成dict

下面给出了一个示例程序

import json

d = {0:["example inside some sentence", "something else", "some blah"], 1:["testing", "some other word"], 2:["a new expression", "my cat is cute"]}
words = ["expression 1", "my expression", "other", "blah"]

json_str = json.dumps(d)
for w in words:
  str = str.replace(w, " ")

req_dict = json.loads(json_str)
这样,您就可以摆脱多重循环

  • replace()
    是Python编程语言中的一个内置函数,它返回一个字符串的副本,其中一个子字符串的所有匹配项都替换为另一个子字符串
Ex.

dictionarylst = {0:["example inside some sentence", "something else", "some 
                  blah"], 1:["testing", "some other word"],2:["a new expression",
                 "my cat is cute"]}

wordslist = ["expression 1", "my expression", "other", "blah"]
dictionarycleaned = {}

def match_pattern(wordslist,value):
    new_list = []
    for text in value:
        # temp variable hold latest updated text
        temp = text
        for word in wordslist:
            if word in text:
                # replace text string with whitespace if word in text
                temp = temp.replace(word,"")
        new_list.append(temp)
    return new_list


for k,v in dictionarylst.items():
    dictionarycleaned[k] = match_pattern(wordslist, v)

print(dictionarycleaned)
O/p:

{0: ['example inside some sentence', 'something else', 'some '], 1: ['testing', 'some  
 word'], 2: ['a new expression', 'my cat is cute']}

Pako的答案很好,但您可以通过这些进一步优化 -使用正则表达式生成替换 -无需创建字典的副本:只需用新列表替换值即可

完整代码

import re
import pprint

dictionarylst = {
    0: ["example inside some sentence", "something else", "some blah"],
    1: ["testing", "some other word"],
    2: ["a new expression", "my cat is cute"],
}
regexs = []
wordslist = ["expression 1", "my expression", "other", "blah"]
for word in wordslist:
    regexs.append(re.compile(r"\b({})\b".format(word)))
for key, value in dictionarylst.items():
    words = [regex.sub(w, ' ') for w in value for regex in regexs]
    dictionarylst[key] = words

pprint.pprint(dictionarycleaned)

您期望的输出是什么?期望的输出是一个字典,就像输入一样,但是文本被清理了。(因此代码中的dictionarycleaned={})我用re.sub尝试了这个方法,效果非常好(我改为re.sub,因为我需要匹配单词边界),非常感谢。