Python 使用字典替换数据帧中的字符串值

Python 使用字典替换数据帧中的字符串值,python,regex,python-3.x,loops,dataframe,Python,Regex,Python 3.x,Loops,Dataframe,我有一个数据框,其中有一列名为“cleaned_tweet”。本专栏由几个缩写的tweet组成,我想用合适的英语单词替换这些缩写。为此,我准备了一本名为“俚语”的词典,其中abbr.是关键字,所需的英语短语/单词作为值,我想用词典中的值替换所有出现的这些abbr。我已经在stackoverflow上寻找了其他几种解决方案,但它们似乎都不起作用。这是我试过的。我使用的是嵌套for循环,我相信我非常接近解决方案,但我做错了什么,我似乎无法理解 下面是嵌套循环: for i in range(len(

我有一个数据框,其中有一列名为“cleaned_tweet”。本专栏由几个缩写的tweet组成,我想用合适的英语单词替换这些缩写。为此,我准备了一本名为“俚语”的词典,其中abbr.是关键字,所需的英语短语/单词作为值,我想用词典中的值替换所有出现的这些abbr。我已经在stackoverflow上寻找了其他几种解决方案,但它们似乎都不起作用。这是我试过的。我使用的是嵌套for循环,我相信我非常接近解决方案,但我做错了什么,我似乎无法理解

下面是嵌套循环:

for i in range(len(train_test_set)):
    for j in slangs:
        train_test_set['cleaned_tweet'][i] = train_test_set['cleaned_tweet'][i].replace(j, slangs[j])
当我执行此代码并打印
print(train_test_set['cleaned_tweet][0])
时,我得到了如下意外输出:

"#mopanthank whyour | hi | years oldwhyour | hi | years oldhesitationospecial editekissas insekissperience wall hacken whyour | hi | years oldunited statesing a hallwhyour | hi | years olducinogenic drwhyour | hi | years olduglwhyour | hi | years oldung ladye rainbowhwhy | would whyour | hi | years olduohesitationents | rapper from atalk later | ekissperience wall hacken whyour | hi | years oldunited statesing a hallwhyour | hi | years olducinogenic drwhyour | hi | years olduglwhyour | hi | years oldung ladye rainbowhwhy | would whyour | hi | years olduoue loversatileionwhyes | yeah | yes | your | hi | years oldu | team leaderantaonwhysomethingop it | somethingwhyour | hi | years oldupid idiotake careal edwhyour | hi | years olducatekissas insekissperience wall hacken whyour | hi | years oldunited statesing a hallwhyour | hi | years olducinogenic drwhyour | hi | years olduglwhyour | hi | years oldung ladye..."
slangs = {'abbr1': 'word1', .........}
train_test_set['cleaned_tweet'] = train_test_set['cleaned_tweet'].map(slangs)
# define the dictionary with the words as the keys and the lists of the respective abbreviations as the values
slangs = {'word1': ['abbr11', 'abbr12', ....], 'word2': ['abbr21', 'abbr22',..]}
#swap keys in slangs: http://stackoverflow.com/a/31674731/2901002
d = {k: oldk for oldk, oldv in slangs.items() for k in oldv}
train_test_set['cleaned_tweet']  = train_test_set['cleaned_tweet'].map(slangs)
似乎有许多不需要的值被附加到单元格中。 输出大小非常大,所以我不能在这里全部复制。以下是执行代码之前我的数据集和字典的结构:


有人能告诉我我做错了什么吗?

您可以尝试使用字典和map()函数。大概是这样的:

"#mopanthank whyour | hi | years oldwhyour | hi | years oldhesitationospecial editekissas insekissperience wall hacken whyour | hi | years oldunited statesing a hallwhyour | hi | years olducinogenic drwhyour | hi | years olduglwhyour | hi | years oldung ladye rainbowhwhy | would whyour | hi | years olduohesitationents | rapper from atalk later | ekissperience wall hacken whyour | hi | years oldunited statesing a hallwhyour | hi | years olducinogenic drwhyour | hi | years olduglwhyour | hi | years oldung ladye rainbowhwhy | would whyour | hi | years olduoue loversatileionwhyes | yeah | yes | your | hi | years oldu | team leaderantaonwhysomethingop it | somethingwhyour | hi | years oldupid idiotake careal edwhyour | hi | years olducatekissas insekissperience wall hacken whyour | hi | years oldunited statesing a hallwhyour | hi | years olducinogenic drwhyour | hi | years olduglwhyour | hi | years oldung ladye..."
slangs = {'abbr1': 'word1', .........}
train_test_set['cleaned_tweet'] = train_test_set['cleaned_tweet'].map(slangs)
# define the dictionary with the words as the keys and the lists of the respective abbreviations as the values
slangs = {'word1': ['abbr11', 'abbr12', ....], 'word2': ['abbr21', 'abbr22',..]}
#swap keys in slangs: http://stackoverflow.com/a/31674731/2901002
d = {k: oldk for oldk, oldv in slangs.items() for k in oldv}
train_test_set['cleaned_tweet']  = train_test_set['cleaned_tweet'].map(slangs)
如果同一个单词有多个缩写,可以尝试用这些单词作为关键字,用相应缩写的列表作为值来定义词典。然后可以交换键和值,并遵循相同的方法。大概是这样的:

"#mopanthank whyour | hi | years oldwhyour | hi | years oldhesitationospecial editekissas insekissperience wall hacken whyour | hi | years oldunited statesing a hallwhyour | hi | years olducinogenic drwhyour | hi | years olduglwhyour | hi | years oldung ladye rainbowhwhy | would whyour | hi | years olduohesitationents | rapper from atalk later | ekissperience wall hacken whyour | hi | years oldunited statesing a hallwhyour | hi | years olducinogenic drwhyour | hi | years olduglwhyour | hi | years oldung ladye rainbowhwhy | would whyour | hi | years olduoue loversatileionwhyes | yeah | yes | your | hi | years oldu | team leaderantaonwhysomethingop it | somethingwhyour | hi | years oldupid idiotake careal edwhyour | hi | years olducatekissas insekissperience wall hacken whyour | hi | years oldunited statesing a hallwhyour | hi | years olducinogenic drwhyour | hi | years olduglwhyour | hi | years oldung ladye..."
slangs = {'abbr1': 'word1', .........}
train_test_set['cleaned_tweet'] = train_test_set['cleaned_tweet'].map(slangs)
# define the dictionary with the words as the keys and the lists of the respective abbreviations as the values
slangs = {'word1': ['abbr11', 'abbr12', ....], 'word2': ['abbr21', 'abbr22',..]}
#swap keys in slangs: http://stackoverflow.com/a/31674731/2901002
d = {k: oldk for oldk, oldv in slangs.items() for k in oldv}
train_test_set['cleaned_tweet']  = train_test_set['cleaned_tweet'].map(slangs)

我建议使用支持callable作为替换参数的
Series.str.replace
方法

首先,定义一个字典,其中键是搜索表达式,值是要替换的文本:

slangs = { 'lng1': 'val1', 'lng2': 'val2' }
然后,使用

rx = r'\b(?:{})\b'.format("|".join(slangs.keys())
train_test_set['cleaned_tweet'] = train_test_set['cleaned_tweet'].str.replace(rx), lambda x: slangs[x.group()])
这里,
rx
将是
\b(?:abc | def | ghi |…)\b
类型的动态格式正则表达式,其中
\b
是单词边界。如果您有由字母、数字或下划线组成的搜索词,这将起作用。请参阅此动态模式构建的一部分,以涵盖更多场景。找到匹配项后,将其传递给lambda表达式,
lambda x:slags[x.group()]
返回找到的键的字典值

如果有数千个字典项,请使用来构建regex-trie。

检查此处: