Python 使用字典替换数据帧中的字符串值_Python_Regex_Python 3.x_Loops_Dataframe

Python 使用字典替换数据帧中的字符串值

python regex python-3.x loops dataframe

Python 使用字典替换数据帧中的字符串值,python,regex,python-3.x,loops,dataframe,Python,Regex,Python 3.x,Loops,Dataframe,我有一个数据框，其中有一列名为“cleaned_tweet”。本专栏由几个缩写的tweet组成，我想用合适的英语单词替换这些缩写。为此，我准备了一本名为“俚语”的词典，其中abbr.是关键字，所需的英语短语/单词作为值，我想用词典中的值替换所有出现的这些abbr。我已经在stackoverflow上寻找了其他几种解决方案，但它们似乎都不起作用。这是我试过的。我使用的是嵌套for循环，我相信我非常接近解决方案，但我做错了什么，我似乎无法理解下面是嵌套循环： for i in range(len(

我有一个数据框，其中有一列名为“cleaned_tweet”。本专栏由几个缩写的tweet组成，我想用合适的英语单词替换这些缩写。为此，我准备了一本名为“俚语”的词典，其中abbr.是关键字，所需的英语短语/单词作为值，我想用词典中的值替换所有出现的这些abbr。我已经在stackoverflow上寻找了其他几种解决方案，但它们似乎都不起作用。这是我试过的。我使用的是嵌套for循环，我相信我非常接近解决方案，但我做错了什么，我似乎无法理解

下面是嵌套循环：

for i in range(len(train_test_set)):
    for j in slangs:
        train_test_set['cleaned_tweet'][i] = train_test_set['cleaned_tweet'][i].replace(j, slangs[j])

当我执行此代码并打印

print（train_test_set['cleaned_tweet][0]）

时，我得到了如下意外输出：

"#mopanthank whyour | hi | years oldwhyour | hi | years oldhesitationospecial editekissas insekissperience wall hacken whyour | hi | years oldunited statesing a hallwhyour | hi | years olducinogenic drwhyour | hi | years olduglwhyour | hi | years oldung ladye rainbowhwhy | would whyour | hi | years olduohesitationents | rapper from atalk later | ekissperience wall hacken whyour | hi | years oldunited statesing a hallwhyour | hi | years olducinogenic drwhyour | hi | years olduglwhyour | hi | years oldung ladye rainbowhwhy | would whyour | hi | years olduoue loversatileionwhyes | yeah | yes | your | hi | years oldu | team leaderantaonwhysomethingop it | somethingwhyour | hi | years oldupid idiotake careal edwhyour | hi | years olducatekissas insekissperience wall hacken whyour | hi | years oldunited statesing a hallwhyour | hi | years olducinogenic drwhyour | hi | years olduglwhyour | hi | years oldung ladye..."

slangs = {'abbr1': 'word1', .........}
train_test_set['cleaned_tweet'] = train_test_set['cleaned_tweet'].map(slangs)

# define the dictionary with the words as the keys and the lists of the respective abbreviations as the values
slangs = {'word1': ['abbr11', 'abbr12', ....], 'word2': ['abbr21', 'abbr22',..]}
#swap keys in slangs: http://stackoverflow.com/a/31674731/2901002
d = {k: oldk for oldk, oldv in slangs.items() for k in oldv}
train_test_set['cleaned_tweet']  = train_test_set['cleaned_tweet'].map(slangs)

似乎有许多不需要的值被附加到单元格中。输出大小非常大，所以我不能在这里全部复制。以下是执行代码之前我的数据集和字典的结构：

有人能告诉我我做错了什么吗？

您可以尝试使用字典和map（）函数。大概是这样的：

"#mopanthank whyour | hi | years oldwhyour | hi | years oldhesitationospecial editekissas insekissperience wall hacken whyour | hi | years oldunited statesing a hallwhyour | hi | years olducinogenic drwhyour | hi | years olduglwhyour | hi | years oldung ladye rainbowhwhy | would whyour | hi | years olduohesitationents | rapper from atalk later | ekissperience wall hacken whyour | hi | years oldunited statesing a hallwhyour | hi | years olducinogenic drwhyour | hi | years olduglwhyour | hi | years oldung ladye rainbowhwhy | would whyour | hi | years olduoue loversatileionwhyes | yeah | yes | your | hi | years oldu | team leaderantaonwhysomethingop it | somethingwhyour | hi | years oldupid idiotake careal edwhyour | hi | years olducatekissas insekissperience wall hacken whyour | hi | years oldunited statesing a hallwhyour | hi | years olducinogenic drwhyour | hi | years olduglwhyour | hi | years oldung ladye..."

slangs = {'abbr1': 'word1', .........}
train_test_set['cleaned_tweet'] = train_test_set['cleaned_tweet'].map(slangs)

# define the dictionary with the words as the keys and the lists of the respective abbreviations as the values
slangs = {'word1': ['abbr11', 'abbr12', ....], 'word2': ['abbr21', 'abbr22',..]}
#swap keys in slangs: http://stackoverflow.com/a/31674731/2901002
d = {k: oldk for oldk, oldv in slangs.items() for k in oldv}
train_test_set['cleaned_tweet']  = train_test_set['cleaned_tweet'].map(slangs)

如果同一个单词有多个缩写，可以尝试用这些单词作为关键字，用相应缩写的列表作为值来定义词典。然后可以交换键和值，并遵循相同的方法。大概是这样的：

"#mopanthank whyour | hi | years oldwhyour | hi | years oldhesitationospecial editekissas insekissperience wall hacken whyour | hi | years oldunited statesing a hallwhyour | hi | years olducinogenic drwhyour | hi | years olduglwhyour | hi | years oldung ladye rainbowhwhy | would whyour | hi | years olduohesitationents | rapper from atalk later | ekissperience wall hacken whyour | hi | years oldunited statesing a hallwhyour | hi | years olducinogenic drwhyour | hi | years olduglwhyour | hi | years oldung ladye rainbowhwhy | would whyour | hi | years olduoue loversatileionwhyes | yeah | yes | your | hi | years oldu | team leaderantaonwhysomethingop it | somethingwhyour | hi | years oldupid idiotake careal edwhyour | hi | years olducatekissas insekissperience wall hacken whyour | hi | years oldunited statesing a hallwhyour | hi | years olducinogenic drwhyour | hi | years olduglwhyour | hi | years oldung ladye..."

slangs = {'abbr1': 'word1', .........}
train_test_set['cleaned_tweet'] = train_test_set['cleaned_tweet'].map(slangs)

# define the dictionary with the words as the keys and the lists of the respective abbreviations as the values
slangs = {'word1': ['abbr11', 'abbr12', ....], 'word2': ['abbr21', 'abbr22',..]}
#swap keys in slangs: http://stackoverflow.com/a/31674731/2901002
d = {k: oldk for oldk, oldv in slangs.items() for k in oldv}
train_test_set['cleaned_tweet']  = train_test_set['cleaned_tweet'].map(slangs)

我建议使用支持callable作为替换参数的

Series.str.replace

方法

首先，定义一个字典，其中键是搜索表达式，值是要替换的文本：

slangs = { 'lng1': 'val1', 'lng2': 'val2' }

然后，使用

rx = r'\b(?:{})\b'.format("|".join(slangs.keys())
train_test_set['cleaned_tweet'] = train_test_set['cleaned_tweet'].str.replace(rx), lambda x: slangs[x.group()])

这里，

rx

将是

\b（？：abc | def | ghi |…）\b

类型的动态格式正则表达式，其中

\b

是单词边界。如果您有由字母、数字或下划线组成的搜索词，这将起作用。请参阅此动态模式构建的一部分，以涵盖更多场景。找到匹配项后，将其传递给lambda表达式，

lambda x:slags[x.group（）]

返回找到的键的字典值

如果有数千个字典项，请使用来构建regex-trie。

检查此处：