替换类似类别列中的字符串,映射到python中的新列

替换类似类别列中的字符串,映射到python中的新列,python,pandas,dataframe,statistics,apply,Python,Pandas,Dataframe,Statistics,Apply,我有一个现有的数据帧(coffee\u directions\u df),如下所示 coffee_directions_df Utterance Frequency Directions to Starbucks 1045 Directions to Tullys 1034 Give me directions to Tullys 986 Directions to Seattles

我有一个现有的数据帧(coffee\u directions\u df),如下所示

coffee_directions_df

Utterance                         Frequency   

Directions to Starbucks           1045
Directions to Tullys              1034
Give me directions to Tullys      986
Directions to Seattles Best       875
Show me directions to Dunkin      812
Directions to Daily Dozen         789
Show me directions to Starbucks   754
Give me directions to Dunkin      612
Navigate me to Seattles Best      498
Display navigation to Starbucks   376
Direct me to Starbucks            201
DF显示了人们的话语和话语频率

也就是说,“星巴克方向”被说了1045次

我正在试图找出如何将类似的单词替换为一个字符串,例如“咖啡”列中的“星巴克”、“塔利斯”、“西雅图最佳”。我见过类似的答案,建议使用字典,例如下面的,但我还没有成功

{'Utterance':['Starbucks','Tullys','Seattles Best'],
      'Combi_Utterance':['Coffee','Coffee','Coffee','Coffee']}

{'Utterance':['Dunkin','Daily Dozen'],
      'Combi_Utterance':['Donut','Donut']}

{'Utterance':['Give me','Show me','Navigate me','Direct me'],
      'Combi_Utterance':['V_me','V_me','V_me','V_me']}
所需输出如下所示:

coffee_directions_df

Utterance                         Frequency  Combi_Utterance
Directions to Starbucks           1045       Directions to Coffee
Directions to Tullys              1034       Directions to Coffee
Give me directions to Tullys      986        V_me to Coffee
Directions to Seattles Best       875        Directions to Coffee
Show me directions to Dunkin      812        V_me to Donut
Directions to Daily Dozen         789        Directions to Donut
Show me directions to Starbucks   754        V_me to Coffee
Give me directions to Dunkin      612        V_me to Donut
Navigate me to Seattles Best      498        V_me to Coffee
Display navigation to Starbucks   376        Display navigation to Coffee
Direct me to Starbucks            201        V_me to Coffee
最终,我希望能够使用这段代码生成最终输出

df = (df.set_index('Frequency')['Utterance']
        .str.split(expand=True)
        .stack()
        .reset_index(name='Words')
        .groupby('Words', as_index=False)['Frequency'].sum()
        )

print (df)
         Words  Frequency
0   Directions       6907
1         V_me       3863
2        Donut       2213
3       Coffee       5769
4        Other        376

谢谢

下面是一种方法。根据您之前的问题,我选择使用
collections.Counter
而不是
pandas
作为您的计数逻辑

所需的输入是映射字典
rep_dict
的形式。我们将此应用于
df['outrance']
系列中字符串的子字符串

from collections import Counter
import pandas as pd

df = pd.DataFrame([['Directions to Starbucks', 1045],
                   ['Show me directions to Starbucks', 754],
                   ['Give me directions to Starbucks', 612],
                   ['Navigate me to Starbucks', 498],
                   ['Display navigation to Starbucks', 376],
                   ['Direct me to Starbucks', 201],
                   ['Navigate to Starbucks', 180]],
                  columns=['Utterance', 'Frequency'])

# define dictionary of mappings
rep_dict = {'Starbucks': 'Coffee', 'Tullys': 'Coffee', 'Seattles Best': 'Coffee'}

# apply substring mapping
df['Utterance'] = df['Utterance'].replace(rep_dict, regex=True).str.lower()

# previous logic below
c = Counter()

for row in df.itertuples():
    for i in row[1].split():
        c[i] += row[2]

res = pd.DataFrame.from_dict(c, orient='index')\
                  .rename(columns={0: 'Count'})\
                  .sort_values('Count', ascending=False)

def add_combinations(df, lst):
    for i in lst:
        words = '_'.join(i)
        df.loc[words] = df.loc[df.index.isin(i), 'Count'].sum()
    return df.sort_values('Count', ascending=False)

lst = [('give', 'show', 'navigate', 'direct')]

res = add_combinations(res, lst)
结果

                           Count
to                          3666
coffee                      3666
directions                  2411
give_show_navigate_direct   2245
me                          2065
show                         754
navigate                     678
give                         612
display                      376
navigation                   376
direct                       201

嗨,如果你有时间,请看一下我的下一个问题。永远感谢你的帮助!(遵循您的步骤,但我也在尝试做其他事情)。谢谢@用户_seaveed,如果这个答案有效,请接受它(左边绿色的勾号)。谢谢,很抱歉还是新的。基本上是在寻找一种新的方法,用更大的数据帧来实现这一点。(理想情况下,我希望将rep_dict作为txt文件导入,而不是将其全部写入shell/终端)。谢谢@用户_海藻,请看