Python 熊猫.基于子串出现次数的计数表示方法_Python_Python 3.x_Pandas_Pandas Groupby

Python 熊猫.基于子串出现次数的计数表示方法

python python-3.x pandas

Python 熊猫.基于子串出现次数的计数表示方法,python,python-3.x,pandas,pandas-groupby,Python,Python 3.x,Pandas,Pandas Groupby,假设我有一个如下所示的数据帧： df=pd.DataFrame({'name': ['john','jack','jill','al','zoe','jenn','ringo','paul','george','lisa'], 'how do you feel?': ['excited', 'not excited', 'excited and nervous', 'worried', 'really worried', 'excited', 'not that worried', 'not t

假设我有一个如下所示的数据帧：

df=pd.DataFrame({'name': ['john','jack','jill','al','zoe','jenn','ringo','paul','george','lisa'], 'how do you feel?': ['excited', 'not excited', 'excited and nervous', 'worried', 'really worried', 'excited', 'not that worried', 'not that excited', 'nervous', 'nervous']})

      how do you feel?    name
0              excited    john
1          not excited    jack
2  excited and nervous    jill
3              worried      al
4       really worried     zoe
5              excited    jenn
6     not that worried   ringo
7     not that excited    paul
8              nervous  george
9              nervous    lisa

我对计数感兴趣，但按三类分类：“兴奋”、“担心”和“紧张”

问题是“兴奋和紧张”应该与“兴奋”一起归类。事实上，包含“兴奋”的字符串应该包括在一个组中，除了“不太兴奋”和“不兴奋”之类的字符串。同样的逻辑也适用于“担心”和“紧张”。（注意，“兴奋和紧张”实际上属于“兴奋”组和“紧张”组）

您可以看到，典型的groupby无法工作，字符串搜索必须灵活

我有一个解决方案，但不知道你们是否都能找到一个更好的方法，在Pythonic方面，和/或使用我可能不知道的更合适的方法

以下是我的解决方案：定义一个函数以返回包含所需子字符串的行的计数，而不包含否定该值的子字符串

def get_perc(df, column_label, str_include, str_exclude):

    data=df[col_lab][(~df[col_lab].str.contains(str_exclude, case=False)) & \
    (df[col_lab].str.contains(str_include,  case=False))]

    num=data.count()

    return num

然后，在循环内调用此函数，传入各种“str.contains”参数，并将结果收集到另一个数据帧中

groups=['excited', 'worried', 'nervous']
column_label='How do you feel?'

data=pd.DataFrame([], columns=['group','num'])
for str_include in groups:
    num=get_perc(df, column_label, str_include, 'not|neither')
    tmp=pd.DataFrame([{'group': str_include,'num': num}])
    data=pd.concat([data, tmp])


data

      group    num
0   excited      3
1   worried      2
2   nervous      3

你能想到一种更干净的方法吗？我在“

str.contains

”中尝试了一个正则表达式，以避免需要两个布尔级数和“

”。但是，如果没有捕获组，我就无法做到这一点，这意味着我必须使用“

str.extract

”，这似乎不允许我以同样的方式选择数据

非常感谢您的帮助。

您只需提供映射，然后根据映射产生的新序列进行分组

map_dict = {'excited and nervous':'excited', 'not that excited':'not excited', 
            'really worried':'worried', 'not that worried':'not worried'}
df.groupby(df['how do you feel?'].replace(map_dict)).size()

输出：

how do you feel?
excited        3
nervous        2
not excited    2
not worried    1
worried        2
dtype: int64

你可以做：

方法1

忽略

而不是行，然后


从指标字符串中获取相关的组


方法2
In [162]: dfs = df['how do you feel?'].str.get_dummies(sep=' ')

In [163]: dfs.loc[~dfs['not'].astype(bool), groups].sum()
Out[163]:
excited    3
worried    2
nervous    3
dtype: int64

“兴奋和紧张”
“兴奋和紧张”算作两者或仅仅是两者都兴奋？在这种情况下，两者都很好，你也可以编辑你的问题。我会推荐一个好的，我刚要发布假人_df=df[“你感觉如何？”]。str.get_dummies（sep=”）dummies_df.loc[dummies_df[“not”！=1，[“兴奋”，“担心”，“紧张]。sum（）你赢了我：）@Zero我真的很喜欢这个“傻瓜”的东西。我使用的是str.contains
，这样我仍然可以通过使用正则表达式组（如“兴奋的”|泵送的|激动的|兴奋的|）将“兴奋的”和“兴奋的”进行分组。当然，这需要一些柠檬化才能正确完成。NLTK与您的方法结合使用可以很好地工作。
In [162]: dfs = df['how do you feel?'].str.get_dummies(sep=' ')

In [163]: dfs.loc[~dfs['not'].astype(bool), groups].sum()
Out[163]:
excited    3
worried    2
nervous    3
dtype: int64