Python 查找'；字符串'；在子组列中，并根据其出现情况标记maingroup_Python_Pandas_Numpy_Group By_Pandas Groupby

Python 查找'；字符串'；在子组列中，并根据其出现情况标记maingroup

python pandas numpy

Python 查找'；字符串'；在子组列中，并根据其出现情况标记maingroup,python,pandas,numpy,group-by,pandas-groupby,Python,Pandas,Numpy,Group By,Pandas Groupby,我有如下数据： Group string A Hello A SearchListing A GoSearch A pen A Hello B Real-Estate B Access B Denied B Group B Group C Glance C NoSearch C Home Group containsSearch TotalStrings

我有如下数据：

Group   string
 A     Hello
 A     SearchListing
 A     GoSearch
 A     pen
 A     Hello
 B     Real-Estate
 B     Access
 B     Denied
 B     Group
 B     Group
 C     Glance
 C     NoSearch
 C     Home

Group   containsSearch  TotalStrings  UniqueStrings  NoOfTimesSearch
 A           1              5             4              2
 B           0              5             4              0
 C           1              3             3              1

等等

我想找出字符串中有“搜索”短语的所有组，并将它们标记为0/1。同时，我希望聚合每个组的结果，如唯一字符串和总字符串，以及该组遇到“搜索”的次数。我想要的最终结果是这样的：

Group   string
 A     Hello
 A     SearchListing
 A     GoSearch
 A     pen
 A     Hello
 B     Real-Estate
 B     Access
 B     Denied
 B     Group
 B     Group
 C     Glance
 C     NoSearch
 C     Home

Group   containsSearch  TotalStrings  UniqueStrings  NoOfTimesSearch
 A           1              5             4              2
 B           0              5             4              0
 C           1              3             3              1

我可以使用一个简单的groupby子句进行聚合，但是我在如何根据“search”的存在以及遇到的次数将组标记为0/1方面遇到了问题。

让我们试试：

l1 = lambda x: x.str.lower().str.contains('search').any().astype(int)
l1.__name__ = 'containsSearch'
l2 = lambda x: x.str.lower().str.contains('search').sum().astype(int)
l2.__name__ = 'NoOfTimesSEarch'

df.groupby('Group')['string'].agg(['count','nunique',l1,l2]).reset_index()

输出：

  Group  count  nunique  containsSearch  NooOfTimesSEarch
0     A      5        4               1                2
1     B      5        4               0                0
2     C      3        3               1                1

  Group  count  nunique  conatinsSearch  NoOfTimesSearch
0     A      5        4               1                2
1     B      5        4               0                0
2     C      3        3               1                1

或者使用定义的函数，谢谢，@W-B:

def conatinsSearch(x):
    return x.str.lower().str.contains('search').any().astype(int)

def NoOfTimesSearch(x):
    return x.str.lower().str.contains('search').sum().astype(int)


df.groupby('Group')['string'].agg(['count', 'nunique',
                                   conatinsSearch, NoOfTimesSearch]).reset_index()

输出：

  Group  count  nunique  containsSearch  NooOfTimesSEarch
0     A      5        4               1                2
1     B      5        4               0                0
2     C      3        3               1                1

  Group  count  nunique  conatinsSearch  NoOfTimesSearch
0     A      5        4               1                2
1     B      5        4               0                0
2     C      3        3               1                1

让我们试试：

l1 = lambda x: x.str.lower().str.contains('search').any().astype(int)
l1.__name__ = 'containsSearch'
l2 = lambda x: x.str.lower().str.contains('search').sum().astype(int)
l2.__name__ = 'NoOfTimesSEarch'

df.groupby('Group')['string'].agg(['count','nunique',l1,l2]).reset_index()

输出：

  Group  count  nunique  containsSearch  NooOfTimesSEarch
0     A      5        4               1                2
1     B      5        4               0                0
2     C      3        3               1                1

  Group  count  nunique  conatinsSearch  NoOfTimesSearch
0     A      5        4               1                2
1     B      5        4               0                0
2     C      3        3               1                1

或者使用定义的函数，谢谢，@W-B:

def conatinsSearch(x):
    return x.str.lower().str.contains('search').any().astype(int)

def NoOfTimesSearch(x):
    return x.str.lower().str.contains('search').sum().astype(int)


df.groupby('Group')['string'].agg(['count', 'nunique',
                                   conatinsSearch, NoOfTimesSearch]).reset_index()

输出：

  Group  count  nunique  containsSearch  NooOfTimesSEarch
0     A      5        4               1                2
1     B      5        4               0                0
2     C      3        3               1                1

  Group  count  nunique  conatinsSearch  NoOfTimesSearch
0     A      5        4               1                2
1     B      5        4               0                0
2     C      3        3               1                1

如果要创建函数，请执行以下操作：

def my_agg(x):
    names = {
    'containsSearch' : int(x['string'].str.lower().str.contains('search').any()),
    'TotalStrings' : x['string'].count(),
    'UniqueStrings' : x['string'].drop_duplicates().count(),
    'NoOfTimesSearch' : int(x[x['string'].str.lower().str.contains('search')].count())
    }

    return pd.Series(names)

df.groupby('Group').apply(my_agg)

       containsSearch  TotalStrings  UniqueStrings  NoOfTimesSearch
Group                                                              
A                   1             5              4                2
B                   0             5              4                0
C                   1             3              3                1

如果要创建函数，请执行以下操作：

def my_agg(x):
    names = {
    'containsSearch' : int(x['string'].str.lower().str.contains('search').any()),
    'TotalStrings' : x['string'].count(),
    'UniqueStrings' : x['string'].drop_duplicates().count(),
    'NoOfTimesSearch' : int(x[x['string'].str.lower().str.contains('search')].count())
    }

    return pd.Series(names)

df.groupby('Group').apply(my_agg)

       containsSearch  TotalStrings  UniqueStrings  NoOfTimesSearch
Group                                                              
A                   1             5              4                2
B                   0             5              4                0
C                   1             3              3                1

没有理由。我想这样会更好。我得到一个错误：“'bool'对象没有属性'astype'”没有理由。我想这样会更好。我得到一个错误：“'bool'对象没有属性'astype'”