Python 跨多列计算每个唯一行的字符串出现次数
我想计算多列中某些字符串的出现次数,并在新列中返回总计数 所以我知道我可以使用value_counts来计算给定列中值的总出现次数:Python 跨多列计算每个唯一行的字符串出现次数,python,pandas,Python,Pandas,我想计算多列中某些字符串的出现次数,并在新列中返回总计数 所以我知道我可以使用value_counts来计算给定列中值的总出现次数: data['col'].value_counts(dropna=False) 结果: [["win" TKO technical knockout] 336 [["win" UD unanimous decision] 307 [["win" KO knockout] 225 [["loss" UD unanimo
data['col'].value_counts(dropna=False)
结果:
[["win" TKO technical knockout] 336
[["win" UD unanimous decision] 307
[["win" KO knockout] 225
[["loss" UD unanimous decision] 97
[["loss" TKO technical knockout] 64
[["win" nan null] 53
[["draw" MD majority decision] 43
[["loss" KO knockout] 41
[["loss" MD majority decision] 35
[["loss" nan null] 32
[["loss" SD split decision] 29
[["unknown" nan null] 29
[["win" SD split decision] 27
[["draw" PTS null] 18
[["win" RTD corner retirement] 17
[["draw" SD split decision] 12
[["loss" RTD corner retirement] 11
[["win" MD majority decision] 9
[["loss" DQ disqualification] 6
[["win" PTS null] 6
[["unknown" NC null] 3
问题是,例如,我想计算[[“win”KO knockout]在每个相关列中的出现次数(相关列为col1到col20)
以下是我的数据示例:
{'col1': {0: ['["win" UD unanimous decision'],
1: ['["win" UD unanimous decision'],
2: ['["win" TKO technical knockout'],
3: ['["win" UD unanimous decision'],
4: ['["win" UD unanimous decision']},
'col2': {0: ['["win" TKO technical knockout'],
1: ['["win" TKO technical knockout'],
2: ['["win" TKO technical knockout'],
3: ['["win" UD unanimous decision'],
4: ['["win" UD unanimous decision']},
'col3': {0: ['["win" TKO technical knockout'],
1: ['["win" KO knockout'],
2: ['["win" TKO technical knockout'],
3: ['["win" TKO technical knockout'],
4: ['["win" UD unanimous decision']},
'col4': {0: ['["win" UD unanimous decision'],
1: ['["win" UD unanimous decision'],
2: ['["win" KO knockout'],
3: ['["win" TKO technical knockout'],
4: ['["win" UD unanimous decision']}}
在这种情况下,所需的输出为:
win UD win TKO win KO
0 2 2 0
1 2 1 1
2 0 3 1
3 2 2 0
4 4 0 0
更新:
我还尝试使用大小和groupby:
#list of column names
col_outcome = ['col'+str(i) for i in range(1,11)]
data.groupby(col_outcome).size()
但是,这将返回以下错误消息:
TypeError:不可损坏的类型:“列表”
IIUC,让我们使用堆栈将“宽”数据帧重塑为“长”,然后进行一点数据字符串清理,然后使用正则表达式提取和替换,接下来分组依据和应用值\u计数,最后使用取消堆栈来重塑结果:
df.stack().str[0].str.replace('\[|\"','')\
.str.extract('(\w+\s\w+)')\
.groupby(level=0)[0].apply(pd.Series.value_counts).unstack(fill_value=0)
输出:
win KO win TKO win UD
0 0 2 2
1 1 1 2
2 1 3 0
3 0 2 2
4 0 0 4
你能给我们一个数据和代码的样本,这样我们就可以通过复制/粘贴的方式运行它吗?@PrinceFrancis我在我的问题中添加了一个数据样本作为字典-仅限于4列你要求计算[[“win”KO knockout]的出现次数。但你的预期结果是另一回事。我很困惑,这就是为什么我问了一个简单的例子df.stack().value\u counts()
?或df.melt(value\u name='vals')['vals'].。value\u counts()
@princefrances这只是一个让问题更清楚的例子,如果引起混淆,它将被删除