Python 熊猫：分割字符串和计数值？_Python_Pandas

Python 熊猫：分割字符串和计数值？

python pandas

Python 熊猫：分割字符串和计数值？,python,pandas,Python,Pandas,我有一个pandas数据集，其中的列是逗号分隔的字符串，例如1,2,3,10： data = [ { 'id': 1, 'score': 9, 'topics': '11,22,30' }, { 'id': 2, 'score': 7, 'topics': '11,18,30' }, { 'id': 3, 'score': 6, 'topics': '1,12,30' }, { 'id': 4, 'score': 4, 'topics': '1,18,30' } ] df = p

我有一个pandas数据集，其中的列是逗号分隔的字符串，例如

1,2,3,10

：

data = [
  { 'id': 1, 'score': 9, 'topics': '11,22,30' },
  { 'id': 2, 'score': 7, 'topics': '11,18,30' },
  { 'id': 3, 'score': 6, 'topics': '1,12,30' },
  { 'id': 4, 'score': 4, 'topics': '1,18,30' }
]
df = pd.DataFrame(data)

我想获得

主题

中每个值的计数和平均分数。因此：

topic_id,count,mean
1,2,5
11,2,8
12,1,6

等等。我该怎么做

我已经做到了：

df['topic_ids'] = df.topics.str.split()

但是现在我想我想把

topic\u id

分解出来，所以在整个值集中每个唯一的值都有一列…？

unnest then

groupby

和

agg

df.topics=df.topics.str.split(',')
New_df=pd.DataFrame({'topics':np.concatenate(df.topics.values),'id':df.id.repeat(df.topics.apply(len)),'score':df.score.repeat(df.topics.apply(len))})

New_df.groupby('topics').score.agg(['count','mean'])

Out[1256]: 
        count  mean
topics             
1           2   5.0
11          2   8.0
12          1   6.0
18          2   5.5
22          1   9.0
30          4   6.5

这是一种方式。重新编制索引和堆栈，然后进行分组（&agg）

import pandas as pd

data = [
  { 'id': 1, 'score': 9, 'topics': '11,22,30' },
  { 'id': 2, 'score': 7, 'topics': '11,18,30' },
  { 'id': 3, 'score': 6, 'topics': '1,12,30' },
  { 'id': 4, 'score': 4, 'topics': '1,18,30' }
]
df = pd.DataFrame(data)
df.topics = df.topics.str.split(',')
df2 = pd.DataFrame(df.topics.tolist(), index=[df.id, df.score])\
                   .stack()\
                   .reset_index(name='topics')\
                   .drop('level_2', 1)

df2.groupby('topics').score.agg(['count', 'mean']).reset_index()

通过平均分数，您的意思是

df.topics.str.split（'，'，expand=True）.aType（int）.mean（axis=1）

？都在一行中

（df.set_index（['id'，'score']）topics.str.split（'，'，'expand=True）.stack（）.reset_index（name='Topic'）.groupby（'Topic'）.agg（{'id'：'size'，'score'：'mean'}））

@ScottBoston这可能也行得通。不止一条路！谢谢不幸的是，对于我的真实数据，我在

“主题”上遇到了一个错误：np.concatenate（df.topics.values）

-错误是

ValueError：所有输入数组必须具有相同数量的维度。我想这是因为拆分数组的长度可变-如何处理？@Richard拆分后您是否重新分配了它？这是因为我的数据中有一些NaN值-替换这些值修复了问题。谢谢@Richard aha，np.nan会导致问题，您可以替换nan:-）
In [111]: def mean1(x): return np.array(x).astype(int).mean()

In [112]: df.topics.str.split(',', expand=False).agg([mean1, len])
Out[112]:
       mean1  len
0  21.000000       3
1  19.666667       3
2  14.333333       3
3  16.333333       3