Python提取一个新的数据帧
我有一个数据帧:Python提取一个新的数据帧,python,pandas,dataframe,group-by,Python,Pandas,Dataframe,Group By,我有一个数据帧: topic student level week 1 a 1 1 1 b 2 1 1 a 3 1 2 a 1 2 2 b 2 2 2 a 3 2 2 b 4 2 3 c 1
topic student level week
1 a 1 1
1 b 2 1
1 a 3 1
2 a 1 2
2 b 2 2
2 a 3 2
2 b 4 2
3 c 1 2
3 b 2 2
3 c 3 2
3 a 4 2
3 b 5 2
我想在主题中提取一些学生的信息,并创建一个包含三列的新df:
student topic messages
a 1 2
a 2 2
a 3 1
b 1 1
b 2 2
b 3 2
c 3 2
我想跳过包含0条消息的行
有人有什么建议吗
谢谢大家! 您可以这样做:
In [132]: df.groupby(['student','topic']).size().to_frame('messages').reset_index()
Out[132]:
student topic messages
0 a 1 2
1 a 2 2
2 a 3 1
3 b 1 1
4 b 2 2
5 b 3 2
6 c 3 2
计时:
In [208]: df = pd.concat([df] * 10**4, ignore_index=True)
In [209]: df.shape
Out[209]: (120000, 4)
In [210]: %timeit df.groupby(['student','topic']).size().to_frame('messages').reset_index()
10 loops, best of 3: 32.6 ms per loop
In [211]: %timeit df.groupby(['student','topic']).size().reset_index(name='messages')
10 loops, best of 3: 32.4 ms per loop
In [212]: from collections import Counter
In [213]: %%timeit
...: s = pd.Series(Counter(zip(df.student, df.topic)), name='messages')
...: s.rename_axis(['student', 'topic']).reset_index()
...:
10 loops, best of 3: 90.3 ms per loop
In [214]: %%timeit
...: s = pd.value_counts(list(zip(df.student, df.topic)))
...: pd.DataFrame(
...: np.column_stack([s.index.tolist(), s.values]),
...: columns=['student', 'topic', 'messages'])
...:
10 loops, best of 3: 83.4 ms per loop
您可以使用++:
跳出框框思考 使用
计数器
import pandas as pd
from collections import Counter
s = pd.Series(Counter(zip(df.student, df.topic)), name='messages')
s.rename_axis(['student', 'topic']).reset_index()
student topic messages
0 a 1 2
1 a 2 2
2 a 3 1
3 b 1 1
4 b 2 2
5 b 3 2
6 c 3 2
s = pd.value_counts(list(zip(df.student, df.topic)))
pd.DataFrame(
np.column_stack([s.index.tolist(), s.values]),
columns=['student', 'topic', 'messages'])
student topic messages
0 a 1 2
1 a 2 2
2 a 3 1
3 b 1 1
4 b 2 2
5 b 3 2
6 c 3 2
使用pd.value\u计数
import pandas as pd
from collections import Counter
s = pd.Series(Counter(zip(df.student, df.topic)), name='messages')
s.rename_axis(['student', 'topic']).reset_index()
student topic messages
0 a 1 2
1 a 2 2
2 a 3 1
3 b 1 1
4 b 2 2
5 b 3 2
6 c 3 2
s = pd.value_counts(list(zip(df.student, df.topic)))
pd.DataFrame(
np.column_stack([s.index.tolist(), s.values]),
columns=['student', 'topic', 'messages'])
student topic messages
0 a 1 2
1 a 2 2
2 a 3 1
3 b 1 1
4 b 2 2
5 b 3 2
6 c 3 2
@Giada,很高兴我能帮上忙:)太好了@jezrael!谢谢@耶斯雷尔/马苏。。。我喜欢探索:-)嗯,我认为大小真的很快,因为
cython
实现了(不是100%确定)