Pandas 如何从数据透视表中进行加权字数计算
这是我的透视表Pandas 如何从数据透视表中进行加权字数计算,pandas,dataframe,Pandas,Dataframe,这是我的透视表 No Keyword Count 1 Sell Laptop Online 10 2 Buy Computer Online 8 3 Laptop and Case 5 这是我想要的 No Word Count 1 Online 18 2 Laptop 15 3 Sell 10 4 Buy 8 5 Computer 8 6 and
No Keyword Count
1 Sell Laptop Online 10
2 Buy Computer Online 8
3 Laptop and Case 5
这是我想要的
No Word Count
1 Online 18
2 Laptop 15
3 Sell 10
4 Buy 8
5 Computer 8
6 and 5
7 Case 5
我所做的是
df['Word'].apply(lambda x: x.str.split(expand=True).stack()).stack().value_counts()
但结果是
No Word Count
1 Online 2
2 Laptop 2
3 Sell 1
4 Buy 1
5 Computer 1
6 and 1
7 Case 1
我想从数据透视表中加权字数使用:
No Keyword Count
1 Sell Laptop Online 10
2 Buy Computer Online 8
3 Laptop and Case 5
df1 = (df.set_index('Count')['Keyword']
.str.split(expand=True)
.stack()
.reset_index(name='Word')
.groupby('Word')['Count']
.sum()
.sort_values(ascending=False)
.reset_index())
说明:
Count
设置为索引,以防止丢失此信息DataFrame
by多索引
转换为列系列
按from itertools import chain
s = df['Keyword'].str.split()
df = pd.DataFrame({
'Word' : list(chain.from_iterable(s.values.tolist())),
'Count' : df['Count'].repeat(s.str.len())
})
print (df)
Word Count
0 Sell 10
0 Laptop 10
0 Online 10
1 Buy 8
1 Computer 8
1 Online 8
2 Laptop 5
2 and 5
2 Case 5
df1 = df.groupby('Word')['Count'].sum().sort_values(ascending=False).reset_index()
print (df1)
Word Count
0 Online 18
1 Laptop 15
2 Sell 10
3 Computer 8
4 Buy 8
5 and 5
6 Case 5
说明:
Count
值,将关键字的拆分值计数到新的数据帧
sum
,排序序列和最后一次reset\u索引
这里有一个简单的方法,只需一个热编码
df['Keyword'].str.get_dummies(sep=' ').mul(df['Count'],axis=0).sum(0).to_frame('Count')
Count
Buy 8
Case 5
Computer 8
Laptop 15
Online 18
Sell 10
and 5
如果速度提高,请尝试scikit的多标签二进制软件。i、 e
from sklearn.preprocessing import MultiLabelBinarizer
vec = MultiLabelBinarizer()
oh = (vec.fit_transform(df['Keyword'].str.split()) * df['Count'].values[:,None]).sum(0)
pd.DataFrame({'Count': oh ,'Word':vec.classes_})
说明:
Get dummies将生成热编码数据帧
Buy Case Computer Laptop Online Sell and
0 0 0 0 1 1 1 0
1 1 0 1 0 1 0 0
2 0 1 0 1 0 0 1
与各列的计数相乘
Buy Case Computer Laptop Online Sell and
0 0 0 0 10 10 10 0
1 8 0 8 0 8 0 0
2 0 5 0 5 0 0 5
求和并转换为数据帧
Buy 8
Case 5
Computer 8
Laptop 15
Online 18
Sell 10
and 5
dtype: int64
df['Keyword'].str.get_dummies(sep=' ').mul(df['Count'],axis=0).sum(0).to_frame('Count')
Count
Buy 8
Case 5
Computer 8
Laptop 15
Online 18
Sell 10
and 5
from sklearn.preprocessing import MultiLabelBinarizer
vec = MultiLabelBinarizer()
oh = (vec.fit_transform(df['Keyword'].str.split()) * df['Count'].values[:,None]).sum(0)
pd.DataFrame({'Count': oh ,'Word':vec.classes_})
Buy Case Computer Laptop Online Sell and
0 0 0 0 1 1 1 0
1 1 0 1 0 1 0 0
2 0 1 0 1 0 0 1
Buy Case Computer Laptop Online Sell and
0 0 0 0 10 10 10 0
1 8 0 8 0 8 0 0
2 0 5 0 5 0 0 5
Buy 8
Case 5
Computer 8
Laptop 15
Online 18
Sell 10
and 5
dtype: int64