Pandas 如何从数据透视表中进行加权字数计算_Pandas_Dataframe

Pandas 如何从数据透视表中进行加权字数计算

pandas dataframe

Pandas 如何从数据透视表中进行加权字数计算,pandas,dataframe,Pandas,Dataframe,这是我的透视表 No Keyword Count 1 Sell Laptop Online 10 2 Buy Computer Online 8 3 Laptop and Case 5 这是我想要的 No Word Count 1 Online 18 2 Laptop 15 3 Sell 10 4 Buy 8 5 Computer 8 6 and

这是我的透视表

No  Keyword              Count
1   Sell Laptop Online   10
2   Buy Computer Online  8
3   Laptop and Case      5

这是我想要的

No   Word      Count
1    Online    18
2    Laptop    15
3    Sell      10
4    Buy        8
5    Computer   8
6    and        5
7    Case       5

我所做的是

df['Word'].apply(lambda x: x.str.split(expand=True).stack()).stack().value_counts()

但结果是

No   Word      Count
1    Online    2
2    Laptop    2
3    Sell      1
4    Buy       1
5    Computer  1
6    and       1
7    Case      1

我想从数据透视表中加权字数

使用：

No  Keyword              Count
1   Sell Laptop Online   10
2   Buy Computer Online  8
3   Laptop and Case      5

df1 = (df.set_index('Count')['Keyword']
         .str.split(expand=True)
         .stack()
         .reset_index(name='Word')
         .groupby('Word')['Count']
         .sum()
         .sort_values(ascending=False)
         .reset_index())

说明：

将

Count

设置为索引，以防止丢失此信息

创建

DataFrame

重塑

通过将

多索引

转换为列

聚合<代码>总和

排序

系列

按

最后

另一种解决方案-如果数据帧更大，则速度更快：

from itertools import chain

s = df['Keyword'].str.split()

df = pd.DataFrame({
    'Word' : list(chain.from_iterable(s.values.tolist())), 
    'Count' : df['Count'].repeat(s.str.len())
})

print (df)
       Word  Count
0      Sell     10
0    Laptop     10
0    Online     10
1       Buy      8
1  Computer      8
1    Online      8
2    Laptop      5
2       and      5
2      Case      5

df1 = df.groupby('Word')['Count'].sum().sort_values(ascending=False).reset_index()
print (df1)
       Word  Count
0    Online     18
1    Laptop     15
2      Sell     10
3  Computer      8
4       Buy      8
5       and      5
6      Case      5

说明：

首先重复

Count

值，将

关键字的拆分值计数到新的数据帧


聚合sum
，排序序列和最后一次reset\u索引
解决方案包括：




这里有一个简单的方法，只需一个热编码
df['Keyword'].str.get_dummies(sep=' ').mul(df['Count'],axis=0).sum(0).to_frame('Count')

          Count
Buy           8
Case          5
Computer      8
Laptop       15
Online       18
Sell         10
and           5

如果速度提高，请尝试scikit的多标签二进制软件。i、 e
from sklearn.preprocessing import MultiLabelBinarizer
vec = MultiLabelBinarizer()

oh = (vec.fit_transform(df['Keyword'].str.split()) * df['Count'].values[:,None]).sum(0)
pd.DataFrame({'Count': oh ,'Word':vec.classes_})

说明：
Get dummies将生成热编码数据帧
    Buy  Case  Computer  Laptop  Online  Sell  and
 0    0     0         0       1       1     1    0
 1    1     0         1       0       1     0    0
 2    0     1         0       1       0     0    1

与各列的计数相乘
   Buy  Case  Computer  Laptop  Online  Sell  and
0    0     0         0      10      10    10    0
1    8     0         8       0       8     0    0
2    0     5         0       5       0     0    5

求和并转换为数据帧
Buy          8
Case         5
Computer     8
Laptop      15
Online      18
Sell        10
and          5
dtype: int64

df['Keyword'].str.get_dummies(sep=' ').mul(df['Count'],axis=0).sum(0).to_frame('Count')

          Count
Buy           8
Case          5
Computer      8
Laptop       15
Online       18
Sell         10
and           5

from sklearn.preprocessing import MultiLabelBinarizer
vec = MultiLabelBinarizer()

oh = (vec.fit_transform(df['Keyword'].str.split()) * df['Count'].values[:,None]).sum(0)
pd.DataFrame({'Count': oh ,'Word':vec.classes_})

    Buy  Case  Computer  Laptop  Online  Sell  and
 0    0     0         0       1       1     1    0
 1    1     0         1       0       1     0    0
 2    0     1         0       1       0     0    1

   Buy  Case  Computer  Laptop  Online  Sell  and
0    0     0         0      10      10    10    0
1    8     0         8       0       8     0    0
2    0     5         0       5       0     0    5

Buy          8
Case         5
Computer     8
Laptop      15
Online      18
Sell        10
and          5
dtype: int64