Python 熊猫中多个数据帧的复杂拆分、合并和透视
我有两个熊猫数据框,它们必须是合并和枢轴。在其中一个数据帧中,列是由字符串和逗号分隔的。数据帧是Python 熊猫中多个数据帧的复杂拆分、合并和透视,python,pandas,dataframe,merge,Python,Pandas,Dataframe,Merge,我有两个熊猫数据框,它们必须是合并和枢轴。在其中一个数据帧中,列是由字符串和逗号分隔的。数据帧是 import pandas as pd import numpy as np tableA = [(100, 'chocolate, sprinkles'), (101, 'chocolate, sprinkles'), (102, 'glazed')] labels = ['product', 'tags'] dfA = pd.DataFrame.from_records(t
import pandas as pd
import numpy as np
tableA = [(100, 'chocolate, sprinkles'),
(101, 'chocolate, sprinkles'),
(102, 'glazed')]
labels = ['product', 'tags']
dfA = pd.DataFrame.from_records(tableA, columns=labels)
tableB = [('A', 100),
('A', 101),
('B', 101),
('C', 100),
('C', 102),
('B', 101),
('A', 100),
('C', 102)]
labels = ['customer', 'product']
dfB = pd.DataFrame.from_records(tableB, columns=labels)
dfA:
product tags
0 100 chocolate, sprinkles
1 101 chocolate, sprinkles
2 102 glazed
dfB:
customer product
0 A 100
1 A 101
2 B 101
3 C 100
4 C 102
5 B 101
6 A 100
7 C 102
结果一定是这样的
customer sprinkles chocolate glazed
A ? ? ?
B ? ? ?
C ? ? ?
我尝试过各种功能,但都失败了。任何建议都将不胜感激
我的一些代码,我知道这不起作用,但它应该让你了解我试图做什么:
dfC=dfB.merge(dfA, left_on='product', right_on='product')
print(dfC)
这导致了
customer product tags
0 A 100 chocolate, sprinkles
1 C 100 chocolate, sprinkles
2 A 100 chocolate, sprinkles
3 A 101 chocolate, sprinkles
4 B 101 chocolate, sprinkles
5 B 101 chocolate, sprinkles
6 C 102 glazed
7 C 102 glazed
以及
这导致:
var1 var2
0 A chocolate
1 A sprinkles
2 C chocolate
3 C sprinkles
4 A chocolate
5 A sprinkles
6 A chocolate
7 A sprinkles
8 B chocolate
9 B sprinkles
10 B chocolate
11 B sprinkles
12 C glazed
13 C glazed
首先,您需要剥离var2:
dfS['var2'] = dfS['var2'].str.strip()
若要删除空间,则可以为每个标记创建一列,例如:
dfS['chocolate'] = dfS['var2'].apply(lambda x: 1 if x == 'chocolate' else 0)
dfS['sprinkles'] = dfS['var2'].apply(lambda x: 1 if x == 'sprinkles' else 0)
dfS['glazed'] = dfS['var2'].apply(lambda x: 1 if x == 'glazed' else 0)
然后您可以groupby
var1并将其他列聚合为总和,例如:
dfS.groupby('var1').agg(sum).reset_index().rename(columns ={'var1':'customer'})
输出如下所示:
customer chocolate sprinkles glazed
0 A 3 3 0
1 B 2 2 0
2 C 1 1 2
使用联合数据帧
dfs
可以使用pd.crosstab
获取客户标签使用计数
pd.crosstab(dfs.var1,dfs.var2)
var2 chocolate glazed sprinkles
var1
A 3 0 3
B 2 0 2
C 1 2 1
谢谢但正如您所看到的,虽然巧克力和洒布的数量在原始数据帧中相同,但结果中只有一个有价值。当我直接使用table_pivot时,它发生在我身上。对此有什么建议吗?对不起,我添加了strip()来删除var2列中的空间(隐藏空间对我来说总是很棘手)
pd.crosstab(dfs.var1,dfs.var2)
var2 chocolate glazed sprinkles
var1
A 3 0 3
B 2 0 2
C 1 2 1