Python 重塑dataframe以将分类列转换为单个列
我有如下数据:Python 重塑dataframe以将分类列转换为单个列,python,pandas,Python,Pandas,我有如下数据: df = pd.DataFrame(data=[list('ABCDE'), ['Crude Oil', 'Natural Gas', 'Gasoline', 'Diesel', 'Bitumen'], ['Natural Gas', 'Salt water', 'Waste water', 'Motor oil', 'Sour Gas'], ['Oil', 'Gas', 'Refined', 'Refined', '
df = pd.DataFrame(data=[list('ABCDE'),
['Crude Oil', 'Natural Gas', 'Gasoline', 'Diesel', 'Bitumen'],
['Natural Gas', 'Salt water', 'Waste water', 'Motor oil', 'Sour Gas'],
['Oil', 'Gas', 'Refined', 'Refined', 'Oil'],
['Gas', 'Water', 'Water', 'Oil', 'Gas'],
list(np.random.randint(10, 100, 5)),
list(np.random.randint(10, 100, 5))]
).T
df.columns =['ID', 'Substance1', 'Substance2', 'Category1', 'Category2', 'Quantity1', 'Quantity2']
ID Substance1 Substance2 Category1 Category2 Quantity1 Quantity2
0 A Crude Oil Natural Gas Oil Gas 85 14
1 B Natural Gas Salt water Gas Water 95 78
2 C Gasoline Waste water Refined Water 33 25
3 D Diesel Motor oil Refined Oil 49 54
4 E Bitumen Sour Gas Oil Gas 92 86
ID Oil Gas Water Refined
0 A 85 14 NaN NaN
1 B NaN 95 78 NaN
2 C NaN NaN 25 33
3 D 54 NaN NaN 49
4 E 92 86 NaN NaN
类别
和数量
列是指相应的物质
列
我想将类别
列展开为每个唯一值的新列,并将数量
值作为单元格值。不存在的类别将是NaN。因此,生成的帧将如下所示:
df = pd.DataFrame(data=[list('ABCDE'),
['Crude Oil', 'Natural Gas', 'Gasoline', 'Diesel', 'Bitumen'],
['Natural Gas', 'Salt water', 'Waste water', 'Motor oil', 'Sour Gas'],
['Oil', 'Gas', 'Refined', 'Refined', 'Oil'],
['Gas', 'Water', 'Water', 'Oil', 'Gas'],
list(np.random.randint(10, 100, 5)),
list(np.random.randint(10, 100, 5))]
).T
df.columns =['ID', 'Substance1', 'Substance2', 'Category1', 'Category2', 'Quantity1', 'Quantity2']
ID Substance1 Substance2 Category1 Category2 Quantity1 Quantity2
0 A Crude Oil Natural Gas Oil Gas 85 14
1 B Natural Gas Salt water Gas Water 95 78
2 C Gasoline Waste water Refined Water 33 25
3 D Diesel Motor oil Refined Oil 49 54
4 E Bitumen Sour Gas Oil Gas 92 86
ID Oil Gas Water Refined
0 A 85 14 NaN NaN
1 B NaN 95 78 NaN
2 C NaN NaN 25 33
3 D 54 NaN NaN 49
4 E 92 86 NaN NaN
我尝试了
.melt()
,然后是.pivot\u table()
,但由于某些原因,值会在新的类别列中重复。您需要使用pd.melt
然后使用groupby
:
np.random.seed(0)
df = pd.DataFrame(data=[list('ABCDE'),
['Crude Oil', 'Natural Gas', 'Gasoline', 'Diesel', 'Bitumen'],
['Natural Gas', 'Salt water', 'Waste water', 'Motor oil', 'Sour Gas'],
['Oil', 'Gas', 'Refined', 'Refined', 'Oil'],
['Gas', 'Water', 'Water', 'Oil', 'Gas'],
list(np.random.randint(10, 100, 5)),
list(np.random.randint(10, 100, 5))]
).T
df.columns =['ID', 'Substance1', 'Substance2', 'Category1', 'Category2', 'Quantity1', 'Quantity2']
pd.wide_to_long(df,['Substance','Category','Quantity'], 'ID','Num','','.+')\
.groupby(['ID','Category'])['Quantity'].sum()\
.unstack().reset_index()
输出:
Category ID Gas Oil Refined Water
0 A 19.0 54.0 NaN NaN
1 B 57.0 NaN NaN 93.0
2 C NaN NaN 74.0 31.0
3 D NaN 46.0 77.0 NaN
4 E 97.0 77.0 NaN NaN
这是可行的,但并非如此。它创建了大量重复的列,并添加了所有的数字,导致了不准确的值。但是在链中添加了两种方法,
reset\u index
和drop\u duplicates
,pd.wide\u to\u long(df,['Substance','Category','Quantity'],'Quantity'],'ID','Num','.+')\.reset\u index().drop\u duplicates(子集=['ID','Num Num'])\.groupby(['ID','Category'])['Quantity'].sum().sum().unstack().reset\u index()
我要补充的是,这个漂亮的解决方案还需要对较老的熊猫进行一些调整
0.19。