Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/292.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 重塑dataframe以将分类列转换为单个列_Python_Pandas - Fatal编程技术网

Python 重塑dataframe以将分类列转换为单个列

Python 重塑dataframe以将分类列转换为单个列,python,pandas,Python,Pandas,我有如下数据: df = pd.DataFrame(data=[list('ABCDE'), ['Crude Oil', 'Natural Gas', 'Gasoline', 'Diesel', 'Bitumen'], ['Natural Gas', 'Salt water', 'Waste water', 'Motor oil', 'Sour Gas'], ['Oil', 'Gas', 'Refined', 'Refined', '

我有如下数据:

df = pd.DataFrame(data=[list('ABCDE'), 
          ['Crude Oil', 'Natural Gas', 'Gasoline', 'Diesel', 'Bitumen'],
          ['Natural Gas', 'Salt water', 'Waste water', 'Motor oil', 'Sour Gas'],
          ['Oil', 'Gas', 'Refined', 'Refined', 'Oil'],
          ['Gas', 'Water', 'Water', 'Oil', 'Gas'],
          list(np.random.randint(10, 100, 5)),
          list(np.random.randint(10, 100, 5))]
          ).T
df.columns =['ID', 'Substance1', 'Substance2', 'Category1', 'Category2', 'Quantity1', 'Quantity2']

  ID   Substance1  Substance2 Category1 Category2 Quantity1 Quantity2
0  A    Crude Oil  Natural Gas      Oil       Gas        85        14
1  B  Natural Gas   Salt water      Gas     Water        95        78
2  C     Gasoline  Waste water  Refined     Water        33        25
3  D       Diesel    Motor oil  Refined       Oil        49        54
4  E      Bitumen     Sour Gas      Oil       Gas        92        86
  ID   Oil  Gas Water Refined
0  A    85   14   NaN     NaN
1  B   NaN   95    78     NaN
2  C   NaN  NaN    25      33
3  D    54  NaN   NaN      49  
4  E    92   86   NaN     NaN
类别
数量
列是指相应的
物质

我想将
类别
列展开为每个唯一值的新列,并将
数量
值作为单元格值。不存在的类别将是NaN。因此,生成的帧将如下所示:

df = pd.DataFrame(data=[list('ABCDE'), 
          ['Crude Oil', 'Natural Gas', 'Gasoline', 'Diesel', 'Bitumen'],
          ['Natural Gas', 'Salt water', 'Waste water', 'Motor oil', 'Sour Gas'],
          ['Oil', 'Gas', 'Refined', 'Refined', 'Oil'],
          ['Gas', 'Water', 'Water', 'Oil', 'Gas'],
          list(np.random.randint(10, 100, 5)),
          list(np.random.randint(10, 100, 5))]
          ).T
df.columns =['ID', 'Substance1', 'Substance2', 'Category1', 'Category2', 'Quantity1', 'Quantity2']

  ID   Substance1  Substance2 Category1 Category2 Quantity1 Quantity2
0  A    Crude Oil  Natural Gas      Oil       Gas        85        14
1  B  Natural Gas   Salt water      Gas     Water        95        78
2  C     Gasoline  Waste water  Refined     Water        33        25
3  D       Diesel    Motor oil  Refined       Oil        49        54
4  E      Bitumen     Sour Gas      Oil       Gas        92        86
  ID   Oil  Gas Water Refined
0  A    85   14   NaN     NaN
1  B   NaN   95    78     NaN
2  C   NaN  NaN    25      33
3  D    54  NaN   NaN      49  
4  E    92   86   NaN     NaN

我尝试了
.melt()
,然后是
.pivot\u table()
,但由于某些原因,值会在新的类别列中重复。

您需要使用
pd.melt
然后使用
groupby

np.random.seed(0)

df = pd.DataFrame(data=[list('ABCDE'), 
          ['Crude Oil', 'Natural Gas', 'Gasoline', 'Diesel', 'Bitumen'],
          ['Natural Gas', 'Salt water', 'Waste water', 'Motor oil', 'Sour Gas'],
          ['Oil', 'Gas', 'Refined', 'Refined', 'Oil'],
          ['Gas', 'Water', 'Water', 'Oil', 'Gas'],
          list(np.random.randint(10, 100, 5)),
          list(np.random.randint(10, 100, 5))]
          ).T
df.columns =['ID', 'Substance1', 'Substance2', 'Category1', 'Category2', 'Quantity1', 'Quantity2']

pd.wide_to_long(df,['Substance','Category','Quantity'], 'ID','Num','','.+')\
  .groupby(['ID','Category'])['Quantity'].sum()\
  .unstack().reset_index()
输出:

Category ID   Gas   Oil  Refined  Water
0         A  19.0  54.0      NaN    NaN
1         B  57.0   NaN      NaN   93.0
2         C   NaN   NaN     74.0   31.0
3         D   NaN  46.0     77.0    NaN
4         E  97.0  77.0      NaN    NaN

这是可行的,但并非如此。它创建了大量重复的列,并添加了所有的数字,导致了不准确的值。但是在链中添加了两种方法,
reset\u index
drop\u duplicates
pd.wide\u to\u long(df,['Substance','Category','Quantity'],'Quantity'],'ID','Num','.+')\.reset\u index().drop\u duplicates(子集=['ID','Num Num'])\.groupby(['ID','Category'])['Quantity'].sum().sum().unstack().reset\u index()
我要补充的是,这个漂亮的解决方案还需要对较老的
熊猫进行一些调整
0.19。