Python 如何将缺少的行插入此数据集中?
我试图做的是在数据集中插入一行丢失的记录 如果查看上面的数据集,它包含3列属性,然后是2个数值。第三列TTF是增量的,不应跳过任何值。在本例中,缺少底部显示的2行。因此,我希望我的代码将这两行插入结果集中(即,计算机-显示器缺少TTF 5,电视-电源缺少TTF 6。我将修复值设置为0,运行总值设置为与前一行相同) 我想我应该通过拆分列名,递归地遍历前2个,然后从1到8遍历第3个Python 如何将缺少的行插入此数据集中?,python,pandas,numpy,Python,Pandas,Numpy,我试图做的是在数据集中插入一行丢失的记录 如果查看上面的数据集,它包含3列属性,然后是2个数值。第三列TTF是增量的,不应跳过任何值。在本例中,缺少底部显示的2行。因此,我希望我的代码将这两行插入结果集中(即,计算机-显示器缺少TTF 5,电视-电源缺少TTF 6。我将修复值设置为0,运行总值设置为与前一行相同) 我想我应该通过拆分列名,递归地遍历前2个,然后从1到8遍历第3个 for i in range(len(Product)): for j in range(len(Module
for i in range(len(Product)):
for j in range(len(Module)):
for k in range(1, 8):
# Check if the Repair value is there if not make it 0
# If Repair value is missing, look up previous Running Total
这似乎是最好的方法吗?如果您能在实际代码中提供帮助以实现这一点,我们将不胜感激
编辑:这里是DF中的代码读取,因为根据excel屏幕截图,这似乎令人困惑
>>> import pandas as pd
>>>
>>> df = pd.read_csv('minimal.csv')
>>>
>>> df
Product Module TTF Repair Running Total
0 Computer Display 1 3 3
1 Computer Display 2 2 5
2 Computer Display 3 1 6
3 Computer Display 4 5 11
4 Computer Display 6 4 15
5 Computer Display 7 3 18
6 Computer Display 8 2 20
7 Television Power Supply 1 7 7
8 Television Power Supply 2 6 13
9 Television Power Supply 3 4 17
10 Television Power Supply 4 5 22
11 Television Power Supply 5 6 28
12 Television Power Supply 7 7 35
13 Television Power Supply 8 8 43
让我们使用
reindex
按np.arange
的顺序为缺失的数字创建新的TTF:
df = pd.DataFrame({'Product':['Computer']*7 + ['Television']*7,'Module':['Display']*7 + ['Power Supply']*7,
'TTF':[1,2,3,4,6,7,8,1,2,3,4,5,7,8],'Repair':np.random.randint(1,8,14)})
df['Running Total'] = df['Repair'].cumsum()
print(df)
输入数据帧:
Module Product Repair TTF Running Total
0 Display Computer 6 1 6
1 Display Computer 2 2 8
2 Display Computer 2 3 10
3 Display Computer 4 4 14
4 Display Computer 2 6 16
5 Display Computer 3 7 19
6 Display Computer 6 8 25
7 Power Supply Television 3 1 28
8 Power Supply Television 3 2 31
9 Power Supply Television 5 3 36
10 Power Supply Television 6 4 42
11 Power Supply Television 4 5 46
12 Power Supply Television 2 7 48
13 Power Supply Television 2 8 50
df_out = df.set_index('TTF').groupby(['Product','Module'], group_keys=False).apply(lambda x: x.reindex(np.arange(1,9)))
df_out['repair'] = df_out['Repair'].fillna(0)
df_out = df_out.ffill().reset_index()
print(df_out)
输出:
TTF Module Product Repair Running Total repair
0 1 Display Computer 6.0 6.0 6.0
1 2 Display Computer 2.0 8.0 2.0
2 3 Display Computer 2.0 10.0 2.0
3 4 Display Computer 4.0 14.0 4.0
4 5 Display Computer 4.0 14.0 0.0
5 6 Display Computer 2.0 16.0 2.0
6 7 Display Computer 3.0 19.0 3.0
7 8 Display Computer 6.0 25.0 6.0
8 1 Power Supply Television 3.0 28.0 3.0
9 2 Power Supply Television 3.0 31.0 3.0
10 3 Power Supply Television 5.0 36.0 5.0
11 4 Power Supply Television 6.0 42.0 6.0
12 5 Power Supply Television 4.0 46.0 4.0
13 6 Power Supply Television 4.0 46.0 0.0
14 7 Power Supply Television 2.0 48.0 2.0
15 8 Power Supply Television 2.0 50.0 2.0
您发布的是一个看起来像Excel文件的屏幕截图。您应该真正展示您是如何将这些数据读入Python/pandas的,以及您打算如何从中着手。您是否已将单个excel文件加载到单个数据框中?我创建了excel屏幕截图,因为我认为它在问题中显示得最好。实际上,我正在从Teradata中提取一个结果集,这将创建Pandas df。另见。很难理解如何从Excel SS中获得进步-它是一个单独的DF吗?在这种情况下,您只需对数据进行排序。我对原始问题进行了编辑,显示了使我了解我所在位置的代码。谢谢!这正是我想要的。关于如何将lambda更改为动态而不是1,9,有什么建议吗?也就是说,随着时间的推移,数据可能具有不同的TTF范围。@MichaelMelillo您可以在np.arange函数中使用df.TFF.min()和df.TFF.max()。
df\u out=df.set\u index('TTF').groupby(['Product','Module'],group\u key=False)。apply(lambda x:x.reindex(np.arange(df.TTF.min(),df.TTF.max()+1))