Python 如何将缺少的行插入此数据集中？_Python_Pandas_Numpy

Python 如何将缺少的行插入此数据集中？

python pandas numpy

Python 如何将缺少的行插入此数据集中？,python,pandas,numpy,Python,Pandas,Numpy,我试图做的是在数据集中插入一行丢失的记录如果查看上面的数据集，它包含3列属性，然后是2个数值。第三列TTF是增量的，不应跳过任何值。在本例中，缺少底部显示的2行。因此，我希望我的代码将这两行插入结果集中（即，计算机-显示器缺少TTF 5，电视-电源缺少TTF 6。我将修复值设置为0，运行总值设置为与前一行相同）我想我应该通过拆分列名，递归地遍历前2个，然后从1到8遍历第3个 for i in range(len(Product)): for j in range(len(Module

我试图做的是在数据集中插入一行丢失的记录

如果查看上面的数据集，它包含3列属性，然后是2个数值。第三列TTF是增量的，不应跳过任何值。在本例中，缺少底部显示的2行。因此，我希望我的代码将这两行插入结果集中（即，计算机-显示器缺少TTF 5，电视-电源缺少TTF 6。我将修复值设置为0，运行总值设置为与前一行相同）

我想我应该通过拆分列名，递归地遍历前2个，然后从1到8遍历第3个

for i in range(len(Product)):
    for j in range(len(Module)):
        for k in range(1, 8):  
            # Check if the Repair value is there if not make it 0
            # If Repair value is missing, look up previous Running Total

这似乎是最好的方法吗？如果您能在实际代码中提供帮助以实现这一点，我们将不胜感激

编辑：这里是DF中的代码读取，因为根据excel屏幕截图，这似乎令人困惑

>>> import pandas as pd
>>> 
>>> df = pd.read_csv('minimal.csv')
>>> 
>>> df
       Product         Module   TTF   Repair   Running Total
0     Computer        Display     1        3               3
1     Computer        Display     2        2               5
2     Computer        Display     3        1               6
3     Computer        Display     4        5              11
4     Computer        Display     6        4              15
5     Computer        Display     7        3              18
6     Computer        Display     8        2              20
7   Television   Power Supply     1        7               7
8   Television   Power Supply     2        6              13
9   Television   Power Supply     3        4              17
10  Television   Power Supply     4        5              22
11  Television   Power Supply     5        6              28
12  Television   Power Supply     7        7              35
13  Television   Power Supply     8        8              43

让我们使用

reindex

按

np.arange

的顺序为缺失的数字创建新的TTF：

df = pd.DataFrame({'Product':['Computer']*7 + ['Television']*7,'Module':['Display']*7 + ['Power Supply']*7,
                 'TTF':[1,2,3,4,6,7,8,1,2,3,4,5,7,8],'Repair':np.random.randint(1,8,14)})

df['Running Total'] = df['Repair'].cumsum()

print(df)

输入数据帧：

          Module     Product  Repair  TTF  Running Total
0        Display    Computer       6    1              6
1        Display    Computer       2    2              8
2        Display    Computer       2    3             10
3        Display    Computer       4    4             14
4        Display    Computer       2    6             16
5        Display    Computer       3    7             19
6        Display    Computer       6    8             25
7   Power Supply  Television       3    1             28
8   Power Supply  Television       3    2             31
9   Power Supply  Television       5    3             36
10  Power Supply  Television       6    4             42
11  Power Supply  Television       4    5             46
12  Power Supply  Television       2    7             48
13  Power Supply  Television       2    8             50


df_out = df.set_index('TTF').groupby(['Product','Module'], group_keys=False).apply(lambda x: x.reindex(np.arange(1,9)))

df_out['repair'] = df_out['Repair'].fillna(0)

df_out = df_out.ffill().reset_index()

print(df_out)

输出：

    TTF        Module     Product  Repair  Running Total  repair
0     1       Display    Computer     6.0            6.0     6.0
1     2       Display    Computer     2.0            8.0     2.0
2     3       Display    Computer     2.0           10.0     2.0
3     4       Display    Computer     4.0           14.0     4.0
4     5       Display    Computer     4.0           14.0     0.0
5     6       Display    Computer     2.0           16.0     2.0
6     7       Display    Computer     3.0           19.0     3.0
7     8       Display    Computer     6.0           25.0     6.0
8     1  Power Supply  Television     3.0           28.0     3.0
9     2  Power Supply  Television     3.0           31.0     3.0
10    3  Power Supply  Television     5.0           36.0     5.0
11    4  Power Supply  Television     6.0           42.0     6.0
12    5  Power Supply  Television     4.0           46.0     4.0
13    6  Power Supply  Television     4.0           46.0     0.0
14    7  Power Supply  Television     2.0           48.0     2.0
15    8  Power Supply  Television     2.0           50.0     2.0

您发布的是一个看起来像Excel文件的屏幕截图。您应该真正展示您是如何将这些数据读入Python/pandas的，以及您打算如何从中着手。您是否已将单个excel文件加载到单个数据框中？我创建了excel屏幕截图，因为我认为它在问题中显示得最好。实际上，我正在从Teradata中提取一个结果集，这将创建Pandas df。另见。很难理解如何从Excel SS中获得进步-它是一个单独的DF吗？在这种情况下，您只需对数据进行排序。我对原始问题进行了编辑，显示了使我了解我所在位置的代码。谢谢！这正是我想要的。关于如何将lambda更改为动态而不是1,9，有什么建议吗？也就是说，随着时间的推移，数据可能具有不同的TTF范围。@MichaelMelillo您可以在np.arange函数中使用df.TFF.min（）和df.TFF.max（）。

df\u out=df.set\u index（'TTF'）.groupby（['Product'，'Module']，group\u key=False）。apply（lambda x:x.reindex（np.arange（df.TTF.min（），df.TTF.max（）+1））