Python 在重复日期作为索引的数据帧上使用combine_first_Python_Pandas_Dataframe

Python 在重复日期作为索引的数据帧上使用combine_first

python pandas dataframe

Python 在重复日期作为索引的数据帧上使用combine_first,python,pandas,dataframe,Python,Pandas,Dataframe,我有两个数据框，其中包含同一数据框内不同位置不同日期的气象数据，下面是我的数据的一个简单版本，它再现了这个问题： df = pd.DataFrame(np.random.randint(0,30,size=(10, 4)), columns=(['Temp', 'Precip', 'Wind', 'Pressure'])) df1 = pd.DataFrame(np.random.randint(0,30,size=(10, 4)), columns=(['Temp', 'Precip', 'W

我有两个数据框，其中包含同一数据框内不同位置不同日期的气象数据，下面是我的数据的一个简单版本，它再现了这个问题：

df = pd.DataFrame(np.random.randint(0,30,size=(10, 4)), columns=(['Temp', 'Precip', 'Wind', 'Pressure']))
df1 = pd.DataFrame(np.random.randint(0,30,size=(10, 4)), columns=(['Temp', 'Precip', 'Wind', 'Pressure']))

df['Location'] =[2,2,3,3,4,4,5,5,6,6]
df1['Location'] =[2,2,3,3,4,4,5,5,6,6]

2020年5月18日和19日为df编制了索引，5月19日和20日为df1编制了索引，如下所示：

df.index = ["2020-05-18 12:00:00","2020-05-19 12:00:00","2020-05-18 12:00:00","2020-05-19 12:00:00","2020-05-18 12:00:00","2020-05-19 12:00:00","2020-05-18 12:00:00","2020-05-19 12:00:00","2020-05-18 12:00:00","2020-05-19 12:00:00"]
df1.index = ["2020-05-19 12:00:00", "2020-05-20 12:00:00", "2020-05-19 12:00:00", "2020-05-20 12:00:00", "2020-05-19 12:00:00", "2020-05-20 12:00:00", "2020-05-19 12:00:00", "2020-05-20 12:00:00", "2020-05-19 12:00:00", "2020-05-20 12:00:00"]

df.index = pd.to_datetime(df.index)
df1.index = pd.to_datetime(df1.index)

数据帧的结构方式意味着每个

位置

点在每个数据帧中有2天的数据。df中的第18和第19个，df1中的第19和第20个。下面的示例如下所示：

我想将这两个数据帧组合到df3中，其中每个位置点都有18、19和20的值，其中18来自df，19和20来自df1。i、 e.df1在同一日期覆盖每个位置的df，然后附加以下所有日期的数据，以生成如下内容：

事实上，我有数百个地点在许多天，所以这将需要工作的基础上的索引（我想）

我已经尝试了

pd.combine_first

方法，如下所示：

df.combine_first(df1)

但是（由于索引中重复的日期），这会产生一个包含比我想要的更多单元格的数据帧-总共应该有15个，而且还有更多

我认为这是由于索引，因为当我尝试一个只针对一个位置使用简单日期的示例时，它工作得很好-但我不知道如何对在同一数据帧中具有多个位置的数据执行此操作。我真的很感谢你的帮助

编辑：下面标记的答案确实解决了这个问题，但现在当我想添加与索引长度不匹配的新数据时，如下所示：

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0,30,size=(10, 4)), columns=(['Temp', 'Precip', 'Wind', 'Pressure']))
df1 = pd.DataFrame(np.random.randint(0,30,size=(11, 4)), columns=(['Temp', 'Precip', 'Wind', 'Pressure']))

df['Location'] =[2,2,3,3,4,4,5,5,6,6]
df1['Location'] =[1,2,2,3,3,4,4,5,5,6,6]

df.index = ["2020-05-18 12:00:00","2020-05-19 12:00:00","2020-05-18 12:00:00","2020-05-19 12:00:00","2020-05-18 12:00:00","2020-05-19 12:00:00","2020-05-18 12:00:00","2020-05-19 12:00:00","2020-05-18 12:00:00","2020-05-19 12:00:00"]
df1.index = ["2020-05-19 12:00:00", "2020-05-20 12:00:00", "2020-05-19 12:00:00", "2020-05-20 12:00:00", "2020-05-19 12:00:00", "2020-05-20 12:00:00", "2020-05-19 12:00:00", "2020-05-20 12:00:00", "2020-05-19 12:00:00", "2020-05-20 12:00:00", "2020-05-19 12:00:00"]

df.index = pd.to_datetime(df.index)
df1.index = pd.to_datetime(df1.index)

df1

现在我有了另一个值为1的位置，我想把这个位置添加到df中，同时用df1的值更新这个位置。当我使用以下代码时：

df = df.set_index(df.groupby(level=0).cumcount(), append=True)
df1 = df1.set_index(df1.groupby(level=0).cumcount(), append=True)

df = df.combine_first(df1).sort_index(level=[1,0]).reset_index(level=1, drop=True)
print (df)

它使用df2中的值更新df，但删除新位置。有什么办法可以解决这个问题吗？

这里有重复的问题，所以

首先合并

创建默认的外部联接。解决方案是在

MultiIndex

中为唯一索引值添加帮助器级别，并使用“删除帮助器级别”进行最后排序：

df = df.set_index(df.groupby(level=0).cumcount(), append=True)
df1 = df1.set_index(df1.groupby(level=0).cumcount(), append=True)
 
df = df.combine_first(df1).sort_index(level=[1,0]).reset_index(level=1, drop=True)
print (df)
                     Temp  Precip  Wind  Pressure  Location
2020-05-18 12:00:00  24.0     3.0   5.0      28.0       2.0
2020-05-19 12:00:00   8.0    21.0   2.0       6.0       2.0
2020-05-20 12:00:00  10.0    12.0   4.0      15.0       2.0
2020-05-18 12:00:00  25.0     4.0   6.0      14.0       3.0
2020-05-19 12:00:00  19.0     8.0  13.0      14.0       3.0
2020-05-20 12:00:00   5.0     5.0  13.0       1.0       3.0
2020-05-18 12:00:00   6.0    27.0  16.0      15.0       4.0
2020-05-19 12:00:00  24.0     3.0  24.0      25.0       4.0
2020-05-20 12:00:00  13.0     5.0  28.0      22.0       4.0
2020-05-18 12:00:00  18.0    26.0  13.0      23.0       5.0
2020-05-19 12:00:00  13.0    27.0  15.0      16.0       5.0
2020-05-20 12:00:00  25.0    11.0   6.0      21.0       5.0
2020-05-18 12:00:00  23.0    21.0   3.0      22.0       6.0
2020-05-19 12:00:00   6.0    12.0  10.0       2.0       6.0
2020-05-20 12:00:00   2.0    12.0  12.0      14.0       6.0

只需修改

df3=df3.drop_duplicates（subset=[“index”，“Location”]，keep=“last”）.set_index（“index”）.sort_index（）`-我不确定原因，正在研究。这意味着数据帧没有“索引”列。你在哪一行得到错误？当我运行这个df3=df3.drop\u duplicates（subset=[“index”，“FarmID]”，keep=“last”）
我得到错误如果你之前有.reset\u index
，你应该有一个“index”列。在使用之前，您可以检查您的df3
吗？删除重复的？这可以解决上述示例和我的更复杂数据的问题，因此我将其标记为答案-非常感谢！抱歉打扰您-我在使用此方法添加新数据时遇到问题，因为新位置的数据比现有位置的数据少，我已编辑了上述问题以说明新问题，如果您有机会查看，将不胜感激！
df3 = pd.concat([df,df1]).reset_index()
df3 = df3.drop_duplicates(subset=["index","Location"], keep="last")
df3 = df3.set_index("index").sort_index().sort_values(by="Location")

In [29]: df3
Out[29]: 
             

                     Temp  Precip  Wind  Pressure  Location
index                                                      
2020-05-18 12:00:00     9      13    17        27         2
2020-05-19 12:00:00    23      27    22         0         2
2020-05-20 12:00:00    21      22     0         5         2
2020-05-18 12:00:00    22      27    19        13         3
2020-05-19 12:00:00     4      29    21         0         3
2020-05-20 12:00:00    12      28    11        25         3
2020-05-18 12:00:00    29       8    21        20         4
2020-05-19 12:00:00    10       3    15        25         4
2020-05-20 12:00:00    23       2    14         5         4
2020-05-18 12:00:00    11      19    17        17         5
2020-05-19 12:00:00    13       1    12         7         5
2020-05-20 12:00:00     4      18    25        19         5
2020-05-18 12:00:00     3      21    16        18         6
2020-05-19 12:00:00    16      12    11        12         6
2020-05-20 12:00:00    27      19    13        19         6
    
In [30]: df3.shape
Out[30]: (15, 5)