Python 合并/连接数据帧_Python_Pandas

Python 合并/连接数据帧

python pandas

Python 合并/连接数据帧,python,pandas,Python,Pandas,我不确定这是否意味着合并或加入我有一个包含以下列的数据框['times'，'spots']，数据框a，我还有一个包含类似列['times'，'spots']，数据框B的数据框我想更改数据帧A，使其具有数据帧B中的值，在这些时间从A中插值。因此，它将在数据帧A中有一个新的列spots_B。好的，我将在一个窗台上向您展示如何使用后缀合并： # First import our libraries >>> import pandas as pd >>> impo

我不确定这是否意味着合并或加入

我有一个包含以下列的数据框

['times'，'spots']

，数据框a，我还有一个包含类似列

['times'，'spots']

，数据框B的数据框

我想更改数据帧A，使其具有数据帧B中的值，在这些时间从A中插值。因此，它将在数据帧A中有一个新的列spots_B。

好的，我将在一个窗台上向您展示如何使用后缀合并：

# First import our libraries
>>> import pandas as pd
>>> import numpy as np
# Then create our dataframes
>>> df_A = pd.DataFrame(np.random.rand(3,2),columns=['times','spots'])
>>> df_B = pd.DataFrame(np.random.rand(3,2),columns=['times','spots'])
# Set default values
>>> df_A['times'] = [1,2,3]
>>> df_B['times'] = [1,2,3]
>>> df_A['spots'] = [44,55,66]
>>> df_B['spots'] = [77,88,99]
# Here is what both dataframes contain
>>> df_A
   times  spots
0      1     44
1      2     55
2      3     66
>>> df_B
   times  spots
0      1     77
1      2     88
2      3     99
# Now the merge -- note: this does not affect the first dataframe in place.
#                        It will create a new dataframe. You can overwrite the 
#                        first if you set the result to df_A instead of df_merged.
# Note the use of the keyword, suffixes. In the event that the same column names exist
#  in both dataframes (that aren't being merged on) Pandas will need to differentiate
#  between them. By default same column names will result in a '_x' will be appended to
#  the left dataframe column name, and a '_y' to the right dataframe column name
#  [order is set by the first two arguments in the merge function]. 
#  The suffixes keyword allows the user to override this behaviour with their
#  own version of '_x' and '_y'.
>>> df_merged = pd.merge(df_A,df_B,how='inner',on=['times'],suffixes=['_A','_B'])
>>> df_merged
   times  spots_A  spots_B
0      1       44       77
1      2       55       88
2      3       66       99

现在，从您的问题来看，您似乎不想修改spot的第一个dataframes列名。这可以通过相同的方式实现，除了不使用

后缀=[''''u A'，''u B']

使用

后缀=['''u B']

之外。这实际上将左dataframe列后缀设置为nothing，因此它保持不变：

>>> df_merged = pd.merge(df_A,df_B,how='inner',on=['times'],suffixes=['','_B'])
>>> df_merged
   times  spots  spots_B
0      1     44       77
1      2     55       88
2      3     66       99

瞧！我希望这有帮助。如果我误解了，您实际上是在寻找A和B之间的插值，请告诉我，我将编辑此答案

*编辑1*

考虑到你最后的评论，以下是我相信你正在努力实现的目标。下面，我将向您展示如何使用后缀扩展合并，然后使用“时间”插值方法用插值填充spot_B中的nan

# Start by creating out datetimes to set for the times column
>>> times_A = []
>>> times_B = []
>>> for i in range(1,4):
...   times_A.append(datetime.datetime(year=2011,month=5,day=i))
...
>>> for i in range(1,6,2):
...   times_B.append(datetime.datetime(year=2011,month=5,day=i))
...
# times_A: May 1st, 2011 - May 3rd, 2011
>>> times_A
[datetime.datetime(2011, 5, 1, 0, 0), datetime.datetime(2011, 5, 2, 0, 0), datetime.datetime(2011, 5, 3, 0, 0)]
# times_B: May 1st 2011, May 3rd 2011, May 5th 2011
>>> times_B
[datetime.datetime(2011, 5, 1, 0, 0), datetime.datetime(2011, 5, 3, 0, 0), datetime.datetime(2011, 5, 5, 0, 0)]
# So now times_B is missing May 2nd, and has an extra time, May 5th.
>>> df_A['times'] = times_A
>>> df_B['times'] = times_B
>>> df_A['spots'] = [44,55,66]
>>> df_B['spots'] = [44,66,88]
>>> df_A
                times  spots
0 2011-05-01 00:00:00     44
1 2011-05-02 00:00:00     55
2 2011-05-03 00:00:00     66
>>> df_B
                times  spots
0 2011-05-01 00:00:00     44
1 2011-05-03 00:00:00     66
2 2011-05-05 00:00:00     88

# Now it appears you only care about the times in df_A - so
#   left merge df_A with df_B (include all times from df_A and  
#   try to merge with df_B or NaN). Below the date May 5th was dropped.
>>> df_merged = pd.merge(df_A,df_B,how='left',on=['times'],suffixes=['','_B'])
>>> df_merged
                times  spots  spots_B
0 2011-05-01 00:00:00     44       44
1 2011-05-02 00:00:00     55      NaN
2 2011-05-03 00:00:00     66       66

# Here is the important part:
# Since it appears that your data is going to be a time series
#   you will need to set your dataframe index to be the times column.
>>> df_merged = df_merged.set_index(['times'])
>>> df_merged
            spots  spots_B
times
2011-05-01     44       44
2011-05-02     55      NaN
2011-05-03     66       66

# With the times as index we can use the appropriate
#   interpolation method for best results
>>> df_merged['spots_B'] = df_merged['spots_B'].interpolate(method='time')
>>> df_merged
            spots  spots_B
times
2011-05-01     44       44
2011-05-02     55       55
2011-05-03     66       66

注意：

系列

上的

interpolate（）

的默认行为是假定每一行的距离相等。如果时间间隔不相等，则需要使用TimeSeries索引重新索引数据帧。如果索引是timeseries，则可以在

interpolate（）

函数中使用

method='time'

参数。

注2：提问时请尽量提供详细信息。这有助于想回答问题的人完全理解你的问题。你真的是指插值，还是只想从B中提取A中缺少的值？你能举一个输入A和B以及所需输出的例子吗？事实上，如果df_A乘以1,2,3，df_B乘以1,3,4。我基本上希望在一个调用点中有另一个列。在你的例子中，时间完全相同，所以不需要插值。在我的例子中，我必须得到对应于df_A的时间点的值，点A和点B之间的关系是什么？A和B中的斑点是否在同一时间相同？df_A和df_B中不会有重复的时间吗？还有，您的时间数据类型是什么。数据类型是一个实际的日期时间值，还是类似于我上面提到的一般值。您是否只关心点A，并试图推断与在B中找到的新值相对应的值，如果df_A和df_B中都有匹配的时间键，则只使用点A值？您可能可以编辑您的问题来回答大多数问题。可以提供一个输入/输出示例来帮助理解。@coffeequant-简而言之，我试图理解被插值的内容，如果有的话。这可能很简单，只需从df_B中添加在df_A中找不到的时间键。然后，用spots_B中的值填充这些新行spots_A值。我只想确保我不会对您的数据集做出任何错误的假设。非常感谢：）这就是我要找的@很好，很高兴我能帮上忙。如果您能接受我正确的回答，我们将不胜感激。