Python 其他多个数据帧的条件合并
我有四个数据帧(Python 其他多个数据帧的条件合并,python,pandas,Python,Pandas,我有四个数据帧(A、B、C和D)A具有一系列时间戳和一列,该列引用其他数据帧之一: A Timestamp Source ----------- ------ 2012-4-3 B 2013-12-20 C 2012-3-5 C 2014-12-7 D 2012-7-10 B ... 其他数据帧包含更多数据: B Timestamp Foo Bar ----------- ---- ---- 2012-1-1 1.5 1.3 2012-
A
、B
、C
和D
)A
具有一系列时间戳和一列,该列引用其他数据帧之一:
A
Timestamp Source
----------- ------
2012-4-3 B
2013-12-20 C
2012-3-5 C
2014-12-7 D
2012-7-10 B
...
其他数据帧包含更多数据:
B
Timestamp Foo Bar
----------- ---- ----
2012-1-1 1.5 1.3
2012-1-2 2.3 5.6
2012-1-3 3.4 3.3
...
2014-3-31 0.8 2.1
C
Timestamp Foo Bar
----------- ---- ----
2012-1-1 9.2 5.6
2012-1-2 4.8 7.6
2012-1-3 2.7 6.4
...
2014-3-31 7.0 6.5
D
Timestamp Foo Bar
----------- ---- ----
2012-1-1 6.8 4.2
2012-1-2 4.2 9.3
2012-1-3 5.5 0.7
...
2014-3-31 6.3 2.0
我想从a
,B
,C
和D
构建一个数据帧,它有三列(时间戳
,Foo
,和条
)其中Foo
和Bar
的值来自数据帧中相应的时间戳
,该数据帧在A
中列为源
并非A
中的所有时间戳都出现在其他三个数据帧中,在这种情况下,我希望Foo
和Bar
的值为np.nan
。并非B
、C
和D
中的所有时间戳都出现在A
中,并且根本不会出现在最终的数据帧中
我目前的方法是循环遍历A中的每一行,并从相应的源
数据帧返回值:
srcs = {'B': B, 'C': C, 'D': D}
A['Foo'] = np.nan
A['Bar'] = np.nan
for i in range(len(A)):
ts = A.iloc[i].Timestamp
src = A.iloc[i].Source
A.iloc[i].Foo = srcs[src][srcs[src].Timestamp == ts].Foo
A.iloc[i].Bar = srcs[src][srcs[src].Timestamp == ts].Bar
必须有一种更高效、更具泛石器时代特色的方法来执行此操作?看起来您可以使用多索引来执行此操作。您的索引将由时间戳和源组成。您可以使用数据帧上的
set\u index
方法来实现这一点
下面是一些代码来创建一些伪数据帧,每个都带有多索引
# Imports for creating fake data
from random import random
from random import choice
# Setup the sample data
A = pd.DataFrame({'TimeStamp':range(20), 'Source':[choice(others) for i in range(20)]})
# Create the MultiIndex on A
A.set_index(['TimeStamp', 'Source'], inplace=True)
A['Bar'] = [np.nan] * len(A)
A['Foo'] = [np.nan] * len(A)
B = pd.DataFrame({'TimeStamp':range(5),
'Foo':[random()*5+5 for i in range(5)],
'Bar':[random()*5+5 for i in range(5)]})
C = pd.DataFrame({'TimeStamp':range(5,10),
'Foo':[random()*5+5 for i in range(5)],
'Bar':[random()*5+5 for i in range(5)]})
D = pd.DataFrame({'TimeStamp':range(10,15),
'Foo':[random()*5+5 for i in range(5)],
'Bar':[random()*5+5 for i in range(5)]})
sources = {'B':B, 'C':C, 'D':D}
# create the MultiIndex on the Source data sets
for s, df in sources.items():
df['Source'] = [s]*len(df)
df.set_index(['TimeStamp', 'Source'], inplace=True)
现在可以使用A上的索引对源数据集(B、C和D)进行索引
for s, df in sources.items():
temp = df.loc[A.index] # the source data set indexed by A's index
# this will contain NaN's where df does not
# have corresponding index entries
temp.dropna(inplace=True) # dropping the NaN values leaves you with
# only the values in df matching the index in A
if len(temp) > 0:
A.loc[temp.index] = temp # now assign the data to A
print(A)
结果如下:
Bar Foo
TimeStamp Source
0 D NaN NaN
1 C NaN NaN
2 D NaN NaN
3 B 7.927154 8.581380
4 B 7.638422 5.970348
5 D NaN NaN
6 C 6.938001 6.417248
7 B NaN NaN
8 C 5.131940 9.144621
9 B NaN NaN
10 D 9.186963 5.991877
11 D 8.070543 7.735040
12 C NaN NaN
13 B NaN NaN
14 C NaN NaN
15 D NaN NaN
16 C NaN NaN
17 C NaN NaN
18 C NaN NaN
19 B NaN NaN
Timestamp Source Foo Bar
0 2012-04-03 B 3.1 4.1
1 2012-04-02 B NaN NaN
2 2013-12-20 C NaN NaN
3 2012-03-05 C 4.8 7.6
4 2014-12-07 D NaN NaN
5 2012-07-10 B NaN NaN
看起来您可以使用多索引来执行此操作。您的索引将由时间戳和源组成。您可以使用数据帧上的
set\u index
方法来实现这一点
下面是一些代码来创建一些伪数据帧,每个都带有多索引
# Imports for creating fake data
from random import random
from random import choice
# Setup the sample data
A = pd.DataFrame({'TimeStamp':range(20), 'Source':[choice(others) for i in range(20)]})
# Create the MultiIndex on A
A.set_index(['TimeStamp', 'Source'], inplace=True)
A['Bar'] = [np.nan] * len(A)
A['Foo'] = [np.nan] * len(A)
B = pd.DataFrame({'TimeStamp':range(5),
'Foo':[random()*5+5 for i in range(5)],
'Bar':[random()*5+5 for i in range(5)]})
C = pd.DataFrame({'TimeStamp':range(5,10),
'Foo':[random()*5+5 for i in range(5)],
'Bar':[random()*5+5 for i in range(5)]})
D = pd.DataFrame({'TimeStamp':range(10,15),
'Foo':[random()*5+5 for i in range(5)],
'Bar':[random()*5+5 for i in range(5)]})
sources = {'B':B, 'C':C, 'D':D}
# create the MultiIndex on the Source data sets
for s, df in sources.items():
df['Source'] = [s]*len(df)
df.set_index(['TimeStamp', 'Source'], inplace=True)
现在可以使用A上的索引对源数据集(B、C和D)进行索引
for s, df in sources.items():
temp = df.loc[A.index] # the source data set indexed by A's index
# this will contain NaN's where df does not
# have corresponding index entries
temp.dropna(inplace=True) # dropping the NaN values leaves you with
# only the values in df matching the index in A
if len(temp) > 0:
A.loc[temp.index] = temp # now assign the data to A
print(A)
结果如下:
Bar Foo
TimeStamp Source
0 D NaN NaN
1 C NaN NaN
2 D NaN NaN
3 B 7.927154 8.581380
4 B 7.638422 5.970348
5 D NaN NaN
6 C 6.938001 6.417248
7 B NaN NaN
8 C 5.131940 9.144621
9 B NaN NaN
10 D 9.186963 5.991877
11 D 8.070543 7.735040
12 C NaN NaN
13 B NaN NaN
14 C NaN NaN
15 D NaN NaN
16 C NaN NaN
17 C NaN NaN
18 C NaN NaN
19 B NaN NaN
Timestamp Source Foo Bar
0 2012-04-03 B 3.1 4.1
1 2012-04-02 B NaN NaN
2 2013-12-20 C NaN NaN
3 2012-03-05 C 4.8 7.6
4 2014-12-07 D NaN NaN
5 2012-07-10 B NaN NaN
安装程序
然后我结合了pd.concat
justB
C
和D
bdf = pd.concat([B, C, D], keys=['B', 'C', 'D'])
bdf.reset_index(level=1, inplace=1, drop=1)
bdf.index.name = 'Source'
bdf.reset_index(inplace=1)
print bdf
看起来是这样的:
Source Timestamp Foo Bar
0 B 2012-01-01 1.5 1.3
1 B 2012-04-03 3.1 4.1
2 B 2012-01-02 2.3 5.6
3 B 2012-01-03 3.4 3.3
4 B 2014-03-31 0.8 2.1
5 C 2012-01-01 9.2 5.6
6 C 2012-03-05 4.8 7.6
7 C 2012-01-02 4.8 7.6
8 C 2012-01-03 2.7 6.4
9 C 2014-03-31 7.0 6.5
10 D 2012-01-01 6.8 4.2
11 D 2012-01-02 4.2 9.3
12 D 2012-01-03 5.5 0.7
13 D 2014-03-31 6.3 2.0
最后
简单的合并
A.merge(bdf, how='left')
看起来像:
Bar Foo
TimeStamp Source
0 D NaN NaN
1 C NaN NaN
2 D NaN NaN
3 B 7.927154 8.581380
4 B 7.638422 5.970348
5 D NaN NaN
6 C 6.938001 6.417248
7 B NaN NaN
8 C 5.131940 9.144621
9 B NaN NaN
10 D 9.186963 5.991877
11 D 8.070543 7.735040
12 C NaN NaN
13 B NaN NaN
14 C NaN NaN
15 D NaN NaN
16 C NaN NaN
17 C NaN NaN
18 C NaN NaN
19 B NaN NaN
Timestamp Source Foo Bar
0 2012-04-03 B 3.1 4.1
1 2012-04-02 B NaN NaN
2 2013-12-20 C NaN NaN
3 2012-03-05 C 4.8 7.6
4 2014-12-07 D NaN NaN
5 2012-07-10 B NaN NaN
安装程序
然后我结合了pd.concat
justB
C
和D
bdf = pd.concat([B, C, D], keys=['B', 'C', 'D'])
bdf.reset_index(level=1, inplace=1, drop=1)
bdf.index.name = 'Source'
bdf.reset_index(inplace=1)
print bdf
看起来是这样的:
Source Timestamp Foo Bar
0 B 2012-01-01 1.5 1.3
1 B 2012-04-03 3.1 4.1
2 B 2012-01-02 2.3 5.6
3 B 2012-01-03 3.4 3.3
4 B 2014-03-31 0.8 2.1
5 C 2012-01-01 9.2 5.6
6 C 2012-03-05 4.8 7.6
7 C 2012-01-02 4.8 7.6
8 C 2012-01-03 2.7 6.4
9 C 2014-03-31 7.0 6.5
10 D 2012-01-01 6.8 4.2
11 D 2012-01-02 4.2 9.3
12 D 2012-01-03 5.5 0.7
13 D 2014-03-31 6.3 2.0
最后
简单的合并
A.merge(bdf, how='left')
看起来像:
Bar Foo
TimeStamp Source
0 D NaN NaN
1 C NaN NaN
2 D NaN NaN
3 B 7.927154 8.581380
4 B 7.638422 5.970348
5 D NaN NaN
6 C 6.938001 6.417248
7 B NaN NaN
8 C 5.131940 9.144621
9 B NaN NaN
10 D 9.186963 5.991877
11 D 8.070543 7.735040
12 C NaN NaN
13 B NaN NaN
14 C NaN NaN
15 D NaN NaN
16 C NaN NaN
17 C NaN NaN
18 C NaN NaN
19 B NaN NaN
Timestamp Source Foo Bar
0 2012-04-03 B 3.1 4.1
1 2012-04-02 B NaN NaN
2 2013-12-20 C NaN NaN
3 2012-03-05 C 4.8 7.6
4 2014-12-07 D NaN NaN
5 2012-07-10 B NaN NaN
嗯,一种方法是向每个df添加一个源列,其中B、C、D分别设置为B、C、D,然后在时间戳和源上合并它们,不知道这会有多混乱,这会不会导致一个df有6个单独的列(例如,“Foo_x”,“Bar_x”,“Foo_y”,“Bar_y”,“Foo”,“Bar”)?如何根据源代码将它们组合成两列(“Foo”和“Bar”)?嗯,一种方法是在每个df中添加一个源列,其中B、C、D分别设置为B、C、D,然后在时间戳和源代码上合并它们,不确定会有多混乱,这会不会导致df有6个单独的列(例如,“Foo_x”、“Bar_x”、“Foo_y”、“Bar_y”、“Foo”、“Bar”)?如何根据来源将它们组合成两列(“Foo”和“Bar”)?