Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/301.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/tfs/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 其他多个数据帧的条件合并_Python_Pandas - Fatal编程技术网

Python 其他多个数据帧的条件合并

Python 其他多个数据帧的条件合并,python,pandas,Python,Pandas,我有四个数据帧(A、B、C和D)A具有一系列时间戳和一列,该列引用其他数据帧之一: A Timestamp Source ----------- ------ 2012-4-3 B 2013-12-20 C 2012-3-5 C 2014-12-7 D 2012-7-10 B ... 其他数据帧包含更多数据: B Timestamp Foo Bar ----------- ---- ---- 2012-1-1 1.5 1.3 2012-

我有四个数据帧(
A
B
C
D
A
具有一系列时间戳和一列,该列引用其他数据帧之一:

A

Timestamp    Source
-----------  ------
2012-4-3     B
2013-12-20   C
2012-3-5     C
2014-12-7    D
2012-7-10    B
...
其他数据帧包含更多数据:

B

Timestamp   Foo  Bar
----------- ---- ----
2012-1-1    1.5  1.3
2012-1-2    2.3  5.6
2012-1-3    3.4  3.3
...
2014-3-31   0.8  2.1

C

Timestamp   Foo  Bar
----------- ---- ----
2012-1-1    9.2  5.6
2012-1-2    4.8  7.6
2012-1-3    2.7  6.4
...
2014-3-31   7.0  6.5

D

Timestamp   Foo  Bar
----------- ---- ----
2012-1-1    6.8  4.2
2012-1-2    4.2  9.3
2012-1-3    5.5  0.7
...
2014-3-31   6.3  2.0
我想从
a
B
C
D
构建一个数据帧,它有三列(
时间戳
Foo
,和
)其中
Foo
Bar
的值来自数据帧中相应的
时间戳
,该数据帧在
A
中列为

并非
A
中的所有时间戳都出现在其他三个数据帧中,在这种情况下,我希望
Foo
Bar
的值为
np.nan
。并非
B
C
D
中的所有时间戳都出现在
A
中,并且根本不会出现在最终的数据帧中

我目前的方法是循环遍历A中的每一行,并从相应的
数据帧返回值:

srcs = {'B': B, 'C': C, 'D': D}
A['Foo'] = np.nan
A['Bar'] = np.nan

for i in range(len(A)):
    ts = A.iloc[i].Timestamp
    src = A.iloc[i].Source
    A.iloc[i].Foo = srcs[src][srcs[src].Timestamp == ts].Foo
    A.iloc[i].Bar = srcs[src][srcs[src].Timestamp == ts].Bar

必须有一种更高效、更具泛石器时代特色的方法来执行此操作?

看起来您可以使用多索引来执行此操作。您的索引将由时间戳和源组成。您可以使用数据帧上的
set\u index
方法来实现这一点

下面是一些代码来创建一些伪数据帧,每个都带有多索引

# Imports for creating fake data
from random import random
from random import choice

# Setup the sample data
A = pd.DataFrame({'TimeStamp':range(20), 'Source':[choice(others) for i in range(20)]})
# Create the MultiIndex on A
A.set_index(['TimeStamp', 'Source'], inplace=True)
A['Bar'] = [np.nan] * len(A)
A['Foo'] = [np.nan] * len(A)

B = pd.DataFrame({'TimeStamp':range(5), 
                  'Foo':[random()*5+5 for i in range(5)], 
                  'Bar':[random()*5+5 for i in range(5)]})
C = pd.DataFrame({'TimeStamp':range(5,10), 
                  'Foo':[random()*5+5 for i in range(5)], 
                  'Bar':[random()*5+5 for i in range(5)]})
D = pd.DataFrame({'TimeStamp':range(10,15), 
                  'Foo':[random()*5+5 for i in range(5)], 
                  'Bar':[random()*5+5 for i in range(5)]})

sources = {'B':B, 'C':C, 'D':D}

# create the MultiIndex on the Source data sets
for s, df in sources.items():
    df['Source'] = [s]*len(df)
    df.set_index(['TimeStamp', 'Source'], inplace=True)
现在可以使用A上的索引对源数据集(B、C和D)进行索引

for s, df in sources.items():    

    temp = df.loc[A.index]  # the source data set indexed by A's index
                            # this will contain NaN's where df does not
                            # have corresponding index entries
    temp.dropna(inplace=True) # dropping the NaN values leaves you with 
                             # only the values in df matching the index in A
    if len(temp) > 0:
        A.loc[temp.index] = temp  # now assign the data to A

print(A)
结果如下:

                       Bar       Foo
TimeStamp Source                    
0         D            NaN       NaN
1         C            NaN       NaN
2         D            NaN       NaN
3         B       7.927154  8.581380
4         B       7.638422  5.970348
5         D            NaN       NaN
6         C       6.938001  6.417248
7         B            NaN       NaN
8         C       5.131940  9.144621
9         B            NaN       NaN
10        D       9.186963  5.991877
11        D       8.070543  7.735040
12        C            NaN       NaN
13        B            NaN       NaN
14        C            NaN       NaN
15        D            NaN       NaN
16        C            NaN       NaN
17        C            NaN       NaN
18        C            NaN       NaN
19        B            NaN       NaN
   Timestamp Source  Foo  Bar
0 2012-04-03      B  3.1  4.1
1 2012-04-02      B  NaN  NaN
2 2013-12-20      C  NaN  NaN
3 2012-03-05      C  4.8  7.6
4 2014-12-07      D  NaN  NaN
5 2012-07-10      B  NaN  NaN

看起来您可以使用多索引来执行此操作。您的索引将由时间戳和源组成。您可以使用数据帧上的
set\u index
方法来实现这一点

下面是一些代码来创建一些伪数据帧,每个都带有多索引

# Imports for creating fake data
from random import random
from random import choice

# Setup the sample data
A = pd.DataFrame({'TimeStamp':range(20), 'Source':[choice(others) for i in range(20)]})
# Create the MultiIndex on A
A.set_index(['TimeStamp', 'Source'], inplace=True)
A['Bar'] = [np.nan] * len(A)
A['Foo'] = [np.nan] * len(A)

B = pd.DataFrame({'TimeStamp':range(5), 
                  'Foo':[random()*5+5 for i in range(5)], 
                  'Bar':[random()*5+5 for i in range(5)]})
C = pd.DataFrame({'TimeStamp':range(5,10), 
                  'Foo':[random()*5+5 for i in range(5)], 
                  'Bar':[random()*5+5 for i in range(5)]})
D = pd.DataFrame({'TimeStamp':range(10,15), 
                  'Foo':[random()*5+5 for i in range(5)], 
                  'Bar':[random()*5+5 for i in range(5)]})

sources = {'B':B, 'C':C, 'D':D}

# create the MultiIndex on the Source data sets
for s, df in sources.items():
    df['Source'] = [s]*len(df)
    df.set_index(['TimeStamp', 'Source'], inplace=True)
现在可以使用A上的索引对源数据集(B、C和D)进行索引

for s, df in sources.items():    

    temp = df.loc[A.index]  # the source data set indexed by A's index
                            # this will contain NaN's where df does not
                            # have corresponding index entries
    temp.dropna(inplace=True) # dropping the NaN values leaves you with 
                             # only the values in df matching the index in A
    if len(temp) > 0:
        A.loc[temp.index] = temp  # now assign the data to A

print(A)
结果如下:

                       Bar       Foo
TimeStamp Source                    
0         D            NaN       NaN
1         C            NaN       NaN
2         D            NaN       NaN
3         B       7.927154  8.581380
4         B       7.638422  5.970348
5         D            NaN       NaN
6         C       6.938001  6.417248
7         B            NaN       NaN
8         C       5.131940  9.144621
9         B            NaN       NaN
10        D       9.186963  5.991877
11        D       8.070543  7.735040
12        C            NaN       NaN
13        B            NaN       NaN
14        C            NaN       NaN
15        D            NaN       NaN
16        C            NaN       NaN
17        C            NaN       NaN
18        C            NaN       NaN
19        B            NaN       NaN
   Timestamp Source  Foo  Bar
0 2012-04-03      B  3.1  4.1
1 2012-04-02      B  NaN  NaN
2 2013-12-20      C  NaN  NaN
3 2012-03-05      C  4.8  7.6
4 2014-12-07      D  NaN  NaN
5 2012-07-10      B  NaN  NaN
安装程序 然后我结合了
pd.concat
just
B
C
D

bdf = pd.concat([B, C, D], keys=['B', 'C', 'D'])
bdf.reset_index(level=1, inplace=1, drop=1)
bdf.index.name = 'Source'
bdf.reset_index(inplace=1)

print bdf
看起来是这样的:

   Source  Timestamp  Foo  Bar
0       B 2012-01-01  1.5  1.3
1       B 2012-04-03  3.1  4.1
2       B 2012-01-02  2.3  5.6
3       B 2012-01-03  3.4  3.3
4       B 2014-03-31  0.8  2.1
5       C 2012-01-01  9.2  5.6
6       C 2012-03-05  4.8  7.6
7       C 2012-01-02  4.8  7.6
8       C 2012-01-03  2.7  6.4
9       C 2014-03-31  7.0  6.5
10      D 2012-01-01  6.8  4.2
11      D 2012-01-02  4.2  9.3
12      D 2012-01-03  5.5  0.7
13      D 2014-03-31  6.3  2.0
最后 简单的合并

A.merge(bdf, how='left')
看起来像:

                       Bar       Foo
TimeStamp Source                    
0         D            NaN       NaN
1         C            NaN       NaN
2         D            NaN       NaN
3         B       7.927154  8.581380
4         B       7.638422  5.970348
5         D            NaN       NaN
6         C       6.938001  6.417248
7         B            NaN       NaN
8         C       5.131940  9.144621
9         B            NaN       NaN
10        D       9.186963  5.991877
11        D       8.070543  7.735040
12        C            NaN       NaN
13        B            NaN       NaN
14        C            NaN       NaN
15        D            NaN       NaN
16        C            NaN       NaN
17        C            NaN       NaN
18        C            NaN       NaN
19        B            NaN       NaN
   Timestamp Source  Foo  Bar
0 2012-04-03      B  3.1  4.1
1 2012-04-02      B  NaN  NaN
2 2013-12-20      C  NaN  NaN
3 2012-03-05      C  4.8  7.6
4 2014-12-07      D  NaN  NaN
5 2012-07-10      B  NaN  NaN
安装程序 然后我结合了
pd.concat
just
B
C
D

bdf = pd.concat([B, C, D], keys=['B', 'C', 'D'])
bdf.reset_index(level=1, inplace=1, drop=1)
bdf.index.name = 'Source'
bdf.reset_index(inplace=1)

print bdf
看起来是这样的:

   Source  Timestamp  Foo  Bar
0       B 2012-01-01  1.5  1.3
1       B 2012-04-03  3.1  4.1
2       B 2012-01-02  2.3  5.6
3       B 2012-01-03  3.4  3.3
4       B 2014-03-31  0.8  2.1
5       C 2012-01-01  9.2  5.6
6       C 2012-03-05  4.8  7.6
7       C 2012-01-02  4.8  7.6
8       C 2012-01-03  2.7  6.4
9       C 2014-03-31  7.0  6.5
10      D 2012-01-01  6.8  4.2
11      D 2012-01-02  4.2  9.3
12      D 2012-01-03  5.5  0.7
13      D 2014-03-31  6.3  2.0
最后 简单的合并

A.merge(bdf, how='left')
看起来像:

                       Bar       Foo
TimeStamp Source                    
0         D            NaN       NaN
1         C            NaN       NaN
2         D            NaN       NaN
3         B       7.927154  8.581380
4         B       7.638422  5.970348
5         D            NaN       NaN
6         C       6.938001  6.417248
7         B            NaN       NaN
8         C       5.131940  9.144621
9         B            NaN       NaN
10        D       9.186963  5.991877
11        D       8.070543  7.735040
12        C            NaN       NaN
13        B            NaN       NaN
14        C            NaN       NaN
15        D            NaN       NaN
16        C            NaN       NaN
17        C            NaN       NaN
18        C            NaN       NaN
19        B            NaN       NaN
   Timestamp Source  Foo  Bar
0 2012-04-03      B  3.1  4.1
1 2012-04-02      B  NaN  NaN
2 2013-12-20      C  NaN  NaN
3 2012-03-05      C  4.8  7.6
4 2014-12-07      D  NaN  NaN
5 2012-07-10      B  NaN  NaN

嗯,一种方法是向每个df添加一个源列,其中B、C、D分别设置为B、C、D,然后在时间戳和源上合并它们,不知道这会有多混乱,这会不会导致一个df有6个单独的列(例如,“Foo_x”,“Bar_x”,“Foo_y”,“Bar_y”,“Foo”,“Bar”)?如何根据源代码将它们组合成两列(“Foo”和“Bar”)?嗯,一种方法是在每个df中添加一个源列,其中B、C、D分别设置为B、C、D,然后在时间戳和源代码上合并它们,不确定会有多混乱,这会不会导致df有6个单独的列(例如,“Foo_x”、“Bar_x”、“Foo_y”、“Bar_y”、“Foo”、“Bar”)?如何根据来源将它们组合成两列(“Foo”和“Bar”)?