Python 如何根据另一个数据帧的条件从多索引数据帧中选择子集
我有一个如下所示的数据帧:Python 如何根据另一个数据帧的条件从多索引数据帧中选择子集,python,pandas,dataframe,multi-index,Python,Pandas,Dataframe,Multi Index,我有一个如下所示的数据帧: dates 0 numbers letters 0 a 2013-01-01 0.261092 2013-01-02 -1.267770 2013-01-03 0.008230 b 2013-01-01 -1.515866 2013
dates 0
numbers letters
0 a 2013-01-01 0.261092
2013-01-02 -1.267770
2013-01-03 0.008230
b 2013-01-01 -1.515866
2013-01-02 0.351942
2013-01-03 -0.245463
c 2013-01-01 -0.253103
2013-01-02 -0.385411
2013-01-03 -1.740821
1 a 2013-01-01 -0.108325
2013-01-02 -0.212350
2013-01-03 0.021097
b 2013-01-01 -1.922214
2013-01-02 -1.769003
2013-01-03 -0.594216
c 2013-01-01 -0.419775
2013-01-02 1.511700
2013-01-03 0.994332
2 a 2013-01-01 -0.020299
2013-01-02 -0.749474
2013-01-03 -1.478558
b 2013-01-01 -1.357671
2013-01-02 0.161185
2013-01-03 -0.658246
c 2013-01-01 -0.564796
2013-01-02 -0.333106
2013-01-03 -2.814611
现在我得到了一份清单,如:
numbers letters
0 0 b
1 1 c
我需要选择索引满足列表要求的数据。答案是:
dates 0
numbers letters
0 b 2013-01-01 -1.515866
2013-01-02 0.351942
2013-01-03 -0.245463
1 c 2013-01-01 -0.419775
2013-01-02 1.511700
2013-01-03 0.994332
如何从MultiIndex的数据框中选择特定数据?假设您具有以下DF以及您想要获得的值:
In [28]: l
Out[28]:
numbers letters
0 0 b
1 1 c
如果需要选择数字
为0
或1
且字母
位于['b','c']
中的所有行,则可以使用df.query()
方法,如下所示:
In [29]: df.query("numbers in @l.numbers and letters in @l.letters")
Out[29]:
dates 0
numbers letters
0 b 2013-01-01 -1.515866
b 2013-01-02 0.351942
b 2013-01-03 -0.245463
c 2013-01-01 -0.253103
c 2013-01-02 -0.385411
c 2013-01-03 -1.740821
1 c 2013-01-01 -0.108325
c 2013-01-02 -0.212350
c 2013-01-03 0.021097
b 2013-01-01 -1.922214
b 2013-01-02 -1.769003
b 2013-01-03 -0.594216
c 2013-01-01 -0.419775
c 2013-01-02 1.511700
c 2013-01-03 0.994332
或者简单地说:
df.query("numbers in [0,1] and letters in ['b','c']")
更新:如果必须精确匹配,如(0,'b')
和(1,'c')
:
假设您具有以下DF以及您想要获得的值:
In [28]: l
Out[28]:
numbers letters
0 0 b
1 1 c
如果需要选择数字
为0
或1
且字母
位于['b','c']
中的所有行,则可以使用df.query()
方法,如下所示:
In [29]: df.query("numbers in @l.numbers and letters in @l.letters")
Out[29]:
dates 0
numbers letters
0 b 2013-01-01 -1.515866
b 2013-01-02 0.351942
b 2013-01-03 -0.245463
c 2013-01-01 -0.253103
c 2013-01-02 -0.385411
c 2013-01-03 -1.740821
1 c 2013-01-01 -0.108325
c 2013-01-02 -0.212350
c 2013-01-03 0.021097
b 2013-01-01 -1.922214
b 2013-01-02 -1.769003
b 2013-01-03 -0.594216
c 2013-01-01 -0.419775
c 2013-01-02 1.511700
c 2013-01-03 0.994332
或者简单地说:
df.query("numbers in [0,1] and letters in ['b','c']")
更新:如果必须精确匹配,如(0,'b')
和(1,'c')
:
还可以使用索引交点:
In [39]: l
Out[39]:
numbers letters
0 0 b
1 1 c
In [40]: df.loc[df.index.intersection(l.set_index(['numbers','letters']).index)]
Out[40]:
dates 0
numbers letters
0 b 2013-01-01 -1.515866
b 2013-01-02 0.351942
b 2013-01-03 -0.245463
1 c 2013-01-01 -0.108325
c 2013-01-02 -0.212350
c 2013-01-03 0.021097
c 2013-01-01 -0.419775
c 2013-01-02 1.511700
c 2013-01-03 0.994332
或:
计时:
对于27.000行多索引DF
In [156]: df = pd.concat([df.reset_index()] * 10**3, ignore_index=True).set_index(['numbers','letters'])
In [157]: df.shape
Out[157]: (27000, 2)
In [158]: %%timeit
...: q = l.apply(lambda r: "(numbers == {} and letters == '{}')".format(r.numbers, r.letters),
...: axis=1) \
...: .str.cat(sep=' or ')
...: df.query(q)
...:
10 loops, best of 3: 21.3 ms per loop
In [159]: %%timeit
...: df.loc[l.set_index(['numbers','letters']).index]
...:
10 loops, best of 3: 20.2 ms per loop
In [160]: %%timeit
...: df.loc[df.index.intersection(l.set_index(['numbers','letters']).index)]
...:
10 loops, best of 3: 27.2 ms per loop
In [163]: %%timeit
...: q = l.apply(lambda r: "(numbers == {} and letters == '{}')".format(r.numbers, r.letters),
...: axis=1) \
...: .str.cat(sep=' or ')
...: df.query(q)
...:
10 loops, best of 3: 117 ms per loop
In [164]: %%timeit
...: df.loc[l.set_index(['numbers','letters']).index]
...:
1 loop, best of 3: 142 ms per loop
In [165]: %%timeit
...: df.loc[df.index.intersection(l.set_index(['numbers','letters']).index)]
...:
10 loops, best of 3: 185 ms per loop
对于270.000行多索引DF
In [156]: df = pd.concat([df.reset_index()] * 10**3, ignore_index=True).set_index(['numbers','letters'])
In [157]: df.shape
Out[157]: (27000, 2)
In [158]: %%timeit
...: q = l.apply(lambda r: "(numbers == {} and letters == '{}')".format(r.numbers, r.letters),
...: axis=1) \
...: .str.cat(sep=' or ')
...: df.query(q)
...:
10 loops, best of 3: 21.3 ms per loop
In [159]: %%timeit
...: df.loc[l.set_index(['numbers','letters']).index]
...:
10 loops, best of 3: 20.2 ms per loop
In [160]: %%timeit
...: df.loc[df.index.intersection(l.set_index(['numbers','letters']).index)]
...:
10 loops, best of 3: 27.2 ms per loop
In [163]: %%timeit
...: q = l.apply(lambda r: "(numbers == {} and letters == '{}')".format(r.numbers, r.letters),
...: axis=1) \
...: .str.cat(sep=' or ')
...: df.query(q)
...:
10 loops, best of 3: 117 ms per loop
In [164]: %%timeit
...: df.loc[l.set_index(['numbers','letters']).index]
...:
1 loop, best of 3: 142 ms per loop
In [165]: %%timeit
...: df.loc[df.index.intersection(l.set_index(['numbers','letters']).index)]
...:
10 loops, best of 3: 185 ms per loop
结论:
df.query()
内部使用numexpr
模块的方法对于较大的DFs似乎更快您也可以使用索引交点:
In [39]: l
Out[39]:
numbers letters
0 0 b
1 1 c
In [40]: df.loc[df.index.intersection(l.set_index(['numbers','letters']).index)]
Out[40]:
dates 0
numbers letters
0 b 2013-01-01 -1.515866
b 2013-01-02 0.351942
b 2013-01-03 -0.245463
1 c 2013-01-01 -0.108325
c 2013-01-02 -0.212350
c 2013-01-03 0.021097
c 2013-01-01 -0.419775
c 2013-01-02 1.511700
c 2013-01-03 0.994332
或:
计时:
对于27.000行多索引DF
In [156]: df = pd.concat([df.reset_index()] * 10**3, ignore_index=True).set_index(['numbers','letters'])
In [157]: df.shape
Out[157]: (27000, 2)
In [158]: %%timeit
...: q = l.apply(lambda r: "(numbers == {} and letters == '{}')".format(r.numbers, r.letters),
...: axis=1) \
...: .str.cat(sep=' or ')
...: df.query(q)
...:
10 loops, best of 3: 21.3 ms per loop
In [159]: %%timeit
...: df.loc[l.set_index(['numbers','letters']).index]
...:
10 loops, best of 3: 20.2 ms per loop
In [160]: %%timeit
...: df.loc[df.index.intersection(l.set_index(['numbers','letters']).index)]
...:
10 loops, best of 3: 27.2 ms per loop
In [163]: %%timeit
...: q = l.apply(lambda r: "(numbers == {} and letters == '{}')".format(r.numbers, r.letters),
...: axis=1) \
...: .str.cat(sep=' or ')
...: df.query(q)
...:
10 loops, best of 3: 117 ms per loop
In [164]: %%timeit
...: df.loc[l.set_index(['numbers','letters']).index]
...:
1 loop, best of 3: 142 ms per loop
In [165]: %%timeit
...: df.loc[df.index.intersection(l.set_index(['numbers','letters']).index)]
...:
10 loops, best of 3: 185 ms per loop
对于270.000行多索引DF
In [156]: df = pd.concat([df.reset_index()] * 10**3, ignore_index=True).set_index(['numbers','letters'])
In [157]: df.shape
Out[157]: (27000, 2)
In [158]: %%timeit
...: q = l.apply(lambda r: "(numbers == {} and letters == '{}')".format(r.numbers, r.letters),
...: axis=1) \
...: .str.cat(sep=' or ')
...: df.query(q)
...:
10 loops, best of 3: 21.3 ms per loop
In [159]: %%timeit
...: df.loc[l.set_index(['numbers','letters']).index]
...:
10 loops, best of 3: 20.2 ms per loop
In [160]: %%timeit
...: df.loc[df.index.intersection(l.set_index(['numbers','letters']).index)]
...:
10 loops, best of 3: 27.2 ms per loop
In [163]: %%timeit
...: q = l.apply(lambda r: "(numbers == {} and letters == '{}')".format(r.numbers, r.letters),
...: axis=1) \
...: .str.cat(sep=' or ')
...: df.query(q)
...:
10 loops, best of 3: 117 ms per loop
In [164]: %%timeit
...: df.loc[l.set_index(['numbers','letters']).index]
...:
1 loop, best of 3: 142 ms per loop
In [165]: %%timeit
...: df.loc[df.index.intersection(l.set_index(['numbers','letters']).index)]
...:
10 loops, best of 3: 185 ms per loop
结论:
df.query()
内部使用numexpr
模块的方法对于更大的DFs似乎更快谢谢您的回答。但如果我这样做,我会得到额外的行,我不需要。有改进吗?@JHuang,我添加了另一个解决方案-请检查,再次感谢。但是,当数据非常大时,这种方法速度不够快,因为我应该自己键入所有条件。@J黄,不,你不应该键入任何内容-我只是忘了先在[14]中添加,现在它已修复。顺便说一句,我添加了一个替代方案作为额外的答案谢谢你的回答。但如果我这样做,我会得到额外的行,我不需要。有改进吗?@JHuang,我添加了另一个解决方案-请检查,再次感谢。但是,当数据非常大时,这种方法速度不够快,因为我应该自己键入所有条件。@J黄,不,你不应该键入任何内容-我只是忘了先在[14]
中添加,现在它已修复。顺便说一句,我已经添加了一个替代解决方案,作为一个额外的应答器,当数据帧出现时,通过索引访问通常是最好的主意large@Javier,我认为你是对的。谢谢你的评论!你需要做交叉口的事情吗?在我看来,您将两次遍历df索引,一次用于创建交点,另一次用于选择切片。这不管用吗df.loc[l.set_index(['numbers','letters']).index]
我过去也有过类似的情况,对我来说,最大的减速是多索引,当我切换到一个简单索引时,数据访问变得更快了,所以可能多索引访问没有优化?@Javier,是的,df.loc[l.set_index(['numbers','letters')).index]
工作速度更快,但与df.query()方法相比,它仍然较慢。我将更新计时按索引访问通常是在获得数据帧时最好的方法large@Javier,我认为你是对的。谢谢你的评论!你需要做交叉口的事情吗?在我看来,您将两次遍历df索引,一次用于创建交点,另一次用于选择切片。这不管用吗df.loc[l.set_index(['numbers','letters']).index]
我过去也有过类似的情况,对我来说,最大的减速是多索引,当我切换到一个简单索引时,数据访问变得更快了,所以可能多索引访问没有优化?@Javier,是的,df.loc[l.set_index(['numbers','letters')).index]
工作速度更快,但与df.query()方法相比,它仍然较慢。我会更新时间