Python 如何使用最少的代码创建过滤数据帧_Python_Pandas_Indexing_Dataframe_Conditional Statements

Python 如何使用最少的代码创建过滤数据帧

python pandas indexing dataframe

Python 如何使用最少的代码创建过滤数据帧,python,pandas,indexing,dataframe,conditional-statements,Python,Pandas,Indexing,Dataframe,Conditional Statements,共有四款车：bmw、geo、vw和porsche： import pandas as pd df = pd.DataFrame({ 'car': ['bmw','geo','vw','porsche'], 'warranty': ['yes','yes','yes','no'], 'dvd': ['yes','yes','no','yes'], 'sunroof': ['yes','no','no','no']}) 我想创建一个过滤后

共有四款车：

bmw

、

geo

、

vw

和

porsche

：

import pandas as pd
df = pd.DataFrame({
    'car':      ['bmw','geo','vw','porsche'],
    'warranty': ['yes','yes','yes','no'], 
    'dvd':      ['yes','yes','no','yes'], 
    'sunroof':  ['yes','no','no','no']})

我想创建一个过滤后的数据框，只列出那些具有全部三个功能的汽车：DVD播放机、天窗和保修（我们知道这里是宝马，所有功能都设置为“是”）

我可以使用以下工具一次完成一列：

cars_with_warranty = df['car'][df['warranty']=='yes']
print(cars_with_warranty)

然后我需要对dvd和天窗列进行类似的列计算：

cars_with_dvd = df['car'][df['dvd']=='yes']
cars_with_sunroof = df['car'][df['sunroof']=='yes']

我想知道是否有一种聪明的方法来创建过滤后的

数据帧

稍后编辑：发布的解决方案运行良好。但是得到的

cars\u和\u all\u three

是一个简单的列表变量。我们需要DataFrame对象，其中只有一辆“bmw”汽车作为其唯一的一行，并且所有三列都已就位：dvd、天窗和保修（所有三个值都设置为“是”）

您可以将simple for

循环

与

枚举

一起使用：

cars_with_all_three = []
for ind, car in enumerate(df['car']):
    if df['dvd'][ind] == df['warranty'][ind] == df['sunroof'][ind] == 'yes':
        cars_with_all_three.append(car)

如果您执行

打印（所有三辆车）

您将获得

['bmw']

或者，如果您想变得非常聪明并使用一个衬里，您可以这样做：

[car for ind, car in enumerate(df['car']) if df['dvd'][ind] == df['warranty'][ind] == df['sunroof'][ind] == 'yes']

希望对您有所帮助

您可以使用：

另一种解决方案是检查列中的所有值是否为

yes

，然后通过以下方式检查所有值是否为

True

：

如果

DataFrame

只有

列，则使用最少代码的解决方案，如示例：

print (df[(df.set_index('car') == 'yes').all(1).values])
   car  dvd sunroof warranty
0  bmw  yes     yes      yes

计时：

In [44]: %timeit ([car for ind, car in enumerate(df['car']) if df['dvd'][ind] == df['warranty'][ind] == df['sunroof'][ind] == 'yes'])
10 loops, best of 3: 120 ms per loop

In [45]: %timeit (df[(df.dvd == 'yes')&(df.sunroof == 'yes')&(df.warranty == 'yes')])
The slowest run took 4.39 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 2.09 ms per loop

In [46]: %timeit (df[(df[[ u'dvd', u'sunroof', u'warranty']] == "yes").all(axis=1)])
1000 loops, best of 3: 1.53 ms per loop

In [47]: %timeit (df[(df.ix[:, [u'dvd', u'sunroof', u'warranty']] == "yes").all(axis=1)])
The slowest run took 4.46 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.51 ms per loop

In [48]: %timeit (df[(df.set_index('car') == 'yes').all(1).values])
1000 loops, best of 3: 1.64 ms per loop

In [49]: %timeit (mer(df))
The slowest run took 4.17 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 3.85 ms per loop

df = pd.DataFrame({
    'car':      ['bmw','geo','vw','porsche'],
    'warranty': ['yes','yes','yes','no'], 
    'dvd':      ['yes','yes','no','yes'], 
    'sunroof':  ['yes','no','no','no']})

print (df)
df = pd.concat([df]*1000).reset_index(drop=True)

def mer(df):
    df = df.set_index('car')
    return df[df[[ u'dvd', u'sunroof', u'warranty']] == "yes"].dropna().reset_index()

计时代码：

In [44]: %timeit ([car for ind, car in enumerate(df['car']) if df['dvd'][ind] == df['warranty'][ind] == df['sunroof'][ind] == 'yes'])
10 loops, best of 3: 120 ms per loop

In [45]: %timeit (df[(df.dvd == 'yes')&(df.sunroof == 'yes')&(df.warranty == 'yes')])
The slowest run took 4.39 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 2.09 ms per loop

In [46]: %timeit (df[(df[[ u'dvd', u'sunroof', u'warranty']] == "yes").all(axis=1)])
1000 loops, best of 3: 1.53 ms per loop

In [47]: %timeit (df[(df.ix[:, [u'dvd', u'sunroof', u'warranty']] == "yes").all(axis=1)])
The slowest run took 4.46 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.51 ms per loop

In [48]: %timeit (df[(df.set_index('car') == 'yes').all(1).values])
1000 loops, best of 3: 1.64 ms per loop

In [49]: %timeit (mer(df))
The slowest run took 4.17 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 3.85 ms per loop

df = pd.DataFrame({
    'car':      ['bmw','geo','vw','porsche'],
    'warranty': ['yes','yes','yes','no'], 
    'dvd':      ['yes','yes','no','yes'], 
    'sunroof':  ['yes','no','no','no']})

print (df)
df = pd.concat([df]*1000).reset_index(drop=True)

def mer(df):
    df = df.set_index('car')
    return df[df[[ u'dvd', u'sunroof', u'warranty']] == "yes"].dropna().reset_index()

试试这个：

df = df.set_index('car')
df[df[[ u'dvd', u'sunroof', u'warranty']] == "yes"].dropna().reset_index()

 df
   car  dvd sunroof warranty
0  bmw  yes     yes      yes


df = df.set_index('car')
df[df[[ u'dvd', u'sunroof', u'warranty']]== "yes"].dropna().index.values

['bmw']

df = df.set_index('car')
df[df[[ u'dvd', u'sunroof', u'warranty']] == "yes"].dropna().reset_index()

 df
   car  dvd sunroof warranty
0  bmw  yes     yes      yes


df = df.set_index('car')
df[df[[ u'dvd', u'sunroof', u'warranty']]== "yes"].dropna().index.values

['bmw']