Python 从中排除列，其中（）_Python_Python 2.7_Pandas

Python 从中排除列，其中（）

python python-2.7 pandas

Python 从中排除列，其中（）,python,python-2.7,pandas,Python,Python 2.7,Pandas,我有以下建议： import pandas as pd import numpy as np pd_df = pd.DataFrame({'Qu1': ['apple', 'potato', 'cheese', 'banana', 'cheese', 'banana', 'cheese', 'potato', 'egg'], 'Qu2': ['sausage', 'banana', 'apple', 'apple', 'apple', np.nan, 'ba

我有以下建议：

import pandas as pd
import numpy as np    

pd_df = pd.DataFrame({'Qu1': ['apple', 'potato', 'cheese', 'banana', 'cheese', 'banana', 'cheese', 'potato', 'egg'],
              'Qu2': ['sausage', 'banana', 'apple', 'apple', 'apple', np.nan, 'banana', 'banana', 'banana'],
              'Qu3': ['apple', 'potato', 'sausage', 'cheese', 'cheese', 'potato', 'cheese', 'potato', 'egg']})

我只想在两列

Qu1

和

Qu2

上实现

where（）

，并保留其余部分，所以我创建了

pd1

pd1 = pd_df.where(pd_df.apply(lambda x: x.map(x.value_counts()))>=2,
                              "other")[['Qu1', 'Qu2']]

pd1['Qu3'] = pd_df['Qu3']
pd_df = []

然后我在

pd1

pd1 = pd_df.where(pd_df.apply(lambda x: x.map(x.value_counts()))>=2,
                              "other")[['Qu1', 'Qu2']]

pd1['Qu3'] = pd_df['Qu3']
pd_df = []

我的问题是：最初我想对

df

的一部分执行

where（）

，并保持列的其余部分不变，那么上面的代码对大型数据集是否有危险？我可以这样破坏原始数据吗？如果是，最好的方法是什么

非常感谢

您只需显式获取原始df的

副本

，然后覆盖该df的选择：

In [40]:
pd1 = pd_df.copy()
pd1[['Qu1', 'Qu2']] = pd1[['Qu1', 'Qu2']].where(pd_df.apply(lambda x: x.map(x.value_counts()))>=2,
                              "other")
pd1

Out[40]:
      Qu1     Qu2      Qu3
0   other   other    apple
1  potato  banana   potato
2  cheese   apple  sausage
3  banana   apple   cheese
4  cheese   apple   cheese
5  banana   other   potato
6  cheese  banana   cheese
7  potato  banana   potato
8   other  banana      egg

所以这里的区别在于，我们只对df的一部分进行操作，而不是对整个df进行操作，然后选择感兴趣的列

更新

如果只想覆盖这些列，则只需选择这些列：

In [48]:
pd_df[['Qu1', 'Qu2']] = pd_df[['Qu1', 'Qu2']].where(pd_df.apply(lambda x: x.map(x.value_counts()))>=2,
                              "other")
pd_df

Out[48]:
      Qu1     Qu2      Qu3
0   other   other    apple
1  potato  banana   potato
2  cheese   apple  sausage
3  banana   apple   cheese
4  cheese   apple   cheese
5  banana   other   potato
6  cheese  banana   cheese
7  potato  banana   potato
8   other  banana      egg

谢谢我的数据集大约为30G，是否

copy

会在内存中生成另一个30G数据集？您的问题显示您创建了2列的副本，如果您只想覆盖原始数据中的这些列，那么您只需删除

copy

行，然后在原始df上执行第二行