Python 如何基于条件表达式从数据帧中删除行_Python_Pandas

Python 如何基于条件表达式从数据帧中删除行

python pandas

Python 如何基于条件表达式从数据帧中删除行,python,pandas,Python,Pandas,我有一个pandas数据框，我想从中删除特定列中字符串长度大于2的行我希望能够做到这一点（per）：我做错了什么（注意：我知道我可以使用df.dropna（）删除包含任何NaN的行，但我不知道如何基于条件表达式删除行。）当您执行len（df['column name']）时，您只得到一个数字，即数据帧中的行数（即列本身的长度）. 如果要对列中的每个元素应用len，请使用df['column name'].map（len）。所以试试看 df[df['column name'].map(le

我有一个pandas数据框，我想从中删除特定列中字符串长度大于2的行

我希望能够做到这一点（per）：

我做错了什么

（注意：我知道我可以使用

df.dropna（）

删除包含任何

NaN

的行，但我不知道如何基于条件表达式删除行。）

当您执行

len（df['column name']）

时，您只得到一个数字，即数据帧中的行数（即列本身的长度）. 如果要对列中的每个元素应用

len

，请使用

df['column name'].map（len）

。所以试试看

df[df['column name'].map(len) < 2]

df[df['column name'].map（len）<2]

要直接回答此问题的原始标题“如何基于条件表达式从pandas数据帧中删除行”（我理解这不一定是OP的问题，但可以帮助其他用户遇到此问题），一种方法是使用以下方法：

您可以将

数据帧

分配给其自身的过滤版本：

df = df[df.score > 50]

这比

下降速度快

：

%%timeit
test = pd.DataFrame({'x': np.random.randn(int(1e6))})
test = test[test.x < 0]
# 54.5 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
test = pd.DataFrame({'x': np.random.randn(int(1e6))})
test.drop(test[test.x > 0].index, inplace=True)
# 201 ms ± 17.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
test = pd.DataFrame({'x': np.random.randn(int(1e6))})
test = test.drop(test[test.x > 0].index)
# 194 ms ± 7.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
test=pd.DataFrame（{'x'：np.random.randn（int（1e6））}）
测试=测试[测试x<0]
#每个回路54.5 ms±2.02 ms（7次运行的平均值±标准偏差，每个10个回路）
%%时间
test=pd.DataFrame（{'x'：np.random.randn（int（1e6））}）
test.drop（test[test.x>0]。索引，原地=真）
#每个回路201 ms±17.9 ms（7次运行的平均值±标准偏差，每个10个回路）
%%时间
test=pd.DataFrame（{'x'：np.random.randn（int（1e6））}）
test=test.drop（test[test.x>0].索引）
#每个回路194 ms±7.03 ms（7次运行的平均值±标准偏差，每个10个回路）

在pandas中，您可以对边界执行

str.len

，并使用布尔结果对其进行过滤

df[df['column name'].str.len().lt(2)]

如果您想根据列值上的某些复杂条件删除数据帧行，那么以上面所示的方式写入数据帧行可能会很复杂。我有以下简单的解决方案，它总是有效的。让我们假设您想要删除带有“header”的列，因此首先在列表中获取该列

text\u data=df['name'].tolist（）

现在对列表的每个元素应用一些函数，并将其放入熊猫系列：

text\u length=pd.Series（[func（t）表示文本数据中的t]）

在我的例子中，我只是想得到代币的数量：

text\u length=pd.Series（[len（t.split（））表示文本中的t\u数据]）

现在，在数据框中为上述系列添加一个额外的列：

df=df.assign（text\u length=text\u length.values）

现在我们可以对新列应用条件，例如：

df=df[df.text_length>10]

def pass_过滤器（df、标签、长度、pass_类型）：
text_data=df[label].tolist（）
text_length=pd.Series（[len（t.split（））表示text_数据中的t]）
df=df.assign（text\u length=text\u length.values）
如果通过类型=‘高’：
df=df[df.text_length>length]
如果pass_type==“low”：
df=df[df.text_length

我将扩展@User的通用解决方案，以提供一个免费的替代方案。这是针对根据问题标题（不是OP的问题）指导的人员的

假设要删除所有具有负值的行。一种线性解决方案是：-

df = df[(df > 0).all(axis=1)]

逐步解释：---

让我们生成一个5x5随机正态分布数据帧

np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,5), columns=list('ABCDE'))
      A         B         C         D         E
0  1.764052  0.400157  0.978738  2.240893  1.867558
1 -0.977278  0.950088 -0.151357 -0.103219  0.410599
2  0.144044  1.454274  0.761038  0.121675  0.443863
3  0.333674  1.494079 -0.205158  0.313068 -0.854096
4 -2.552990  0.653619  0.864436 -0.742165  2.269755

让条件为删除负片。满足条件的布尔df：-

df > 0
      A     B      C      D      E
0   True  True   True   True   True
1  False  True  False  False   True
2   True  True   True   True   True
3   True  True  False   True  False
4  False  True   True  False   True

满足条件的所有行的布尔序列注意，如果行中的任何元素不满足条件，则该行被标记为false

(df > 0).all(axis=1)
0     True
1    False
2     True
3    False
4    False
dtype: bool

最后根据条件从数据帧中筛选出行

df[(df > 0).all(axis=1)]
      A         B         C         D         E
0  1.764052  0.400157  0.978738  2.240893  1.867558
2  0.144044  1.454274  0.761038  0.121675  0.443863

您可以将其分配回df，以实际删除vs过滤上述操作

df=df[（df>0）.all（axis=1）]

这可以很容易地扩展以过滤掉包含NaN的行（非数字条目）：-

df=df[（~df.isnull（））.all（axis=1）]

对于以下情况也可以简化：删除列E为负值的所有行

df = df[(df.E>0)]

最后，我想介绍一下@User的

drop

解决方案比基于原始列的过滤慢的原因：-

%timeit df_new = df[(df.E>0)]
345 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit dft.drop(dft[dft.E < 0].index, inplace=True)
890 µs ± 94.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df_new=df[（df.E>0）]
每个回路345µs±10.5µs（7次运行的平均值±标准偏差，每个1000个回路）
%timeit dft.drop（dft[dft.E<0]。索引，就地=真）
每个回路890µs±94.9µs（7次运行的平均值±标准偏差，每个1000个回路）

列基本上是一个

系列

，即

NumPy

数组，它可以被索引而不需要任何成本。对于那些对底层内存组织如何影响执行速度感兴趣的人来说，这里有一个很好的建议：

我想出了一种使用列表理解的方法：

df[[（len（x）<2）For x in df['column name']]]]

，但是你的更好。谢谢你的帮助！如果有人需要更复杂的比较，可以使用lambda

df[df['column name'].map（lambda x:str（x）！=”）]

出于某种原因，除了@4lberto发布的选项外，其他选项都不适用于我。我使用的是

pandas 0.23.4

和python 3.6I，我会在末尾添加一个

.copy（）

，以防您以后要编辑此数据帧（例如，分配新列会导致“试图在数据帧的切片副本上设置值”警告。我只是想指出，drop函数支持就地替换。也就是说，您的解决方案与df.drop（df[df.score<50].index，inplace=True）相同。不过，我不知道“index”技巧。我只是想指出，在使用此索引技巧之前，您需要确保您的索引值是唯一的（或调用

reset\u index（）

）。我找到了

np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,5), columns=list('ABCDE'))
      A         B         C         D         E
0  1.764052  0.400157  0.978738  2.240893  1.867558
1 -0.977278  0.950088 -0.151357 -0.103219  0.410599
2  0.144044  1.454274  0.761038  0.121675  0.443863
3  0.333674  1.494079 -0.205158  0.313068 -0.854096
4 -2.552990  0.653619  0.864436 -0.742165  2.269755

df > 0
      A     B      C      D      E
0   True  True   True   True   True
1  False  True  False  False   True
2   True  True   True   True   True
3   True  True  False   True  False
4  False  True   True  False   True

(df > 0).all(axis=1)
0     True
1    False
2     True
3    False
4    False
dtype: bool

df[(df > 0).all(axis=1)]
      A         B         C         D         E
0  1.764052  0.400157  0.978738  2.240893  1.867558
2  0.144044  1.454274  0.761038  0.121675  0.443863

df = df[(df.E>0)]

%timeit df_new = df[(df.E>0)]
345 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit dft.drop(dft[dft.E < 0].index, inplace=True)
890 µs ± 94.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)