Python 2.7 pandas drop()-错误-轴中不包含标签[]

Python 2.7 pandas drop()-错误-轴中不包含标签[],python-2.7,pandas,Python 2.7,Pandas,我使用drop()从某些列中清除带有垃圾值(NaN、NaT、“”)的行 for index, row in user_data_to_clean.iterrows(): if row.email != row.email or row.email == '' or row.email == ' ': user_data_to_clean.drop(index, inplace=True) email_count = email_count + 1

我使用drop()从某些列中清除带有垃圾值(NaN、NaT、“”)的行

for index, row in user_data_to_clean.iterrows():    
    if row.email != row.email or row.email == '' or row.email == ' ':
        user_data_to_clean.drop(index, inplace=True)
        email_count = email_count + 1

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-22-bb0cb6d83902> in <module>()
     24 
     25         if row.email != row.email or row.email == '' or row.email == ' ':
---> 26             user_data_to_clean.drop(index, inplace=True)
     27             email_count = email_count + 1
     28 

/home/eyebell/local_bin/janacare/virtenv/lib/python2.7/site-packages/pandas/core/generic.pyc in drop(self, labels, axis, level, inplace, errors)
   1871                 new_axis = axis.drop(labels, level=level, errors=errors)
   1872             else:
-> 1873                 new_axis = axis.drop(labels, errors=errors)
   1874             dropped = self.reindex(**{axis_name: new_axis})
   1875             try:

/home/eyebell/local_bin/janacare/virtenv/lib/python2.7/site-packages/pandas/indexes/base.pyc in drop(self, labels, errors)
   2964             if errors != 'ignore':
   2965                 raise ValueError('labels %s not contained in axis' %
-> 2966                                  labels[mask])
   2967             indexer = indexer[~mask]
   2968         return self.delete(indexer)

ValueError: labels [124] not contained in axis
这里的问题是什么

我知道实现我目标的另一种方法是切片,
但我想知道这里出了什么问题

IIUC您可以使用矢量化而不是作为
drop
iterrows()
,因为
iterrows()
非常慢: 对于通过
NaN
NaT
进行的屏蔽,请使用:

样本:

import pandas as pd
import numpy as np

user_data_to_clean = pd.DataFrame({'email':['','aa',' ', np.nan, 'dd'],
                   'a':[7,5,6,4,7],
                   'b':[7,8,9,1,2]})

print (user_data_to_clean)
   a  b email
0  7  7      
1  5  8    aa
2  6  9      
3  4  1   NaN
4  7  2    dd
布尔掩码:

print ((user_data_to_clean.email != '') &  
       (user_data_to_clean.email != ' ') & 
       (user_data_to_clean.email.notnull())) 

0    False
1     True
2    False
3    False
4     True
Name: email, dtype: bool 

print (user_data_to_clean[(user_data_to_clean.email != '') & 
                          (user_data_to_clean.email != ' ') & 
                          (user_data_to_clean.email.notnull()) ])

   a  b email
1  5  8    aa
4  7  2    dd  

IIUC您可以使用矢量化而不是作为
drop
iterrows()
一起使用,因为
iterrows()
非常慢: 对于通过
NaN
NaT
进行的屏蔽,请使用:

样本:

import pandas as pd
import numpy as np

user_data_to_clean = pd.DataFrame({'email':['','aa',' ', np.nan, 'dd'],
                   'a':[7,5,6,4,7],
                   'b':[7,8,9,1,2]})

print (user_data_to_clean)
   a  b email
0  7  7      
1  5  8    aa
2  6  9      
3  4  1   NaN
4  7  2    dd
布尔掩码:

print ((user_data_to_clean.email != '') &  
       (user_data_to_clean.email != ' ') & 
       (user_data_to_clean.email.notnull())) 

0    False
1     True
2    False
3    False
4     True
Name: email, dtype: bool 

print (user_data_to_clean[(user_data_to_clean.email != '') & 
                          (user_data_to_clean.email != ' ') & 
                          (user_data_to_clean.email.notnull()) ])

   a  b email
1  5  8    aa
4  7  2    dd  

我会这样做:

测试DF:

In [43]: df = pd.DataFrame({'email':['x@x.x', 'aaa@aaa.com','  ', np.nan, 'a@mail.com', '1', 'xxx@gmail.com', '', np.nan], 'col': np.random.randint(0,100,9)})

In [44]: df
Out[44]:
   col          email
0   89          x@x.x
1   81    aaa@aaa.com
2   82
3   43            NaN
4   71     a@mail.com
5    3              1
6   48  xxx@gmail.com
7   48
8   71            NaN
清理:

In [53]: df = df[(df.email.notnull()) & (df.email.str.strip().str.len() > 5)]

In [54]: df
Out[54]:
   col          email
1   97    aaa@aaa.com
4   77     a@mail.com
6   47  xxx@gmail.com
PS如果你想要一个严肃、健壮(但速度慢)的电子邮件验证,请使用模块

如果需要
电子邮件计数
,请在清理后执行此操作:

email_count = len(df)

我会这样做:

测试DF:

In [43]: df = pd.DataFrame({'email':['x@x.x', 'aaa@aaa.com','  ', np.nan, 'a@mail.com', '1', 'xxx@gmail.com', '', np.nan], 'col': np.random.randint(0,100,9)})

In [44]: df
Out[44]:
   col          email
0   89          x@x.x
1   81    aaa@aaa.com
2   82
3   43            NaN
4   71     a@mail.com
5    3              1
6   48  xxx@gmail.com
7   48
8   71            NaN
清理:

In [53]: df = df[(df.email.notnull()) & (df.email.str.strip().str.len() > 5)]

In [54]: df
Out[54]:
   col          email
1   97    aaa@aaa.com
4   77     a@mail.com
6   47  xxx@gmail.com
PS如果你想要一个严肃、健壮(但速度慢)的电子邮件验证,请使用模块

如果需要
电子邮件计数
,请在清理后执行此操作:

email_count = len(df)

您可以改为将
用户数据\u检查到\u clean.loc[124]
<代码>iloc查看行的位置,而不是标签。您可能试图删除以前删除的一行。@ayhan:谢谢,原来是我的错误。我正在删除已经删除的行。您是否可以改为将
user\u data\u检查到\u clean.loc[124]
<代码>iloc查看行的位置,而不是标签。您可能试图删除以前删除的一行。@ayhan:谢谢,原来是我的错误。我正在删除已经删除的行。谢谢,我正在使用validate_电子邮件模块。它工作得很好。谢谢,我正在使用验证电子邮件模块。它工作得很好。我现在正在使用布尔掩蔽方法。iterrows运行得非常慢。非常感谢。我现在正在使用布尔掩蔽方法。iterrows运行得非常慢。非常感谢你。