Excel 使用Pandas从数据帧的两列中筛选非数字数据_Excel_Numpy_Pandas

Excel 使用Pandas从数据帧的两列中筛选非数字数据

excel numpy pandas

Excel 使用Pandas从数据帧的两列中筛选非数字数据,excel,numpy,pandas,Excel,Numpy,Pandas,我正在加载一个Pandas数据框，它有许多数据类型（从Excel加载）。两个特定的列应该是浮动的，但有时研究人员会输入一个随机注释，如“not measured”。我需要删除两列中任何一列中的值不是数字的任何行，并在其他列中保留非数字数据。一个简单的用例如下所示（实际的表有几千行…）这将导致此数据表： A B C D 0 1 96 12 apples 1 2 33 Not measured oran

我正在加载一个Pandas数据框，它有许多数据类型（从Excel加载）。两个特定的列应该是浮动的，但有时研究人员会输入一个随机注释，如“not measured”。我需要删除两列中任何一列中的值不是数字的任何行，并在其他列中保留非数字数据。一个简单的用例如下所示（实际的表有几千行…）

这将导致此数据表：

    A   B   C               D
0   1   96  12              apples
1   2   33  Not measured    oranges
2   3   45  15              peaches
3   4       66              plums
4   5   8   42              pears

我不清楚如何到达这张桌子：

    A   B   C               D
0   1   96  12              apples
2   3   45  15              peaches
4   5   8   42              pears

我试过dropna，但是类型是“object”，因为有非数字的条目。

如果不转换整个表，或者一次只执行一个序列，从而丢失与行中其他数据的关系，我就无法将值转换为浮点。也许有一些简单的事情我不明白？

您可以首先创建包含列

、

的子集，然后检查值是否正确。然后使用：

下一个解决方案使用and和xor（

）：

但使用和的解决方案最快：

print df[pd.to_numeric(df['B'], errors='coerce').notnull() 
       ^ pd.to_numeric(df['C'], errors='coerce').isnull()]

   A   B   C        D
0  1  96  12   apples
2  3  45  15  peaches
4  5   8  42    pears

计时：

#len(df) = 5k
df = pd.concat([df]*1000).reset_index(drop=True)

In [611]: %timeit df[pd.to_numeric(df['B'], errors='coerce').notnull() ^ pd.to_numeric(df['C'], errors='coerce').isnull()]
1000 loops, best of 3: 1.88 ms per loop

In [612]: %timeit df[df['B'].str.isdigit().isnull() ^ df['C'].str.isdigit().notnull()]
100 loops, best of 3: 16.1 ms per loop

In [613]: %timeit df[df[['B','C']].apply(pd.to_numeric, errors='coerce').notnull().all(axis=1)]
The slowest run took 4.28 times longer than the fastest. This could mean that an intermediate result is being cached 
100 loops, best of 3: 3.49 ms per loop

谢谢对于可维护性，我喜欢第一个带有apply，notnull的解决方案。它似乎起作用了！我会给它一天时间，看看是否有任何问题出现，或者是否有人会给出更简单的解决方案。

print df['B'].str.isdigit().isnull() ^ df['C'].str.isdigit().notnull()
0     True
1    False
2     True
3    False
4     True
dtype: bool

print df[df['B'].str.isdigit().isnull() ^ df['C'].str.isdigit().notnull()]
   A   B   C        D
0  1  96  12   apples
2  3  45  15  peaches
4  5   8  42    pears

print df[pd.to_numeric(df['B'], errors='coerce').notnull() 
       ^ pd.to_numeric(df['C'], errors='coerce').isnull()]

   A   B   C        D
0  1  96  12   apples
2  3  45  15  peaches
4  5   8  42    pears

#len(df) = 5k
df = pd.concat([df]*1000).reset_index(drop=True)

In [611]: %timeit df[pd.to_numeric(df['B'], errors='coerce').notnull() ^ pd.to_numeric(df['C'], errors='coerce').isnull()]
1000 loops, best of 3: 1.88 ms per loop

In [612]: %timeit df[df['B'].str.isdigit().isnull() ^ df['C'].str.isdigit().notnull()]
100 loops, best of 3: 16.1 ms per loop

In [613]: %timeit df[df[['B','C']].apply(pd.to_numeric, errors='coerce').notnull().all(axis=1)]
The slowest run took 4.28 times longer than the fastest. This could mean that an intermediate result is being cached 
100 loops, best of 3: 3.49 ms per loop