Python 查找numpy数组中类型转换失败的索引_Python_Arrays_Numpy

Python 查找numpy数组中类型转换失败的索引

python arrays numpy

Python 查找numpy数组中类型转换失败的索引,python,arrays,numpy,Python,Arrays,Numpy,我有一个一维numpy字符串数组，需要将其转换为新的数据类型。新类型可以是int、float或datetime类型。某些字符串可能对该类型无效，无法转换，从而导致错误，例如： >>> np.array(['10', '20', 'a'], dtype=int) ... ValueError: invalid literal for int() with base 10: 'a' 我想找到无效值的索引，在本例中为2。目前，我只能想到两种解决方案，它们都不是很好：使用正则表达式

我有一个一维numpy字符串数组，需要将其转换为新的数据类型。新类型可以是int、float或datetime类型。某些字符串可能对该类型无效，无法转换，从而导致错误，例如：

>>> np.array(['10', '20', 'a'], dtype=int)
...
ValueError: invalid literal for int() with base 10: 'a'

我想找到无效值的索引，在本例中为2。目前，我只能想到两种解决方案，它们都不是很好：

使用正则表达式解析异常消息以查找无效值，然后在原始数组中查找该值的索引。这看起来很混乱而且容易出错。在Python中解析循环中的值。这可能会比numpy版本慢很多。例如，我做了一个实验：这似乎是一个非常简单和常见的操作，我希望在numpy库中内置一个解决方案，但我找不到解决方案。

您可以使用查找数字的索引，然后使用逻辑not操作数获取nan数字项的索引。之后，您可以使用np.where获得相应的索引：

In [20]: arr = np.array(['10', '20', 'a', '4', '%'])

In [24]: np.where(~np.core.defchararray.isdigit(arr))
Out[24]: (array([2, 4]),)

如果要检查多个类型（如float），可以使用自定义函数，然后使用np.vectorize将函数应用于数组。对于日期，这有点棘手，但是如果您想要一种通用的方法，您可能需要使用dateutils.parser

您可以使用如下函数：

# from dateutils import parser
In [33]: def check_type(item):
    ...:     try:
    ...:         float(item)
    ...:     except:
    ...:         try:         
    ...:             parser.parse(item)
    ...:         except:     
    ...:             return True
    ...:         else:      
    ...:             return False
    ...:     else:          
    ...:         return False

然后：

演示：

我会这样做：

custom_type=int
i = 0
l = ['10', '20', 'a']
acc = np.array([], dtype=custom_type)
for elem in l:
    try:
       acc = np.concatenate((acc, np.array([elem], dtype=custom_type)))
       i += 1
    except:
       print("Failed to convert the type of the element in position {}".format(i))

事实证明，我高估了Python和numpy之间的差异，虽然我在问题中输入的Python代码非常慢，但使用预分配数组可以使其速度更快：

def python_parse(arr):
    result = np.empty(shape=(len(arr),), dtype=int)
    for i, x in enumerate(arr):
        try:
            result[i] = x
        except ValueError:
            raise Exception(f'Failed at: {i}')
    return result

这会正确地产生错误，速度几乎和np.arraystrings，dtype=int一样快，这让我非常吃惊。

我怀疑这比只遍历常规Python列表效率要低得多，OP已经不愿意这样做了。正如@roganjosh所说，我试图避免Python循环。但我不能将其用于浮点数、日期时间或负数。@对于这些情况，您必须使用基于Python的方法。谢谢您的努力。最后我成功地解决了这个问题，恐怕我更喜欢我的解决方案，特别是因为它在第一个错误时就停止了，但我喜欢你的想法。请注意，这似乎只在1D阵列上提供有意义的输出。试试arr=np.array['10'，'20'，'a'，'4'，'%'，'2']。我想你必须在更高的维度上展开，然后反向工作。我想知道一旦你尝试这个方法，你会有多惊讶：将enumeratearr更改为enumeratearr.tolist，然后再次计时。@PaulPanzer我很惊讶，谢谢！一开始我真的很震惊，因为我认为这使numpy比Python慢，但我看到np.arraystrings，dtype=int在我添加.tolist时也变得快得多。不必惊慌，这里有一个解释：数组的u getitem_uuuuuuu方法比列表的方法要昂贵得多。因为1它必须能够解析更复杂的索引2，所以它必须从数组中存储的C元素创建Python对象，而列表只需要返回一个引用。现在，很明显，tolist也必须创建这些对象，但我认为批量创建会更便宜。3 tolist尽可能返回int等本机Python对象，而不是np.int64，而uu getitem_uu不返回。这似乎也有利于列表访问。@PaulPanzer尽管如此，它似乎还是可以在numpy中得到改进。我想我会提出一个问题。

In [45]: arr = np.array(['10.34', '-20', 'a', '4', '%', '2018-5-01'])

In [46]: vector_func = np.vectorize(check_type)
    ...: np.where(vector_func(arr))
    ...: 
Out[46]: (array([2, 4]),)

custom_type=int
i = 0
l = ['10', '20', 'a']
acc = np.array([], dtype=custom_type)
for elem in l:
    try:
       acc = np.concatenate((acc, np.array([elem], dtype=custom_type)))
       i += 1
    except:
       print("Failed to convert the type of the element in position {}".format(i))

def python_parse(arr):
    result = np.empty(shape=(len(arr),), dtype=int)
    for i, x in enumerate(arr):
        try:
            result[i] = x
        except ValueError:
            raise Exception(f'Failed at: {i}')
    return result